CyberHunt 2025 | UMBC CSEE Department
| Authors | Bharath Kumar Swargam, Prajna Bhandary, Charles Nicholas |
| Institution | University of Maryland, Baltimore County (UMBC) |
| Contact | swargambharath987@gmail.com |
Malware clustering groups samples by shared behavioral and structural characteristics to support automated family classification and threat tracking. A persistent challenge in this process is the presence of outliers — samples that do not clearly belong to any cluster.
This work investigates whether outliers produced by clustering algorithms are merely noise, or whether they carry meaningful intelligence value. We apply two complementary unsupervised methods — K-Means Least Trimmed Squares (LTS) and HDBSCAN — across two real-world malware datasets of contrasting scale and quality, using DLL import features as lightweight behavioral indicators.
Antivirus (AV) labels are used post hoc to characterize and interpret the flagged outliers, not during clustering.
- Outliers are not noise. They frequently represent mislabeled samples, emerging threats, or structurally unique binaries
- K-Means LTS surfaces feature-dense, high-functionality malware with large DLL import counts (mean: 229 on MOTIF, 1,339 on SOREL)
- HDBSCAN captures sparse, obfuscated, or evasive binaries with fewer DLL imports (mean: 104 on MOTIF, 561 on SOREL)
- The two methods are complementary — each surfaces a different type of anomaly; neither alone covers the full outlier landscape
- On MOTIF: HDBSCAN flagged 61.5% as noise vs. only 9.9% by K-Means LTS
- On SOREL: both methods converged to ~10–11% outlier rate at scale
- 1,586 DLL functions were exclusive to outliers from both methods on MOTIF (2,881 on SOREL), suggesting niche or novel malware behavior
| Dataset | Description | Original Size | Used in Analysis |
|---|---|---|---|
| MOTIF | Well-labeled dataset spanning 454 malware families (ransomware, banking trojans, RATs, droppers) | 3,090 samples | 2,380 samples |
| SOREL-20M | Large-scale corpus; filtered to ransomware samples with VirusTotal labels and valid import info | 627,298 samples | 157,848 samples |
Feature representation: Each sample is encoded as a binary vector of DLL-imported functions (78,614 unique functions identified across both datasets). Samples with fewer than 10 unique DLL imports were excluded.
A robust variant of K-Means that trims the top 10% of samples with the highest distances to their assigned cluster centroids, treating them as outliers.
| Parameter | MOTIF | SOREL |
|---|---|---|
| Clusters (k) | 5 | 7 |
| k selection | Elbow method | Elbow method on 10% stratified sample via Truncated SVD (200 components) |
| Outlier trim | 10% | 10% |
Hierarchical density-based clustering that naturally labels low-density points as noise without requiring a fixed outlier percentage.
| Parameter | Value |
|---|---|
min_cluster_size |
10 |
min_samples |
10 |
| Selection method | Parameter sweep on MOTIF evaluating Silhouette Score and Davies–Bouldin Index |
| Metric | MOTIF K-Means LTS | MOTIF HDBSCAN | SOREL K-Means LTS | SOREL HDBSCAN |
|---|---|---|---|---|
| Silhouette Score | 0.1981 | 0.551 | 0.2498 | 0.6094 |
| Davies–Bouldin Index | 1.4320 | 0.910 | 1.6612 | 1.7235 |
| Outlier Count | 235 | 1,464 | 15,783 | 17,892 |
| Total Samples | 2,380 | 2,380 | 157,848 | 157,848 |
| Outlier Percentage | 9.9% | 61.5% | 10.0% | 11.3% |
| Overlap Count | 157 | 157 | 5,716 | 5,716 |
| Overlap % (K-Means view) | 66.8% | — | 36.2% | — |
| Overlap % (HDBSCAN view) | — | 10.7% | — | 31.9% |
| MOTIF K-Means LTS | MOTIF HDBSCAN | SOREL K-Means LTS | SOREL HDBSCAN | |
|---|---|---|---|---|
| Mean (Outliers) | 228.99 | 103.75 | 1,339.20 | 560.54 |
| Mean (Inliers) | 108.06 | 145.97 | 647.40 | 736.52 |
| Median (Outliers) | 195 | 90 | 1,096 | 326 |
| Median (Inliers) | 86 | 92 | 438 | 514 |
MOTIF
| Rank | K-Means LTS | % | HDBSCAN | % |
|---|---|---|---|---|
| 1 | zerot | 6.4% | icedid | 7.7% |
| 2 | flawedammyy | 4.7% | phorpiex | 2.3% |
| 3 | xmrig | 3.4% | gandcrab | 2.0% |
| 4 | qbot | 3.0% | maze | 1.8% |
| 5 | nymaim | 3.0% | bazarbackdoor | 1.7% |
SOREL
| Rank | K-Means LTS | % | HDBSCAN | % |
|---|---|---|---|---|
| 1 | cerber | 31.6% | cerber | 28.5% |
| 2 | bunitu | 7.0% | expiro | 5.6% |
| 3 | expiro | 4.2% | tofsee | 3.8% |
| 4 | cryptxxx | 3.9% | zbot | 3.8% |
| 5 | icloader | 2.8% | bunitu | 3.1% |
All plots are in the Plots/ directory.
MOTIF
| Plot | Description |
|---|---|
Plots/Motif/hdbscan_hyperparameter_heatmaps.png |
HDBSCAN parameter sweep — Silhouette Score, DBI, cluster count, and outlier % across min_cluster_size and min_samples |
Plots/Motif/hdbscan_pca_tsne_plots_20250625_104901.png |
PCA and t-SNE projections of HDBSCAN results showing inliers vs. noise |
Plots/Motif/kmeans_k_selection_analysis.png |
Elbow method, Silhouette Score, and DBI across k values — optimal k=5 selected |
Plots/Motif/kmeans_lts_pca_outlier_analysis.png |
PCA projection of K-Means LTS results with outliers highlighted |
Plots/Motif/kmeans_lts_tsne_outlier_analysis.png |
t-SNE projection of K-Means LTS results with outliers highlighted |
SOREL
| Plot | Description |
|---|---|
Plots/SOREL/kselection_kmeans_lts_SOREL.png |
Elbow method, Silhouette Score, and DBI across k values — optimal k=7 selected |
Plots/SOREL/PCA_KmeansLTS_SOREL.png |
PCA projection of K-Means LTS clustering on SOREL with outliers (red x) |
Plots/SOREL/tSNE_KmeansLTS_SOREL.png |
t-SNE projection of K-Means LTS clustering on SOREL with outliers (red x) |
Outlier_Analysis/
├── outlier.ipynb # Main analysis notebook
├── results.csv # Clustering results summary
├── outliers_data_kmeans_lts.csv # K-Means LTS outlier records (MOTIF)
├── outliers_data_cof.xlsx # COF outlier analysis (MOTIF)
├── outliers_data_lof.xlsx # LOF outlier analysis (MOTIF)
├── requirements.txt # Python dependencies
├── Dockerfile # Docker image definition
├── MOTIF/
│ └── Data/
│ ├── func_maps.pkl # DLL function → integer mapping dictionary
│ ├── transformed_data_new.pkl # Preprocessed binary feature matrix
│ ├── hdbscan_complete_analysis.pkl # Full HDBSCAN clustering results
│ └── kmeans_lts_complete_analysis.pkl # Full K-Means LTS clustering results
└── Plots/
├── Motif/
│ ├── hdbscan_hyperparameter_heatmaps.png
│ ├── hdbscan_pca_tsne_plots_20250625_104901.png
│ ├── kmeans_k_selection_analysis.png
│ ├── kmeans_lts_pca_outlier_analysis.png
│ └── kmeans_lts_tsne_outlier_analysis.png
└── SOREL/
├── kselection_kmeans_lts_SOREL.png
├── PCA_KmeansLTS_SOREL.png
└── tSNE_KmeansLTS_SOREL.png
1. Clone the repository
git clone https://github.com/UMBC-DREAM-Lab/Outlier_Analysis.git
cd Outlier_Analysis2. Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows3. Install dependencies
pip install -r requirements.txt4. Launch the notebook
jupyter notebook outlier.ipynbBuild
docker build -t outlier-analysis .Run
docker run -p 8888:8888 outlier-analysisThen open http://localhost:8888 in your browser.
| Metric | What it measures | Better when |
|---|---|---|
| Silhouette Score | Similarity of a sample to its own cluster vs. other clusters | Higher (max 1.0) |
| Davies–Bouldin Index (DBI) | Ratio of intra-cluster scatter to inter-cluster separation | Lower (min 0.0) |
Metrics are computed on inlier samples only after outlier removal to assess cluster quality.
The outlier detection pipeline functions as an automated pre-triage mechanism for Cyber Threat Intelligence (CTI) workflows:
- Flag anomalous binaries for deeper sandbox or reverse-engineering analysis
- Surface mislabeled samples in AV label datasets
- Detect emerging variants that deviate structurally from known family clusters
- Identify obfuscated/evasive payloads via sparse DLL profiles (HDBSCAN signal)
- Identify modular or packed malware via dense DLL profiles (K-Means LTS signal)
- Analysis relies solely on static DLL import features — behavioral nuances from dynamic execution are not captured
- The DLL import table may be intentionally corrupted or obfuscated in packed/evasive samples
- LOF produced inconsistent scoring on MOTIF due to high dimensionality; COF produced infinite scores — neither was applied to SOREL
- Only two clustering algorithms were evaluated; ensemble or deep anomaly detection approaches remain unexplored
- Extend beyond DLL imports with opcode n-grams, PE header entropy, and selected dynamic traces
- Cross-validate outlier sets against consensus AV labels, threat intelligence feeds, and sandbox outputs
- Build lightweight visualization and summarization tools to translate clustering outputs into analyst-ready intelligence reports
- Explore LLM-assisted summarization to auto-describe behavioral themes within outlier clusters
- Map outlier indicators (hashes, API imports) to STIX/TAXII formats for pipeline integration
- R. J. Joyce, D. Amlani, C. Nicholas, E. Raff — "MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels" — arXiv:2111.15031, 2021
- R. Harang, E. M. Rudd — "SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection" — arXiv:2012.07634, 2020
- T. Estella et al. — "Outlier Handling in Clustering: K-Means, Robust Trimmed K-Means, and K-Means LTS" — IJCDS, vol. 15, no. 1, 2024
- B. V. Sanchez-Vinces et al. — "A comparative evaluation of clustering-based outlier detection" — DMKD, vol. 39, no. 2, 2025
- P. Bhandary, R. J. Joyce, C. Nicholas — "Ransomware evolution: Unveiling patterns using HDBSCAN" — CAMLIS '24, CEUR-WS vol. 3920
- K. Ghosh et al. — "Unsupervised Parameter-free Outlier Detection using HDBSCAN Outlier Profiles"* — arXiv:2411.08867, 2024
- Q. Li, S. Wang — "Detecting outliers by clustering algorithms" — arXiv:2412.05669, 2024
- A. Nowak-Brzezinska, C. Horyn — "Outliers in rules — the comparison of LOF, COF and KMEANS algorithms" — Procedia CS, vol. 176, 2020
@inproceedings{swargam2025outliers,
title = {Analyzing the Impact of Outliers in Malware Clustering},
author = {Swargam, Bharath Kumar and Bhandary, Prajna and Nicholas, Charles},
booktitle = {CyberHunt 2025},
year = {2025},
institution = {University of Maryland, Baltimore County (UMBC)}
}