Analyzing the Impact of Outliers in Malware Clustering

CyberHunt 2025 | UMBC CSEE Department


Authors	Bharath Kumar Swargam, Prajna Bhandary, Charles Nicholas
Institution	University of Maryland, Baltimore County (UMBC)
Contact	swargambharath987@gmail.com

Overview

Malware clustering groups samples by shared behavioral and structural characteristics to support automated family classification and threat tracking. A persistent challenge in this process is the presence of outliers — samples that do not clearly belong to any cluster.

This work investigates whether outliers produced by clustering algorithms are merely noise, or whether they carry meaningful intelligence value. We apply two complementary unsupervised methods — K-Means Least Trimmed Squares (LTS) and HDBSCAN — across two real-world malware datasets of contrasting scale and quality, using DLL import features as lightweight behavioral indicators.

Antivirus (AV) labels are used post hoc to characterize and interpret the flagged outliers, not during clustering.

Key Findings

Outliers are not noise. They frequently represent mislabeled samples, emerging threats, or structurally unique binaries
K-Means LTS surfaces feature-dense, high-functionality malware with large DLL import counts (mean: 229 on MOTIF, 1,339 on SOREL)
HDBSCAN captures sparse, obfuscated, or evasive binaries with fewer DLL imports (mean: 104 on MOTIF, 561 on SOREL)
The two methods are complementary — each surfaces a different type of anomaly; neither alone covers the full outlier landscape
On MOTIF: HDBSCAN flagged 61.5% as noise vs. only 9.9% by K-Means LTS
On SOREL: both methods converged to ~10–11% outlier rate at scale
1,586 DLL functions were exclusive to outliers from both methods on MOTIF (2,881 on SOREL), suggesting niche or novel malware behavior

Datasets

Dataset	Description	Original Size	Used in Analysis
MOTIF	Well-labeled dataset spanning 454 malware families (ransomware, banking trojans, RATs, droppers)	3,090 samples	2,380 samples
SOREL-20M	Large-scale corpus; filtered to ransomware samples with VirusTotal labels and valid import info	627,298 samples	157,848 samples

Feature representation: Each sample is encoded as a binary vector of DLL-imported functions (78,614 unique functions identified across both datasets). Samples with fewer than 10 unique DLL imports were excluded.

Methods

K-Means Least Trimmed Squares (LTS)

A robust variant of K-Means that trims the top 10% of samples with the highest distances to their assigned cluster centroids, treating them as outliers.

Parameter	MOTIF	SOREL
Clusters (k)	5	7
k selection	Elbow method	Elbow method on 10% stratified sample via Truncated SVD (200 components)
Outlier trim	10%	10%

HDBSCAN

Hierarchical density-based clustering that naturally labels low-density points as noise without requiring a fixed outlier percentage.

Parameter	Value
`min_cluster_size`	10
`min_samples`	10
Selection method	Parameter sweep on MOTIF evaluating Silhouette Score and Davies–Bouldin Index

Results

Clustering Performance

Metric	MOTIF K-Means LTS	MOTIF HDBSCAN	SOREL K-Means LTS	SOREL HDBSCAN
Silhouette Score	0.1981	0.551	0.2498	0.6094
Davies–Bouldin Index	1.4320	0.910	1.6612	1.7235
Outlier Count	235	1,464	15,783	17,892
Total Samples	2,380	2,380	157,848	157,848
Outlier Percentage	9.9%	61.5%	10.0%	11.3%
Overlap Count	157	157	5,716	5,716
Overlap % (K-Means view)	66.8%	—	36.2%	—
Overlap % (HDBSCAN view)	—	10.7%	—	31.9%

DLL Import Statistics

	MOTIF K-Means LTS	MOTIF HDBSCAN	SOREL K-Means LTS	SOREL HDBSCAN
Mean (Outliers)	228.99	103.75	1,339.20	560.54
Mean (Inliers)	108.06	145.97	647.40	736.52
Median (Outliers)	195	90	1,096	326
Median (Inliers)	86	92	438	514

Top Malware Families in Outliers

MOTIF

Rank	K-Means LTS	%	HDBSCAN	%
1	zerot	6.4%	icedid	7.7%
2	flawedammyy	4.7%	phorpiex	2.3%
3	xmrig	3.4%	gandcrab	2.0%
4	qbot	3.0%	maze	1.8%
5	nymaim	3.0%	bazarbackdoor	1.7%

SOREL

Rank	K-Means LTS	%	HDBSCAN	%
1	cerber	31.6%	cerber	28.5%
2	bunitu	7.0%	expiro	5.6%
3	expiro	4.2%	tofsee	3.8%
4	cryptxxx	3.9%	zbot	3.8%
5	icloader	2.8%	bunitu	3.1%

Visualizations

All plots are in the Plots/ directory.

MOTIF

Plot	Description
`Plots/Motif/hdbscan_hyperparameter_heatmaps.png`	HDBSCAN parameter sweep — Silhouette Score, DBI, cluster count, and outlier % across `min_cluster_size` and `min_samples`
`Plots/Motif/hdbscan_pca_tsne_plots_20250625_104901.png`	PCA and t-SNE projections of HDBSCAN results showing inliers vs. noise
`Plots/Motif/kmeans_k_selection_analysis.png`	Elbow method, Silhouette Score, and DBI across k values — optimal k=5 selected
`Plots/Motif/kmeans_lts_pca_outlier_analysis.png`	PCA projection of K-Means LTS results with outliers highlighted
`Plots/Motif/kmeans_lts_tsne_outlier_analysis.png`	t-SNE projection of K-Means LTS results with outliers highlighted

SOREL

Plot	Description
`Plots/SOREL/kselection_kmeans_lts_SOREL.png`	Elbow method, Silhouette Score, and DBI across k values — optimal k=7 selected
`Plots/SOREL/PCA_KmeansLTS_SOREL.png`	PCA projection of K-Means LTS clustering on SOREL with outliers (red x)
`Plots/SOREL/tSNE_KmeansLTS_SOREL.png`	t-SNE projection of K-Means LTS clustering on SOREL with outliers (red x)

Repository Structure

Outlier_Analysis/
├── outlier.ipynb                            # Main analysis notebook
├── results.csv                              # Clustering results summary
├── outliers_data_kmeans_lts.csv             # K-Means LTS outlier records (MOTIF)
├── outliers_data_cof.xlsx                   # COF outlier analysis (MOTIF)
├── outliers_data_lof.xlsx                   # LOF outlier analysis (MOTIF)
├── requirements.txt                         # Python dependencies
├── Dockerfile                               # Docker image definition
├── MOTIF/
│   └── Data/
│       ├── func_maps.pkl                    # DLL function → integer mapping dictionary
│       ├── transformed_data_new.pkl         # Preprocessed binary feature matrix
│       ├── hdbscan_complete_analysis.pkl    # Full HDBSCAN clustering results
│       └── kmeans_lts_complete_analysis.pkl # Full K-Means LTS clustering results
└── Plots/
    ├── Motif/
    │   ├── hdbscan_hyperparameter_heatmaps.png
    │   ├── hdbscan_pca_tsne_plots_20250625_104901.png
    │   ├── kmeans_k_selection_analysis.png
    │   ├── kmeans_lts_pca_outlier_analysis.png
    │   └── kmeans_lts_tsne_outlier_analysis.png
    └── SOREL/
        ├── kselection_kmeans_lts_SOREL.png
        ├── PCA_KmeansLTS_SOREL.png
        └── tSNE_KmeansLTS_SOREL.png

Setup & Usage

1. Clone the repository

git clone https://github.com/UMBC-DREAM-Lab/Outlier_Analysis.git
cd Outlier_Analysis

2. Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate        # Linux/macOS
venv\Scripts\activate           # Windows

3. Install dependencies

pip install -r requirements.txt

4. Launch the notebook

jupyter notebook outlier.ipynb

Docker

Build

docker build -t outlier-analysis .

Run

docker run -p 8888:8888 outlier-analysis

Then open http://localhost:8888 in your browser.

Evaluation Metrics

Metric	What it measures	Better when
Silhouette Score	Similarity of a sample to its own cluster vs. other clusters	Higher (max 1.0)
Davies–Bouldin Index (DBI)	Ratio of intra-cluster scatter to inter-cluster separation	Lower (min 0.0)

Metrics are computed on inlier samples only after outlier removal to assess cluster quality.

CTI / Threat Hunting Applications

The outlier detection pipeline functions as an automated pre-triage mechanism for Cyber Threat Intelligence (CTI) workflows:

Flag anomalous binaries for deeper sandbox or reverse-engineering analysis
Surface mislabeled samples in AV label datasets
Detect emerging variants that deviate structurally from known family clusters
Identify obfuscated/evasive payloads via sparse DLL profiles (HDBSCAN signal)
Identify modular or packed malware via dense DLL profiles (K-Means LTS signal)

Limitations

Analysis relies solely on static DLL import features — behavioral nuances from dynamic execution are not captured
The DLL import table may be intentionally corrupted or obfuscated in packed/evasive samples
LOF produced inconsistent scoring on MOTIF due to high dimensionality; COF produced infinite scores — neither was applied to SOREL
Only two clustering algorithms were evaluated; ensemble or deep anomaly detection approaches remain unexplored

Future Work

Extend beyond DLL imports with opcode n-grams, PE header entropy, and selected dynamic traces
Cross-validate outlier sets against consensus AV labels, threat intelligence feeds, and sandbox outputs
Build lightweight visualization and summarization tools to translate clustering outputs into analyst-ready intelligence reports
Explore LLM-assisted summarization to auto-describe behavioral themes within outlier clusters
Map outlier indicators (hashes, API imports) to STIX/TAXII formats for pipeline integration

References

R. J. Joyce, D. Amlani, C. Nicholas, E. Raff — "MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels" — arXiv:2111.15031, 2021
R. Harang, E. M. Rudd — "SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection" — arXiv:2012.07634, 2020
T. Estella et al. — "Outlier Handling in Clustering: K-Means, Robust Trimmed K-Means, and K-Means LTS" — IJCDS, vol. 15, no. 1, 2024
B. V. Sanchez-Vinces et al. — "A comparative evaluation of clustering-based outlier detection" — DMKD, vol. 39, no. 2, 2025
P. Bhandary, R. J. Joyce, C. Nicholas — "Ransomware evolution: Unveiling patterns using HDBSCAN" — CAMLIS '24, CEUR-WS vol. 3920
K. Ghosh et al. — "Unsupervised Parameter-free Outlier Detection using HDBSCAN Outlier Profiles"* — arXiv:2411.08867, 2024
Q. Li, S. Wang — "Detecting outliers by clustering algorithms" — arXiv:2412.05669, 2024
A. Nowak-Brzezinska, C. Horyn — "Outliers in rules — the comparison of LOF, COF and KMEANS algorithms" — Procedia CS, vol. 176, 2020

Citation

@inproceedings{swargam2025outliers,
  title     = {Analyzing the Impact of Outliers in Malware Clustering},
  author    = {Swargam, Bharath Kumar and Bhandary, Prajna and Nicholas, Charles},
  booktitle = {CyberHunt 2025},
  year      = {2025},
  institution = {University of Maryland, Baltimore County (UMBC)}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analyzing the Impact of Outliers in Malware Clustering

Overview

Key Findings

Datasets

Methods

K-Means Least Trimmed Squares (LTS)

HDBSCAN

Results

Clustering Performance

DLL Import Statistics

Top Malware Families in Outliers

Visualizations

Repository Structure

Setup & Usage

Docker

Evaluation Metrics

CTI / Threat Hunting Applications

Limitations

Future Work

References

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
MOTIF/Data		MOTIF/Data
Plots		Plots
Dockerfile		Dockerfile
README.md		README.md
outlier.ipynb		outlier.ipynb
outliers_data_cof.xlsx		outliers_data_cof.xlsx
outliers_data_kmeans_lts.csv		outliers_data_kmeans_lts.csv
outliers_data_lof.xlsx		outliers_data_lof.xlsx
requirements.txt		requirements.txt
results.csv		results.csv

Folders and files

Latest commit

History

Repository files navigation

Analyzing the Impact of Outliers in Malware Clustering

Overview

Key Findings

Datasets

Methods

K-Means Least Trimmed Squares (LTS)

HDBSCAN

Results

Clustering Performance

DLL Import Statistics

Top Malware Families in Outliers

Visualizations

Repository Structure

Setup & Usage

Docker

Evaluation Metrics

CTI / Threat Hunting Applications

Limitations

Future Work

References

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages