Skip to content

UMBC-DREAM-Lab/Outlier_Analysis

Repository files navigation

Analyzing the Impact of Outliers in Malware Clustering

CyberHunt 2025 | UMBC CSEE Department

Authors Bharath Kumar Swargam, Prajna Bhandary, Charles Nicholas
Institution University of Maryland, Baltimore County (UMBC)
Contact swargambharath987@gmail.com

Overview

Malware clustering groups samples by shared behavioral and structural characteristics to support automated family classification and threat tracking. A persistent challenge in this process is the presence of outliers — samples that do not clearly belong to any cluster.

This work investigates whether outliers produced by clustering algorithms are merely noise, or whether they carry meaningful intelligence value. We apply two complementary unsupervised methods — K-Means Least Trimmed Squares (LTS) and HDBSCAN — across two real-world malware datasets of contrasting scale and quality, using DLL import features as lightweight behavioral indicators.

Antivirus (AV) labels are used post hoc to characterize and interpret the flagged outliers, not during clustering.


Key Findings

  • Outliers are not noise. They frequently represent mislabeled samples, emerging threats, or structurally unique binaries
  • K-Means LTS surfaces feature-dense, high-functionality malware with large DLL import counts (mean: 229 on MOTIF, 1,339 on SOREL)
  • HDBSCAN captures sparse, obfuscated, or evasive binaries with fewer DLL imports (mean: 104 on MOTIF, 561 on SOREL)
  • The two methods are complementary — each surfaces a different type of anomaly; neither alone covers the full outlier landscape
  • On MOTIF: HDBSCAN flagged 61.5% as noise vs. only 9.9% by K-Means LTS
  • On SOREL: both methods converged to ~10–11% outlier rate at scale
  • 1,586 DLL functions were exclusive to outliers from both methods on MOTIF (2,881 on SOREL), suggesting niche or novel malware behavior

Datasets

Dataset Description Original Size Used in Analysis
MOTIF Well-labeled dataset spanning 454 malware families (ransomware, banking trojans, RATs, droppers) 3,090 samples 2,380 samples
SOREL-20M Large-scale corpus; filtered to ransomware samples with VirusTotal labels and valid import info 627,298 samples 157,848 samples

Feature representation: Each sample is encoded as a binary vector of DLL-imported functions (78,614 unique functions identified across both datasets). Samples with fewer than 10 unique DLL imports were excluded.


Methods

K-Means Least Trimmed Squares (LTS)

A robust variant of K-Means that trims the top 10% of samples with the highest distances to their assigned cluster centroids, treating them as outliers.

Parameter MOTIF SOREL
Clusters (k) 5 7
k selection Elbow method Elbow method on 10% stratified sample via Truncated SVD (200 components)
Outlier trim 10% 10%

HDBSCAN

Hierarchical density-based clustering that naturally labels low-density points as noise without requiring a fixed outlier percentage.

Parameter Value
min_cluster_size 10
min_samples 10
Selection method Parameter sweep on MOTIF evaluating Silhouette Score and Davies–Bouldin Index

Results

Clustering Performance

Metric MOTIF K-Means LTS MOTIF HDBSCAN SOREL K-Means LTS SOREL HDBSCAN
Silhouette Score 0.1981 0.551 0.2498 0.6094
Davies–Bouldin Index 1.4320 0.910 1.6612 1.7235
Outlier Count 235 1,464 15,783 17,892
Total Samples 2,380 2,380 157,848 157,848
Outlier Percentage 9.9% 61.5% 10.0% 11.3%
Overlap Count 157 157 5,716 5,716
Overlap % (K-Means view) 66.8% 36.2%
Overlap % (HDBSCAN view) 10.7% 31.9%

DLL Import Statistics

MOTIF K-Means LTS MOTIF HDBSCAN SOREL K-Means LTS SOREL HDBSCAN
Mean (Outliers) 228.99 103.75 1,339.20 560.54
Mean (Inliers) 108.06 145.97 647.40 736.52
Median (Outliers) 195 90 1,096 326
Median (Inliers) 86 92 438 514

Top Malware Families in Outliers

MOTIF

Rank K-Means LTS % HDBSCAN %
1 zerot 6.4% icedid 7.7%
2 flawedammyy 4.7% phorpiex 2.3%
3 xmrig 3.4% gandcrab 2.0%
4 qbot 3.0% maze 1.8%
5 nymaim 3.0% bazarbackdoor 1.7%

SOREL

Rank K-Means LTS % HDBSCAN %
1 cerber 31.6% cerber 28.5%
2 bunitu 7.0% expiro 5.6%
3 expiro 4.2% tofsee 3.8%
4 cryptxxx 3.9% zbot 3.8%
5 icloader 2.8% bunitu 3.1%

Visualizations

All plots are in the Plots/ directory.

MOTIF

Plot Description
Plots/Motif/hdbscan_hyperparameter_heatmaps.png HDBSCAN parameter sweep — Silhouette Score, DBI, cluster count, and outlier % across min_cluster_size and min_samples
Plots/Motif/hdbscan_pca_tsne_plots_20250625_104901.png PCA and t-SNE projections of HDBSCAN results showing inliers vs. noise
Plots/Motif/kmeans_k_selection_analysis.png Elbow method, Silhouette Score, and DBI across k values — optimal k=5 selected
Plots/Motif/kmeans_lts_pca_outlier_analysis.png PCA projection of K-Means LTS results with outliers highlighted
Plots/Motif/kmeans_lts_tsne_outlier_analysis.png t-SNE projection of K-Means LTS results with outliers highlighted

SOREL

Plot Description
Plots/SOREL/kselection_kmeans_lts_SOREL.png Elbow method, Silhouette Score, and DBI across k values — optimal k=7 selected
Plots/SOREL/PCA_KmeansLTS_SOREL.png PCA projection of K-Means LTS clustering on SOREL with outliers (red x)
Plots/SOREL/tSNE_KmeansLTS_SOREL.png t-SNE projection of K-Means LTS clustering on SOREL with outliers (red x)

Repository Structure

Outlier_Analysis/
├── outlier.ipynb                            # Main analysis notebook
├── results.csv                              # Clustering results summary
├── outliers_data_kmeans_lts.csv             # K-Means LTS outlier records (MOTIF)
├── outliers_data_cof.xlsx                   # COF outlier analysis (MOTIF)
├── outliers_data_lof.xlsx                   # LOF outlier analysis (MOTIF)
├── requirements.txt                         # Python dependencies
├── Dockerfile                               # Docker image definition
├── MOTIF/
│   └── Data/
│       ├── func_maps.pkl                    # DLL function → integer mapping dictionary
│       ├── transformed_data_new.pkl         # Preprocessed binary feature matrix
│       ├── hdbscan_complete_analysis.pkl    # Full HDBSCAN clustering results
│       └── kmeans_lts_complete_analysis.pkl # Full K-Means LTS clustering results
└── Plots/
    ├── Motif/
    │   ├── hdbscan_hyperparameter_heatmaps.png
    │   ├── hdbscan_pca_tsne_plots_20250625_104901.png
    │   ├── kmeans_k_selection_analysis.png
    │   ├── kmeans_lts_pca_outlier_analysis.png
    │   └── kmeans_lts_tsne_outlier_analysis.png
    └── SOREL/
        ├── kselection_kmeans_lts_SOREL.png
        ├── PCA_KmeansLTS_SOREL.png
        └── tSNE_KmeansLTS_SOREL.png

Setup & Usage

1. Clone the repository

git clone https://github.com/UMBC-DREAM-Lab/Outlier_Analysis.git
cd Outlier_Analysis

2. Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate        # Linux/macOS
venv\Scripts\activate           # Windows

3. Install dependencies

pip install -r requirements.txt

4. Launch the notebook

jupyter notebook outlier.ipynb

Docker

Build

docker build -t outlier-analysis .

Run

docker run -p 8888:8888 outlier-analysis

Then open http://localhost:8888 in your browser.


Evaluation Metrics

Metric What it measures Better when
Silhouette Score Similarity of a sample to its own cluster vs. other clusters Higher (max 1.0)
Davies–Bouldin Index (DBI) Ratio of intra-cluster scatter to inter-cluster separation Lower (min 0.0)

Metrics are computed on inlier samples only after outlier removal to assess cluster quality.


CTI / Threat Hunting Applications

The outlier detection pipeline functions as an automated pre-triage mechanism for Cyber Threat Intelligence (CTI) workflows:

  • Flag anomalous binaries for deeper sandbox or reverse-engineering analysis
  • Surface mislabeled samples in AV label datasets
  • Detect emerging variants that deviate structurally from known family clusters
  • Identify obfuscated/evasive payloads via sparse DLL profiles (HDBSCAN signal)
  • Identify modular or packed malware via dense DLL profiles (K-Means LTS signal)

Limitations

  • Analysis relies solely on static DLL import features — behavioral nuances from dynamic execution are not captured
  • The DLL import table may be intentionally corrupted or obfuscated in packed/evasive samples
  • LOF produced inconsistent scoring on MOTIF due to high dimensionality; COF produced infinite scores — neither was applied to SOREL
  • Only two clustering algorithms were evaluated; ensemble or deep anomaly detection approaches remain unexplored

Future Work

  • Extend beyond DLL imports with opcode n-grams, PE header entropy, and selected dynamic traces
  • Cross-validate outlier sets against consensus AV labels, threat intelligence feeds, and sandbox outputs
  • Build lightweight visualization and summarization tools to translate clustering outputs into analyst-ready intelligence reports
  • Explore LLM-assisted summarization to auto-describe behavioral themes within outlier clusters
  • Map outlier indicators (hashes, API imports) to STIX/TAXII formats for pipeline integration

References

  1. R. J. Joyce, D. Amlani, C. Nicholas, E. Raff — "MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels" — arXiv:2111.15031, 2021
  2. R. Harang, E. M. Rudd — "SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection" — arXiv:2012.07634, 2020
  3. T. Estella et al. — "Outlier Handling in Clustering: K-Means, Robust Trimmed K-Means, and K-Means LTS" — IJCDS, vol. 15, no. 1, 2024
  4. B. V. Sanchez-Vinces et al. — "A comparative evaluation of clustering-based outlier detection" — DMKD, vol. 39, no. 2, 2025
  5. P. Bhandary, R. J. Joyce, C. Nicholas — "Ransomware evolution: Unveiling patterns using HDBSCAN" — CAMLIS '24, CEUR-WS vol. 3920
  6. K. Ghosh et al. — "Unsupervised Parameter-free Outlier Detection using HDBSCAN Outlier Profiles"* — arXiv:2411.08867, 2024
  7. Q. Li, S. Wang — "Detecting outliers by clustering algorithms" — arXiv:2412.05669, 2024
  8. A. Nowak-Brzezinska, C. Horyn — "Outliers in rules — the comparison of LOF, COF and KMEANS algorithms" — Procedia CS, vol. 176, 2020

Citation

@inproceedings{swargam2025outliers,
  title     = {Analyzing the Impact of Outliers in Malware Clustering},
  author    = {Swargam, Bharath Kumar and Bhandary, Prajna and Nicholas, Charles},
  booktitle = {CyberHunt 2025},
  year      = {2025},
  institution = {University of Maryland, Baltimore County (UMBC)}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors