Skip to content

Commit ddb09cb

Browse files
committed
feat(scan): add LLM-based malicious-code scanner for installed packages
Adds `cull scan PATH...` alongside the existing `cull check`. Discovers installed npm/Python packages from node_modules / site-packages, filters to scannable source/manifest files, splits each file into character-based sliding-window chunks (full coverage, including minified bundles), and classifies every chunk through an OpenAI-compatible chat-completions API (Gemini 3.1 Flash-Lite by default; any local server such as Ollama, llama.cpp, vLLM, or LM Studio works via --base-url / --model). - Two-phase flow: preflight estimate (packages, files, chunks, tokens, cost) before any LLM call; --estimate-only to stop there; --budget-usd to abort once actual spend exceeds the cap. - Verdict cache at ~/.cache/cull/verdicts.json keyed by sha256(chunk):model:prompt_version, so model swaps and prompt bumps invalidate automatically and identical chunks dedup across packages. - Concurrent scanning via ThreadPoolExecutor (default 8 workers); cancels pending futures on budget abort and flushes cache from a finally so partial-run results are never lost. - JSON / Markdown report output (-o report.{json,md}) plus a stdout --json mode for piping. - Dockerfile.sandbox + scripts/fetch-datadog-npm-samples.sh for safely benchmarking against Datadog's malicious-package dataset. - 24 unit tests covering chunker coverage on minified payloads, filter rules, verdict merging/dedup, cache round-trip, and pricing. - README documents both the default (Gemini) and local-model setup, the dataset benchmark workflow, and what is/isn't stored in cache. Pure stdlib, no new runtime dependencies.
1 parent 242ab89 commit ddb09cb

26 files changed

Lines changed: 2005 additions & 106 deletions

.dockerignore

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
.git
2+
.github
3+
__pycache__/
4+
*.py[cod]
5+
*.egg-info/
6+
.coverage
7+
.venv/
8+
venv/
9+
build/
10+
dist/
11+
samples/
12+
.env
13+
**/.env

.env.example

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
GEMINI_API_KEY=your-google-ai-studio-key-here

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,6 @@ __pycache__/
66
venv/
77
build/
88
dist/
9+
samples/
10+
out/
11+
.env

Dockerfile.sandbox

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
FROM python:3.12-slim
2+
3+
RUN apt-get update \
4+
&& apt-get install -y --no-install-recommends git openssh-client \
5+
&& rm -rf /var/lib/apt/lists/*
6+
7+
RUN useradd -m -s /bin/bash sandbox
8+
USER sandbox
9+
WORKDIR /home/sandbox
10+
ENV PATH="/home/sandbox/.local/bin:${PATH}"
11+
12+
RUN mkdir -p ~/.ssh \
13+
&& ssh-keyscan github.com >> ~/.ssh/known_hosts
14+
15+
COPY --chown=sandbox:sandbox . /home/sandbox/cull
16+
RUN pip install --no-cache-dir --user /home/sandbox/cull
17+
18+
ENTRYPOINT ["bash"]

README.md

Lines changed: 89 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# cull
22

3-
Find compromised npm packages across your infrastructure. Only Python stdlib, no dependencies.
3+
Find compromised packages and suspicious package code. Only Python stdlib, no dependencies.
44

55
## Install
66

@@ -12,63 +12,124 @@ python3 -m pip install .
1212

1313
We deliberately do not publish to PyPI to avoid creating another supply-chain distribution point for a security tool. Install from a reviewed git clone instead.
1414

15-
## Usage
15+
## Commands
16+
17+
### `cull check`
18+
19+
Deterministically search for known compromised package names/versions across lock files, `node_modules`, GitHub code search, Docker images, GCR, and Artifact Registry.
1620

1721
```bash
18-
22+
cull check [email protected] [email protected] plain-crypto-js
23+
cull check [email protected] --dirs ~/projects/app1 ~/projects/app2
24+
cull check [email protected] --github-org myorg
25+
cull check [email protected] --docker
1926
```
2027

21-
Checks lock files (`pnpm-lock.yaml`, `package-lock.json`, `yarn.lock`, `bun.lock`), `node_modules`, GitHub code search, and Docker image layers (legacy + OCI). Version-aware — distinguishes compromised versions from safe pins. Exit code `0` when clean, `1` when a compromised package is found, and `2` when the scan could not complete reliably.
28+
Bare usage remains an alias for one release:
29+
30+
```bash
31+
32+
```
2233

23-
## Scan targets
34+
### `cull scan`
2435

25-
### Local directories
36+
LLM-scan installed package source files for suspicious supply-chain behavior.
2637

2738
```bash
28-
cull [email protected] --dirs ~/projects/app1 ~/projects/app2
39+
export GEMINI_API_KEY=...
40+
cull scan ./node_modules
41+
cull scan ./.venv/lib/python3.12/site-packages
42+
cull scan ./node_modules ./.venv/lib/python3.12/site-packages -o report.json
2943
```
3044

31-
Default: current directory.
45+
`PATH` must point at a package install directory: `node_modules`, `site-packages`, or a directory that clearly looks like one.
46+
47+
Every run prints a preflight estimate first:
48+
49+
```text
50+
packages: 342
51+
files: 4,127 kept, 8,901 skipped
52+
chunks: 4,143
53+
tokens: ~2.1M in / ~0.16M out
54+
cost: $0.25
55+
```
56+
57+
Then it scans unless you pass:
58+
59+
```bash
60+
cull scan ./node_modules --estimate-only
61+
```
3262

33-
### GitHub
63+
Useful flags:
3464

3565
```bash
36-
export GITHUB_TOKEN=ghp_...
37-
cull [email protected] --github-org myorg
66+
--budget-usd 1.00 # abort if estimate or actual cost exceeds budget
67+
--concurrency 4 # default 8
68+
--no-cache # disable ~/.cache/cull/verdicts.json
69+
--include-tests # include test/spec dirs
70+
-o report.json # full JSON report
71+
-o report.md # full Markdown report
72+
--json # write JSON result to stdout
3873
```
3974

40-
Searches lock files via [code search API](https://docs.github.com/en/rest/search/search#search-code). Token can also be passed via `--github-token`.
75+
Default model is `gemini-3.1-flash-lite-preview` against `https://generativelanguage.googleapis.com/v1beta/openai`, reading `GEMINI_API_KEY`. Swap any OpenAI-compatible provider with `--model`, `--base-url`, and `--api-key-env`.
76+
77+
See `[examples/ngx-perfect-scrollbar.md](examples/ngx-perfect-scrollbar.md)` for a real report against a known-malicious Shai-Hulud sample.
4178

42-
**Creating a PAT**[github.com/settings/tokens](https://github.com/settings/tokens): classic → `repo` scope, fine-grained → set resource owner to your org, grant `Contents: Read-only`.
79+
### Local model
4380

44-
### Docker images
81+
Any local server that speaks the OpenAI `/v1/chat/completions` protocol (Ollama, llama.cpp `llama-server`, vLLM, LM Studio, …) works. Point `cull` at it with `--base-url` and `--model`. Most local servers ignore the API key — set any non-empty value so `cull` proceeds.
4582

4683
```bash
47-
cull [email protected] --docker # all local images
48-
cull [email protected] --images app:latest # specific images
84+
export LOCAL_API_KEY=local
85+
cull scan ./node_modules \
86+
--base-url http://localhost:11434/v1 \
87+
--model qwen2.5-coder:7b \
88+
--api-key-env LOCAL_API_KEY
4989
```
5090

51-
Requires `docker` CLI. Remote images are auto-pulled; use `--no-pull` to skip.
91+
Local providers usually report no token usage, so the cost column will read `$0.0000`. The verdict cache is keyed by model id, so switching providers re-scans every chunk.
5292

53-
### Google Cloud
93+
## Sandbox
94+
95+
Build a minimal Docker sandbox with `cull` installed:
5496

5597
```bash
56-
cull [email protected] --gar-repo us-central1-docker.pkg.dev/proj/repo # Artifact Registry
57-
cull [email protected] --gcr-project my-project # Container Registry (legacy)
98+
docker build -f Dockerfile.sandbox -t cull-sandbox .
99+
docker run --rm cull-sandbox -lc 'cull --help'
58100
```
59101

60-
Requires `gcloud` CLI with `gcloud auth login` and `gcloud auth configure-docker REGION-docker.pkg.dev`.
102+
## Benchmark against Datadog's malicious package dataset
61103

62-
## Requirements
104+
Datadog publishes ~3,000 real-world malicious npm packages as password-protected zips ([dataset](https://github.com/DataDog/malicious-software-packages-dataset)). The password is `infected`
63105

64-
Python 3.9+. Optional CLIs: `docker`, `gcloud`. If you request a scan target whose CLI is missing or whose backend calls fail, `cull` reports an error and exits non-zero instead of silently treating that target as clean.
106+
Fetch the npm samples (sparse-checkout, no full repo history):
65107

66-
## Security
108+
```bash
109+
scripts/fetch-datadog-npm-samples.sh
110+
```
67111

68-
We intentionally do not publish this to PyPI. The goal is to avoid creating another supply-chain distribution point for a security tool. Install from a reviewed git clone instead.
112+
Extract and scan **inside the sandbox** — the archives are real malware, never `npm install` them on your host.
69113

70-
Stdlib only — nothing else to supply-chain. External CLIs invoked only when their flags are used. Images are exported via `docker save` / `docker pull` — never `docker run`.
114+
```bash
115+
docker run --rm \
116+
-v "$PWD/samples:/samples:ro" \
117+
-v "$HOME/.cache/cull:/home/sandbox/.cache/cull" \
118+
-e GEMINI_API_KEY \
119+
cull-sandbox -lc '
120+
set -e
121+
src=/samples/datadog-malicious-software-packages-dataset/samples/npm
122+
work=$(mktemp -d)
123+
python3 -c "
124+
import sys, zipfile, pathlib
125+
with zipfile.ZipFile(sys.argv[1]) as z:
126+
z.extractall(pathlib.Path(sys.argv[2]), pwd=b\"infected\")
127+
" "$src/2024-01/some-package.zip" "$work/node_modules"
128+
cull scan "$work/node_modules" --budget-usd 0.50 -o /tmp/report.md
129+
cat /tmp/report.md
130+
'
131+
```
71132

72-
## Contributing
133+
## Security
73134

74-
PRs welcome — GitLab, Bitbucket, AWS ECR, and Azure ACR are natural next targets.
135+
Stdlib only. `cull scan` reads installed package files and sends selected source chunks to the configured LLM provider. It does not execute package code.

cull/check/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Deterministic package compromise checker."""

cull/check/cli.py

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
from __future__ import annotations
2+
3+
import argparse
4+
import os
5+
import sys
6+
7+
from ..models import Finding
8+
from ..output import bold, green, print_error, print_header, red, tprint, yellow
9+
from ..parsers import parse_pkg_arg
10+
from ..scanners import collect_images, scan_docker, scan_github, scan_local
11+
12+
13+
def add_arguments(parser: argparse.ArgumentParser) -> None:
14+
parser.add_argument(
15+
"packages",
16+
nargs="+",
17+
metavar="PKG",
18+
help="packages to search for (e.g. [email protected] plain-crypto-js)",
19+
)
20+
21+
local = parser.add_argument_group("local")
22+
local.add_argument("--dirs", nargs="+", metavar="DIR", help="directories to scan (default: current directory)")
23+
24+
github = parser.add_argument_group("github")
25+
github.add_argument(
26+
"--github-token",
27+
metavar="TOKEN",
28+
default=os.environ.get("GITHUB_TOKEN"),
29+
help="GitHub PAT (default: $GITHUB_TOKEN)",
30+
)
31+
github.add_argument("--github-org", metavar="ORG")
32+
33+
docker = parser.add_argument_group("docker")
34+
docker.add_argument("--docker", action="store_true", help="scan all local Docker images")
35+
docker.add_argument("--images", nargs="+", metavar="IMG", help="specific images to scan")
36+
docker.add_argument("--no-pull", action="store_true", help="don't auto-pull remote images before scanning")
37+
38+
cloud = parser.add_argument_group("cloud registries")
39+
cloud.add_argument("--gcr-project", metavar="PROJECT", help="Google Container Registry project")
40+
cloud.add_argument("--gar-repo", metavar="REPO", help="Artifact Registry repo (e.g. us-central1-docker.pkg.dev/proj/repo)")
41+
42+
43+
def run(args: argparse.Namespace) -> None:
44+
targets = [parse_pkg_arg(raw) for raw in args.packages]
45+
auto_pull = not args.no_pull
46+
47+
labels = ", ".join(target.label for target in targets)
48+
tprint(bold(f"━━━ cull check: searching for {labels} ━━━"))
49+
50+
all_findings: list[Finding] = []
51+
has_other_source = args.github_org or args.docker or args.images or args.gcr_project or args.gar_repo
52+
scan_dirs = args.dirs or (None if has_other_source else ["."])
53+
54+
for target in targets:
55+
print_header(f"▸ {target.label}")
56+
57+
if scan_dirs:
58+
print_header(" LOCAL DIRECTORIES")
59+
all_findings.extend(scan_local(scan_dirs, target.name, target.version))
60+
61+
if args.github_org and not args.github_token:
62+
detail = "GitHub token required when --github-org is set"
63+
print_error(f"org:{args.github_org}", detail)
64+
all_findings.append(Finding("github", f"org:{args.github_org}", "error", detail))
65+
elif args.github_token and args.github_org:
66+
print_header(" GITHUB")
67+
all_findings.extend(scan_github(args.github_token, args.github_org, target.name, target.version))
68+
69+
all_images, image_findings = collect_images(args)
70+
all_findings.extend(image_findings)
71+
72+
if all_images:
73+
print_header(" IMAGES")
74+
all_findings.extend(scan_docker(all_images, targets, auto_pull=auto_pull))
75+
76+
infected = [finding for finding in all_findings if finding.status == "found"]
77+
pinned = [finding for finding in all_findings if finding.status == "pinned"]
78+
errors = [finding for finding in all_findings if finding.status == "error"]
79+
80+
tprint()
81+
parts: list[str] = []
82+
if infected:
83+
parts.append(red(f"{len(infected)} infected"))
84+
if pinned:
85+
parts.append(green(f"{len(pinned)} pinned (safe)"))
86+
if errors:
87+
parts.append(yellow(f"{len(errors)} errors"))
88+
if not infected and not pinned and not errors:
89+
parts.append(green("clean"))
90+
91+
tprint(bold(f"━━━ Result: {', '.join(parts)} ━━━"))
92+
sys.exit(1 if infected else 2 if errors else 0)

0 commit comments

Comments
 (0)