HacktronAI
diff --git a/‎.dockerignore‎
Lines changed: 13 additions & 0 deletions b/‎.dockerignore‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎.env.example‎
Lines changed: 1 addition & 0 deletions b/‎.env.example‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎Dockerfile.sandbox‎
Lines changed: 18 additions & 0 deletions b/‎Dockerfile.sandbox‎
Lines changed: 18 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 89 additions & 28 deletions b/‎README.md‎
Lines changed: 89 additions & 28 deletions
diff --git a/‎cull/check/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎cull/check/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎cull/check/cli.py‎
Lines changed: 92 additions & 0 deletions b/‎cull/check/cli.py‎
Lines changed: 92 additions & 0 deletions
@@ -0,0 +1,13 @@
+.git
+.github
+__pycache__/
+*.py[cod]
+*.egg-info/
+.coverage
+.venv/
+venv/
+build/
+dist/
+samples/
+.env
+**/.env
@@ -0,0 +1 @@
+GEMINI_API_KEY=your-google-ai-studio-key-here
@@ -6,3 +6,6 @@ __pycache__/
 venv/
 build/
 dist/
+samples/
+out/
+.env
@@ -0,0 +1,18 @@
+FROM python:3.12-slim
+
+RUN apt-get update \
+ && apt-get install -y --no-install-recommends git openssh-client \
+ && rm -rf /var/lib/apt/lists/*
+
+RUN useradd -m -s /bin/bash sandbox
+USER sandbox
+WORKDIR /home/sandbox
+ENV PATH="/home/sandbox/.local/bin:${PATH}"
+
+RUN mkdir -p ~/.ssh \
+ && ssh-keyscan github.com >> ~/.ssh/known_hosts
+
+COPY --chown=sandbox:sandbox . /home/sandbox/cull
+RUN pip install --no-cache-dir --user /home/sandbox/cull
+
+ENTRYPOINT ["bash"]
@@ -1,6 +1,6 @@
 # cull
 
-Find compromised npm packages across your infrastructure. Only Python stdlib, no dependencies.
+Find compromised packages and suspicious package code. Only Python stdlib, no dependencies.
 
 ## Install
 
@@ -12,63 +12,124 @@ python3 -m pip install .
 
 We deliberately do not publish to PyPI to avoid creating another supply-chain distribution point for a security tool. Install from a reviewed git clone instead.
 
-## Usage
+## Commands
+
+### `cull check`
+
+Deterministically search for known compromised package names/versions across lock files, `node_modules`, GitHub code search, Docker images, GCR, and Artifact Registry.
 
 ```bash
-cull [email protected] [email protected] plain-crypto-js
+cull check [email protected] [email protected] plain-crypto-js
+cull check [email protected] --dirs ~/projects/app1 ~/projects/app2
+cull check [email protected] --github-org myorg
+cull check [email protected] --docker
 ```
 
-Checks lock files (`pnpm-lock.yaml`, `package-lock.json`, `yarn.lock`, `bun.lock`), `node_modules`, GitHub code search, and Docker image layers (legacy + OCI). Version-aware — distinguishes compromised versions from safe pins. Exit code `0` when clean, `1` when a compromised package is found, and `2` when the scan could not complete reliably.
+Bare usage remains an alias for one release:
+
+```bash
+cull [email protected]
+```
 
-## Scan targets
+### `cull scan`
 
-### Local directories
+LLM-scan installed package source files for suspicious supply-chain behavior.
 
 ```bash
-cull [email protected] --dirs ~/projects/app1 ~/projects/app2
+export GEMINI_API_KEY=...
+cull scan ./node_modules
+cull scan ./.venv/lib/python3.12/site-packages
+cull scan ./node_modules ./.venv/lib/python3.12/site-packages -o report.json
 ```
 
-Default: current directory.
+`PATH` must point at a package install directory: `node_modules`, `site-packages`, or a directory that clearly looks like one.
+
+Every run prints a preflight estimate first:
+
+```text
+packages: 342
+files:    4,127 kept, 8,901 skipped
+chunks:   4,143
+tokens:   ~2.1M in / ~0.16M out
+cost:     $0.25
+```
+
+Then it scans unless you pass:
+
+```bash
+cull scan ./node_modules --estimate-only
+```
 
-### GitHub
+Useful flags:
 
 ```bash
-export GITHUB_TOKEN=ghp_...
-cull [email protected] --github-org myorg
+--budget-usd 1.00          # abort if estimate or actual cost exceeds budget
+--concurrency 4            # default 8
+--no-cache                 # disable ~/.cache/cull/verdicts.json
+--include-tests            # include test/spec dirs
+-o report.json             # full JSON report
+-o report.md               # full Markdown report
+--json                     # write JSON result to stdout
 ```
 
-Searches lock files via [code search API](https://docs.github.com/en/rest/search/search#search-code). Token can also be passed via `--github-token`.
+Default model is `gemini-3.1-flash-lite-preview` against `https://generativelanguage.googleapis.com/v1beta/openai`, reading `GEMINI_API_KEY`. Swap any OpenAI-compatible provider with `--model`, `--base-url`, and `--api-key-env`.
+
+See `[examples/ngx-perfect-scrollbar.md](examples/ngx-perfect-scrollbar.md)` for a real report against a known-malicious Shai-Hulud sample.
 
-**Creating a PAT** — [github.com/settings/tokens](https://github.com/settings/tokens): classic → `repo` scope, fine-grained → set resource owner to your org, grant `Contents: Read-only`.
+### Local model
 
-### Docker images
+Any local server that speaks the OpenAI `/v1/chat/completions` protocol (Ollama, llama.cpp `llama-server`, vLLM, LM Studio, …) works. Point `cull` at it with `--base-url` and `--model`. Most local servers ignore the API key — set any non-empty value so `cull` proceeds.
 
 ```bash
-cull [email protected] --docker            # all local images
-cull [email protected] --images app:latest # specific images
+export LOCAL_API_KEY=local
+cull scan ./node_modules \
+  --base-url http://localhost:11434/v1 \
+  --model qwen2.5-coder:7b \
+  --api-key-env LOCAL_API_KEY
 ```
 
-Requires `docker` CLI. Remote images are auto-pulled; use `--no-pull` to skip.
+Local providers usually report no token usage, so the cost column will read `$0.0000`. The verdict cache is keyed by model id, so switching providers re-scans every chunk.
 
-### Google Cloud
+## Sandbox
+
+Build a minimal Docker sandbox with `cull` installed:
 
 ```bash
-cull [email protected] --gar-repo us-central1-docker.pkg.dev/proj/repo  # Artifact Registry
-cull [email protected] --gcr-project my-project                         # Container Registry (legacy)
+docker build -f Dockerfile.sandbox -t cull-sandbox .
+docker run --rm cull-sandbox -lc 'cull --help'
 ```
 
-Requires `gcloud` CLI with `gcloud auth login` and `gcloud auth configure-docker REGION-docker.pkg.dev`.
+## Benchmark against Datadog's malicious package dataset
 
-## Requirements
+Datadog publishes ~3,000 real-world malicious npm packages as password-protected zips ([dataset](https://github.com/DataDog/malicious-software-packages-dataset)). The password is `infected`
 
-Python 3.9+. Optional CLIs: `docker`, `gcloud`. If you request a scan target whose CLI is missing or whose backend calls fail, `cull` reports an error and exits non-zero instead of silently treating that target as clean.
+Fetch the npm samples (sparse-checkout, no full repo history):
 
-## Security
+```bash
+scripts/fetch-datadog-npm-samples.sh
+```
 
-We intentionally do not publish this to PyPI. The goal is to avoid creating another supply-chain distribution point for a security tool. Install from a reviewed git clone instead.
+Extract and scan **inside the sandbox** — the archives are real malware, never `npm install` them on your host. 
 
-Stdlib only — nothing else to supply-chain. External CLIs invoked only when their flags are used. Images are exported via `docker save` / `docker pull` — never `docker run`.
+```bash
+docker run --rm \
+  -v "$PWD/samples:/samples:ro" \
+  -v "$HOME/.cache/cull:/home/sandbox/.cache/cull" \
+  -e GEMINI_API_KEY \
+  cull-sandbox -lc '
+    set -e
+    src=/samples/datadog-malicious-software-packages-dataset/samples/npm
+    work=$(mktemp -d)
+    python3 -c "
+import sys, zipfile, pathlib
+with zipfile.ZipFile(sys.argv[1]) as z:
+    z.extractall(pathlib.Path(sys.argv[2]), pwd=b\"infected\")
+" "$src/2024-01/some-package.zip" "$work/node_modules"
+    cull scan "$work/node_modules" --budget-usd 0.50 -o /tmp/report.md
+    cat /tmp/report.md
+  '
+```
 
-## Contributing
+## Security
 
-PRs welcome — GitLab, Bitbucket, AWS ECR, and Azure ACR are natural next targets.
+Stdlib only. `cull scan` reads installed package files and sends selected source chunks to the configured LLM provider. It does not execute package code.
@@ -0,0 +1 @@
+"""Deterministic package compromise checker."""
@@ -0,0 +1,92 @@
+from __future__ import annotations
+
+import argparse
+import os
+import sys
+
+from ..models import Finding
+from ..output import bold, green, print_error, print_header, red, tprint, yellow
+from ..parsers import parse_pkg_arg
+from ..scanners import collect_images, scan_docker, scan_github, scan_local
+
+
+def add_arguments(parser: argparse.ArgumentParser) -> None:
+    parser.add_argument(
+        "packages",
+        nargs="+",
+        metavar="PKG",
+        help="packages to search for (e.g. [email protected] plain-crypto-js)",
+    )
+
+    local = parser.add_argument_group("local")
+    local.add_argument("--dirs", nargs="+", metavar="DIR", help="directories to scan (default: current directory)")
+
+    github = parser.add_argument_group("github")
+    github.add_argument(
+        "--github-token",
+        metavar="TOKEN",
+        default=os.environ.get("GITHUB_TOKEN"),
+        help="GitHub PAT (default: $GITHUB_TOKEN)",
+    )
+    github.add_argument("--github-org", metavar="ORG")
+
+    docker = parser.add_argument_group("docker")
+    docker.add_argument("--docker", action="store_true", help="scan all local Docker images")
+    docker.add_argument("--images", nargs="+", metavar="IMG", help="specific images to scan")
+    docker.add_argument("--no-pull", action="store_true", help="don't auto-pull remote images before scanning")
+
+    cloud = parser.add_argument_group("cloud registries")
+    cloud.add_argument("--gcr-project", metavar="PROJECT", help="Google Container Registry project")
+    cloud.add_argument("--gar-repo", metavar="REPO", help="Artifact Registry repo (e.g. us-central1-docker.pkg.dev/proj/repo)")
+
+
+def run(args: argparse.Namespace) -> None:
+    targets = [parse_pkg_arg(raw) for raw in args.packages]
+    auto_pull = not args.no_pull
+
+    labels = ", ".join(target.label for target in targets)
+    tprint(bold(f"━━━ cull check: searching for {labels} ━━━"))
+
+    all_findings: list[Finding] = []
+    has_other_source = args.github_org or args.docker or args.images or args.gcr_project or args.gar_repo
+    scan_dirs = args.dirs or (None if has_other_source else ["."])
+
+    for target in targets:
+        print_header(f"▸ {target.label}")
+
+        if scan_dirs:
+            print_header("  LOCAL DIRECTORIES")
+            all_findings.extend(scan_local(scan_dirs, target.name, target.version))
+
+        if args.github_org and not args.github_token:
+            detail = "GitHub token required when --github-org is set"
+            print_error(f"org:{args.github_org}", detail)
+            all_findings.append(Finding("github", f"org:{args.github_org}", "error", detail))
+        elif args.github_token and args.github_org:
+            print_header("  GITHUB")
+            all_findings.extend(scan_github(args.github_token, args.github_org, target.name, target.version))
+
+    all_images, image_findings = collect_images(args)
+    all_findings.extend(image_findings)
+
+    if all_images:
+        print_header("  IMAGES")
+        all_findings.extend(scan_docker(all_images, targets, auto_pull=auto_pull))
+
+    infected = [finding for finding in all_findings if finding.status == "found"]
+    pinned = [finding for finding in all_findings if finding.status == "pinned"]
+    errors = [finding for finding in all_findings if finding.status == "error"]
+
+    tprint()
+    parts: list[str] = []
+    if infected:
+        parts.append(red(f"{len(infected)} infected"))
+    if pinned:
+        parts.append(green(f"{len(pinned)} pinned (safe)"))
+    if errors:
+        parts.append(yellow(f"{len(errors)} errors"))
+    if not infected and not pinned and not errors:
+        parts.append(green("clean"))
+
+    tprint(bold(f"━━━ Result: {', '.join(parts)} ━━━"))
+    sys.exit(1 if infected else 2 if errors else 0)
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+GEMINI_API_KEY=your-google-ai-studio-key-here`
-Original file line number
+Diff line change
 venv/
 build/
 dist/
 +samples/
 +out/
 +.env
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"""Deterministic package compromise checker."""`