Binoc generates changelogs for datasets that don't have them. Given a series of snapshots of a dataset downloaded at different times, Binoc detects what changed, expresses those changes as a minimal structured diff, and produces human-readable summaries that distinguish substantive policy changes from clerical housekeeping.
The core workflow: an archivist, data scientist, or steward has five copies of a government dataset containing CSVs, downloaded over two years. Some are identical. Some have reordered columns. One has a new category relevant to their research. Binoc tells them exactly what changed, when, and whether (by their definition) it matters.
To submit feedback on Binoc, please fill out our form or send us an email at [email protected].
A dataset ships as a zip of CSVs alongside a SQLite database. Between quarterly releases, the CSV columns were reordered and the database grew:
binoc diff release-q3/ release-q4/# Changelog: release-q3/ → release-q4/
## Clerical Changes
- **data.zip/agencies.csv**: Columns reordered (content unchanged)
## Substantive Changes
- **summary.sqlite**: Content changed (12.0 KB → 12.0 KB)
Binoc looked inside the zip and compared the CSV column-by-column — the reorder is flagged as clerical housekeeping, not a real data change. But .sqlite is opaque to the standard library, so you only learn that the bytes differ.
pip install binoc-sqlite
binoc diff release-q3/ release-q4/# Changelog: release-q3/ → release-q4/
## Clerical Changes
- **data.zip/agencies.csv**: Columns reordered (content unchanged)
## Substantive Changes
- **summary.sqlite/allocations**: 3 rows added (84 → 87 rows)
Same command, richer output. The plugin parsed the database and found the actual change: three new rows in the allocations table. Plugins install via pip and work immediately — no configuration required.
Datasets published by governments, research institutions, and public bodies are living artifacts, and can change without warning or documentation (or without consistent documentation). The archival and data science communities need tooling to:
- Detect whether a new snapshot of a dataset actually differs from the previous one.
- Describe changes precisely — not just "the file changed," but "three columns were reordered (clerical) and one column was split into two (substantive)."
- Produce changelogs that are machine-readable for automated pipelines and human-readable for policy analysis.
- Handle real-world messiness: datasets inside zip archives, nested containers, mixed formats, renamed files.
Generic diff tools don't understand data formats, while version control systems track lines, not columns or schemas. Binoc bridges this gap.
- Compare directory snapshots recursively
- Diff zip archives, including nested zip contents
- Diff tar/tar.gz/tgz archives, including nested tar contents
- Compare CSV files with row, column, and cell awareness
- Compare text files at line level
- Compare binary files by content hash
- Detect moves and copies from content hashes
- Extract actual changed data from changeset nodes (added rows, text diffs, etc.)
- Render changesets as JSON or Markdown changelogs
- Extend comparison and transformation pipelines via Rust native plugins (C ABI), Python plugins, or in-workspace stdlib plugins
- tutorial.md: end-to-end contributor walkthrough.
- test-vectors/: fixtures demonstrating major capabilities.
- docs/adr/: records of architectural decisions.
pip install binocOr run without installing:
uvx binoc diff path/to/snapshot-a path/to/snapshot-bDiff two snapshots (prints a Markdown changelog to stdout by default):
binoc diff path/to/snapshot-a path/to/snapshot-bGet raw changeset JSON instead:
binoc diff path/to/snapshot-a path/to/snapshot-b --format jsonSave outputs to files (format inferred from extension, or use format:path syntax):
binoc diff path/to/snapshot-a path/to/snapshot-b \
-o changeset.json -o CHANGELOG.md -qCombine saved changesets into a changelog:
binoc changelog changesets/*.jsonExtract the actual changed data from a changeset node (requires original snapshots):
binoc extract changeset.json data.csv rows_addedThird-party plugin packs extend binoc with domain-specific comparators and transformers. Install a plugin and its formats are available automatically:
pip install binoc-sqlite # SQLite schema + row count diffing
binoc diff snapshots/v1 snapshots/v2 # .sqlite/.db files now get semantic diffsOr with uvx, no install needed:
uvx binoc --with binoc-sqlite diff snapshots/v1 snapshots/v2Plugins can be Rust crates (compiled as native shared libraries via the export_plugin! macro) or pure Python. See docs/adr/ for architecture and model-plugins/ for reference implementations.
Rust plugin authors should depend on the published SDK crate:
cargo add binoc-sdkThe workspace also includes a standalone binoc-cli crate for contributors and a future Rust-only distribution path, but the SDK is the only Rust package published on crates.io for now.
Prerequisites: Rust, just (brew install just), and uv.
just build # Rust workspace + Python bindings
just test # full suite: Rust + Python
just docs # regenerate tutorial after code changesTo test the full Python CLI with local plugin crates (no PyPI needed):
uv run --with ./binoc-python --with ./model-plugins/binoc-sqlite \
binoc diff path/to/snapshot-a path/to/snapshot-bThis builds both packages from source and wires up entry-point discovery automatically. The same pattern works for any local plugin crate that has a pyproject.toml with a [project.entry-points."binoc.plugins"] section. For a self-contained plugin example (install, run, test vectors), see model-plugins/binoc-sqlite/.
| Path | Role |
|---|---|
binoc-sdk/ |
Plugin SDK: traits, IR types, DataAccess, export_plugin! macro, C ABI wire types |
binoc-core/ |
Controller loop, config, plugin registry, output functions |
binoc-stdlib/ |
Standard comparators and transformers (architecturally identical to third-party plugins) |
binoc-cli/ |
CLI library + standalone Rust binary |
binoc-python/ |
PyO3 bindings, native plugin loader (libloading), Python plugin bridges, binoc CLI entry point |
model-plugins/ |
Reference plugin implementations: binoc-sqlite (Rust comparator), binoc-row-reorder (Rust transformer), binoc-html (Python renderer) |
test-vectors/ |
Shared test fixtures for standard library plugins |
docs/ |
Documentation, design notes, and ADRs |
- Additional plugins such as Excel, Parquet, PDF
binoc plugin install/binoc plugin listCLI subcommands- Richer Python notebook ergonomics
- LLM-summarized output formatter
- WASM/IPC plugin transport (ABI designed for it; not yet implemented)
- Memory-bounded processing for very large trees
- Similarity-based rename detection for modified-and-moved files
- Fixed-point transformer iteration (transformers currently run in a single pass, may miss optimizations)