Skip to content

bulkload-db: snapshot source once instead of per-chunk LIMIT/OFFSET#22

Merged
kevinschaper merged 4 commits into
mainfrom
fix/snapshot-bulkload
May 22, 2026
Merged

bulkload-db: snapshot source once instead of per-chunk LIMIT/OFFSET#22
kevinschaper merged 4 commits into
mainfrom
fix/snapshot-bulkload

Conversation

@kevinschaper
Copy link
Copy Markdown
Contributor

Fixes #19, #20, #21.

Summary

The prior bulkload_duckdb paginated the source by issuing a fresh SELECT * FROM <table> [ORDER BY ...] LIMIT N OFFSET M from each parallel worker, each on its own connection. That one design choice caused three separate problems:

Change

Replace the offset paradigm with a snapshot-and-stream model in bulkload_duckdb:

  1. ATTACH the user's DB read-only into a fresh tempfile DuckDB.
  2. CREATE TABLE snapshot AS SELECT … FROM src.<table> [WHERE …] [ORDER BY …] — one materialization.
  3. Stream the snapshot through a single read-only cursor (fetchmany(chunk_size)).
  4. Fan out only the Solr POSTs across a ThreadPoolExecutor with backpressure (at most 2 × workers chunks in flight). DuckDB reads stay serial in one connection — the OOM and view-recomputation pathologies can't happen.

Each row in the snapshot is yielded exactly once by the cursor, so the #19 row-loss is gone even without an explicit ORDER BY. --order-by is now optional and only useful for run-to-run reproducibility.

table_name may also be a parenthesized subquery (e.g. '(SELECT ... FROM src.edges JOIN src.nodes ...) v') so callers can drive joins through the snapshot without first defining a view in the source DB.

query_duckdb_chunk and upload_duckdb_chunk (the old offset workers) are removed — they only existed to serve the broken pattern.

Default workers: 12 → 4

get_optimal_worker_count used to scale to cpu * 1.5 capped at 12. Benchmarks (below) show parallelism past 4 has no benefit because Solr indexing dominates. New default: 4. Override with --parallel-workers.

Benchmarks (M-series Mac, Solr 8 in Docker, separate lsolr_bench_solr on :8984)

Source: monarch-kg.duckdb (~7.7 GB) from monarch-ingest-graph-schema-verify.

Correctness — synthetic view (LEFT JOIN edges → nodes ×2 → closure_label ×2, LIMIT 500 000)

Code Workers Reported success Actually committed Wall time
OLD 4 500 000 ✅ 302 262 (40% row loss) 40.2 s
NEW 4 500 000 500 000 36.8 s

Same workload, same data — reproduces #19 (silent ~40% loss with parallel + no ORDER BY on a view) and shows the fix.

Throughput — nodes table (1 457 305 rows), NEW code

Workers numFound Wall time Throughput
1 1 457 305 55.1 s 27 k docs/s
4 1 457 305 29.2 s 52 k docs/s
12 1 457 305 27.9 s 55 k docs/s

~2× from 1→4 workers, then plateau — Solr indexing is the bottleneck. Hence the new default of 4.

Smoke — denormalized_edges table (14 988 609 rows), OLD code w=12

~3 161 s, 4 742 docs/s, full count. (The OLD nodes run at default workers also happened to land the full count this run; #19 is timing-dependent, but the view test above demonstrates the failure mode reliably.)

A heavier ordered LEFT JOIN view (the shape #21 describes) couldn't complete a single OLD chunk in 5 minutes locally, consistent with the report there.

Test plan

  • make test (unit suite, unaffected — only bulkload_file is exercised)
  • Re-run on the real Monarch KG to confirm counts at full scale
  • Sanity-check --order-by still works and produces a stable snapshot

Fixes #19, #20, #21. The prior implementation paginated the source by
issuing a fresh `SELECT * FROM <table> [ORDER BY ...] LIMIT N OFFSET M`
from each parallel worker. Three problems followed from that one
choice:

  * Without ORDER BY, separate connections were free to return rows in
    different orders, so chunks could overlap (rows overwritten by id
    in Solr) and miss rows entirely. Loads silently dropped ~20% of
    input while reporting success.
  * With ORDER BY, every worker re-sorted the entire source. On
    multi-GB inputs the simultaneous sorts exhausted memory; failed
    chunks contributed zero with no propagated error.
  * On a DuckDB view, every chunk re-evaluated the full view.

Replace the offset paradigm with: materialize the (filtered/ordered)
source once into a tempfile DuckDB via ATTACH ... READ_ONLY, stream
rows through a single read cursor with fetchmany, and fan out only the
Solr POSTs across a thread pool with backpressure. One pass, bounded
memory, every row uploaded exactly once.

Defaults: parallel-workers drops from `cpu * 1.5 (max 12)` to 4 —
benchmarks on the Monarch KG (1.46M nodes, 500k joined edges) show
~2x going from 1→4 workers and a hard plateau after that, because
Solr indexing becomes the bottleneck.
GitHub now hard-fails workflows that use actions/cache@v2, which the
previous pr-test workflow did — so this PR's tests couldn't run at all.
Bump checkout@v4, setup-python@v5, cache@v4, install-poetry@v1, and
move the matrix off EOL 3.8/3.9 onto 3.10/3.11. Includes the Python
version in the venv cache key so the two matrix entries don't collide.
@kevinschaper kevinschaper merged commit cdb6de4 into main May 22, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bulkload-db: chunked LIMIT/OFFSET without ORDER BY can drop ~20% of rows

1 participant