bulkload-db: snapshot source once instead of per-chunk LIMIT/OFFSET by kevinschaper · Pull Request #22 · linkml/linkml-solr

kevinschaper · 2026-05-19T23:30:58Z

Summary

The prior bulkload_duckdb paginated the source by issuing a fresh SELECT * FROM <table> [ORDER BY ...] LIMIT N OFFSET M from each parallel worker, each on its own connection. That one design choice caused three separate problems:

bulkload-db: chunked LIMIT/OFFSET without ORDER BY can drop ~20% of rows #19: Without ORDER BY, different connections were free to return rows in different orders, so chunks could overlap (rows overwritten by id in Solr) and miss rows entirely. Loads silently dropped ~20% of input while reporting success.
bulkload-db: parallel chunks each re-sort the entire source, causing silent OOM #20: With ORDER BY, every worker re-sorted the entire source. On multi-GB inputs the 12-way simultaneous sorts exhausted memory; failed chunks contributed zero with no propagated error.
bulkload-db against a DuckDB VIEW recomputes the entire view per chunk #21: On a DuckDB view, every chunk re-evaluated the full view (50× slowdown reported in the issue).

Change

Replace the offset paradigm with a snapshot-and-stream model in bulkload_duckdb:

ATTACH the user's DB read-only into a fresh tempfile DuckDB.
CREATE TABLE snapshot AS SELECT … FROM src.<table> [WHERE …] [ORDER BY …] — one materialization.
Stream the snapshot through a single read-only cursor (fetchmany(chunk_size)).
Fan out only the Solr POSTs across a ThreadPoolExecutor with backpressure (at most 2 × workers chunks in flight). DuckDB reads stay serial in one connection — the OOM and view-recomputation pathologies can't happen.

Each row in the snapshot is yielded exactly once by the cursor, so the #19 row-loss is gone even without an explicit ORDER BY. --order-by is now optional and only useful for run-to-run reproducibility.

table_name may also be a parenthesized subquery (e.g. '(SELECT ... FROM src.edges JOIN src.nodes ...) v') so callers can drive joins through the snapshot without first defining a view in the source DB.

query_duckdb_chunk and upload_duckdb_chunk (the old offset workers) are removed — they only existed to serve the broken pattern.

Default workers: 12 → 4

get_optimal_worker_count used to scale to cpu * 1.5 capped at 12. Benchmarks (below) show parallelism past 4 has no benefit because Solr indexing dominates. New default: 4. Override with --parallel-workers.

Benchmarks (M-series Mac, Solr 8 in Docker, separate `lsolr_bench_solr` on :8984)

Source: monarch-kg.duckdb (~7.7 GB) from monarch-ingest-graph-schema-verify.

Correctness — synthetic view (LEFT JOIN edges → nodes ×2 → closure_label ×2, LIMIT 500 000)

Code	Workers	Reported success	Actually committed	Wall time
OLD	4	500 000 ✅	302 262 (40% row loss)	40.2 s
NEW	4	500 000	500 000	36.8 s

Same workload, same data — reproduces #19 (silent ~40% loss with parallel + no ORDER BY on a view) and shows the fix.

Throughput — `nodes` table (1 457 305 rows), NEW code

Workers	numFound	Wall time	Throughput
1	1 457 305	55.1 s	27 k docs/s
4	1 457 305	29.2 s	52 k docs/s
12	1 457 305	27.9 s	55 k docs/s

~2× from 1→4 workers, then plateau — Solr indexing is the bottleneck. Hence the new default of 4.

Smoke — `denormalized_edges` table (14 988 609 rows), OLD code w=12

~3 161 s, 4 742 docs/s, full count. (The OLD nodes run at default workers also happened to land the full count this run; #19 is timing-dependent, but the view test above demonstrates the failure mode reliably.)

A heavier ordered LEFT JOIN view (the shape #21 describes) couldn't complete a single OLD chunk in 5 minutes locally, consistent with the report there.

Test plan

make test (unit suite, unaffected — only bulkload_file is exercised)
Re-run on the real Monarch KG to confirm counts at full scale
Sanity-check --order-by still works and produces a stable snapshot

Fixes #19, #20, #21. The prior implementation paginated the source by issuing a fresh `SELECT * FROM <table> [ORDER BY ...] LIMIT N OFFSET M` from each parallel worker. Three problems followed from that one choice: * Without ORDER BY, separate connections were free to return rows in different orders, so chunks could overlap (rows overwritten by id in Solr) and miss rows entirely. Loads silently dropped ~20% of input while reporting success. * With ORDER BY, every worker re-sorted the entire source. On multi-GB inputs the simultaneous sorts exhausted memory; failed chunks contributed zero with no propagated error. * On a DuckDB view, every chunk re-evaluated the full view. Replace the offset paradigm with: materialize the (filtered/ordered) source once into a tempfile DuckDB via ATTACH ... READ_ONLY, stream rows through a single read cursor with fetchmany, and fan out only the Solr POSTs across a thread pool with backpressure. One pass, bounded memory, every row uploaded exactly once. Defaults: parallel-workers drops from `cpu * 1.5 (max 12)` to 4 — benchmarks on the Monarch KG (1.46M nodes, 500k joined edges) show ~2x going from 1→4 workers and a hard plateau after that, because Solr indexing becomes the bottleneck.

GitHub now hard-fails workflows that use actions/cache@v2, which the previous pr-test workflow did — so this PR's tests couldn't run at all. Bump checkout@v4, setup-python@v5, cache@v4, install-poetry@v1, and move the matrix off EOL 3.8/3.9 onto 3.10/3.11. Includes the Python version in the venv cache key so the two matrix entries don't collide.

kevinschaper added 4 commits May 19, 2026 16:30

ci: install setuptools so pysolr's pkg_resources import works on py3.10+

deb4203

ci: pin setuptools<81 (pysolr needs pkg_resources, removed in 81)

2338171

kevinschaper merged commit cdb6de4 into main May 22, 2026
2 checks passed

kevinschaper mentioned this pull request May 22, 2026

ci: rewrite main.yaml so it actually runs #23

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulkload-db: snapshot source once instead of per-chunk LIMIT/OFFSET#22

bulkload-db: snapshot source once instead of per-chunk LIMIT/OFFSET#22
kevinschaper merged 4 commits into
mainfrom
fix/snapshot-bulkload

kevinschaper commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kevinschaper commented May 19, 2026

Summary

Change

Default workers: 12 → 4

Benchmarks (M-series Mac, Solr 8 in Docker, separate lsolr_bench_solr on :8984)

Correctness — synthetic view (LEFT JOIN edges → nodes ×2 → closure_label ×2, LIMIT 500 000)

Throughput — nodes table (1 457 305 rows), NEW code

Smoke — denormalized_edges table (14 988 609 rows), OLD code w=12

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Benchmarks (M-series Mac, Solr 8 in Docker, separate `lsolr_bench_solr` on :8984)

Throughput — `nodes` table (1 457 305 rows), NEW code

Smoke — `denormalized_edges` table (14 988 609 rows), OLD code w=12