optimized load_blocks_for_chunks and find_chunks_by_dedup_key queries#9526
optimized load_blocks_for_chunks and find_chunks_by_dedup_key queries#9526dannyzaken wants to merge 1 commit intonoobaa:masterfrom
Conversation
📝 WalkthroughWalkthroughReplaced separate chunk and block lookups with SQL joins and in-function assembly: Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Client
participant MD as MDStore
participant DB as Postgres
participant Processor as In-memory assembler
Caller->>MD: find_chunks_by_dedup_key(bucket, dedup_keys)
MD->>DB: Execute JOIN query (datachunks JOIN datablocks) with dedup_keys
DB-->>MD: Rows with chunk + block JSON
MD->>Processor: Decode rows, build unique chunk map
Processor-->>MD: Chunks with grouped frags and blocks assigned
MD-->>Caller: Return chunks with frags[].blocks populated
Caller->>MD: load_blocks_for_chunks(chunkIds, sorter)
MD->>DB: Execute unnest($1::text[]) + LATERAL query for blocks
DB-->>MD: Block rows (non-deleted)
MD->>Processor: Group blocks by chunk -> frag, apply sorter
Processor-->>MD: Updated chunks with frag.blocks
MD-->>Caller: Return updated chunks
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/server/object_services/md_store.js`:
- Around line 1548-1561: The SQL INNER JOIN between this._chunks and
this._blocks omits chunks that have no blocks; change the JOIN to LEFT JOIN in
the query that selects c.data AS chunk_data, b.data AS block_data so chunks
without blocks are returned, and then update the post-processing loop that
builds frag.blocks (which uses blocks_by_frag and frag._id) to guard against
null block_data—only map/push block entries when block_data exists (skip or
treat as empty array when block_data is null) so frag.blocks =
blocks_by_frag[frag._id.toHexString()] || [] still works for chunks with no
blocks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4d20403b-b76d-4b95-8b7f-4fb3137d60ab
📒 Files selected for processing (1)
src/server/object_services/md_store.js
- Replace mongo-to-pg translation in load_blocks_for_chunks with a raw LATERAL + OFFSET 0 query that forces per-chunk index lookups on idx_btree_datablocks_chunk, reducing execution from ~2600ms to ~2ms for 38 chunks. - Combine the two-step find_chunks_by_dedup_key into a single JOIN query that eliminates an extra DB round-trip while the planner reliably picks nested loop + index scans on both tables. Signed-off-by: Danny Zaken <[email protected]>
Describe the Problem
Two functions in
md_store.js(load_blocks_for_chunksandfind_chunks_by_dedup_key) that query thedatablockstable (~3M rows) suffered from PostgreSQL choosing sequential scans when the number of chunk IDs in theINclause exceeded ~10 values. The defaultrandom_page_cost = 4.0caused the planner to prefer seq scan over the existingidx_btree_datablocks_chunkB-tree index.load_blocks_for_chunkstook ~2600ms for 38 chunks, andfind_chunks_by_dedup_keyadded an extra DB round-trip by callingload_blocks_for_chunksseparately after fetching chunks.Explain the Changes
load_blocks_for_chunks: Replaced the mongo-to-pg translation layer (this._blocks.find(...)) with a raw SQL query usingUNNEST + LATERAL + OFFSET 0. TheOFFSET 0acts as an optimization barrier that forces PostgreSQL into a nested loop with per-chunk Bitmap Index Scan onidx_btree_datablocks_chunk, reducing execution from ~2600ms to ~2ms for 38 chunks (~1000x faster).find_chunks_by_dedup_key: Combined the two-step approach (find chunks by dedup_key, then callload_blocks_for_chunks) into a singleINNER JOINquery betweendatachunksanddatablocks. The dedup_key index filters chunks to a small set, so the planner reliably picks Nested Loop + index scans on both tables. The denormalized result set is deduplicated in JS using aMapkeyed by chunk_id. This eliminates one DB round-trip entirely.max-linesdirective from 2200 to 2210 to accommodate the additional lines from the optimized implementations.Issues: Fixed #xxx / Gap #xxx
Testing Instructions:
node src/test/integration_tests/db/test_md_store.js— coversfind_chunks_by_dedup_keywith matching keys, non-matching keys, and empty arrays.load_blocks_for_chunksvia any flow that reads object data (e.g.map_reader,map_builder,map_deleter).EXPLAIN ANALYZEon both queries to confirm index scans are used instead of seq scans.Summary by CodeRabbit