Skip to content

optimized load_blocks_for_chunks and find_chunks_by_dedup_key queries#9526

Open
dannyzaken wants to merge 1 commit intonoobaa:masterfrom
dannyzaken:danny-fixes
Open

optimized load_blocks_for_chunks and find_chunks_by_dedup_key queries#9526
dannyzaken wants to merge 1 commit intonoobaa:masterfrom
dannyzaken:danny-fixes

Conversation

@dannyzaken
Copy link
Copy Markdown
Member

@dannyzaken dannyzaken commented Apr 2, 2026

Describe the Problem

Two functions in md_store.js (load_blocks_for_chunks and find_chunks_by_dedup_key) that query the datablocks table (~3M rows) suffered from PostgreSQL choosing sequential scans when the number of chunk IDs in the IN clause exceeded ~10 values. The default random_page_cost = 4.0 caused the planner to prefer seq scan over the existing idx_btree_datablocks_chunk B-tree index. load_blocks_for_chunks took ~2600ms for 38 chunks, and find_chunks_by_dedup_key added an extra DB round-trip by calling load_blocks_for_chunks separately after fetching chunks.

Explain the Changes

  1. load_blocks_for_chunks: Replaced the mongo-to-pg translation layer (this._blocks.find(...)) with a raw SQL query using UNNEST + LATERAL + OFFSET 0. The OFFSET 0 acts as an optimization barrier that forces PostgreSQL into a nested loop with per-chunk Bitmap Index Scan on idx_btree_datablocks_chunk, reducing execution from ~2600ms to ~2ms for 38 chunks (~1000x faster).
  2. find_chunks_by_dedup_key: Combined the two-step approach (find chunks by dedup_key, then call load_blocks_for_chunks) into a single INNER JOIN query between datachunks and datablocks. The dedup_key index filters chunks to a small set, so the planner reliably picks Nested Loop + index scans on both tables. The denormalized result set is deduplicated in JS using a Map keyed by chunk _id. This eliminates one DB round-trip entirely.
  3. Bumped the eslint max-lines directive from 2200 to 2210 to accommodate the additional lines from the optimized implementations.

Issues: Fixed #xxx / Gap #xxx

  1. https://redhat.atlassian.net/browse/DFBUGS-5326

Testing Instructions:

  1. Run existing integration tests: node src/test/integration_tests/db/test_md_store.js — covers find_chunks_by_dedup_key with matching keys, non-matching keys, and empty arrays.
  2. Verify load_blocks_for_chunks via any flow that reads object data (e.g. map_reader, map_builder, map_deleter).
  3. On a production-scale DB (~3M datablocks rows), run EXPLAIN ANALYZE on both queries to confirm index scans are used instead of seq scans.
  • Doc added/updated
  • Tests added

Summary by CodeRabbit

  • Refactor
    • Data retrieval now returns chunk fragments populated with their associated blocks in one step, improving consistency and reducing extra lookups; optional ordering is applied when provided.
  • Bug Fixes
    • Soft-deleted chunks and deleted blocks are now excluded from results.
  • Tests
    • Integration tests extended to verify fragments include correct blocks, grouping, ordering, and deletion handling.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

Replaced separate chunk and block lookups with SQL joins and in-function assembly: find_chunks_by_dedup_key now selects joined datachunks+datablocks, decodes rows, groups blocks by chunk→frag and assigns frag.blocks; load_blocks_for_chunks uses unnest($1::text[]) + LATERAL to fetch and attach blocks. Tests extended for Postgres. (≤50 words)

Changes

Cohort / File(s) Summary
MD store DB/query changes
src/server/object_services/md_store.js
Rewrote find_chunks_by_dedup_key() to use a joined SQL query returning chunk+block JSON, decode rows in-function, dedupe chunks, group blocks by chunk/frag and set frag.blocks. Replaced prior Mongo-style block lookup in load_blocks_for_chunks() with a raw SQL unnest($1::text[]) + LATERAL query; grouping and optional sorting applied. ESLint max-lines increased from 2200 to 2210.
Integration tests (Postgres-focused)
src/test/integration_tests/db/test_md_store.js
Expanded tests for find_chunks_by_dedup_key and load_blocks_for_chunks: multiple chunks/frags/blocks insertion, assertions that each frag.blocks is populated and correctly sized, exclusions for soft-deleted chunks and deleted blocks, sorter behavior, and handling of empty/null inputs.
Lint config
.eslintrc.js
Bumped max-lines rule to accommodate added code (+10).

Sequence Diagram(s)

sequenceDiagram
  participant Caller as Client
  participant MD as MDStore
  participant DB as Postgres
  participant Processor as In-memory assembler

  Caller->>MD: find_chunks_by_dedup_key(bucket, dedup_keys)
  MD->>DB: Execute JOIN query (datachunks JOIN datablocks) with dedup_keys
  DB-->>MD: Rows with chunk + block JSON
  MD->>Processor: Decode rows, build unique chunk map
  Processor-->>MD: Chunks with grouped frags and blocks assigned
  MD-->>Caller: Return chunks with frags[].blocks populated

  Caller->>MD: load_blocks_for_chunks(chunkIds, sorter)
  MD->>DB: Execute unnest($1::text[]) + LATERAL query for blocks
  DB-->>MD: Block rows (non-deleted)
  MD->>Processor: Group blocks by chunk -> frag, apply sorter
  Processor-->>MD: Updated chunks with frag.blocks
  MD-->>Caller: Return updated chunks
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Suggested labels

size/L

Suggested reviewers

  • liranmauda
  • tangledbytes
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and accurately describes the main objective of the pull request: optimizing two query functions (load_blocks_for_chunks and find_chunks_by_dedup_key) by replacing inefficient mongo-to-postgres translations with optimized raw SQL queries.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/server/object_services/md_store.js`:
- Around line 1548-1561: The SQL INNER JOIN between this._chunks and
this._blocks omits chunks that have no blocks; change the JOIN to LEFT JOIN in
the query that selects c.data AS chunk_data, b.data AS block_data so chunks
without blocks are returned, and then update the post-processing loop that
builds frag.blocks (which uses blocks_by_frag and frag._id) to guard against
null block_data—only map/push block entries when block_data exists (skip or
treat as empty array when block_data is null) so frag.blocks =
blocks_by_frag[frag._id.toHexString()] || [] still works for chunks with no
blocks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4d20403b-b76d-4b95-8b7f-4fb3137d60ab

📥 Commits

Reviewing files that changed from the base of the PR and between efb5389 and c0f1f25.

📒 Files selected for processing (1)
  • src/server/object_services/md_store.js

- Replace mongo-to-pg translation in load_blocks_for_chunks with a raw LATERAL + OFFSET 0 query that forces per-chunk index lookups on idx_btree_datablocks_chunk, reducing execution from ~2600ms to ~2ms for 38 chunks.
- Combine the two-step find_chunks_by_dedup_key into a single JOIN query that eliminates an extra DB round-trip while the planner reliably picks nested loop + index scans on both tables.

Signed-off-by: Danny Zaken <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants