feat: allow different population lists for HGDP haplotype and gnomAD variant inputs#52
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR consolidates separate HGDP and gnomAD population legends into a unified joint legend throughout the DivRef pipeline. The workflow configuration now splits population inputs into independent HGDP+1KG and gnomAD variant lists; DuckDB indexing computes remapping arrays to align both sources to the joint legend; and the ChangesPopulation Legend Consolidation Across HGDP and gnomAD Data Sources
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
divref/divref/tools/remap_divref.py (1)
193-197: 💤 Low valueConsider clarifying empty-string handling rationale.
The
_MISSING_AF_TOKENSfrozenset now includes""(empty string) alongside"null"and"NA". While the test coverage confirms this is intentional, a brief inline comment explaining why empty strings should be treated as missing (e.g., "empty string occurs when Hail exports missing values in certain TSV configurations") would help future maintainers understand the design choice.📝 Suggested comment addition
-_MISSING_AF_TOKENS = frozenset({"null", "NA", ""}) +# Treat "null" (DuckDB NULL), "NA" (Hail missing token), and "" (Hail TSV empty cell) as missing. +_MISSING_AF_TOKENS = frozenset({"null", "NA", ""})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@divref/divref/tools/remap_divref.py` around lines 193 - 197, The frozenset _MISSING_AF_TOKENS and the _parse_pop_freqs function now treat the empty string as a missing allele frequency token; add a concise inline comment above _MISSING_AF_TOKENS explaining the rationale (e.g., that some exporters like Hail or certain TSV configurations emit empty strings for missing values), so future maintainers understand why "" is included alongside "null" and "NA". Keep the comment short, reference the exporters or formats that produce empty tokens, and place it directly above the _MISSING_AF_TOKENS declaration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@divref/divref/tools/create_duckdb_index.py`:
- Around line 141-159: Ensure table_pairs is non-empty and validate that each
pair's haplotype/sites tables share the same pops legend before reusing remaps:
check table_pairs length and raise a clear error if empty, then iterate over
table_pairs and read each table_pairs[i].haplotype_table_path and
table_pairs[i].sites_table_path to collect their .pops and confirm they match
hgdp_pops_legend and gnomad_pops_legend (or recompute joint_pops_legend
per-pair) instead of blindly using table_pairs[0]; if any mismatch is found,
either compute per-pair remaps (hgdp_to_joint, gnomad_to_joint, hgdp_at_joint,
gnomad_at_joint) for that pair or raise an explicit exception explaining the
inconsistent pops so remapping and exported gnomAD_AF_* columns remain correct.
---
Nitpick comments:
In `@divref/divref/tools/remap_divref.py`:
- Around line 193-197: The frozenset _MISSING_AF_TOKENS and the _parse_pop_freqs
function now treat the empty string as a missing allele frequency token; add a
concise inline comment above _MISSING_AF_TOKENS explaining the rationale (e.g.,
that some exporters like Hail or certain TSV configurations emit empty strings
for missing values), so future maintainers understand why "" is included
alongside "null" and "NA". Keep the comment short, reference the exporters or
formats that produce empty tokens, and place it directly above the
_MISSING_AF_TOKENS declaration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 148e100a-1954-48d8-a1bb-c1c1090c2aa3
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
divref/divref/tools/compute_haplotypes.pydivref/divref/tools/create_duckdb_index.pydivref/divref/tools/remap_divref.pydivref/tests/tools/test_remap_divref.pypixi.tomlworkflows/config/config_schema.ymlworkflows/generate_divref.smk
Summary by CodeRabbit
Configuration
hgdp_1kg_populationsandgnomad_variant_populationsparameters for more granular control over population selections in extraction workflows.Chores
Refactor