Skip to content

feat: allow different population lists for HGDP haplotype and gnomAD variant inputs#52

Merged
ameynert merged 5 commits into
mainfrom
am_diff_pops
May 15, 2026
Merged

feat: allow different population lists for HGDP haplotype and gnomAD variant inputs#52
ameynert merged 5 commits into
mainfrom
am_diff_pops

Conversation

@ameynert
Copy link
Copy Markdown
Collaborator

@ameynert ameynert commented May 13, 2026

Summary by CodeRabbit

  • Configuration

    • Split population configuration into separate hgdp_1kg_populations and gnomad_variant_populations parameters for more granular control over population selections in extraction workflows.
  • Chores

    • Added samtools dependency (≥1.23.1, <2).
  • Refactor

    • Enhanced internal haplotype population frequency handling to support dynamic population legends.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

Warning

Rate limit exceeded

@ameynert has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03526d2b-9087-475e-9e1c-c1ebac8dc16b

📥 Commits

Reviewing files that changed from the base of the PR and between 20fbd14 and 9956358.

📒 Files selected for processing (6)
  • divref/divref/tools/compute_haplotypes.py
  • divref/divref/tools/create_duckdb_index.py
  • divref/divref/tools/remap_divref.py
  • divref/tests/tools/test_remap_divref.py
  • workflows/config/config_schema.yml
  • workflows/generate_divref.smk
📝 Walkthrough

Walkthrough

This PR consolidates separate HGDP and gnomAD population legends into a unified joint legend throughout the DivRef pipeline. The workflow configuration now splits population inputs into independent HGDP+1KG and gnomAD variant lists; DuckDB indexing computes remapping arrays to align both sources to the joint legend; and the Haplotype model is refactored to accept dynamically-named per-population AF columns.

Changes

Population Legend Consolidation Across HGDP and gnomAD Data Sources

Layer / File(s) Summary
Configuration and Workflow Split
workflows/config/config_schema.yml, workflows/generate_divref.smk
Schema replaces the generic populations field with hgdp_1kg_populations and introduces gnomad_variant_populations. The workflow removes the single POPS variable and introduces HGDP_1KG_POPS and GNOMAD_VARIANT_POPS to separately parameterize the two gnomAD extraction rules.
DuckDB Index Legend Consolidation and Remapping
divref/divref/tools/create_duckdb_index.py
Separate HGDP and gnomAD legends are extracted and merged into joint_pops_legend (gnomAD populations first, then HGDP-only). Bidirectional remapping arrays are computed (source→joint and joint→source with -1 padding) and threaded through build_hgdp_haplotype_table_entries and build_gnomad_variant_table_entries to align both sources to the joint legend. DuckDB now persists three legend tables instead of one.
TSV Export and Chunk Reading
divref/divref/tools/create_duckdb_index.py
Export functions are updated to generate per-population gnomAD_AF_{pop} columns from pre-remapped gnomad_freqs arrays using joint-legend order. Schema overrides in chunk reading are keyed off joint_pops_legend to preserve string typing for variant identifiers and AF columns.
Haplotype Model Refactoring
divref/divref/tools/remap_divref.py
Haplotype model replaces fixed per-population gnomAD AF fields with a generic gnomad_afs: dict[str, str] dictionary. A new from_row classmethod populates this dict from ordered population legends read from the DuckDB index. Population frequency computation now derives all_pop_freqs from the generic dict, and missing-token handling is broadened to treat "NA" and empty strings as 0.0.
Test Fixture Updates and Dependencies
divref/tests/tools/test_remap_divref.py, pixi.toml
Test helper create_haplotype is refactored to accept a gnomad_afs dictionary with sensible per-population defaults. Existing tests are updated to pass the new dict pattern. New test coverage is added for NA token handling and Haplotype.from_row constructor behavior. Samtools dependency version >=1.23.1,<2 is added to pixi.
Compute Haplotypes Output Finalization
divref/divref/tools/compute_haplotypes.py
The final union haplotype table is annotated with a pops global containing the gnomAD population legend before writing output.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • fg-labs/divref-wf#8: Introduces compute_haplotypes tool; this PR extends it to attach gnomAD pop_legend metadata to the final output table.
  • fg-labs/divref-wf#44: Also decouples HGDP+1KG haplotype inputs from gnomAD variant AF extraction in the workflow configuration and rule parameterization.
  • fg-labs/divref-wf#34: Updates to DuckDB legend remapping and per-population gnomAD_AF_* column generation affect downstream allele-frequency comparison schemas consumed by this PR's tooling.

Poem

🐰 Two legends walk into a merged table,
HGDP meets gnomAD, a unified fable,
Remap arrays shuffle and pad with such grace,
Per-pop frequencies find their rightful place,
One joint legend rules them all, in style! 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and accurately describes the main change: allowing different population lists for HGDP haplotype and gnomAD variant inputs, which is the primary objective across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am_diff_pops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 13, 2026 21:56 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
divref/divref/tools/remap_divref.py (1)

193-197: 💤 Low value

Consider clarifying empty-string handling rationale.

The _MISSING_AF_TOKENS frozenset now includes "" (empty string) alongside "null" and "NA". While the test coverage confirms this is intentional, a brief inline comment explaining why empty strings should be treated as missing (e.g., "empty string occurs when Hail exports missing values in certain TSV configurations") would help future maintainers understand the design choice.

📝 Suggested comment addition
-_MISSING_AF_TOKENS = frozenset({"null", "NA", ""})
+# Treat "null" (DuckDB NULL), "NA" (Hail missing token), and "" (Hail TSV empty cell) as missing.
+_MISSING_AF_TOKENS = frozenset({"null", "NA", ""})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@divref/divref/tools/remap_divref.py` around lines 193 - 197, The frozenset
_MISSING_AF_TOKENS and the _parse_pop_freqs function now treat the empty string
as a missing allele frequency token; add a concise inline comment above
_MISSING_AF_TOKENS explaining the rationale (e.g., that some exporters like Hail
or certain TSV configurations emit empty strings for missing values), so future
maintainers understand why "" is included alongside "null" and "NA". Keep the
comment short, reference the exporters or formats that produce empty tokens, and
place it directly above the _MISSING_AF_TOKENS declaration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@divref/divref/tools/create_duckdb_index.py`:
- Around line 141-159: Ensure table_pairs is non-empty and validate that each
pair's haplotype/sites tables share the same pops legend before reusing remaps:
check table_pairs length and raise a clear error if empty, then iterate over
table_pairs and read each table_pairs[i].haplotype_table_path and
table_pairs[i].sites_table_path to collect their .pops and confirm they match
hgdp_pops_legend and gnomad_pops_legend (or recompute joint_pops_legend
per-pair) instead of blindly using table_pairs[0]; if any mismatch is found,
either compute per-pair remaps (hgdp_to_joint, gnomad_to_joint, hgdp_at_joint,
gnomad_at_joint) for that pair or raise an explicit exception explaining the
inconsistent pops so remapping and exported gnomAD_AF_* columns remain correct.

---

Nitpick comments:
In `@divref/divref/tools/remap_divref.py`:
- Around line 193-197: The frozenset _MISSING_AF_TOKENS and the _parse_pop_freqs
function now treat the empty string as a missing allele frequency token; add a
concise inline comment above _MISSING_AF_TOKENS explaining the rationale (e.g.,
that some exporters like Hail or certain TSV configurations emit empty strings
for missing values), so future maintainers understand why "" is included
alongside "null" and "NA". Keep the comment short, reference the exporters or
formats that produce empty tokens, and place it directly above the
_MISSING_AF_TOKENS declaration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 148e100a-1954-48d8-a1bb-c1c1090c2aa3

📥 Commits

Reviewing files that changed from the base of the PR and between 801fd91 and 20fbd14.

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • divref/divref/tools/compute_haplotypes.py
  • divref/divref/tools/create_duckdb_index.py
  • divref/divref/tools/remap_divref.py
  • divref/tests/tools/test_remap_divref.py
  • pixi.toml
  • workflows/config/config_schema.yml
  • workflows/generate_divref.smk

Comment thread divref/divref/tools/create_duckdb_index.py
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 13, 2026 23:51 — with GitHub Actions Inactive
@ameynert ameynert merged commit 24a5258 into main May 15, 2026
4 checks passed
@ameynert ameynert deleted the am_diff_pops branch May 15, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant