feat: allow different population lists for HGDP haplotype and gnomAD variant inputs by ameynert · Pull Request #52 · fg-labs/divref-wf

ameynert · 2026-05-13T21:55:02Z

Summary by CodeRabbit

Configuration
- Split population configuration into separate hgdp_1kg_populations and gnomad_variant_populations parameters for more granular control over population selections in extraction workflows.
Chores
- Added samtools dependency (≥1.23.1, <2).
Refactor
- Enhanced internal haplotype population frequency handling to support dynamic population legends.

coderabbitai · 2026-05-13T21:55:13Z

Warning

Rate limit exceeded

@ameynert has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03526d2b-9087-475e-9e1c-c1ebac8dc16b

📥 Commits

Reviewing files that changed from the base of the PR and between 20fbd14 and 9956358.

📒 Files selected for processing (6)

divref/divref/tools/compute_haplotypes.py
divref/divref/tools/create_duckdb_index.py
divref/divref/tools/remap_divref.py
divref/tests/tools/test_remap_divref.py
workflows/config/config_schema.yml
workflows/generate_divref.smk

📝 Walkthrough

Walkthrough

This PR consolidates separate HGDP and gnomAD population legends into a unified joint legend throughout the DivRef pipeline. The workflow configuration now splits population inputs into independent HGDP+1KG and gnomAD variant lists; DuckDB indexing computes remapping arrays to align both sources to the joint legend; and the Haplotype model is refactored to accept dynamically-named per-population AF columns.

Changes

Population Legend Consolidation Across HGDP and gnomAD Data Sources

Layer / File(s)	Summary
Configuration and Workflow Split `workflows/config/config_schema.yml`, `workflows/generate_divref.smk`	Schema replaces the generic `populations` field with `hgdp_1kg_populations` and introduces `gnomad_variant_populations`. The workflow removes the single `POPS` variable and introduces `HGDP_1KG_POPS` and `GNOMAD_VARIANT_POPS` to separately parameterize the two gnomAD extraction rules.
DuckDB Index Legend Consolidation and Remapping `divref/divref/tools/create_duckdb_index.py`	Separate HGDP and gnomAD legends are extracted and merged into `joint_pops_legend` (gnomAD populations first, then HGDP-only). Bidirectional remapping arrays are computed (source→joint and joint→source with `-1` padding) and threaded through `build_hgdp_haplotype_table_entries` and `build_gnomad_variant_table_entries` to align both sources to the joint legend. DuckDB now persists three legend tables instead of one.
TSV Export and Chunk Reading `divref/divref/tools/create_duckdb_index.py`	Export functions are updated to generate per-population `gnomAD_AF_{pop}` columns from pre-remapped `gnomad_freqs` arrays using joint-legend order. Schema overrides in chunk reading are keyed off `joint_pops_legend` to preserve string typing for variant identifiers and AF columns.
Haplotype Model Refactoring `divref/divref/tools/remap_divref.py`	`Haplotype` model replaces fixed per-population gnomAD AF fields with a generic `gnomad_afs: dict[str, str]` dictionary. A new `from_row` classmethod populates this dict from ordered population legends read from the DuckDB index. Population frequency computation now derives `all_pop_freqs` from the generic dict, and missing-token handling is broadened to treat `"NA"` and empty strings as `0.0`.
Test Fixture Updates and Dependencies `divref/tests/tools/test_remap_divref.py`, `pixi.toml`	Test helper `create_haplotype` is refactored to accept a `gnomad_afs` dictionary with sensible per-population defaults. Existing tests are updated to pass the new dict pattern. New test coverage is added for `NA` token handling and `Haplotype.from_row` constructor behavior. Samtools dependency version `>=1.23.1,<2` is added to pixi.
Compute Haplotypes Output Finalization `divref/divref/tools/compute_haplotypes.py`	The final union haplotype table is annotated with a `pops` global containing the gnomAD population legend before writing output.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fg-labs/divref-wf#8: Introduces compute_haplotypes tool; this PR extends it to attach gnomAD pop_legend metadata to the final output table.
fg-labs/divref-wf#44: Also decouples HGDP+1KG haplotype inputs from gnomAD variant AF extraction in the workflow configuration and rule parameterization.
fg-labs/divref-wf#34: Updates to DuckDB legend remapping and per-population gnomAD_AF_* column generation affect downstream allele-frequency comparison schemas consumed by this PR's tooling.

Poem

🐰 Two legends walk into a merged table,
HGDP meets gnomAD, a unified fable,
Remap arrays shuffle and pad with such grace,
Per-pop frequencies find their rightful place,
One joint legend rules them all, in style! 🌟

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and accurately describes the main change: allowing different population lists for HGDP haplotype and gnomAD variant inputs, which is the primary objective across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch am_diff_pops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…dp inputs

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

divref/divref/tools/remap_divref.py (1)
193-197: 💤 Low value

Consider clarifying empty-string handling rationale.

The _MISSING_AF_TOKENS frozenset now includes "" (empty string) alongside "null" and "NA". While the test coverage confirms this is intentional, a brief inline comment explaining why empty strings should be treated as missing (e.g., "empty string occurs when Hail exports missing values in certain TSV configurations") would help future maintainers understand the design choice.
📝 Suggested comment addition
-_MISSING_AF_TOKENS = frozenset({"null", "NA", ""})
+# Treat "null" (DuckDB NULL), "NA" (Hail missing token), and "" (Hail TSV empty cell) as missing.
+_MISSING_AF_TOKENS = frozenset({"null", "NA", ""})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@divref/divref/tools/remap_divref.py` around lines 193 - 197, The frozenset
_MISSING_AF_TOKENS and the _parse_pop_freqs function now treat the empty string
as a missing allele frequency token; add a concise inline comment above
_MISSING_AF_TOKENS explaining the rationale (e.g., that some exporters like Hail
or certain TSV configurations emit empty strings for missing values), so future
maintainers understand why "" is included alongside "null" and "NA". Keep the
comment short, reference the exporters or formats that produce empty tokens, and
place it directly above the _MISSING_AF_TOKENS declaration.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@divref/divref/tools/create_duckdb_index.py`:
- Around line 141-159: Ensure table_pairs is non-empty and validate that each
pair's haplotype/sites tables share the same pops legend before reusing remaps:
check table_pairs length and raise a clear error if empty, then iterate over
table_pairs and read each table_pairs[i].haplotype_table_path and
table_pairs[i].sites_table_path to collect their .pops and confirm they match
hgdp_pops_legend and gnomad_pops_legend (or recompute joint_pops_legend
per-pair) instead of blindly using table_pairs[0]; if any mismatch is found,
either compute per-pair remaps (hgdp_to_joint, gnomad_to_joint, hgdp_at_joint,
gnomad_at_joint) for that pair or raise an explicit exception explaining the
inconsistent pops so remapping and exported gnomAD_AF_* columns remain correct.

---

Nitpick comments:
In `@divref/divref/tools/remap_divref.py`:
- Around line 193-197: The frozenset _MISSING_AF_TOKENS and the _parse_pop_freqs
function now treat the empty string as a missing allele frequency token; add a
concise inline comment above _MISSING_AF_TOKENS explaining the rationale (e.g.,
that some exporters like Hail or certain TSV configurations emit empty strings
for missing values), so future maintainers understand why "" is included
alongside "null" and "NA". Keep the comment short, reference the exporters or
formats that produce empty tokens, and place it directly above the
_MISSING_AF_TOKENS declaration.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 148e100a-1954-48d8-a1bb-c1c1090c2aa3

📥 Commits

Reviewing files that changed from the base of the PR and between 801fd91 and 20fbd14.

⛔ Files ignored due to path filters (1)

pixi.lock is excluded by !**/*.lock

📒 Files selected for processing (7)

divref/divref/tools/compute_haplotypes.py
divref/divref/tools/create_duckdb_index.py
divref/divref/tools/remap_divref.py
divref/tests/tools/test_remap_divref.py
pixi.toml
workflows/config/config_schema.yml
workflows/generate_divref.smk

ameynert added 4 commits May 13, 2026 14:56

feat: allow different populations for gnomad and haplotypes

0ab491b

feat: annotate the populations as a global

be628dc

feat: handle potentially different population lists b/w gnomad and hg…

6d40944

…dp inputs

feat: remove hard-coding of populations from remap_divref script

57c2997

ameynert force-pushed the am_diff_pops branch from 20fbd14 to 57c2997 Compare May 13, 2026 21:56

ameynert temporarily deployed to github-actions-snakemake-linting May 13, 2026 21:56 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread divref/divref/tools/create_duckdb_index.py

feat: address coderabbitai comments

9956358

ameynert temporarily deployed to github-actions-snakemake-linting May 13, 2026 23:51 — with GitHub Actions Inactive

ameynert merged commit 24a5258 into main May 15, 2026
4 checks passed

ameynert deleted the am_diff_pops branch May 15, 2026 19:52

coderabbitai Bot mentioned this pull request May 19, 2026

feat: only add gnomAD variant sites for chrX and chrY; no haplotypes #53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow different population lists for HGDP haplotype and gnomAD variant inputs#52

feat: allow different population lists for HGDP haplotype and gnomAD variant inputs#52
ameynert merged 5 commits into
mainfrom
am_diff_pops

ameynert commented May 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ameynert commented May 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ameynert commented May 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading