Skip to content

feat: implement sample adjacency algorithm for computing haplotypes#48

Open
ameynert wants to merge 17 commits into
mainfrom
am_sample_adjacency
Open

feat: implement sample adjacency algorithm for computing haplotypes#48
ameynert wants to merge 17 commits into
mainfrom
am_sample_adjacency

Conversation

@ameynert
Copy link
Copy Markdown
Collaborator

@ameynert ameynert commented May 5, 2026

See next PR for details and comparison

Summary by CodeRabbit

  • New Features

    • Haplotype records now include per-variant component data and new per-haplotype summary metrics.
  • Improvements

    • Unified haplotype computation into a single adjacency/gap-based pipeline with containment-based deduplication and checkpointed intermediate outputs.
    • Filtering now uses variant- and haplotype-frequency thresholds and containment-aware deduplication.
  • Tests

    • Expanded unit-style tests covering block/fragment formation, containment aggregation, metric calculation, and deduplication.
  • Chores

    • Removed legacy window-size config property and updated workflows and cleanup behavior.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR replaces two-pass/windowed haplotype assembly with an adjacency-gap pipeline: compute locus groups, form per-(sample,strand) parent blocks split at gaps, enumerate adjacency-contiguous subfragments, aggregate per-pop containment AC, attach variant component info, compute per-haplotype metrics, apply containment dedup, update callers/workflows, and expand unit tests.

Changes

Haplotype Aggregation Algorithm Overhaul

Layer / File(s) Summary
Data Shape: locus grouping
divref/divref/tools/compute_haplotypes.py (_compute_locus_groups)
Assigns variants to locus_group by cutting gaps ≥ window_size across sorted variants; returns group mapping and count.
Core: parent block formation
divref/divref/tools/compute_haplotypes.py (_form_parent_blocks)
Per-(sample,strand) carrier-ordered variants are split into parent blocks at gaps ≥ window_size; blocks shorter than 2 are dropped.
Core: sub-fragment enumeration
divref/divref/tools/compute_haplotypes.py (_enumerate_subfragments)
Enumerates all adjacency-contiguous sub-fragments (length ≥2) from each parent block, emitting one row per subfragment.
Core: containment AC aggregation
divref/divref/tools/compute_haplotypes.py (_aggregate_containment_ac)
Groups subfragments by haplotype indices and computes per_pop_AC by counting parent-block containment per population.
Enrichment: attach components & metrics
divref/divref/tools/compute_haplotypes.py (_attach_component_info, _compute_metrics)
Driver-side broadcast of variants_ht maps haplotype indices to variant components and per-pop frequencies; computes max_pop, empirical AF/AC, fraction_phased, min_variant_frequency, and estimated_gnomad_AF.
Deduplication
divref/divref/tools/compute_haplotypes.py (_apply_containment_dedup, helpers)
Drops haplotypes subsumed contiguously within longer ones when per_pop_AC vectors match element-wise (driver grouping and contiguous-subarray checks).
Orchestration & checkpoints
divref/divref/tools/compute_haplotypes.py (compute_haplotypes body)
Replaces two-window flow with single pipeline (variants → locus groups → blocks → parents → subfragments → aggregate → attach → metrics → filter → dedup), checkpoints intermediate tables, and writes final keyed haplotype table.

Downstream API & Workflow Simplification

Layer / File(s) Summary
API change
divref/divref/tools/create_duckdb_index.py
build_hgdp_haplotype_table_entries signature removed window_size; split_haplotypes import removed; callers updated to stop passing window_size.
Config schema
workflows/config/config_schema.yml
Removed hgdp_1kg_haplotype_window_size property from the schema.
Workflow wiring
workflows/generate_divref.smk, workflows/create_test_data.smk
compute_haplotypes now uses SEQUENCE_WINDOW_SIZE; test WINDOW_SIZE_COMPUTE_HAPLOTYPES changed 5000→25; intermediate-file cleanup targets updated to explicit checkpoint filenames.

Test Coverage Expansion

Layer / File(s) Summary
Unit tests for helpers & E2E update
divref/tests/tools/test_compute_haplotypes.py
Added extensive unit tests for _form_parent_blocks, _enumerate_subfragments, _aggregate_containment_ac, _attach_component_info, _compute_metrics, and _apply_containment_dedup using synthetic Hail fixtures; updated parametrized test_compute_haplotypes expected row counts. (+~815 lines)

Sequence Diagram

sequenceDiagram
  participant VCF as "VariantsHT"
  participant Prep as "LocusGrouper"
  participant Carriers as "CarrierBuilder"
  participant Blocks as "ParentBlocks"
  participant Frags as "SubFragments"
  participant Agg as "ContainmentAgg"
  participant Enrich as "Enricher"
  participant Dedup as "Deduper"
  participant Out as "OutputTable"

  VCF->>Prep: load and filter variants
  Prep->>Carriers: build per sample strand carrier arrays
  Carriers->>Blocks: sort and split into parent blocks
  Blocks->>Frags: enumerate adjacency contiguous subfragments
  Frags->>Agg: group by haplotype indices and compute per_pop_AC
  Agg->>Enrich: attach variant components and per pop freqs
  Enrich->>Enrich: compute metrics and fractions
  Enrich->>Dedup: emit candidate haplotypes with metrics
  Dedup->>Dedup: remove subsumed haplotypes with identical AC
  Dedup->>Out: write final haplotype table
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • fg-labs/divref-wf#8: Prior PR that introduced the two-pass/windowed compute_haplotypes flow which this PR replaces.
  • fg-labs/divref-wf#31: Modifies test-data workflow window-size usage; likely overlaps with adjusted workflow constants here.
  • fg-labs/divref-wf#39: Related changes touching HGDP haplotype processing and window_size handling used by create_duckdb_index.

Poem

🐰
I hop where variant gaps are told,
I gather fragments, brave and bold,
I count the pops and map each phase,
I prune the small and mark the maze,
A rabbit cheers this tidy mold!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main algorithmic change: replacing a two-pass windowing approach with a sample adjacency algorithm for haplotype computation.
Docstring Coverage ✅ Passed Docstring coverage is 95.83% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am_sample_adjacency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ameynert ameynert self-assigned this May 5, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
divref/divref/tools/compute_haplotypes.py (1)

180-197: ⚡ Quick win

Combine row_idx and struct collection into a single aggregate call to eliminate fragile coupling.

Two separate calls—variants_ht.row_idx.collect() and variants_ht.aggregate(hl.agg.collect(...)) on an unkeyed table—have no ordering guarantee between them. Hail's documentation states that "the ordering of rows in a table without a key is undefined." If rows reorder between the two passes, the zip(..., strict=True) will silently pair the wrong components to indices, corrupting the resulting dict. While strict=True catches length mismatches, it cannot detect row-order mismatches.

Combining both into a single aggregate call makes the pairing explicit, eliminates the fragility, and saves one pass over the table.

♻️ Proposed refactor
-    components_dict = hl.literal(
-        dict(
-            zip(
-                variants_ht.row_idx.collect(),
-                variants_ht.aggregate(
-                    hl.agg.collect(
-                        hl.struct(
-                            locus=variants_ht.locus,
-                            alleles=variants_ht.alleles,
-                            freq=variants_ht.freq,
-                            frequencies_by_pop=variants_ht.frequencies_by_pop,
-                        )
-                    )
-                ),
-                strict=True,
-            )
-        )
-    )
+    collected = variants_ht.aggregate(
+        hl.agg.collect(
+            hl.struct(
+                row_idx=variants_ht.row_idx,
+                locus=variants_ht.locus,
+                alleles=variants_ht.alleles,
+                freq=variants_ht.freq,
+                frequencies_by_pop=variants_ht.frequencies_by_pop,
+            )
+        )
+    )
+    components_dict = hl.literal({c.row_idx: c.drop("row_idx") for c in collected})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@divref/divref/tools/compute_haplotypes.py` around lines 180 - 197, The
current code builds components_dict by separately collecting variants_ht.row_idx
and then collecting structs from variants_ht in a second pass and zipping them,
which is fragile because an unkeyed table’s row order is undefined; instead,
change components_dict to a single aggregate over variants_ht that collects
pairs (row_idx, hl.struct(locus=..., alleles=..., freq=...,
frequencies_by_pop=...)) and then turns that single-aggregate result into the
dict (and wrap with hl.literal as before). Target the symbols components_dict,
variants_ht, row_idx, hl.agg.collect, and the hl.struct creation and perform the
pairing inside one variants_ht.aggregate call so the index-to-struct association
cannot get reordered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@workflows/generate_divref.smk`:
- Around line 183-185: The cleanup only removes {params.output_base}.variants.ht
but compute_haplotypes produces four checkpoint directories (.variants.ht,
.cols.ht, .parents.ht, .hap_ac.ht); update the cleanup command in the rule (the
block that currently runs rm -r {params.output_base}.variants.ht) to also remove
{params.output_base}.cols.ht, {params.output_base}.parents.ht, and
{params.output_base}.hap_ac.ht so the three leftover checkpoints are deleted
after the rule completes.

---

Nitpick comments:
In `@divref/divref/tools/compute_haplotypes.py`:
- Around line 180-197: The current code builds components_dict by separately
collecting variants_ht.row_idx and then collecting structs from variants_ht in a
second pass and zipping them, which is fragile because an unkeyed table’s row
order is undefined; instead, change components_dict to a single aggregate over
variants_ht that collects pairs (row_idx, hl.struct(locus=..., alleles=...,
freq=..., frequencies_by_pop=...)) and then turns that single-aggregate result
into the dict (and wrap with hl.literal as before). Target the symbols
components_dict, variants_ht, row_idx, hl.agg.collect, and the hl.struct
creation and perform the pairing inside one variants_ht.aggregate call so the
index-to-struct association cannot get reordered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 38b69893-2136-4f09-ac41-82c26b3d543a

📥 Commits

Reviewing files that changed from the base of the PR and between bd81b8e and a67bb80.

📒 Files selected for processing (6)
  • divref/divref/tools/compute_haplotypes.py
  • divref/divref/tools/create_duckdb_index.py
  • divref/tests/tools/test_compute_haplotypes.py
  • workflows/config/config_schema.yml
  • workflows/create_test_data.smk
  • workflows/generate_divref.smk
💤 Files with no reviewable changes (1)
  • workflows/config/config_schema.yml

Comment thread workflows/generate_divref.smk
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 5, 2026 18:03 — with GitHub Actions Inactive
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 5, 2026 23:17 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@divref/divref/tools/compute_haplotypes.py`:
- Around line 45-50: The grouping logic uses prev_end as the last variant end
which can shrink when a short variant is seen; change it to track the running
maximum end of the active group: when evaluating each row in the loop over rows
(variable names: v, prev_end, prev_contig, current_group, group_of,
window_size), only start a new group if v.locus.contig != prev_contig or
v.locus.position - prev_end >= window_size, and otherwise update prev_end =
max(prev_end, v.locus.position + v.ref_len); when you do start a new group, set
prev_end to v.locus.position + v.ref_len (and update
prev_contig/current_group/group_of as before).
- Around line 87-96: The current sorting and breakpoint logic in
blocks_ht.annotate (carriers) and the breakpoints calculation is unstable:
change the hl.sorted key to a deterministic tuple (e.g., lambda v:
(v.locus.position, v.ref_len, v.alt or v.some_unique_id)) so equal-position
carriers have a stable order, and replace the single-previous-gap check used in
breakpoints with a running-max-end comparison (compute a running max of
carriers[i-1].locus.position + carriers[i-1].ref_len and test
carriers[i].locus.position against that running max using hl.scan/hl.fold
semantics) so splits occur only when the new variant starts beyond the maximal
end of the current block. Ensure you update references to carriers, n, and
breakpoints accordingly.
- Around line 208-223: The code currently aggregates all variants into a Python
dict (pairs/components_dict) and broadcasts it, risking OOM; instead build a
Hail-side mapping keyed by variants_ht.row_idx and use Hail table lookup/join:
create a components_table from variants_ht selecting the small struct (locus,
alleles, freq, frequencies_by_pop) and key it by row_idx (e.g., components_table
= variants_ht.select(...).key_by('row_idx')), then replace the driver dict usage
in hap_table.annotate by mapping hap_table.haplotype to table lookups like
components_table[hl.int32(idx)] (or perform a join) so all work stays in Hail
and avoids materializing the full payload on the driver.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f1d54da1-0a56-4447-8943-410e0deb4173

📥 Commits

Reviewing files that changed from the base of the PR and between 489354b and 47b45e5.

📒 Files selected for processing (2)
  • divref/divref/tools/compute_haplotypes.py
  • workflows/generate_divref.smk

Comment thread divref/divref/tools/compute_haplotypes.py
Comment thread divref/divref/tools/compute_haplotypes.py
Comment on lines +208 to +223
pairs = variants_ht.aggregate(
hl.agg.collect(
hl.tuple([
variants_ht.row_idx,
hl.struct(
locus=variants_ht.locus,
alleles=variants_ht.alleles,
freq=variants_ht.freq,
frequencies_by_pop=variants_ht.frequencies_by_pop,
),
])
)
)
components_dict = hl.literal(dict(pairs))
hap_table = hap_table.annotate(
_components=hap_table.haplotype.map(lambda idx: components_dict[hl.int32(idx)])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file size and locate the function
wc -l divref/divref/tools/compute_haplotypes.py

Repository: fg-labs/divref-wf

Length of output: 107


🏁 Script executed:

# Read the function containing lines 208-223 with context
sed -n '170,230p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 3307


🏁 Script executed:

# Check the function signature and any default parameters
sed -n '1,250p' divref/divref/tools/compute_haplotypes.py | grep -A 50 "^def compute_haplotypes"

Repository: fg-labs/divref-wf

Length of output: 43


🏁 Script executed:

# Find the function that calls _attach_component_info
grep -n "_attach_component_info" divref/divref/tools/compute_haplotypes.py

Repository: fg-labs/divref-wf

Length of output: 311


🏁 Script executed:

# Read the compute_haplotypes function signature and check for spark_driver_memory_gb default
sed -n '1,150p' divref/divref/tools/compute_haplotypes.py | head -100

Repository: fg-labs/divref-wf

Length of output: 4172


🏁 Script executed:

# Look for spark_driver_memory_gb definition in the file
grep -n "spark_driver_memory_gb" divref/divref/tools/compute_haplotypes.py

Repository: fg-labs/divref-wf

Length of output: 374


🏁 Script executed:

# Let's check what's at lines 208-223 in the file
sed -n '208,223p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 747


🏁 Script executed:

# Also check context around those lines to understand the function
sed -n '200,240p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 2084


🏁 Script executed:

# Check if there's variant filtering before the aggregation at line 518
sed -n '500,525p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 1658


🏁 Script executed:

# Look for variant_freq_threshold to understand filtering
grep -n "variant_freq_threshold" divref/divref/tools/compute_haplotypes.py

Repository: fg-labs/divref-wf

Length of output: 573


🏁 Script executed:

# Let's check the data flow: when is variants_ht filtered before being passed to _attach_component_info?
sed -n '440,480p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 2000


🏁 Script executed:

# Check the actual variant struct schema to understand memory footprint - look for freq and frequencies_by_pop definitions
grep -A 20 "freq.*=" divref/divref/tools/compute_haplotypes.py | head -40

Repository: fg-labs/divref-wf

Length of output: 1804


🏁 Script executed:

# Let's see the full compute_haplotypes function context around the _attach_component_info call
sed -n '510,530p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 1248


🏁 Script executed:

# Let's check how large variants_ht might be by looking at the filtering logic more closely
sed -n '449,481p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 1623


🏁 Script executed:

# Check if there's any documentation or test data that shows typical variant counts
find . -name "*.py" -o -name "*.md" | xargs grep -l "variant_freq_threshold" | head -5

Repository: fg-labs/divref-wf

Length of output: 153


🏁 Script executed:

# Verify the exact line numbers by checking the actual line count from the beginning of _attach_component_info
sed -n '182,225p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 2478


🏁 Script executed:

# Let's verify the exact line ranges - check what's at lines 208-223 (absolute file positions)
sed -n '206,225p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 915


🏁 Script executed:

# Check if there are any recent changes or TODOs about this memory pattern
grep -n "TODO\|FIXME\|memory\|aggregate\|driver" divref/divref/tools/compute_haplotypes.py | grep -i "component\|attach"

Repository: fg-labs/divref-wf

Length of output: 43


🏁 Script executed:

# Let's understand the actual variant count constraints - check if variants_ht is filtered before line 518
sed -n '475,520p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 2566


Avoid collecting the full variant payload onto the driver here.

This materializes every retained variant, including nested frequency structs, in Python via aggregate(collect(...)), then duplicates it with dict(pairs) before broadcasting back as a literal. With chromosome-scale inputs and a 1 GB default driver, this is a realistic OOM failure path. Prefer a Hail-side join/indexing strategy keyed by row_idx instead of a driver dictionary.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@divref/divref/tools/compute_haplotypes.py` around lines 208 - 223, The code
currently aggregates all variants into a Python dict (pairs/components_dict) and
broadcasts it, risking OOM; instead build a Hail-side mapping keyed by
variants_ht.row_idx and use Hail table lookup/join: create a components_table
from variants_ht selecting the small struct (locus, alleles, freq,
frequencies_by_pop) and key it by row_idx (e.g., components_table =
variants_ht.select(...).key_by('row_idx')), then replace the driver dict usage
in hap_table.annotate by mapping hap_table.haplotype to table lookups like
components_table[hl.int32(idx)] (or perform a join) so all work stays in Hail
and avoids materializing the full payload on the driver.

@ameynert ameynert force-pushed the am_sample_adjacency branch from 47b45e5 to 229c37e Compare May 11, 2026 16:30
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 11, 2026 16:30 — with GitHub Actions Inactive
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 11, 2026 16:32 — with GitHub Actions Inactive
ameynert added a commit that referenced this pull request May 13, 2026
Within a parent block, gap detection compared only against the
immediately preceding variant end, so a shorter overlapping carrier
could shrink the boundary and force a false split (e.g. 100-150,
120-121, 160 with window_size=20 splits instead of staying connected).
Compute breakpoints against the running max end of carriers[0..i-1].

The carrier sort also keyed only on locus.position, so equal-position
carriers could end up in different row_idx orders across samples even
for the same fragment, which downstream grouping keys on. Sort by
(position, ref_len, row_idx) for deterministic ordering.

Addresses CodeRabbit review on PR #48.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 13, 2026 21:03 — with GitHub Actions Inactive
ameynert added a commit that referenced this pull request May 13, 2026
Within a parent block, gap detection compared only against the
immediately preceding variant end, so a shorter overlapping carrier
could shrink the boundary and force a false split (e.g. 100-150,
120-121, 160 with window_size=20 splits instead of staying connected).
Compute breakpoints against the running max end of carriers[0..i-1].

The carrier sort also keyed only on locus.position, so equal-position
carriers could end up in different row_idx orders across samples even
for the same fragment, which downstream grouping keys on. Sort by
(position, ref_len, row_idx) for deterministic ordering.

Addresses CodeRabbit review on PR #48.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ameynert ameynert force-pushed the am_sample_adjacency branch from 42916cf to 4394cf4 Compare May 13, 2026 21:19
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 13, 2026 21:19 — with GitHub Actions Inactive
ameynert added a commit that referenced this pull request May 15, 2026
Within a parent block, gap detection compared only against the
immediately preceding variant end, so a shorter overlapping carrier
could shrink the boundary and force a false split (e.g. 100-150,
120-121, 160 with window_size=20 splits instead of staying connected).
Compute breakpoints against the running max end of carriers[0..i-1].

The carrier sort also keyed only on locus.position, so equal-position
carriers could end up in different row_idx orders across samples even
for the same fragment, which downstream grouping keys on. Sort by
(position, ref_len, row_idx) for deterministic ordering.

Addresses CodeRabbit review on PR #48.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ameynert ameynert force-pushed the am_sample_adjacency branch from 4394cf4 to 0b5e3e3 Compare May 15, 2026 21:45
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 15, 2026 21:45 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
divref/tests/tools/test_compute_haplotypes.py (1)

11-16: 🏗️ Heavy lift

Avoid coupling tests to private helper internals.

The new suite heavily imports/targets underscored helpers, which makes tests fragile to internal refactors. Prefer validating these invariants via stable public behavior (or promote helpers to explicit public contracts if they must be tested directly).

As per coding guidelines, "Test behavior, not implementation—tests should survive refactoring."

Also applies to: 85-797

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@divref/tests/tools/test_compute_haplotypes.py` around lines 11 - 16, Tests
import and assert on private underscored helpers (_aggregate_containment_ac,
_apply_containment_dedup, _attach_component_info, _compute_metrics,
_enumerate_subfragments, _form_parent_blocks), which couples tests to
implementation; update the tests to exercise the stable public API instead (call
the top-level public function(s) that use these helpers and assert on observable
outputs/side effects), or if a helper truly needs direct testing, promote it to
a public symbol by renaming and exporting it (remove the leading underscore and
update imports) so tests target a stable contract rather than internal details.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@divref/tests/tools/test_compute_haplotypes.py`:
- Around line 11-16: Tests import and assert on private underscored helpers
(_aggregate_containment_ac, _apply_containment_dedup, _attach_component_info,
_compute_metrics, _enumerate_subfragments, _form_parent_blocks), which couples
tests to implementation; update the tests to exercise the stable public API
instead (call the top-level public function(s) that use these helpers and assert
on observable outputs/side effects), or if a helper truly needs direct testing,
promote it to a public symbol by renaming and exporting it (remove the leading
underscore and update imports) so tests target a stable contract rather than
internal details.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51760d1b-fc53-4a20-8353-f20d8f9fc103

📥 Commits

Reviewing files that changed from the base of the PR and between b6ff60c and 0b5e3e3.

📒 Files selected for processing (6)
  • divref/divref/tools/compute_haplotypes.py
  • divref/divref/tools/create_duckdb_index.py
  • divref/tests/tools/test_compute_haplotypes.py
  • workflows/config/config_schema.yml
  • workflows/create_test_data.smk
  • workflows/generate_divref.smk
💤 Files with no reviewable changes (1)
  • workflows/config/config_schema.yml
🚧 Files skipped from review as they are similar to previous changes (3)
  • workflows/create_test_data.smk
  • workflows/generate_divref.smk
  • divref/divref/tools/create_duckdb_index.py

ameynert and others added 12 commits May 19, 2026 16:32
Forms per-(sample, strand) parent blocks from a cols-table of alt-carrier
arrays via adjacency-gap cutting at window_size. Drops blocks of length < 2.
First step toward replacing the fixed-bin two-windower build in
compute_haplotypes with single-window per-sample adjacency.

Lives alongside the existing _get_haplotypes for now; the swap happens in a
later commit once all helpers are in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For each parent block of length N, emits N(N-1)/2 sub-fragments — every
adjacency-contiguous slice of length >= 2, including the full block. Lays
the groundwork for downstream containment-AC aggregation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Groups sub-fragment rows by haplotype (row_idx tuple) and produces a per-
population AC vector of length n_pops, indexed by pop_int. Counts the
parent blocks (one row per (sample, strand, parent block, sub-fragment))
containing each haplotype as adjacency-contiguous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Joins each haplotype row's row_idx array against the variants table to
attach parallel-length variants/gnomad_freqs/frequencies_by_pop arrays.
Uses a broadcast dict (driver-side collect of variants_ht) keyed by
row_idx; lookup casts to int32 since Python ints encode that way through
hl.literal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Annotates per-haplotype frequency / phasing summary fields:
max_pop, max_empirical_AF, max_empirical_AC, min_variant_frequency,
all_pop_freqs (sorted by AF desc, missing AF sorts to end),
fraction_phased, estimated_gnomad_AF.

Per-population empirical_AF = per_pop_AC[p] / min over component variants
of frequencies_by_pop[p].AN; missing when min_AN == 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops haplotype rows subsumed by a strictly longer adjacency-contiguous
fragment with identical per-pop AC. Driver-side scan: collect rows, group
by AC tuple, pairwise contiguous-substring check within each group;
filter the Hail table by stringified-haplotype set. Helpers
_is_contiguous_subarray and _find_subsumed_haplotypes factored out for
testability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rithm

Replaces the fixed-bin two-windower union with the per-sample adjacency
pipeline composed of the six helpers added in prior commits.

Deletes _get_haplotypes and its three nested helpers. Removes the
.[12].ht intermediate writes; adds .cols.ht, .parents.ht, .hap_ac.ht
checkpoints between major stages to break Hail IR plan accumulation.

_compute_metrics: switch frequencies_by_pop[p] direct indexing to
.get(p, missing) so populations absent from the per-variant call-stats
dict do not raise.

Updates the test_compute_haplotypes parametrize from
[517, 295, 30, 33] to [457, 283, 43, 35].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ypes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop split_haplotypes + distinct() transformation logic from
build_hgdp_haplotype_table_entries. Haplotypes are already at the
correct granularity from compute_haplotypes (per-sample adjacency
with containment dedup), so re-fragmenting them here would only
underestimate AC by discarding counts via distinct().

- Remove split_haplotypes import (still used by compute_haplotype_statistics)
- Remove window_size parameter from build_hgdp_haplotype_table_entries
- Drop early .distinct(), post-split key_by("haplotype").distinct(),
  and the diagnostic count() log lines
- Update build_contig_sequences_table caller and create_duckdb_index
  docstring (window_size now solely drives FASTA flanking context)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new compute_haplotypes per-sample-adjacency algorithm uses a single
window size, and create_duckdb_index no longer re-splits haplotypes —
so the dual hgdp_1kg_haplotype_window_size / sequence_window_size knobs
collapse into one.

- generate_divref.smk: remove HGDP_1KG_HAPLOTYPE_WINDOW_SIZE constant;
  compute_haplotypes rule now uses SEQUENCE_WINDOW_SIZE; cleanup line
  drops .[12].ht (no longer produced)
- create_test_data.smk: WINDOW_SIZE_COMPUTE_HAPLOTYPES 5000 -> 25 to
  match sequence_window_size schema default
- config_schema.yml: remove hgdp_1kg_haplotype_window_size property

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Within each per-pop AC group, candidates are already sorted by haplotype
length descending. Add an explicit `len(h_long) <= n_short` break so the
inner scan exits as soon as the running candidate can no longer strictly
subsume the target — proper containment requires len(h_long) > len(h_short).

Worst case stays O(G²); large groups with a wide length range now prune
most pair comparisons. Mitigation for risk #2 in the algorithm rewrite plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ameynert and others added 4 commits May 19, 2026 16:33
Within a parent block, gap detection compared only against the
immediately preceding variant end, so a shorter overlapping carrier
could shrink the boundary and force a false split (e.g. 100-150,
120-121, 160 with window_size=20 splits instead of staying connected).
Compute breakpoints against the running max end of carriers[0..i-1].

The carrier sort also keyed only on locus.position, so equal-position
carriers could end up in different row_idx orders across samples even
for the same fragment, which downstream grouping keys on. Sort by
(position, ref_len, row_idx) for deterministic ordering.

Addresses CodeRabbit review on PR #48.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ameynert ameynert force-pushed the am_sample_adjacency branch from 0b5e3e3 to e284b31 Compare May 19, 2026 23:33
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 19, 2026 23:33 — with GitHub Actions Inactive
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting May 20, 2026 00:03 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant