feat: implement sample adjacency algorithm for computing haplotypes by ameynert · Pull Request #48 · fg-labs/divref-wf

ameynert · 2026-05-05T00:47:23Z

See next PR for details and comparison

Summary by CodeRabbit

New Features
- Haplotype records now include per-variant component data and new per-haplotype summary metrics.
Improvements
- Unified haplotype computation into a single adjacency/gap-based pipeline with containment-based deduplication and checkpointed intermediate outputs.
- Filtering now uses variant- and haplotype-frequency thresholds and containment-aware deduplication.
Tests
- Expanded unit-style tests covering block/fragment formation, containment aggregation, metric calculation, and deduplication.
Chores
- Removed legacy window-size config property and updated workflows and cleanup behavior.

coderabbitai · 2026-05-05T00:47:34Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR replaces two-pass/windowed haplotype assembly with an adjacency-gap pipeline: compute locus groups, form per-(sample,strand) parent blocks split at gaps, enumerate adjacency-contiguous subfragments, aggregate per-pop containment AC, attach variant component info, compute per-haplotype metrics, apply containment dedup, update callers/workflows, and expand unit tests.

Changes

Haplotype Aggregation Algorithm Overhaul

Layer / File(s)	Summary
Data Shape: locus grouping `divref/divref/tools/compute_haplotypes.py` (`_compute_locus_groups`)	Assigns variants to `locus_group` by cutting gaps ≥ `window_size` across sorted variants; returns group mapping and count.
Core: parent block formation `divref/divref/tools/compute_haplotypes.py` (`_form_parent_blocks`)	Per-(sample,strand) carrier-ordered variants are split into parent blocks at gaps ≥ `window_size`; blocks shorter than 2 are dropped.
Core: sub-fragment enumeration `divref/divref/tools/compute_haplotypes.py` (`_enumerate_subfragments`)	Enumerates all adjacency-contiguous sub-fragments (length ≥2) from each parent block, emitting one row per subfragment.
Core: containment AC aggregation `divref/divref/tools/compute_haplotypes.py` (`_aggregate_containment_ac`)	Groups subfragments by haplotype indices and computes `per_pop_AC` by counting parent-block containment per population.
Enrichment: attach components & metrics `divref/divref/tools/compute_haplotypes.py` (`_attach_component_info`, `_compute_metrics`)	Driver-side broadcast of `variants_ht` maps haplotype indices to variant components and per-pop frequencies; computes `max_pop`, empirical AF/AC, `fraction_phased`, `min_variant_frequency`, and `estimated_gnomad_AF`.
Deduplication `divref/divref/tools/compute_haplotypes.py` (`_apply_containment_dedup`, helpers)	Drops haplotypes subsumed contiguously within longer ones when `per_pop_AC` vectors match element-wise (driver grouping and contiguous-subarray checks).
Orchestration & checkpoints `divref/divref/tools/compute_haplotypes.py` (compute_haplotypes body)	Replaces two-window flow with single pipeline (variants → locus groups → blocks → parents → subfragments → aggregate → attach → metrics → filter → dedup), checkpoints intermediate tables, and writes final keyed haplotype table.

Downstream API & Workflow Simplification

Layer / File(s)	Summary
API change `divref/divref/tools/create_duckdb_index.py`	`build_hgdp_haplotype_table_entries` signature removed `window_size`; `split_haplotypes` import removed; callers updated to stop passing `window_size`.
Config schema `workflows/config/config_schema.yml`	Removed `hgdp_1kg_haplotype_window_size` property from the schema.
Workflow wiring `workflows/generate_divref.smk`, `workflows/create_test_data.smk`	`compute_haplotypes` now uses `SEQUENCE_WINDOW_SIZE`; test `WINDOW_SIZE_COMPUTE_HAPLOTYPES` changed 5000→25; intermediate-file cleanup targets updated to explicit checkpoint filenames.

Test Coverage Expansion

Layer / File(s)	Summary
Unit tests for helpers & E2E update `divref/tests/tools/test_compute_haplotypes.py`	Added extensive unit tests for `_form_parent_blocks`, `_enumerate_subfragments`, `_aggregate_containment_ac`, `_attach_component_info`, `_compute_metrics`, and `_apply_containment_dedup` using synthetic Hail fixtures; updated parametrized `test_compute_haplotypes` expected row counts. (+~815 lines)

Sequence Diagram

sequenceDiagram
  participant VCF as "VariantsHT"
  participant Prep as "LocusGrouper"
  participant Carriers as "CarrierBuilder"
  participant Blocks as "ParentBlocks"
  participant Frags as "SubFragments"
  participant Agg as "ContainmentAgg"
  participant Enrich as "Enricher"
  participant Dedup as "Deduper"
  participant Out as "OutputTable"

  VCF->>Prep: load and filter variants
  Prep->>Carriers: build per sample strand carrier arrays
  Carriers->>Blocks: sort and split into parent blocks
  Blocks->>Frags: enumerate adjacency contiguous subfragments
  Frags->>Agg: group by haplotype indices and compute per_pop_AC
  Agg->>Enrich: attach variant components and per pop freqs
  Enrich->>Enrich: compute metrics and fractions
  Enrich->>Dedup: emit candidate haplotypes with metrics
  Dedup->>Dedup: remove subsumed haplotypes with identical AC
  Dedup->>Out: write final haplotype table

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fg-labs/divref-wf#8: Prior PR that introduced the two-pass/windowed compute_haplotypes flow which this PR replaces.
fg-labs/divref-wf#31: Modifies test-data workflow window-size usage; likely overlaps with adjusted workflow constants here.
fg-labs/divref-wf#39: Related changes touching HGDP haplotype processing and window_size handling used by create_duckdb_index.

Poem

🐰
I hop where variant gaps are told,
I gather fragments, brave and bold,
I count the pops and map each phase,
I prune the small and mark the maze,
A rabbit cheers this tidy mold!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main algorithmic change: replacing a two-pass windowing approach with a sample adjacency algorithm for haplotype computation.
Docstring Coverage	✅ Passed	Docstring coverage is 95.83% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch am_sample_adjacency

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

divref/divref/tools/compute_haplotypes.py (1)

180-197: ⚡ Quick win

Combine row_idx and struct collection into a single aggregate call to eliminate fragile coupling.

Two separate calls—variants_ht.row_idx.collect() and variants_ht.aggregate(hl.agg.collect(...)) on an unkeyed table—have no ordering guarantee between them. Hail's documentation states that "the ordering of rows in a table without a key is undefined." If rows reorder between the two passes, the zip(..., strict=True) will silently pair the wrong components to indices, corrupting the resulting dict. While strict=True catches length mismatches, it cannot detect row-order mismatches.

Combining both into a single aggregate call makes the pairing explicit, eliminates the fragility, and saves one pass over the table.

♻️ Proposed refactor

-    components_dict = hl.literal(
-        dict(
-            zip(
-                variants_ht.row_idx.collect(),
-                variants_ht.aggregate(
-                    hl.agg.collect(
-                        hl.struct(
-                            locus=variants_ht.locus,
-                            alleles=variants_ht.alleles,
-                            freq=variants_ht.freq,
-                            frequencies_by_pop=variants_ht.frequencies_by_pop,
-                        )
-                    )
-                ),
-                strict=True,
-            )
-        )
-    )
+    collected = variants_ht.aggregate(
+        hl.agg.collect(
+            hl.struct(
+                row_idx=variants_ht.row_idx,
+                locus=variants_ht.locus,
+                alleles=variants_ht.alleles,
+                freq=variants_ht.freq,
+                frequencies_by_pop=variants_ht.frequencies_by_pop,
+            )
+        )
+    )
+    components_dict = hl.literal({c.row_idx: c.drop("row_idx") for c in collected})

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@divref/divref/tools/compute_haplotypes.py` around lines 180 - 197, The
current code builds components_dict by separately collecting variants_ht.row_idx
and then collecting structs from variants_ht in a second pass and zipping them,
which is fragile because an unkeyed table’s row order is undefined; instead,
change components_dict to a single aggregate over variants_ht that collects
pairs (row_idx, hl.struct(locus=..., alleles=..., freq=...,
frequencies_by_pop=...)) and then turns that single-aggregate result into the
dict (and wrap with hl.literal as before). Target the symbols components_dict,
variants_ht, row_idx, hl.agg.collect, and the hl.struct creation and perform the
pairing inside one variants_ht.aggregate call so the index-to-struct association
cannot get reordered.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@workflows/generate_divref.smk`:
- Around line 183-185: The cleanup only removes {params.output_base}.variants.ht
but compute_haplotypes produces four checkpoint directories (.variants.ht,
.cols.ht, .parents.ht, .hap_ac.ht); update the cleanup command in the rule (the
block that currently runs rm -r {params.output_base}.variants.ht) to also remove
{params.output_base}.cols.ht, {params.output_base}.parents.ht, and
{params.output_base}.hap_ac.ht so the three leftover checkpoints are deleted
after the rule completes.

---

Nitpick comments:
In `@divref/divref/tools/compute_haplotypes.py`:
- Around line 180-197: The current code builds components_dict by separately
collecting variants_ht.row_idx and then collecting structs from variants_ht in a
second pass and zipping them, which is fragile because an unkeyed table’s row
order is undefined; instead, change components_dict to a single aggregate over
variants_ht that collects pairs (row_idx, hl.struct(locus=..., alleles=...,
freq=..., frequencies_by_pop=...)) and then turns that single-aggregate result
into the dict (and wrap with hl.literal as before). Target the symbols
components_dict, variants_ht, row_idx, hl.agg.collect, and the hl.struct
creation and perform the pairing inside one variants_ht.aggregate call so the
index-to-struct association cannot get reordered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 38b69893-2136-4f09-ac41-82c26b3d543a

📥 Commits

Reviewing files that changed from the base of the PR and between bd81b8e and a67bb80.

📒 Files selected for processing (6)

divref/divref/tools/compute_haplotypes.py
divref/divref/tools/create_duckdb_index.py
divref/tests/tools/test_compute_haplotypes.py
workflows/config/config_schema.yml
workflows/create_test_data.smk
workflows/generate_divref.smk

💤 Files with no reviewable changes (1)

workflows/config/config_schema.yml

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@divref/divref/tools/compute_haplotypes.py`:
- Around line 45-50: The grouping logic uses prev_end as the last variant end
which can shrink when a short variant is seen; change it to track the running
maximum end of the active group: when evaluating each row in the loop over rows
(variable names: v, prev_end, prev_contig, current_group, group_of,
window_size), only start a new group if v.locus.contig != prev_contig or
v.locus.position - prev_end >= window_size, and otherwise update prev_end =
max(prev_end, v.locus.position + v.ref_len); when you do start a new group, set
prev_end to v.locus.position + v.ref_len (and update
prev_contig/current_group/group_of as before).
- Around line 87-96: The current sorting and breakpoint logic in
blocks_ht.annotate (carriers) and the breakpoints calculation is unstable:
change the hl.sorted key to a deterministic tuple (e.g., lambda v:
(v.locus.position, v.ref_len, v.alt or v.some_unique_id)) so equal-position
carriers have a stable order, and replace the single-previous-gap check used in
breakpoints with a running-max-end comparison (compute a running max of
carriers[i-1].locus.position + carriers[i-1].ref_len and test
carriers[i].locus.position against that running max using hl.scan/hl.fold
semantics) so splits occur only when the new variant starts beyond the maximal
end of the current block. Ensure you update references to carriers, n, and
breakpoints accordingly.
- Around line 208-223: The code currently aggregates all variants into a Python
dict (pairs/components_dict) and broadcasts it, risking OOM; instead build a
Hail-side mapping keyed by variants_ht.row_idx and use Hail table lookup/join:
create a components_table from variants_ht selecting the small struct (locus,
alleles, freq, frequencies_by_pop) and key it by row_idx (e.g., components_table
= variants_ht.select(...).key_by('row_idx')), then replace the driver dict usage
in hap_table.annotate by mapping hap_table.haplotype to table lookups like
components_table[hl.int32(idx)] (or perform a join) so all work stays in Hail
and avoids materializing the full payload on the driver.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f1d54da1-0a56-4447-8943-410e0deb4173

📥 Commits

Reviewing files that changed from the base of the PR and between 489354b and 47b45e5.

📒 Files selected for processing (2)

divref/divref/tools/compute_haplotypes.py
workflows/generate_divref.smk

coderabbitai · 2026-05-05T23:28:27Z

+    pairs = variants_ht.aggregate(
+        hl.agg.collect(
+            hl.tuple([
+                variants_ht.row_idx,
+                hl.struct(
+                    locus=variants_ht.locus,
+                    alleles=variants_ht.alleles,
+                    freq=variants_ht.freq,
+                    frequencies_by_pop=variants_ht.frequencies_by_pop,
+                ),
+            ])
+        )
    )
+    components_dict = hl.literal(dict(pairs))
+    hap_table = hap_table.annotate(
+        _components=hap_table.haplotype.map(lambda idx: components_dict[hl.int32(idx)])


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file size and locate the function wc -l divref/divref/tools/compute_haplotypes.py

Repository: fg-labs/divref-wf

Length of output: 107

🏁 Script executed:

# Read the function containing lines 208-223 with context sed -n '170,230p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 3307

🏁 Script executed:

# Check the function signature and any default parameters sed -n '1,250p' divref/divref/tools/compute_haplotypes.py | grep -A 50 "^def compute_haplotypes"

Repository: fg-labs/divref-wf

Length of output: 43

🏁 Script executed:

# Find the function that calls _attach_component_info grep -n "_attach_component_info" divref/divref/tools/compute_haplotypes.py

Repository: fg-labs/divref-wf

Length of output: 311

🏁 Script executed:

# Read the compute_haplotypes function signature and check for spark_driver_memory_gb default sed -n '1,150p' divref/divref/tools/compute_haplotypes.py | head -100

Repository: fg-labs/divref-wf

Length of output: 4172

🏁 Script executed:

# Look for spark_driver_memory_gb definition in the file grep -n "spark_driver_memory_gb" divref/divref/tools/compute_haplotypes.py

Repository: fg-labs/divref-wf

Length of output: 374

🏁 Script executed:

# Let's check what's at lines 208-223 in the file sed -n '208,223p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 747

🏁 Script executed:

# Also check context around those lines to understand the function sed -n '200,240p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 2084

🏁 Script executed:

# Check if there's variant filtering before the aggregation at line 518 sed -n '500,525p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 1658

🏁 Script executed:

# Look for variant_freq_threshold to understand filtering grep -n "variant_freq_threshold" divref/divref/tools/compute_haplotypes.py

Repository: fg-labs/divref-wf

Length of output: 573

🏁 Script executed:

# Let's check the data flow: when is variants_ht filtered before being passed to _attach_component_info? sed -n '440,480p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 2000

🏁 Script executed:

# Check the actual variant struct schema to understand memory footprint - look for freq and frequencies_by_pop definitions grep -A 20 "freq.*=" divref/divref/tools/compute_haplotypes.py | head -40

Repository: fg-labs/divref-wf

Length of output: 1804

🏁 Script executed:

# Let's see the full compute_haplotypes function context around the _attach_component_info call sed -n '510,530p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 1248

🏁 Script executed:

# Let's check how large variants_ht might be by looking at the filtering logic more closely sed -n '449,481p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 1623

🏁 Script executed:

# Check if there's any documentation or test data that shows typical variant counts find . -name "*.py" -o -name "*.md" | xargs grep -l "variant_freq_threshold" | head -5

Repository: fg-labs/divref-wf

Length of output: 153

🏁 Script executed:

# Verify the exact line numbers by checking the actual line count from the beginning of _attach_component_info sed -n '182,225p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 2478

🏁 Script executed:

# Let's verify the exact line ranges - check what's at lines 208-223 (absolute file positions) sed -n '206,225p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 915

🏁 Script executed:

# Check if there are any recent changes or TODOs about this memory pattern grep -n "TODO\|FIXME\|memory\|aggregate\|driver" divref/divref/tools/compute_haplotypes.py | grep -i "component\|attach"

Repository: fg-labs/divref-wf

Length of output: 43

🏁 Script executed:

# Let's understand the actual variant count constraints - check if variants_ht is filtered before line 518 sed -n '475,520p' divref/divref/tools/compute_haplotypes.py | cat -n

Repository: fg-labs/divref-wf

Length of output: 2566

Avoid collecting the full variant payload onto the driver here.

This materializes every retained variant, including nested frequency structs, in Python via aggregate(collect(...)), then duplicates it with dict(pairs) before broadcasting back as a literal. With chromosome-scale inputs and a 1 GB default driver, this is a realistic OOM failure path. Prefer a Hail-side join/indexing strategy keyed by row_idx instead of a driver dictionary.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@divref/divref/tools/compute_haplotypes.py` around lines 208 - 223, The code currently aggregates all variants into a Python dict (pairs/components_dict) and broadcasts it, risking OOM; instead build a Hail-side mapping keyed by variants_ht.row_idx and use Hail table lookup/join: create a components_table from variants_ht selecting the small struct (locus, alleles, freq, frequencies_by_pop) and key it by row_idx (e.g., components_table = variants_ht.select(...).key_by('row_idx')), then replace the driver dict usage in hap_table.annotate by mapping hap_table.haplotype to table lookups like components_table[hl.int32(idx)] (or perform a join) so all work stays in Hail and avoids materializing the full payload on the driver.

Within a parent block, gap detection compared only against the immediately preceding variant end, so a shorter overlapping carrier could shrink the boundary and force a false split (e.g. 100-150, 120-121, 160 with window_size=20 splits instead of staying connected). Compute breakpoints against the running max end of carriers[0..i-1]. The carrier sort also keyed only on locus.position, so equal-position carriers could end up in different row_idx orders across samples even for the same fragment, which downstream grouping keys on. Sort by (position, ref_len, row_idx) for deterministic ordering. Addresses CodeRabbit review on PR #48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai

🧹 Nitpick comments (1)

divref/tests/tools/test_compute_haplotypes.py (1)
11-16: 🏗️ Heavy lift

Avoid coupling tests to private helper internals.

The new suite heavily imports/targets underscored helpers, which makes tests fragile to internal refactors. Prefer validating these invariants via stable public behavior (or promote helpers to explicit public contracts if they must be tested directly).

As per coding guidelines, "Test behavior, not implementation—tests should survive refactoring."

Also applies to: 85-797
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@divref/tests/tools/test_compute_haplotypes.py` around lines 11 - 16, Tests
import and assert on private underscored helpers (_aggregate_containment_ac,
_apply_containment_dedup, _attach_component_info, _compute_metrics,
_enumerate_subfragments, _form_parent_blocks), which couples tests to
implementation; update the tests to exercise the stable public API instead (call
the top-level public function(s) that use these helpers and assert on observable
outputs/side effects), or if a helper truly needs direct testing, promote it to
a public symbol by renaming and exporting it (remove the leading underscore and
update imports) so tests target a stable contract rather than internal details.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@divref/tests/tools/test_compute_haplotypes.py`:
- Around line 11-16: Tests import and assert on private underscored helpers
(_aggregate_containment_ac, _apply_containment_dedup, _attach_component_info,
_compute_metrics, _enumerate_subfragments, _form_parent_blocks), which couples
tests to implementation; update the tests to exercise the stable public API
instead (call the top-level public function(s) that use these helpers and assert
on observable outputs/side effects), or if a helper truly needs direct testing,
promote it to a public symbol by renaming and exporting it (remove the leading
underscore and update imports) so tests target a stable contract rather than
internal details.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 51760d1b-fc53-4a20-8353-f20d8f9fc103

📥 Commits

Reviewing files that changed from the base of the PR and between b6ff60c and 0b5e3e3.

📒 Files selected for processing (6)

divref/divref/tools/compute_haplotypes.py
divref/divref/tools/create_duckdb_index.py
divref/tests/tools/test_compute_haplotypes.py
workflows/config/config_schema.yml
workflows/create_test_data.smk
workflows/generate_divref.smk

💤 Files with no reviewable changes (1)

workflows/config/config_schema.yml

🚧 Files skipped from review as they are similar to previous changes (3)

workflows/create_test_data.smk
workflows/generate_divref.smk
divref/divref/tools/create_duckdb_index.py

Forms per-(sample, strand) parent blocks from a cols-table of alt-carrier arrays via adjacency-gap cutting at window_size. Drops blocks of length < 2. First step toward replacing the fixed-bin two-windower build in compute_haplotypes with single-window per-sample adjacency. Lives alongside the existing _get_haplotypes for now; the swap happens in a later commit once all helpers are in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For each parent block of length N, emits N(N-1)/2 sub-fragments — every adjacency-contiguous slice of length >= 2, including the full block. Lays the groundwork for downstream containment-AC aggregation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Groups sub-fragment rows by haplotype (row_idx tuple) and produces a per- population AC vector of length n_pops, indexed by pop_int. Counts the parent blocks (one row per (sample, strand, parent block, sub-fragment)) containing each haplotype as adjacency-contiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Joins each haplotype row's row_idx array against the variants table to attach parallel-length variants/gnomad_freqs/frequencies_by_pop arrays. Uses a broadcast dict (driver-side collect of variants_ht) keyed by row_idx; lookup casts to int32 since Python ints encode that way through hl.literal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Annotates per-haplotype frequency / phasing summary fields: max_pop, max_empirical_AF, max_empirical_AC, min_variant_frequency, all_pop_freqs (sorted by AF desc, missing AF sorts to end), fraction_phased, estimated_gnomad_AF. Per-population empirical_AF = per_pop_AC[p] / min over component variants of frequencies_by_pop[p].AN; missing when min_AN == 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops haplotype rows subsumed by a strictly longer adjacency-contiguous fragment with identical per-pop AC. Driver-side scan: collect rows, group by AC tuple, pairwise contiguous-substring check within each group; filter the Hail table by stringified-haplotype set. Helpers _is_contiguous_subarray and _find_subsumed_haplotypes factored out for testability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rithm Replaces the fixed-bin two-windower union with the per-sample adjacency pipeline composed of the six helpers added in prior commits. Deletes _get_haplotypes and its three nested helpers. Removes the .[12].ht intermediate writes; adds .cols.ht, .parents.ht, .hap_ac.ht checkpoints between major stages to break Hail IR plan accumulation. _compute_metrics: switch frequencies_by_pop[p] direct indexing to .get(p, missing) so populations absent from the per-variant call-stats dict do not raise. Updates the test_compute_haplotypes parametrize from [517, 295, 30, 33] to [457, 283, 43, 35]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ypes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop split_haplotypes + distinct() transformation logic from build_hgdp_haplotype_table_entries. Haplotypes are already at the correct granularity from compute_haplotypes (per-sample adjacency with containment dedup), so re-fragmenting them here would only underestimate AC by discarding counts via distinct(). - Remove split_haplotypes import (still used by compute_haplotype_statistics) - Remove window_size parameter from build_hgdp_haplotype_table_entries - Drop early .distinct(), post-split key_by("haplotype").distinct(), and the diagnostic count() log lines - Update build_contig_sequences_table caller and create_duckdb_index docstring (window_size now solely drives FASTA flanking context) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The new compute_haplotypes per-sample-adjacency algorithm uses a single window size, and create_duckdb_index no longer re-splits haplotypes — so the dual hgdp_1kg_haplotype_window_size / sequence_window_size knobs collapse into one. - generate_divref.smk: remove HGDP_1KG_HAPLOTYPE_WINDOW_SIZE constant; compute_haplotypes rule now uses SEQUENCE_WINDOW_SIZE; cleanup line drops .[12].ht (no longer produced) - create_test_data.smk: WINDOW_SIZE_COMPUTE_HAPLOTYPES 5000 -> 25 to match sequence_window_size schema default - config_schema.yml: remove hgdp_1kg_haplotype_window_size property Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Within each per-pop AC group, candidates are already sorted by haplotype length descending. Add an explicit `len(h_long) <= n_short` break so the inner scan exits as soon as the running candidate can no longer strictly subsume the target — proper containment requires len(h_long) > len(h_short). Worst case stays O(G²); large groups with a wide length range now prune most pair comparisons. Mitigation for risk #2 in the algorithm rewrite plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…types

Within a parent block, gap detection compared only against the immediately preceding variant end, so a shorter overlapping carrier could shrink the boundary and force a false split (e.g. 100-150, 120-121, 160 with window_size=20 splits instead of staying connected). Compute breakpoints against the running max end of carriers[0..i-1]. The carrier sort also keyed only on locus.position, so equal-position carriers could end up in different row_idx orders across samples even for the same fragment, which downstream grouping keys on. Sort by (position, ref_len, row_idx) for deterministic ordering. Addresses CodeRabbit review on PR #48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ameynert self-assigned this May 5, 2026

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

Comment thread workflows/generate_divref.smk

ameynert temporarily deployed to github-actions-snakemake-linting May 5, 2026 18:03 — with GitHub Actions Inactive

ameynert temporarily deployed to github-actions-snakemake-linting May 5, 2026 23:17 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

ameynert force-pushed the am_sample_adjacency branch from 47b45e5 to 229c37e Compare May 11, 2026 16:30

ameynert temporarily deployed to github-actions-snakemake-linting May 11, 2026 16:30 — with GitHub Actions Inactive

ameynert temporarily deployed to github-actions-snakemake-linting May 11, 2026 16:32 — with GitHub Actions Inactive

ameynert temporarily deployed to github-actions-snakemake-linting May 13, 2026 21:03 — with GitHub Actions Inactive

ameynert force-pushed the am_sample_adjacency branch from 42916cf to 4394cf4 Compare May 13, 2026 21:19

ameynert temporarily deployed to github-actions-snakemake-linting May 13, 2026 21:19 — with GitHub Actions Inactive

ameynert force-pushed the am_sample_adjacency branch from 4394cf4 to 0b5e3e3 Compare May 15, 2026 21:45

ameynert temporarily deployed to github-actions-snakemake-linting May 15, 2026 21:45 — with GitHub Actions Inactive

coderabbitai Bot reviewed May 15, 2026

View reviewed changes

ameynert and others added 12 commits May 19, 2026 16:32

feat: fix up test outputs

796e056

chore: drop unused assert_path_is_writable import from compute_haplot…

f4c3946

…ypes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ameynert and others added 4 commits May 19, 2026 16:33

feat: split up variants into blocks by locus before calculating haplo…

e1d9bf6

…types

feat: suggestions from coderabbit

779fc94

fix: guard prev-end

f8d2fef

ameynert force-pushed the am_sample_adjacency branch from 0b5e3e3 to e284b31 Compare May 19, 2026 23:33

ameynert temporarily deployed to github-actions-snakemake-linting May 19, 2026 23:33 — with GitHub Actions Inactive

fix: leftover window_size param removed

edc18a2

ameynert temporarily deployed to github-actions-snakemake-linting May 20, 2026 00:03 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement sample adjacency algorithm for computing haplotypes#48

feat: implement sample adjacency algorithm for computing haplotypes#48
ameynert wants to merge 17 commits into
mainfrom
am_sample_adjacency

ameynert commented May 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 5, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 5, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ameynert commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ameynert commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading