feat: implement sample adjacency algorithm for computing haplotypes#48
feat: implement sample adjacency algorithm for computing haplotypes#48ameynert wants to merge 17 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR replaces two-pass/windowed haplotype assembly with an adjacency-gap pipeline: compute locus groups, form per-(sample,strand) parent blocks split at gaps, enumerate adjacency-contiguous subfragments, aggregate per-pop containment AC, attach variant component info, compute per-haplotype metrics, apply containment dedup, update callers/workflows, and expand unit tests. ChangesHaplotype Aggregation Algorithm Overhaul
Downstream API & Workflow Simplification
Test Coverage Expansion
Sequence DiagramsequenceDiagram
participant VCF as "VariantsHT"
participant Prep as "LocusGrouper"
participant Carriers as "CarrierBuilder"
participant Blocks as "ParentBlocks"
participant Frags as "SubFragments"
participant Agg as "ContainmentAgg"
participant Enrich as "Enricher"
participant Dedup as "Deduper"
participant Out as "OutputTable"
VCF->>Prep: load and filter variants
Prep->>Carriers: build per sample strand carrier arrays
Carriers->>Blocks: sort and split into parent blocks
Blocks->>Frags: enumerate adjacency contiguous subfragments
Frags->>Agg: group by haplotype indices and compute per_pop_AC
Agg->>Enrich: attach variant components and per pop freqs
Enrich->>Enrich: compute metrics and fractions
Enrich->>Dedup: emit candidate haplotypes with metrics
Dedup->>Dedup: remove subsumed haplotypes with identical AC
Dedup->>Out: write final haplotype table
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
divref/divref/tools/compute_haplotypes.py (1)
180-197: ⚡ Quick winCombine row_idx and struct collection into a single aggregate call to eliminate fragile coupling.
Two separate calls—
variants_ht.row_idx.collect()andvariants_ht.aggregate(hl.agg.collect(...))on an unkeyed table—have no ordering guarantee between them. Hail's documentation states that "the ordering of rows in a table without a key is undefined." If rows reorder between the two passes, thezip(..., strict=True)will silently pair the wrong components to indices, corrupting the resulting dict. Whilestrict=Truecatches length mismatches, it cannot detect row-order mismatches.Combining both into a single aggregate call makes the pairing explicit, eliminates the fragility, and saves one pass over the table.
♻️ Proposed refactor
- components_dict = hl.literal( - dict( - zip( - variants_ht.row_idx.collect(), - variants_ht.aggregate( - hl.agg.collect( - hl.struct( - locus=variants_ht.locus, - alleles=variants_ht.alleles, - freq=variants_ht.freq, - frequencies_by_pop=variants_ht.frequencies_by_pop, - ) - ) - ), - strict=True, - ) - ) - ) + collected = variants_ht.aggregate( + hl.agg.collect( + hl.struct( + row_idx=variants_ht.row_idx, + locus=variants_ht.locus, + alleles=variants_ht.alleles, + freq=variants_ht.freq, + frequencies_by_pop=variants_ht.frequencies_by_pop, + ) + ) + ) + components_dict = hl.literal({c.row_idx: c.drop("row_idx") for c in collected})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@divref/divref/tools/compute_haplotypes.py` around lines 180 - 197, The current code builds components_dict by separately collecting variants_ht.row_idx and then collecting structs from variants_ht in a second pass and zipping them, which is fragile because an unkeyed table’s row order is undefined; instead, change components_dict to a single aggregate over variants_ht that collects pairs (row_idx, hl.struct(locus=..., alleles=..., freq=..., frequencies_by_pop=...)) and then turns that single-aggregate result into the dict (and wrap with hl.literal as before). Target the symbols components_dict, variants_ht, row_idx, hl.agg.collect, and the hl.struct creation and perform the pairing inside one variants_ht.aggregate call so the index-to-struct association cannot get reordered.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@workflows/generate_divref.smk`:
- Around line 183-185: The cleanup only removes {params.output_base}.variants.ht
but compute_haplotypes produces four checkpoint directories (.variants.ht,
.cols.ht, .parents.ht, .hap_ac.ht); update the cleanup command in the rule (the
block that currently runs rm -r {params.output_base}.variants.ht) to also remove
{params.output_base}.cols.ht, {params.output_base}.parents.ht, and
{params.output_base}.hap_ac.ht so the three leftover checkpoints are deleted
after the rule completes.
---
Nitpick comments:
In `@divref/divref/tools/compute_haplotypes.py`:
- Around line 180-197: The current code builds components_dict by separately
collecting variants_ht.row_idx and then collecting structs from variants_ht in a
second pass and zipping them, which is fragile because an unkeyed table’s row
order is undefined; instead, change components_dict to a single aggregate over
variants_ht that collects pairs (row_idx, hl.struct(locus=..., alleles=...,
freq=..., frequencies_by_pop=...)) and then turns that single-aggregate result
into the dict (and wrap with hl.literal as before). Target the symbols
components_dict, variants_ht, row_idx, hl.agg.collect, and the hl.struct
creation and perform the pairing inside one variants_ht.aggregate call so the
index-to-struct association cannot get reordered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 38b69893-2136-4f09-ac41-82c26b3d543a
📒 Files selected for processing (6)
divref/divref/tools/compute_haplotypes.pydivref/divref/tools/create_duckdb_index.pydivref/tests/tools/test_compute_haplotypes.pyworkflows/config/config_schema.ymlworkflows/create_test_data.smkworkflows/generate_divref.smk
💤 Files with no reviewable changes (1)
- workflows/config/config_schema.yml
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@divref/divref/tools/compute_haplotypes.py`:
- Around line 45-50: The grouping logic uses prev_end as the last variant end
which can shrink when a short variant is seen; change it to track the running
maximum end of the active group: when evaluating each row in the loop over rows
(variable names: v, prev_end, prev_contig, current_group, group_of,
window_size), only start a new group if v.locus.contig != prev_contig or
v.locus.position - prev_end >= window_size, and otherwise update prev_end =
max(prev_end, v.locus.position + v.ref_len); when you do start a new group, set
prev_end to v.locus.position + v.ref_len (and update
prev_contig/current_group/group_of as before).
- Around line 87-96: The current sorting and breakpoint logic in
blocks_ht.annotate (carriers) and the breakpoints calculation is unstable:
change the hl.sorted key to a deterministic tuple (e.g., lambda v:
(v.locus.position, v.ref_len, v.alt or v.some_unique_id)) so equal-position
carriers have a stable order, and replace the single-previous-gap check used in
breakpoints with a running-max-end comparison (compute a running max of
carriers[i-1].locus.position + carriers[i-1].ref_len and test
carriers[i].locus.position against that running max using hl.scan/hl.fold
semantics) so splits occur only when the new variant starts beyond the maximal
end of the current block. Ensure you update references to carriers, n, and
breakpoints accordingly.
- Around line 208-223: The code currently aggregates all variants into a Python
dict (pairs/components_dict) and broadcasts it, risking OOM; instead build a
Hail-side mapping keyed by variants_ht.row_idx and use Hail table lookup/join:
create a components_table from variants_ht selecting the small struct (locus,
alleles, freq, frequencies_by_pop) and key it by row_idx (e.g., components_table
= variants_ht.select(...).key_by('row_idx')), then replace the driver dict usage
in hap_table.annotate by mapping hap_table.haplotype to table lookups like
components_table[hl.int32(idx)] (or perform a join) so all work stays in Hail
and avoids materializing the full payload on the driver.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f1d54da1-0a56-4447-8943-410e0deb4173
📒 Files selected for processing (2)
divref/divref/tools/compute_haplotypes.pyworkflows/generate_divref.smk
| pairs = variants_ht.aggregate( | ||
| hl.agg.collect( | ||
| hl.tuple([ | ||
| variants_ht.row_idx, | ||
| hl.struct( | ||
| locus=variants_ht.locus, | ||
| alleles=variants_ht.alleles, | ||
| freq=variants_ht.freq, | ||
| frequencies_by_pop=variants_ht.frequencies_by_pop, | ||
| ), | ||
| ]) | ||
| ) | ||
| ) | ||
| components_dict = hl.literal(dict(pairs)) | ||
| hap_table = hap_table.annotate( | ||
| _components=hap_table.haplotype.map(lambda idx: components_dict[hl.int32(idx)]) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the file size and locate the function
wc -l divref/divref/tools/compute_haplotypes.pyRepository: fg-labs/divref-wf
Length of output: 107
🏁 Script executed:
# Read the function containing lines 208-223 with context
sed -n '170,230p' divref/divref/tools/compute_haplotypes.py | cat -nRepository: fg-labs/divref-wf
Length of output: 3307
🏁 Script executed:
# Check the function signature and any default parameters
sed -n '1,250p' divref/divref/tools/compute_haplotypes.py | grep -A 50 "^def compute_haplotypes"Repository: fg-labs/divref-wf
Length of output: 43
🏁 Script executed:
# Find the function that calls _attach_component_info
grep -n "_attach_component_info" divref/divref/tools/compute_haplotypes.pyRepository: fg-labs/divref-wf
Length of output: 311
🏁 Script executed:
# Read the compute_haplotypes function signature and check for spark_driver_memory_gb default
sed -n '1,150p' divref/divref/tools/compute_haplotypes.py | head -100Repository: fg-labs/divref-wf
Length of output: 4172
🏁 Script executed:
# Look for spark_driver_memory_gb definition in the file
grep -n "spark_driver_memory_gb" divref/divref/tools/compute_haplotypes.pyRepository: fg-labs/divref-wf
Length of output: 374
🏁 Script executed:
# Let's check what's at lines 208-223 in the file
sed -n '208,223p' divref/divref/tools/compute_haplotypes.py | cat -nRepository: fg-labs/divref-wf
Length of output: 747
🏁 Script executed:
# Also check context around those lines to understand the function
sed -n '200,240p' divref/divref/tools/compute_haplotypes.py | cat -nRepository: fg-labs/divref-wf
Length of output: 2084
🏁 Script executed:
# Check if there's variant filtering before the aggregation at line 518
sed -n '500,525p' divref/divref/tools/compute_haplotypes.py | cat -nRepository: fg-labs/divref-wf
Length of output: 1658
🏁 Script executed:
# Look for variant_freq_threshold to understand filtering
grep -n "variant_freq_threshold" divref/divref/tools/compute_haplotypes.pyRepository: fg-labs/divref-wf
Length of output: 573
🏁 Script executed:
# Let's check the data flow: when is variants_ht filtered before being passed to _attach_component_info?
sed -n '440,480p' divref/divref/tools/compute_haplotypes.py | cat -nRepository: fg-labs/divref-wf
Length of output: 2000
🏁 Script executed:
# Check the actual variant struct schema to understand memory footprint - look for freq and frequencies_by_pop definitions
grep -A 20 "freq.*=" divref/divref/tools/compute_haplotypes.py | head -40Repository: fg-labs/divref-wf
Length of output: 1804
🏁 Script executed:
# Let's see the full compute_haplotypes function context around the _attach_component_info call
sed -n '510,530p' divref/divref/tools/compute_haplotypes.py | cat -nRepository: fg-labs/divref-wf
Length of output: 1248
🏁 Script executed:
# Let's check how large variants_ht might be by looking at the filtering logic more closely
sed -n '449,481p' divref/divref/tools/compute_haplotypes.py | cat -nRepository: fg-labs/divref-wf
Length of output: 1623
🏁 Script executed:
# Check if there's any documentation or test data that shows typical variant counts
find . -name "*.py" -o -name "*.md" | xargs grep -l "variant_freq_threshold" | head -5Repository: fg-labs/divref-wf
Length of output: 153
🏁 Script executed:
# Verify the exact line numbers by checking the actual line count from the beginning of _attach_component_info
sed -n '182,225p' divref/divref/tools/compute_haplotypes.py | cat -nRepository: fg-labs/divref-wf
Length of output: 2478
🏁 Script executed:
# Let's verify the exact line ranges - check what's at lines 208-223 (absolute file positions)
sed -n '206,225p' divref/divref/tools/compute_haplotypes.py | cat -nRepository: fg-labs/divref-wf
Length of output: 915
🏁 Script executed:
# Check if there are any recent changes or TODOs about this memory pattern
grep -n "TODO\|FIXME\|memory\|aggregate\|driver" divref/divref/tools/compute_haplotypes.py | grep -i "component\|attach"Repository: fg-labs/divref-wf
Length of output: 43
🏁 Script executed:
# Let's understand the actual variant count constraints - check if variants_ht is filtered before line 518
sed -n '475,520p' divref/divref/tools/compute_haplotypes.py | cat -nRepository: fg-labs/divref-wf
Length of output: 2566
Avoid collecting the full variant payload onto the driver here.
This materializes every retained variant, including nested frequency structs, in Python via aggregate(collect(...)), then duplicates it with dict(pairs) before broadcasting back as a literal. With chromosome-scale inputs and a 1 GB default driver, this is a realistic OOM failure path. Prefer a Hail-side join/indexing strategy keyed by row_idx instead of a driver dictionary.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@divref/divref/tools/compute_haplotypes.py` around lines 208 - 223, The code
currently aggregates all variants into a Python dict (pairs/components_dict) and
broadcasts it, risking OOM; instead build a Hail-side mapping keyed by
variants_ht.row_idx and use Hail table lookup/join: create a components_table
from variants_ht selecting the small struct (locus, alleles, freq,
frequencies_by_pop) and key it by row_idx (e.g., components_table =
variants_ht.select(...).key_by('row_idx')), then replace the driver dict usage
in hap_table.annotate by mapping hap_table.haplotype to table lookups like
components_table[hl.int32(idx)] (or perform a join) so all work stays in Hail
and avoids materializing the full payload on the driver.
47b45e5 to
229c37e
Compare
Within a parent block, gap detection compared only against the immediately preceding variant end, so a shorter overlapping carrier could shrink the boundary and force a false split (e.g. 100-150, 120-121, 160 with window_size=20 splits instead of staying connected). Compute breakpoints against the running max end of carriers[0..i-1]. The carrier sort also keyed only on locus.position, so equal-position carriers could end up in different row_idx orders across samples even for the same fragment, which downstream grouping keys on. Sort by (position, ref_len, row_idx) for deterministic ordering. Addresses CodeRabbit review on PR #48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Within a parent block, gap detection compared only against the immediately preceding variant end, so a shorter overlapping carrier could shrink the boundary and force a false split (e.g. 100-150, 120-121, 160 with window_size=20 splits instead of staying connected). Compute breakpoints against the running max end of carriers[0..i-1]. The carrier sort also keyed only on locus.position, so equal-position carriers could end up in different row_idx orders across samples even for the same fragment, which downstream grouping keys on. Sort by (position, ref_len, row_idx) for deterministic ordering. Addresses CodeRabbit review on PR #48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
42916cf to
4394cf4
Compare
Within a parent block, gap detection compared only against the immediately preceding variant end, so a shorter overlapping carrier could shrink the boundary and force a false split (e.g. 100-150, 120-121, 160 with window_size=20 splits instead of staying connected). Compute breakpoints against the running max end of carriers[0..i-1]. The carrier sort also keyed only on locus.position, so equal-position carriers could end up in different row_idx orders across samples even for the same fragment, which downstream grouping keys on. Sort by (position, ref_len, row_idx) for deterministic ordering. Addresses CodeRabbit review on PR #48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4394cf4 to
0b5e3e3
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
divref/tests/tools/test_compute_haplotypes.py (1)
11-16: 🏗️ Heavy liftAvoid coupling tests to private helper internals.
The new suite heavily imports/targets underscored helpers, which makes tests fragile to internal refactors. Prefer validating these invariants via stable public behavior (or promote helpers to explicit public contracts if they must be tested directly).
As per coding guidelines, "Test behavior, not implementation—tests should survive refactoring."
Also applies to: 85-797
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@divref/tests/tools/test_compute_haplotypes.py` around lines 11 - 16, Tests import and assert on private underscored helpers (_aggregate_containment_ac, _apply_containment_dedup, _attach_component_info, _compute_metrics, _enumerate_subfragments, _form_parent_blocks), which couples tests to implementation; update the tests to exercise the stable public API instead (call the top-level public function(s) that use these helpers and assert on observable outputs/side effects), or if a helper truly needs direct testing, promote it to a public symbol by renaming and exporting it (remove the leading underscore and update imports) so tests target a stable contract rather than internal details.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@divref/tests/tools/test_compute_haplotypes.py`:
- Around line 11-16: Tests import and assert on private underscored helpers
(_aggregate_containment_ac, _apply_containment_dedup, _attach_component_info,
_compute_metrics, _enumerate_subfragments, _form_parent_blocks), which couples
tests to implementation; update the tests to exercise the stable public API
instead (call the top-level public function(s) that use these helpers and assert
on observable outputs/side effects), or if a helper truly needs direct testing,
promote it to a public symbol by renaming and exporting it (remove the leading
underscore and update imports) so tests target a stable contract rather than
internal details.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 51760d1b-fc53-4a20-8353-f20d8f9fc103
📒 Files selected for processing (6)
divref/divref/tools/compute_haplotypes.pydivref/divref/tools/create_duckdb_index.pydivref/tests/tools/test_compute_haplotypes.pyworkflows/config/config_schema.ymlworkflows/create_test_data.smkworkflows/generate_divref.smk
💤 Files with no reviewable changes (1)
- workflows/config/config_schema.yml
🚧 Files skipped from review as they are similar to previous changes (3)
- workflows/create_test_data.smk
- workflows/generate_divref.smk
- divref/divref/tools/create_duckdb_index.py
Forms per-(sample, strand) parent blocks from a cols-table of alt-carrier arrays via adjacency-gap cutting at window_size. Drops blocks of length < 2. First step toward replacing the fixed-bin two-windower build in compute_haplotypes with single-window per-sample adjacency. Lives alongside the existing _get_haplotypes for now; the swap happens in a later commit once all helpers are in place. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For each parent block of length N, emits N(N-1)/2 sub-fragments — every adjacency-contiguous slice of length >= 2, including the full block. Lays the groundwork for downstream containment-AC aggregation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Groups sub-fragment rows by haplotype (row_idx tuple) and produces a per- population AC vector of length n_pops, indexed by pop_int. Counts the parent blocks (one row per (sample, strand, parent block, sub-fragment)) containing each haplotype as adjacency-contiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Joins each haplotype row's row_idx array against the variants table to attach parallel-length variants/gnomad_freqs/frequencies_by_pop arrays. Uses a broadcast dict (driver-side collect of variants_ht) keyed by row_idx; lookup casts to int32 since Python ints encode that way through hl.literal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Annotates per-haplotype frequency / phasing summary fields: max_pop, max_empirical_AF, max_empirical_AC, min_variant_frequency, all_pop_freqs (sorted by AF desc, missing AF sorts to end), fraction_phased, estimated_gnomad_AF. Per-population empirical_AF = per_pop_AC[p] / min over component variants of frequencies_by_pop[p].AN; missing when min_AN == 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops haplotype rows subsumed by a strictly longer adjacency-contiguous fragment with identical per-pop AC. Driver-side scan: collect rows, group by AC tuple, pairwise contiguous-substring check within each group; filter the Hail table by stringified-haplotype set. Helpers _is_contiguous_subarray and _find_subsumed_haplotypes factored out for testability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rithm Replaces the fixed-bin two-windower union with the per-sample adjacency pipeline composed of the six helpers added in prior commits. Deletes _get_haplotypes and its three nested helpers. Removes the .[12].ht intermediate writes; adds .cols.ht, .parents.ht, .hap_ac.ht checkpoints between major stages to break Hail IR plan accumulation. _compute_metrics: switch frequencies_by_pop[p] direct indexing to .get(p, missing) so populations absent from the per-variant call-stats dict do not raise. Updates the test_compute_haplotypes parametrize from [517, 295, 30, 33] to [457, 283, 43, 35]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ypes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop split_haplotypes + distinct() transformation logic from
build_hgdp_haplotype_table_entries. Haplotypes are already at the
correct granularity from compute_haplotypes (per-sample adjacency
with containment dedup), so re-fragmenting them here would only
underestimate AC by discarding counts via distinct().
- Remove split_haplotypes import (still used by compute_haplotype_statistics)
- Remove window_size parameter from build_hgdp_haplotype_table_entries
- Drop early .distinct(), post-split key_by("haplotype").distinct(),
and the diagnostic count() log lines
- Update build_contig_sequences_table caller and create_duckdb_index
docstring (window_size now solely drives FASTA flanking context)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new compute_haplotypes per-sample-adjacency algorithm uses a single window size, and create_duckdb_index no longer re-splits haplotypes — so the dual hgdp_1kg_haplotype_window_size / sequence_window_size knobs collapse into one. - generate_divref.smk: remove HGDP_1KG_HAPLOTYPE_WINDOW_SIZE constant; compute_haplotypes rule now uses SEQUENCE_WINDOW_SIZE; cleanup line drops .[12].ht (no longer produced) - create_test_data.smk: WINDOW_SIZE_COMPUTE_HAPLOTYPES 5000 -> 25 to match sequence_window_size schema default - config_schema.yml: remove hgdp_1kg_haplotype_window_size property Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Within each per-pop AC group, candidates are already sorted by haplotype length descending. Add an explicit `len(h_long) <= n_short` break so the inner scan exits as soon as the running candidate can no longer strictly subsume the target — proper containment requires len(h_long) > len(h_short). Worst case stays O(G²); large groups with a wide length range now prune most pair comparisons. Mitigation for risk #2 in the algorithm rewrite plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Within a parent block, gap detection compared only against the immediately preceding variant end, so a shorter overlapping carrier could shrink the boundary and force a false split (e.g. 100-150, 120-121, 160 with window_size=20 splits instead of staying connected). Compute breakpoints against the running max end of carriers[0..i-1]. The carrier sort also keyed only on locus.position, so equal-position carriers could end up in different row_idx orders across samples even for the same fragment, which downstream grouping keys on. Sort by (position, ref_len, row_idx) for deterministic ordering. Addresses CodeRabbit review on PR #48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0b5e3e3 to
e284b31
Compare
See next PR for details and comparison
Summary by CodeRabbit
New Features
Improvements
Tests
Chores