Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,5 @@ panct is a collection of tools for working pangenomes
Homepage: [https://panct.readthedocs.io/](https://panct.readthedocs.io/)

Visit our homepage for installation and usage instructions.

Precomputed region-level complexity scores (at 50kb, 100kb, and 1Mb) resolution for hg38 are available in [precomputed-scores](./precomputed-scores).
44 changes: 44 additions & 0 deletions precomputed-scores/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Precomputed pangenome complexity scores

Scores are computed for 50kb, 100kb, or 1Mb windows on hg38 in the files:

* `hprc-v1.1-mc-grch38_complexity_50000.tab`
* `hprc-v1.1-mc-grch38_complexity_100000.tab`
* `hprc-v1.1-mc-grch38_complexity_1000000.tab`

Columns give:

* chrom, start, end: genomic location of the window in hg38
* numnodes: number of minigraph-cactus nodes identified in the window
* total_length: sum of the lengths of all nodes in the window (note this will usually be close to but not exactly equal to the length of the hg38 window, since it includes the lengths of non-reference nodes)
* numwalks: number of walks identified through the subgraph. This is based on output of the `query` command from gbz-base and is in some cases more than the total number of assemblies used to build the graph.
* sequniq-normwalk: sum_n len(n)*p_n*(1-p_n)/L where L is the average walk length
* sequniq-normnode: sum_n len(n)*p_n*(1-p_n)/L where L is the average node length

Scores of None indicate no walks were identified through the region.

## Computing scores

Scores were generated using the following commands:

1. Make windows of different sizes across hg38
```
for window in 50000 100000 1000000
do
bedtools makewindows -g hg38.txt -w ${window} > windows/hg38_windows_${window}.bed
done
```

2. Compute complexity scores for each window based on the HPRC minigraph-cactus v1 graph

Scores are based on hprc-v1.1-mc-grch38 available here: https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.gbz
```
for window in 50000 100000 1000000
do
panct complexity \
--region windows/hg38_windows_${window}.bed \
--out hprc-v1.1-mc-grch38_complexity_${window}.tab \
--metrics sequniq-normwalk,sequniq-normnode \
../testdata/hprc-v1.1-mc-grch38.gbz
done
```
25 changes: 25 additions & 0 deletions precomputed-scores/hg38.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
chr1 248956422
chr2 242193529
chr3 198295559
chr4 190214555
chr5 181538259
chr6 170805979
chr7 159345973
chrX 156040895
chr8 145138636
chr9 138394717
chr11 135086622
chr10 133797422
chr12 133275309
chr13 114364328
chr14 107043718
chr15 101991189
chr16 90338345
chr17 83257441
chr18 80373285
chr20 64444167
chr19 58617616
chrY 57227415
chr22 50818468
chr21 46709983
chrM 16569
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading