Create utilities or add examples that demonstrate how to store bgen zarr efficiently

Similar to https://github.com/pystatgen/sgkit/issues/80, we need a better way to store genotype probabilities and/or dosages from bgen.  UKB chromosomes stored as 32-bit dosages in zstd compressed zarrs are almost twice as large as the original bgen, and the loss of probabilities means even less information is present.

Using a [FixedScaleOffset](https://numcodecs.readthedocs.io/en/stable/fixedscaleoffset.html?highlight=FixedScaleOffset#numcodecs.fixedscaleoffset.FixedScaleOffset) filter like this looks to be equivalent to what's described in the v1 [bgen spec](https://www.well.ox.ac.uk/~gav/bgen_format/spec/latest.html) (see "Probability data storage"):

```python
numcodecs.FixedScaleOffset(
    offset=0, 
    scale=32_768, # 2**15
    dtype='float16', # Decoded data type (doesn't have to be 16-bit)
    astype='uint16' # Encoded data type
)
```

The choice of a 2**15 scale is deliberate even with uint16 to allow for probabilities to take values in [0, 2).  The v2 spec uses parameterized scale factor and assumes probabilities are in [0, 1] instead.  The rounding is a little more complicated in the v2 case but it doesn't seem like the difference is likely to be important -- plus it means you need all genotype probabilities for a call to do the rounding. 

Lastly, the [bgen paper](https://www.biorxiv.org/content/10.1101/308296v2.full.pdf) shows that using 8-bit probabilities (256 scale factor) leads to virtually indistinguishable downstream results in association testing vs 20-bit probabilities, but similarity degrades quickly below 8 bits.  The results section concludes by saying:

> Given the findings above we recommend the use of b=8, 12 or 16 bits for imputed data, with b=8 providing a particularly good balance between implementation efficiency, accuracy, and overall file size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create utilities or add examples that demonstrate how to store bgen zarr efficiently #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create utilities or add examples that demonstrate how to store bgen zarr efficiently #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions