Similar to https://github.com/pystatgen/sgkit/issues/80, we need a better way to store genotype probabilities and/or dosages from bgen. UKB chromosomes stored as 32-bit dosages in zstd compressed zarrs are almost twice as large as the original bgen, and the loss of probabilities means even less information is present.
Using a FixedScaleOffset filter like this looks to be equivalent to what's described in the v1 bgen spec (see "Probability data storage"):
numcodecs.FixedScaleOffset(
offset=0,
scale=32_768, # 2**15
dtype='float16', # Decoded data type (doesn't have to be 16-bit)
astype='uint16' # Encoded data type
)
The choice of a 2**15 scale is deliberate even with uint16 to allow for probabilities to take values in [0, 2). The v2 spec uses parameterized scale factor and assumes probabilities are in [0, 1] instead. The rounding is a little more complicated in the v2 case but it doesn't seem like the difference is likely to be important -- plus it means you need all genotype probabilities for a call to do the rounding.
Lastly, the bgen paper shows that using 8-bit probabilities (256 scale factor) leads to virtually indistinguishable downstream results in association testing vs 20-bit probabilities, but similarity degrades quickly below 8 bits. The results section concludes by saying:
Given the findings above we recommend the use of b=8, 12 or 16 bits for imputed data, with b=8 providing a particularly good balance between implementation efficiency, accuracy, and overall file size
Similar to https://github.com/pystatgen/sgkit/issues/80, we need a better way to store genotype probabilities and/or dosages from bgen. UKB chromosomes stored as 32-bit dosages in zstd compressed zarrs are almost twice as large as the original bgen, and the loss of probabilities means even less information is present.
Using a FixedScaleOffset filter like this looks to be equivalent to what's described in the v1 bgen spec (see "Probability data storage"):
The choice of a 2**15 scale is deliberate even with uint16 to allow for probabilities to take values in [0, 2). The v2 spec uses parameterized scale factor and assumes probabilities are in [0, 1] instead. The rounding is a little more complicated in the v2 case but it doesn't seem like the difference is likely to be important -- plus it means you need all genotype probabilities for a call to do the rounding.
Lastly, the bgen paper shows that using 8-bit probabilities (256 scale factor) leads to virtually indistinguishable downstream results in association testing vs 20-bit probabilities, but similarity degrades quickly below 8 bits. The results section concludes by saying: