This will primarily be helpful to understand if using dask for parallelism over a thread-safe reader library has any obvious disadvantages over putting parallelism in the reader library itself. Drawing this conclusion from different libraries won't be ideal, but Carl had these numbers handy for bgen-reader-py so we should make sure cbgen is comparable once #20 is done:
I added multithreading to the Numpy-inspired reader. Using this API, on my 6 processor machine from a SSD, I was able to read 109 variants/second (53 million distributions/second). This was on file ‘merged_487400x220000.bgen’, which tries to be like the UKBio Bank data.
(Single threaded performance is 31 variants/second and 15 million distributions/second. We also verified that the cbgen interface is thread-safe.)
This will primarily be helpful to understand if using dask for parallelism over a thread-safe reader library has any obvious disadvantages over putting parallelism in the reader library itself. Drawing this conclusion from different libraries won't be ideal, but Carl had these numbers handy for bgen-reader-py so we should make sure cbgen is comparable once #20 is done: