Dear authors,
Thank you for developing and sharing EsViritu — it is a very useful tool.
Issue body
I am currently using EsViritu with a large custom reference database (~190k contigs) and noticed that some steps become very slow. I would like to share these observations in case there is room for future improvement.
⸻
Background / use case
My workflow is roughly as follows:
1. Run EsViritu with the original database to detect known viruses
2. Run a second analysis using the unmapped reads from step 1 with a custom viral contig database
The custom reference database contains approximately 190,000 de-replicated viral assemblies.
⸻
Observed bottlenecks
When using this large database, the pipeline becomes very slow or difficult to complete at the following steps:
• cluster_assemblies_by_read_sharing
• bam_coverage_windows
From inspecting the code, the main reasons appear to be:
-
cluster_assemblies_by_read_sharing
• Per-assembly statistics (e.g. maximum read identity) are computed by repeatedly scanning the full read-sharing table.
• With a large number of assemblies, this results in many repeated full DataFrame scans at the Python level.
-
bam_coverage_windows
• Coverage is calculated using repeated pysam.pileup calls for each window.
• This leads to many redundant pileup operations per contig and is largely single-threaded, which becomes slow at scale.
⸻
Experimental changes
I am not an expert in algorithm optimization, but with the help of ChatGPT (CodeX) I experimented with small code changes aimed at:
• Reducing repeated full-table scans
• Avoiding redundant pileup operations
• Keeping the original logic and outputs unchanged
In my test runs (including runs with the original EsViritu database), the modified code produced identical results and allowed the pipeline to complete successfully on my large custom database, with reduced runtime.
I’ve shared these experimental changes here in case they are useful as reference material: https://github.com/XuhanDeng/try_change_esv
⸻
Notes
• This is not a request for immediate changes
• I understand this is a non-standard use case
• I am mainly documenting these observations so they can be referenced in future updates, tutorials, or documentation related to custom databases
Thanks again for developing and sharing EsViritu — it has been very useful for my work.
Best regards,
Xuhan Deng
⸻
P.S. If others are interested in building a custom database, I am planning to further refine my Snakemake workflow and share it on GitHub in the near future.
Dear authors,
Thank you for developing and sharing EsViritu — it is a very useful tool.
Issue body
I am currently using EsViritu with a large custom reference database (~190k contigs) and noticed that some steps become very slow. I would like to share these observations in case there is room for future improvement.
⸻
Background / use case
My workflow is roughly as follows:
1. Run EsViritu with the original database to detect known viruses
2. Run a second analysis using the unmapped reads from step 1 with a custom viral contig database
The custom reference database contains approximately 190,000 de-replicated viral assemblies.
⸻
Observed bottlenecks
When using this large database, the pipeline becomes very slow or difficult to complete at the following steps:
• cluster_assemblies_by_read_sharing
• bam_coverage_windows
From inspecting the code, the main reasons appear to be:
cluster_assemblies_by_read_sharing
• Per-assembly statistics (e.g. maximum read identity) are computed by repeatedly scanning the full read-sharing table.
• With a large number of assemblies, this results in many repeated full DataFrame scans at the Python level.
bam_coverage_windows
• Coverage is calculated using repeated pysam.pileup calls for each window.
• This leads to many redundant pileup operations per contig and is largely single-threaded, which becomes slow at scale.
⸻
Experimental changes
I am not an expert in algorithm optimization, but with the help of ChatGPT (CodeX) I experimented with small code changes aimed at:
• Reducing repeated full-table scans
• Avoiding redundant pileup operations
• Keeping the original logic and outputs unchanged
In my test runs (including runs with the original EsViritu database), the modified code produced identical results and allowed the pipeline to complete successfully on my large custom database, with reduced runtime.
I’ve shared these experimental changes here in case they are useful as reference material: https://github.com/XuhanDeng/try_change_esv
⸻
Notes
• This is not a request for immediate changes
• I understand this is a non-standard use case
• I am mainly documenting these observations so they can be referenced in future updates, tutorials, or documentation related to custom databases
Thanks again for developing and sharing EsViritu — it has been very useful for my work.
Best regards,
Xuhan Deng
⸻
P.S. If others are interested in building a custom database, I am planning to further refine my Snakemake workflow and share it on GitHub in the near future.