Performance issues when running EsViritu with large custom reference databases (~190k contigs)


Dear authors,

Thank you for developing and sharing EsViritu — it is a very useful tool.

**Issue body**


I am currently using EsViritu with a large custom reference database (~190k contigs) and noticed that some steps become very slow. I would like to share these observations in case there is room for future improvement.

⸻

**Background / use case**

My workflow is roughly as follows:
	1.	Run EsViritu with the original database to detect known viruses
	2.	Run a second analysis using the unmapped reads from step 1 with a custom viral contig database

The custom reference database contains approximately 190,000 de-replicated viral assemblies.

⸻

**Observed bottlenecks**

When using this large database, the pipeline becomes very slow or difficult to complete at the following steps:
	•	cluster_assemblies_by_read_sharing
	•	bam_coverage_windows

From inspecting the code, the main reasons appear to be:

1. cluster_assemblies_by_read_sharing
	•	Per-assembly statistics (e.g. maximum read identity) are computed by repeatedly scanning the full read-sharing table.
	•	With a large number of assemblies, this results in many repeated full DataFrame scans at the Python level.

2. bam_coverage_windows
	•	Coverage is calculated using repeated pysam.pileup calls for each window.
	•	This leads to many redundant pileup operations per contig and is largely single-threaded, which becomes slow at scale.

⸻

**Experimental changes**

I am not an expert in algorithm optimization, but with the help of ChatGPT (CodeX) I experimented with small code changes aimed at:
	•	Reducing repeated full-table scans
	•	Avoiding redundant pileup operations
	•	Keeping the original logic and outputs unchanged

In my test runs (including runs with the original EsViritu database), the modified code produced identical results and allowed the pipeline to complete successfully on my large custom database, with reduced runtime.

I’ve shared these experimental changes here in case they are useful as reference material: https://github.com/XuhanDeng/try_change_esv

⸻

Notes
	•	This is not a request for immediate changes
	•	I understand this is a non-standard use case
	•	I am mainly documenting these observations so they can be referenced in future updates, tutorials, or documentation related to custom databases

Thanks again for developing and sharing EsViritu — it has been very useful for my work.

Best regards,
Xuhan Deng

⸻

P.S. If others are interested in building a custom database, I am planning to further refine my Snakemake workflow and share it on GitHub in the near future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues when running EsViritu with large custom reference databases (~190k contigs) #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance issues when running EsViritu with large custom reference databases (~190k contigs) #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions