Skip to content

Give kraken2 100G from 100MB input#1606

Open
mira-miracoli wants to merge 1 commit intousegalaxy-eu:masterfrom
mira-miracoli:kraken2-rule
Open

Give kraken2 100G from 100MB input#1606
mira-miracoli wants to merge 1 commit intousegalaxy-eu:masterfrom
mira-miracoli:kraken2-rule

Conversation

@mira-miracoli
Copy link
Copy Markdown
Contributor

@mira-miracoli mira-miracoli commented Jul 29, 2025

I think I need some help with the interpretation of my analysis :D
This means, that on average 1GB of input needed 506GB of memory?

galaxy@sn09:/data/dnb01/test3/mira$ cat kraken2_memory.tsv |         awk '{print $10}' | ./histogram.py --percentage --max=256
# NumSamples = 122212; Min = 0.00; Max = 256.00
# 72742 values outside of min/max
# Mean = 5085761.854932; Variance = 147305090613418592.000000; SD = 383803453.102520; Median 506.000000
# each ∎ represents a count of 238
0 - 25.6 [17861] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎  (14.61%)
25.6 - 51.2 [10820] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎  (8.85%)
51.2 - 76.8 [5866] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎  (4.80%)
76.8 - 102.4 [3500] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎  (2.86%)
102.4 - 128.0 [2575] ∎∎∎∎∎∎∎∎∎∎∎  (2.11%)
128.0 - 153.6 [2092] ∎∎∎∎∎∎∎∎∎  (1.71%)
153.6 - 179.2 [2072] ∎∎∎∎∎∎∎∎∎  (1.70%)
179.2 - 204.8 [1589] ∎∎∎∎∎∎∎  (1.30%)
204.8 - 230.4 [1533] ∎∎∎∎∎∎  (1.25%)
230.4 - 256.0 [1562] ∎∎∎∎∎∎∎  (1.28%)

@mira-miracoli mira-miracoli requested a review from bgruening July 29, 2025 14:55
@mira-miracoli mira-miracoli requested a review from wm75 August 20, 2025 09:52
@bgruening
Copy link
Copy Markdown
Member

I don't see in your screenshot where the data is coming from. A similar figure I know does not take the input size into consideration. You just see that 14% of all jobs finish with less then 30GB of memory, 30% of all jobs with less then 100GB of memory ... etc

@mira-miracoli
Copy link
Copy Markdown
Contributor Author

gxadmin query tool-memory-per-inputs

@pvanheus
Copy link
Copy Markdown

pvanheus commented Jan 5, 2026

Hi there @mira-miracoli - the RAM requirement is more connected to the DB size than the input size. Since the kraken2 wrapper pulls the DB location from a tool data table, I am not sure if this is included in the input size? Thus, for the 8.1 GB Mycobacterium database, 10 GB is enough, for the 94 GB "standard" database, 100 GB works.

So the challenge then becomes: how to access this information? Right now kraken2 relies on the kraken2_databases tool data table, which includes fields "value, name, path". If TPV is able to access the path where tool data is stored, the size of the database can be calculated and the RAM usage inferred. If, however, TPV is isolated from this filesystem, it will be necessary to update the data manager to store the size of the DB in the tool data table.

In the future, kraken2 could have the database passed as an input (galaxyproject/tools-iuc#7257) in which case the size of the database input could be used.

I am very naive when it comes to TPV, so please let me know if this line of thinking makes sense.

@mira-miracoli
Copy link
Copy Markdown
Contributor Author

mira-miracoli commented Jan 12, 2026

I see, maybe we can wait then for this to happen:

In the future, kraken2 could have the database passed as an input (galaxyproject/tools-iuc#7257) in which case the size of the database input could be used.

If TPV is able to access the path where tool data is stored, the size of the database can be calculated and the RAM usage inferred.

I think it could access this file, but could it slow down the job handlers maybe?

@bgruening
Copy link
Copy Markdown
Member

I have this here for a TPV rule. @pvanheus wanted to look at it:

1.5G    2019-07-13T223742Z_silva_kmer-len_35_minimizer-len_31_minimizer-spaces_6
266M    2019-07-13T223747Z_greengenes_kmer-len_35_minimizer-len_31_minimizer-spaces_6
9.0G    2019-07-13T223756Z_minikraken2_v2_8GB
248M    2019-07-13T223807Z_rdp_kmer-len_35_minimizer-len_31_minimizer-spaces_6
45G     2019-07-14T163343Z_standard_kmer-len_35_minimizer-len_31_minimizer-spaces_6
266M    2020-06-24T164430Z_greengenes_kmer-len_35_minimizer-len_31_minimizer-spaces_6
55G     2020-06-24T164454Z_standard_kmer-len_35_minimizer-len_31_minimizer-spaces_6
248M    2020-06-24T164505Z_rdp_kmer-len_35_minimizer-len_31_minimizer-spaces_6
1.5G    2020-06-24T164526Z_silva_kmer-len_35_minimizer-len_31_minimizer-spaces_6
2.3G    2022-02-02T162953Z_greengenes_kmer-len_35_minimizer-len_31_minimizer-spaces_6_load-factor_0.7
2.1G    2022-02-02T162959Z_silva_kmer-len_35_minimizer-len_31_minimizer-spaces_6_load-factor_0.7
70G     2022-07-06T094102Z_standard_prebuilt_standard_2022-06-07
636M    2022-08-04T105935Z_standard_prebuilt_viral_2022-06-07
74G     2022-09-04T165121Z_standard_prebuilt_pluspf_2022-06-07
156G    2022-09-05T092205Z_standard_prebuilt_pluspfp_2022-06-07
18G     2023-08-17T071759Z_standard_prebuilt_standard_16gb_2022-06-07
9.1G    2023-08-17T071804Z_standard_prebuilt_standard_08gb_2022-06-07
9.1G    2024-07-15T083656Z_standard_prebuilt_pluspf_08gb_2024-01-12
19G     2024-07-15T175900Z_standard_prebuilt_pluspf_16gb_2024-01-12
93G     2024-07-15T185612Z_standard_prebuilt_pluspf_2024-01-12
313G    2025-01-04T202436Z_standard_prebuilt_core_nt_2024-09-04
97G     2025-07-23T080057Z_standard_prebuilt_standard_2024-09-04
2.4G    kalamari
9.7G    2024-10-12T100500Z_Mycobacterium_v1_doi_10.5281_zenodo.8339822
2.1G    fungi2019-03
1.4G    plasmid2019-03
440M    viral2019-03
114G    pluspf2021-05
185M    16S_SILVA138_k2db2021-08

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants