-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Labels
Description
I have one sequence (hCoV_19_Norway_1539_2020_EPI_ISL_417487) that tn93 keeps thinking has one fewer characters than it actually has (or at least seems to have). I have attached a minimal working example below:
I tried to run tn93 as follows:
cat example.aln | tn93 -l 1 -t 1But I get the following error message:
All sequences must have the same length (29811), but sequence 'hCoV_19_Norway_1539_2020_EPI_ISL_417487' had length 29810
However, I tried checking it in Python (lines[3] is the problematic sequence):
lines = open('example.txt').readlines()
len(lines[1]) # prints 29812 (includes the newline at the end)
lines[1][:10] # 'CTTCCCAGGT'
lines[1][-10:] # 'AATTTTAGT\n'
set(lines[1]) # {'\n', 'R', 'G', 'A', 'C', 'T', 'M'}
len(lines[3]) # prints 29812 (includes the newline at the end)
lines[3][:10] # 'CTTCCCAGGT'
lines[3][-10:] # 'AATTTTAGT\n'
set(lines[3]) # {'V', 'S', '\n', 'R', 'G', 'I', 'A', 'C', 'Y', 'T'}
len(lines[5]) # prints 29812 (includes the newline at the end)
lines[5][:10] # '----------'
lines[5][-10:] # 'AATTTTAGT\n'
set(lines[5]) # {'\n', 'G', 'A', '-', 'C', 'T'}Excluding the newline character after every line (which is included in the lengths printed by the above code), each sequence has exactly 29811 characters.
The only weird character I see in the problematic sequence is I, which doesn't seem to be a standard IUPAC character. Thoughts?
Reactions are currently unavailable