Skip to content

Invalid character results in wrong error message ("All sequences must have the same length") #19

@niemasd

Description

@niemasd

I have one sequence (hCoV_19_Norway_1539_2020_EPI_ISL_417487) that tn93 keeps thinking has one fewer characters than it actually has (or at least seems to have). I have attached a minimal working example below:

example.txt

I tried to run tn93 as follows:

cat example.aln | tn93 -l 1 -t 1

But I get the following error message:

All sequences must have the same length (29811), but sequence 'hCoV_19_Norway_1539_2020_EPI_ISL_417487' had length 29810

However, I tried checking it in Python (lines[3] is the problematic sequence):

lines = open('example.txt').readlines()

len(lines[1])  # prints 29812 (includes the newline at the end)
lines[1][:10]  # 'CTTCCCAGGT'
lines[1][-10:] # 'AATTTTAGT\n'
set(lines[1])  # {'\n', 'R', 'G', 'A', 'C', 'T', 'M'}

len(lines[3])  # prints 29812 (includes the newline at the end)
lines[3][:10]  # 'CTTCCCAGGT'
lines[3][-10:] # 'AATTTTAGT\n'
set(lines[3])  # {'V', 'S', '\n', 'R', 'G', 'I', 'A', 'C', 'Y', 'T'}

len(lines[5])  # prints 29812 (includes the newline at the end)
lines[5][:10]  # '----------'
lines[5][-10:] # 'AATTTTAGT\n'
set(lines[5])  # {'\n', 'G', 'A', '-', 'C', 'T'}

Excluding the newline character after every line (which is included in the lengths printed by the above code), each sequence has exactly 29811 characters.

The only weird character I see in the problematic sequence is I, which doesn't seem to be a standard IUPAC character. Thoughts?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions