We've identified a corner case where a clinically relevant non-coding gene (RNU2-2[P]) overlaps a non-clinically-relevant gene (WDR74). bcftools csq's logic only searches for non-coding transcript consequences if there are no coding-transcript hits.
https://github.com/samtools/bcftools/blob/develop/csq.c#L3694
We are using BCFtools in a workflow where csq annotates variant consequences, but also to associates variants with genes, so annotation on non-coding genes is still important. This csq decision was obscured for a while because in our hands it was annotating some non-coding genes just fine (e.g. RNU4-2), though that now appears to be because RNU4-2 doesn't overlap a coding transcript, so this condition was never triggered.
We were able to overcome this issue by splitting the GFF into coding and non-coding, and doing two non-conflicting annotation loops, but we've also solved the problem in code and wondered if this was a change you might be interested in adopting.
- Original logic: If a CDS, UTR, or splice consequence is annotated on the variant record, don't run the transcript scan (here is the only point in code where non-coding CSQs originate)
- New logic: Record whether a CDS, UTR, or splice consequence is annotated on the variant record, then run the transcript scan. If a coding variant was detected, skip all coding transcripts, but annotate non-coding transcripts as normal.
develop...populationgenomics:bcftools:develop
In practice this leaves the coding annotation unchanged, and always checks for overlapping non-coding gene annotations, removing the conflict between the two entities.
I appreciate non-coding annotation is not always useful, so this might not be useful for most users. It would be ideal if this was a CLI-switch behaviour to allow users to opt in to more non-coding annotation.
We've identified a corner case where a clinically relevant non-coding gene (
RNU2-2[P]) overlaps a non-clinically-relevant gene (WDR74).bcftools csq's logic only searches for non-coding transcript consequences if there are no coding-transcript hits.https://github.com/samtools/bcftools/blob/develop/csq.c#L3694
We are using BCFtools in a workflow where
csqannotates variant consequences, but also to associates variants with genes, so annotation on non-coding genes is still important. This csq decision was obscured for a while because in our hands it was annotating some non-coding genes just fine (e.g.RNU4-2), though that now appears to be becauseRNU4-2doesn't overlap a coding transcript, so this condition was never triggered.We were able to overcome this issue by splitting the GFF into coding and non-coding, and doing two non-conflicting annotation loops, but we've also solved the problem in code and wondered if this was a change you might be interested in adopting.
develop...populationgenomics:bcftools:develop
In practice this leaves the coding annotation unchanged, and always checks for overlapping non-coding gene annotations, removing the conflict between the two entities.
I appreciate non-coding annotation is not always useful, so this might not be useful for most users. It would be ideal if this was a CLI-switch behaviour to allow users to opt in to more non-coding annotation.