-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Need
When inferring the reference assembly used for a genomic file (e.g., BAM, VCF, GTF), we must align the result to a controlled vocabulary of valid, recognized reference assemblies.
The findability subset definition for reference_genome is currently "A reference to the collection of sequences taken as the standard for a given organism. May be defined by https://www.ncbi.nlm.nih.gov/grc." However, we should strive to be more exact.
An initial proposal is to capture the minimal fields below and review the "Other fields" for possible inclusion.
Minimal Fields
- assembly_accession — Unique stable identifier, typically from RefSeq or GenBank (e.g., GCF_000001405.39)
- assembly_name — Human-readable name of the assembly (e.g., GRCh38)
- assembly_version — Patch version or subversion (e.g., p13)
Other fields to consider
- organism_name — Scientific name of the species (e.g., Homo sapiens)
- tax_id — NCBI taxonomy ID (e.g., 9606)
- source_database — Originating database or authority (e.g., RefSeq, GenBank, Ensembl)
- source_uri — Public URI to the full assembly metadata (e.g., NCBI GRCh38.p13)
- refget_digest — GA4GH refget-compliant digest of the full FASTA
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
No status