Skip to content

Define allowed values for reference assemblies used during reference_genome inference #5

@NoopDog

Description

@NoopDog

Need

When inferring the reference assembly used for a genomic file (e.g., BAM, VCF, GTF), we must align the result to a controlled vocabulary of valid, recognized reference assemblies.

The findability subset definition for reference_genome is currently "A reference to the collection of sequences taken as the standard for a given organism. May be defined by https://www.ncbi.nlm.nih.gov/grc." However, we should strive to be more exact.

An initial proposal is to capture the minimal fields below and review the "Other fields" for possible inclusion.

Minimal Fields

  • assembly_accession — Unique stable identifier, typically from RefSeq or GenBank (e.g., GCF_000001405.39)
  • assembly_name — Human-readable name of the assembly (e.g., GRCh38)
  • assembly_version — Patch version or subversion (e.g., p13)

Other fields to consider

  • organism_name — Scientific name of the species (e.g., Homo sapiens)
  • tax_id — NCBI taxonomy ID (e.g., 9606)
  • source_database — Originating database or authority (e.g., RefSeq, GenBank, Ensembl)
  • source_uri — Public URI to the full assembly metadata (e.g., NCBI GRCh38.p13)
  • refget_digest — GA4GH refget-compliant digest of the full FASTA

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions