Skip to content

Daylily-Informatics/daylily-omics-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

948 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Daylily Omics Analysis

Latest release Latest tag

Daylily Omics Analysis contains the Snakemake workflows, shell entrypoints, profile configuration, and run documentation used for Daylily whole-genome sequencing analysis. It is specifically tuned to run inside infrastructure created by daylily-ephemeral-cluster, with Daylily omics/reference data mounted on the headnode and compute nodes under /fsx/data.

This repository does not create, update, or destroy AWS infrastructure. Cluster lifecycle, FSx mounts, and production sample staging belong to daylily-ephemeral-cluster and its daylily-ec CLI. Use daylily-ec to stage reads and create or deliver the samples.tsv and units.tsv manifests for production worksets; this repo consumes those manifests from each analysis clone.

Current Inputs

Current workflows use paired manifest tables:

File Purpose
config/samples.tsv One row per biological sample and truth/control metadata.
config/units.tsv One row per sequencing unit, lane, read pair, CRAM/BAM, or downsampled analysis unit.

The legacy config/analysis_manifest.csv path is historical. Keep it only for old-run conversion notes.

SUBSAMPLE_PCT in units.tsv is supported for inline FASTQ downsampling. Values must be floats in (0.0, 1.0]; use na or an empty value when no downsampling is intended.

For production analyses, prefer manifests generated or staged by daylily-ec from the operator side. Hand-written copies are acceptable for focused debugging only when the paths, genome build, and /fsx/data reference resources have been verified.

Quick Start

For a local smoke test from a fresh checkout, use the existing DAY-EC environment. This verifies wiring and small fixtures; routine full workflows are expected to run on a prepared headnode. The fixture copy commands below write config/samples.tsv and config/units.tsv, so run them in a scratch checkout or preserve existing manifests first:

eval "$(conda shell.zsh hook)"
conda activate DAY-EC
source dyoainit
dy-a local hg38

mkdir -p config
cp .test_data/data/0.01xwgs_HG002_hg38.samples.tsv config/samples.tsv
cp .test_data/data/0.01xwgs_HG002_hg38.units.tsv config/units.tsv

dy-r produce_alignstats -p -j 1 -n
dy-r produce_alignstats -p -j 1

For a Slurm-backed headnode run, connect through daylily-ec/SSM, then use a persistent workset clone. Stage production reads and manifests with daylily-ec before running workflow targets:

cd /fsx/analysis_results/ubuntu
day-clone -t <git-ref-or-tag> -d <workset-name>
cd /fsx/analysis_results/ubuntu/<workset-name>/daylily-omics-analysis

source dyoainit
dy-a slurm hg38

dy-r produce_snv_concordances -p -k -j 20 -n
dy-r produce_snv_concordances -p -k -j 20

Run dy-r help for available targets and use tab completion after source dyoainit.

CLI Entry Points

Command Purpose
source dyoainit Initialize Daylily shell functions, environment checks, and completion.
`dy-a <local slurm> `
dy-r <targets...> [flags] Compose and run the Snakemake command.
dy-m [--workdir PATH] [--interval N] Monitor command history, master log, Slurm jobs, and recent task logs.
`dy-g <hg38 hg38_broad
dy-d reset Reset Daylily shell state.

Common flags passed through dy-r:

Flag Meaning
-n Dry-run.
-p Print shell commands.
-k Keep independent jobs running after a failure.
-j N Limit concurrent Snakemake jobs.
-T N Snakemake retry/attempt flag used by existing Daylily run commands.
--rerun-incomplete Re-run incomplete outputs.
--keep-incomplete Keep incomplete outputs for debugging failed jobs.
--keep-temp Daylily convenience flag translated by bin/day_run to Snakemake --notemp.

Common Workflow Targets

Target Typical use
produce_alignstats Alignment statistics and aggregate alignstats_combo_mqc.tsv.
produce_snv_concordances GIAB/RTG concordance outputs where truth metadata is present.
produce_sentd_snv_vcf Illumina Sentieon DNAscope SNV calling.
produce_deep19_snv_vcf DeepVariant 1.9 SNV calling.
produce_sentdont_snv_vcf ONT Sentieon SNV calling.
produce_sentdpb_snv_vcf PacBio Sentieon SNV calling.
produce_sentdug_snv_vcf Ultima Genomics SNV calling, usually on hg38_broad.
produce_cgt7p_snv_vcf Complete Genomics/MGI Sentieon DNAscope path using sentcg and cgt7p.
produce_sentdhiom_snv_vcf Modular Illumina+ONT hybrid Sentieon workflow.
produce_sentdhuom_snv_vcf Modular Ultima+ONT hybrid Sentieon workflow.
produce_dmd_dedup_cram, produce_smd_dedup_cram, produce_na_dedup_cram Canonical dedup selector targets; dppl is accepted only as a deprecated alias for dmd.
produce_all_align, produce_all_dedup_cram, produce_all_snv_vcf, produce_all_sv_vcf Run every registered selector in that stage, subject to manifest/platform compatibility.
produce_bclconvert_fastqs, produce_bclconvert_metrics, produce_bclconvert_multiqc, produce_bclconvert_fastqs_and_metrics Illumina BCL Convert bootstrap, generated units, demux metrics, and MultiQC-ready BCL metric tables.
produce_manta_sv_vcf, produce_tiddit_sv_vcf, produce_dysgu_sv_vcf Structural variant callers.
produce_htd_calls Selected HTD/special callers from --config htd_callers=[...].
produce_verifybamid2_panel_comparison Runs selected VerifyBamID2 SNP panels from --config verifybamid2_panels=[...] and writes a comparison TSV.
produce_multiqc_input_data MultiQC for input sequence-data QC.
produce_multiqc_cram MultiQC for CRAM/alignment QC.
produce_multiqc_snv, produce_multiqc_sv MultiQC for SNV and SV QC scopes.
produce_multiqc_sample_qc MultiQC for sample-level QC such as contamination and relatedness.
produce_multiqc_variant_annotation MultiQC for enabled annotation QC such as VEP.
produce_multiqc_all Canonical final routine MultiQC aggregation.

Legacy selector targets such as produce_sentD_vcf, produce_manta, and produce_multiqc_final_wgs remain available for now, but are marked as deprecated in the workflow and docs. Current examples should use the canonical selector names above.

Complete Genomics / MGI WGS

Complete Genomics T7+ and MGI-style WGS uses the dedicated sentcg -> smd -> cgt7p path. The canonical selector form avoids selector --config lists:

dy-r produce_sentcg_align produce_smd_dedup_cram produce_cgt7p_snv_vcf \
  produce_alignstats produce_snv_concordances \
  -p -j 20 -k -T 1 --retries 0 --rerun-incomplete --keep-incomplete

This path uses Sentieon BWA MEM with DNAscopeMGIWGS2.1.bundle/bwa.model, read group platform DNBSEQ, Sentieon duplicate marking, and DNAscope with DNAscopeMGIWGS2.1.bundle/dnascope.model plus --pcr_indel_model none.

See docs/workflows/complete_genomics_sentieon.md for model paths, output names, downsampling, and monitoring details.

Results And Logs

Item Location
Results results/day/<build>/
Per-sample outputs results/day/<build>/<sample>/
Aggregate reports results/day/<build>/other_reports/
Benchmark summary results/day/<build>/reports/benchmarks_summary.tsv
Snakemake master logs .snakemake/log/<timestamp>.snakemake.log
Slurm logs logs/slurm/<rule>/*.{out,err}
Command history day_cmd.log
Completion markers daylily.successful_run, daylily.failed_run

When debugging, inspect logs in this order: latest .snakemake/log by mtime, relevant logs/slurm files by mtime, then the stable rule log under results/day/<build>/<sample>/.../logs/.

Documentation Map

Document Purpose
daylily-ephemeral-cluster Cluster lifecycle, headnode access, sample staging, and manifest generation.
docs/README.md Documentation index and current/historical doc policy.
docs/quickest_start.md Minimal smoke-test checklist.
docs/first_ephemeral_cluster_analysis.md First headnode workset run.
docs/ops/dycli.md CLI command behavior and monitoring.
docs/ops/config.md Profiles, config precedence, sample/unit schema notes.
docs/ops/tests.md Local validation commands.
docs/ops/multiqc_qc_targets.md Staged MultiQC targets, runtime gating, and routine vs optional QC policy.
docs/catalog_of_tools.md Code-sourced catalog of Daylily tool integrations, evidence, outputs, and tests.
docs/ops/dir_and_file_scheme.md Current result layout and naming conventions.
docs/ops/workflow_catalog.md Packaged workflow catalog API and current contents.
docs/workflows/complete_genomics_sentieon.md Complete Genomics/MGI sentcg/smd/cgt7p workflow.
docs/workflows/bclconvert_bootstrap.md Illumina BCL Convert bootstrap path, generated units, BCL metrics, and MultiQC custom-data integration.
docs/workflows/ensemble_vcf.md Ensemble VCF workflow notes.
docs/remote_test_execution.md Remote tmux/Slurm execution pattern.

Top-level run notes such as run_cg.md, gotimeplan.md, hyb_runbook.md, hybrun.md, and ugdata.md are historical records for specific executions. Prefer the canonical docs above for new runs.

Repository Layout

.
├── bin/                         # dy-cli wrappers and utility scripts
├── config/                      # profiles, genome/supporting config, sample/unit inputs
├── daylily_omics_analysis/      # packaged Python helpers, including workflow catalog
├── docs/                        # canonical docs and historical notes
├── resources/                   # staged supporting data
├── tests/                       # shell and Python validation tests
└── workflow/                    # Snakemake rules, envs, scripts, and schemas

Development Checks

For a documentation-only change, run:

git diff --check
bash tests/test_cli_commands.sh
bash tests/test_bclconvert_bootstrap.sh
python -m pytest tests/test_complete_genomics_sentieon.py tests/test_workflow_catalog.py

For broad workflow changes, run the relevant target dry-run through dy-r after source dyoainit and dy-a <profile> <build>.

About

(current) Comprehensive WGS omics analysis pipeline ( reads to variants & QC ) : Sequence Platforms ( Illumina, PacBio, ONT, Ultima, Roche )

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors