Contact: Nicole Gay (nicolerg@stanford.edu)
Use randomization.R to make well-balanced batches of MoTrPAC human samples in terms of clinical site, intervention group, age, and sex.
- Shipment manifest Excel file(s) from the Biorepository, e.g.
Stanford_ADU830-10060_120720.xlsx(--shipment-manifest-excel-ship) - Corresponding CSV file(s) from the Biospecimen Metadata Download API on motrpac.org, e.g.
ADU830-10060.csv(--api-metadata-csv-api) - Maximum number of samples per batch (
--max-n-per-batch-max)
IMPORTANT: Include manifests for both adult and pediatric samples to randomize studies together.
IMPORTANT: If the manifests include multiple aliquots of the same samples for different assays, you MUST add an
assaycolumn to either of these input files to distiguish the different sets of aliquots. For example, if muscle samples are being processed for both ATAC-seq and RNA-seq at Stanford, add anassaycolumn with the valuesrnaseqandatacseq. The values themselves do not matter as long as they separate the sets of aliquots.
- The
--strict-size-sflag should be used for batches with small numbers of samples - The
--max-full-batches-fflag should be used to force as many batches as possible to have exactly--max-n-per-batchsamples --vars-to-balance-vdefines the list of variables for which more than one group should be present in each batch (default:c('codedsiteid','randomgroupcode','sex_psca','older_than_40'))--tissue-subset [TISSUE_CODE]restricts balancing to a single tissue specified by[TISSUE_CODE], which must match a value in either the 'Sample Type' column of one of--shipment-manifest-excelor the 'SampleTypeCode' column of one of--api-metadata-csv--block-randomizationcan be used to include an "injection_order" column based on block randomization (i.e. individuals within a batch are randomized, keeping all samples from an individual together, and then samples within an indiviual are randomized)--separate-batch-filescan be used to write out one blinded batch assignment file per batch per tissue per assay instead of the default behavior of one file per tissue per assay--outdir-ocan be used to specify an output directory other than the current working directory--quietly-qcan be used to silence progress messages
Expert level:
- If you want the script to check more random combinations of samples before compromising the ideal batch sizes, increase
--max-inner-loop-iter(default: 1e6) - If you want the script to check more random combinations of samples before decreasing the stringency of the batch balance checks, increase
--max-outer-loop-iter(default: 1000 for--strict-sizeand 5000 otherwise) --balance-strictness-bcan be used to specify the initial strictness of the balance checks, with 10 being the strictest and 1 being the most lenient. By default,--balance-strictnessis 1 when--strict-sizeis used and 10 otherwise
By default, two files are written for each assay & tissue combination:
- Blinded batch assignments in the format
files/precovid_[SAMPLE_TYPE]-samples_BLINDED-batch-assignments.csv(see example) - Unblinded batching metadata in the format
files/precovid_[SAMPLE_TYPE]-samples_UNBLINDED-batch-characteristics.csv
Use the --separate-batch-files flag to output separate blinded batch assignment files per batch, e.g. files/precovid_[SAMPLE_TYPE]-samples_BLINDED-batch_3-assignments.csv.
One plot is saved for each assay & tissue combination. This plot includes the number of individuals and samples per batch as well as the balance across each level of each --vars-to-balance. These plots should be visually examined to confirm that batches are adequately balanced, i.e. that numbers are reasonably distributed across each ROW. (see examples).
data.table
readxl
testit
argparse
ggplot2
gtsummary
pheatmapHere is an example of how to run the script from the command line, assuming the shipment manifest Excel files and API metadata CSV files are in the same directory as this script. Include manifests and metadata for all pre-COVID clinical samples, i.e. both adult and pediatric shipments.
Rscript randomization.R \
--shipment-manifest-excel \
Stanford_ADU830-10060_120720.xlsx \
Stanford_PED830-10062_120720.xlsx \
--api-metadata-csv \
ADU830-10060.csv \
PED830-10062.csv \
--max-n-per-batch 94 \
--outdir ~/Desktop/stanford_batches Equivalently:
Rscript randomization.R \
-ship \
Stanford_ADU830-10060_120720.xlsx \
Stanford_PED830-10062_120720.xlsx \
-api \
ADU830-10060.csv \
PED830-10062.csv \
-max 94 \
-o ~/Desktop/stanford_batches Remember to add the --strict-size or -s flag if the maximum number of samples per batch is small, e.g.:
Rscript randomization.R \
-ship ShipmentContents_BroadCarr_012521.xlsx \
-api ADU822-10074.csv \
-max 15 \
-s \
-o ~/Desktop/broad_batches Add the --max-full-batches or -f flag to force as many batches as possible to have exactly --max-n-per-batch samples, e.g.:
Rscript randomization.R \
-ship Stanford_ADU830-10060_120720.xlsx \
-api ADU830-10060.csv \
-max 88 \
-o ~/Desktop/stanford_batches \
--max-full-batchesTo run the randomization script for a single tissue, use the --tissue-subset argument, where the supplied value must be a value in the 'Sample Type' column of one --shipment-manifest-excel OR a value in the 'SampleTypeCode' column of one --api-metadata-csv, e.g.:
Rscript randomization.R \
-ship \
Stanford_ADU830-10060_120720.xlsx \
Stanford_PED830-10062_120720.xlsx \
-api \
ADU830-10060.csv \
PED830-10062.csv \
-max 94 \
-o ~/Desktop/stanford_batches \
--tissue-subset 06 \
--overwriteThe --overwrite flag ignores existing batching outputs and overwrites the files. Without this flag, batching for a sample type will be skipped if a batching output already exists.
To add an "injection_order" column based on block randomization, add the --block-randomization flag; to output separate blinded batch assignment files for each batch, add the --separate-batch-files flag:
Rscript randomization.R \
-ship \
Stanford_ADU830-10060_120720.xlsx \
Stanford_PED830-10062_120720.xlsx \
-api \
ADU830-10060.csv \
PED830-10062.csv \
-max 94 \
-o ~/Desktop/stanford_batches \
--tissue-subset 06 \
--overwrite \
--block-randomization \
--separate-batch-filesSee examples of the stdout for large batches and small batches (--strict-size).
Alternatively, run the script interactively in RStudio by commenting out lines 19-66 and manually defining arguments below (see examples on lines 68-123), though this is not recommended.
Run Rscript randomization.R -h to see this help message:
usage: randomization.R [-h] -ship SHIPMENT_MANIFEST_EXCEL
[SHIPMENT_MANIFEST_EXCEL ...] -api API_METADATA_CSV
[API_METADATA_CSV ...] -max MAX_N_PER_BATCH [-s] [-f]
[-v VARS_TO_BALANCE] [-o OUTDIR] [-q]
[-inner MAX_INNER_LOOP_ITER]
[-outer MAX_OUTER_LOOP_ITER] [-b BALANCE_STRICTNESS]
[--overwrite] [--tissue-subset TISSUE_SUBSET]
[--block-randomization] [--separate-batch-files]
optional arguments:
-h, --help show this help message and exit
-ship SHIPMENT_MANIFEST_EXCEL [SHIPMENT_MANIFEST_EXCEL ...], --shipment-manifest-excel SHIPMENT_MANIFEST_EXCEL [SHIPMENT_MANIFEST_EXCEL ...]
Path(s) to shipment manifest Excel files, e.g.
Stanford_ADU830-10060_120720.xlsx
Stanford_PED830-10062_120720.xlsx
-api API_METADATA_CSV [API_METADATA_CSV ...], --api-metadata-csv API_METADATA_CSV [API_METADATA_CSV ...]
Path(s) to sample metadata from web API, e.g.
ADU830-10060.csv PED830-10062.csv
-max MAX_N_PER_BATCH, --max-n-per-batch MAX_N_PER_BATCH
Max number of samples per batch
-s, --strict-size Force *all* batches to be as close to --max-n-per-
batch as possible. Most applicable for small batches
(e.g. < 20)
-f, --max-full-batches
Force as many batches as possible to have *exactly*
--max-n-per-batch samples.
-v VARS_TO_BALANCE, --vars-to-balance VARS_TO_BALANCE
Force batches to include samples from at least two
groups of each of these variables. Must be defined in
--api-metadata-csv
-o OUTDIR, --outdir OUTDIR
Path to output directory
-q, --quietly Silence progress messages
-inner MAX_INNER_LOOP_ITER, --max-inner-loop-iter MAX_INNER_LOOP_ITER
Max number of failed attempts to fit all samples in
batches before increasing the number of batches
-outer MAX_OUTER_LOOP_ITER, --max-outer-loop-iter MAX_OUTER_LOOP_ITER
Max number of failed attempts to find optimally
balanced bacthes before relaxing the stringency of the
balance checks
-b BALANCE_STRICTNESS, --balance-strictness BALANCE_STRICTNESS
Initial strictness of balance checks, with 10 being
the strictest and 1 being the most lenient
--overwrite Overwrite existing batching results
--tissue-subset TISSUE_SUBSET
Run batching for a single tissue. Must be a value in
the 'Sample Type' column of one --shipment-manifest-
excel OR a value in the 'SampleTypeCode' column of one
--api-metadata-csv
--block-randomization
Block randomization for metabolomics sites: samples
within a batch are ordered by individual; samples
within an individual are randomized. This adds an
'injection_order' column.
--separate-batch-files
Write separate BLINDED output files per batchFor questions about the documentation or any issues with the code, please submit an issue or contact Nicole at nicolerg@stanford.edu.