-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Hello!
I have recently run into an issue regarding the output of CoGAPS when run in different environments.
I have tested the tool in:
- local system R environment
- local conda environment
- Galaxy development instance using
bioconductor-cogapsand Galaxy development instance usingghcr.io/fertiglab/cogaps:master(produce same output)
For context, here are some links that may be helpful for Galaxy:
- https://galaxyproject.org/
- https://training.galaxyproject.org/training-material/topics/dev/index.html?utm_source=gtn&utm_medium=home&utm_campaign=home-quickstart#st-dependencies
I have tested 8 different Visium datasets found here: https://github.com/FridleyLab/spatialGE_Data/blob/main/tnbc_bassiouni.zip. For each sample, I get varying degrees of consistent/inconsistent output when looking a the featureLoadings and sampleFactors slots. For example, when I run sample 092a, the output is the same across all environments. When I run sample 094c, the output is identical between galaxy and conda, but different on my local R. Sample 092b is consistent between local R and galaxy but different on conda, while Sample 120d is different between all environments. There are several samples that do output identical values between all environments, but it is very unpredictable.
Below are some example values for samples and environments:
I'm staying consistent at CoGAPS Version 3.26.0, but have also tested with 3.28.0 and 3.29.1, with the same results.
Do you have any insights as to why this might be happening? Is there a minimum data requirement? Do all dependencies need to be the same version in order for the output to be consistent? What level of difference is to be expected?
I will include the code I'm using below, however it is the same script used for CoGAPS in the BTC pipeline. The parameters have stayed consistent at 1000 genes, 5 patterns, 100 iterations, and false for sparsity.
Any help would be much appreciated!
#--------#
# COGAPS #
#--------#
library(CoGAPS)
#libraries fot plotting added later
library(dplyr)
library(tidyr)
library(ggplot2)
library(tibble)
#specify args
args <- commandArgs(TRUE)
dgcmatrix <- args[1]
n_top_genes <- args[2]
n_patterns <- args[3]
n_iterations <- args[4]
sparse_opt <- args[5]
#use sparse matrix to find top genes
sparse <- readRDS(dgcmatrix)
#select top 5K genes
message("finding top ", n_top_genes, " genes")
vars <- apply(sparse, 1, var)
if (length(vars) > length(n_top_genes)) {
ngenes <- n_top_genes
} else {
ngenes <- vars
}
top_genes <- order(vars, decreasing=TRUE)[1:ngenes]
sparse <- sparse[top_genes,]
message("selected top ", length(top_genes), " genes of ", length(vars))
data <- as.matrix(sparse)
#avoid errors with distributed params
dist_param <- NULL
#set params for patterns and iterations
params <- CogapsParams(seed=42,
nPatterns = as.numeric(n_patterns), #number of patterns CoGAPS will learn
nIterations = as.numeric(n_iterations), #number of iterations for each phase of the algorithm
sparseOptimization = as.logical(sparse_opt), #speeds up performance with sparse data (roughly >80 default uncertainty)
distributed=dist_param) #either "genome-wide" or "single-cell" indicating which distributed algorithm should be used
#run cogaps
cogapsResult <- CoGAPS(data = data, params = params, outputFrequency = 100)
sample_id <- tools::file_path_sans_ext(basename(dgcmatrix))
outdir <- "cogaps_result"
dir.create(outdir, showWarnings = FALSE, recursive = TRUE)
outfile <- file.path(outdir, paste0(sample_id, "_cogapsResult.rds"))
saveRDS(cogapsResult, file = outfile)
message("Saving cogapsResult to ", outfile)