Skip to content

Inconsistencies in CoGAPS results across environments #147

@MauraKautz

Description

@MauraKautz

Hello!

I have recently run into an issue regarding the output of CoGAPS when run in different environments.

I have tested the tool in:

  • local system R environment
  • local conda environment
  • Galaxy development instance using bioconductor-cogaps and Galaxy development instance using ghcr.io/fertiglab/cogaps:master (produce same output)

For context, here are some links that may be helpful for Galaxy:

I have tested 8 different Visium datasets found here: https://github.com/FridleyLab/spatialGE_Data/blob/main/tnbc_bassiouni.zip. For each sample, I get varying degrees of consistent/inconsistent output when looking a the featureLoadings and sampleFactors slots. For example, when I run sample 092a, the output is the same across all environments. When I run sample 094c, the output is identical between galaxy and conda, but different on my local R. Sample 092b is consistent between local R and galaxy but different on conda, while Sample 120d is different between all environments. There are several samples that do output identical values between all environments, but it is very unpredictable.

Below are some example values for samples and environments:

CoGAPS_Output.xlsx

I'm staying consistent at CoGAPS Version 3.26.0, but have also tested with 3.28.0 and 3.29.1, with the same results.

Do you have any insights as to why this might be happening? Is there a minimum data requirement? Do all dependencies need to be the same version in order for the output to be consistent? What level of difference is to be expected?

I will include the code I'm using below, however it is the same script used for CoGAPS in the BTC pipeline. The parameters have stayed consistent at 1000 genes, 5 patterns, 100 iterations, and false for sparsity.

Any help would be much appreciated!

#--------#
# COGAPS #
#--------#

library(CoGAPS)

#libraries fot plotting added later
library(dplyr)
library(tidyr)
library(ggplot2)
library(tibble)

#specify args
args <- commandArgs(TRUE)

dgcmatrix <- args[1]
n_top_genes <- args[2]
n_patterns <- args[3]
n_iterations <- args[4]
sparse_opt <- args[5]

#use sparse matrix to find top genes
sparse <- readRDS(dgcmatrix)
      
#select top 5K genes
message("finding top ", n_top_genes, " genes")

vars <- apply(sparse, 1, var)
if (length(vars) > length(n_top_genes)) {
  ngenes <- n_top_genes
} else {
  ngenes <- vars
}
top_genes <- order(vars, decreasing=TRUE)[1:ngenes]
sparse <- sparse[top_genes,]

message("selected top ", length(top_genes), " genes of ", length(vars))

data <- as.matrix(sparse)

#avoid errors with distributed params
dist_param <- NULL

#set params for patterns and iterations
params <- CogapsParams(seed=42,
                       nPatterns = as.numeric(n_patterns), #number of patterns CoGAPS will learn
                       nIterations = as.numeric(n_iterations), #number of iterations for each phase of the algorithm
                       sparseOptimization = as.logical(sparse_opt), #speeds up performance with sparse data (roughly >80 default uncertainty)
                       distributed=dist_param) #either "genome-wide" or "single-cell" indicating which distributed algorithm should be used

#run cogaps
cogapsResult <- CoGAPS(data = data, params = params, outputFrequency = 100)

sample_id <- tools::file_path_sans_ext(basename(dgcmatrix))
outdir <- "cogaps_result"
dir.create(outdir, showWarnings = FALSE, recursive = TRUE)

outfile <- file.path(outdir, paste0(sample_id, "_cogapsResult.rds"))
saveRDS(cogapsResult, file = outfile)
message("Saving cogapsResult to ", outfile)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions