Skip to content

Latest commit

 

History

History
170 lines (124 loc) · 10.1 KB

File metadata and controls

170 lines (124 loc) · 10.1 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

What is prolfqua

An R package for mass spectrometry-based label-free quantification (LFQ) proteomics analysis. It provides a complete workflow: QC, normalization, protein aggregation, statistical modelling, hypothesis testing, and sample size estimation. Data is always in long (tidy) format. Branch Modelling2R6 is the active development branch.

Build & Test Commands

make test            # Run testthat suite (runs document first)
make check-fast      # R CMD check without vignettes (quick validation)
make check           # Full R CMD check (document → build → check)
make document        # Generate roxygen2 docs (NAMESPACE + man/)
make install         # Install package locally
make lint            # Run lintr static analysis
make format          # Format with air
make build-vignettes # Build vignettes into inst/doc
make site            # Build pkgdown site locally

Single test file:

Rscript -e "testthat::test_file('tests/testthat/test-LFQData.R')"

Library setup:

Rscript -e ".libPaths()"

Use the normal user / system R libraries for this workspace; renv autoload is disabled.

Code Style

  • Line length: 120 chars, indentation: 2 spaces (.lintr)
  • object_name_linter is disabled — the codebase uses camelCase for R6 classes and snake_case/mixed for functions
  • NAMESPACE is auto-generated by roxygen2 — never edit directly; run make document
  • Roxygen is configured with r6 = TRUE for R6 class documentation
  • Never use \dontrun{} or \donttest{} in @examples — all examples must run during R CMD check. If an example is too slow, optimize it instead of skipping it.

Architecture

Core Data Flow

Raw Data + AnalysisConfiguration → LFQData
    ├── get_Transformer() → LFQDataTransformer  (log2, robscale, normalize)
    ├── get_Aggregator()  → LFQDataAggregator   (peptide → protein rollup)
    ├── get_Stats()       → LFQDataStats         (CV, variance per group)
    ├── get_Plotter()     → LFQDataPlotter       (heatmaps, PCA, boxplots)
    ├── get_Summariser()  → LFQDataSummariser    (missingness, hierarchy counts)
    └── get_Imputer()     → LFQDataImp           (missing value imputation)

LFQData → build_contrast_analysis(lfqdata, modelstr, contrasts, method)
    └── Returns a Facade with uniform API:
        $get_contrasts(), $get_missing(), $get_Plotter(), $to_wide()

Facade Pattern (ContrastsFacades.R, build_contrast_analysis.R)

build_contrast_analysis() is the recommended entry point. Each method dispatches to a Facade class that wires strategy → model → contrasts → moderation internally.

Aggregated input (protein-level, subject_Id == hierarchy_keys): lm, rlm, lm_missing, lm_impute, limma, deqms, firth

Nested input (peptide-level, subject_Id is strict subset of hierarchy_keys): lmer, ropeca

Weights & nr_children

config$nr_children names the column tracking child-feature counts (e.g. peptides per protein). After get_Aggregator() rollup, each protein×sample row gets its own count — nr_children is sample-wise. For peptide/precursor-level data it is typically 1.

Two distinct uses:

  1. Fitting weights (sample-wise): Aggregated facades (lm, limma, lm_missing, lm_impute, deqms) pass nr_children as weights by default to lm() or limma::lmFit(). This down-weights protein intensities derived from fewer peptides in a given sample. Disable with weights = NULL.

  2. DEqMS variance moderation (experiment-wide): ContrastsDEqMSFacade additionally aggregates nr_children via max() per protein across all samples for count-dependent variance shrinkage. This is separate from the fitting weights.

Protein-level input must carry nr_children. If the column is missing, setup_analysis() adds it set to 1 with a warning — but this defeats the purpose for aggregated data where the actual peptide count matters.

Key Design Patterns

Decorator/Composition: LFQData factory methods (get_Transformer(), get_Plotter(), etc.) return decorator objects that wrap the LFQData. Decorators hold a reference in their lfq field.

Method chaining: Transformer methods return self for chaining, access result via $lfq:

lfqdata <- lfqdata$get_Transformer()$log2()$robscale()$lfq

Strategy pattern for models: Strategy R6 classes for models: StrategyLM, StrategyRLM, StrategyLmer, StrategyLogistf — each with model_fun, isSingular, contrast_fun, df_residual, sigma methods. Wrapper functions strategy_lm(), strategy_rlm(), strategy_lmer(), strategy_logistf() create instances. strategy_limma() returns a plain list (formula, trend, robust, weights) consumed by build_model_limma().

Config immutability: AnalysisConfiguration is always deep-cloned when passed to new LFQData instances. Never modify config in-place on an existing LFQData.

R6 Classes (22 classes across R/)

Category Classes Files
Core data LFQData, AnalysisConfiguration LFQData.R, AnalysisConfiguration.R
Decorators LFQDataTransformer, LFQDataAggregator, LFQDataStats, LFQDataPlotter, LFQDataSummariser, LFQDataImp LFQData*.R
Model interfaces ModelInterface, Model, ModelFirth, ModelLimma Model*.R, ContrastsLimma.R
Contrast interfaces ContrastsInterface, Contrasts, ContrastsModerated, ContrastsLimma, ContrastsROPECA, ContrastsMissing, ContrastsFirth, ContrastsTable Contrasts*.R, ContrastFirth.R, ContrastsSimpleImpute.R
Visualization ContrastsPlotter ContrastsPlotter.R
Utilities MissingHelpers tidyMS_missingness_imputation.R

AnalysisConfiguration

Flat R6 class that maps column roles in the data:

  • hierarchy: ordered measurement levels (protein_Id → peptide_Id → precursor_Id → fragment_Id). hierarchy_depth controls which level is modelled.
  • factors: explanatory variables (group, treatment). factor_depth controls interaction depth.
  • work_intensity: response column. Uses a stack (set_response() / pop_response() / get_response()) for working with multiple intensity columns.
  • file_name: sample identifier column.

Concrete config factories (e.g. create_config_Skyline(), create_config_Spectronaut_Peptide()) were in tidyMS_R6_ConcreteConfigurations.R (now removed — create_config_MQ_peptide() was dead code). Remaining factories are in downstream packages.

Key Functions (not in classes)

  • build_contrast_analysis(lfqdata, modelstr, contrasts, method) — main entry point, returns a Facade (in build_contrast_analysis.R)
  • setup_analysis(data, config) — prepare data for analysis (in tidyMS_data_setup.R)
  • build_model(data, strategy, subject_Id) — fit per-protein models (in tidyMS_build_model.R)
  • build_model_impute(lfqdata, strategy) — fit with LOD imputation + borrowed covariance for missing groups (in tidyMS_build_model.R)
  • build_model_limma(lfqdata, strategy) — fit limma matrix model (in ContrastsLimma.R)
  • StrategyLM, StrategyRLM, StrategyLmer R6 classes + strategy_lm/rlm/lmer() wrappers (tidyMS_R6_Modelling.R); StrategyLogistf + strategy_logistf() (logistf.R)
  • strategy_limma() — limma matrix model strategy (in ContrastsLimma.R)
  • sim_lfq_data_peptide_config() — simulate test data (in simulate_LFQ_data.R)

File Naming Convention

  • R/LFQData*.R — Core data container and its decorator classes
  • R/Model*.R, R/Contrasts*.R — Modelling and hypothesis testing
  • R/AnalysisConfiguration.R — Configuration (column role mapping + serialization)
  • R/tidyMS_data_setup.Rsetup_analysis, complete_cases, sample_subset
  • R/tidyMS_summarize_hierarchy.Rtable_factors, hierarchy_counts, etc.
  • R/tidyMS_R6_Modelling.R — Strategy R6 classes (StrategyLM, StrategyRLM, StrategyLmer)
  • R/tidyMS_build_model.Rbuild_model, model_analyse, imputation internals
  • R/tidyMS_contrasts.Rlinfct_* family, compute_contrast, contrasts_linfct, pivot_model_contrasts_to_wide
  • R/tidyMS_moderation.Rmoderated_p_limma*, adjust_p_values, ROPECA, Fisher
  • R/tidyMS_*.R — Other utility functions (plotting, stats, aggregation, missingness)
  • R/utilities.R — Shared helpers (make_interaction_column, .error_handler)

Vectorized mode

options(prolfqua.vectorize = TRUE) activates vectorized implementations of compute_contrast and linfct_matrix_contrasts (matrix multiplication instead of per-row loops). Affects all Wald test facades (lm, rlm, firth, lmer) and limma's linfct path. Results are numerically identical. Default is FALSE.

Testing

  • When fixing a bug, first add a test that reproduces it, then fix. This ensures regressions are caught.

11 test files in tests/testthat/:

  • test-LFQData.R — Core data container and decorators
  • test-Model.R — Model fitting and coefficient extraction
  • test-Contrasts.R — Contrast computation (Wald test path)
  • test-ContrastsFacades.R — All facade classes and build_contrast_analysis()
  • test-ContrastsLimma.R — Limma backend (ModelLimma, ContrastsLimma, merge, 2-factor)
  • test-ContrastsModeratedDEqMS.R — DEqMS moderation and facade
  • test-ContrastsPlotter.R — Contrast visualization
  • test-ImputeModel.R — LOD imputation with borrowed covariance
  • test-plotting_functions.R — Low-level plots
  • test-tidyconfig_functions.R — Configuration and utilities
  • test-vectorize-contrasts.R — Side-by-side original vs vectorized contrast functions

Cross-Package Context

prolfqua is part of the prolfqua ecosystem (see ../CLAUDE.md). Downstream packages depend on its R6 classes and exported API:

  • prolfquapp — CLI wrapper for core facility workflows
  • prophosqua — Phosphoproteomics analysis
  • prolfquabenchmark — Benchmarking vignettes

Renaming R6 methods, changing exported function signatures, or modifying AnalysisConfiguration fields can silently break these packages.