A comprehensive R-based data processing pipeline for the "Narrating the Future" (FTOLP) research project. This pipeline handles LimeSurvey data from multiple countries and survey waves, performing data splitting, quality control cleaning, and merging operations.
- Project Overview
- Directory Structure
- Installation
- Configuration
- Pipeline Workflow
- Scripts Documentation
- Dataset Groupings
- License
This pipeline processes survey data from LimeSurvey across multiple countries (Brazil, Portugal, China, USA, and others) and handles:
- Data preprocessing: Splitting and normalizing raw survey data
- Quality control: Removing invalid responses, detecting straightlining, zigzag patterns, and outliers
- Data validation: Checking control items and analyzing response durations
- Data merging: Consolidating cleaned datasets across countries and waves
FTOLP_Data_Pipeline/
βββ config/
β βββ paths.R # Central configuration for paths and dataset groupings
βββ src/
β βββ pipeline/ # Main processing pipeline scripts (run in order)
β β βββ 01_split_raw.R # Split and preprocess raw LimeSurvey data
β β βββ 02_clean.R # Quality control and data cleaning
β β βββ 03_merge_general.R # Merge cleaned datasets
β βββ utils/ # Utility functions
β β βββ cleaning_functions.R # Core cleaning utilities (steps, filters, outlier detection)
β β βββ merge_functions.R # Merging and labeling utilities
β β βββ comparison_functions.R # Dataset comparison and validation functions
β βββ analysis/ # Analysis and diagnostic scripts
β βββ consolidate_datasets.R # Generate dataset overlap matrices
β βββ duration_analysis.R # Survey completion time analysis
βββ docs/ # Additional documentation
βββ README.md # This file
βββ LICENSE # MIT License
- R (β₯ 4.0.0)
- RStudio (recommended)
install.packages(c(
"tidyverse", # Data manipulation
"haven", # SPSS file I/O
"labelled", # Variable labels
"readxl", # Excel files
"lubridate", # Date handling
"writexl", # Excel output
"ggplot2", # Visualization
"here", # Path management
"rstatix", # Statistical tests
"PerFit" # Person-fit analysis
))Before running the pipeline, configure your paths in config/paths.R:
# Base project directory
PROJECT_ROOT <- "~/Library/CloudStorage/Nextcloud-6161138@soliscom.uu.nl@surfdrive.surf.nl/Narrating the Future (Bogdan)"
# Data directories
DIR_RAW <- file.path(PROJECT_ROOT, "LimeSurvey Raw")
DIR_REAL_RAW <- file.path(PROJECT_ROOT, "Real Raw Data")
DIR_PROCESSED <- file.path(PROJECT_ROOT, "LimeSurvey Processed")
DIR_CLEAN <- file.path(PROJECT_ROOT, "LimeSurvey Processed", "clean")Modify PROJECT_ROOT to match your local setup.
Run scripts in the following order:
Purpose: Load and preprocess raw .sav files from LimeSurvey
Key Operations:
- Load raw survey data
- Normalize column names (handle variations like
IT_IT1βIT_1) - Filter out test participants
- Define dataset groupings by country/wave
- Export processed .sav files to
LimeSurvey Processed/
Output: Individual country/wave datasets ready for cleaning
Important: The script uses absolute paths via the here package and configuration files, so it can be run multiple times without path issues.
source("src/pipeline/01_split_raw.R")Purpose: Apply comprehensive quality control filters
Cleaning Steps:
- Missing responses: Remove rows with entire scale blocks missing
- Constant answers: Detect straightlining (identical responses across items)
- Zigzag patterns: Identify alternating response patterns
- Control items: Validate attention check responses
- Age filtering: Keep respondents 18-65 (configurable)
- Duration filtering: Remove suspiciously fast completions
- Nationality filtering: Apply country-specific filters
- Mahalanobis distance: Multivariate outlier detection
- Guttman scaling: Person-fit analysis using PerFit package
Output:
- Cleaned datasets in
LimeSurvey Processed/clean/ - Summary reports with removal counts per step
source("src/pipeline/02_clean.R")Purpose: Consolidate cleaned datasets (in development)
Key Operations:
- Fix and harmonize demographic variables (e.g.,
Adults_*columns) - Apply missing value coding logic
- Merge across countries/waves
Status: In development
source("src/pipeline/03_merge_general.R")- Function: Preprocesses raw LimeSurvey exports
- Key Functions:
normalize_column_names(): Standardizes variable nameswrite_clean(): Removes all-NA columns before saving
- Input:
.savfiles inDIR_RAW - Output: Processed
.savfiles inDIR_PROCESSED
- Function: Main cleaning pipeline orchestrator
- Defines: Multi-step cleaning workflow with dataset-specific gating
- Uses: Functions from
cleaning_functions.R - Input: Processed
.savfiles from step 1 - Output: Clean datasets + audit reports
- Function: Merges cleaned datasets
- Special handling:
- Adults demographic variable (wide β long transformation)
- Missing value reason codes
- Input: Clean
.savfiles from step 2 - Output: Merged dataset(s)
Core cleaning utilities:
Structure Builders:
mk_step(): Define individual cleaning stepsmk_group(): Group steps with shared logic
Data Quality Functions:
step_drop_na_block(): Remove rows with missing scale blocksstep_constant_answers(): Detect straightliningstep_detect_zigzag(): Identify alternating patternsstep_check_control(): Validate attention checksstep_filter_age(): Age-based filtering (default 18-65)step_filter_min_duration(): Remove fast completionsstep_remove_foreigners(): Nationality filtering
Statistical Outlier Detection:
step_mahalanobis(): Multivariate outlier detectionstep_guttman(): Person-fit analysis (via PerFit)
Pipeline Execution:
run_cleaning_pipeline(): Execute steps with gating logicbuild_wide_summary(): Generate removal audit reports
Merging and labeling utilities:
get_schema(): Extract variable labels, value labels, and metadataapply_schema_from(): Apply label schema from one variable to anotheraugment_with_reasons(): Add missing value reason codes- Missing value codes:
- 990: By design
- 991: Unknown missing
- 992: Technical error
- 993: Refused
- 994: Dubious
- 995: Nonresponse
- 999: Not applicable
Dataset comparison and validation:
compare_dfs_compact(): Comprehensive dataset comparison- Column name differences
- Type mismatches
- Label inconsistencies
- Value differences
- Useful for QA and merging preparation
Purpose: Generate comprehensive overlap matrix
Creates three matrices showing:
- Existence: Which columns exist in which datasets
- All-NA detection: Columns with no data
- Type tracking: Data types (including haven-specific types)
Output: CSV with visual indicators:
β type: Column present with dataββ type: Column present but all NA
Use case: Quickly identify which scales were administered in which countries/waves
source("src/analysis/consolidate_datasets.R")Purpose: Statistical analysis of survey completion times
Key Function: analyze_duration_histograms()
Features:
- Per-page duration distributions
- Robust outlier detection (median Β± 3ΓMAD)
- Z-scores based on Median Absolute Deviation (MAD)
- Faceted and individual page plots
- Outlier percentage tables
Output: Duration plots and CSV reports
Note: After analysis, duration filters were NOT applied to final cleaning
source("src/analysis/duration_analysis.R")Defined in config/paths.R:
DATASETS <- list(
br_pt = c("br_pilot", "PTBR_277273", "PTBR_999625"),
ch = c("CH_277273", "CH_999625"),
us = c("US_all", "US_216254", "US_868141"),
first_stage = c("CH_277273", "EN_277273", "ES_277273",
"IT_277273", "PTBR_277273", "SL_277273",
"US_all", "IT_extra", "US_216254", "US_868141")
)Country codes:
BR/PTBR: BrazilPT: PortugalCH: ChinaUS: United StatesEN: EnglishES: Spanish (Spain)IT: Italian (Italy)SL: Slovenian (Slovenia)
Survey instruments (examples):
FTOS: Future Time Orientation ScaleAS: Authenticity ScaleMiLQ/MLQ: Meaning in Life QuestionnaireCIPIP: Circumplex of Interpersonal ProblemsLS: Life SatisfactionGrit: Grit ScaleCAAS: Career Adapt-Abilities ScaleDASS: Depression, Anxiety and Stress Scale
MIT License - see LICENSE file for details.
Copyright (c) 2025 Qixiang Fang
Project: Narrating the Future (Bogdan)
Last Updated: November 2025