Skip to content

NSAPH-Data-Processing/medpar_outcomes

Repository files navigation

Medicare Provider Analysis and Review (MedPAR) Outcomes

This repository provides a workflow generating tagged outcomes from the MedPAR denominator file using a user defined set of ICD (International Classification of Diseases) diagnosis code lists. The workflow also generates outcome counts per year or zipcode-years.

About MedPAR dataset

The MedPAR dataset contains detailed records of hospital inpatient and skilled nursing facility (SNF) stays for Medicare beneficiaries in the United States. This dataset includes key information such as dates of admission and discharge, diagnoses, procedures, and billing details, all coded using the ICD codes.

Table of Contents

Project Overview

The MedPAR Outcomes Processing repository aims to provide a streamlined approach for hospital admissions in the MedPAR dataset using ICD codes. This process is essential for researchers and healthcare analysts who need to filter and analyze hospital data based on certain medical conditions.

Repository Structure

The repository is organized into the following directories and files:

  • data/: Directory containing subfolders for input, and output files. For NSAPH internal processes, the data/input/README.md includes the symlinks commands for shared secure cluster usage.
    • input/: Contains MedPAR datasets that need to be processed.
    • output/: The directory where the processed datasets are saved. These datasets include admission details along with the labels indicating the presence of conditions as defined by the ICD codes.
  • icd_codes/: Contains YAML files that lists ICD codes used to label and categorize hospital admissions. Each file contains multiple conditions for a given study.
  • scripts/: Python scripts for processing the MEDPAR data and applying the ICD code lists.
  • notes/: Includes exploratory notebooks and project-specific details.
  • README.md: Provides an overview of the project and instructions for usage.
  • requirements.yaml: Conda environment file, containing the Python package dependencies and versions required to run the scripts.

Getting Started

Clone the repository:

Clone the repository and create a conda environment.

git clone <https://github.com/<user>/repo>
cd <repo>

conda env create -f requirements.yml
conda activate <env_name>

Usage

Step 1: Prepare Your Data

  • Add symlinks to input, and output folders inside the corresponding /data subfolders.

For example:

export HOME_DIR=$(pwd)

cd $HOME_DIR/data/input/ .
ln -s <input_path> .

cd $HOME_DIR/data/output/
ln -s <output_path> .

The README.md files inside the /data subfolders contain path documentation for NSAPH internal purposes.

Step 2: Define ICD Code Lists

  • The ICD code lists are central to labeling the admissions. These lists should be defined in YAML format and placed in the icd_codes/ directory. Each file in this directory should represent a specific list of conditions and include the relevant ICD codes.

Step 3: Run the Processing Script

  • To process the data and label the admissions based on the ICD code lists, the script reads ICD codes from a YAML file, constructs SQL queries to match these codes against the diagnoses in the MedPAR data, and then tags each hospitalization accordingly. The tagged data is then saved in a specified format (such as Parquet, Feather, or CSV) for further analysis. Run the following command:
python src/get_outcomes.py

In addition, .sbatch templates are provided for SLURM users. Be mindful that each HPC clusters has a different configuration and the .sbatch files might need to be modified accordingly.

Step 4: Review the Output

  • The output of the processing script will be saved in the output/ directory. The labeled dataset will include original admission details along with additional columns indicating the presence of conditions as defined by the ICD code lists.

ICD Code Lists

# Icd list created for project ....
# Example: Dementia with Lewy bodies
lewy:
  long_name: "Dementia with Lewy Bodies"
  icd9: ['33182']
  icd10: ['G3183']

YAML Format for ICD Code Lists The ICD code lists are defined in YAML format to ensure they are easy to read and maintain. Each YAML file should have the following structure:

  • #Icd list created for project: This comment should precede the listing of ICD codes to indicate the purpose of the file.
  • lewy: This is a unique key representing the condition (in this case, Dementia with Lewy Bodies).
  • long_name: A human-readable name for the condition.
  • icd9: A list of ICD-9 codes associated with this condition.
  • icd10: A list of ICD-10 codes associated with this condition.

Adding New ICD Code Lists

To add a new ICD code list:

  • Create a New YAML File: In the icd_codes/ directory, create a new YAML file with a descriptive name (e.g., icd_code_.yml).
  • Add ICD Codes: Define the condition and associated ICD codes in the YAML format, as shown above.
  • Update the Script: Ensure that the processing script references this new ICD code list file if necessary in the argument parsing.

Output

  • After processing, the output will be a labeled dataset saved in the output/ directory. The dataset includes all original hospital admission details along with labels indicating the presence of conditions or procedures based on the ICD code lists.

About

Tag hospitalizations in the medpar denominator file using icd code lists

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors