Skip to content

si-software-lab/precision-medicine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Precision Medicine MiniDemo

QUESTION:

Which patients with condition D1 would you recommend be given treatment W to reduce their risk of outcome Y1?

ANSWER:

  1. Running 'tx_effect.py' in Python 3.13 with dependencies will generate a file 'docs/results" tx_recommendations.csv" with a treatment recommendation for all patients with diagnosis D1. These are further grouped by patients with similar treatment effect scoring for easier outreach/followup assignment and prioritization.

  2. For a light review the methodology used, please see the example whitepaper: "/docs/results/PrecisionMedicine_miniWhitepaper.pdf"

Project Structure

precision-medicine/
├── data/                           
│   └── data.csv                        ← source patient data flat file 
├── docs/
│   ├── assets                          ← figures, images, tables
│   └── results                         ← deliverables
│       ├── tx_recommendations.csv      ← treatment recommendations flat file artifact (deliverable)
│       └── uplift_distribution.png     ← data-analytic plot of Average Tx Effect (CATE/uplift) distribution
├── utils/
│   ├── meta_author.json                ← corresponding author metadata
│   ├── git_init.py                     ← python script to initialize git repo
│   └── logging_config.py               ← custom logging config to track runtime kickoffs
├── utils/
│   ├── 1_descriptive_stats.py          ← descriptive statistics of source dataframe and its characteristics
│   └── 2_treatment_effect.py           ← derive ATE/CATE and cluster results for patient engagement prioritization
├── README.md                           ← this file you're viewing / describes data-analytic procedure
├── SECURITY.md                         ← example infosec policy to report CVEs or other exposures
└── requirements.txt                    ← python 3.13 library dependencies

Here is my procedure:

Step 0: Acquire source patient data

  • Download the patient data file
  • Initialize this git repository as the response
  • Generate descriptive stats to learn attribute dTypes (view in console manually):
import pandas as pd


def describe_csv(csv_file_path: str, max_cats: int = 10) -> dict:
    """
    Purpose: I manually view dummy variable data types
            - specifically 1/0 integers/numbers/decimals/floats versus True/False boolean/strings
    Method: Ingest a csv and compute descriptive statistics and freq counts.
    Args: csv_file_path (str): Path to the CSV file; max_cats (int): Max number of value counts to display per column.
    Returns: dictionary summary
    """
    df = pd.read_csv(csv_file_path)

    # Basic descriptive stats for numeric columns
    numeric_summary = df.describe(include=[float, int]).to_dict()

    # Column data types
    dtypes_summary = df.dtypes.apply(lambda x: str(x)).to_dict()

    # Frequency counts for each column
    freq_summary = {}
    for col in df.columns:
        freq_summary[col] = (
            df[col]
            .value_counts(dropna=False)
            .head(max_cats)
            .to_dict()
        )

    summary = {
        "shape": df.shape,
        "columns": df.columns.tolist(),
        "dtypes": dtypes_summary,
        "numeric_summary": numeric_summary,
        "frequency_summary": freq_summary
    }

    return summary


# call it and scroll through console output manually perusing dtypes
if __name__ == "__main__":
    result = describe_csv("data/data.csv")
    print("Dataset shape:", result["shape"])
    print("\nColumn data types:")
    for col, dtype in result["dtypes"].items():
        print(f"  {col}: {dtype}")
    print("\n--- Numeric Summary ---")
    print(pd.DataFrame(result["numeric_summary"]))
    print("\n--- Frequency Counts ---")
    for col, freqs in result["frequency_summary"].items():
        print(f"\n{col}:")
        for val, count in freqs.items():
            print(f"  {val}: {count}")

Step 1: Frame the problem precisely

Category Details
Target Cohort Patients with disease of interest where D1 == 1
Treatment (Tx) W (1 / TRUE = received treatment)
Outcome to Reduce Y1 (1 = bad outcome)
Goal Part A: For each patient in the D1 cohort, estimate the individual treatment effect (uplift) on outcome Y1.
Part B: Recommend treatment W for those with predicted risk reduction (> 0).
Assumptions 1) Consistency: each observation’s outcome corresponds to the treatment actually received (no confounding).
2) Ignorability: given X1–X80 (+DDx, site), treatment assignment is as-if random (no unmeasured confounders).
3) Positivity: every profile had a non-zero chance to be treated and untreated.
Deliverables 1) Baseline for the D1 cohort (treated/untreated rates, outcome rates)
2) Average Treatment Effect (ATE/CATE) estimates (propensity weighting + doubly-robust checks)
3) Uplift (individual Tx effect) score per patient
4) Recommended-treatment list of patient IDs with positive estimated benefit

Step 2: Look at the baseline

Compute basic rates for D1 == 1 to ensure the data give us a naive risk comparison.

import pandas as pd
import numpy as np
from prettyprinter import pprint
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# adjust path/file
precision_df = pd.read_csv("data/data.csv")

# normalize booleans to 0/1 if needed
# if we do this, we can just use sum() and mean() for descriptive stats if the boolean flag was text 'TRUE/FALSE'
bool_map = {"TRUE": 1, "True": 1, True: 1, "FALSE": 0, "False": 0, False: 0}
if precision_df['W'].dtype == object:
    precision_df['W'] = precision_df['W'].map(bool_map).astype(int)
if precision_df['Y1'].dtype == object:
    precision_df['Y1'] = precision_df['Y1'].map(bool_map).astype(int)

# cohort: D1 == 1
cohort = precision_df[precision_df['D1'] == 1].copy()

baseline_summary = {
    "n_total_D1": len(cohort),
    "treated_n": int(cohort['W'].sum()),
    "untreated_n": int((1 - cohort['W']).sum()),
    "treated_rate": float(cohort['W'].mean()),

    "Y1_rate_overall": float(cohort['Y1'].mean()),
    "Y1_rate_treated": float(cohort.loc[cohort['W'] == 1, 'Y1'].mean()),
    "Y1_rate_untreated": float(cohort.loc[cohort['W'] == 0, 'Y1'].mean()),
}

baseline_summary["naive_risk_diff_treated_minus_untreated"] = (
        baseline_summary["Y1_rate_treated"] - baseline_summary["Y1_rate_untreated"]
)

pd.DataFrame([baseline_summary])
pprint(baseline_summary)


Step 3: Estimate the average treatment effect (ATE) inside the D1 cohort

  • We can now calculate and use the propensity score and doubly-robust estimator so that the result is less biased by baseline differences.

Propensity Score Model

The model estimates the probability that a patient receives treatment (W == 1) given their features (X):

$p(X) = P(W = 1 \mid X)$

Using logistic regression, this is:

$p(X) = \frac{1}{1 + e^{-(\beta_0 + \beta^\top X)}}$

Where:

  • (X = (X_1, X_2, ... , X_{80}, site_id))
  • $\beta$ is the coefficient vector
  • $\beta_0$ is the intercept
  • Propensity model predicts the statistical likelihood that each patient would receive the treatment (W) given all baseline features (X1 - X80 and site)
  • Outcome model predicts the likelihood of a bad outcome for Y1 given treatment assignment and features.
  • Combine both –> Double-robust estimator corrects some bias even if one model is a bit wonky.
  • Here is an explanation of Average Tx Effect:
Term Meaning Formula Typical Usage
ATE Average Treatment Effect $( E[Y(1) - Y(0)] )$ Overall average effect of treatment in a population.
CATE Conditional Average Treatment Effect $( E[Y(1) - Y(0) \mid X=x] )$ Average effect conditional on some covariates (e.g., age, comorbidities, site, severity).
  • ATE tells you “on average, what’s the treatment’s effect overall.”
  • CATE tells you “on average, what’s the treatment’s effect for patients like this one (given X ).”
# filter cohort for treated pats
precision_df = pd.read_csv("data/data.csv")
cohort = precision_df[precision_df['D1'] == 1].copy()

# 1) prep feature matrix
feature_cols = [f"X{i}" for i in range(1, 81)] + ["site_id"]
X = cohort[feature_cols]
W = cohort["W"].astype(int)
Y = cohort["Y1"].astype(int)

# 2a) propensity model: P(W=1|X)
ps_model = Pipeline([
    ("scaler", StandardScaler()),
    ("logit", LogisticRegression(max_iter=500))
])
ps_model.fit(X, W)
cohort["p_hat"] = ps_model.predict_proba(X)[:, 1]

# 2b) remove extreme probabilities to avoid huge weights
cohort["p_hat"] = cohort["p_hat"].clip(0.01, 0.99)

# 3a) outcome model: E[Y|W,X] - fit the model separately for treated and untreated (t: with Tx W; c: without)
outcome_model_t = Pipeline([
    ("scaler", StandardScaler()),
    ("logit", LogisticRegression(max_iter=500))
])
outcome_model_c = Pipeline([
    ("scaler", StandardScaler()),
    ("logit", LogisticRegression(max_iter=500))
])

outcome_model_t.fit(X[W == 1], Y[W == 1])
outcome_model_c.fit(X[W == 0], Y[W == 0])

# 3b) predicted outcomes under each condition
mu1 = outcome_model_t.predict_proba(X)[:, 1]  # if treated
mu0 = outcome_model_c.predict_proba(X)[:, 1]  # if untreated

# 4a) create the doubly robust ATE composite estimate (aka ate_dr)
cohort["dr_term"] = (
    (W * (Y - mu1) / cohort["p_hat"])
    - ((1 - W) * (Y - mu0) / (1 - cohort["p_hat"]))
    + (mu1 - mu0)
)
ate_dr = cohort["dr_term"].mean()  # just take the mean

print(f"Estimated ATE (Y1 bad outcome difference, treated - untreated): {ate_dr:.4f}")
print(f"Estimated Risk reduction by treatment: {-ate_dr:.4f}")

Tx Risk MGMT

ATE Sign Interpretation
Negative Treatment reduces bad outcomes → good news
Positive Treatment worsens outcomes → bad news
  • can convert it to “% reduction in bad-outcome risk.”
  • ate_dr is the doubly-robust average treatment effect on Y1:


Step 4: Compute CATE (uplift) for each pat_id

  • The goal was to generate individual-level treatment recommendations — the CATE/uplift for each patient with D1.
  • So far, we have already fit these models:
    • Propensity model: $(\hat p(X) = P(W=1|X))$
    • Outcome models: $(\mu_1(X) = E[Y|W=1,X])$ for treated, and $(\mu_0(X) = E[Y|W=0,X])$ for untreated pat_ids
  • I want to try and keep it simple, stable, and interpretable.

Now CATE is: $\text{CATE}_i = \mu_1(X_i) - \mu_0(X_i)$

Since Y1 == 1 means “bad outcome,” we want:

  • Negative CATE → treatment reduces risk → recommend treatment
  • Positive → treatment increases/worsens risk → do NOT treat
  • Near zero → “no strong effect” meaning probably do NOT treat
# 4b) calculate individual-level CATE (uplift)
cohort["cate"] = mu1 - mu0   # predicted bad-outcome difference

# risk reduction is the negative of that (because Y1=1 means bad)
cohort["risk_reduction"] = -cohort["cate"]

# generate the Tx recommendation: treat if predicted risk reduction > 0
cohort["recommend_treatment"] = cohort["risk_reduction"] > 0

# recommendation table
recommend_df = cohort[
    ["patient_id", "cate", "risk_reduction", "recommend_treatment"]
].sort_values("risk_reduction", ascending=False)

# export to csv
recommend_df.to_csv("recommendation/tx_recommendations.csv", index=False)
print()
print()
pprint(recommend_df)  # .head(10))

Output columns:

Column Meaning
cate Predicted change in risk if treated vs untreated
risk_reduction Negative of CATE (lower Y1 = better)
recommend_treatment Boolean: True = predicted benefit

Let's interpret the values:

  • risk_reduction = 0.12 → treatment lowers bad-outcome risk by 12 percentage points.
  • risk_reduction = -0.05 → treatment increases risk by 5 points (do not treat).

Want to prioritize patients for treatment-screening appointments?

  • Sorting by risk reduction gives the highest-priority patients to treat.
  • We can organize them into treatment-priority cluster in step 5.
  • Direct clinicians to call/contact/outreach/follow up with prioritized treatment-recommended individuals or groups first if working on a clinical schedule and timeline.

Treatment-Recommendation Map (when to treat):

Scenario Interpretation Recommendation
CATE < 0 Treatment lowers risk of Y1 (good) Treat
CATE > 0 Treatment increases risk of Y1 (bad) Do not treat
CATE ≈ 0 No meaningful difference Neutral / optional
Risk_Reduction = -CATE > 0 Predicts fewer bad outcomes with treatment Treat
Risk_Reduction = -CATE < 0 Predicts more bad outcomes with treatment Avoid


Step 5a: Plot the uplift distribution

We can see that distribution of good/bad treatment effects is normal:

print()
print('--------------------------------------------------------------------------------------------')
print('PLOT UPLIFT DISTRIBUTION:')
print()
print('please see recommendation/tx_recommendations.csv')
# Histogram of risk reduction
plt.figure(figsize=(10, 6))
plt.hist(cohort["risk_reduction"], bins=50)
plt.title("Distribution of Predicted Risk Reduction (–CATE)")
plt.xlabel("Risk Reduction (negative = harmful treatment; positive = beneficial)")
plt.ylabel("Number of Patients")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('recommendation/uplift_distribution_plot.png')
plt.show()

Step 5b: Cluster patients by similar uplift

# reshape dataframe
risk_vals = cohort["risk_reduction"].values.reshape(-1, 1)

# k-means: three uplift groups
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
cohort["uplift_cluster"] = kmeans.fit_predict(risk_vals)

# sort the clusters by mean risk reduction for consistent labels
cluster_means = cohort.groupby("uplift_cluster")["risk_reduction"].mean().sort_values()

# map the clusters -> ordered labels (0 = worst, 2 = best)
cluster_order_map = {
    old: new for new, old in enumerate(cluster_means.index)
}
cohort["uplift_cluster"] = cohort["uplift_cluster"].map(cluster_order_map)

pprint(cohort[["patient_id", "risk_reduction", "uplift_cluster"]].head(10))
uplift_clusters_summary = cohort.groupby("uplift_cluster")["risk_reduction"].agg(["count", "mean", "min", "max"])
print(uplift_clusters_summary)

Cluster Number Meaning Map:

uplift_cluster Interpretation Meaning for Treatment
0 treatment harmful negative risk_reduction → avoid
1 neutral or weak effect little/no benefit
2 strong positive effect major benefit → treat

Developer Notes

Why use uplift / doubly-robust CI estimation instead of a T-learning model with XGBoost? (The Average Treatment Effect (ACE/CASE) is also known as 'uplift.')

  • A T-learner with XGBoost is not a causal model.
  • It is a predicitive model trained separately on treated and untreated outcomes.

Because of that, it suffers from three fundamental weaknesses and would be misapplied in our 'Tx Recommendations' use case:

1. T-Learning does not correct for treatment bias (confounding).

T-Learner trains:

  • $𝑓_1(𝑋)$ on treated patients
  • $𝑓_0(𝑋)$ on untreated pats

But those two groups almost always have a different 𝑋 distributions:

treated patients != untreated patients

so XGBoost ends up learning:

  • differences in who gets treated
  • NOT differences in treatment effect

This produces biased uplift.

ATE/CATE/DoublyRobust handles this using:

  • propensity scores: $\hat{p}(𝑋)$ (i.e., the estimated probability as a function of X)
  • outcome models: $\hat{\mu_1}(𝑋)$, $\hat{\mu_0}(𝑋)$
  • inverse probability weighting (IPW, i.e., a causal method that reweighs each observation by the inverse of its probability of receiving the treatment it actually received, making the treated and untreated groups comparable as if randomized.
  • Doubly robust, written as var_dr in my python script. DR correction combines both propensity weighting and outcome modeling so the treatment effect remains unbiased if either model is correctly specified.

This makes treatment comparisons as if randomized--T-learning cannot do that.

2. Doubly-Robust ATE remains correct even if one model is wrong.

The DR estimator is called "doubly robust" because $\widehat{ATE_{DR}}$ is unbiased if either $\hat{p}(X)$ or $\hat{\mu}(X)$ is correct.

T-learning has zero robustness:

  • if either XGBoost model is even slightly mis-specified
  • we get biased estimates

3. T-Learner overfits easily with imbalanced treatments

For example, if treatment W=1 is rare (common in medicine), then:

  • $f_1(X)$ treated model is weak
  • $f_0(X)$ untreated model is strong
  • the uplift is dominated by noise from the treated model

ATE/DR uses:

  • weights
  • correction terms
  • variance stabilization So it handles imbalance safely.

Also: T-Learning estimates outcomes, not causal effects

This is the most important point of why not to use it in this use case.

T-Learner model:

$E[Y | X,W]$

But we want:

$E[Y(1)] - Y(0) | X]$

T-learner predicts based on observed outcomes only, not counterfactuals.

ATE/DR explicity reconstructs counterfactuals using:

  • proprensity adjustment
  • residual correction
  • cross-world estimate (i.e., hypothetical comparison of Tx and not Tx)

This is known as causal inference (CI), not standard prediciton.

To summarize the above:

Method Purpose Problem Causal?
XGBoost T-Learner Predicts Y for treated & untreated separately Learns treatment bias; no confounding correction no
ATE/Doubly-Robust Estimates effect of treatment holding confounding constant Requires assumptions but statistically robust yes

The ATE/doubly-robust method is better because it removes confounding and identifies true causal effects, whereas an XGBoost T-learner only builds two prediction models and cannot produce unbiased Tx effects.

Here is a full variable mapping for this project:

Variable Meaning Model Role
patient_id Unique patient identifier 🟦 ID only (not used in models)
site_id Hospital/site ID 🟩 Propensity + Outcome
D1–D4 Diagnoses 🟩 Propensity + Outcome
X1–X80 Patient covariates 🟩 Propensity + Outcome
W Treatment indicator 🟥 Target (Propensity), Feature (Outcome)
Y1 Main clinical outcome 🟥 Target (Outcome model)
Y2–Y5 Other outcomes 🟧 Optional Outcome targets
p̂(X) Propensity score 🟪 Propensity model output
μ̂₁(X) Predicted outcome if treated 🟪 Outcome model output
μ̂₀(X) Predicted outcome if untreated 🟪 Outcome model output
CATE_i Individual treatment effect 🟨 Uplift/Causal Model
risk_reduction_i −CATEᵢ 🟨 Decision Rule
uplift_cluster K-means uplift group 🟫 Clustering Model
ATE_DR Doubly-robust ATE 🟫 Causal Estimator

Legend: 🟦 ID 🟩 Features 🟥 Targets 🟪 Model Outputs 🟨 Causal/Uplift 🟫 Meta-analysis

Here is a data-pipeline / dataflow diagram showing where each variable is touched:

                  ┌───────────────────────────────────────────┐
                  │                  DATA                     │
                  │ patient_id, site_id, D1–D4, X1–X80, W, Y1 │
                  └───────────────────────────────────────────┘
                                      |
                                      v
        ┌─────────────────────────────────────────────────────────┐
        │                 PROPENSITY MODEL p̂(X)                   │
        │   Inputs: site_id, D1–D4, X1–X80                        │
        │   Output: p̂(X) = P(W=1 | X)                             │
        └─────────────────────────────────────────────────────────┘
                                      |
                                      v
      ┌─────────────────────────────────────────────────────────────┐
      │                OUTCOME MODELS μ̂₁(X), μ̂₀(X)                  │
      │   Inputs: W, Y1, site_id, D1–D4, X1–X80                     │
      │   Outputs: μ̂₁(X), μ̂₀(X)                                     │
      └─────────────────────────────────────────────────────────────┘
                                      |
                                      v
          ┌──────────────────────────────────────────────────┐
          │              DOUBLY-ROBUST ATE_DR                │
          │   Uses: Y1, W, p̂(X), μ̂₁(X), μ̂₀(X)                │
          └──────────────────────────────────────────────────┘
                                      |
                                      v
     ┌──────────────────────────────────────────────────────────────┐
     │                    CATE / UPLIFT MODEL                       │
     │   CATE = μ̂₁(X) − μ̂₀(X)                                       │
     │   risk_reduction = −CATE                                     │
     │   uplift_cluster (K-means on risk_reduction)                 │
     └──────────────────────────────────────────────────────────────┘

Please see the mini/demo whitepaper: PrecisionMedicine_miniWhitepaper.pdf

View the mini white paper example and causal DAG

About

causal analysis demo

Resources

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages