Skip to content
View skerk001's full-sized avatar
😃
😃

Block or report skerk001

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
skerk001/README.md

Samir Kerkar — Data Scientist

Healthcare data scientist working in causal inference, quasi-experimental program evaluation, and applied ML. Four years evaluating pharmacist-led clinical programs at a 60,000-patient managed care system — most recently a PSM + DiD evaluation that documented $83.50 PMPM cost reduction (p = 0.0027) across a COPD cohort.

M.S. Data Science, UC San Diego (starting Fall 2026) · B.S. Mathematics, UC Irvine

Available: Full-time · Open to contract / full-time · Remote-friendly

📧 Samir2000VIP@gmail.com · LinkedIn · Irvine, CA


Selected Work — Desert Oasis Healthcare

Data Scientist across 11 clinics. Methods chosen to defend causal claims a reviewer would actually question.

  • COPD program evaluation (n = 997, PSM + DiD): $83.50 PMPM cost reduction (p = 0.0027), driven by lower ED (p = 0.002), inpatient (p = 0.04), and readmission (p = 0.002) utilization.
  • Post-discharge pharmacist intervention (n = 878, negative binomial regression): 22% reduction in 30-day readmissions (IRR = 0.78, p = 0.02).
  • Heart failure outcomes manuscript — under peer review.
  • AFib anticoagulation care-gap analysis — poster, ASHP National Conference.

Featured Projects

🧪 CausalCare — Causal inference on ICU mortality

Five-method stack (PSM, IPW, AIPW, Double ML, Causal Forest) built on DoWhy's identify–estimate–refute workflow. Placebo and random-common-cause refuters at each stage; method agreement used as a robustness check rather than a single point estimate. Why it matters: demonstrates the full causal pipeline most healthcare ML projects skip.

🧬 GenomicsGPT — Variant interpretation at scale

XGBoost / LightGBM ensemble over 1.69M ClinVar variants. Leakage-corrected AUC 0.985, macro-F1 0.948. Feature ablation defends against gene-name memorization: consequence + LoF alone reach AUC 0.97, while gene-only collapses to 0.78. SHAP per-variant audit; Llama 3 / Claude narrative engine generates ACMG/AMP-style reports. Why it matters: pre-empts the leakage critique a reviewer would lead with.


Other Projects

  • ClinicalRAG — RAG over 220 clinical documents with retrieval and refusal as first-class metrics: 97.6% condition recall, 85.7% citation rate, 95.2% abstention accuracy.
  • Diabetic Retinopathy — Custom CNN for 5-class DR grading. Weighted F1 = 0.94, outperforming ResNet-50 and VGG-16 on the same split. Grad-CAM confirms attention to clinically meaningful pathology. Paper.
  • REIGN — Cross-era NBA impact models over 29,969 player-seasons with era-specific z-score normalization.

Stack

Causal & Statistics — PSM, DiD, IPW, AIPW, Double ML, Causal Forest, negative binomial & other GLMs, survival analysis ML — XGBoost, LightGBM, scikit-learn, TensorFlow / Keras, SHAP LLM / NLP — RAG, LangChain, ChromaDB, HuggingFace Transformers Healthcare data — EHR, claims, pharmacy, ICD-10, HCC, PMPM Languages & delivery — Python, SQL, R, Power BI, FastAPI, Git


Outside work: 2500+ rated chess · basketball · piano

Pinned Loading

  1. diabetic-retinopathy-classification diabetic-retinopathy-classification Public

    CNN-based 5-class diabetic retinopathy severity classification from retinal fundus images (F1 = 0.94)

  2. gene-cancer-prediction gene-cancer-prediction Public

    ML classification of AML vs. ALL leukemia subtypes from gene expression data (F1 = 0.95)

    Jupyter Notebook

  3. clinical-rag clinical-rag Public

    RAG system for clinical question answering over 220 discharge summaries with hallucination guardrails, citation tracking, and chunking strategy evaluation (97.6% condition recall)

    Python

  4. genomicsgpt genomicsgpt Public

    ML + LLM pipeline for genetic variant pathogenicity prediction (AUC 0.9949, 1.69M ClinVar variants) with SHAP explainability and clinical report generation via Llama 3 / Claude

    Jupyter Notebook

  5. CausalCare CausalCare Public

    Causal inference analysis of ICU beta-blocker treatment effects using propensity matching, IPW, doubly robust estimation, Double ML, and Causal Forest on eICU data

    Python

  6. reign-web reign-web Public

    NBA player impact analytics across 80 years. Era-specific composite models, playoff opponent adjustments, and interactive visualizations for 3,484 players (1946–2025).

    JavaScript