This project is part of the INFO 531: Data Warehousing and Analytics in the Cloud course (Fall 2025). It focuses on healthcare analytics, specifically aiming to predict disease risks based on lifestyle and health metrics. By analyzing factors such as physical activity, diet, and physiological markers, the project seeks to identify key drivers of disease and healthy lifestyles.
The analysis utilizes a dataset of 100,000 records containing 16 health and lifestyle-related features[cite: 1].
- Source: Kaggle
- Target Variable:
disease_risk(Binary Variable)0: Low Risk1: High Risk
The dataset includes 14 predictor variables across several categories:
- Demographic: Age, Gender (Encoded)
- Anthropometric: BMI
- Physical Activity: Daily Steps
- Lifestyle/Habit: Sleep Hours, Water Intake, Smoking Status, Alcohol Use
- Diet: Calories Consumed
- Physiological: Resting Heart Rate, Systolic & Diastolic Blood Pressure
- Biomarker: Cholesterol Level
- Medical History: Family History
To ensure high-quality input for machine learning models, the following preparation steps are implemented:
-
Data Cleaning:
- Scan for and remove duplicate entries.
- Verify missing values (though none are expected).
- Feature Exclusion: Drop the
IDcolumn as it holds no predictive value.
-
Transformation & Encoding:
- Categorical Encoding: Convert
genderto numerical format using One-Hot Encoding (gender_Male,gender_Female). - Scaling: Apply
StandardScalerto continuous numerical features to normalize variance (critical for distance-based algorithms).
- Categorical Encoding: Convert
-
Handling Class Imbalance:
- The dataset has a 25% High Risk / 75% Low Risk split.
- Strategies include adjusting class weights (
class_weight='balanced') and applying SMOTE (Synthetic Minority Over-sampling Technique) to the training set if necessary.
- Split Ratio: 80% Training / 20% Testing.
-
Validation: k-fold cross-validation (
$k=5$ ) is applied to the training set for hyperparameter tuning.
The project employs a two-pronged modeling approach:
- Baseline Model: Logistic Regression (LR)
- Chosen for its interpretability and ability to provide clear coefficients for risk factors.
- Primary Model: Random Forest Classifier (RFC)
- Selected for its robustness to non-linear interactions, resistance to overfitting, and ability to provide feature importance scores.
- Additional Models: Support Vector Machines (SVMs) and XGBoost may be explored for robust accuracy.
Given the class imbalance, the project prioritizes metrics beyond standard accuracy:
- F1-Score: To balance precision and recall.
- ROC-AUC: To measure discrimination ability across thresholds.
- Recall (Sensitivity): To minimize false negatives (missing high-risk patients), which is critical in a clinical setting.
| Model | Accuracy | Recall | F1 Score | ROC AUC |
|---|---|---|---|---|
| Logistic Regression | 0.49980 | 0.503828 | 0.333333 | 0.503064 |
| Random Forest | 0.71055 | 0.065673 | 0.101227 | 0.495568 |
| XGBoost | 0.74365 | 0.017123 | 0.032094 | 0.494139 |
| SVC | 0.50435 | 0.479654 | 0.324497 | 0.490195 |
- Language: Python
- Libraries:
scikit-learn(for models, scaling, and metrics)pandas(for data manipulation)numpy(for numerical operations)


