Skip to content

fatzzi/Exoplanet-Transit-Detection-Using-NASA-Kepler-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔭 Exoplanet Transit Detection Using NASA Kepler Data

A Multi-Method AI Approach with Classification, Clustering, and Probabilistic Reasoning

Eesha Fatima (31647) · Fatima Kaleem (31620) · Uroos Fatima (31094) Institute of Business Administration, Karachi Spring 2026 — Introduction to Artificial Intelligence — Dr. Syed Ali Raza


📌 Project Overview

This project builds a multi-method AI pipeline to classify stellar light curves from the NASA Kepler Cumulative KOI (Kepler Objects of Interest) dataset. The goal is to identify which unconfirmed candidate signals are most likely to be genuine exoplanet transits, using:

  • Decision Trees
  • Naive Bayes
  • K-Means Clustering
  • Bayesian Probabilistic Reasoning

All methods are implemented from scratch using NumPy.


📁 Project Structure

Exoplanet-Transit-Detection-Using-NASA-Kepler-Data/
│
├── cumulative_2026.04.12_06.34.10.csv   ← Raw NASA dataset
├── koi_clean.csv                         ← Cleaned dataset (Phase 2 output)
│
├── cleaningdata.py                       ← Phase 2: data cleaning
├── preprocessing.py                      ← Phase 3: preprocessing + SMOTE
├── decision_tree.py                      ← Phase 4: Decision Tree 
├── naive_bayes.py                        ← Phase 5: Gaussian Naive Bayes 
├── kmeans.py                             ← Phase 6: K-Means Clustering 
├── bayesian_reasoning.py                 ← Phase 7: Bayesian probabilistic reasoning
├── cnn_baseline.py                       ← Phase 8: CNN baseline 
├── candidate_ranking.py                  ← Phase 9: Final candidate ranking
├── gui.py                                ← Phase 10: Interactive GUI
│
├── requirements.txt
├── .gitignore
├── LICENSE
│
├── X_train.npy                           ← Balanced training features (post-SMOTE)
├── y_train.npy                           ← Balanced training labels
├── X_val.npy                             ← Validation features
├── y_val.npy                             ← Validation labels
├── X_test.npy                            ← Test features
├── y_test.npy                            ← Test labels
├── X_candidates.npy                      ← 1,979 candidate features (inference)
│
├── bayesian_candidate_scores.npy         ← Phase 7 output
├── bayesian_candidate_labels.npy         ← Phase 7 output
├── cnn_candidate_scores.npy              ← Phase 8 output
├── cnn_candidate_labels.npy              ← Phase 8 output
├── final_candidate_scores.npy            ← Phase 9 output
├── final_candidate_labels.npy            ← Phase 9 output
├── final_candidate_tiers.npy             ← Phase 9 output
├── final_candidate_ranking.npy           ← Phase 9 output
├── final_agreement_counts.npy            ← Phase 9 output
├── final_candidate_summary.txt           ← Phase 9 output
│
└── README.md

📊 Performance Summary

Metric Score
Validation Accuracy 95.61%
Test Accuracy 94.55%
AUC-ROC 0.9437
Precision 91%
Recall 94%

Key Finding: The model correctly identifies 94% of confirmed planets (Recall) while maintaining 91% precision.



🚀 Progress Log

✅ Phase 1 — Dataset Acquisition & Understanding

Dataset: NASA Kepler Cumulative KOI Table (cumulative_2026.04.12_06.34.10.csv) Source: NASA Exoplanet Archive

The raw dataset contains approximately 9,564 KOI entries, each representing a stellar signal flagged by the Kepler pipeline. Each entry carries over 40 engineered photometric and orbital features including transit depth, orbital period, transit duration, stellar radius, SNR, and centroid offset metrics.

Class Distribution:

Label Count Role
FALSE POSITIVE 4,839 Training data (negative class)
CONFIRMED 2,746 Training data (positive class)
CANDIDATE 1,979 Inference targets (unknowns)

Key Insight: The 1,979 CANDIDATE entries have never been confirmed or ruled out — primarily because the Kepler mission ended in 2018, ground-based telescope time is limited, and many candidates have weak signals or orbit faint stars. These are the primary scientific target of this pipeline.


✅ Phase 2 — Data Cleaning (cleaningdata.py)

  1. Loaded raw CSV using pd.read_csv(..., comment='#') to skip header comment rows
  2. Inspected missing values — identified columns with >50% null rates
  3. Dropped high-missingness columns (>50% null threshold)
  4. Dropped non-feature columns (identifiers, provenance fields, administrative metadata):
rowid, kepid, kepoi_name, koi_vet_stat, koi_vet_date,
koi_pdisposition, koi_disp_prov, koi_comment, koi_fittype,
koi_limbdark_mod, koi_parm_prov, koi_tce_delivname,
koi_quarters, koi_trans_mod, koi_datalink_dvr,
koi_datalink_dvs, koi_sparprov, koi_eccen
  1. Saved cleaned dataset as koi_clean.csv

Output shape: 9,564 rows × 103 columns (102 features + 1 target)

False positive flags retained as features:

Column Meaning
koi_fpflag_nt Not transit-like shape
koi_fpflag_ss Secondary eclipse (eclipsing binary)
koi_fpflag_co Centroid offset (background contamination)
koi_fpflag_ec Ephemeris match to known false positive

✅ Phase 3 — Preprocessing (preprocessing.py)

Train / Candidate Split

CANDIDATE rows were separated before any fitting to prevent data leakage.

Training pool  →  7,585 rows  (CONFIRMED + FALSE POSITIVE)
Candidates     →  1,979 rows  (held out entirely)

Steps Performed

Step Method
Missing values Median imputation (fit on train only)
Label encoding CONFIRMED → 0, FALSE POSITIVE → 1
Feature scaling Z-score standardization (fit on train only)
Train/Val/Test split Stratified 70% / 15% / 15%
Class rebalancing SMOTE oversampling

After SMOTE:

CONFIRMED = 3,387 | FALSE POSITIVE = 3,387

Saved splits:

X_train.npy  y_train.npy
X_val.npy    y_val.npy
X_test.npy   y_test.npy
X_candidates.npy

🛠️ Model Logic — Decision Tree (Phase 4)

The classifier is built from scratch using NumPy.

Core Functions

entropy(y)

Measures the disorder/impurity of a dataset.

$$H(X) = -\sum p_i \log_2(p_i)$$

Entropy = 0 when all labels are the same; entropy = 1 at a 50/50 split.

information_gain(...)

Measures the reduction in entropy after splitting on a feature threshold.

$$\text{Gain} = H(\text{parent}) - \text{Weighted Entropy of Children}$$

best_split(X, y)

Iterates through every feature and every unique threshold value to find the split that maximizes Information Gain.

build_tree(...)

Recursively grows the tree. Stops when:

  • A node is pure (only one class remains), or
  • max_depth = 10 is reached (to prevent overfitting)

predict(...)

Traverses the finished tree for new data until it reaches a leaf node (0 = Confirmed, 1 = False Positive).

Node Structure

Each Node object contains:

  • Feature / Threshold — the question the node asks (e.g., Is transit depth < 0.05?)
  • Left / Right — pointers to child branches
  • Value — only present in leaf nodes; the final classification

⚠️ Feature Selection — Removing "Cheat Codes"

Initially, the model achieved 99.03% accuracy — but analysis of decision paths revealed it was primarily using NASA-derived flags:

koi_fpflag_nt, koi_fpflag_ss, koi_fpflag_co, koi_fpflag_ec, koi_score

These flags are assigned after scientists already know the classification. A model using them isn't predicting planets — it's memorising NASA's notes.

Solution: These columns were removed, forcing the model to learn from raw physical observations only:

Feature Description
Transit Depth How much light the planet blocks
Orbital Period How long it takes to orbit the star
Stellar Radius The size of the parent star

This dropped accuracy to 94.55%, but produced a far more robust and scientifically honest model capable of generalising to new stars where these flags don't yet exist.


📈 Results

Test Set Performance (1,138 unseen samples)

Metric Value
Accuracy 94.55%
Precision 0.91
Recall 0.94
Total errors 62 / 1,138

Confusion Matrix

Predicted Confirmed Predicted False Positive
Actual Confirmed 386 ✅ 36 ❌
Actual False Positive 26 ❌ 690 ✅

Recall is the priority metric — a missed genuine planet (false negative) is more scientifically costly than a false alarm.

Candidate Predictions (1,979 unresolved signals)

Prediction Count
🟢 CONFIRMED (likely planet) 743
🔴 FALSE POSITIVE 1,236

This provides a prioritised list of 743 high-probability candidates for astronomers to focus follow-up observations on.


✅ Phase 5 — Naive Bayes (naive_bayes.py)

Gaussian Naive Bayes built from scratch using NumPy. Assumes each feature follows a Gaussian (normal) distribution per class. Uses log-probabilities throughout for numerical stability. An epsilon (1e-9) is added to variance and PDF values to prevent division by zero.

Metric Value
Test Accuracy 85.50%
Precision 71.40%
Recall 100%
F1-Score 83.32%

Confusion Matrix (test set):

Pred Planet Pred FP
Actual Planet 412 ✅ 0 ❌
Actual FP 165 ❌ 561 ✅

Naive Bayes achieves perfect recall — it never misses a real planet — at the cost of lower precision. This makes it a valuable evidence source for the Bayesian combiner in Phase 7.


✅ Phase 6 — K-Means Clustering (kmeans.py)

K-Means clustering built from scratch using NumPy. Uses the Elbow Method (K = 1 to 6) to identify optimal K. Operates fully unsupervised — groups stars by physical similarity without using NASA labels. The planet-rich cluster is identified post-hoc by checking cluster composition against training labels.

Result: The planet cluster contained ~54% confirmed planets, confirming that physical features carry discriminative signal even without supervision. K-Means is intentionally a weaker evidence source (FPR of 0.851 on the validation set) and its contribution to the final ensemble is appropriately downweighted in Phase 9.


✅ Phase 7 — Bayesian Probabilistic Reasoning (bayesian_reasoning.py)

A Bayesian combiner that sequentially updates a probability estimate using evidence from all three from-scratch classifiers (Naive Bayes, K-Means, Decision Tree).

How it works:

  1. Start with the prior: P(planet) = 0.50 (from SMOTE-balanced training set)
  2. For each classifier, apply Bayes' theorem:
    P(planet | evidence) = P(evidence | planet) × P(planet) / P(evidence)
  3. The output of each update becomes the prior for the next classifier
  4. Final score = P(planet | all three classifiers)

Likelihoods (recall and FPR) are measured on the validation set to produce honest, non-overfit estimates.

Classifier Recall FPR Notes
Naive Bayes 0.993 0.220 Strong recall, moderate FPR
K-Means 0.990 0.851 Weak separator — unsupervised limitation
Decision Tree 0.985 0.004 Dominant evidence source
Metric Value
Test Accuracy 99.12%
Precision 99.26%
Recall 98.30%
F1-Score 98.78%

Confusion Matrix (test set, 1,138 samples):

Pred Planet Pred FP
Actual Planet 405 ✅ 7 ❌
Actual FP 3 ❌ 723 ✅

Candidate Predictions: 1,192 likely planets, 787 false positives
Of the 1,192 planet predictions, 1,105 had all three classifiers in unanimous agreement (score = 0.9992).

All 8 vote combinations verified (sanity check passed):

NB KM DT Candidates Score Label
planet planet planet 1,105 0.9992 CONFIRMED
planet FP planet 54 0.9859 CONFIRMED
FP planet planet 33 0.7215 CONFIRMED
planet planet FP 284 0.0712 FALSE POSITIVE
FP FP planet 42 0.1269 FALSE POSITIVE
planet FP FP 38 0.0043 FALSE POSITIVE
FP planet FP 64 0.0002 FALSE POSITIVE
FP FP FP 359 0.0000 FALSE POSITIVE

✅ Phase 8 — CNN Deep Learning Baseline (cnn_baseline.py)

A 1D Convolutional Neural Network built with TensorFlow/Keras as an industry-standard baseline. Treats the 102 tabular features as a 1D sequence and applies two convolutional layers to extract local feature patterns, followed by dense layers for classification.

Architecture:

Conv1D(32 filters, kernel=3, ReLU) → MaxPooling1D(2)
Conv1D(64 filters, kernel=3, ReLU) → MaxPooling1D(2)
Flatten → Dense(64, ReLU) → Dropout(0.3) → Dense(1, sigmoid)

Trained for 20 epochs, batch size 32, Adam optimizer, binary cross-entropy loss.

Metric Value
Test Accuracy 99.30%
Precision 99.27%
Recall 98.79%
F1-Score 99.03%

Confusion Matrix (test set, 1,138 samples):

Pred Planet Pred FP
Actual Planet 407 ✅ 5 ❌
Actual FP 3 ❌ 723 ✅

Candidate Predictions: 1,164 likely planets, 815 false positives

Key finding: The CNN (99.30%) outperforms the hand-coded Bayesian ensemble (99.12%) by only 0.18 percentage points, demonstrating that our from-scratch NumPy implementations are highly competitive with state-of-the-art deep learning.


✅ Phase 9 — Candidate Ranking & Final Predictions (candidate_ranking.py)

Combines outputs from all five classifiers into a single final ensemble score for each of the 1,979 unresolved candidates. Uses a weighted average reflecting each method's validated test performance.

Ensemble weights:

Classifier Weight Rationale
CNN 40% Highest test accuracy (99.30%)
Bayesian Reasoning 40% Near-equal accuracy (99.12%), fully interpretable
Decision Tree 12% Solid performance (94.55%), from scratch
Naive Bayes 5% Lower precision but strong recall signal
K-Means 3% Weakest separator (unsupervised, high FPR)

Confidence tiers:

Tier Threshold Count
HIGH score >= 0.80 1,112
MEDIUM 0.50 <= score < 0.80 90
LOW / False Positive score < 0.50 777

Classifier agreement across all 1,979 candidates:

Classifiers agreeing Candidates
5 / 5 (unanimous) 1,066
4 / 5 88
3 / 5 70
2 / 5 267
1 / 5 129
0 / 5 359

Top candidates (score = 0.9997, all 5 classifiers unanimous): indices 2, 9, 40, 23, 1517, 45, 53, 1520, 1487, 1509 — and 1,056 more.

Outputs saved:

File Contents
final_candidate_scores.npy Ensemble probability per candidate
final_candidate_labels.npy 1 = planet, 0 = false positive
final_candidate_tiers.npy HIGH / MEDIUM / LOW per candidate
final_candidate_ranking.npy Candidate indices sorted best to worst
final_agreement_counts.npy How many classifiers agreed per candidate
final_candidate_summary.txt Human-readable ranked list of all 1,979

✅ Phase 10 — Interactive GUI (gui.py)

A desktop application built with Tkinter. Loads all Phase 9 outputs and provides an interactive interface for exploring the full ranked candidate list.

To run:

python gui.py

Requires only Python 3 + NumPy + Tkinter (Tkinter ships with standard Python). All .npy output files from Phase 9 must be in the same directory.

Features:

  • Animated starfield header with project and team information
  • Stat bar showing total candidates, tier counts, Bayesian accuracy (99.12%), and CNN accuracy (99.30%)
  • Full ranked table of all 1,979 candidates, colour-coded by confidence tier, with search bar, tier filter buttons, and click-to-sort column headers
  • Confidence distribution bar chart (HIGH / MEDIUM / LOW)
  • Candidate detail panel — click any row to see that candidate's ensemble score, tier, 5-dot classifier agreement indicator, and individual Bayesian and CNN score bars

📐 Evaluation Metrics

All classifiers are evaluated on the held-out test set using:

  • Accuracy — overall correctness
  • Precision — of predicted planets, how many are real
  • Recall — of real planets, how many did we catch (priority metric)
  • F1-Score — harmonic mean of precision and recall
  • AUC-ROC — discrimination ability across thresholds
  • Confusion Matrix — breakdown of TP, FP, TN, FN

🏁 Final Results

Of the 1,979 unresolved Kepler candidates that have never been confirmed or ruled out:

  • 1,112 signals classified as HIGH confidence planet candidates (ensemble score >= 0.80)
  • 1,066 of those have all five classifiers in unanimous agreement
  • 90 signals in the MEDIUM confidence tier — worth follow-up investigation
  • 777 signals classified as likely false positives

This pipeline provides astronomers with a prioritised, ranked list of candidates for ground-based follow-up observation — maximising the scientific return from limited telescope time.


Last updated: April 26, 2026

About

This project presents a multi-method AI pipeline for prioritizing exoplanet candidates from the NASA Kepler KOI dataset. Combining from-scratch implementations of Decision Trees, Naive Bayes, K-Means clustering, and Bayesian reasoning with a CNN baseline, the system ranks 1,979 unresolved signals by their likelihood of being genuine exoplanets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages