🔭 Exoplanet Transit Detection Using NASA Kepler Data

A Multi-Method AI Approach with Classification, Clustering, and Probabilistic Reasoning

Eesha Fatima (31647) · Fatima Kaleem (31620) · Uroos Fatima (31094) Institute of Business Administration, Karachi Spring 2026 — Introduction to Artificial Intelligence — Dr. Syed Ali Raza

📌 Project Overview

This project builds a multi-method AI pipeline to classify stellar light curves from the NASA Kepler Cumulative KOI (Kepler Objects of Interest) dataset. The goal is to identify which unconfirmed candidate signals are most likely to be genuine exoplanet transits, using:

Decision Trees
Naive Bayes
K-Means Clustering
Bayesian Probabilistic Reasoning

All methods are implemented from scratch using NumPy.

📁 Project Structure

Exoplanet-Transit-Detection-Using-NASA-Kepler-Data/
│
├── cumulative_2026.04.12_06.34.10.csv   ← Raw NASA dataset
├── koi_clean.csv                         ← Cleaned dataset (Phase 2 output)
│
├── cleaningdata.py                       ← Phase 2: data cleaning
├── preprocessing.py                      ← Phase 3: preprocessing + SMOTE
├── decision_tree.py                      ← Phase 4: Decision Tree 
├── naive_bayes.py                        ← Phase 5: Gaussian Naive Bayes 
├── kmeans.py                             ← Phase 6: K-Means Clustering 
├── bayesian_reasoning.py                 ← Phase 7: Bayesian probabilistic reasoning
├── cnn_baseline.py                       ← Phase 8: CNN baseline 
├── candidate_ranking.py                  ← Phase 9: Final candidate ranking
├── gui.py                                ← Phase 10: Interactive GUI
│
├── requirements.txt
├── .gitignore
├── LICENSE
│
├── X_train.npy                           ← Balanced training features (post-SMOTE)
├── y_train.npy                           ← Balanced training labels
├── X_val.npy                             ← Validation features
├── y_val.npy                             ← Validation labels
├── X_test.npy                            ← Test features
├── y_test.npy                            ← Test labels
├── X_candidates.npy                      ← 1,979 candidate features (inference)
│
├── bayesian_candidate_scores.npy         ← Phase 7 output
├── bayesian_candidate_labels.npy         ← Phase 7 output
├── cnn_candidate_scores.npy              ← Phase 8 output
├── cnn_candidate_labels.npy              ← Phase 8 output
├── final_candidate_scores.npy            ← Phase 9 output
├── final_candidate_labels.npy            ← Phase 9 output
├── final_candidate_tiers.npy             ← Phase 9 output
├── final_candidate_ranking.npy           ← Phase 9 output
├── final_agreement_counts.npy            ← Phase 9 output
├── final_candidate_summary.txt           ← Phase 9 output
│
└── README.md

📊 Performance Summary

Metric	Score
Validation Accuracy	95.61%
Test Accuracy	94.55%
AUC-ROC	0.9437
Precision	91%
Recall	94%

Key Finding: The model correctly identifies 94% of confirmed planets (Recall) while maintaining 91% precision.

🚀 Progress Log

✅ Phase 1 — Dataset Acquisition & Understanding

Dataset: NASA Kepler Cumulative KOI Table (cumulative_2026.04.12_06.34.10.csv) Source: NASA Exoplanet Archive

The raw dataset contains approximately 9,564 KOI entries, each representing a stellar signal flagged by the Kepler pipeline. Each entry carries over 40 engineered photometric and orbital features including transit depth, orbital period, transit duration, stellar radius, SNR, and centroid offset metrics.

Class Distribution:

Label	Count	Role
FALSE POSITIVE	4,839	Training data (negative class)
CONFIRMED	2,746	Training data (positive class)
CANDIDATE	1,979	Inference targets (unknowns)

Key Insight: The 1,979 CANDIDATE entries have never been confirmed or ruled out — primarily because the Kepler mission ended in 2018, ground-based telescope time is limited, and many candidates have weak signals or orbit faint stars. These are the primary scientific target of this pipeline.

✅ Phase 2 — Data Cleaning (`cleaningdata.py`)

Loaded raw CSV using pd.read_csv(..., comment='#') to skip header comment rows
Inspected missing values — identified columns with >50% null rates
Dropped high-missingness columns (>50% null threshold)
Dropped non-feature columns (identifiers, provenance fields, administrative metadata):

rowid, kepid, kepoi_name, koi_vet_stat, koi_vet_date,
koi_pdisposition, koi_disp_prov, koi_comment, koi_fittype,
koi_limbdark_mod, koi_parm_prov, koi_tce_delivname,
koi_quarters, koi_trans_mod, koi_datalink_dvr,
koi_datalink_dvs, koi_sparprov, koi_eccen

Saved cleaned dataset as koi_clean.csv

Output shape: 9,564 rows × 103 columns (102 features + 1 target)

False positive flags retained as features:

Column	Meaning
`koi_fpflag_nt`	Not transit-like shape
`koi_fpflag_ss`	Secondary eclipse (eclipsing binary)
`koi_fpflag_co`	Centroid offset (background contamination)
`koi_fpflag_ec`	Ephemeris match to known false positive

✅ Phase 3 — Preprocessing (`preprocessing.py`)

Train / Candidate Split

CANDIDATE rows were separated before any fitting to prevent data leakage.

Training pool  →  7,585 rows  (CONFIRMED + FALSE POSITIVE)
Candidates     →  1,979 rows  (held out entirely)

Steps Performed

Step	Method
Missing values	Median imputation (fit on train only)
Label encoding	CONFIRMED → 0, FALSE POSITIVE → 1
Feature scaling	Z-score standardization (fit on train only)
Train/Val/Test split	Stratified 70% / 15% / 15%
Class rebalancing	SMOTE oversampling

After SMOTE:

CONFIRMED = 3,387 | FALSE POSITIVE = 3,387

Saved splits:

X_train.npy  y_train.npy
X_val.npy    y_val.npy
X_test.npy   y_test.npy
X_candidates.npy

🛠️ Model Logic — Decision Tree (Phase 4)

The classifier is built from scratch using NumPy.

Core Functions

`entropy(y)`

Measures the disorder/impurity of a dataset.

$$H(X) = -\sum p_i \log_2(p_i)$$

Entropy = 0 when all labels are the same; entropy = 1 at a 50/50 split.

`information_gain(...)`

Measures the reduction in entropy after splitting on a feature threshold.

$$\text{Gain} = H(\text{parent}) - \text{Weighted Entropy of Children}$$

`best_split(X, y)`

Iterates through every feature and every unique threshold value to find the split that maximizes Information Gain.

`build_tree(...)`

Recursively grows the tree. Stops when:

A node is pure (only one class remains), or
max_depth = 10 is reached (to prevent overfitting)

`predict(...)`

Traverses the finished tree for new data until it reaches a leaf node (0 = Confirmed, 1 = False Positive).

Node Structure

Each Node object contains:

Feature / Threshold — the question the node asks (e.g., Is transit depth < 0.05?)
Left / Right — pointers to child branches
Value — only present in leaf nodes; the final classification

⚠️ Feature Selection — Removing "Cheat Codes"

Initially, the model achieved 99.03% accuracy — but analysis of decision paths revealed it was primarily using NASA-derived flags:

koi_fpflag_nt, koi_fpflag_ss, koi_fpflag_co, koi_fpflag_ec, koi_score

These flags are assigned after scientists already know the classification. A model using them isn't predicting planets — it's memorising NASA's notes.

Solution: These columns were removed, forcing the model to learn from raw physical observations only:

Feature	Description
Transit Depth	How much light the planet blocks
Orbital Period	How long it takes to orbit the star
Stellar Radius	The size of the parent star

This dropped accuracy to 94.55%, but produced a far more robust and scientifically honest model capable of generalising to new stars where these flags don't yet exist.

📈 Results

Test Set Performance (1,138 unseen samples)

Metric	Value
Accuracy	94.55%
Precision	0.91
Recall	0.94
Total errors	62 / 1,138

Confusion Matrix

	Predicted Confirmed	Predicted False Positive
Actual Confirmed	386 ✅	36 ❌
Actual False Positive	26 ❌	690 ✅

Recall is the priority metric — a missed genuine planet (false negative) is more scientifically costly than a false alarm.

Candidate Predictions (1,979 unresolved signals)

Prediction	Count
🟢 CONFIRMED (likely planet)	743
🔴 FALSE POSITIVE	1,236

This provides a prioritised list of 743 high-probability candidates for astronomers to focus follow-up observations on.

✅ Phase 5 — Naive Bayes (`naive_bayes.py`)

Gaussian Naive Bayes built from scratch using NumPy. Assumes each feature follows a Gaussian (normal) distribution per class. Uses log-probabilities throughout for numerical stability. An epsilon (1e-9) is added to variance and PDF values to prevent division by zero.

Metric	Value
Test Accuracy	85.50%
Precision	71.40%
Recall	100%
F1-Score	83.32%

Confusion Matrix (test set):

	Pred Planet	Pred FP
Actual Planet	412 ✅	0 ❌
Actual FP	165 ❌	561 ✅

Naive Bayes achieves perfect recall — it never misses a real planet — at the cost of lower precision. This makes it a valuable evidence source for the Bayesian combiner in Phase 7.

✅ Phase 6 — K-Means Clustering (`kmeans.py`)

K-Means clustering built from scratch using NumPy. Uses the Elbow Method (K = 1 to 6) to identify optimal K. Operates fully unsupervised — groups stars by physical similarity without using NASA labels. The planet-rich cluster is identified post-hoc by checking cluster composition against training labels.

Result: The planet cluster contained ~54% confirmed planets, confirming that physical features carry discriminative signal even without supervision. K-Means is intentionally a weaker evidence source (FPR of 0.851 on the validation set) and its contribution to the final ensemble is appropriately downweighted in Phase 9.

✅ Phase 7 — Bayesian Probabilistic Reasoning (`bayesian_reasoning.py`)

A Bayesian combiner that sequentially updates a probability estimate using evidence from all three from-scratch classifiers (Naive Bayes, K-Means, Decision Tree).

How it works:

Start with the prior: P(planet) = 0.50 (from SMOTE-balanced training set)
For each classifier, apply Bayes' theorem:
P(planet | evidence) = P(evidence | planet) × P(planet) / P(evidence)
The output of each update becomes the prior for the next classifier
Final score = P(planet | all three classifiers)

Likelihoods (recall and FPR) are measured on the validation set to produce honest, non-overfit estimates.

Classifier	Recall	FPR	Notes
Naive Bayes	0.993	0.220	Strong recall, moderate FPR
K-Means	0.990	0.851	Weak separator — unsupervised limitation
Decision Tree	0.985	0.004	Dominant evidence source

Metric	Value
Test Accuracy	99.12%
Precision	99.26%
Recall	98.30%
F1-Score	98.78%

Confusion Matrix (test set, 1,138 samples):

	Pred Planet	Pred FP
Actual Planet	405 ✅	7 ❌
Actual FP	3 ❌	723 ✅

Candidate Predictions: 1,192 likely planets, 787 false positives
Of the 1,192 planet predictions, 1,105 had all three classifiers in unanimous agreement (score = 0.9992).

All 8 vote combinations verified (sanity check passed):

NB	KM	DT	Candidates	Score	Label
planet	planet	planet	1,105	0.9992	CONFIRMED
planet	FP	planet	54	0.9859	CONFIRMED
FP	planet	planet	33	0.7215	CONFIRMED
planet	planet	FP	284	0.0712	FALSE POSITIVE
FP	FP	planet	42	0.1269	FALSE POSITIVE
planet	FP	FP	38	0.0043	FALSE POSITIVE
FP	planet	FP	64	0.0002	FALSE POSITIVE
FP	FP	FP	359	0.0000	FALSE POSITIVE

✅ Phase 8 — CNN Deep Learning Baseline (`cnn_baseline.py`)

A 1D Convolutional Neural Network built with TensorFlow/Keras as an industry-standard baseline. Treats the 102 tabular features as a 1D sequence and applies two convolutional layers to extract local feature patterns, followed by dense layers for classification.

Architecture:

Conv1D(32 filters, kernel=3, ReLU) → MaxPooling1D(2)
Conv1D(64 filters, kernel=3, ReLU) → MaxPooling1D(2)
Flatten → Dense(64, ReLU) → Dropout(0.3) → Dense(1, sigmoid)

Trained for 20 epochs, batch size 32, Adam optimizer, binary cross-entropy loss.

Metric	Value
Test Accuracy	99.30%
Precision	99.27%
Recall	98.79%
F1-Score	99.03%

Confusion Matrix (test set, 1,138 samples):

	Pred Planet	Pred FP
Actual Planet	407 ✅	5 ❌
Actual FP	3 ❌	723 ✅

Candidate Predictions: 1,164 likely planets, 815 false positives

Key finding: The CNN (99.30%) outperforms the hand-coded Bayesian ensemble (99.12%) by only 0.18 percentage points, demonstrating that our from-scratch NumPy implementations are highly competitive with state-of-the-art deep learning.

✅ Phase 9 — Candidate Ranking & Final Predictions (`candidate_ranking.py`)

Combines outputs from all five classifiers into a single final ensemble score for each of the 1,979 unresolved candidates. Uses a weighted average reflecting each method's validated test performance.

Ensemble weights:

Classifier	Weight	Rationale
CNN	40%	Highest test accuracy (99.30%)
Bayesian Reasoning	40%	Near-equal accuracy (99.12%), fully interpretable
Decision Tree	12%	Solid performance (94.55%), from scratch
Naive Bayes	5%	Lower precision but strong recall signal
K-Means	3%	Weakest separator (unsupervised, high FPR)

Confidence tiers:

Tier	Threshold	Count
HIGH	score >= 0.80	1,112
MEDIUM	0.50 <= score < 0.80	90
LOW / False Positive	score < 0.50	777

Classifier agreement across all 1,979 candidates:

Classifiers agreeing	Candidates
5 / 5 (unanimous)	1,066
4 / 5	88
3 / 5	70
2 / 5	267
1 / 5	129
0 / 5	359

Top candidates (score = 0.9997, all 5 classifiers unanimous): indices 2, 9, 40, 23, 1517, 45, 53, 1520, 1487, 1509 — and 1,056 more.

Outputs saved:

File	Contents
`final_candidate_scores.npy`	Ensemble probability per candidate
`final_candidate_labels.npy`	1 = planet, 0 = false positive
`final_candidate_tiers.npy`	HIGH / MEDIUM / LOW per candidate
`final_candidate_ranking.npy`	Candidate indices sorted best to worst
`final_agreement_counts.npy`	How many classifiers agreed per candidate
`final_candidate_summary.txt`	Human-readable ranked list of all 1,979

✅ Phase 10 — Interactive GUI (`gui.py`)

A desktop application built with Tkinter. Loads all Phase 9 outputs and provides an interactive interface for exploring the full ranked candidate list.

To run:

python gui.py

Requires only Python 3 + NumPy + Tkinter (Tkinter ships with standard Python). All .npy output files from Phase 9 must be in the same directory.

Features:

Animated starfield header with project and team information
Stat bar showing total candidates, tier counts, Bayesian accuracy (99.12%), and CNN accuracy (99.30%)
Full ranked table of all 1,979 candidates, colour-coded by confidence tier, with search bar, tier filter buttons, and click-to-sort column headers
Confidence distribution bar chart (HIGH / MEDIUM / LOW)
Candidate detail panel — click any row to see that candidate's ensemble score, tier, 5-dot classifier agreement indicator, and individual Bayesian and CNN score bars

📐 Evaluation Metrics

All classifiers are evaluated on the held-out test set using:

Accuracy — overall correctness
Precision — of predicted planets, how many are real
Recall — of real planets, how many did we catch (priority metric)
F1-Score — harmonic mean of precision and recall
AUC-ROC — discrimination ability across thresholds
Confusion Matrix — breakdown of TP, FP, TN, FN

🏁 Final Results

Of the 1,979 unresolved Kepler candidates that have never been confirmed or ruled out:

1,112 signals classified as HIGH confidence planet candidates (ensemble score >= 0.80)
1,066 of those have all five classifiers in unanimous agreement
90 signals in the MEDIUM confidence tier — worth follow-up investigation
777 signals classified as likely false positives

This pipeline provides astronomers with a prioritised, ranked list of candidates for ground-based follow-up observation — maximising the scientific return from limited telescope time.

Last updated: April 26, 2026

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
X_candidates.npy		X_candidates.npy
X_test.npy		X_test.npy
X_train.npy		X_train.npy
X_val.npy		X_val.npy
bayesian_candidate_labels.npy		bayesian_candidate_labels.npy
bayesian_candidate_scores.npy		bayesian_candidate_scores.npy
bayesian_reasoning.py		bayesian_reasoning.py
candidate_ranking.py		candidate_ranking.py
cleaningdata.py		cleaningdata.py
cnn_baseline.py		cnn_baseline.py
cnn_candidate_labels.npy		cnn_candidate_labels.npy
cnn_candidate_scores.npy		cnn_candidate_scores.npy
cumulative_2026.04.12_06.34.10.csv		cumulative_2026.04.12_06.34.10.csv
decision_tree.py		decision_tree.py
final_agreement_counts.npy		final_agreement_counts.npy
final_candidate_labels.npy		final_candidate_labels.npy
final_candidate_ranking.npy		final_candidate_ranking.npy
final_candidate_scores.npy		final_candidate_scores.npy
final_candidate_summary.txt		final_candidate_summary.txt
final_candidate_tiers.npy		final_candidate_tiers.npy
gui.py		gui.py
kmeans.py		kmeans.py
koi_clean.csv		koi_clean.csv
naive_bayes.py		naive_bayes.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
y_test.npy		y_test.npy
y_train.npy		y_train.npy
y_val.npy		y_val.npy

Folders and files

Latest commit

History

Repository files navigation

🔭 Exoplanet Transit Detection Using NASA Kepler Data

A Multi-Method AI Approach with Classification, Clustering, and Probabilistic Reasoning

📌 Project Overview

📁 Project Structure

📊 Performance Summary

🚀 Progress Log

✅ Phase 1 — Dataset Acquisition & Understanding

✅ Phase 2 — Data Cleaning (cleaningdata.py)

✅ Phase 3 — Preprocessing (preprocessing.py)

Train / Candidate Split

Steps Performed

🛠️ Model Logic — Decision Tree (Phase 4)

Core Functions

entropy(y)

information_gain(...)

best_split(X, y)

build_tree(...)

predict(...)

Node Structure

⚠️ Feature Selection — Removing "Cheat Codes"

📈 Results

Test Set Performance (1,138 unseen samples)

Confusion Matrix

Candidate Predictions (1,979 unresolved signals)

✅ Phase 5 — Naive Bayes (naive_bayes.py)

✅ Phase 6 — K-Means Clustering (kmeans.py)

✅ Phase 7 — Bayesian Probabilistic Reasoning (bayesian_reasoning.py)

✅ Phase 8 — CNN Deep Learning Baseline (cnn_baseline.py)

✅ Phase 9 — Candidate Ranking & Final Predictions (candidate_ranking.py)

✅ Phase 10 — Interactive GUI (gui.py)

📐 Evaluation Metrics

🏁 Final Results

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

✅ Phase 2 — Data Cleaning (`cleaningdata.py`)

✅ Phase 3 — Preprocessing (`preprocessing.py`)

`entropy(y)`

`information_gain(...)`

`best_split(X, y)`

`build_tree(...)`

`predict(...)`

✅ Phase 5 — Naive Bayes (`naive_bayes.py`)

✅ Phase 6 — K-Means Clustering (`kmeans.py`)

✅ Phase 7 — Bayesian Probabilistic Reasoning (`bayesian_reasoning.py`)

✅ Phase 8 — CNN Deep Learning Baseline (`cnn_baseline.py`)

✅ Phase 9 — Candidate Ranking & Final Predictions (`candidate_ranking.py`)

✅ Phase 10 — Interactive GUI (`gui.py`)

Packages