PillCare - Multi-Modal Pill Identification System (Vision + OCR)

PillCare is a deep learning system that identifies medications from real-world images using both visual features (shape, color) and text imprints.

It is designed for high-risk scenarios like elderly medication management, where misidentification can have serious consequences.

Why this project matters

Most pill identification systems rely on either image classification or text lookup, which breaks down in real-world conditions:

Pills without clear imprints → OCR fails
Visually similar pills → vision models fail
Noisy lighting / angles → both degrade

PillCare solves this by combining both modalities, learning when to trust each signal.

Key Idea

Instead of treating vision and OCR equally, PillCare uses a gated fusion architecture:

Vision model extracts shape/color features
OCR model extracts imprint text
A learned confidence gate decides when OCR should influence predictions

This allows the system to:

Ignore OCR when no readable text exists
Use OCR when visual ambiguity is high

System Overview

Image Input
↓
[Vision Model: MobileNetV2] → visual features
↓
[OCR Model: CRNN] → text features
↓
[Confidence Gate] → weighs OCR usefulness
↓
[Fusion Layer] → final classification

Results

Vision-only model: 80% accuracy
Fusion model: 84% accuracy (+3.6%)
Macro F1: 0.85 across 16 drug classes

Key takeaway:

Multi-modal fusion improves performance only when the model learns when to ignore bad signals

Engineering Highlights

Built a multi-stage ML pipeline (preprocessing → OCR → fusion → inference)
Designed a gated fusion mechanism to reduce noisy feature influence
Implemented class-weighted training to handle severe dataset imbalance
Optimized models for edge deployment using TensorFlow Lite
Automated dataset expansion using openFDA APIs + augmentation pipelines

Key Lessons

Data > architecture: 3× augmentation gave larger gains than model complexity
Naive fusion fails: OCR hurts performance unless selectively gated
Small datasets require restraint: freezing backbones outperformed fine-tuning
Class imbalance can silently break models

Tech Stack

Component	Tool
Deep Learning	TensorFlow / Keras
Model Architectures	MobileNetV2 (visual), CRNN (text), Gated Fusion (combined)
Image Preprocessing	OpenCV, NumPy
Text Processing	TensorFlow Text, Regular Expressions
Data Augmentation	OpenCV (offline), TensorFlow (online)
Dataset Expansion	openFDA API, ePillID
Model Deployment	TensorFlow Lite

Dataset

Visual Recognition Dataset

707 images across 16 drug classes, expanded from ePillID segmented NIH pill images
Augmented offline with rotation, flip, brightness, contrast, perspective, noise, and crop transforms
Preprocessed to 224x224 resolution
Balanced via class-weighted training + targeted augmentation

Supported Drug Classes

Drug	Train Images	Drug	Train Images
Diltiazem HCl	70	Simvastatin	50
Lisinopril	67	Gabapentin	52
Metformin HCl	49	Hydrocodone/APAP	43
Warfarin Sodium	32	Prednisone	19
Hydrochlorothiazide	19	Levothyroxine Sodium	19
Metoprolol Tartrate	18	Amlodipine Besylate	16
Losartan Potassium	16	Carvedilol	15
Pantoprazole Sodium	15	Amoxicillin	12

OCR Dataset

Processed pill images with corresponding text imprints
Character set includes alphanumeric characters and common symbols
Preprocessed using adaptive thresholding and resizing while maintaining aspect ratio
Located in ocr_dataset_epillid/ directory

Model Architectures

CRNN OCR Pipeline

Data Preparation

Image Preprocessing (preprocess.py)
- Converts images to grayscale
- Applies adaptive thresholding for better text visibility
- Resizes images to fixed height while maintaining aspect ratio
- Normalizes pixel values to [0, 1] range
Data Generation (create_ocr_dataset.py)
- Processes raw images and labels
- Generates character-to-index mappings
- Creates train/validation/test splits
- Handles variable-length sequences with padding

Architecture

Input → Conv2D → BatchNorm → ReLU → MaxPool2D → Dropout →
Conv2D → BatchNorm → ReLU → Conv2D → BatchNorm → ReLU → MaxPool2D → Dropout →
Conv2D → BatchNorm → ReLU → Dropout → Reshape →
Bidirectional(GRU) → BatchNorm → Dropout →
Bidirectional(GRU) → Dense → Softmax → CTCLoss

Fusion Model

Architecture

Vision Branch:  Input(224x224x3) → MobileNetV2 → GAP → Dense(128) → Dropout(0.3)
OCR Branch:     Input(32x128x1)  → CRNN Conv Layers → GAP → (256-dim features)
                                    ↓
Gate:           OCR features → Dense(64) → Sigmoid → scale OCR features
                                    ↓
Fusion:         Concatenate(vision_features, gated_ocr) → Dense(256) → Dropout(0.4) → Softmax

Training Strategy

Phase 1: Freeze both backbones, train fusion head + gate (25 epochs, LR=1e-4, class weights)
Phase 2: Disabled — causes catastrophic forgetting with small datasets
Fine-tuned MobileNetV2 weights loaded from separately trained vision model

Results

Vision-Only Model (MobileNetV2)

Metric	Score
Test Accuracy	80%
Macro F1	0.83
Classes	16

Fusion Model (Vision + Gated OCR)

Metric	Score
Test Accuracy	84%
Macro F1	0.85
Classes	16

With increased augmentation (200 samples per class), the late-fusion model outperformed the vision-only baseline by +3.6%, demonstrating the benefit of multi-modal integration when sufficient data is available.

Per-Class Performance (Fusion)

Pill	Precision	Recall	F1
Amlodipine Besylate	0.67	0.80	0.73
Amoxicillin	0.57	1.00	0.73
Carvedilol	1.00	1.00	1.00
Diltiazem HCl	0.73	0.57	0.64
Gabapentin	1.00	0.78	0.88
Hydrochlorothiazide	0.50	0.60	0.55
Hydrocodone/APAP	0.86	1.00	0.92
Levothyroxine Sodium	0.83	1.00	0.91
Lisinopril	0.88	0.58	0.70
Losartan Potassium	1.00	1.00	1.00
Metformin HCl	1.00	0.88	0.93
Metoprolol Tartrate	0.71	1.00	0.83
Pantoprazole Sodium	0.80	1.00	0.89
Prednisone	1.00	1.00	1.00
Simvastatin	1.00	0.92	0.96
Warfarin Sodium	0.89	1.00	0.94

Getting Started

Prerequisites

Python 3.10–3.12
TensorFlow 2.16+
OpenCV
NumPy, scikit-learn
Matplotlib (for visualization)
WSL2 recommended for GPU training on Windows

Installation

# Clone the repository
git clone https://github.com/d2r3v/PillCare.git
cd PillCare

# Install dependencies
pip install -r requirements.txt

Training

Vision Model

python models/train_v2.py

OCR Model (CRNN)

python scripts/train_crnn.py --data_dir=ocr_dataset_epillid --epochs=100 --batch_size=32

Fusion Model (Vision + OCR)

python scripts/train_fusion.py

Dataset Expansion

# Expand dataset from ePillID source images
python scripts/expand_dataset.py

# Offline data augmentation (3-5x more training images)
python scripts/augment_dataset.py --target 150

Evaluation

# Evaluate baseline
python scripts/evaluate_baseline.py

# Evaluate fusion model
python scripts/evaluate_fusion.py

Inference

# Run the full pipeline on an image
python scripts/run_fusion_pipeline.py --image path_to_pill_image.jpg

Project Structure

PillCare/
├── pill_dataset_split/          # Visual dataset (train/val/test, 16 classes)
├── ocr_dataset_epillid/         # OCR dataset (images + labels)
├── data/
│   ├── ePillID_data/            # Source ePillID images
│   └── label_map.json           # Class index → drug name mapping
├── models/
│   ├── train.py                 # Original vision training (Keras 2)
│   ├── train_v2.py              # Vision training (Keras 3, class weights)
│   ├── vision_model_v2.keras    # Trained vision model
│   ├── crnn_epillid.h5          # Trained CRNN weights
│   └── fusion_model.h5          # Trained fusion model
├── scripts/
│   ├── train_fusion.py          # Fusion model training (gated fusion)
│   ├── train_crnn.py            # CRNN training script
│   ├── evaluate_fusion.py       # Fusion evaluation
│   ├── evaluate_baseline.py     # Baseline evaluation
│   ├── expand_dataset.py        # Dataset expansion via openFDA API
│   ├── augment_dataset.py       # Offline data augmentation
│   ├── run_fusion_pipeline.py   # Fusion inference
│   ├── preprocess.py            # Image preprocessing
│   ├── classify.py              # Vision classification
│   └── ocr_crnn.py              # OCR prediction
├── logs/                        # Training logs & evaluation reports
├── plots/                       # Training history plots
└── README.md

Future Work

Combine visual and text recognition for more accurate identification
Re-train vision model in Keras 3 for full weight transfer to fusion
Expand dataset to 16 drug classes (707 images)
Implement confidence gating for OCR branch
Further augmentation to 250+ images per class
Late fusion (ensemble) approach for combining models
Improve weakest classes (hydrochlorothiazide, lisinopril)
Develop mobile application with TFLite deployment
Implement real-time inference on mobile devices

Development Journey & Lessons Learned

This project went through several iterations. Here are the key challenges encountered and how they were resolved:

1. Catastrophic Forgetting During Fine-Tuning

Problem: Unfreezing the full MobileNetV2 backbone in Phase 2 caused train accuracy to drop 30-40% immediately, destroying all progress from Phase 1.

What was tried:

Full unfreeze with low LR (1e-5) — accuracy crashed
Partial unfreeze (last 30 layers) with LR 1e-6 — still dropped ~10%
Various learning rate schedules

Solution: Disabled Phase 2 entirely. With small datasets (<1000 images), training only the fusion head on top of frozen pretrained features is more stable and performant.

2. OCR Branch Hurting Instead of Helping

Problem: The fusion model (53%) initially performed worse than vision-only (77%). The CRNN text features added noise for pills without readable imprints.

Approach	Accuracy	Outcome
Vision-only	77%	Best standalone
Naive fusion (concat)	53%	OCR noise dominated
Fusion + class weights	72%	Improved but still behind
Fusion + confidence gating	68%	Gate added too many params
Fusion + gating + augmentation	81%	Finally competitive

Key insight: The confidence gating concept was right (learn when to ignore OCR), but it only worked once we had enough data for the gate to learn meaningful patterns. Architecture improvements without data are ineffective.

3. Data Quantity > Model Complexity

Problem: Every architecture trick gave diminishing returns. The real bottleneck was always data.

Change	Impact
Confidence gating	-4% (hurt with small data)
Class weights	+19% (53% → 72%)
Disable Phase 2	+3%
Data augmentation (3x)	+9% (biggest single gain)

Lesson: With small datasets, simple approaches (frozen backbone + class weights + more data) consistently outperform complex ones (gating, multi-phase fine-tuning).

4. Class Imbalance Kills Minority Classes

Problem: After expanding to 16 classes, 6 drugs had 0% precision/recall — the model simply never predicted them.

Root cause: Amoxicillin had 12 training images while diltiazem had 70. The cross-entropy loss optimized for the majority classes.

Solution: sklearn.utils.class_weight.compute_class_weight("balanced") upweighted rare classes proportionally. Combined with targeted augmentation (generating more images for underrepresented classes), all 16 classes achieved non-zero performance.

5. Keras 2 → Keras 3 Migration Pain

Problem: The original vision model (best_model.h5) couldn't be loaded in TensorFlow 2.16+ (Keras 3) due to Sequential model format changes and the deprecated .h5 save format.

Solution: Rewrote training with the Functional API and .keras save format. Pre-trained the vision model separately, then loaded its MobileNetV2 weights into the fusion model.

6. Regularization Isn't Always Free

Problem: Label smoothing (0.1) + cosine LR decay reduced test accuracy (fusion 83.78% → 82.88%, vision 80% → 78%).

Key insight: With small, noisy datasets, the model benefits more from stronger supervision (hard labels) than softer targets. Label smoothing dilutes the learning signal when every training example counts. Cosine decay reduced the LR too aggressively before the model fully converged. Reverted both changes — the simple constant LR + hard cross-entropy remained the best configuration.

Author Notes

This project explores:

Transfer learning for visual recognition
Sequence learning with CRNN and CTC loss
Multi-modal fusion with confidence gating for pill identification
Class-balanced training for imbalanced medical datasets
Automated dataset expansion using FDA drug databases
Model optimization for edge devices

License

[Your License Here]

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
api		api
app		app
data		data
logs		logs
models		models
pill_dataset_split		pill_dataset_split
plots		plots
scripts		scripts
test_images		test_images
visualization		visualization
.gitignore		.gitignore
OCR_test.py		OCR_test.py
README.md		README.md
best_model.h5		best_model.h5
pillcare_model.tflite		pillcare_model.tflite
processed_image.jpg		processed_image.jpg
requirements.txt		requirements.txt
test_all.py		test_all.py

Folders and files

Latest commit

History

Repository files navigation

PillCare - Multi-Modal Pill Identification System (Vision + OCR)

Why this project matters

Key Idea

System Overview

Results

Engineering Highlights

Key Lessons

Tech Stack

Dataset

Visual Recognition Dataset

Supported Drug Classes

OCR Dataset

Model Architectures

CRNN OCR Pipeline

Data Preparation

Architecture

Fusion Model

Architecture

Training Strategy

Results

Vision-Only Model (MobileNetV2)

Fusion Model (Vision + Gated OCR)

Per-Class Performance (Fusion)

Getting Started

Prerequisites

Installation

Training

Vision Model

OCR Model (CRNN)

Fusion Model (Vision + OCR)

Dataset Expansion

Evaluation

Inference

Project Structure

Future Work

Development Journey & Lessons Learned

1. Catastrophic Forgetting During Fine-Tuning

2. OCR Branch Hurting Instead of Helping

3. Data Quantity > Model Complexity

4. Class Imbalance Kills Minority Classes

5. Keras 2 → Keras 3 Migration Pain

6. Regularization Isn't Always Free

Author Notes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages