Skip to content

Latest commit

 

History

History
164 lines (115 loc) · 4.99 KB

File metadata and controls

164 lines (115 loc) · 4.99 KB

PCA-Based Digit Visualization

This repository demonstrates how Principal Component Analysis (PCA) can be applied to pixel-based handwritten digit images.
The project focuses on dimensionality reduction, visual interpretation of components, and understanding how PCA transforms high-dimensional image data into a meaningful latent space.


Overview

This project applies Principal Component Analysis (PCA) to the classic Handwritten Digits dataset.
The workflow focuses on:

  • Extracting pixel data from an 8×8 grayscale image grid
  • Visualizing sample digits as heatmaps and images
  • Reducing 64-dimensional pixel data to 2D and 3D PCA space
  • Understanding variance explained by principal components
  • Visualizing digit clusters based on PCA compressed features

PCA is especially powerful in high-dimensional image tasks where each pixel is a feature.


Dataset

  • Name: Digits Dataset (Scikit-Learn)
  • Source: Data set
  • Records: 1,797 handwritten digits (0–9)
  • Features:
    • 64 pixel intensity values (8×8 image)
    • number_label (target digit)

Each image is stored as a flattened array of 64 grayscale values.


Project Workflow

1) Loading the Dataset

Data is loaded from CSV containing pixel values and the digit label.

  • Display dataset info
  • Verify 64 pixel columns
  • Confirm no missing values

2) Pixel Extraction

Created a new DataFrame:

pixels = df.drop("number_label", axis=1)

Extracted a single digit representation (first row), converted to NumPy array, and reshaped it into 8×8 grid.

3) Visualizing Digits

Three visualizations:

  • matplotlib.imshow (default colormap)
  • matplotlib.imshow(cmap='gray')
  • seaborn.heatmap with pixel intensities

These help interpret the pixel intensities and confirm data structure.

4) Scaling Pixel Features

Used:

StandardScaler()

Scaling is essential because PCA relies on variance, and raw pixel ranges differ across images.

5) PCA with 2 Components

Performed dimensionality reduction:

  • PCA(n_components=2)
  • Projected all digits to 2D PCA space
  • Visualized using a color-coded scatter plot (hue = digit label)

Digits form distinct natural clusters, even with only 2 components.

Explained variance of PC1+PC2:

~21.59%

6) PCA with 3 Components

Extended the model:

  • PCA(n_components=3)
  • 3D scatter plot using Matplotlib's 3D axis
  • Color-coded by digit label

This provides an even clearer separation for some digit classes.


Libraries Used

  • numpy
  • pandas
  • seaborn
  • matplotlib
  • scikit-learn (PCA, StandardScaler)

How to Run

Clone the repository:

GitHub Repository

Install dependencies:

pip install -r requirements.txt
  • requirements.txt → File

or directly:

pip install numpy pandas seaborn matplotlib scikit-learn

Run the script to generate all visualizations.


Results Summary

PCA with 2 Components

  • Captures ~21.6% of the variance.
  • Despite low variance percentage, digits form recognizable clusters.
  • Demonstrates PCA’s ability to compress images while retaining structure.

PCA with 3 Components

  • Better separation of digits in 3D space.
  • Useful for interactive visualization and cluster analysis.

Image Reconstruction Insight

Each digit's raw pixel row can be reshaped back into an 8×8 grid to visually confirm the sample.


Key Takeaways

  • PCA can cluster handwritten digits even with limited components
  • 64-dimensional pixel data compresses cleanly into 2D and 3D
  • Variance captured by early components contains meaningful structure
  • PCA is suitable for visualization, preprocessing, and noise reduction

Conclusion

This project demonstrates how PCA transforms high-dimensional pixel data into low-dimensional latent space:

  • Meaningful digit clusters emerge even with 2 or 3 components
  • Useful for visualization, feature extraction, and preprocessing for ML models
  • Highlights the power of dimensionality reduction on image datasets

PCA remains a foundational tool in exploratory data analysis for image-based machine learning tasks.


Author ✍️

Author: Ali
Field: Data Science & Machine Learning Student
Email: ali.hz87980@gmail.com
GitHub: ali-119