This repository contains a comprehensive collection of Natural Language Processing (NLP) experiments, focusing on text preprocessing, feature engineering techniques, and model implementation for sentiment analysis and text classification.
The project demonstrates the end-to-end workflow of building NLP applications, from raw text cleaning to training advanced machine learning models. It explores various vectorization strategies and their impact on model performance.
Rigorous text cleaning steps were implemented to prepare raw text for modeling:
- Text Cleaning: Removal of HTML tags, special characters, and punctuation using Regex.
- Normalization: Lowercasing and whitespace trimming.
- Tokenization: Splitting text into individual tokens.
- Stopword Removal: Eliminating common words (e.g., "the", "is") using NLTK
stopwords. - Stemming & Lemmatization: Reducing words to their root forms (implied usage in notebooks).
We explored multiple methods to convert text into numerical vectors:
- Bag of Words (BOW): Implemented using
CountVectorizerto create sparse frequency matrices. - TF-IDF (Term Frequency-Inverse Document Frequency): Implemented using
TfidfVectorizerto weight terms based on importance. - Word2Vec:
- Trained custom Word2Vec embeddings using Gensim on the Game of Thrones dataset to capture semantic relationships between words.
- Explored both CBOW and Skip-gram architectures.
- One-Hot Encoding (OHE) / Label Encoding:
- Used
LabelEncoderfor transforming categorical target variables (Sentiment labels). - (OHE is conceptually covered for handling categorical metadata features).
- Used
Implemented and evaluated various machine learning models:
- Sentiment Analysis (IMDB Dataset):
- Classifying movie reviews as Positive or Negative.
- Models Used:
- Naive Bayes (
GaussianNB): Probabilistic classifier suitable for high-dimensional text data. - Random Forest (
RandomForestClassifier): Ensemble learning method for robust classification.
- Naive Bayes (
- Performance Comparison: Analyzed accuracy differences between BOW and TF-IDF features.
| File / Notebook | Description |
|---|---|
text-classification.ipynb |
Main notebook for Sentiment Analysis on IMDB dataset. Includes preprocessing, vectorization (BOW/TF-IDF), and model training (NB, RF). |
game-of-thrones-word2vec.ipynb |
Training a custom Word2Vec model on Game of Thrones text to find semantic similarities (e.g., similar characters/houses). |
bow-with-preprocessing...ipynb |
Deep dive into Bag-of-Words with various preprocessing configurations. |
text-preprocessing.ipynb |
Standalone guide and experiments with text cleaning and tokenization techniques. |
initial_EDA.ipynb |
Exploratory Data Analysis to understand dataset distribution and text characteristics. |
word2vec_demo.ipynb |
Simple demonstration of Word2Vec capabilities. |
To replicate the experiments, install the required dependencies:
pip install numpy pandas scikit-learn gensim nltk matplotlib seaborn- Preprocessing Matters: Proper cleaning significantly reduces noise and feature space dimension.
- BOW vs TF-IDF: TF-IDF generally outperforms BOW by filtering out frequent but uninformative terms.
- Embeddings: Word2Vec captures semantic meaning (e.g., "King" - "Man" + "Woman" ≈ "Queen") which simple frequency-based methods miss.