NLP Project: Feature Engineering & Model Experiments

This repository contains a comprehensive collection of Natural Language Processing (NLP) experiments, focusing on text preprocessing, feature engineering techniques, and model implementation for sentiment analysis and text classification.

🚀 Project Overview

The project demonstrates the end-to-end workflow of building NLP applications, from raw text cleaning to training advanced machine learning models. It explores various vectorization strategies and their impact on model performance.

🛠 Built Components & functionality

1. Preprocessing Pipeline

Rigorous text cleaning steps were implemented to prepare raw text for modeling:

Text Cleaning: Removal of HTML tags, special characters, and punctuation using Regex.
Normalization: Lowercasing and whitespace trimming.
Tokenization: Splitting text into individual tokens.
Stopword Removal: Eliminating common words (e.g., "the", "is") using NLTK stopwords.
Stemming & Lemmatization: Reducing words to their root forms (implied usage in notebooks).

2. Feature Engineering Techniques

We explored multiple methods to convert text into numerical vectors:

Bag of Words (BOW): Implemented using CountVectorizer to create sparse frequency matrices.
TF-IDF (Term Frequency-Inverse Document Frequency): Implemented using TfidfVectorizer to weight terms based on importance.
Word2Vec:
- Trained custom Word2Vec embeddings using Gensim on the Game of Thrones dataset to capture semantic relationships between words.
- Explored both CBOW and Skip-gram architectures.
One-Hot Encoding (OHE) / Label Encoding:
- Used LabelEncoder for transforming categorical target variables (Sentiment labels).
- (OHE is conceptually covered for handling categorical metadata features).

3. NLP Models & Use Cases

Implemented and evaluated various machine learning models:

Sentiment Analysis (IMDB Dataset):
- Classifying movie reviews as Positive or Negative.
- Models Used:
  - Naive Bayes (GaussianNB): Probabilistic classifier suitable for high-dimensional text data.
  - Random Forest (RandomForestClassifier): Ensemble learning method for robust classification.
- Performance Comparison: Analyzed accuracy differences between BOW and TF-IDF features.

📂 Repository Structure

File / Notebook	Description
`text-classification.ipynb`	Main notebook for Sentiment Analysis on IMDB dataset. Includes preprocessing, vectorization (BOW/TF-IDF), and model training (NB, RF).
`game-of-thrones-word2vec.ipynb`	Training a custom Word2Vec model on Game of Thrones text to find semantic similarities (e.g., similar characters/houses).
`bow-with-preprocessing...ipynb`	Deep dive into Bag-of-Words with various preprocessing configurations.
`text-preprocessing.ipynb`	Standalone guide and experiments with text cleaning and tokenization techniques.
`initial_EDA.ipynb`	Exploratory Data Analysis to understand dataset distribution and text characteristics.
`word2vec_demo.ipynb`	Simple demonstration of Word2Vec capabilities.

🔧 Setup & Installation

To replicate the experiments, install the required dependencies:

pip install numpy pandas scikit-learn gensim nltk matplotlib seaborn

📊 Key Insights

Preprocessing Matters: Proper cleaning significantly reduces noise and feature space dimension.
BOW vs TF-IDF: TF-IDF generally outperforms BOW by filtering out frequent but uninformative terms.
Embeddings: Word2Vec captures semantic meaning (e.g., "King" - "Man" + "Woman" ≈ "Queen") which simple frequency-based methods miss.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
bow-with-basic-features.ipynb		bow-with-basic-features.ipynb
bow-with-preprocessing-and-advanced-features.ipynb		bow-with-preprocessing-and-advanced-features.ipynb
game-of-thrones-word2vec.ipynb		game-of-thrones-word2vec.ipynb
initial_EDA.ipynb		initial_EDA.ipynb
krkn_nlp_prototype.ipynb		krkn_nlp_prototype.ipynb
only-bow.ipynb		only-bow.ipynb
pos-tagging.ipynb		pos-tagging.ipynb
text-classification.ipynb		text-classification.ipynb
text-preprocessing.ipynb		text-preprocessing.ipynb
word2vec.ipynb		word2vec.ipynb
word2vec_demo.ipynb		word2vec_demo.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP Project: Feature Engineering & Model Experiments

🚀 Project Overview

🛠 Built Components & functionality

1. Preprocessing Pipeline

2. Feature Engineering Techniques

3. NLP Models & Use Cases

📂 Repository Structure

🔧 Setup & Installation

📊 Key Insights

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP Project: Feature Engineering & Model Experiments

🚀 Project Overview

🛠 Built Components & functionality

1. Preprocessing Pipeline

2. Feature Engineering Techniques

3. NLP Models & Use Cases

📂 Repository Structure

🔧 Setup & Installation

📊 Key Insights

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages