Skip to content

nirdesho6o/NLP-Progress

Repository files navigation

NLP Project: Feature Engineering & Model Experiments

This repository contains a comprehensive collection of Natural Language Processing (NLP) experiments, focusing on text preprocessing, feature engineering techniques, and model implementation for sentiment analysis and text classification.

🚀 Project Overview

The project demonstrates the end-to-end workflow of building NLP applications, from raw text cleaning to training advanced machine learning models. It explores various vectorization strategies and their impact on model performance.

🛠 Built Components & functionality

1. Preprocessing Pipeline

Rigorous text cleaning steps were implemented to prepare raw text for modeling:

  • Text Cleaning: Removal of HTML tags, special characters, and punctuation using Regex.
  • Normalization: Lowercasing and whitespace trimming.
  • Tokenization: Splitting text into individual tokens.
  • Stopword Removal: Eliminating common words (e.g., "the", "is") using NLTK stopwords.
  • Stemming & Lemmatization: Reducing words to their root forms (implied usage in notebooks).

2. Feature Engineering Techniques

We explored multiple methods to convert text into numerical vectors:

  • Bag of Words (BOW): Implemented using CountVectorizer to create sparse frequency matrices.
  • TF-IDF (Term Frequency-Inverse Document Frequency): Implemented using TfidfVectorizer to weight terms based on importance.
  • Word2Vec:
    • Trained custom Word2Vec embeddings using Gensim on the Game of Thrones dataset to capture semantic relationships between words.
    • Explored both CBOW and Skip-gram architectures.
  • One-Hot Encoding (OHE) / Label Encoding:
    • Used LabelEncoder for transforming categorical target variables (Sentiment labels).
    • (OHE is conceptually covered for handling categorical metadata features).

3. NLP Models & Use Cases

Implemented and evaluated various machine learning models:

  • Sentiment Analysis (IMDB Dataset):
    • Classifying movie reviews as Positive or Negative.
    • Models Used:
      • Naive Bayes (GaussianNB): Probabilistic classifier suitable for high-dimensional text data.
      • Random Forest (RandomForestClassifier): Ensemble learning method for robust classification.
    • Performance Comparison: Analyzed accuracy differences between BOW and TF-IDF features.

📂 Repository Structure

File / Notebook Description
text-classification.ipynb Main notebook for Sentiment Analysis on IMDB dataset. Includes preprocessing, vectorization (BOW/TF-IDF), and model training (NB, RF).
game-of-thrones-word2vec.ipynb Training a custom Word2Vec model on Game of Thrones text to find semantic similarities (e.g., similar characters/houses).
bow-with-preprocessing...ipynb Deep dive into Bag-of-Words with various preprocessing configurations.
text-preprocessing.ipynb Standalone guide and experiments with text cleaning and tokenization techniques.
initial_EDA.ipynb Exploratory Data Analysis to understand dataset distribution and text characteristics.
word2vec_demo.ipynb Simple demonstration of Word2Vec capabilities.

🔧 Setup & Installation

To replicate the experiments, install the required dependencies:

pip install numpy pandas scikit-learn gensim nltk matplotlib seaborn

📊 Key Insights

  • Preprocessing Matters: Proper cleaning significantly reduces noise and feature space dimension.
  • BOW vs TF-IDF: TF-IDF generally outperforms BOW by filtering out frequent but uninformative terms.
  • Embeddings: Word2Vec captures semantic meaning (e.g., "King" - "Man" + "Woman" ≈ "Queen") which simple frequency-based methods miss.

About

Repository to store and share my learnings in NLP. This includes pre-processing,feature-enginnering, text representations such as ngrams,TfIDF, word2vec, etc, and projects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors