Skip to content

Zer0-Bug/IMDB_Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IMDB Movie Ratings Prediction

Python Scikit-Learn XGBoost Pandas License

An end-to-end Machine Learning pipeline for predicting movie ratings with high precision.

Leveraging XGBoost, Random Forest, and advanced feature engineering to analyze cinematic success.


° ° ° ° ° °



Technical Architecture

The project is structured as a robust regression pipeline designed to predict IMDB scores based on historical cinematic data. The workflow follows a systematic approach to ensure data integrity and model generalization:

  1. Exploratory Data Analysis (EDA): Statistical analysis of features such as genre, budget, and cast influence.
  2. Data Preprocessing: Sophisticated handling of missing values and categorical encoding using log transformations for skewed distributions.
  3. Feature Engineering: Extraction of meaningful insights from cast and crew data to enhance predictive power.
  4. Model Selection & Evaluation: Comparative analysis of multiple regressors (Linear, KNN, Decision Trees, Random Forest, XGBoost) to optimize for RMSE and R² metrics.


Project Structure

IMDB_Prediction/
├── LICENSE                                   # MIT License
├── README.md                                 # Project documentation
├── .gitattributes                            # Git configuration for attributes
│
├── Code and Dataset/                         # Core machine learning development
│   ├── IMDB Movie Ratings Prediction.ipynb   # Comprehensive Jupyter Notebook (EDA + Modeling)
│   └── movie_metadata.csv                    # Raw dataset containing 5000+ movie records
│
└── Documents/                                # Scientific and presentational materials
    ├── IEEE_Report.pdf                       # Technical research paper in IEEE format
    └── Poster.pdf                            # Visual summary and presentation poster


Processing Pipeline

1. Data Cleaning

The system identifies and handles missing values within the movie_metadata.csv. Features with excessive null values are pruned, while others are imputed based on statistical medians or modes.

2. Feature Transformation

To handle the high variance in movie budgets and gross earnings, log transformations are applied. This normalizes the distribution, allowing models like Linear Regression to perform more effectively.

3. Categorical Encoding

Categorical variables such as genre and director_name are transformed into numerical representations using encoding techniques, ensuring the models can ingest non-numeric cinematic data.

# Conceptual transformation logic
df['gross_log'] = np.log1p(df['gross'])
df['budget_log'] = np.log1p(df['budget'])

4. Model Training & Testing

The dataset is split into training and testing sets (typically 75/25 or 80/20) to evaluate the model's ability to generalize to unseen movie data.

5. Performance Monitoring

During training, the system monitors Mean Squared Error (MSE) and R² Score. The XGBoost regressor is iteratively tuned to achieve the lowest possible error rates.



Detailed Module Specifications

1. Code and Dataset (Core Implementation)

This directory contains the primary intellectual property of the project.

  • IMDB Movie Ratings Prediction.ipynb: This notebook is the heart of the project. It includes the full data science lifecycle: from importing pandas and numpy to visualizing data distributions with seaborn. It implements five distinct machine learning algorithms, providing a comparative framework for cinematic success prediction.
  • movie_metadata.csv: A comprehensive dataset sourced from Kaggle, featuring 28 attributes for over 5000 movies. Key features include director names, lead actors, genres, and social media metrics (Facebook likes).

2. Documents (Scientific reporting)

This section provides the formal academic context for the project.

  • IEEE_Report.pdf: A high-level technical document detailing the methodology, mathematical foundations of the algorithms used (e.g., the loss functions in XGBoost), and a deep dive into the results. It follows standard IEEE publication guidelines.
  • Poster.pdf: A condensed, visual representation of the project designed for academic conferences or project showcases. It highlights the key findings, such as the superiority of ensemble methods over linear models.


Technical Specifications

Component Details
Programming Language Python 3.8+
Primary Libraries Pandas, NumPy, Scikit-Learn, XGBoost, Matplotlib, Seaborn
Dataset Volume 5043 Records, 28 Columns
Preprocessing Techniques Label Encoding, Log Transformation, Missing Value Imputation
Algorithm Suite Linear, Lasso, Ridge Regression, KNN, Decision Tree, Random Forest, XGBoost
Development Environment Jupyter Notebook / Anaconda


Model Performance Results

Model RMSE (Train) RMSE (Test) R² (Train) R² (Test) Accuracy
Linear Regression 0.119 0.120 0.411 0.387 95.43%
Decision Tree 0.029 0.049 0.716 0.148 97.0%
Random Forest 0.014 0.035 0.925 0.565 97.74%
KNN 0.040 0.049 0.443 0.159 96.74%
Lasso Regression 0.044 0.043 0.351 0.346 97.15%
Ridge Regression 0.044 0.043 0.351 0.346 97.15%
XGBoost 0.004 0.033 0.991 0.611 97.86%


Deployment & Installation

Repository Acquisition

To initialize a local instance of this repository, execute the following commands in your terminal:

git clone https://github.com/Zer0-Bug/IMDB_Prediction.git
cd IMDB_Prediction

Environment Configuration

The project dependencies are managed via pip. It is highly recommended to utilize a isolated virtual environment to prevent dependency conflicts:

# Optional: Create and activate virtual environment
python -m venv venv

source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install core dependencies

pip install numpy pandas scikit-learn xgboost matplotlib seaborn jupyterlab

Running the Analysis

The primary analytical engine is contained within the Jupyter Notebook. To reproduce the results:

  1. Launch JupyterLab:
jupyter lab
  1. Navigate to the Code and Dataset/ folder via the sidebar.
  2. Open IMDB Movie Ratings Prediction.ipynb.
  3. Execute all cells (Run > Run All Cells) to observe the EDA and model benchmarking.


Future Improvements

  • Incorporating NLP Techniques: Analyzing movie reviews to enhance prediction accuracy.
  • Using Deep Learning: Implementing neural networks to capture complex relationships in the data.
  • Expanding Feature Set: Adding social media metrics, box-office earnings, and critic scores.


Contribution

Contributions are always appreciated. Open-source projects grow through collaboration, and any improvement—whether a bug fix, new feature, documentation update, or suggestion—is valuable.

To contribute, please follow the steps below:

  1. Fork the repository.
  2. Create a new branch for your change:
    git checkout -b feature/your-feature-name
  3. Commit your changes with a clear and descriptive message:
    git commit -m "Add: brief description of the change"
  4. Push your branch to your fork:
    git push origin feature/your-feature-name
  5. Open a Pull Request describing the changes made.

All contributions are reviewed before being merged. Please ensure that your changes follow the existing code style and include relevant documentation or tests where applicable.

References

  1. Sharda, R., & Delen, D. (2006). Predicting box-office success of motion pictures with neural networks. Expert Systems with Applications, 31(3), 481–490.

  2. Lee, K., Park, J., Kim, I., & Choi, Y. (2018). Predicting movie success with machine learning techniques: ways to improve accuracy. Information Systems Frontiers, 20(3), 577–588.

  3. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

  4. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

  5. Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).

  6. Saurav, S. (2023). IMDB score prediction for movies [Notebook]. Kaggle.



Email × LinkedIn