An end-to-end Machine Learning pipeline for predicting movie ratings with high precision.
Leveraging XGBoost, Random Forest, and advanced feature engineering to analyze cinematic success.
The project is structured as a robust regression pipeline designed to predict IMDB scores based on historical cinematic data. The workflow follows a systematic approach to ensure data integrity and model generalization:
- Exploratory Data Analysis (EDA): Statistical analysis of features such as genre, budget, and cast influence.
- Data Preprocessing: Sophisticated handling of missing values and categorical encoding using log transformations for skewed distributions.
- Feature Engineering: Extraction of meaningful insights from cast and crew data to enhance predictive power.
- Model Selection & Evaluation: Comparative analysis of multiple regressors (Linear, KNN, Decision Trees, Random Forest, XGBoost) to optimize for RMSE and R² metrics.
IMDB_Prediction/
├── LICENSE # MIT License
├── README.md # Project documentation
├── .gitattributes # Git configuration for attributes
│
├── Code and Dataset/ # Core machine learning development
│ ├── IMDB Movie Ratings Prediction.ipynb # Comprehensive Jupyter Notebook (EDA + Modeling)
│ └── movie_metadata.csv # Raw dataset containing 5000+ movie records
│
└── Documents/ # Scientific and presentational materials
├── IEEE_Report.pdf # Technical research paper in IEEE format
└── Poster.pdf # Visual summary and presentation poster
The system identifies and handles missing values within the movie_metadata.csv. Features with excessive null values are pruned, while others are imputed based on statistical medians or modes.
To handle the high variance in movie budgets and gross earnings, log transformations are applied. This normalizes the distribution, allowing models like Linear Regression to perform more effectively.
Categorical variables such as genre and director_name are transformed into numerical representations using encoding techniques, ensuring the models can ingest non-numeric cinematic data.
# Conceptual transformation logic
df['gross_log'] = np.log1p(df['gross'])
df['budget_log'] = np.log1p(df['budget'])The dataset is split into training and testing sets (typically 75/25 or 80/20) to evaluate the model's ability to generalize to unseen movie data.
During training, the system monitors Mean Squared Error (MSE) and R² Score. The XGBoost regressor is iteratively tuned to achieve the lowest possible error rates.
This directory contains the primary intellectual property of the project.
- IMDB Movie Ratings Prediction.ipynb: This notebook is the heart of the project. It includes the full data science lifecycle: from importing
pandasandnumpyto visualizing data distributions withseaborn. It implements five distinct machine learning algorithms, providing a comparative framework for cinematic success prediction. - movie_metadata.csv: A comprehensive dataset sourced from Kaggle, featuring 28 attributes for over 5000 movies. Key features include director names, lead actors, genres, and social media metrics (Facebook likes).
This section provides the formal academic context for the project.
- IEEE_Report.pdf: A high-level technical document detailing the methodology, mathematical foundations of the algorithms used (e.g., the loss functions in XGBoost), and a deep dive into the results. It follows standard IEEE publication guidelines.
- Poster.pdf: A condensed, visual representation of the project designed for academic conferences or project showcases. It highlights the key findings, such as the superiority of ensemble methods over linear models.
| Component | Details |
|---|---|
| Programming Language | Python 3.8+ |
| Primary Libraries | Pandas, NumPy, Scikit-Learn, XGBoost, Matplotlib, Seaborn |
| Dataset Volume | 5043 Records, 28 Columns |
| Preprocessing Techniques | Label Encoding, Log Transformation, Missing Value Imputation |
| Algorithm Suite | Linear, Lasso, Ridge Regression, KNN, Decision Tree, Random Forest, XGBoost |
| Development Environment | Jupyter Notebook / Anaconda |
| Model | RMSE (Train) | RMSE (Test) | R² (Train) | R² (Test) | Accuracy |
|---|---|---|---|---|---|
| Linear Regression | 0.119 | 0.120 | 0.411 | 0.387 | 95.43% |
| Decision Tree | 0.029 | 0.049 | 0.716 | 0.148 | 97.0% |
| Random Forest | 0.014 | 0.035 | 0.925 | 0.565 | 97.74% |
| KNN | 0.040 | 0.049 | 0.443 | 0.159 | 96.74% |
| Lasso Regression | 0.044 | 0.043 | 0.351 | 0.346 | 97.15% |
| Ridge Regression | 0.044 | 0.043 | 0.351 | 0.346 | 97.15% |
| XGBoost | 0.004 | 0.033 | 0.991 | 0.611 | 97.86% |
To initialize a local instance of this repository, execute the following commands in your terminal:
git clone https://github.com/Zer0-Bug/IMDB_Prediction.gitcd IMDB_PredictionThe project dependencies are managed via pip. It is highly recommended to utilize a isolated virtual environment to prevent dependency conflicts:
# Optional: Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`pip install numpy pandas scikit-learn xgboost matplotlib seaborn jupyterlabThe primary analytical engine is contained within the Jupyter Notebook. To reproduce the results:
- Launch JupyterLab:
jupyter lab- Navigate to the
Code and Dataset/folder via the sidebar. - Open
IMDB Movie Ratings Prediction.ipynb. - Execute all cells (
Run > Run All Cells) to observe the EDA and model benchmarking.
- Incorporating NLP Techniques: Analyzing movie reviews to enhance prediction accuracy.
- Using Deep Learning: Implementing neural networks to capture complex relationships in the data.
- Expanding Feature Set: Adding social media metrics, box-office earnings, and critic scores.
Contributions are always appreciated. Open-source projects grow through collaboration, and any improvement—whether a bug fix, new feature, documentation update, or suggestion—is valuable.
To contribute, please follow the steps below:
- Fork the repository.
- Create a new branch for your change:
git checkout -b feature/your-feature-name - Commit your changes with a clear and descriptive message:
git commit -m "Add: brief description of the change" - Push your branch to your fork:
git push origin feature/your-feature-name - Open a Pull Request describing the changes made.
All contributions are reviewed before being merged. Please ensure that your changes follow the existing code style and include relevant documentation or tests where applicable.
-
Sharda, R., & Delen, D. (2006). Predicting box-office success of motion pictures with neural networks. Expert Systems with Applications, 31(3), 481–490.
-
Lee, K., Park, J., Kim, I., & Choi, Y. (2018). Predicting movie success with machine learning techniques: ways to improve accuracy. Information Systems Frontiers, 20(3), 577–588.
-
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
-
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
-
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).
-
Saurav, S. (2023). IMDB score prediction for movies [Notebook]. Kaggle.
∞