The official deployment platform for the MixCap Multimodal Captioning Project. This full-stack application automates the entire pipeline: video processing, feature extraction (BLIP-2 + Wav2Vec2), and caption generation.
See the application in action, including real-time caption generation and the user interface workflow.
Click the image above to watch the full demo on YouTube.
This application bridges the gap between complex deep learning research and end-users. It allows users to upload raw video files and receive accurate, natural language descriptions generated by the MixCap Model.
This repository contains the full-stack web platform for MixCap.
- Frontend: React-based UI for video upload and caption display
- Backend: Flask API that handles preprocessing, feature extraction, and inference
- AI Pipeline: BLIP‑2 (vision) + Wav2Vec2 (audio) fused by the custom MixCap model
This repository does not include large model weights.
This repository contains the Application Logic (React/Flask). To explore the core research, model architecture, and training notebooks, visit the Research Repository:
MixCap Research & Model Repository
- User uploads a video from the React frontend
- Flask backend extracts frames and audio using FFmpeg
- Visual and audio features are extracted
- MixCap model generates a caption
- Caption is returned to the frontend
The directory structure below matches this repository exactly. Folder names are case‑sensitive.
MIXCAP-WEB-PLATFORM/
├── Be/ # Backend (Flask)
│ ├── app.py # Flask API entry point
│ ├── features/ # Feature-related logic
│ ├── scripts/ # FFmpeg + extraction scripts
│ ├── utils/ # Model loading & inference
│ ├── tokenizer/ # SentencePiece tokenizer
│ ├── uploads/ # Temporarily stored uploads
│ └── models/
│ ├── blip2-opt-2/ # BLIP-2 model files
│ └── MixCap/ # MixCap_model_only.pth
│
├── Fe/ # Frontend (React)
│ ├── public/
│ └── src/
│ ├── assets/
│ ├── components/
│ ├── pages/
│ ├── App.js
│ ├── App.css
│ ├── index.js
│ └── index.css
│
├── README.md
└── LICENSELarge AI model files are NOT included in this repository.
Why:
- Exceed GitHub file size limits
- Keep the repository clean and lightweight
This is intentional.
Download the official BLIP‑2 model from Hugging Face:
https://huggingface.co/Salesforce/blip2-opt-2.7b
Setup:
- Go to
Be/models/ - Create folder:
blip2-opt-2 - Place all BLIP‑2 files inside it
MixCap_model_only.pthis not uploaded to this repository- Provided separately via the research project or direct distribution
Setup:
- Go to
Be/models/ - Create folder:
MixCap - Place
MixCap_model_only.pthinside
Without this file, inference will not run.
-
Python 3.9+
-
Node.js 18+
-
FFmpeg installed and available in PATH
- macOS:
brew install ffmpeg - Linux:
sudo apt install ffmpeg - Windows: manual install
- macOS:
-
CUDA‑enabled GPU (recommended)
cd Be
python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activateCreate requirements.txt:
flask==3.0.0
flask-cors==4.0.0
torch>=2.0.0
torchaudio>=2.0.0
transformers>=4.30.0
numpy
Pillow
sentencepiece
werkzeugInstall & run:
pip install -r requirements.txt
python app.pyBackend runs at: http://127.0.0.1:5000
cd Fe
npm install
npm startFrontend runs at: http://localhost:3000
| Method | Endpoint | Description |
|---|---|---|
| POST | /upload |
Uploads video, runs FFmpeg & feature extraction |
| POST | /generate_caption |
Runs MixCap inference on extracted features |
| POST | /save_caption |
Saves the result to CSV for user history |
| GET | /health |
Checks if the backend and model are loaded correctly |
Ravindu Layanga
