MixCap Web Platform

The official deployment platform for the MixCap Multimodal Captioning Project. This full-stack application automates the entire pipeline: video processing, feature extraction (BLIP-2 + Wav2Vec2), and caption generation.

Live Demo & Walkthrough

See the application in action, including real-time caption generation and the user interface workflow.

Click the image above to watch the full demo on YouTube.

Overview

This application bridges the gap between complex deep learning research and end-users. It allows users to upload raw video files and receive accurate, natural language descriptions generated by the MixCap Model.

This repository contains the full-stack web platform for MixCap.

Frontend: React-based UI for video upload and caption display
Backend: Flask API that handles preprocessing, feature extraction, and inference
AI Pipeline: BLIP‑2 (vision) + Wav2Vec2 (audio) fused by the custom MixCap model

This repository does not include large model weights.

The Brain Behind the App

This repository contains the Application Logic (React/Flask). To explore the core research, model architecture, and training notebooks, visit the Research Repository:

MixCap Research & Model Repository

System Flow

User uploads a video from the React frontend
Flask backend extracts frames and audio using FFmpeg
Visual and audio features are extracted
MixCap model generates a caption
Caption is returned to the frontend

Project Structure (Exact)

The directory structure below matches this repository exactly. Folder names are case‑sensitive.

MIXCAP-WEB-PLATFORM/
├── Be/                              # Backend (Flask)
│   ├── app.py                       # Flask API entry point
│   ├── features/                    # Feature-related logic
│   ├── scripts/                     # FFmpeg + extraction scripts
│   ├── utils/                       # Model loading & inference
│   ├── tokenizer/                   # SentencePiece tokenizer
│   ├── uploads/                     # Temporarily stored uploads
│   └── models/                      
│       ├── blip2-opt-2/             # BLIP-2 model files
│       └── MixCap/                  # MixCap_model_only.pth
│
├── Fe/                              # Frontend (React)
│   ├── public/
│   └── src/
│       ├── assets/
│       ├── components/
│       ├── pages/
│       ├── App.js
│       ├── App.css
│       ├── index.js
│       └── index.css
│
├── README.md
└── LICENSE

Model Files (Important)

Large AI model files are NOT included in this repository.

Why:

Exceed GitHub file size limits
Keep the repository clean and lightweight

This is intentional.

Model Setup

BLIP‑2 (Visual Encoder)

Download the official BLIP‑2 model from Hugging Face:

https://huggingface.co/Salesforce/blip2-opt-2.7b

Setup:

Go to Be/models/
Create folder: blip2-opt-2
Place all BLIP‑2 files inside it

MixCap Model Checkpoint

MixCap_model_only.pth is not uploaded to this repository
Provided separately via the research project or direct distribution

Setup:

Go to Be/models/
Create folder: MixCap
Place MixCap_model_only.pth inside

Without this file, inference will not run.

System Requirements

Python 3.9+
Node.js 18+
FFmpeg installed and available in PATH
- macOS: brew install ffmpeg
- Linux: sudo apt install ffmpeg
- Windows: manual install
CUDA‑enabled GPU (recommended)

Installation

Backend (Flask)

cd Be
python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate

Create requirements.txt:

flask==3.0.0
flask-cors==4.0.0
torch>=2.0.0
torchaudio>=2.0.0
transformers>=4.30.0
numpy
Pillow
sentencepiece
werkzeug

Install & run:

pip install -r requirements.txt
python app.py

Backend runs at: http://127.0.0.1:5000

Frontend (React)

cd Fe
npm install
npm start

Frontend runs at: http://localhost:3000

API Endpoints

Method	Endpoint	Description
POST	`/upload`	Uploads video, runs FFmpeg & feature extraction
POST	`/generate_caption`	Runs MixCap inference on extracted features
POST	`/save_caption`	Saves the result to CSV for user history
GET	`/health`	Checks if the backend and model are loaded correctly

Author

Ravindu Layanga

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MixCap Web Platform

Live Demo & Walkthrough

Overview

The Brain Behind the App

System Flow

Project Structure (Exact)

Model Files (Important)

Model Setup

BLIP‑2 (Visual Encoder)

MixCap Model Checkpoint

System Requirements

Installation

Backend (Flask)

Frontend (React)

API Endpoints

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Be		Be
Fe		Fe
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

MixCap Web Platform

Live Demo & Walkthrough

Overview

The Brain Behind the App

System Flow

Project Structure (Exact)

Model Files (Important)

Model Setup

BLIP‑2 (Visual Encoder)

MixCap Model Checkpoint

System Requirements

Installation

Backend (Flask)

Frontend (React)

API Endpoints

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages