Skip to content

RavinduLayanga/mixcap-web-platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MixCap Web Platform

React Flask PyTorch Status

The official deployment platform for the MixCap Multimodal Captioning Project. This full-stack application automates the entire pipeline: video processing, feature extraction (BLIP-2 + Wav2Vec2), and caption generation.


Live Demo & Walkthrough

See the application in action, including real-time caption generation and the user interface workflow.

MixCap Demo Video

Click the image above to watch the full demo on YouTube.


Overview

This application bridges the gap between complex deep learning research and end-users. It allows users to upload raw video files and receive accurate, natural language descriptions generated by the MixCap Model.

This repository contains the full-stack web platform for MixCap.

  • Frontend: React-based UI for video upload and caption display
  • Backend: Flask API that handles preprocessing, feature extraction, and inference
  • AI Pipeline: BLIP‑2 (vision) + Wav2Vec2 (audio) fused by the custom MixCap model

This repository does not include large model weights.

The Brain Behind the App

This repository contains the Application Logic (React/Flask). To explore the core research, model architecture, and training notebooks, visit the Research Repository:

MixCap Research & Model Repository


System Flow

  1. User uploads a video from the React frontend
  2. Flask backend extracts frames and audio using FFmpeg
  3. Visual and audio features are extracted
  4. MixCap model generates a caption
  5. Caption is returned to the frontend

Project Structure (Exact)

The directory structure below matches this repository exactly. Folder names are case‑sensitive.

MIXCAP-WEB-PLATFORM/
├── Be/                              # Backend (Flask)
│   ├── app.py                       # Flask API entry point
│   ├── features/                    # Feature-related logic
│   ├── scripts/                     # FFmpeg + extraction scripts
│   ├── utils/                       # Model loading & inference
│   ├── tokenizer/                   # SentencePiece tokenizer
│   ├── uploads/                     # Temporarily stored uploads
│   └── models/                      
│       ├── blip2-opt-2/             # BLIP-2 model files
│       └── MixCap/                  # MixCap_model_only.pth
│
├── Fe/                              # Frontend (React)
│   ├── public/
│   └── src/
│       ├── assets/
│       ├── components/
│       ├── pages/
│       ├── App.js
│       ├── App.css
│       ├── index.js
│       └── index.css
│
├── README.md
└── LICENSE

Model Files (Important)

Large AI model files are NOT included in this repository.

Why:

  • Exceed GitHub file size limits
  • Keep the repository clean and lightweight

This is intentional.


Model Setup

BLIP‑2 (Visual Encoder)

Download the official BLIP‑2 model from Hugging Face:

https://huggingface.co/Salesforce/blip2-opt-2.7b

Setup:

  1. Go to Be/models/
  2. Create folder: blip2-opt-2
  3. Place all BLIP‑2 files inside it

MixCap Model Checkpoint

  • MixCap_model_only.pth is not uploaded to this repository
  • Provided separately via the research project or direct distribution

Setup:

  1. Go to Be/models/
  2. Create folder: MixCap
  3. Place MixCap_model_only.pth inside

Without this file, inference will not run.


System Requirements

  • Python 3.9+

  • Node.js 18+

  • FFmpeg installed and available in PATH

    • macOS: brew install ffmpeg
    • Linux: sudo apt install ffmpeg
    • Windows: manual install
  • CUDA‑enabled GPU (recommended)


Installation

Backend (Flask)

cd Be
python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate

Create requirements.txt:

flask==3.0.0
flask-cors==4.0.0
torch>=2.0.0
torchaudio>=2.0.0
transformers>=4.30.0
numpy
Pillow
sentencepiece
werkzeug

Install & run:

pip install -r requirements.txt
python app.py

Backend runs at: http://127.0.0.1:5000


Frontend (React)

cd Fe
npm install
npm start

Frontend runs at: http://localhost:3000


API Endpoints

Method Endpoint Description
POST /upload Uploads video, runs FFmpeg & feature extraction
POST /generate_caption Runs MixCap inference on extracted features
POST /save_caption Saves the result to CSV for user history
GET /health Checks if the backend and model are loaded correctly

Author

Ravindu Layanga

About

Full-stack web application (React + Flask) for Multimodal Video Captioning. Deploys the MixCap model (BLIP-2 + Wav2Vec2) to generate video descriptions for end-users.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors