Skip to content

Latest commit

 

History

History
99 lines (74 loc) · 4.88 KB

File metadata and controls

99 lines (74 loc) · 4.88 KB

📄🔉 Real-Time Video Generation 📽️🖼️

Modular, adaptive serving stack for real-time multi-modal generation (e.g., video, audio, images). It dynamically balances latency, cost, and quality, and supports both streaming generation (real-time playback) and offline workloads.

It uses a cluster manager called StreamWise. We have implemented multiple applications that run on top of StreamWise. For example, StreamCast is an application that generates real-time video podcasts from input documents (e.g., PDFs).


Important

This project focuses on systems research — specifically the infrastructure, scheduling, provisioning, and serving aspects of multi-modal generation workloads. The application workloads are used to stress-test and evaluate the system, not to assess or guarantee the quality of the generated content. Outputs may not be inconsistent, contain visual artifacts, or otherwise degraded — this is irrelevant to the research goals. This project is not designed for production purposes.


🚀 Features

  • Model on-boarding for 25+ multi-modal models (video, audio, image, LLMs)
  • Provisioning of GPUs, replicas, and model variants
  • Deadline-aware request scheduler for streaming workloads
  • Adaptive quality (resolution, FPS, sampling steps)
  • Multi-GPU + cross-region support
  • Spot-aware optimization to reduce cost
  • Caching, batching, and GPU frequency scaling

🏗 Architecture

It consists of:

  • Model on-boarding: packaging and standardizing multi-modal models
  • Provisioning: selecting hardware, GPUs, and model replicas
  • Scheduling: orchestrating requests under latency constraints
  • Execution: running requests efficiently inside a model instance

Architecture

📦 Model wrapper and on-boarding

We package each model as a Docker container, based on an NVIDIA image with GPU drivers and runtime tools. Each container embeds our Instance Manager, which standardizes the interface for executing inference requests. We adapt existing inference code (typically from Hugging Face) to this interface and bundle it with the model weights. A Python wrapper exposes an HTTP endpoint for existing multi-modal generation models (e.g., Flux or Wan). It allows triggering multi-modal generations (e.g., video from text) and collect statistics. The manager also handles request batching and adjusts GPU frequencies to optimize resource usage.

For the complete list of wrapped models with full details and classification, see Model Wrapper documentation.

The characteristics for each model are in (services.json). These characteristics include quality (Elo ranking), frame rate (FPS), maximum number of frames (video length), number of attention heads, VAE compression ratios, supported resolutions, and other relevant attributes. More details here.

⚙️ Provisioning hardware and models

We frame hardware and model selection for a workload (e.g., a 10-minute medium-quality video podcast) as an optimization problem. After selecting a configuration, the hardware and model provisioners handle setup accordingly. More details here.

📅 Request scheduler

The request scheduler orchestrates execution using a live, iterative version of our greedy algorithm informed by the request DAG. More details here.

⚙️ Applications

We implemented multiple workflows for multi-modal generation. More details here.

🚀 Deployment

We build StreamWise on top of a Kubernetes (K8s) cluster: a widely adopted cluster manager that enables modular deployment, auto-scaling, service discovery, and fault tolerance. More details here.

☸️ Kubernetes

Our Docker containers to run on K8s.

☁️ Azure Kubernetes Service (AKS)

To deploy on Azure Kubernetes Service (AKS) follow the instructions here.

📄 Citation

If you use StreamWise in research, please cite:

@article{streamwise2026,
  title={{StreamWise: Adaptive Serving for Real-Time Multi-Modal Generation}},
  author={Haoran Qiu, Gohar Irfan Chaudry, Chaojie Zhang, Íñigo Goiri, Esha Choukse, Rodrigo Fonseca, Ricardo Bianchini},
  journal = {arXiv:2603.05800},
  year={2026}
}

🤝 Contributing

Pull requests are welcome! Please open an issue for major changes. More details here.

📜 License

MIT License.