Skip to content

Latest commit

Β 

History

History
68 lines (60 loc) Β· 5.42 KB

File metadata and controls

68 lines (60 loc) Β· 5.42 KB

πŸ“¦ Model wrapper and on-boarding

We wrap models publicly available on Hugging Face using our own HTTP REST interface wrapper. We package each model as a Docker container, based on an NVIDIA image with GPU drivers and runtime tools. Each container embeds our Instance Manager, which standardizes the interface for executing inference requests. We adapt existing inference code (typically from Hugging Face) to this interface and bundle it with the model weights.

πŸ“š Libraries

We leverage the following:

  • Diffusers to provide a simple and unified interface.
  • xDiT for parallelization of the models.
  • vLLM for the OpenAI interface.

πŸ€– Models

Model Name Class
Fantasy Talking πŸ”€πŸ–ΌοΈπŸ”Šβž”πŸŽ₯ Text+Image+Audio to Video
FLUX πŸ”€βž”πŸ–ΌοΈ Text to Image
FLUX Upscaler πŸ–ΌοΈβž”πŸ–ΌοΈ Image to Image
FLUX Krea πŸ”€βž”πŸ–ΌοΈ Text to Image
FLUX Kontext πŸ–ΌοΈβž”πŸ–ΌοΈ Image to Image
4KAgent πŸ–ΌοΈβž”πŸ–ΌοΈ Image to Image
HiDream I1 πŸ”€βž”πŸ–ΌοΈ Text to Image
Qwen Image πŸ”€βž”πŸ–ΌοΈ Text to Image
Qwen Image Edit πŸ–ΌοΈβž”πŸ–ΌοΈ Image to Image
Janus Pro πŸ”€βž”πŸ–ΌοΈ Text to Image
LlamaGen πŸ”€βž”πŸ–ΌοΈ Text to Image
Bagel πŸ–ΌοΈβž”πŸ–ΌοΈ Image to Image
Hunyuan Image πŸ”€βž”πŸ–ΌοΈ Text to Image
Hunyuan FramePack πŸ”€πŸ–ΌοΈβž”πŸŽ₯ Text+Image to Video
Hunyuan FramePack F1 πŸ”€πŸ–ΌοΈβž”πŸŽ₯ Text+Image to Video
Hunyuan Avatar πŸ”€πŸ–ΌοΈπŸ”Šβž”πŸŽ₯ Text+Image+Audio to Video
Kokoro πŸ”€βž”πŸ”Š Text to Audio
XTTS πŸ”€βž”πŸ”Š Text to Audio
ThinkSound πŸŽ₯βž”πŸ”Š Video to Audio
VibeVoice πŸ”€βž”πŸ”Š Text to Audio
Wan 2.1 πŸ”€πŸ–ΌοΈβž”πŸŽ₯ Text+Image to Video
Wan 2.2 πŸ”€πŸ–ΌοΈβž”πŸŽ₯ Text+Image to Video
YOLO πŸ–ΌοΈβž”πŸ–ΌοΈ Image to Image
Image Resize πŸ–ΌοΈβž”πŸ–ΌοΈ Image to Image
Real-ESRGAN πŸ–ΌοΈβž”πŸ–ΌοΈ Image to Image
LTX-Video πŸ”€πŸ–ΌοΈβž”πŸŽ₯ Text+Image to Video
LongCat-Video πŸ”€πŸ–ΌοΈβž”πŸŽ₯ Text+Image to Video
Gemma 3 πŸ€– LLM
Llama 3.2 πŸ€– LLM
whisper πŸ”Šβž”πŸ”€ Audio to Text

The characteristics for each model are in (services.json). These characteristics include quality (Elo ranking), frame rate (FPS), maximum number of frames (video length), number of attention heads, VAE compression ratios, supported resolutions, and other relevant attributes.

πŸ“Š Profiling

We generate simple model profiles to estimate runtime and resource usage, as key parameters (e.g., pixel count, frame count) scale proportionally. We benchmark a representative configuration (e.g., 1+16 frames, 10 steps, 640 x 400 resolution) and validate it against additional test points. We also measure peak power, energy, and temperature. These data inform predictive models for performance, cost, and quality under different configurations.

⚑ Parallelism

Many diffusion models include native support for multi-GPU inference (e.g., Wan). For those that do not, we use USP from xDiT. We have enabled parallelism for four models (e.g., Fantasy Talking, Hunyuan FramePack), each requiring under two hours of work. The xfuser repository provides examples, and this process could be streamlined with LLM-based coding agents.

🎯 Accuracy

We use scikit-learn to fit linear models. Our runtime and cost profiles are over 99.9% accurate.

πŸ† Quality

When on-boarding the model, StreamWise uses the Elo rankings from public leaderboards.