📦 Model wrapper and on-boarding

We wrap models publicly available on Hugging Face using our own HTTP REST interface wrapper. We package each model as a Docker container, based on an NVIDIA image with GPU drivers and runtime tools. Each container embeds our Instance Manager, which standardizes the interface for executing inference requests. We adapt existing inference code (typically from Hugging Face) to this interface and bundle it with the model weights.

📚 Libraries

We leverage the following:

Diffusers to provide a simple and unified interface.
xDiT for parallelization of the models.
vLLM for the OpenAI interface.

🤖 Models

Model Name	Class
Fantasy Talking	🔤🖼️🔊➔🎥 Text+Image+Audio to Video
FLUX	🔤➔🖼️ Text to Image
FLUX Upscaler	🖼️➔🖼️ Image to Image
FLUX Krea	🔤➔🖼️ Text to Image
FLUX Kontext	🖼️➔🖼️ Image to Image
4KAgent	🖼️➔🖼️ Image to Image
HiDream I1	🔤➔🖼️ Text to Image
Qwen Image	🔤➔🖼️ Text to Image
Qwen Image Edit	🖼️➔🖼️ Image to Image
Janus Pro	🔤➔🖼️ Text to Image
LlamaGen	🔤➔🖼️ Text to Image
Bagel	🖼️➔🖼️ Image to Image
Hunyuan Image	🔤➔🖼️ Text to Image
Hunyuan FramePack	🔤🖼️➔🎥 Text+Image to Video
Hunyuan FramePack F1	🔤🖼️➔🎥 Text+Image to Video
Hunyuan Avatar	🔤🖼️🔊➔🎥 Text+Image+Audio to Video
Kokoro	🔤➔🔊 Text to Audio
XTTS	🔤➔🔊 Text to Audio
ThinkSound	🎥➔🔊 Video to Audio
VibeVoice	🔤➔🔊 Text to Audio
Wan 2.1	🔤🖼️➔🎥 Text+Image to Video
Wan 2.2	🔤🖼️➔🎥 Text+Image to Video
YOLO	🖼️➔🖼️ Image to Image
Image Resize	🖼️➔🖼️ Image to Image
Real-ESRGAN	🖼️➔🖼️ Image to Image
LTX-Video	🔤🖼️➔🎥 Text+Image to Video
LongCat-Video	🔤🖼️➔🎥 Text+Image to Video
Gemma 3	🤖 LLM
Llama 3.2	🤖 LLM
whisper	🔊➔🔤 Audio to Text

The characteristics for each model are in (services.json). These characteristics include quality (Elo ranking), frame rate (FPS), maximum number of frames (video length), number of attention heads, VAE compression ratios, supported resolutions, and other relevant attributes.

📊 Profiling

We generate simple model profiles to estimate runtime and resource usage, as key parameters (e.g., pixel count, frame count) scale proportionally. We benchmark a representative configuration (e.g., 1+16 frames, 10 steps, 640 x 400 resolution) and validate it against additional test points. We also measure peak power, energy, and temperature. These data inform predictive models for performance, cost, and quality under different configurations.

⚡ Parallelism

Many diffusion models include native support for multi-GPU inference (e.g., Wan). For those that do not, we use USP from xDiT. We have enabled parallelism for four models (e.g., Fantasy Talking, Hunyuan FramePack), each requiring under two hours of work. The xfuser repository provides examples, and this process could be streamlined with LLM-based coding agents.

🎯 Accuracy

We use scikit-learn to fit linear models. Our runtime and cost profiles are over 99.9% accurate.

🏆 Quality

When on-boarding the model, StreamWise uses the Elo rankings from public leaderboards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📦 Model wrapper and on-boarding

📚 Libraries

🤖 Models

📊 Profiling

⚡ Parallelism

🎯 Accuracy

🏆 Quality

FilesExpand file tree

model_wrapper.md

Latest commit

History

model_wrapper.md

File metadata and controls

📦 Model wrapper and on-boarding

📚 Libraries

🤖 Models

📊 Profiling

⚡ Parallelism

🎯 Accuracy

🏆 Quality