We wrap models publicly available on Hugging Face using our own HTTP REST interface wrapper. We package each model as a Docker container, based on an NVIDIA image with GPU drivers and runtime tools. Each container embeds our Instance Manager, which standardizes the interface for executing inference requests. We adapt existing inference code (typically from Hugging Face) to this interface and bundle it with the model weights.
We leverage the following:
- Diffusers to provide a simple and unified interface.
- xDiT for parallelization of the models.
- vLLM for the OpenAI interface.
| Model Name | Class |
|---|---|
| Fantasy Talking | π€πΌοΈπβπ₯ Text+Image+Audio to Video |
| FLUX | π€βπΌοΈ Text to Image |
| FLUX Upscaler | πΌοΈβπΌοΈ Image to Image |
| FLUX Krea | π€βπΌοΈ Text to Image |
| FLUX Kontext | πΌοΈβπΌοΈ Image to Image |
| 4KAgent | πΌοΈβπΌοΈ Image to Image |
| HiDream I1 | π€βπΌοΈ Text to Image |
| Qwen Image | π€βπΌοΈ Text to Image |
| Qwen Image Edit | πΌοΈβπΌοΈ Image to Image |
| Janus Pro | π€βπΌοΈ Text to Image |
| LlamaGen | π€βπΌοΈ Text to Image |
| Bagel | πΌοΈβπΌοΈ Image to Image |
| Hunyuan Image | π€βπΌοΈ Text to Image |
| Hunyuan FramePack | π€πΌοΈβπ₯ Text+Image to Video |
| Hunyuan FramePack F1 | π€πΌοΈβπ₯ Text+Image to Video |
| Hunyuan Avatar | π€πΌοΈπβπ₯ Text+Image+Audio to Video |
| Kokoro | π€βπ Text to Audio |
| XTTS | π€βπ Text to Audio |
| ThinkSound | π₯βπ Video to Audio |
| VibeVoice | π€βπ Text to Audio |
| Wan 2.1 | π€πΌοΈβπ₯ Text+Image to Video |
| Wan 2.2 | π€πΌοΈβπ₯ Text+Image to Video |
| YOLO | πΌοΈβπΌοΈ Image to Image |
| Image Resize | πΌοΈβπΌοΈ Image to Image |
| Real-ESRGAN | πΌοΈβπΌοΈ Image to Image |
| LTX-Video | π€πΌοΈβπ₯ Text+Image to Video |
| LongCat-Video | π€πΌοΈβπ₯ Text+Image to Video |
| Gemma 3 | π€ LLM |
| Llama 3.2 | π€ LLM |
| whisper | πβπ€ Audio to Text |
The characteristics for each model are in (services.json). These characteristics include quality (Elo ranking), frame rate (FPS), maximum number of frames (video length), number of attention heads, VAE compression ratios, supported resolutions, and other relevant attributes.
We generate simple model profiles to estimate runtime and resource usage, as key parameters (e.g., pixel count, frame count) scale proportionally. We benchmark a representative configuration (e.g., 1+16 frames, 10 steps, 640 x 400 resolution) and validate it against additional test points. We also measure peak power, energy, and temperature. These data inform predictive models for performance, cost, and quality under different configurations.
Many diffusion models include native support for multi-GPU inference (e.g., Wan). For those that do not, we use USP from xDiT. We have enabled parallelism for four models (e.g., Fantasy Talking, Hunyuan FramePack), each requiring under two hours of work. The xfuser repository provides examples, and this process could be streamlined with LLM-based coding agents.
We use scikit-learn to fit linear models. Our runtime and cost profiles are over 99.9% accurate.
When on-boarding the model, StreamWise uses the Elo rankings from public leaderboards.