The 2026 Python Stack for Real-Time Multimodal Agents

Build production-ready vision-language agents in under 300 lines of Python.

from src.core.agent import AgentLoop
from src.inputs.webcam import WebcamInput
from src.models import create_model
from src.memory import SlidingWindowMemory
from src.tools import SlackAlertTool

# Create agent
agent = AgentLoop(
    model=create_model("openai", "gpt-4o-mini"),
    memory=SlidingWindowMemory(),
)
agent.register_tool(SlackAlertTool(webhook_url="..."))

# Run on webcam
await agent.run(WebcamInput())

Features

Minimal Core: ~150 lines for the complete agent loop
6 Model Providers: OpenAI, Anthropic, Google, Groq, Fireworks, Together
5 Input Sources: Webcam, microphone, files, RTSP streams, URLs
Plug-and-Play Tools: Slack, Notion, PLC, robot arm
Battle-Tested: Protocol-driven, async-first design

Quick Start

1. Install Dependencies

# Using pip
pip install -r requirements.txt

# Using uv (recommended)
uv pip install -r requirements.txt

2. Set Up API Keys

cp .env.example .env
# Edit .env with your API keys

3. Run an Example

# Basic webcam demo
python examples/01_basic_webcam.py

# Security monitor with Slack alerts
python examples/02_security_monitor.py

# Manufacturing quality inspector
python examples/03_quality_inspector.py

# Meeting assistant
python examples/04_meeting_assistant.py

# Benchmark all providers
python examples/05_benchmark_providers.py

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Inputs    │────▶│    Agent    │────▶│   Tools     │
│             │     │    Loop     │     │             │
│ • Webcam    │     │             │     │ • Slack     │
│ • Mic       │     │ ┌─────────┐ │     │ • Notion    │
│ • Files     │     │ │ Buffer  │ │     │ • PLC       │
│ • RTSP      │     │ └────┬────┘ │     │ • Robot     │
│ • URLs      │     │      ▼      │     └─────────────┘
└─────────────┘     │ ┌─────────┐ │
                    │ │ Model   │ │     ┌─────────────┐
                    │ └────┬────┘ │────▶│   Memory    │
                    │      ▼      │     │             │
                    │ ┌─────────┐ │     │ • Sliding   │
                    │ │ Output  │ │     │   Window    │
                    │ └─────────┘ │     └─────────────┘
                    └─────────────┘

Model Providers

Provider	Models	Best For
OpenAI	gpt-4o, gpt-4o-mini	Tool calling, general use
Anthropic	claude-3.5-sonnet, claude-3.5-haiku	Reasoning, safety
Google	gemini-1.5-flash, gemini-1.5-pro	Cost efficiency, video
Groq	llama-3.2-90b-vision, llama-3.2-11b-vision	Speed
Fireworks	firellava-13b, phi-3-vision	Open models
Together	llama-3.2-11b-vision, llama-3.2-90b-vision	Open models

Input Sources

# Webcam
from src.inputs import WebcamInput
source = WebcamInput(device_id=0, fps=1.0)

# Microphone
from src.inputs import MicrophoneInput
source = MicrophoneInput(sample_rate=16000)

# Video/Audio Files
from src.inputs import FileInput
source = FileInput("recording.mp4")

# RTSP Stream (IP Cameras)
from src.inputs import RTSPInput
source = RTSPInput("rtsp://user:[email protected]:554/stream")

# URLs
from src.inputs import URLInput
source = URLInput("https://example.com/image.jpg")

# Combined (webcam + mic)
from src.inputs.base import CompositeInput
source = CompositeInput(WebcamInput(), MicrophoneInput())

Tools

Slack Alerts

from src.tools import SlackAlertTool

tool = SlackAlertTool(
    webhook_url="https://hooks.slack.com/...",
    default_channel="#alerts"
)
agent.register_tool(tool)

Notion Run-Sheet

from src.tools import NotionRunSheetTool

tool = NotionRunSheetTool(
    api_key="secret_...",
    database_id="abc123..."
)
agent.register_tool(tool)

PLC Control (Industrial)

from src.tools import PLCWriteTool

tool = PLCWriteTool(simulate=True)  # Use simulate=False for real PLCs
agent.register_tool(tool)

Robot Arm

from src.tools import RobotArmTool

tool = RobotArmTool(simulate=True)  # Use simulate=False for real robots
agent.register_tool(tool)

Configuration

from src.core.config import AgentConfig

config = AgentConfig(
    # Frame processing
    max_frames=4,           # Max frames per request
    frame_batch_size=1,     # Process after N frames
    frame_interval_ms=1000, # Min ms between captures

    # Audio processing
    min_audio_chars=50,     # Process after N transcribed chars

    # Context management
    max_context_tokens=4000,
    max_context_messages=20,

    # System prompt
    system_prompt="You are a helpful assistant...",
)

agent = AgentLoop(model=model, memory=memory, config=config)

Benchmarking

Run benchmarks across all providers:

python examples/05_benchmark_providers.py

Or programmatically:

from src.utils.benchmark import BenchmarkRunner
from src.models import create_model

runner = BenchmarkRunner()
runner.add_model(create_model("openai", "gpt-4o-mini"))
runner.add_model(create_model("anthropic", "claude-3-5-haiku-latest"))

results = await runner.run_all(iterations=10)
print(runner.to_markdown_table())

Project Structure

multimodal-python-stack/
├── src/
│   ├── core/
│   │   ├── agent.py      # Main AgentLoop (~150 lines)
│   │   ├── types.py      # Frame, AudioChunk, Message, etc.
│   │   └── config.py     # AgentConfig
│   ├── inputs/
│   │   ├── webcam.py     # WebcamInput
│   │   ├── microphone.py # MicrophoneInput
│   │   ├── file.py       # FileInput, VideoFileInput, AudioFileInput
│   │   ├── rtsp.py       # RTSPInput
│   │   └── url.py        # URLInput
│   ├── models/
│   │   ├── openai.py     # OpenAIVisionModel
│   │   ├── anthropic.py  # AnthropicVisionModel
│   │   ├── google.py     # GoogleVisionModel
│   │   ├── groq.py       # GroqVisionModel
│   │   ├── fireworks.py  # FireworksVisionModel
│   │   └── together.py   # TogetherVisionModel
│   ├── tools/
│   │   ├── slack.py      # SlackAlertTool
│   │   ├── notion.py     # NotionRunSheetTool
│   │   ├── plc.py        # PLCWriteTool
│   │   └── robot.py      # RobotArmTool
│   ├── memory/
│   │   └── sliding_window.py
│   └── utils/
│       ├── audio.py      # WhisperTranscriber
│       ├── image.py      # Frame utilities
│       └── benchmark.py  # BenchmarkRunner
├── examples/
│   ├── 01_basic_webcam.py
│   ├── 02_security_monitor.py
│   ├── 03_quality_inspector.py
│   ├── 04_meeting_assistant.py
│   └── 05_benchmark_providers.py
└── docs/
    └── blog_post.md

Requirements

Python 3.11+
API keys for at least one provider

License

MIT

Contributing

Contributions welcome! Please read the contributing guidelines first.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
docs		docs
examples		examples
src		src
.env.example		.env.example
.gitignore		.gitignore
INTERNAL_SETUP.md		INTERNAL_SETUP.md
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The 2026 Python Stack for Real-Time Multimodal Agents

Features

Quick Start

1. Install Dependencies

2. Set Up API Keys

3. Run an Example

Architecture

Model Providers

Input Sources

Tools

Slack Alerts

Notion Run-Sheet

PLC Control (Industrial)

Robot Arm

Configuration

Benchmarking

Project Structure

Requirements

License

Contributing

About

Uh oh!

Releases

Packages

Languages

argotdev/multimodal-python-stack

Folders and files

Latest commit

History

Repository files navigation

The 2026 Python Stack for Real-Time Multimodal Agents

Features

Quick Start

1. Install Dependencies

2. Set Up API Keys

3. Run an Example

Architecture

Model Providers

Input Sources

Tools

Slack Alerts

Notion Run-Sheet

PLC Control (Industrial)

Robot Arm

Configuration

Benchmarking

Project Structure

Requirements

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages