A production-grade, memory-efficient C++20 WebSocket server for real-time audio transcription, powered by whisper.cpp. The protocol is aligned with Azure Cognitive Services Speech-to-Text conventions.
whisper.cpp — a high-performance C/C++ implementation of OpenAI's Whisper speech recognition model, using the GGML tensor library.
- Pure C/C++ — no Python, no PyTorch, no ONNX
- CPU optimized (AVX, AVX2, NEON) with optional GPU acceleration (CUDA, Vulkan, Metal)
- Uses GGML quantized model format for reduced memory and faster inference
- Model loaded once at startup, shared across all concurrent sessions
- Embedded Whisper inference — pure C++, no Python dependencies
- WebSocket streaming — Azure-style protocol with continuous binary audio
- PCM and Opus audio at any sample rate (8–96 kHz, resampled internally to 16 kHz)
- Stateless checkpointing — resume sessions on any server node
- API key authentication — Bearer token in header
- Rate limiting — per-IP auth brute-force protection (429) and per-session message throttling
- Kubernetes-ready — multi-stage Docker image, health checks, NetworkPolicy, Helm-friendly
⚠️ Deployment Notice: This service is designed to run behind infrastructure safeguards — a Kubernetes Ingress (or reverse proxy) for TLS termination and a NetworkPolicy for network segmentation. Internal endpoints (/health,/ready,/metrics) are intentionally unauthenticated for K8s probes and Prometheus scraping, and must not be exposed to untrusted networks. See SECURITY.md for the full deployment security model.
Pre-built images are available with popular models embedded — no downloads or volume mounts needed:
# Just run — model is included in the image
docker run -p 9090:9090 -e WSS_API_KEY=your-secret-key \
ghcr.io/vbomfim/openasr:base.en| Image tag | Model | Image size | Best for |
|---|---|---|---|
ghcr.io/vbomfim/openasr:tiny.en |
Whisper tiny (English) | ~140 MB | Development, testing |
ghcr.io/vbomfim/openasr:base.en |
Whisper base (English) | ~210 MB | Low-latency production |
ghcr.io/vbomfim/openasr:large-v3-turbo |
Whisper large v3 turbo | ~1.7 GB | Best speed/quality balance |
ghcr.io/vbomfim/openasr:large-v3 |
Whisper large v3 | ~3.1 GB | Maximum accuracy |
ghcr.io/vbomfim/openasr:latest |
No model (bring your own) | ~63 MB | Production K8s with external model storage |
Use the slim latest image and mount your model as a volume:
# Download a model
curl -L -o models/ggml-base.en.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin
# Run with volume mount
docker run -p 9090:9090 \
-v $(pwd)/models:/models \
-e WHISPER_MODEL_PATH=/models/ggml-base.en.bin \
-e WSS_API_KEY=your-secret-key \
ghcr.io/vbomfim/openasr:latestservices:
openasr:
image: ghcr.io/vbomfim/openasr:base.en
ports:
- "9090:9090"
environment:
- WSS_API_KEY=your-secret-key
- WSS_INFERENCE_THREADS=4
- WSS_MAX_SESSIONS=10# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(nproc)
# Download a model
curl -L -o models/ggml-base.en.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin
# Run
WHISPER_MODEL_PATH=models/ggml-base.en.bin ./transcription_serverws://host:9090/transcribe
Every WebSocket connection must be authenticated (unless WSS_API_KEY is unset for dev mode).
The API key is any string you choose — there is no specific format or token service. Use a cryptographically random string of at least 32 characters:
# Generate a secure random key
openssl rand -hex 32
# Example output: a3f1b9c7e8d2f4a6b0c5e7d9f1a3b5c7e9d1f3a5b7c9e1d3f5a7b9c1e3d5f7
# Or using Python
python3 -c "import secrets; print(secrets.token_hex(32))"Set the key as an environment variable when starting the server:
# Docker
docker run -e WSS_API_KEY=<your-api-key-here> ghcr.io/vbomfim/openasr:base.en
# Kubernetes (use a Secret, not a ConfigMap)
kubectl create secret generic openasr-api-key \
--from-literal=WSS_API_KEY=<your-api-key-here> \
-n whisperxIn production, set WSS_REQUIRE_AUTH=true to prevent accidental startup without a key.
Pass the key in the Authorization header when opening the WebSocket connection:
Authorization: Bearer a3f1b9c7e8d2f4a6b0c5e7d9f1a3b5c7...
# Python
headers = {"Authorization": f"Bearer {api_key}"}
async with websockets.connect("ws://host:9090/transcribe",
additional_headers=headers) as ws:
...// JavaScript
const ws = new WebSocket("ws://host:9090/transcribe", {
headers: { "Authorization": `Bearer ${apiKey}` }
});# curl (for testing WebSocket upgrade)
curl -H "Authorization: Bearer $API_KEY" \
-H "Upgrade: websocket" -H "Connection: Upgrade" \
-H "Sec-WebSocket-Key: $(openssl rand -base64 16)" \
-H "Sec-WebSocket-Version: 13" \
http://localhost:9090/transcribeQuery string authentication is not supported — API keys in URLs are logged by proxies and intermediaries.
Unauthenticated or invalid connections receive HTTP 401 Unauthorized before the WebSocket handshake completes.
For multi-tenant deployments with per-user tokens, token revocation, and SSO, use an ingress-level identity provider instead of a shared API key. The server stays lean — authentication is handled by your infrastructure:
Client (JWT) → [ Ingress / API Gateway ] → OpenASR server
validates token here receives pre-authenticated request
Example with nginx-ingress + OAuth2 Proxy (Azure Entra ID):
# Ingress annotation for OAuth2 Proxy
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
nginx.ingress.kubernetes.io/auth-url: "https://oauth2-proxy.example.com/oauth2/auth"
nginx.ingress.kubernetes.io/auth-signin: "https://oauth2-proxy.example.com/oauth2/start"
spec:
rules:
- host: openasr.example.com
http:
paths:
- path: /
backend:
service:
name: whisperx-server
port:
number: 9090Compatible identity providers:
- Azure Entra ID (formerly Azure AD) — via OAuth2 Proxy
- Okta / Auth0 — via OAuth2 Proxy or Kong OIDC plugin
- Google Identity — via IAP or OAuth2 Proxy
- Keycloak — self-hosted, via OAuth2 Proxy
- Istio — native JWT validation in the service mesh
When using ingress-level auth, set WSS_API_KEY to empty (disable the built-in check) and rely on the gateway to reject unauthenticated requests before they reach the server.
sequenceDiagram
participant C as Client
participant S as Server
C->>S: WebSocket upgrade (Authorization: Bearer <key>)
S-->>C: 101 Switching Protocols
C->>S: speech.config (JSON)
S-->>C: speech.config.ack (JSON)
Note over C,S: Client streams continuous binary audio frames
loop Every audio window (e.g. 5s)
C->>S: binary audio frames...
Note right of S: Window full → inference runs
S-->>C: speech.hypothesis (interim, per segment)
S-->>C: speech.phrase (finalized turn, status=Success)
S-->>C: speech.checkpoint (full transcript so far)
end
C->>S: speech.end (JSON)
S-->>C: speech.phrase (status=EndOfStream, full transcript)
S-->>C: speech.checkpoint (final)
Each completed inference window produces three events in this order:
| # | Event | Purpose | Content |
|---|---|---|---|
| 1 | speech.hypothesis |
Interim result per segment | Individual segment text, offset, duration |
| 2 | speech.phrase |
Finalized turn (stable, won't change) | Combined text for this window, status: "Success" |
| 3 | speech.checkpoint |
Full accumulated transcript | Everything transcribed so far (for session resume) |
On speech.end, the server sends:
| # | Event | Purpose | Content |
|---|---|---|---|
| 1 | speech.phrase |
End of stream marker | Full transcript, status: "EndOfStream" |
| 2 | speech.checkpoint |
Final session state | Complete transcript + resume data |
Time Direction Event
──── ───────── ──────────────────────────────────────────────
0.0s C→S speech.config (window=5000ms, overlap=500ms)
0.0s S→C speech.config.ack
0.0s C→S binary audio frames (streaming continuously)
...
5.0s ───── Window 1 ready [0ms–5000ms] → inference starts
6.2s S→C speech.hypothesis: "Hello, this is a test"
6.2s S→C speech.phrase: "Hello, this is a test." (status=Success)
6.2s S→C speech.checkpoint: "Hello, this is a test."
...
9.5s ───── Window 2 ready [4500ms–9500ms] → inference starts
11.0s S→C speech.hypothesis: "of the streaming server."
11.0s S→C speech.phrase: "of the streaming server." (status=Success)
11.0s S→C speech.checkpoint: "Hello, this is a test. of the streaming server."
...
12.0s C→S speech.end
12.0s S→C speech.phrase: "Hello, this is a test. of the ..." (status=EndOfStream)
12.0s S→C speech.checkpoint: (final state, usable for resume)
Sent once after WebSocket connect. Configures language, audio format, and windowing.
{
"type": "speech.config",
"payload": {
"language": "en",
"sample_rate": 16000,
"encoding": "pcm_s16le",
"window_duration_ms": 5000,
"overlap_duration_ms": 500,
"model_id": "whisper-large-v3",
"resume_checkpoint": null
}
}| Field | Type | Required | Default | Description |
|---|---|---|---|---|
language |
string | yes | "en" |
Language code (max 16 chars) |
sample_rate |
int | yes | 16000 |
Audio sample rate in Hz (8000–96000) |
encoding |
string | yes | "pcm_s16le" |
"pcm_s16le" or "opus" |
window_duration_ms |
int | yes | 5000 |
Transcription window size (1000–60000) |
overlap_duration_ms |
int | yes | 500 |
Window overlap (0 to window_duration_ms-1) |
model_id |
string | no | "whisper-tiny.en" |
Model identifier (max 128 chars) |
resume_checkpoint |
object|null | no | null |
Checkpoint from a previous session to resume |
After receiving speech.config.ack, send audio as continuous binary WebSocket frames. No JSON metadata per chunk — just raw audio bytes.
| Encoding | Format |
|---|---|
pcm_s16le |
Raw PCM, 16-bit signed little-endian, mono |
opus |
Opus-encoded frames, mono |
Audio at any sample rate is internally resampled to 16 kHz for Whisper inference.
Recommended chunk size: 200ms of audio (e.g., 6400 bytes at 16 kHz PCM).
Maximum frame size: 16 MB.
Signals no more audio will be sent. Server sends final transcript and checkpoint.
{
"type": "speech.end",
"payload": {}
}Confirms session creation with effective configuration.
{
"type": "speech.config.ack",
"session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
"payload": {
"session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
"effective_config": {
"language": "en",
"sample_rate": 16000,
"encoding": "pcm_s16le",
"window_duration_ms": 5000,
"overlap_duration_ms": 500,
"model_id": "whisper-large-v3"
}
}
}Emitted after each inference window completes. May change as more context arrives.
{
"type": "speech.hypothesis",
"session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
"payload": {
"offset": 0,
"duration": 5000,
"text": "Hello, this is a test of the Whisper streaming transcription server."
}
}| Field | Type | Description |
|---|---|---|
offset |
int | Start time in milliseconds from session start |
duration |
int | Duration of the transcribed segment in milliseconds |
text |
string | Transcribed text |
Emitted after each inference window completes. This is the confirmed, stable transcription for that segment — it will not change. Use speech.checkpoint for the full accumulated transcript.
{
"type": "speech.phrase",
"session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
"payload": {
"offset": 0,
"duration": 5000,
"text": "Hello, this is a test of the Whisper streaming transcription server.",
"confidence": 1.0,
"status": "Success"
}
}On speech.end, a final speech.phrase with "status": "EndOfStream" is sent containing the full accumulated transcript.
Emitted after each inference window. Store this to resume the session on any server node.
{
"type": "speech.checkpoint",
"session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
"payload": {
"session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
"last_audio_ms": 5000,
"last_text_offset": 69,
"full_transcript": "Hello, this is a test of the Whisper streaming transcription server.",
"buffer_config": {
"window_duration_ms": 5000,
"overlap_duration_ms": 500
},
"backend_model_id": "whisper-large-v3"
}
}To resume, pass the checkpoint as resume_checkpoint in the next speech.config.
Sent when the server's audio buffer is filling up. Slow down or resume sending.
{
"type": "speech.backpressure",
"session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
"payload": {
"action": "slow_down"
}
}| Action | Meaning |
|---|---|
"slow_down" |
Buffer >80% full — reduce audio send rate |
"ok" |
Buffer <50% — safe to resume normal rate |
{
"type": "speech.error",
"session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
"payload": {
"code": "INVALID_MESSAGE",
"message": "Missing or invalid 'language'"
}
}| Code | Reason | Recovery |
|---|---|---|
INVALID_MESSAGE |
Malformed JSON, missing required fields, invalid field types, values out of range, or unknown message type | Fix the message and resend |
INVALID_STATE |
Message received in wrong connection state (e.g., binary audio before speech.config) |
Follow the correct protocol flow |
SESSION_LIMIT |
Maximum concurrent sessions reached (default: 20) | Wait for a session to end, or increase WSS_MAX_SESSIONS |
SESSION_NOT_FOUND |
Session ID not found (session expired or destroyed) | Create a new session |
AUDIO_ERROR |
Audio decoding or resampling failed (corrupt Opus frame, invalid PCM data) | Check audio encoding and format |
HTTP-level errors (before WebSocket upgrade):
| HTTP Status | Reason |
|---|---|
401 Unauthorized |
Missing or invalid API key |
404 Not Found |
Invalid endpoint (use /transcribe) |
| Encoding | Description | Notes |
|---|---|---|
pcm_s16le |
Raw PCM, 16-bit signed, little-endian, mono | Most common, zero overhead |
opus |
Opus-encoded frames, mono | Lower bandwidth, decoder initialized per-session |
Sample rates: Any rate from 8,000 to 96,000 Hz. Audio is internally resampled to 16,000 Hz (Whisper's native rate) using libsamplerate with SRC_SINC_MEDIUM_QUALITY.
Channel layout: Mono only. Multi-channel audio must be downmixed before sending.
| Parameter | Default | Configurable via |
|---|---|---|
| Max concurrent sessions | 20 | WSS_MAX_SESSIONS |
| Max WebSocket connections | 100 | Compile-time (kMaxConnections) |
| Max audio frame size | 16 MB | Compile-time (maxPayloadLength) |
| Idle connection timeout | 120s | Compile-time (idleTimeout) |
| Send buffer backpressure | 1 MB | Compile-time (maxBackpressure) |
| Audio ring buffer | 30s per session | Code default |
| Inference queue depth | 100 jobs | Compile-time |
| Max transcript length | 1 MB per session | Compile-time (kMaxTranscriptLength) |
| Max session duration | 2 hours | Compile-time (kMaxSessionDurationMs) |
| Inference threads | 4 | WSS_INFERENCE_THREADS |
Memory per session and model RAM: To be measured. See benchmark tool for running your own sizing tests.
Measured inference latency (5s audio window, 12-core ARM CPU, single session):
| Model | Latency |
|---|---|
| tiny.en | ~1.1s |
| base.en | ~2.4s |
| small.en | ~8.1s |
| large-v3 | ~50s |
Latency varies by hardware, concurrency, and audio content. GPU (CUDA) reduces latency 10–50×.
All models are downloaded from ggerganov/whisper.cpp on Hugging Face in GGML format.
| Model | File | Size | Languages | Download |
|---|---|---|---|---|
| Tiny (EN) | ggml-tiny.en.bin |
75 MB | English | ⬇ |
| Tiny | ggml-tiny.bin |
75 MB | 99 languages | ⬇ |
| Base (EN) | ggml-base.en.bin |
142 MB | English | ⬇ |
| Base | ggml-base.bin |
142 MB | 99 languages | ⬇ |
| Small (EN) | ggml-small.en.bin |
466 MB | English | ⬇ |
| Small | ggml-small.bin |
466 MB | 99 languages | ⬇ |
| Medium (EN) | ggml-medium.en.bin |
1.5 GB | English | ⬇ |
| Medium | ggml-medium.bin |
1.5 GB | 99 languages | ⬇ |
| Large v3 | ggml-large-v3.bin |
3 GB | 99 languages | ⬇ |
| Large v3 Turbo | ggml-large-v3-turbo.bin |
1.6 GB | 99 languages | ⬇ |
Quantized variants use less RAM and run faster with minimal quality loss:
| Quantization | Size reduction | Example |
|---|---|---|
| Q8_0 | ~50% smaller | ggml-large-v3-turbo-q8_0.bin ⬇ |
| Q5_0 | ~65% smaller | ggml-large-v3-q5_0.bin ⬇ |
Browse all available models: huggingface.co/ggerganov/whisper.cpp
| Model | Description | Download |
|---|---|---|
ggml-small.en-tdrz.bin |
Small English + speaker diarization (tinydiarize) | ⬇ |
1. Download a model:
# English-only (faster, recommended for English)
curl -L -o models/ggml-base.en.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin
# Multilingual (supports 99 languages)
curl -L -o models/ggml-large-v3-turbo.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin
# Quantized (smaller + faster, good for production)
curl -L -o models/ggml-large-v3-turbo-q8_0.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo-q8_0.bin2. Set the model path and restart:
# Docker
docker run -e WHISPER_MODEL_PATH=/models/ggml-base.en.bin ...
# Kubernetes
kubectl -n whisperx set env deployment/whisperx-server \
WHISPER_MODEL_PATH=/models/ggml-base.en.bin
kubectl -n whisperx rollout restart deployment whisperx-server
# server.toml
[model]
path = "/models/ggml-base.en.bin"The server loads one model at startup. Switching models requires a restart. .en models are English-only and faster.
All settings via environment variables (or server.toml with WSS_CONFIG_PATH). Environment variables take precedence over the config file.
| Variable | Default | Description |
|---|---|---|
WHISPER_MODEL_PATH |
(required) | Path to GGML model file |
WSS_API_KEY |
(empty = dev mode) | API key for authentication |
WSS_PORT |
9090 |
Server listen port |
WSS_HOST |
0.0.0.0 |
Server bind address |
WSS_MAX_SESSIONS |
20 |
Maximum concurrent sessions |
WSS_LANGUAGE |
en |
Default language |
WSS_BEAM_SIZE |
5 |
Whisper beam search size |
WSS_INFERENCE_THREADS |
4 |
Threads per inference call |
WSS_WINDOW_DURATION_MS |
20000 |
Default window duration |
WSS_OVERLAP_DURATION_MS |
2000 |
Default window overlap |
WSS_LOG_LEVEL |
info |
Log level (trace/debug/info/warn/error) |
WSS_LOG_FORMAT |
text |
Log format: text (human-readable) or json (structured) |
WSS_CONFIG_PATH |
config/server.toml |
Path to TOML config file |
For log aggregation systems (Datadog, Loki, ELK, CloudWatch), enable JSON log output:
WSS_LOG_FORMAT=json ./transcription_serverOr in server.toml:
[logging]
level = "info"
format = "json"JSON output (one object per line):
{"level":"info","message":"whisperx-streaming-server v0.1.0 starting...","timestamp":"2026-03-03T23:23:16.957Z"}
{"level":"debug","message":"Binary ingested: session=94a0d5 bytes=6400 written=3200","timestamp":"2026-03-03T23:33:54.283Z"}Default format is text (human-readable) for backward compatibility.
GET /health # liveness probe
GET /ready # readiness probe (checks model loaded + queue capacity)
Returns JSON (no authentication required):
// /health
{"status": "ok", "active_sessions": 3, "max_sessions": 20, "inference_pending": 1}
// /ready
{"status": "ready", "model_ready": true, "queue_ok": true, "sessions_available": true}GET /metrics
Returns metrics in Prometheus text format (no authentication required). Scrape interval recommendation: 15s.
| Metric | Description |
|---|---|
openasr_active_sessions |
Current number of transcription sessions |
openasr_active_connections |
Current WebSocket connections |
openasr_inference_queue_depth |
Inference jobs waiting in queue |
| Metric | Description |
|---|---|
openasr_connections_total |
Total WebSocket connections |
openasr_connections_rejected_auth_total |
Connections rejected (401 Unauthorized) |
openasr_connections_rejected_limit_total |
Connections rejected (limit reached) |
openasr_sessions_created_total |
Sessions created |
openasr_sessions_destroyed_total |
Sessions destroyed |
openasr_audio_bytes_received_total |
Audio data received (bytes) |
openasr_audio_chunks_received_total |
Binary WebSocket frames received |
openasr_inference_jobs_submitted_total |
Inference jobs submitted to thread pool |
openasr_inference_jobs_completed_total |
Inference jobs completed successfully |
openasr_inference_jobs_dropped_total |
Inference jobs dropped (queue full) |
openasr_transcription_segments_total |
Transcription segments produced |
openasr_errors_total |
Errors sent to clients |
openasr_backpressure_events_total |
Backpressure signals sent |
| Metric | Buckets | Description |
|---|---|---|
openasr_inference_duration_seconds |
0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120s | Per-window inference time |
openasr_audio_window_duration_seconds |
1, 2, 5, 10, 20, 30, 60s | Audio window size |
# prometheus.yml
scrape_configs:
- job_name: openasr
scrape_interval: 15s
static_configs:
- targets: ["openasr-server:9090"]# Request rate (connections/sec)
rate(openasr_connections_total[5m])
# Inference latency p95
histogram_quantile(0.95, rate(openasr_inference_duration_seconds_bucket[5m]))
# Error rate (%)
rate(openasr_errors_total[5m]) / rate(openasr_connections_total[5m]) * 100
# Audio throughput (MB/sec)
rate(openasr_audio_bytes_received_total[5m]) / 1024 / 1024
# Queue saturation
openasr_inference_queue_depth / 100 # queue limit is 100
# Local development
kubectl apply -f k8s/local/all-in-one.yaml
# Production (customize first)
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/pvc.yaml
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yamlHealth probes are configured for /health on port 9090.
import asyncio, json, wave, websockets
async def transcribe(wav_path, api_key):
headers = {"Authorization": f"Bearer {api_key}"}
async with websockets.connect("ws://localhost:9090/transcribe",
additional_headers=headers) as ws:
# Configure session
with wave.open(wav_path) as wf:
sample_rate = wf.getframerate()
pcm = wf.readframes(wf.getnframes())
await ws.send(json.dumps({
"type": "speech.config",
"payload": {
"language": "en",
"sample_rate": sample_rate,
"encoding": "pcm_s16le",
"window_duration_ms": 5000,
"overlap_duration_ms": 500
}
}))
ack = json.loads(await ws.recv())
print(f"Session: {ack['payload']['session_id']}")
# Stream audio as binary frames
chunk_size = sample_rate * 2 // 5 # 200ms chunks
for i in range(0, len(pcm), chunk_size):
await ws.send(pcm[i:i+chunk_size])
# Collect results
while True:
msg = json.loads(await asyncio.wait_for(ws.recv(), timeout=120))
if msg["type"] == "speech.hypothesis":
print(f" Partial: {msg['payload']['text']}")
elif msg["type"] == "speech.checkpoint":
print(f" Transcript: {msg['payload']['full_transcript']}")
asyncio.run(transcribe("audio.wav", "your-api-key"))A full-featured test client is included at tools/test_client.py.
See docs/ARCHITECTURE.md for the full system design, threading model, memory management strategy, and protocol specification.
MIT