OpenASR — Streaming Speech-to-Text Server

A production-grade, memory-efficient C++20 WebSocket server for real-time audio transcription, powered by whisper.cpp. The protocol is aligned with Azure Cognitive Services Speech-to-Text conventions.

Inference Runtime

whisper.cpp — a high-performance C/C++ implementation of OpenAI's Whisper speech recognition model, using the GGML tensor library.

Pure C/C++ — no Python, no PyTorch, no ONNX
CPU optimized (AVX, AVX2, NEON) with optional GPU acceleration (CUDA, Vulkan, Metal)
Uses GGML quantized model format for reduced memory and faster inference
Model loaded once at startup, shared across all concurrent sessions

Features

Embedded Whisper inference — pure C++, no Python dependencies
WebSocket streaming — Azure-style protocol with continuous binary audio
PCM and Opus audio at any sample rate (8–96 kHz, resampled internally to 16 kHz)
Stateless checkpointing — resume sessions on any server node
API key authentication — Bearer token in header
Rate limiting — per-IP auth brute-force protection (429) and per-session message throttling
Kubernetes-ready — multi-stage Docker image, health checks, NetworkPolicy, Helm-friendly

⚠️ Deployment Notice: This service is designed to run behind infrastructure safeguards — a Kubernetes Ingress (or reverse proxy) for TLS termination and a NetworkPolicy for network segmentation. Internal endpoints (/health, /ready, /metrics) are intentionally unauthenticated for K8s probes and Prometheus scraping, and must not be exposed to untrusted networks. See SECURITY.md for the full deployment security model.

Quick Start

Pre-built Docker Images (recommended)

Pre-built images are available with popular models embedded — no downloads or volume mounts needed:

# Just run — model is included in the image
docker run -p 9090:9090 -e WSS_API_KEY=your-secret-key \
  ghcr.io/vbomfim/openasr:base.en

Image tag	Model	Image size	Best for
`ghcr.io/vbomfim/openasr:tiny.en`	Whisper tiny (English)	~140 MB	Development, testing
`ghcr.io/vbomfim/openasr:base.en`	Whisper base (English)	~210 MB	Low-latency production
`ghcr.io/vbomfim/openasr:large-v3-turbo`	Whisper large v3 turbo	~1.7 GB	Best speed/quality balance
`ghcr.io/vbomfim/openasr:large-v3`	Whisper large v3	~3.1 GB	Maximum accuracy
`ghcr.io/vbomfim/openasr:latest`	No model (bring your own)	~63 MB	Production K8s with external model storage

Docker with Your Own Model

Use the slim latest image and mount your model as a volume:

# Download a model
curl -L -o models/ggml-base.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin

# Run with volume mount
docker run -p 9090:9090 \
  -v $(pwd)/models:/models \
  -e WHISPER_MODEL_PATH=/models/ggml-base.en.bin \
  -e WSS_API_KEY=your-secret-key \
  ghcr.io/vbomfim/openasr:latest

Docker Compose

services:
  openasr:
    image: ghcr.io/vbomfim/openasr:base.en
    ports:
      - "9090:9090"
    environment:
      - WSS_API_KEY=your-secret-key
      - WSS_INFERENCE_THREADS=4
      - WSS_MAX_SESSIONS=10

From Source

# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(nproc)

# Download a model
curl -L -o models/ggml-base.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin

# Run
WHISPER_MODEL_PATH=models/ggml-base.en.bin ./transcription_server

WebSocket Protocol

Endpoint

ws://host:9090/transcribe

Authentication

Every WebSocket connection must be authenticated (unless WSS_API_KEY is unset for dev mode).

Generating an API key

The API key is any string you choose — there is no specific format or token service. Use a cryptographically random string of at least 32 characters:

# Generate a secure random key
openssl rand -hex 32
# Example output: a3f1b9c7e8d2f4a6b0c5e7d9f1a3b5c7e9d1f3a5b7c9e1d3f5a7b9c1e3d5f7

# Or using Python
python3 -c "import secrets; print(secrets.token_hex(32))"

Configuring the server

Set the key as an environment variable when starting the server:

# Docker
docker run -e WSS_API_KEY=<your-api-key-here> ghcr.io/vbomfim/openasr:base.en

# Kubernetes (use a Secret, not a ConfigMap)
kubectl create secret generic openasr-api-key \
  --from-literal=WSS_API_KEY=<your-api-key-here> \
  -n whisperx

In production, set WSS_REQUIRE_AUTH=true to prevent accidental startup without a key.

Connecting with the key

Pass the key in the Authorization header when opening the WebSocket connection:

Authorization: Bearer a3f1b9c7e8d2f4a6b0c5e7d9f1a3b5c7...

# Python
headers = {"Authorization": f"Bearer {api_key}"}
async with websockets.connect("ws://host:9090/transcribe",
                               additional_headers=headers) as ws:
    ...

// JavaScript
const ws = new WebSocket("ws://host:9090/transcribe", {
  headers: { "Authorization": `Bearer ${apiKey}` }
});

# curl (for testing WebSocket upgrade)
curl -H "Authorization: Bearer $API_KEY" \
  -H "Upgrade: websocket" -H "Connection: Upgrade" \
  -H "Sec-WebSocket-Key: $(openssl rand -base64 16)" \
  -H "Sec-WebSocket-Version: 13" \
  http://localhost:9090/transcribe

Query string authentication is not supported — API keys in URLs are logged by proxies and intermediaries.

Unauthenticated or invalid connections receive HTTP 401 Unauthorized before the WebSocket handshake completes.

Enterprise authentication (OIDC / Azure Entra ID)

For multi-tenant deployments with per-user tokens, token revocation, and SSO, use an ingress-level identity provider instead of a shared API key. The server stays lean — authentication is handled by your infrastructure:

Client (JWT) → [ Ingress / API Gateway ] → OpenASR server
                validates token here        receives pre-authenticated request

Example with nginx-ingress + OAuth2 Proxy (Azure Entra ID):

# Ingress annotation for OAuth2 Proxy
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    nginx.ingress.kubernetes.io/auth-url: "https://oauth2-proxy.example.com/oauth2/auth"
    nginx.ingress.kubernetes.io/auth-signin: "https://oauth2-proxy.example.com/oauth2/start"
spec:
  rules:
    - host: openasr.example.com
      http:
        paths:
          - path: /
            backend:
              service:
                name: whisperx-server
                port:
                  number: 9090

Compatible identity providers:

Azure Entra ID (formerly Azure AD) — via OAuth2 Proxy
Okta / Auth0 — via OAuth2 Proxy or Kong OIDC plugin
Google Identity — via IAP or OAuth2 Proxy
Keycloak — self-hosted, via OAuth2 Proxy
Istio — native JWT validation in the service mesh

When using ingress-level auth, set WSS_API_KEY to empty (disable the built-in check) and rely on the gateway to reject unauthenticated requests before they reach the server.

Connection Flow

sequenceDiagram
    participant C as Client
    participant S as Server

    C->>S: WebSocket upgrade (Authorization: Bearer <key>)
    S-->>C: 101 Switching Protocols

    C->>S: speech.config (JSON)
    S-->>C: speech.config.ack (JSON)

    Note over C,S: Client streams continuous binary audio frames

    loop Every audio window (e.g. 5s)
        C->>S: binary audio frames...
        Note right of S: Window full → inference runs

        S-->>C: speech.hypothesis (interim, per segment)
        S-->>C: speech.phrase (finalized turn, status=Success)
        S-->>C: speech.checkpoint (full transcript so far)
    end

    C->>S: speech.end (JSON)
    S-->>C: speech.phrase (status=EndOfStream, full transcript)
    S-->>C: speech.checkpoint (final)

Event ordering per window

Each completed inference window produces three events in this order:

#	Event	Purpose	Content
1	`speech.hypothesis`	Interim result per segment	Individual segment text, offset, duration
2	`speech.phrase`	Finalized turn (stable, won't change)	Combined text for this window, `status: "Success"`
3	`speech.checkpoint`	Full accumulated transcript	Everything transcribed so far (for session resume)

On speech.end, the server sends:

#	Event	Purpose	Content
1	`speech.phrase`	End of stream marker	Full transcript, `status: "EndOfStream"`
2	`speech.checkpoint`	Final session state	Complete transcript + resume data

Example: 12-second audio with 5-second windows

Time  Direction  Event
────  ─────────  ──────────────────────────────────────────────
0.0s  C→S        speech.config (window=5000ms, overlap=500ms)
0.0s  S→C        speech.config.ack
0.0s  C→S        binary audio frames (streaming continuously)
...
5.0s  ─────      Window 1 ready [0ms–5000ms] → inference starts
6.2s  S→C        speech.hypothesis: "Hello, this is a test"
6.2s  S→C        speech.phrase:     "Hello, this is a test." (status=Success)
6.2s  S→C        speech.checkpoint: "Hello, this is a test."
...
9.5s  ─────      Window 2 ready [4500ms–9500ms] → inference starts
11.0s S→C        speech.hypothesis: "of the streaming server."
11.0s S→C        speech.phrase:     "of the streaming server." (status=Success)
11.0s S→C        speech.checkpoint: "Hello, this is a test. of the streaming server."
...
12.0s C→S        speech.end
12.0s S→C        speech.phrase:     "Hello, this is a test. of the ..." (status=EndOfStream)
12.0s S→C        speech.checkpoint: (final state, usable for resume)

Messages

Client → Server

`speech.config` — Initialize session

Sent once after WebSocket connect. Configures language, audio format, and windowing.

{
  "type": "speech.config",
  "payload": {
    "language": "en",
    "sample_rate": 16000,
    "encoding": "pcm_s16le",
    "window_duration_ms": 5000,
    "overlap_duration_ms": 500,
    "model_id": "whisper-large-v3",
    "resume_checkpoint": null
  }
}

Field	Type	Required	Default	Description
`language`	string	yes	`"en"`	Language code (max 16 chars)
`sample_rate`	int	yes	`16000`	Audio sample rate in Hz (8000–96000)
`encoding`	string	yes	`"pcm_s16le"`	`"pcm_s16le"` or `"opus"`
`window_duration_ms`	int	yes	`5000`	Transcription window size (1000–60000)
`overlap_duration_ms`	int	yes	`500`	Window overlap (0 to window_duration_ms-1)
`model_id`	string	no	`"whisper-tiny.en"`	Model identifier (max 128 chars)
`resume_checkpoint`	object\|null	no	`null`	Checkpoint from a previous session to resume

Binary frames — Audio data

After receiving speech.config.ack, send audio as continuous binary WebSocket frames. No JSON metadata per chunk — just raw audio bytes.

Encoding	Format
`pcm_s16le`	Raw PCM, 16-bit signed little-endian, mono
`opus`	Opus-encoded frames, mono

Audio at any sample rate is internally resampled to 16 kHz for Whisper inference.

Recommended chunk size: 200ms of audio (e.g., 6400 bytes at 16 kHz PCM).
Maximum frame size: 16 MB.

`speech.end` — End session

Signals no more audio will be sent. Server sends final transcript and checkpoint.

{
  "type": "speech.end",
  "payload": {}
}

Server → Client

`speech.config.ack` — Session created

Confirms session creation with effective configuration.

{
  "type": "speech.config.ack",
  "session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
  "payload": {
    "session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
    "effective_config": {
      "language": "en",
      "sample_rate": 16000,
      "encoding": "pcm_s16le",
      "window_duration_ms": 5000,
      "overlap_duration_ms": 500,
      "model_id": "whisper-large-v3"
    }
  }
}

`speech.hypothesis` — Partial transcription

Emitted after each inference window completes. May change as more context arrives.

{
  "type": "speech.hypothesis",
  "session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
  "payload": {
    "offset": 0,
    "duration": 5000,
    "text": "Hello, this is a test of the Whisper streaming transcription server."
  }
}

Field	Type	Description
`offset`	int	Start time in milliseconds from session start
`duration`	int	Duration of the transcribed segment in milliseconds
`text`	string	Transcribed text

`speech.phrase` — Finalized turn/segment

Emitted after each inference window completes. This is the confirmed, stable transcription for that segment — it will not change. Use speech.checkpoint for the full accumulated transcript.

{
  "type": "speech.phrase",
  "session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
  "payload": {
    "offset": 0,
    "duration": 5000,
    "text": "Hello, this is a test of the Whisper streaming transcription server.",
    "confidence": 1.0,
    "status": "Success"
  }
}

On speech.end, a final speech.phrase with "status": "EndOfStream" is sent containing the full accumulated transcript.

`speech.checkpoint` — Session state

Emitted after each inference window. Store this to resume the session on any server node.

{
  "type": "speech.checkpoint",
  "session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
  "payload": {
    "session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
    "last_audio_ms": 5000,
    "last_text_offset": 69,
    "full_transcript": "Hello, this is a test of the Whisper streaming transcription server.",
    "buffer_config": {
      "window_duration_ms": 5000,
      "overlap_duration_ms": 500
    },
    "backend_model_id": "whisper-large-v3"
  }
}

To resume, pass the checkpoint as resume_checkpoint in the next speech.config.

`speech.backpressure` — Flow control

Sent when the server's audio buffer is filling up. Slow down or resume sending.

{
  "type": "speech.backpressure",
  "session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
  "payload": {
    "action": "slow_down"
  }
}

Action	Meaning
`"slow_down"`	Buffer >80% full — reduce audio send rate
`"ok"`	Buffer <50% — safe to resume normal rate

`speech.error` — Error

{
  "type": "speech.error",
  "session_id": "d785d7fad5ecd9ee3eed4ddeb2953e59",
  "payload": {
    "code": "INVALID_MESSAGE",
    "message": "Missing or invalid 'language'"
  }
}

Error Codes

Code	Reason	Recovery
`INVALID_MESSAGE`	Malformed JSON, missing required fields, invalid field types, values out of range, or unknown message type	Fix the message and resend
`INVALID_STATE`	Message received in wrong connection state (e.g., binary audio before `speech.config`)	Follow the correct protocol flow
`SESSION_LIMIT`	Maximum concurrent sessions reached (default: 20)	Wait for a session to end, or increase `WSS_MAX_SESSIONS`
`SESSION_NOT_FOUND`	Session ID not found (session expired or destroyed)	Create a new session
`AUDIO_ERROR`	Audio decoding or resampling failed (corrupt Opus frame, invalid PCM data)	Check audio encoding and format

HTTP-level errors (before WebSocket upgrade):

HTTP Status	Reason
`401 Unauthorized`	Missing or invalid API key
`404 Not Found`	Invalid endpoint (use `/transcribe`)

Audio Format Support

Encoding	Description	Notes
`pcm_s16le`	Raw PCM, 16-bit signed, little-endian, mono	Most common, zero overhead
`opus`	Opus-encoded frames, mono	Lower bandwidth, decoder initialized per-session

Sample rates: Any rate from 8,000 to 96,000 Hz. Audio is internally resampled to 16,000 Hz (Whisper's native rate) using libsamplerate with SRC_SINC_MEDIUM_QUALITY.

Channel layout: Mono only. Multi-channel audio must be downmixed before sending.

Capacity & Limits

Parameter	Default	Configurable via
Max concurrent sessions	20	`WSS_MAX_SESSIONS`
Max WebSocket connections	100	Compile-time (`kMaxConnections`)
Max audio frame size	16 MB	Compile-time (`maxPayloadLength`)
Idle connection timeout	120s	Compile-time (`idleTimeout`)
Send buffer backpressure	1 MB	Compile-time (`maxBackpressure`)
Audio ring buffer	30s per session	Code default
Inference queue depth	100 jobs	Compile-time
Max transcript length	1 MB per session	Compile-time (`kMaxTranscriptLength`)
Max session duration	2 hours	Compile-time (`kMaxSessionDurationMs`)
Inference threads	4	`WSS_INFERENCE_THREADS`

Memory per session and model RAM: To be measured. See benchmark tool for running your own sizing tests.

Measured inference latency (5s audio window, 12-core ARM CPU, single session):

Model	Latency
tiny.en	~1.1s
base.en	~2.4s
small.en	~8.1s
large-v3	~50s

Latency varies by hardware, concurrency, and audio content. GPU (CUDA) reduces latency 10–50×.

Compatible Models

All models are downloaded from ggerganov/whisper.cpp on Hugging Face in GGML format.

Standard models

Model	File	Size	Languages	Download
Tiny (EN)	`ggml-tiny.en.bin`	75 MB	English	⬇
Tiny	`ggml-tiny.bin`	75 MB	99 languages	⬇
Base (EN)	`ggml-base.en.bin`	142 MB	English	⬇
Base	`ggml-base.bin`	142 MB	99 languages	⬇
Small (EN)	`ggml-small.en.bin`	466 MB	English	⬇
Small	`ggml-small.bin`	466 MB	99 languages	⬇
Medium (EN)	`ggml-medium.en.bin`	1.5 GB	English	⬇
Medium	`ggml-medium.bin`	1.5 GB	99 languages	⬇
Large v3	`ggml-large-v3.bin`	3 GB	99 languages	⬇
Large v3 Turbo	`ggml-large-v3-turbo.bin`	1.6 GB	99 languages	⬇

Quantized models (smaller, faster, slightly lower accuracy)

Quantized variants use less RAM and run faster with minimal quality loss:

Quantization	Size reduction	Example
Q8_0	~50% smaller	`ggml-large-v3-turbo-q8_0.bin` ⬇
Q5_0	~65% smaller	`ggml-large-v3-q5_0.bin` ⬇

Browse all available models: huggingface.co/ggerganov/whisper.cpp

Special models

Model	Description	Download
`ggml-small.en-tdrz.bin`	Small English + speaker diarization (tinydiarize)	⬇

Changing Models

1. Download a model:

# English-only (faster, recommended for English)
curl -L -o models/ggml-base.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin

# Multilingual (supports 99 languages)
curl -L -o models/ggml-large-v3-turbo.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo.bin

# Quantized (smaller + faster, good for production)
curl -L -o models/ggml-large-v3-turbo-q8_0.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-turbo-q8_0.bin

2. Set the model path and restart:

# Docker
docker run -e WHISPER_MODEL_PATH=/models/ggml-base.en.bin ...

# Kubernetes
kubectl -n whisperx set env deployment/whisperx-server \
  WHISPER_MODEL_PATH=/models/ggml-base.en.bin
kubectl -n whisperx rollout restart deployment whisperx-server

# server.toml
[model]
path = "/models/ggml-base.en.bin"

The server loads one model at startup. Switching models requires a restart. .en models are English-only and faster.

Configuration

All settings via environment variables (or server.toml with WSS_CONFIG_PATH). Environment variables take precedence over the config file.

Variable	Default	Description
`WHISPER_MODEL_PATH`	(required)	Path to GGML model file
`WSS_API_KEY`	(empty = dev mode)	API key for authentication
`WSS_PORT`	`9090`	Server listen port
`WSS_HOST`	`0.0.0.0`	Server bind address
`WSS_MAX_SESSIONS`	`20`	Maximum concurrent sessions
`WSS_LANGUAGE`	`en`	Default language
`WSS_BEAM_SIZE`	`5`	Whisper beam search size
`WSS_INFERENCE_THREADS`	`4`	Threads per inference call
`WSS_WINDOW_DURATION_MS`	`20000`	Default window duration
`WSS_OVERLAP_DURATION_MS`	`2000`	Default window overlap
`WSS_LOG_LEVEL`	`info`	Log level (trace/debug/info/warn/error)
`WSS_LOG_FORMAT`	`text`	Log format: `text` (human-readable) or `json` (structured)
`WSS_CONFIG_PATH`	`config/server.toml`	Path to TOML config file

Structured JSON Logging

For log aggregation systems (Datadog, Loki, ELK, CloudWatch), enable JSON log output:

WSS_LOG_FORMAT=json ./transcription_server

Or in server.toml:

[logging]
level = "info"
format = "json"

JSON output (one object per line):

{"level":"info","message":"whisperx-streaming-server v0.1.0 starting...","timestamp":"2026-03-03T23:23:16.957Z"}
{"level":"debug","message":"Binary ingested: session=94a0d5 bytes=6400 written=3200","timestamp":"2026-03-03T23:33:54.283Z"}

Default format is text (human-readable) for backward compatibility.

Health Check

GET /health     # liveness probe
GET /ready      # readiness probe (checks model loaded + queue capacity)

Returns JSON (no authentication required):

// /health
{"status": "ok", "active_sessions": 3, "max_sessions": 20, "inference_pending": 1}

// /ready
{"status": "ready", "model_ready": true, "queue_ok": true, "sessions_available": true}

Prometheus Metrics

GET /metrics

Returns metrics in Prometheus text format (no authentication required). Scrape interval recommendation: 15s.

Gauges (current state)

Metric	Description
`openasr_active_sessions`	Current number of transcription sessions
`openasr_active_connections`	Current WebSocket connections
`openasr_inference_queue_depth`	Inference jobs waiting in queue

Counters (monotonically increasing)

Metric	Description
`openasr_connections_total`	Total WebSocket connections
`openasr_connections_rejected_auth_total`	Connections rejected (401 Unauthorized)
`openasr_connections_rejected_limit_total`	Connections rejected (limit reached)
`openasr_sessions_created_total`	Sessions created
`openasr_sessions_destroyed_total`	Sessions destroyed
`openasr_audio_bytes_received_total`	Audio data received (bytes)
`openasr_audio_chunks_received_total`	Binary WebSocket frames received
`openasr_inference_jobs_submitted_total`	Inference jobs submitted to thread pool
`openasr_inference_jobs_completed_total`	Inference jobs completed successfully
`openasr_inference_jobs_dropped_total`	Inference jobs dropped (queue full)
`openasr_transcription_segments_total`	Transcription segments produced
`openasr_errors_total`	Errors sent to clients
`openasr_backpressure_events_total`	Backpressure signals sent

Histograms (latency distribution)

Metric	Buckets	Description
`openasr_inference_duration_seconds`	0.1, 0.25, 0.5, 1, 2.5, 5, 10, 30, 60, 120s	Per-window inference time
`openasr_audio_window_duration_seconds`	1, 2, 5, 10, 20, 30, 60s	Audio window size

Prometheus scrape config

# prometheus.yml
scrape_configs:
  - job_name: openasr
    scrape_interval: 15s
    static_configs:
      - targets: ["openasr-server:9090"]

Example Grafana queries

# Request rate (connections/sec)
rate(openasr_connections_total[5m])

# Inference latency p95
histogram_quantile(0.95, rate(openasr_inference_duration_seconds_bucket[5m]))

# Error rate (%)
rate(openasr_errors_total[5m]) / rate(openasr_connections_total[5m]) * 100

# Audio throughput (MB/sec)
rate(openasr_audio_bytes_received_total[5m]) / 1024 / 1024

# Queue saturation
openasr_inference_queue_depth / 100  # queue limit is 100

Kubernetes Deployment

# Local development
kubectl apply -f k8s/local/all-in-one.yaml

# Production (customize first)
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/pvc.yaml
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml

Health probes are configured for /health on port 9090.

Python Client Example

import asyncio, json, wave, websockets

async def transcribe(wav_path, api_key):
    headers = {"Authorization": f"Bearer {api_key}"}
    async with websockets.connect("ws://localhost:9090/transcribe",
                                   additional_headers=headers) as ws:
        # Configure session
        with wave.open(wav_path) as wf:
            sample_rate = wf.getframerate()
            pcm = wf.readframes(wf.getnframes())

        await ws.send(json.dumps({
            "type": "speech.config",
            "payload": {
                "language": "en",
                "sample_rate": sample_rate,
                "encoding": "pcm_s16le",
                "window_duration_ms": 5000,
                "overlap_duration_ms": 500
            }
        }))
        ack = json.loads(await ws.recv())
        print(f"Session: {ack['payload']['session_id']}")

        # Stream audio as binary frames
        chunk_size = sample_rate * 2 // 5  # 200ms chunks
        for i in range(0, len(pcm), chunk_size):
            await ws.send(pcm[i:i+chunk_size])

        # Collect results
        while True:
            msg = json.loads(await asyncio.wait_for(ws.recv(), timeout=120))
            if msg["type"] == "speech.hypothesis":
                print(f"  Partial: {msg['payload']['text']}")
            elif msg["type"] == "speech.checkpoint":
                print(f"  Transcript: {msg['payload']['full_transcript']}")

asyncio.run(transcribe("audio.wav", "your-api-key"))

A full-featured test client is included at tools/test_client.py.

Architecture

See docs/ARCHITECTURE.md for the full system design, threading model, memory management strategy, and protocol specification.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github		.github
benchmarks		benchmarks
config		config
docker		docker
docs		docs
include		include
k8s		k8s
load-testing		load-testing
src		src
tests		tests
third_party		third_party
tools		tools
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Folders and files

Latest commit

History

Repository files navigation

OpenASR — Streaming Speech-to-Text Server

Inference Runtime

Features

Quick Start

Pre-built Docker Images (recommended)

Docker with Your Own Model

Docker Compose

From Source

WebSocket Protocol

Endpoint

Authentication

Generating an API key

Configuring the server

Connecting with the key

Enterprise authentication (OIDC / Azure Entra ID)

Connection Flow

Event ordering per window

Example: 12-second audio with 5-second windows

Messages

Client → Server

speech.config — Initialize session

Binary frames — Audio data

speech.end — End session

Server → Client

speech.config.ack — Session created

speech.hypothesis — Partial transcription

speech.phrase — Finalized turn/segment

speech.checkpoint — Session state

speech.backpressure — Flow control

speech.error — Error

Error Codes

Audio Format Support

Capacity & Limits

Compatible Models

Standard models

Quantized models (smaller, faster, slightly lower accuracy)

Special models

Changing Models

Configuration

Structured JSON Logging

Health Check

Prometheus Metrics

Gauges (current state)

Counters (monotonically increasing)

Histograms (latency distribution)

Prometheus scrape config

Example Grafana queries

Kubernetes Deployment

Python Client Example

Architecture

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`speech.config` — Initialize session

`speech.end` — End session

`speech.config.ack` — Session created

`speech.hypothesis` — Partial transcription

`speech.phrase` — Finalized turn/segment

`speech.checkpoint` — Session state

`speech.backpressure` — Flow control

`speech.error` — Error

Packages