|
| 1 | +# Web Speech API Proposal: On-Device Speech Recognition Quality Levels |
| 2 | + |
| 3 | + |
| 4 | + |
| 5 | +### Problem |
| 6 | + |
| 7 | +The recent updates to the Web Speech API enabling on-device speech recognition (via `processLocally: true`) are a significant step forward for privacy and latency. However, the current API treats all on-device models as functionally equivalent. |
| 8 | + |
| 9 | +In reality, on-device models vary drastically in capability. A lightweight model optimized for voice commands (e.g., "turn on the lights") is insufficient for high-stakes use cases like video conferencing transcription or accessibility captioning, which require handling continuous speech, multiple speakers, and background noise. |
| 10 | + |
| 11 | +Currently, developers have no way to query if the local device is capable of handling these complex tasks. Because they cannot verify the *quality* of the on-device model, applications like Google Meet must often default to Cloud-based recognition to guarantee a minimum user experience. This effectively negates the privacy and bandwidth benefits of the on-device API for high-end use cases. |
| 12 | + |
| 13 | +### Proposed Solution |
| 14 | + |
| 15 | +I propose extending the `SpeechRecognitionOptions` dictionary (used in `SpeechRecognition.available()` and `SpeechRecognition.install()`) to include a `quality` field. |
| 16 | + |
| 17 | +This field uses a "Levels" approach—similar to video codec profiles—to describe the task complexity the model can handle, rather than raw accuracy numbers. |
| 18 | + |
| 19 | +#### WebIDL Definition |
| 20 | + |
| 21 | +```webidl |
| 22 | +enum SpeechRecognitionQuality { |
| 23 | + "command", // Level 1: Short phrases, single speaker, limited vocab (e.g. Smart Home) |
| 24 | + "dictation", // Level 2: Continuous speech, moderate noise, single primary speaker (e.g. SMS/Email) |
| 25 | + "conversation" // Level 3: Multi-speaker, complex vocab, high noise tolerance (e.g. Meetings/Captions) |
| 26 | +}; |
| 27 | +
|
| 28 | +dictionary SpeechRecognitionOptions { |
| 29 | + required sequence<DOMString> langs; |
| 30 | + boolean processLocally = false; |
| 31 | + |
| 32 | + // New optional field. Defaults to "command" (lowest bar) if omitted. |
| 33 | + SpeechRecognitionQuality quality = "command"; |
| 34 | +}; |
| 35 | +``` |
| 36 | + |
| 37 | +### Proposed Behavior |
| 38 | + |
| 39 | +The `quality` field acts as a hard constraint, similar to `langs`. |
| 40 | + |
| 41 | +1. **SpeechRecognition.available(options):** Returns `true` only if the device has an installed model that meets or exceeds the requested quality level for the specified language. |
| 42 | +2. **SpeechRecognition.install(options):** Attempts to download/install a model that meets the requested `quality`. If the device hardware is incapable of running a model of that complexity (e.g., lack of NPU or RAM), the promise should reject. |
| 43 | + |
| 44 | +### Usage Example |
| 45 | + |
| 46 | +This allows developers to write "progressive enhancement" logic that prefers on-device processing but falls back gracefully if the device isn't powerful enough. |
| 47 | + |
| 48 | +```javascript |
| 49 | +const meetConfig = { |
| 50 | + langs: ['en-US'], |
| 51 | + processLocally: true, |
| 52 | + // We specifically need a model capable of handling a meeting environment |
| 53 | + quality: 'conversation' |
| 54 | +}; |
| 55 | + |
| 56 | +async function setupSpeech() { |
| 57 | + // 1. Check if a high-quality model is already available |
| 58 | + const availability = await SpeechRecognition.available(meetConfig); |
| 59 | + |
| 60 | + if (availability == 'available') { |
| 61 | + startOnDeviceSpeech(); |
| 62 | + return; |
| 63 | + } |
| 64 | + |
| 65 | + // 2. If not, try to install one |
| 66 | + try { |
| 67 | + await SpeechRecognition.install(meetConfig); |
| 68 | + startOnDeviceSpeech(); |
| 69 | + } catch (e) { |
| 70 | + // 3. Fallback: Device hardware is too weak for "conversation" quality, |
| 71 | + // or user denied the download. Fallback to Cloud to ensure transcript usability. |
| 72 | + console.log("High-quality on-device model unavailable. Using Cloud."); |
| 73 | + startCloudSpeech(); |
| 74 | + } |
| 75 | +} |
| 76 | +``` |
| 77 | + |
| 78 | +### Alternatives Considered |
| 79 | + |
| 80 | +**1. Exposing Word Error Rate (WER) or Accuracy scores** |
| 81 | +* *Proposal:* Allow developers to request `minAccuracy: 0.95`. |
| 82 | +* *Why rejected:* |
| 83 | + * **Fingerprinting:** Precise accuracy metrics (e.g., 95.4%) are a high-entropy fingerprinting vector that can identify specific hardware/model versions. |
| 84 | + * **Future Proofing:** Accuracy metrics drift over time. A "95%" score on a 2025 test set might be considered poor performance in 2030. |
| 85 | + |
| 86 | +**2. Relative Enums (Low, Medium, High)** |
| 87 | +* *Proposal:* Allow developers to request `quality: 'high'`. |
| 88 | +* *Why rejected:* "High" is relative to the device. "High" quality on a low-end smartwatch might still be insufficient for a meeting transcript. The API needs to guarantee an *absolute* floor of utility (Semantic Capability) rather than a relative hardware setting. |
| 89 | + |
| 90 | +### Benefits |
| 91 | + |
| 92 | +* **Privacy:** Applications can confidently use on-device models for complex tasks, keeping audio data off the cloud. |
| 93 | +* **Predictability:** Developers know exactly what class of tasks the model can handle (e.g., `conversation` implies a higher level of accuracy). |
| 94 | +* **Safety:** By grouping devices into coarse buckets (`command`, `dictation`, `conversation`), we minimize fingerprinting risks compared to exposing model versions or raw metrics. |
0 commit comments