Skip to content

Commit 4af839c

Browse files
Add explainer for quality levels proposal (#183)
Co-authored-by: Evan Liu <[email protected]>
1 parent 895c5dd commit 4af839c

File tree

1 file changed

+94
-0
lines changed

1 file changed

+94
-0
lines changed

explainers/quality-levels.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Web Speech API Proposal: On-Device Speech Recognition Quality Levels
2+
3+
**Author:** [email protected]
4+
5+
### Problem
6+
7+
The recent updates to the Web Speech API enabling on-device speech recognition (via `processLocally: true`) are a significant step forward for privacy and latency. However, the current API treats all on-device models as functionally equivalent.
8+
9+
In reality, on-device models vary drastically in capability. A lightweight model optimized for voice commands (e.g., "turn on the lights") is insufficient for high-stakes use cases like video conferencing transcription or accessibility captioning, which require handling continuous speech, multiple speakers, and background noise.
10+
11+
Currently, developers have no way to query if the local device is capable of handling these complex tasks. Because they cannot verify the *quality* of the on-device model, applications like Google Meet must often default to Cloud-based recognition to guarantee a minimum user experience. This effectively negates the privacy and bandwidth benefits of the on-device API for high-end use cases.
12+
13+
### Proposed Solution
14+
15+
I propose extending the `SpeechRecognitionOptions` dictionary (used in `SpeechRecognition.available()` and `SpeechRecognition.install()`) to include a `quality` field.
16+
17+
This field uses a "Levels" approach—similar to video codec profiles—to describe the task complexity the model can handle, rather than raw accuracy numbers.
18+
19+
#### WebIDL Definition
20+
21+
```webidl
22+
enum SpeechRecognitionQuality {
23+
"command", // Level 1: Short phrases, single speaker, limited vocab (e.g. Smart Home)
24+
"dictation", // Level 2: Continuous speech, moderate noise, single primary speaker (e.g. SMS/Email)
25+
"conversation" // Level 3: Multi-speaker, complex vocab, high noise tolerance (e.g. Meetings/Captions)
26+
};
27+
28+
dictionary SpeechRecognitionOptions {
29+
required sequence<DOMString> langs;
30+
boolean processLocally = false;
31+
32+
// New optional field. Defaults to "command" (lowest bar) if omitted.
33+
SpeechRecognitionQuality quality = "command";
34+
};
35+
```
36+
37+
### Proposed Behavior
38+
39+
The `quality` field acts as a hard constraint, similar to `langs`.
40+
41+
1. **SpeechRecognition.available(options):** Returns `true` only if the device has an installed model that meets or exceeds the requested quality level for the specified language.
42+
2. **SpeechRecognition.install(options):** Attempts to download/install a model that meets the requested `quality`. If the device hardware is incapable of running a model of that complexity (e.g., lack of NPU or RAM), the promise should reject.
43+
44+
### Usage Example
45+
46+
This allows developers to write "progressive enhancement" logic that prefers on-device processing but falls back gracefully if the device isn't powerful enough.
47+
48+
```javascript
49+
const meetConfig = {
50+
langs: ['en-US'],
51+
processLocally: true,
52+
// We specifically need a model capable of handling a meeting environment
53+
quality: 'conversation'
54+
};
55+
56+
async function setupSpeech() {
57+
// 1. Check if a high-quality model is already available
58+
const availability = await SpeechRecognition.available(meetConfig);
59+
60+
if (availability == 'available') {
61+
startOnDeviceSpeech();
62+
return;
63+
}
64+
65+
// 2. If not, try to install one
66+
try {
67+
await SpeechRecognition.install(meetConfig);
68+
startOnDeviceSpeech();
69+
} catch (e) {
70+
// 3. Fallback: Device hardware is too weak for "conversation" quality,
71+
// or user denied the download. Fallback to Cloud to ensure transcript usability.
72+
console.log("High-quality on-device model unavailable. Using Cloud.");
73+
startCloudSpeech();
74+
}
75+
}
76+
```
77+
78+
### Alternatives Considered
79+
80+
**1. Exposing Word Error Rate (WER) or Accuracy scores**
81+
* *Proposal:* Allow developers to request `minAccuracy: 0.95`.
82+
* *Why rejected:*
83+
* **Fingerprinting:** Precise accuracy metrics (e.g., 95.4%) are a high-entropy fingerprinting vector that can identify specific hardware/model versions.
84+
* **Future Proofing:** Accuracy metrics drift over time. A "95%" score on a 2025 test set might be considered poor performance in 2030.
85+
86+
**2. Relative Enums (Low, Medium, High)**
87+
* *Proposal:* Allow developers to request `quality: 'high'`.
88+
* *Why rejected:* "High" is relative to the device. "High" quality on a low-end smartwatch might still be insufficient for a meeting transcript. The API needs to guarantee an *absolute* floor of utility (Semantic Capability) rather than a relative hardware setting.
89+
90+
### Benefits
91+
92+
* **Privacy:** Applications can confidently use on-device models for complex tasks, keeping audio data off the cloud.
93+
* **Predictability:** Developers know exactly what class of tasks the model can handle (e.g., `conversation` implies a higher level of accuracy).
94+
* **Safety:** By grouping devices into coarse buckets (`command`, `dictation`, `conversation`), we minimize fingerprinting risks compared to exposing model versions or raw metrics.

0 commit comments

Comments
 (0)