Changes:
- Silence detection: 3.0s → 2.0s (faster response)
- Typing speed: 0.01s/char → 0.005s/char (2x faster typing)
- Overall: Should feel more responsive now
Note about "typing as you speak":
- Current limitation: Whisper needs complete audio to transcribe
- Can't do real-time streaming with standard Whisper
- Alternative: Could switch to faster-whisper or streaming model (more complex)
Enhanced Matching:
Old: "move left" in text
New: any(phrase in text for phrase in ["move left", "go left", "left arrow", "cursor left"])
More command variations now recognized:
- "move left" OR "go left" OR "left arrow" OR "cursor left"
- "move to start" OR "beginning" OR "start of line"
- "move to end" OR "end of line"
- etc.
Now shows:
📝 Raw transcription: 'move left five'
📝 Lowercase: 'move left five'
🎯 Command detected: move_left(5)
✅ Command executed successfully
Or if not a command:
📝 Raw transcription: 'echo hello'
📝 Lowercase: 'echo hello'
💬 No command detected, typing text...
⌨️ Typing: echo hello
./test-commands.shor
tail -f /tmp/voice-dictation.logIf command works:
📝 Raw transcription: 'move left'
🎯 Command detected: move_left(1)
If command doesn't work:
📝 Raw transcription: 'muv left' ← Whisper misheard!
💬 No command detected, typing text...
This tells you exactly how Whisper heard your voice.
Possible causes:
- Background noise - Whisper mishears words
- Speech clarity - Speaking too fast/slow
- Whisper transcription variance - May spell words differently
Solutions:
- Speak clearly and at normal pace
- Reduce background noise
- Check logs to see what Whisper actually transcribed
- If Whisper consistently mishears, add that variant to code
Current bottleneck: Whisper transcription takes ~1-2 seconds
Transcription time depends on:
- Audio length (shorter = faster)
- Model size (tiny = fastest, base = medium, small = slower)
- CPU speed
To speed up:
# Stop current
./toggle.sh
# Start with tiny model (fastest)
export PICOVOICE_ACCESS_KEY="your-key"
python main.py --no-gui --model tinyCurrent settings:
- 0.005s per character (200 chars/second)
- Can be reduced further but may cause input buffer issues
To make even faster, edit main.py line 63:
self.injector = InputInjector(delay=0.001) # Even fasterCurrent flow:
Say "computer" → <instant>
System hears wake word → <instant>
Start recording → <instant>
Speak your command → <you control>
Silence for 2 seconds → 2 seconds
Transcribe with Whisper → 1-2 seconds
Type the text → 0.5-1 second (depends on length)
───────────────────────────────────────
Total: ~4-5 seconds from silence to completion
The 1-2 second Whisper transcription is unavoidable with current setup.
To get true "type as you speak", would need:
- Switch to faster-whisper library (supports streaming)
- Transcribe in small chunks (0.5s segments)
- Type each word as it's transcribed
This is more complex but achievable. Let me know if you want this!
| Metric | Before | After | Improvement |
|---|---|---|---|
| Silence detection | 3.0s | 2.0s | 33% faster |
| Typing speed | 0.01s/char | 0.005s/char | 2x faster |
| Command variations | 2-3 | 4-5 | More flexible |
| Debug visibility | None | Full logging | Much better |
Try these phrases and watch the logs:
✅ Navigation:
- "move left"
- "move right five"
- "go to start"
- "end of line"
✅ Editing:
- "delete word"
- "delete line"
- "backspace"
✅ Special:
- "press enter"
- "press tab"
✅ Regular text:
- "echo hello world"
- "ls dash la"
See logs: ./test-commands.sh