feat: TTS audio mode — Kokoro voice personas, seekbar, conversational AI#236
feat: TTS audio mode — Kokoro voice personas, seekbar, conversational AI#236alichherawalla wants to merge 98 commits intomainfrom
Conversation
Implements on-device text-to-speech using OuteTTS 0.3 (454 MB) + WavTokenizer (73 MB) via llama.rn, with react-native-audio-api for playback. Two interface modes (user-switchable from Settings): - Chat Mode: play/stop TTSButton on each assistant message bubble - Audio Mode: waveform bubbles with auto-TTS after streaming, transcript expand, speed cycling, and PCM audio persisted to disk per message for repeat playback New files: - src/constants/ttsModels.ts — model URLs, RAM thresholds, cache config - src/services/ttsService.ts — download, load, generate, persist, play - src/stores/ttsStore.ts — Zustand store with Chat + Audio Mode actions - src/hooks/useTTS.ts — convenience hook with RAM gate and weighted progress - src/components/TTSButton/index.tsx — Chat Mode play/stop per message - src/components/AudioMessageBubble/index.tsx — waveform bubble component - src/screens/TTSSettingsScreen/index.tsx — download, mode, speed, cache Modified: - Message type: audioPath, waveformData, audioDurationSeconds, isGeneratingAudio - ChatMessage: Audio Mode branch + TTSButton in meta row - SettingsScreen: Text to Speech nav row - Navigation: TTSSettings route - stores/index.ts, services/index.ts: exports Tests: 42 unit + integration tests covering service, store, and full flows Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Revert ChatMessage to main (avoids pre-existing complexity lint failure when the file enters the push-range diff) - Add Audio Mode + TTSButton to MessageRenderer instead — clean, under limit - Move audioPath/waveformData/audioDurationSeconds/isGeneratingAudio fields from types/index.ts to types/tts.ts via module augmentation (keeps index.ts under the 350-line max) - Add react-native-audio-api global mock to jest.setup.ts so all test suites that transitively import ttsService can resolve the native module Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
In finalizeStreamingMessage, after addMessage() saves the assistant reply, check if Audio Mode is active and model is loaded — if so, fire useTTSStore.generateAndSave() in the background so the waveform bubble auto-generates instead of spinning indefinitely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request implements a Text-to-Speech (TTS) service and store, enabling both Chat and Audio interface modes. The implementation includes model management, audio generation, file persistence, and playback controls. My feedback highlights that btoa and atob are not natively available in React Native and require polyfills or alternative base64 utilities, and suggests adding user feedback and logging when TTS generation fails due to unloaded models.
| for (let i = 0; i < uint8.length; i++) { | ||
| binary += String.fromCharCode(uint8[i]); | ||
| } | ||
| return btoa(binary); |
| } | ||
|
|
||
| private base64ToFloat32(base64: string): Float32Array { | ||
| const binary = atob(base64); |
src/stores/ttsStore.ts
Outdated
| if (!settings.enabled || !isModelLoaded) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
The check if (!settings.enabled || !isModelLoaded) is correct, but it might be better to provide user feedback if they try to speak while the model is not loaded, rather than silently failing. Additionally, ensure this failure is logged to aid in debugging, as swallowing failures can make issues harder to trace.
References
- When catching errors or handling failures, log them instead of swallowing them to ensure failures are visible and to aid in debugging.
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (47.79%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #236 +/- ##
==========================================
- Coverage 85.65% 83.66% -2.00%
==========================================
Files 217 224 +7
Lines 10766 11289 +523
Branches 2888 3023 +135
==========================================
+ Hits 9222 9445 +223
- Misses 870 1138 +268
- Partials 674 706 +32
🚀 New features to boost your workflow:
|
…, TTSButton placement Critical fixes for TTS Audio Mode: - Add updateMessageAudio() to chatStore — writes audioPath, waveformData, audioDurationSeconds, isGeneratingAudio back to the conversation message (without this, the waveform bubble spun forever after generation) - Wire auto-TTS trigger in useChatScreen via useEffect on isStreamingForThisConversation: detects streaming → stopped, checks Audio Mode + model loaded, calls triggerAudioModeGeneration() which sets isGeneratingAudio:true, fires generateAndSave, then writes audio fields or clears the flag on error - Fix isGenerating logic: show spinner only when isGeneratingAudio===true, not for every assistant message missing audioPath (which made all old messages spin forever in Audio Mode) - Fix TTSButton placement: add metaExtra prop to ChatMessage/MessageMetaRow so TTSButton renders inline in the timestamp row rather than below the bubble Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a Voice row (volume icon + Chat/Audio/N/A badge) to the quick settings popover in the chat input. Tapping it: - Toggles between Chat and Audio mode when models are downloaded - Auto-loads/unloads the TTS model on switch - Navigates to TTSSettings when models are not yet downloaded This makes Audio Mode accessible without leaving the chat screen. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The ChatInput test mock for src/stores was missing useTTSStore, causing Popovers.tsx (which now uses useTTSStore) to throw on render. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. checkDownloadStatus() never called on TTSSettingsScreen mount → store always showed models as not downloaded after fresh app start 2. speak() race condition: stop() during generation didn't prevent playback → set isSpeakingFlag=true before generate(), check it after, use finally 3. RNFS.stat() on directory reports block size (~0), not total file size → replaced with readDir() recursive sum of individual .pcm file sizes 4. Historical messages without audio showed broken play button in Audio Mode → AudioMessageBubble only rendered when msg.audioPath || msg.isGeneratingAudio Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaced stat() mock with readDir() mocks matching the new recursive file-size summation approach. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nto feat/tts-implementation
Replaces slider controls with a [–] value [+] stepper row for precise numeric input in settings screens. Supports min/max/step, optional decimal formatting, and testID for E2E automation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removes @react-native-community/slider from GenerationSettingsModal, ModelSettingsScreen, and TTSSettingsScreen. Every numeric control (temperature, top-p, GPU layers, speed, etc.) now uses the stepper for touch-friendly precise adjustment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- MediaAttachment gains audioFormat and audioDurationSeconds fields
- audioRecorderService.stopRecording() now returns { path, durationSeconds }
instead of just the path, enabling accurate audio bubble scrubbing
- ChatInput/Attachments.addAudioAttachment stores the duration
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…send In Audio Mode, user voice recordings now appear as right-aligned audio bubbles instead of text messages, making both sides of the conversation audio-native. - Voice.ts: adds file-based transcription path (audioRecorderService + whisperService.transcribeFile) and onAutoSend callback for atomic send with audio attachment. Multimodal models skip transcription entirely. - ChatInput: passes onAutoSend in Audio Mode; builds MediaAttachment inline to avoid async state-update race; uses attachmentsRef for sync reads. - AudioMessageBubble: adds isUser prop for right-aligned primary-tinted style. - MessageRenderer: renders user audio attachments as AudioMessageBubble before the normal message path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The streaming-complete useEffect only listed isStreamingForThisConversation in its deps, so activeConversation was captured stale. When streaming ended, the last message was always the old value — TTS generation was never triggered. Fix: read conversation and last message directly from useChatStore.getState() inside the effect instead of relying on the closed-over activeConversation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When no Whisper model is installed and the user taps the mic, show a CustomAlert offering to download Whisper Small (466 MB) immediately, rather than navigating away to VoiceSettings. UnavailableButton also now shows a download icon + percentage while the model is being fetched, so feedback is in-place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a TEXT TO SPEECH section alongside IMAGE GENERATION and TEXT GENERATION in the chat settings modal. Shows mode toggle (chat/audio), enable switch, speed stepper, and auto-play toggle. Deep-links to TTSSettingsScreen for full configuration. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
WHISPER_MODELS grows from 5 to 10 entries covering English-only and Multilingual variants for tiny/base/small/medium, plus Large v3 Turbo and Large v3. whisperService.downloadFromUrl(url, modelId) downloads any ggml .bin file from an arbitrary URL — enables installing community models from HuggingFace. whisperStore exposes it as downloadFromUrl action. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewrites the voice settings screen with three sections: - Active model card with inline download progress and remove action - Curated models grouped by English-only / Multilingual (all sizes, tiny → large-v3) - Live HuggingFace search bar (500 ms debounce) that queries ASR repos; tap a repo to expand and browse its ggml .bin files; tap a file to confirm and download via downloadFromUrl huggingFaceService gains searchWhisperRepos() and getWhisperFiles() to power the HF search without coupling to the LLM model browser. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
llmMessages builds an input_audio content block from audio attachments when the active model reports audio support, bypassing Whisper entirely. llm.ts exposes getMultimodalSupport() so the voice layer can detect this. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ttsStore: adds interfaceMode, speed, autoPlay, enabled settings; generateAndSave flow for Audio Mode; updateMessageAudio - ttsService: OuteTTS generate+save path for AI audio bubbles - TTSButton: play/stop per-message with generation spinner - KokoroTTSManager + kokoroModels: scaffold for Tier 1 Kokoro TTS (not yet wired to react-native-executorch, marked not started) - App.tsx: mounts KokoroTTSManager near root - packages: react-native-executorch, background-downloader, dr.pogodin/react-native-fs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ChatMessage: long-press action sheet gains Speak option (delegates to ttsStore) - ModelSettingsScreen: suppress pre-existing exhaustive-deps lint warning - Tests: update GenerationSettingsModal and ModelSettingsScreen tests for NumericStepper (gpu-layers-stepper-increment) replacing slider testIDs - TTS_IMPLEMENTATION_PLAN: rewritten to reflect Audio Mode bidirectional voice conversation, stale closure fix, and implementation status Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sages
Two bugs causing broken Audio Mode:
1. AudioRecorder was recording at the system default rate (~44.1 kHz),
producing WAV that Whisper interprets as static ('TV static' / [SOUND]).
Fix: pass a preset with sampleRate:16000, BitDepth.Bit16 so the file
is Whisper-compatible 16 kHz mono int16 PCM from the start.
2. buildOAIMessages was always including audio attachments as input_audio
content blocks, even for models that don't support audio input (e.g.
remote Qwen 3.5 2B / Gemma 42B). Those models replied 'I cannot hear
audio'. Fix: buildOAIMessages now accepts supportsAudio flag (default
false) and only emits input_audio parts when the model declares audio
support. llm.ts passes multimodalSupport.audio when calling it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
playFromFile was treating WAV bytes as raw Float32 PCM — designed for OuteTTS output only. WAV files have a 44-byte RIFF header plus int16 samples; reinterpreting them as Float32 produces pure static. Fix: use AudioContext.decodeAudioData(filePath) which properly parses the WAV header and decodes samples. The file:// prefix is added if missing. MessageRenderer now wraps user and assistant audio bubbles in a container View with paddingHorizontal:16 and marginVertical:8, matching the ChatMessage container layout so bubbles align correctly with the chat edges instead of touching screen borders. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Audio type attachments were falling through to the FadeInImage branch, causing Image to try to load the WAV file path — resulting in a broken image placeholder that stretched the user bubble very wide (the 'super long' bubble issue). Audio attachments now render as a compact mic icon + 'Voice message' badge (matching the document badge style), keeping the bubble compact. In Audio Mode they never reach this code — they render as AudioMessageBubble. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add isAudioModeMessage to Message type and updateMessageAudio signature. Set flag in triggerAudioModeGeneration so mode switches don't reformat old text messages. MessageRenderer now checks msg.isAudioModeMessage instead of global ttsMode for assistant audio bubbles. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug 2: handlePlayPause calls speak() for AI bubbles (empty audioPath) instead of playMessage with empty string. Remove isGenerating spinner. Bug 3: WaveformBars gets flex:1 + overflow:hidden, WAVEFORM_BARS 40→28, bubble overflow:hidden, maxWidth 80%→88%. Bug 4: user bubble flips play row order (speed+duration left, play right). Bug 5: voice cycling chip on AI bubbles reads/writes kokoroVoiceId. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix guard: was checking isModelLoaded (OuteTTS, always false) instead of kokoroReady — so isAudioModeMessage was never stamped and all AI messages rendered as text in audio mode - Add sentence-level streaming TTS: Kokoro now starts speaking each sentence as soon as LLM finishes generating it, instead of waiting for the full response - Fix waveform invisible in idle state: min bar height 3→6px and empty waveform now renders a sine-wave placeholder instead of nearly-invisible flat bars Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add react-native-executorch mock to jest.setup.ts (voice configs + useTextToSpeech) - Fix tts integration test: speak() now passes callback as 3rd arg - Update VoiceRecordButton tests: tap-to-toggle, download prompt, no "Transcribing..." text - Update VoiceSettingsScreen tests: new UI with English/Multilingual sections, Active badge - Update DownloadManagerScreen tests: conditional active section, filter bar touchables - Update messageContent test: stripControlTokens now trims output 157 suites, 5181 tests, all passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use @react-native-community/slider (already installed) instead of custom PanResponder-based seekbar. Native component handles drag natively at 60fps — no JS thread bottleneck. Removes ~60 lines of PanResponder/measure/layout tracking code. Added slider mock to jest.setup.ts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace animated WaveformBars (VU-meter, wave bounce, 3 animation modes, Animated.Value refs) with simple static bars. Progress is now shown entirely by the native Slider component. Remove RMS amplitude calculation from KokoroTTSManager onNext callback. ~80 lines of animation code removed. No more JS thread contention from per-chunk amplitude updates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…click play - Transcript shows karaoke-style word highlighting based on playback progress — spoken words in full color, upcoming words muted - Stop any TTS playback when user starts recording (mic + speaker shouldn't overlap) - Set isSpeaking + currentMessageId immediately before the 300ms Kokoro cleanup wait, so UI shows loading state right away when switching clips Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- KokoroTTSManager: 500ms cooldown after isSpeaking→false before applying voice config change, giving native ExecuTorch thread time to fully stop - Transcript highlight: only the currently spoken word is highlighted (primary color + subtle background), not all spoken words - Auto-scroll: ScrollView with maxHeight 120px, scrolls to keep the active word visible as playback progresses Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove word-level transcript highlighting — Kokoro doesn't provide word timestamps, so it was always off. Keep transcript as plain text in a scrollable container (max 120px) - Waveform bars now visually distinguish playing vs idle: playing bars are brighter (0.6–1.0 opacity), idle bars are dimmer (0.25–0.6) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Waveform bars now tint as the playhead passes: played bars are bright, unplayed bars are muted — like WhatsApp voice messages - Progress is shown directly on the bars, with the Slider below for drag-to-seek interaction - Increase voice change cooldown to 1500ms to prevent native crash Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Audio bubble uses fixed width: 88% (not maxWidth) so it doesn't resize when transcript opens - Thinking block wrapper matches at width: 88% (was maxWidth: 85%) - Both bubbles now render at exactly the same width Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Slider is now positioned on top of the waveform bars (centered vertically) instead of as a separate row below - Slider track is transparent — waveform bar coloring shows progress - Slider thumb (dot) sits on top of the waveform at the current position - Seekbar visible on both user and AI audio bubbles - Removed separate seekbar row — cleaner layout Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Thumb is transparent when progress=0 and not seeking. Only becomes visible (primary color) when audio is actively playing or user is dragging the slider. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Thumb always shows (primary color) so users know they can seek - Expand seekOverlay to left/right -16px to compensate for Android Slider's built-in ~16px internal padding — thumb now aligns with the waveform bar highlighting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Play button + waveform in top row (waveform takes full remaining width) - Show transcript, duration, speed chip in a single meta row below - Matches WhatsApp voice message layout: play + waveform on top, info below Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bars now distribute evenly across the entire container width instead of clustering together with fixed 2px gaps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Increase to 48 bars with 1.5px gaps — fills full width, looks denser - Bigger speed chip (more padding, larger border radius) — easier to tap - Voice change cooldown now uses actual stream end timestamp instead of isSpeaking state — waits 2 seconds from when the native stream actually stopped, not from when JS flag flipped - Both user and AI bubbles use same width: 88% Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Waveform bars now span edge-to-edge across the entire bubble width. Play button sits in the meta row below alongside show transcript, duration, and speed chip. No more asymmetric padding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reverted play button to left of waveform (standard layout). Reduced playRow gap from SPACING.sm to SPACING.xs so waveform extends further right. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Voice switch: key-based remount of KokoroTTSManager avoids native SIGSEGV when executorch re-initializes with a new voice config. Outer component manages cooldown, inner component holds the hook. Sets kokoroReady=false during switch so UI shows loader. - Seekbar progress: playMessage finally block now checks ownership (currentMessageId === messageId) before clearing state, preventing it from clobbering an in-flight speak() call's isSpeaking/isAudioPlaying. Added playSessionId counter + retry loop (up to 10x 200ms) when executorch reports "model is currently generating" (code 104). - Seekbar smoothness: timer interval 500ms→50ms, fractional seconds instead of Math.floor for continuous waveform bar progress. - Transcript layout: split TranscriptSection into TranscriptToggle (stays in metaRow with time/speed) and TranscriptContent (renders below), preventing text from squeezing against duration/speed chip. - Chat scroll: FlatList hidden (opacity:0) during initial layout, revealed after first scrollToEnd settles. Mode switch (chat↔audio) resets scroll via extraData + scrollToEnd. - Voice loader UI: track kokoroActiveVoiceId in store, derive isChangingVoice in UI components from settings vs active mismatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nto feat/tts-implementation
…tional Kokoro - Audio mode now renders tool-call messages via ChatMessage (proper bubble + tool call UI) instead of dropping them as raw unstyled text. Plain assistant messages still render as AudioMessageBubble. - Transcript ScrollView uses react-native-gesture-handler for reliable nested scrolling inside FlatList on Android. Moved transcript outside the TouchableOpacity wrapper so it can capture scroll gestures. - Action menu (long-press + 3-dot) added to both user and assistant audio bubbles: Copy + Resend for user, Copy + Regenerate for assistant. - Kokoro TTS only loads in audio interface mode (App.tsx), saving RAM when in chat mode. - Post-stream ownership transfer: when all text was spoken by streaming chunks, transfers currentMessageId from 'streaming' to the real message ID so the AudioMessageBubble seekbar works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When retrying a message while TTS is speaking, the audio bubble disappears but Kokoro continues playing natively. Now calls ttsStore.stop() before deleting messages in the retry handler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Conditional mounting (audio mode only) caused Kokoro to not be ready during streaming — it takes ~10s to initialize, but fast models finish streaming before that. Streaming TTS chunks silently skipped because kokoroReady was false. Reverting to always-mounted so Kokoro is warm when streaming starts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Streaming TTS chunks couldn't keep up with fast cloud models — Kokoro speaks slower than tokens arrive, causing a growing backlog of unspoken chunks, word skipping at transitions, and unpredictable playback. Replaced with a simpler approach: text streams normally as a ChatMessage, then when streaming ends the full response is spoken as a single TTS call with the real message ID. Clean, predictable, no word skipping. Also includes: stop in-flight TTS when new streaming begins, TTS stop on retry/resend, and text offset fix for post-stream remaining calc. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…aming ends" This reverts commit 6861c30.
|


Summary
Complete TTS audio mode implementation with Kokoro text-to-speech integration:
Test plan