This file breaks the project into implementable work packages for a Rust-based Ubuntu desktop app that:
- captures system audio output,
- transcribes it in real time with ElevenLabs Scribe Realtime,
- uses OpenAI for commentary and question answering,
- stores sessions locally,
- exposes a small desktop UI.
The goal is to provide a practical build order for Codex or another coding agent.
Goal: establish workspace, standards, and baseline project shape.
- Create Rust workspace.
- Add crates:
audio_captureaudio_pipelinestt_scribetranscript_corecontext_enginepolicy_enginellm_openaistorage_sqliteapp_backendapp_uiipc_schema
- Configure linting:
rustfmtclippy
- Add logging and config dependencies.
- Add root README.
- Add
.env.example. - Add
config.example.toml. - Add license and contribution notes if desired.
- Compiling empty workspace
- Shared config loading
- Shared error type strategy
- CI check for build + clippy + fmt
Goal: reliably capture Ubuntu system audio output from the default sink monitor.
- Define
AudioSourcetrait. - Define types:
AudioFrameAudioFormatCaptureDeviceCaptureEvent
- Add unit tests for type conversions and timestamp handling.
- Implement backend discovery for default sink.
- Resolve matching monitor source.
- Start capture stream.
- Emit raw frames with timestamps.
- Emit device-change events.
- Detect default output device changes.
- Rebind capture stream without app restart.
- Handle disconnect/reconnect cases.
- CLI or debug log showing:
- current sink,
- current monitor source,
- frame rate,
- error states.
- Running backend that captures system output
- Logs confirm active device and frame flow
- Survives switching headphones/speakers
- Can capture speech from browser or media player without app-specific integration
- Capture continues or reconnects after output device switch
- No crash when sink disappears temporarily
Goal: transform raw captured audio into STT-friendly chunks.
- Downmix stereo to mono.
- Resample to 16 kHz.
- Standardize sample format.
- Implement rolling frame buffer.
- Group frames into chunk packets.
- Preserve start/end timestamps.
- Add simple energy gate first.
- Add pluggable VAD interface for future improvement.
- Suppress long silence segments.
- Feed prerecorded PCM/WAV fixtures.
- Verify chunk timing and ordering.
- Verify no duplicate chunk emission.
audio_pipelinecrate producingAudioChunk- Test fixtures and pipeline tests
- Chunks are correctly timestamped
- Silence suppression works
- Pipeline latency remains low enough for realtime use
Goal: stream chunks to ElevenLabs Scribe Realtime and receive transcript events.
- Define
Transcribertrait. - Define:
PartialTranscriptFinalTranscriptTranscriberEventTranscriberHealth
- Implement authenticated connection.
- Send realtime audio chunks.
- Receive provider events.
- Parse partial and final transcript payloads.
- Add reconnect logic.
- Add heartbeat/health status if useful.
- Backoff on transient errors.
- Surface fatal auth/config errors clearly.
- Add mock or replay STT adapter for development.
- Feed canned transcript events without provider dependency.
- Working
stt_scribecrate - Mock adapter for tests/dev
- Provider error mapping
- Partial transcript events appear within a reasonable delay
- Final transcript segments are received and parsed
- Temporary disconnects do not require full app restart
Goal: create stable transcript state from partial/final STT events.
- Show partial transcript separately from committed transcript.
- Commit only final transcript segments to stable history.
- Clear/update partial state appropriately.
- Add transcript segment IDs.
- Store timestamps, source, session ID.
- Support retrieval of recent transcript windows.
- Merge adjacent small final segments when appropriate.
- Avoid duplicate final lines.
- Preserve original ordering.
- Add helper APIs:
last_n_secondslast_n_segmentslast_question_candidate
transcript_corecrate- Stable transcript timeline
- Query helpers for recent context
- Transcript display does not flicker excessively
- Final transcript is coherent and not duplicated
- Recent windows can be assembled reliably
Goal: persist sessions, transcript segments, assistant events, and settings.
- Create schema migrations.
- Tables:
sessionstranscript_segmentsassistant_eventssettingsaudit_events
- Insert session start/stop.
- Insert committed transcript segments.
- Insert assistant outputs.
- Read recent sessions and transcript history.
- Add configurable retention policy.
- Add delete session operation.
- Add purge old sessions operation.
storage_sqlitecrate- Migrations
- Repository tests
- New sessions and transcript segments are persisted
- Data can be queried for recent session playback/export
- Purge operations do not corrupt database
Goal: generate answers, commentary, and summaries from recent transcript windows.
- Implement OpenAI client wrapper.
- Support configuration via API key and model name.
- Define request/response types.
- Define internal schema for:
- answer response
- commentary response
- summary response
- Parse and validate model output.
- Implement prompts for:
answer_questioncommentarysummarise_recent
- Keep prompts concise and deterministic.
- Handle timeouts, auth failures, malformed output.
- Return user-safe fallback errors.
llm_openaicrate- Prompt builders
- Structured output parsing
- App can answer a manually selected recent question
- App can summarise recent transcript window
- Failure to reach OpenAI does not break transcription flow
Goal: decide when the assistant should respond.
- Build helper functions to extract:
- recent transcript window,
- last likely question,
- recent assistant outputs.
- Implement:
- answer last question
- summarise last minute
- comment on current topic
- Implement simple heuristics first:
- question marks from transcript,
- interrogative openings,
- phrases like “can someone”, “how do”, “what is”.
- Keep heuristic module replaceable.
- Throttle assistant outputs.
- Suppress repeated commentary.
- Respect privacy pause and cloud pause.
context_enginecratepolicy_enginecrate- Manual and automatic trigger logic
- Manual actions work reliably
- Auto-answer can be enabled without excessive spam
- Policy state is inspectable in logs/debug view
Goal: compose all backend services into one working runtime.
- Define shared event types in
ipc_schema. - Wire modules together via channels.
- Ensure backpressure is handled.
- Start session on capture begin.
- Stop session on user action/app shutdown.
- Persist audit events.
- Load config from file and env.
- Validate required provider credentials.
- Expose runtime mode settings.
- Add structured logs.
- Log provider states, policy decisions, session lifecycle.
app_backendcrate- Single running process with all services connected
- End-to-end backend works without UI
- Logs clearly show capture -> STT -> transcript -> LLM flow
Goal: provide a small usable desktop interface.
- Create compact UI with:
- status indicator,
- current sink,
- live transcript pane,
- latest assistant output pane.
- Buttons:
- Start/Stop
- Pause cloud
- Answer last question
- Summarise last minute
- Show non-blocking error messages for:
- missing API keys,
- provider disconnects,
- paused capture.
- Add tray icon if framework supports it well.
- Add quick actions from tray menu.
app_uicrate- Working desktop app attached to backend
- User can see live transcript
- User can trigger answer/summary actions
- User can clearly tell when cloud processing is active
Goal: improve usability without changing core architecture.
- Add global hotkeys:
- answer last question
- summarise recent audio
- pause/resume
- Add copy transcript action.
- Add export session action.
- Add session list/history view.
- Hotkey support
- Transcript export to markdown/json
- Frequent actions can be performed without opening full UI
- Transcript history is reusable
Goal: make the app trustworthy to use.
- Add obvious capture indicator.
- Add explicit cloud-processing indicator.
- Add privacy pause mode.
- Add setting to disable automatic answering.
- Add setting to disable storage or shorten retention.
- Add warning text in onboarding/settings.
- Privacy controls in UI and config
- Audit logging for start/stop/pause
- User can stop or pause capture immediately
- User can tell whether data is being sent to providers
- Defaults are conservative
Goal: let the assistant speak replies aloud.
- Create
tts_openaicrate. - Add text-to-speech adapter.
- Add playback controls and mute.
- Add setting for speak-on-answer.
- Ensure TTS audio is not recursively re-captured where possible.
- Spoken assistant output
- Mute and off-by-default policy
- Spoken reply works when enabled
- TTS can be muted instantly
- TTS does not create runaway feedback loops
- Unit tests for:
- resampling helpers,
- transcript merging,
- policy rules,
- schema validation.
- Integration tests for:
- mock audio -> mock STT -> transcript,
- transcript -> OpenAI response parsing.
- Manual end-to-end test scripts.
- Define shared error categories:
- config,
- audio,
- STT provider,
- LLM provider,
- storage,
- UI.
- Ensure user-facing messages are clean and non-technical.
- Add root README.
- Add setup guide:
- Ubuntu packages needed,
- API key setup,
- first run instructions.
- Add troubleshooting guide.
- Build and test on Ubuntu.
- Run fmt/clippy/tests.
- Optionally build
.debartifact in CI.
- A1 define capture abstractions
- A2 implement PulseAudio-compatible backend
- A3 device switch handling
- A4 capture diagnostics
- B1 normalization
- B2 chunking
- B3 silence suppression/VAD
- B4 pipeline tests
- C1 transcriber trait
- C2 ElevenLabs realtime adapter
- C3 reconnect logic
- C4 mock adapter
- D1 partial/final state
- D2 segment model
- D3 merge logic
- D4 rolling context API
- E1 schema
- E2 repositories
- E3 retention/purge
- E4 export
- F1 OpenAI adapter
- F2 structured outputs
- F3 prompts
- F4 fallback handling
- G1 manual actions
- G2 question detection
- G3 rate limiting
- G4 assisted mode
- H1 base window
- H2 controls
- H3 tray
- H4 settings page
- I1 indicators
- I2 pause modes
- I3 retention controls
- I4 onboarding copy
- J1 systemd user service
- J2 desktop entry
- J3
.debpackaging - J4 installation docs
- Repository scaffolding
- Audio capture
- Audio pipeline
- STT integration
- Transcript core
- Minimal terminal/debug UI
- SQLite persistence
- OpenAI manual actions
- Desktop UI
- Policy engine
- Privacy controls
- Optional TTS
- Packaging
This order minimizes wasted effort and gets to a testable transcript flow early.
The MVP is complete when all of the following are true:
- The app captures system audio from Ubuntu output independently of any single app.
- The app shows a live transcript from ElevenLabs Scribe Realtime.
- The app stores transcript sessions locally in SQLite.
- The user can press a button or hotkey to answer the last detected question using OpenAI.
- The user can request a summary of recent audio using OpenAI.
- The UI shows clear capture/cloud/privacy state.
- The app can be installed and run on Ubuntu without manual developer-only steps.
- native PipeWire backend
- local/offline STT backend
- semantic transcript search
- richer meeting mode
- configurable prompt profiles
- speaker attribution improvements
- redaction support
- session replay
- plugin architecture
Implementation guidance:
- Prefer small focused crates.
- Keep provider integrations behind traits.
- Prefer typed events over ad hoc string messages.
- Keep prompt code isolated from UI code.
- Make privacy and pause behaviour first-class, not bolted on.
- Add mock backends early so core logic can be tested without live provider calls.
- Optimize for correctness and clarity before latency micro-optimizations.
The most important first milestone is:
system audio capture -> realtime transcript visible on screen
Everything else builds on that.