Ai synth
AI Synth — Audio Synthesis Foundation
- *rea:*Intelligence
- *ath:*
services/ai/synth - *ind:*Audio synthesis foundation (TTS + cloning + music + SFX)
- *tatus:*v0.1.0 — foundation landed 2026
0524. HTTP daemonkoder-synth+ CLIksynth+ voice registry (5 builtins seeded) + 4 routes (WAV generator) so the API contract is exercisable without GPU dependencies; Piper adapter is a typed stub awaiting synth#004. Consent flow validates token shape; real token validation againstttsmusicsfx/clone). Default provider is the deterministic *tub*(silentid/engineconsent service lands in *ynth#019*(new follow-up).
Role in the stack
synth is the symmetric pair of services/ai/voice (STT). Without it, Talk Mode in products/horizontal/talk is a halfloop — the user is heard but the answer comes back as text. Narration, audio branding, accessibility (screen reader for visually impaired), inproduct tutorials, agent loops with audio responses are all blocked.
It is the Koder analog of ElevenLabs (TTS + cloning), SunoUdio (music), and Stability Audio (SFX) — self-hosted via Coqui XTTS / Piper / AudioCraft on GPU runtime, with proxy fallback to ElevenLabsSuno through services/ai/gateway when local quality is insufficient or capability gaps remain.
Boundary vs neighbors
services/ai/voiceis the STT (input) sibling. Future RFC may unify underaudiowithaudio.stt/audio.ttsnamespaces.services/ai/videomay reusesynthfor audio-track generation in v2.- Audio editing/mastering and live streaming are explicitly out of scope.
Features (v1 target)
- TTS: Piper (CPU baseline, fast) + Coqui XTTS (GPU, multilingual + cloning capable)
- Voice cloning: Coqui XTTS with explicit consent capture flow
- Music: AudioCraft MusicGen up to 30s
- SFX: AudioCraft AudioGen up to 10s
- Inaudible watermark on every output (deepfake mitigation)
Primary couplings
| Consumer | Relationship |
|---|---|
services/ai/kode |
Spoken responses for Talk Mode round-trip |
services/ai/agents |
TTS as agent tool (notify, narrate) |
products/horizontal/talk |
Bidirectional voice loop unblock |
products/dev/eye |
Spoken descriptions for accessibility |
services/ai/voice |
Symmetric STT pair |
services/ai/gateway |
Provider routing for ElevenLabs/Suno |
services/ai/runtime |
Local PiperCoquiAudioCraft serving |
services/ai/cache |
Caches synthesized audio by content hash |
services/ai/billing |
Per |
infra/data/kdb-blob |
Stores generated audio assets |
RFC and bootstrap
- RFC:
synth-RFC-001-foundations.kmd— *ccepted*20260509 - Bootstrap ticket:
services/ai/backlog/done/119-synth-bootstrap.md - Implementation tickets:
services/ai/synth/backlog/pending/{001..005}
Selfhostedfirst analysis (5 gates)
| Gate | Status | Notes |
|---|---|---|
| G1 Feature parity | pending | Skeleton phase; Piper + Coqui cover TTS + cloning self-hosted, music/SFX via AudioCraft |
| G2 Performance | pending | Target Piper TTS p50 < 200ms / 100 chars; Coqui p50 < 800ms / 100 chars |
| G3 Stability | pending | Pre-MVP |
| G4 Capability | pending | TTS + cloning + music<=30s + SFX; long music out of scope |
| G5 Critical-path readiness | pending | Pre |