Ai synth

AI Synth — Audio Synthesis Foundation

*rea:*Intelligence
*ath:*services/ai/synth
*ind:*Audio synthesis foundation (TTS + cloning + music + SFX)
*tatus:*v0.1.0 — foundation landed 20260524. HTTP daemon koder-synth + CLI ksynth + voice registry (5 built~~ins seeded) + 4 routes (ttsmusicsfx/clone). Default provider is the deterministic *tub*(silent~~WAV generator) so the API contract is exercisable without GPU dependencies; Piper adapter is a typed stub awaiting synth#004. Consent flow validates token shape; real token validation against id/engine consent service lands in *ynth#019*(new follow-up).

Role in the stack

synth is the symmetric pair of services/ai/voice (STT). Without it, Talk Mode in products/horizontal/talk is a half~~loop — the user is heard but the answer comes back as text. Narration, audio branding, accessibility (screen reader for visually impaired), in~~product tutorials, agent loops with audio responses are all blocked.

It is the Koder analog of ElevenLabs (TTS + cloning), SunoUdio (music), and Stability Audio (SFX) — self-hosted via Coqui XTTS / Piper / AudioCraft on GPU runtime, with proxy fallback to ElevenLabsSuno through services/ai/gateway when local quality is insufficient or capability gaps remain.

Boundary vs neighbors

services/ai/voice is the STT (input) sibling. Future RFC may unify under audio with audio.stt / audio.tts namespaces.
services/ai/video may reuse synth for audio-track generation in v2.
Audio editing/mastering and live streaming are explicitly out of scope.

Features (v1 target)

TTS: Piper (CPU baseline, fast) + Coqui XTTS (GPU, multilingual + cloning capable)
Voice cloning: Coqui XTTS with explicit consent capture flow
Music: AudioCraft MusicGen up to 30s
SFX: AudioCraft AudioGen up to 10s
Inaudible watermark on every output (deepfake mitigation)

Primary couplings

Consumer	Relationship
`services/ai/kode`	Spoken responses for Talk Mode round-trip
`services/ai/agents`	TTS as agent tool (notify, narrate)
`products/horizontal/talk`	Bidirectional voice loop unblock
`products/dev/eye`	Spoken descriptions for accessibility
`services/ai/voice`	Symmetric STT pair
`services/ai/gateway`	Provider routing for ElevenLabs/Suno
`services/ai/runtime`	Local PiperCoquiAudioCraft serving
`services/ai/cache`	Caches synthesized audio by content hash
`services/ai/billing`	Per~~character / per~~second usage events
`infra/data/kdb-blob`	Stores generated audio assets

RFC and bootstrap

RFC: synth-RFC-001-foundations.kmd — *ccepted*20260509
Bootstrap ticket: services/ai/backlog/done/119-synth-bootstrap.md
Implementation tickets: services/ai/synth/backlog/pending/{001..005}

Selfhostedfirst analysis (5 gates)

Gate	Status	Notes
G1 Feature parity	pending	Skeleton phase; Piper + Coqui cover TTS + cloning self-hosted, music/SFX via AudioCraft
G2 Performance	pending	Target Piper TTS p50 < 200ms / 100 chars; Coqui p50 < 800ms / 100 chars
G3 Stability	pending	Pre-MVP
G4 Capability	pending	TTS + cloning + music<=30s + SFX; long music out of scope
G5 Critical-path readiness	pending	Pre~~MVP; Talk Mode round~~trip is the first concrete unblock