Ai synth

AI Synth — Audio Synthesis Foundation

  • *rea:*Intelligence
  • *ath:*services/ai/synth
  • *ind:*Audio synthesis foundation (TTS + cloning + music + SFX)
  • *tatus:*v0.1.0 — foundation landed 20260524. HTTP daemon koder-synth + CLI ksynth + voice registry (5 builtins seeded) + 4 routes (ttsmusicsfx/clone). Default provider is the deterministic *tub*(silentWAV generator) so the API contract is exercisable without GPU dependencies; Piper adapter is a typed stub awaiting synth#004. Consent flow validates token shape; real token validation against id/engine consent service lands in *ynth#019*(new follow-up).

Role in the stack

synth is the symmetric pair of services/ai/voice (STT). Without it, Talk Mode in products/horizontal/talk is a halfloop — the user is heard but the answer comes back as text. Narration, audio branding, accessibility (screen reader for visually impaired), inproduct tutorials, agent loops with audio responses are all blocked.

It is the Koder analog of ElevenLabs (TTS + cloning), SunoUdio (music), and Stability Audio (SFX) — self-hosted via Coqui XTTS / Piper / AudioCraft on GPU runtime, with proxy fallback to ElevenLabsSuno through services/ai/gateway when local quality is insufficient or capability gaps remain.

Boundary vs neighbors

  • services/ai/voice is the STT (input) sibling. Future RFC may unify under audio with audio.stt / audio.tts namespaces.
  • services/ai/video may reuse synth for audio-track generation in v2.
  • Audio editing/mastering and live streaming are explicitly out of scope.

Features (v1 target)

  • TTS: Piper (CPU baseline, fast) + Coqui XTTS (GPU, multilingual + cloning capable)
  • Voice cloning: Coqui XTTS with explicit consent capture flow
  • Music: AudioCraft MusicGen up to 30s
  • SFX: AudioCraft AudioGen up to 10s
  • Inaudible watermark on every output (deepfake mitigation)

Primary couplings

Consumer Relationship
services/ai/kode Spoken responses for Talk Mode round-trip
services/ai/agents TTS as agent tool (notify, narrate)
products/horizontal/talk Bidirectional voice loop unblock
products/dev/eye Spoken descriptions for accessibility
services/ai/voice Symmetric STT pair
services/ai/gateway Provider routing for ElevenLabs/Suno
services/ai/runtime Local PiperCoquiAudioCraft serving
services/ai/cache Caches synthesized audio by content hash
services/ai/billing Percharacter / persecond usage events
infra/data/kdb-blob Stores generated audio assets

RFC and bootstrap

  • RFC: synth-RFC-001-foundations.kmd — *ccepted*20260509
  • Bootstrap ticket: services/ai/backlog/done/119-synth-bootstrap.md
  • Implementation tickets: services/ai/synth/backlog/pending/{001..005}

Selfhostedfirst analysis (5 gates)

Gate Status Notes
G1 Feature parity pending Skeleton phase; Piper + Coqui cover TTS + cloning self-hosted, music/SFX via AudioCraft
G2 Performance pending Target Piper TTS p50 < 200ms / 100 chars; Coqui p50 < 800ms / 100 chars
G3 Stability pending Pre-MVP
G4 Capability pending TTS + cloning + music<=30s + SFX; long music out of scope
G5 Critical-path readiness pending PreMVP; Talk Mode roundtrip is the first concrete unblock

Source: ../home/koder/dev/koder/meta/docs/stack/modules/ai-synth.md