Ai runtime

AI Runtime — Local LLM Serving

  • *rea:*Intelligence
  • *ath:*ai/runtime
  • *ind:*LLM runtime (download, serve, quantize, fine-tune)

Role in the stack

AI Runtime is the local LLM serving layer of Koder. It downloads models from ai/zoo (or upstream providers), serves them via an OpenAIcompatible API, quantizes them for the target hardware, and optionally finetunes them on tenant data. Its consumers are ai/gateway (which routes traffic to Runtime as one of its backends) and any service that needs local inference without leaving the Koder perimeter.

Deployed in production as LXC 127 on s.r1 using Ollama as the inference backend.

Features

  • Download + serve models from ai/zoo or upstream registries
  • Quantization for GPU / CPU targets
  • Fine-tuning pipeline
  • OpenAI-compat /v1/ API
  • Scales from laptop to cluster

Primary couplings

Consumer Relationship
ai/gateway Routes local-model requests to Runtime
ai/zoo Model registry source
ai/voice Calls Runtime via Gateway for transcription summarization

Source: ../home/koder/dev/koder/meta/docs/stack/modules/ai-runtime.md