Ai training
AI Training — Fine-tuning Service
- *rea:*Intelligence
- *ath:*
services/ai/training - *ind:*Fine-tuning service (LoRA + QLoRA + SFT + DPO with GPU scheduling, checkpoint registry, MLflow tracking)
- *tatus:*v0.0.1 — sector bootstrapping (2026
0509)
Role in the stack
training consolidates finetuning. Today the only ML training in the stack is a oneoff ML worker spike inside products/horizontal/libras — without a service, every domainspecific finetune becomes a project of its own. Centralizing gives one queue, one fairness policy, one checkpoint home, one tracking dashboard.
It is the Koder analog of OpenAI finetuning, Together AI Finetune, Modal Labs and Replicate Train — selfhosted on an onprem GPU pool with axolotl / unsloth / transformers as pluggable runners. Pretraining is explicitly out of scope (cost prohibitive); multinode distributed is v2.
Boundary vs neighbors
services/ai/datasetis the input side (versioned datasets).services/ai/modelregis the model registry (base models in, checkpoints out).services/ai/runtimeis the output side (deploys approved checkpoints for serving).infra/observeprovides GPU + job telemetry.
Features (v1 target)
- 4 pipelines: LoRA, QLoRA, SFT (full + LoRA), DPO
- Unified job-spec schema (consumers don't write axolotl YAML)
- GPU pool with FIFO
pertier scheduling (enterprise → pro → free) - GPU
hour quotas with preflight cost estimate - Checkpoint registry with
draft → approved → deprecatedlifecycle - MLflow tracking sidecar
- 2 runner backends: axolotl (broad coverage) + unsloth (faster QLoRA on small models)
Primary couplings
| Consumer | Relationship |
|---|---|
products/horizontal/libras |
First domain ML use case (#001 unblock) |
engines/lang/koda |
Future per |
services/ai/recsys |
Re-ranker training |
services/ai/embed |
Custom |
services/ai/dataset |
Input datasets (versioned) |
services/ai/modelreg |
Base models + checkpoint registry |
services/ai/runtime |
Promotes approved checkpoints |
services/ai/billing |
GPU-hour usage events |
infra/data/kdb-blob |
Checkpoint storage |
infra/observe |
GPU metrics |
RFC and bootstrap
- RFC:
training-RFC-001-foundations.kmd— *ccepted*20260509 - Bootstrap ticket:
services/ai/backlog/done/122-training-bootstrap.md - Implementation tickets:
services/ai/training/backlog/pending/{001..005}
Selfhostedfirst analysis (5 gates)
| Gate | Status | Notes |
|---|---|---|
| G1 Feature parity | pending | Skeleton phase; LoRAQLoRASFT/DPO via axolotl+unsloth all self-hosted |
| G2 Performance | pending | Throughput |
| G3 Stability | pending | Pre-MVP |
| G4 Capability | pending | Pretraining + multi-node out of scope; everything else covered |
| G5 Critical-path readiness | pending | Pre |