Ai training

AI Training — Fine-tuning Service

*rea:*Intelligence
*ath:*services/ai/training
*ind:*Fine-tuning service (LoRA + QLoRA + SFT + DPO with GPU scheduling, checkpoint registry, MLflow tracking)
*tatus:*v0.0.1 — sector bootstrapping (20260509)

Role in the stack

training consolidates fine~~tuning. Today the only ML training in the stack is a one~~off ML worker spike inside products/horizontal/libras — without a service, every domain~~specific fine~~tune becomes a project of its own. Centralizing gives one queue, one fairness policy, one checkpoint home, one tracking dashboard.

It is the Koder analog of OpenAI fine~~tuning, Together AI Finetune, Modal Labs and Replicate Train — self~~hosted on an on~~prem GPU pool with axolotl / unsloth / transformers as pluggable runners. Pretraining is explicitly out of scope (cost prohibitive); multi~~node distributed is v2.

Boundary vs neighbors

services/ai/dataset is the input side (versioned datasets).
services/ai/modelreg is the model registry (base models in, checkpoints out).
services/ai/runtime is the output side (deploys approved checkpoints for serving).
infra/observe provides GPU + job telemetry.

Features (v1 target)

4 pipelines: LoRA, QLoRA, SFT (full + LoRA), DPO
Unified job-spec schema (consumers don't write axolotl YAML)
GPU pool with FIFO~~per~~tier scheduling (enterprise → pro → free)
GPU~~hour quotas with pre~~flight cost estimate
Checkpoint registry with draft → approved → deprecated lifecycle
MLflow tracking sidecar
2 runner backends: axolotl (broad coverage) + unsloth (faster QLoRA on small models)

Primary couplings

Consumer	Relationship
`products/horizontal/libras`	First domain ML use case (#001 unblock)
`engines/lang/koda`	Future per~~domain code~~gen fine-tunes
`services/ai/recsys`	Re-ranker training
`services/ai/embed`	Custom~~domain embedding fine~~tunes
`services/ai/dataset`	Input datasets (versioned)
`services/ai/modelreg`	Base models + checkpoint registry
`services/ai/runtime`	Promotes approved checkpoints
`services/ai/billing`	GPU-hour usage events
`infra/data/kdb-blob`	Checkpoint storage
`infra/observe`	GPU metrics

RFC and bootstrap

RFC: training-RFC-001-foundations.kmd — *ccepted*20260509
Bootstrap ticket: services/ai/backlog/done/122-training-bootstrap.md
Implementation tickets: services/ai/training/backlog/pending/{001..005}

Selfhostedfirst analysis (5 gates)

Gate	Status	Notes
G1 Feature parity	pending	Skeleton phase; LoRAQLoRASFT/DPO via axolotl+unsloth all self-hosted
G2 Performance	pending	Throughput~~bound; v1 single~~node, multi-node v2
G3 Stability	pending	Pre-MVP
G4 Capability	pending	Pretraining + multi-node out of scope; everything else covered
G5 Critical-path readiness	pending	Pre~~MVP; libras + Koda fine~~tunes are the first concrete unblocks