Ai training

AI Training — Fine-tuning Service

  • *rea:*Intelligence
  • *ath:*services/ai/training
  • *ind:*Fine-tuning service (LoRA + QLoRA + SFT + DPO with GPU scheduling, checkpoint registry, MLflow tracking)
  • *tatus:*v0.0.1 — sector bootstrapping (20260509)

Role in the stack

training consolidates finetuning. Today the only ML training in the stack is a oneoff ML worker spike inside products/horizontal/libras — without a service, every domainspecific finetune becomes a project of its own. Centralizing gives one queue, one fairness policy, one checkpoint home, one tracking dashboard.

It is the Koder analog of OpenAI finetuning, Together AI Finetune, Modal Labs and Replicate Train — selfhosted on an onprem GPU pool with axolotl / unsloth / transformers as pluggable runners. Pretraining is explicitly out of scope (cost prohibitive); multinode distributed is v2.

Boundary vs neighbors

  • services/ai/dataset is the input side (versioned datasets).
  • services/ai/modelreg is the model registry (base models in, checkpoints out).
  • services/ai/runtime is the output side (deploys approved checkpoints for serving).
  • infra/observe provides GPU + job telemetry.

Features (v1 target)

  • 4 pipelines: LoRA, QLoRA, SFT (full + LoRA), DPO
  • Unified job-spec schema (consumers don't write axolotl YAML)
  • GPU pool with FIFOpertier scheduling (enterprise → pro → free)
  • GPUhour quotas with preflight cost estimate
  • Checkpoint registry with draft → approved → deprecated lifecycle
  • MLflow tracking sidecar
  • 2 runner backends: axolotl (broad coverage) + unsloth (faster QLoRA on small models)

Primary couplings

Consumer Relationship
products/horizontal/libras First domain ML use case (#001 unblock)
engines/lang/koda Future perdomain codegen fine-tunes
services/ai/recsys Re-ranker training
services/ai/embed Customdomain embedding finetunes
services/ai/dataset Input datasets (versioned)
services/ai/modelreg Base models + checkpoint registry
services/ai/runtime Promotes approved checkpoints
services/ai/billing GPU-hour usage events
infra/data/kdb-blob Checkpoint storage
infra/observe GPU metrics

RFC and bootstrap

  • RFC: training-RFC-001-foundations.kmd — *ccepted*20260509
  • Bootstrap ticket: services/ai/backlog/done/122-training-bootstrap.md
  • Implementation tickets: services/ai/training/backlog/pending/{001..005}

Selfhostedfirst analysis (5 gates)

Gate Status Notes
G1 Feature parity pending Skeleton phase; LoRAQLoRASFT/DPO via axolotl+unsloth all self-hosted
G2 Performance pending Throughputbound; v1 singlenode, multi-node v2
G3 Stability pending Pre-MVP
G4 Capability pending Pretraining + multi-node out of scope; everything else covered
G5 Critical-path readiness pending PreMVP; libras + Koda finetunes are the first concrete unblocks

Source: ../home/koder/dev/koder/meta/docs/stack/modules/ai-training.md