Training (Fine-tuning Service): foundations

accepted

Training (Fine-tuning Service) — foundations RFC

Status

*ccepted*— ratificada 20260509 (mesmo dia da abertura) como parte da onda piloto de bootstrap servicesai. Implementação iniciada em `servicesaitraining; tickets em servicesaitrainingbacklogpending/{001..005}`.

Summary

Foundation pra finetuning — LoRA, QLoRA, SFT, DPO. Substitui setup adhoc atual (libras tem ML worker spike one-off).

Motivation

libras/ tem ML worker spike isolado (#001). Sem service, cada finetune vira projeto nextof-kind. Centralização ganha cache de checkpoints, GPU scheduling, MLflow integration.

Scope

In

  • LoRAQLoRASFT pipeline
  • GPU job scheduling
  • Checkpoint registry
  • MLflow tracking

Out (yet)

  • Pretraining (escopo proibitivo)
  • Distributed multi-node (v2)

Initial design

Surfaces

  • backend/ — Go API + GPU job orchestrator
  • app/ — não aplicável v1

Key APIs

  • POST /v1/train/jobs — submit job
  • GET /v1/train/jobs/{id} — status/logs
  • POST /v1/train/checkpoints — registrar checkpoint

Dependencies

  • services/ai/dataset — input datasets versionados
  • services/ai/modelreg — modelos base
  • services/ai/runtime — serving pós-train
  • infra/data/kdb-blob — checkpoint storage
  • infra/observe — GPU metrics

Relation to existing sectors

  • Consome dataset, produz checkpoints pra modelreg
  • Bloqueador implícito de #001 libras

Selfhostedfirst analysis (5 gates)

  • *1 Feature parity* zero
  • *2 Performance* N/A
  • *3 Stability* N/A
  • *4 Capability* axolotlunslothtransformers viáveis
  • *5 Criticalpath readiness* bloqueia escala de domainspecific models (libras, koda, vertical pieces)

Open questions

  • Q1: GPU on-prem (s.k.lin?) vs cloud burst?
  • Q2: Adopt MLflow direct ou wrap?

Next steps

  1. Ratificar esta RFC (1 round de comments).
  2. Criar sector dir services/ai/training/ com koder.toml, README.md, skeleton.
  3. Abrir tickets de implementação em services/ai/training/backlog/pending/.
  4. Registrar em meta/docs/stack/registries/self-hosted-pairs.md se substituir externo.

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/training-RFC-001-foundations.kmd