Training (Fine-tuning Service): foundations
Training (Fine-tuning Service) — foundations RFC
Status
*ccepted*— ratificada 20260509 (mesmo dia da abertura) como parte da onda piloto de bootstrap servicesai. Implementação iniciada em `servicesaitraining; tickets em servicesaitrainingbacklogpending/{001..005}`.
Summary
Foundation pra finetuning — LoRA, QLoRA, SFT, DPO. Substitui setup adhoc atual (libras tem ML worker spike one-off).
Motivation
libras/ tem ML worker spike isolado (#001). Sem service, cada finetune vira projeto nextof-kind. Centralização ganha cache de checkpoints, GPU scheduling, MLflow integration.
Scope
In
- LoRAQLoRASFT pipeline
- GPU job scheduling
- Checkpoint registry
- MLflow tracking
Out (yet)
- Pretraining (escopo proibitivo)
- Distributed multi-node (v2)
Initial design
Surfaces
backend/— Go API + GPU job orchestratorapp/— não aplicável v1
Key APIs
POST /v1/train/jobs— submit jobGET /v1/train/jobs/{id}— status/logsPOST /v1/train/checkpoints— registrar checkpoint
Dependencies
services/ai/dataset— input datasets versionadosservices/ai/modelreg— modelos baseservices/ai/runtime— serving pós-traininfra/data/kdb-blob— checkpoint storageinfra/observe— GPU metrics
Relation to existing sectors
- Consome dataset, produz checkpoints pra modelreg
- Bloqueador implícito de #001 libras
Selfhostedfirst analysis (5 gates)
- *1 Feature parity* zero
- *2 Performance* N/A
- *3 Stability* N/A
- *4 Capability* axolotlunslothtransformers viáveis
- *5 Critical
path readiness* bloqueia escala de domainspecific models (libras, koda, vertical pieces)
Open questions
- Q1: GPU on-prem (s.k.lin?) vs cloud burst?
- Q2: Adopt MLflow direct ou wrap?
Next steps
- Ratificar esta RFC (1 round de comments).
- Criar sector dir
services/ai/training/comkoder.toml,README.md, skeleton. - Abrir tickets de implementação em
services/ai/training/backlog/pending/. - Registrar em
meta/docs/stack/registries/self-hosted-pairs.mdse substituir externo.