Extract (Structured Extraction): foundations
Extract (Structured Extraction) — foundations RFC
Status
*ccepted*— 20260509. Sector bootstrap (skeleton + 5 impl tickets) landed as part of /k-go services/ai audit wave (Modo C). Q1 resolved: JSON Schema (draft 202012) is the canonical schema language; Pydanticshorthand accepted at API edge but normalized internally. Q2 resolved: bounded retrywithcorrection (default max_attempts=3, hard cap 5) — LLM sees its own validation error and corrects.
Summary
Extração estruturada de docs (NER, KV pairs, schema-driven JSON) — análogo OpenAI structured outputInstructorMarvin.
Motivation
Espalhado em RAG hoje. Service dedicado: schema validation, retry com correção, batch processing, cache.
Scope
In
- Schema-driven extraction (JSON schema input)
- NER (entities pré-definidas)
- KV pair extraction
Out (yet)
- Full doc understanding (escopo combinação extract + vision + rag)
Initial design
Surfaces
backend/— Go APIapp/— não aplicável v1
Key APIs
POST /v1/extract/schema— schema-drivenPOST /v1/extract/entities— NERPOST /v1/extract/keyvals— KV pairs
Dependencies
services/ai/gateway— LLM com structured outputservices/ai/vision— PDF/imagemservices/ai/prompt— templates extração
Relation to existing sectors
- Pré-requisito de ingest no rag/recsys
- Consome vision para docs visuais
Selfhostedfirst analysis (5 gates)
- *1 Feature parity* zero
- *2 Performance* N/A
- *3 Stability* N/A
- *4 Capability* LLM com structured output FN-call
- *5 Critical-path readiness* destrava ingest no rag/recsys
Open questions
- Q1: Schema language — JSON Schema vs Pydantic-like vs próprio?
- Q2: Retry policy on validation failure?
Next steps
- Ratificar esta RFC (1 round de comments).
- Criar sector dir
services/ai/extract/comkoder.toml,README.md, skeleton. - Abrir tickets de implementação em
services/ai/extract/backlog/pending/. - Registrar em
meta/docs/stack/registries/self-hosted-pairs.mdse substituir externo.