Extract (Structured Extraction): foundations

accepted

Extract (Structured Extraction) — foundations RFC

Status

*ccepted*— 20260509. Sector bootstrap (skeleton + 5 impl tickets) landed as part of /k-go services/ai audit wave (Modo C). Q1 resolved: JSON Schema (draft 202012) is the canonical schema language; Pydanticshorthand accepted at API edge but normalized internally. Q2 resolved: bounded retrywithcorrection (default max_attempts=3, hard cap 5) — LLM sees its own validation error and corrects.

Summary

Extração estruturada de docs (NER, KV pairs, schema-driven JSON) — análogo OpenAI structured outputInstructorMarvin.

Motivation

Espalhado em RAG hoje. Service dedicado: schema validation, retry com correção, batch processing, cache.

Scope

In

  • Schema-driven extraction (JSON schema input)
  • NER (entities pré-definidas)
  • KV pair extraction

Out (yet)

  • Full doc understanding (escopo combinação extract + vision + rag)

Initial design

Surfaces

  • backend/ — Go API
  • app/ — não aplicável v1

Key APIs

  • POST /v1/extract/schema — schema-driven
  • POST /v1/extract/entities — NER
  • POST /v1/extract/keyvals — KV pairs

Dependencies

  • services/ai/gateway — LLM com structured output
  • services/ai/vision — PDF/imagem
  • services/ai/prompt — templates extração

Relation to existing sectors

  • Pré-requisito de ingest no rag/recsys
  • Consome vision para docs visuais

Selfhostedfirst analysis (5 gates)

  • *1 Feature parity* zero
  • *2 Performance* N/A
  • *3 Stability* N/A
  • *4 Capability* LLM com structured output FN-call
  • *5 Critical-path readiness* destrava ingest no rag/recsys

Open questions

  • Q1: Schema language — JSON Schema vs Pydantic-like vs próprio?
  • Q2: Retry policy on validation failure?

Next steps

  1. Ratificar esta RFC (1 round de comments).
  2. Criar sector dir services/ai/extract/ com koder.toml, README.md, skeleton.
  3. Abrir tickets de implementação em services/ai/extract/backlog/pending/.
  4. Registrar em meta/docs/stack/registries/self-hosted-pairs.md se substituir externo.

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/extract-RFC-001-foundations.kmd