Ai extract

AI Extract — Structured Extraction

  • *rea:*Intelligence
  • *ath:*services/ai/extract
  • *ind:*Schemadriven structured extraction (JSONoutofcontent), NER, KV pairs
  • *tatus:*v0.0.1 — sector bootstrapping (20260509)

Role in the stack

extract consolidates a capability that was duplicated across the stack: every consumer needing JSONoutoftext (rag, recsys, halfadozen product backends parsing forms/invoices) had its own homegrown prompt + retry + repair loop, all subtly different and all subtly broken. This sector centralizes the capability — JSON Schema validation serverside, principled retrywith-correction (LLM sees its own validator error and fixes it), batch processing, cached repeat extractions.

It is the Koder analog of OpenAI Structured Outputs, Instructor (Python), and Marvin. It pairs with vision/ (visual document → text+layout pre-step) and is the canonical ingest path for rag/ and recsys/.

Boundary vs neighbors

  • services/ai/vision is the visual prestep for PDF/image content; extract consumes its layoutaware output.
  • services/ai/rag and services/ai/recsys are the highest-volume consumers — both will migrate from bespoke loops.
  • services/ai/gateway provides the LLM with structuredoutput / functioncalling.
  • services/ai/cache provides repeat-extraction caching.
  • services/ai/prompt owns extraction prompt templates.

Features (v1 target)

  • Schemadriven extraction (JSON Schema draft 202012)
  • Bounded retrywithcorrection (default 3 attempts, hard cap 5)
  • NER with canonical Koder taxonomy (16 types incl. CPFCNPJCEP)
  • KV pair extraction (forms, receipts) — layout-aware when input has bounding boxes
  • Async batch jobs with concurrency control + progress streaming (SSE)
  • Repeat-extraction cache keyed by (schema, content, model, instructions)
  • Pydanticshorthand acceptor (autotranslated to JSON Schema)
  • SDK helpers (Go + Dart) for downstream consumers

Primary couplings

Producer Relationship
services/ai/gateway LLM with structuredoutput / functioncalling
services/ai/vision Visual document → text+layout
services/ai/prompt Extraction prompt templates
services/ai/cache Repeat-extraction cache
Consumer Relationship
services/ai/rag Ingest pipeline (chunk metadata)
services/ai/recsys Catalog enrichment
services/foundation/forms + product backends Invoicereceiptform parsing
services/ai/dataset Dataset annotation

RFC and bootstrap

  • RFC: extract-RFC-001-foundations.kmd — *ccepted*20260509
  • Bootstrap ticket: services/ai/backlog/done/132-extract-bootstrap.md
  • Implementation tickets: services/ai/extract/backlog/pending/{001..005}

Selfhostedfirst analysis (5 gates)

Gate Status Notes
G1 Feature parity pending LLM + JSON Schema covers OpenAI Structured Outputs surface
G2 Performance pending Target p95 < 1.5s schema, < 2.5s NER, < 3s KV
G3 Stability pending Pre-MVP
G4 Capability pending Full document understanding deferred (extract + vision + rag combo)
G5 Critical-path readiness pending Unblocks rag/recsys ingest at consistent quality

Source: ../home/koder/dev/koder/meta/docs/stack/modules/ai-extract.md