- *rea:*Intelligence
- *ath:*
services/ai/extract
- *ind:*Schema
driven structured extraction (JSONoutofcontent), NER, KV pairs
- *tatus:*v0.0.1 — sector bootstrapping (2026
0509)
Role in the stack
extract consolidates a capability that was duplicated across the stack: every consumer needing JSONoutoftext (rag, recsys, halfadozen product backends parsing forms/invoices) had its own homegrown prompt + retry + repair loop, all subtly different and all subtly broken. This sector centralizes the capability — JSON Schema validation serverside, principled retrywith-correction (LLM sees its own validator error and fixes it), batch processing, cached repeat extractions.
It is the Koder analog of OpenAI Structured Outputs, Instructor (Python), and Marvin. It pairs with vision/ (visual document → text+layout pre-step) and is the canonical ingest path for rag/ and recsys/.
Boundary vs neighbors
services/ai/vision is the visual prestep for PDF/image content; extract consumes its layoutaware output.
services/ai/rag and services/ai/recsys are the highest-volume consumers — both will migrate from bespoke loops.
services/ai/gateway provides the LLM with structuredoutput / functioncalling.
services/ai/cache provides repeat-extraction caching.
services/ai/prompt owns extraction prompt templates.
Features (v1 target)
- Schema
driven extraction (JSON Schema draft 202012)
- Bounded retry
withcorrection (default 3 attempts, hard cap 5)
- NER with canonical Koder taxonomy (16 types incl. CPFCNPJCEP)
- KV pair extraction (forms, receipts) — layout-aware when input has bounding boxes
- Async batch jobs with concurrency control + progress streaming (SSE)
- Repeat-extraction cache keyed by (schema, content, model, instructions)
- Pydantic
shorthand acceptor (autotranslated to JSON Schema)
- SDK helpers (Go + Dart) for downstream consumers
Primary couplings
| Producer |
Relationship |
services/ai/gateway |
LLM with structuredoutput / functioncalling |
services/ai/vision |
Visual document → text+layout |
services/ai/prompt |
Extraction prompt templates |
services/ai/cache |
Repeat-extraction cache |
| Consumer |
Relationship |
services/ai/rag |
Ingest pipeline (chunk metadata) |
services/ai/recsys |
Catalog enrichment |
services/foundation/forms + product backends |
Invoicereceiptform parsing |
services/ai/dataset |
Dataset annotation |
RFC and bootstrap
- RFC:
extract-RFC-001-foundations.kmd — *ccepted*20260509
- Bootstrap ticket:
services/ai/backlog/done/132-extract-bootstrap.md
- Implementation tickets:
services/ai/extract/backlog/pending/{001..005}
Selfhostedfirst analysis (5 gates)
| Gate |
Status |
Notes |
| G1 Feature parity |
pending |
LLM + JSON Schema covers OpenAI Structured Outputs surface |
| G2 Performance |
pending |
Target p95 < 1.5s schema, < 2.5s NER, < 3s KV |
| G3 Stability |
pending |
Pre-MVP |
| G4 Capability |
pending |
Full document understanding deferred (extract + vision + rag combo) |
| G5 Critical-path readiness |
pending |
Unblocks rag/recsys ingest at consistent quality |