Ai vision

AI Vision — Multimodal Vision Foundation

*rea:*Intelligence
*ath:*services/ai/vision
*ind:*Multimodal vision foundation (OCR + captioning + detection + doc layout + VLM proxy)
*tatus:*v0.0.1 — sector bootstrapping (20260509)

Role in the stack

vision consolidates the server~~side half of multimodal capability. The Stack already has services/ai/voice for audio; vision was fragmented across services/ai/kode (chat attachments), products/dev/eye (on~~device Android UI), and ad-hoc document analysis. Without a foundation, every product reinvents the pipeline.

It is the Koder analog of OpenAI Vision, Claude Vision, Gemini Vision, and Google Cloud Vision — self~~hosted where viable (Tesseract, PaddleOCR, Florence~~2, suryadocling) and a thin VLM proxy front for frontier models routed via `servicesai/gateway`.

Boundary vs `products/dev/eye`

Eye stays on~~device (Android UI inspection, low~~latency screen capture). Vision handles server~~only workloads: multi~~page PDFs, batch jobs, GPU~~bound models, multi~~tenant. Handoff protocol defined in ticket #005.

Features (v1 target)

OCR: Tesseract (CPU) + PaddleOCR (GPU), auto-pick by workload, baseline pt+en
Doc layout: surya/docling pipeline, structured tables (JSON, not markdown)
VLM proxy: ClaudeGeminiGPT-4V via gateway with normalized response
Captioning + object detection: Florence~~2 self~~hosted, fallback to VLM
Eye handoff protocol: on-device delegation contract

Primary couplings

Consumer	Relationship
`services/ai/kode`	Document/screenshot attachment analysis
`services/ai/agents`	Vision tool calls (OCR, caption, detect)
`products/dev/eye`	Hands off heavy server-only workloads
`services/ai/gateway`	Provider routing for VLM proxy
`services/ai/runtime`	Local model serving (Florence-2, PaddleOCR)
`services/ai/cache`	Caches OCRcaptiondetect results
`services/ai/billing`	Receives per-call usage events
`infra/data/kdb-blob`	Stores input images and result artifacts

RFC and bootstrap

RFC: vision-RFC-001-foundations.kmd — *ccepted*20260509
Bootstrap ticket: services/ai/backlog/done/116-vision-bootstrap.md
Implementation tickets: services/ai/vision/backlog/pending/{001..005}

Selfhostedfirst analysis (5 gates)

Gate	Status	Notes
G1 Feature parity	pending	Skeleton phase; OCR + caption baseline self-hosted, VLM proxied
G2 Performance	pending	OCR < 800ms (Tesseract) / 400ms (Paddle); layout 10-page < 8s; caption < 600ms
G3 Stability	pending	Pre-MVP
G4 Capability	pending	OCR + layout + caption + detect + VLM proxy; no image gen (imaging scope), no video (video scope)
G5 Critical-path readiness	pending	Pre-MVP; kodeagentseye unification once v1 ships