Ai vision
AI Vision — Multimodal Vision Foundation
- *rea:*Intelligence
- *ath:*
services/ai/vision - *ind:*Multimodal vision foundation (OCR + captioning + detection + doc layout + VLM proxy)
- *tatus:*v0.0.1 — sector bootstrapping (2026
0509)
Role in the stack
vision consolidates the serverside half of multimodal capability. The Stack already has device Android UI), and ad-hoc document analysis. Without a foundation, every product reinvents the pipeline.services/ai/voice for audio; vision was fragmented across services/ai/kode (chat attachments), products/dev/eye (on
It is the Koder analog of OpenAI Vision, Claude Vision, Gemini Vision, and Google Cloud Vision — selfhosted where viable (Tesseract, PaddleOCR, Florence2, suryadocling) and a thin VLM proxy front for frontier models routed via `servicesai/gateway`.
Boundary vs products/dev/eye
Eye stays ondevice (Android UI inspection, lowlatency screen capture). Vision handles serveronly workloads: multipage PDFs, batch jobs, GPUbound models, multitenant. Handoff protocol defined in ticket #005.
Features (v1 target)
- OCR: Tesseract (CPU) + PaddleOCR (GPU), auto-pick by workload, baseline pt+en
- Doc layout: surya/docling pipeline, structured tables (JSON, not markdown)
- VLM proxy: ClaudeGeminiGPT-4V via gateway with normalized response
- Captioning + object detection: Florence
2 selfhosted, fallback to VLM - Eye handoff protocol: on-device delegation contract
Primary couplings
| Consumer | Relationship |
|---|---|
services/ai/kode |
Document/screenshot attachment analysis |
services/ai/agents |
Vision tool calls (OCR, caption, detect) |
products/dev/eye |
Hands off heavy server-only workloads |
services/ai/gateway |
Provider routing for VLM proxy |
services/ai/runtime |
Local model serving (Florence-2, PaddleOCR) |
services/ai/cache |
Caches OCRcaptiondetect results |
services/ai/billing |
Receives per-call usage events |
infra/data/kdb-blob |
Stores input images and result artifacts |
RFC and bootstrap
- RFC:
vision-RFC-001-foundations.kmd— *ccepted*20260509 - Bootstrap ticket:
services/ai/backlog/done/116-vision-bootstrap.md - Implementation tickets:
services/ai/vision/backlog/pending/{001..005}
Selfhostedfirst analysis (5 gates)
| Gate | Status | Notes |
|---|---|---|
| G1 Feature parity | pending | Skeleton phase; OCR + caption baseline self-hosted, VLM proxied |
| G2 Performance | pending | OCR < 800ms (Tesseract) / 400ms (Paddle); layout 10-page < 8s; caption < 600ms |
| G3 Stability | pending | Pre-MVP |
| G4 Capability | pending | OCR + layout + caption + detect + VLM proxy; no image gen (imaging scope), no video (video scope) |
| G5 Critical-path readiness | pending | Pre-MVP; kodeagentseye unification once v1 ships |