Ai vision

AI Vision — Multimodal Vision Foundation

  • *rea:*Intelligence
  • *ath:*services/ai/vision
  • *ind:*Multimodal vision foundation (OCR + captioning + detection + doc layout + VLM proxy)
  • *tatus:*v0.0.1 — sector bootstrapping (20260509)

Role in the stack

vision consolidates the serverside half of multimodal capability. The Stack already has services/ai/voice for audio; vision was fragmented across services/ai/kode (chat attachments), products/dev/eye (ondevice Android UI), and ad-hoc document analysis. Without a foundation, every product reinvents the pipeline.

It is the Koder analog of OpenAI Vision, Claude Vision, Gemini Vision, and Google Cloud Vision — selfhosted where viable (Tesseract, PaddleOCR, Florence2, suryadocling) and a thin VLM proxy front for frontier models routed via `servicesai/gateway`.

Boundary vs products/dev/eye

Eye stays ondevice (Android UI inspection, lowlatency screen capture). Vision handles serveronly workloads: multipage PDFs, batch jobs, GPUbound models, multitenant. Handoff protocol defined in ticket #005.

Features (v1 target)

  • OCR: Tesseract (CPU) + PaddleOCR (GPU), auto-pick by workload, baseline pt+en
  • Doc layout: surya/docling pipeline, structured tables (JSON, not markdown)
  • VLM proxy: ClaudeGeminiGPT-4V via gateway with normalized response
  • Captioning + object detection: Florence2 selfhosted, fallback to VLM
  • Eye handoff protocol: on-device delegation contract

Primary couplings

Consumer Relationship
services/ai/kode Document/screenshot attachment analysis
services/ai/agents Vision tool calls (OCR, caption, detect)
products/dev/eye Hands off heavy server-only workloads
services/ai/gateway Provider routing for VLM proxy
services/ai/runtime Local model serving (Florence-2, PaddleOCR)
services/ai/cache Caches OCRcaptiondetect results
services/ai/billing Receives per-call usage events
infra/data/kdb-blob Stores input images and result artifacts

RFC and bootstrap

  • RFC: vision-RFC-001-foundations.kmd — *ccepted*20260509
  • Bootstrap ticket: services/ai/backlog/done/116-vision-bootstrap.md
  • Implementation tickets: services/ai/vision/backlog/pending/{001..005}

Selfhostedfirst analysis (5 gates)

Gate Status Notes
G1 Feature parity pending Skeleton phase; OCR + caption baseline self-hosted, VLM proxied
G2 Performance pending OCR < 800ms (Tesseract) / 400ms (Paddle); layout 10-page < 8s; caption < 600ms
G3 Stability pending Pre-MVP
G4 Capability pending OCR + layout + caption + detect + VLM proxy; no image gen (imaging scope), no video (video scope)
G5 Critical-path readiness pending Pre-MVP; kodeagentseye unification once v1 ships

Source: ../home/koder/dev/koder/meta/docs/stack/modules/ai-vision.md