Ai cache

AI Cache — Inference Response Cache

  • *rea:*Intelligence
  • *ath:*services/ai/cache
  • *ind:*Inference response cache (exact + semantic match, per-tenant quotas)
  • *tatus:*v0.0.3 — Semantic cache layer shipped 20260511 (#003). Secondlayer lookup after exact MISS: embed prompt via services/ai/embed HTTP client, vector search via vstore.Memory (inprocess, cosine threshold 0.95 default), replay cached body on hit + header X-Cache-Match: exact|semantic|none. Graceful degradation when embed is down (skip semantic, exact path stays alive). Background indexing on exactmiss+upstream2xx. kdbvector adapter is a followup (#009) blocked by same upstream protobuf init-panic as #007. Previous v0.0.2: public KV HTTP API (#008).

Role in the stack

cache is a transparent middleware in front of services/ai/gateway. Lookup pre-call: HIT skips the provider, MISS calls and populates. Reduces both cost (provider tokens) and latency (no provider RTT) without changing the consumer contract. Helicone/Portkey report 30–60% savings on typical workloads.

It is the Koder analog of OpenAI's prompt cache, Helicone, and Portkey — selfhosted, tenantisolated, integrated with services/ai/embed for semantic match and infra/data/kdb-kv for storage.

Features (v1 target)

  • Exactmatch cache (prompt + params hash, key versionstamped)
  • Semantic cache (embedding cosine ≥ threshold, fallback to exact)
  • Per-tenant byte/count quotas with LRU eviction
  • Configurable TTL per route
  • Admin API + Prometheus metrics + kcache CLI

Primary couplings

Consumer Relationship
services/ai/gateway Wraps every routed request as middleware
services/ai/embed Provides embeddings for semantic match
services/ai/billing Receives saved-call events for cost reporting
infra/data/kdb-kv Backend store for cached responses
infra/data/kdb-vector Index for semantic lookup

RFC and bootstrap

  • RFC: cache-RFC-001-foundations.kmd — *ccepted*20260509
  • Bootstrap ticket: services/ai/backlog/done/125-cache-bootstrap.md
  • Key schema: cache-key.md — accepted 20260509 via #001
  • Implementation tickets:
    • done/: 001 (key schema), 002 (skeleton + middleware + JWKSValidator + memory store), 003 (semantic cache via embed — embedder HTTP client + vstore.Memory + middleware semantic hook + 23 tests, 20260511), 004 (pertenant byteentry quotas + LRU eviction sweeper + admin POSTGET v1cache/quotas), 005 (enhanced stats + bulk invalidate + Prometheus /metrics + kcache CLI + ops manual), 006 (kdbnext store adapter — code complete, gated behind -tags kdb until upstream #007 lands), 008 (public KV HTTP API — PUTGETDELETE /v1/cache/kv/{tenant}/{key}, +10 tests, 20260511)
    • pending/: 007 (resolve upstream kdbnext SDK protobuf initpanic so 006 can drop the build tag), 009 (kdbvector vstore adapter — swaps inmemory Memory; blockedby #007), 010 (semanticbench harness — validates <50ms p50 @ 100K target; blockedby #009), 011 (adversarial regression suite — realembedder integration)

Recent changes

  • *0260511 (#003)*— Semantic cache layer shipped. internal/embedder HTTP client to services/ai/embed (POST /v1/embed/text with bearer auth, ErrEmbedderUnavailable sentinel for graceful degradation). internal/vstore with Store interface + Memory inprocess backend (per(tenant, model) shards, FIFO eviction, cosine via dot product on prenormalised vectors). internal/middleware/semantic.go builds the SemanticHook consumed by middleware.Wrap via Config.Semantic: lookup path between exactmiss and upstream dispatch, indexing on a background goroutine after exactmiss+upstream2xx+Put. X-Cache-Match header (exact|semantic|none) on every response. kdbvector adapter is a followup (#009) blocked by same upstream protobuf initpanic that affects #006/#007. 23 new tests (E1E7 embedder, V1V8 vstore, SC1SC8 semanticmiddleware). 1 danglingreference cleanup loop closes the LRUevictionvs-vstore staleness gap.
  • *0260511 (#008)*— Public KV HTTP API for crosssector reuse. PUT/GET/DELETE /v1/cache/kv/{tenant}/{key} with tenant scoping (crosstenant404 per multitenant policy), ContentType round-trip via internal framing, `?ttl_secondsN (clamped to 30 days), MaxKVValueBytes=1 MiB, quota gate via existing policy.Enforcer.BeforePut. New internalidentctx package breaks the handler→auth→handler cycle; auth.WithIdentity now mirrors writes through identctx. 10 tests (K1-K10). Unblocks servicesaiembed#007` HTTP cache adapter + opens the door for memory cross-instance hits.

Selfhostedfirst analysis (5 gates)

Gate Status Notes
G1 Feature parity partial Exactmatch HITMISSBYPASS shipped (#002) + pertenant quotas + LRU eviction (#004); semantic still pending (#003)
G2 Performance pending Memory store benched OK at MVP; production targets (kdb-next < 5ms p50, semantic < 50ms p50) await #006 adapter
G3 Stability pending Pre-MVP
G4 Capability partial Exact-match wrapper + admin stats live; quotasevictionsemantic still pending
G5 Critical-path readiness pending PreMVP; gateway will adopt once #006 (kdbnext) ships and durability is real

Source: ../home/koder/dev/koder/meta/docs/stack/modules/ai-cache.md