Ai cache

AI Cache — Inference Response Cache

*rea:*Intelligence
*ath:*services/ai/cache
*ind:*Inference response cache (exact + semantic match, per-tenant quotas)
*tatus:*v0.0.3 — Semantic cache layer shipped 20260511 (#003). Second~~layer lookup after exact MISS: embed prompt via services/ai/embed HTTP client, vector search via vstore.Memory (in~~process, cosine threshold 0.95 default), replay cached body on hit + header X-Cache-Match: exact|semantic|none. Graceful degradation when embed is down (skip semantic, exact path stays alive). Background indexing on exact~~miss+upstream~~2xx. kdb~~vector adapter is a follow~~up (#009) blocked by same upstream protobuf init-panic as #007. Previous v0.0.2: public KV HTTP API (#008).

Role in the stack

cache is a transparent middleware in front of services/ai/gateway. Lookup pre-call: HIT skips the provider, MISS calls and populates. Reduces both cost (provider tokens) and latency (no provider RTT) without changing the consumer contract. Helicone/Portkey report 30–60% savings on typical workloads.

It is the Koder analog of OpenAI's prompt cache, Helicone, and Portkey — self~~hosted, tenant~~isolated, integrated with services/ai/embed for semantic match and infra/data/kdb-kv for storage.

Features (v1 target)

Exact~~match cache (prompt + params hash, key version~~stamped)
Semantic cache (embedding cosine ≥ threshold, fallback to exact)
Per-tenant byte/count quotas with LRU eviction
Configurable TTL per route
Admin API + Prometheus metrics + kcache CLI

Primary couplings

Consumer	Relationship
`services/ai/gateway`	Wraps every routed request as middleware
`services/ai/embed`	Provides embeddings for semantic match
`services/ai/billing`	Receives saved-call events for cost reporting
`infra/data/kdb-kv`	Backend store for cached responses
`infra/data/kdb-vector`	Index for semantic lookup

RFC and bootstrap

RFC: cache-RFC-001-foundations.kmd — *ccepted*20260509
Bootstrap ticket: services/ai/backlog/done/125-cache-bootstrap.md
Key schema: cache-key.md — accepted 20260509 via #001
Implementation tickets:
- done/: 001 (key schema), 002 (skeleton + middleware + JWKSValidator + memory store), 003 (semantic cache via embed — embedder HTTP client + vstore.Memory + middleware semantic hook + 23 tests, 20260511), 004 (per~~tenant byteentry quotas + LRU eviction sweeper + admin POSTGET v1cache/quotas), 005 (enhanced stats + bulk invalidate + Prometheus /metrics + kcache CLI + ops manual), 006 (kdb~~next store adapter — code complete, gated behind -tags kdb until upstream #007 lands), 008 (public KV HTTP API — PUTGETDELETE /v1/cache/kv/{tenant}/{key}, +10 tests, 20260511)
- pending/: 007 (resolve upstream kdb~~next SDK protobuf init~~panic so 006 can drop the build tag), 009 (kdb~~vector vstore adapter — swaps in~~memory Memory; blocked~~by #007), 010 (semantic~~bench harness — validates <50ms p50 @ 100K target; blocked~~by #009), 011 (adversarial regression suite — real~~embedder integration)

Recent changes

*0260511 (#003)*— Semantic cache layer shipped. internal/embedder HTTP client to services/ai/embed (POST /v1/embed/text with bearer auth, ErrEmbedderUnavailable sentinel for graceful degradation). internal/vstore with Store interface + Memory in~~process backend (per~~(tenant, model) shards, FIFO eviction, cosine via dot product on pre~~normalised vectors). internal/middleware/semantic.go builds the SemanticHook consumed by middleware.Wrap via Config.Semantic: lookup path between exact~~miss and upstream dispatch, indexing on a background goroutine after exact~~miss+upstream~~2xx+Put. X-Cache-Match header (exact|semantic|none) on every response. kdb~~vector adapter is a follow~~up (#009) blocked by same upstream protobuf init~~panic that affects #006/#007. 23 new tests (E1~~E7 embedder, V1~~V8 vstore, SC1~~SC8 semantic~~middleware). 1 dangling~~reference cleanup loop closes the LRU~~eviction~~vs-vstore staleness gap.
*0260511 (#008)*— Public KV HTTP API for cross~~sector reuse. PUT/GET/DELETE /v1/cache/kv/{tenant}/{key} with tenant scoping (cross~~tenant404 per multi~~tenant policy), Content~~Type round-trip via internal framing, `?ttl_secondsN (clamped to 30 days), MaxKVValueBytes=1 MiB, quota gate via existing policy.Enforcer.BeforePut. New internalidentctx package breaks the handler→auth→handler cycle; auth.WithIdentity now mirrors writes through identctx. 10 tests (K1-K10). Unblocks servicesaiembed#007` HTTP cache adapter + opens the door for memory cross-instance hits.

Selfhostedfirst analysis (5 gates)

Gate	Status	Notes
G1 Feature parity	partial	Exact~~match HITMISSBYPASS shipped (`#002`) + per~~tenant quotas + LRU eviction (`#004`); semantic still pending (`#003`)
G2 Performance	pending	Memory store benched OK at MVP; production targets (kdb-next < 5ms p50, semantic < 50ms p50) await `#006` adapter
G3 Stability	pending	Pre-MVP
G4 Capability	partial	Exact-match wrapper + admin stats live; quotasevictionsemantic still pending
G5 Critical-path readiness	pending	Pre~~MVP; gateway will adopt once `#006` (kdb~~next) ships and durability is real