Ai cache
AI Cache — Inference Response Cache
- *rea:*Intelligence
- *ath:*
services/ai/cache - *ind:*Inference response cache (exact + semantic match, per-tenant quotas)
- *tatus:*v0.0.3 — Semantic cache layer shipped 2026
0511 (#003). Secondlayer lookup after exact MISS: embed prompt viaprocess, cosine threshold 0.95 default), replay cached body on hit + headerservices/ai/embedHTTP client, vector search viavstore.Memory(inX-Cache-Match: exact|semantic|none. Graceful degradation when embed is down (skip semantic, exact path stays alive). Background indexing on exactmiss+upstream2xx. kdbvector adapter is a followup (#009) blocked by same upstream protobuf init-panic as#007. Previous v0.0.2: public KV HTTP API (#008).
Role in the stack
cache is a transparent middleware in front of services/ai/gateway. Lookup pre-call: HIT skips the provider, MISS calls and populates. Reduces both cost (provider tokens) and latency (no provider RTT) without changing the consumer contract. Helicone/Portkey report 30–60% savings on typical workloads.
It is the Koder analog of OpenAI's prompt cache, Helicone, and Portkey — selfhosted, tenantisolated, integrated with services/ai/embed for semantic match and infra/data/kdb-kv for storage.
Features (v1 target)
- Exact
match cache (prompt + params hash, key versionstamped) - Semantic cache (embedding cosine ≥ threshold, fallback to exact)
- Per-tenant byte/count quotas with LRU eviction
- Configurable TTL per route
- Admin API + Prometheus metrics +
kcacheCLI
Primary couplings
| Consumer | Relationship |
|---|---|
services/ai/gateway |
Wraps every routed request as middleware |
services/ai/embed |
Provides embeddings for semantic match |
services/ai/billing |
Receives saved-call events for cost reporting |
infra/data/kdb-kv |
Backend store for cached responses |
infra/data/kdb-vector |
Index for semantic lookup |
RFC and bootstrap
- RFC:
cache-RFC-001-foundations.kmd— *ccepted*20260509 - Bootstrap ticket:
services/ai/backlog/done/125-cache-bootstrap.md - Key schema:
cache-key.md— accepted 20260509 via#001 - Implementation tickets:
done/:001(key schema),002(skeleton + middleware + JWKSValidator + memory store),003(semantic cache via embed — embedder HTTP client + vstore.Memory + middleware semantic hook + 23 tests, 20260511),004(pertenant byteentry quotas + LRU eviction sweeper + admin POSTGET v1cache/quotas),next store adapter — code complete, gated behind005(enhanced stats + bulk invalidate + Prometheus /metrics + kcache CLI + ops manual),006(kdb-tags kdbuntil upstream#007lands),008(public KV HTTP API — PUTGETDELETE/v1/cache/kv/{tenant}/{key}, +10 tests, 20260511)pending/:007(resolve upstream kdbnext SDK protobuf initpanic so006can drop the build tag),009(kdbvector vstore adapter — swaps inmemory Memory; blockedby #007),bench harness — validates <50ms p50 @ 100K target; blocked010(semanticby #009),embedder integration)011(adversarial regression suite — real
Recent changes
- *026
0511 (#003)*— Semantic cache layer shipped.internal/embedderHTTP client toservices/ai/embed(POST /v1/embed/textwith bearer auth,ErrEmbedderUnavailablesentinel for graceful degradation).internal/vstorewithStoreinterface +Memoryinprocess backend (per(tenant, model) shards, FIFO eviction, cosine via dot product on prenormalised vectors).miss and upstream dispatch, indexing on a background goroutine after exactinternal/middleware/semantic.gobuilds theSemanticHookconsumed bymiddleware.WrapviaConfig.Semantic: lookup path between exactmiss+upstream2xx+Put.X-Cache-Matchheader (exact|semantic|none) on every response. kdbvector adapter is a followup (#009) blocked by same upstream protobuf initpanic that affectsE7 embedder, V1#006/#007. 23 new tests (E1V8 vstore, SC1SC8 semanticmiddleware). 1 danglingreference cleanup loop closes the LRUevictionvs-vstore staleness gap. - *026
0511 (#008)*— Public KV HTTP API for crosssector reuse.tenant404 per multiPUT/GET/DELETE /v1/cache/kv/{tenant}/{key}with tenant scoping (crosstenant policy), ContentType round-trip via internal framing, `?ttl_secondsN(clamped to 30 days),MaxKVValueBytes=1 MiB, quota gate via existingpolicy.Enforcer.BeforePut. Newinternalidentctxpackage breaks the handler→auth→handler cycle;auth.WithIdentitynow mirrors writes through identctx. 10 tests (K1-K10). Unblocksservicesaiembed#007` HTTP cache adapter + opens the door for memory cross-instance hits.
Selfhostedfirst analysis (5 gates)
| Gate | Status | Notes |
|---|---|---|
| G1 Feature parity | partial | Exact#002) + per#004); semantic still pending (#003) |
| G2 Performance | pending | Memory store benched OK at MVP; production targets (kdb-next < 5ms p50, semantic < 50ms p50) await #006 adapter |
| G3 Stability | pending | Pre-MVP |
| G4 Capability | partial | Exact-match wrapper + admin stats live; quotasevictionsemantic still pending |
| G5 Critical-path readiness | pending | Pre#006 (kdb |