Ai embed

AI Embed — Unified Embeddings Foundation

*rea:*Intelligence
*ath:*services/ai/embed
*ind:*Embedding generation foundation (text + image + audio; intent-based model selection)
*tatus:*v0.0.7 — #013 binary naming alignment shipped 20260511. Daemon binary renamed kembed → koder-embed; worker kembed-worker → koder-embed-worker; CLI kembed-admin → kembed. Matches the cache precedent (koder-cache daemon + kcache CLI). systemd units updated; feature surface unchanged from v0.0.6. Previous v0.0.6: #007 HTTP cache adapter consuming services/ai/cache#008 KV API.

Role in the stack

embed consolidates embeddings server-side. Today every consumer ships its own model (kortex, search, recsys, agents) — no shared cache, inconsistent dimensions, no consolidated billing. Migrating to a better model means rewriting each consumer.

The headline design choice: *ntent~~based selection* Consumers query by intent (`text~~match~~search, text~~classify, image-match`) and the registry picks the model. Swapping a model is a registry change, not a consumer rewrite.

It is the Koder analog of OpenAI Embeddings, Voyage, Cohere Embed and Jina — self-hosted via BGE / E5 / CLIP / CLAP on GPU runtime, with proxy fallback to OpenAI / Voyage / Cohere through services/ai/gateway for capability gaps.

Boundary vs neighbors

infra/data/kdb-vector is the vector store (search, ANN); embed is the producer.
services/ai/cache is the embedding cache backend (keyed by model+input hash).
services/ai/runtime hosts the model weights.

Features (v1 target)

Text embeddings: BGE~~multilingual default (pt+en), BGE~~large~~en opt~~in, E5, Instructor
Image embeddings: CLIP~~ViT~~L/14 + SigLIP for cross-modal
Audio embeddings: CLAP for audio~~text cross~~modal
Intent registry: 6 default intents, per-tenant override
Batch coalescing: 10ms window OR 64-input cap for throughput
Cache integration: hits don't decrement quota

Primary couplings

Consumer	Relationship
`services/ai/memory`	Episodic + semantic memory backed by embeddings
`services/ai/search`	Hybrid lexical+semantic search
`services/ai/recsys`	User+item embedding for similarity
`services/ai/kortex`	Knowledge-graph node + edge embeddings
`services/ai/agents`	Tool selection + memory retrieval
`services/ai/classify`	Zero-shot classification via embedding similarity
`services/ai/extract`	Embedding-based entity disambiguation
`services/ai/cache`	Embedding-result cache
`services/ai/runtime`	Local model serving
`infra/data/kdb-vector`	Native vector store target

RFC and bootstrap

RFC: embed-RFC-001-foundations.kmd — *ccepted*20260509
Bootstrap ticket: services/ai/backlog/done/121-embed-bootstrap.md
API contract (normative): api-v1.kmd + intents.toml — shipped 20260511 via #001 (5 endpoints, intent~~over~~model design, 7 registered models across textimageaudio, error map)
Implementation tickets:
- done/: 001 (API contract + intent registry), 002 (Go server skeleton + coalescer), 003 (quotas + cache + audit + admin API), 004 (split into #009~~#012), 006 (kembed~~admin CLI), 007 (HTTP cache adapter — consumes cache#008), 008 (superseded by #010), 010 (real tokenizers — WordPiece pure~~Go), 013 (binary naming alignment) — all 2026~~05-11
- pending/: 005 (image+audio CLIP/CLAP), 009 (ONNX runtime + LocalONNXWorker — CGO), 011 (GPU LXC s.embed.gpu provisioning), 012 (weights + MTEB benchmark — blocked-by #009 + #011)

Recent changes

*0260511 (#013 ship)*— Binary naming aligned with cache precedent. Daemon binary: kembed → koder-embed (matches cmd/koder-embed/ + systemd unit). Worker: kembed-worker → koder-embed-worker. CLI: kembed-admin → kembed (now matches the binaries-and-cli/naming.kmd "binary = aliases[0] || slug" rule). systemd units' ExecStart paths updated to /usr/local/bin/koder-embed[-worker]. Zero behaviour change (pure rename); 65+ tests stay green. Breaking for any operator running pre-v0.0.7 packages on the same host — /usr/local/bin/kembed now refers to the CLI, not the daemon.
*0260511 (#007 ship)*— HTTP cache adapter shipped. internal/cache/http_adapter.HTTPCache consumes services/ai/cache#008 public KV API (PUT/GET/DELETE /v1/cache/kv/{tenant}/{key}) with circuit breaker (5 consecutive 5xx/network errors → open for 30s, single probe re~~closes on success), per~~call timeout (default 1s), TTL via ?ttl_seconds= (default 30 days, clamped server~~side), defensive empty~~BaseURL → NoOp. Wire format JSON {vector: [...], usage: {text_tokens, image_count, audio_seconds}}. 9 tests (HC1~~HC9) including bearer header propagation, URL~~encoding of |~~keys, 404=clean~~miss (no trip), trip+cooldown reopen, local Stats accumulation. [cache.backend] = "http" selects it; [cache.http] config section in TOML.
*0260511 (#006 ship + #007 blocker)*— kembed-admin CLI shipped (cmd/kembed/, bin/kembed-admin). Subcommands stats, quotas {list,set,delete}, models. Mirrors kcache pattern but binary named kembed-admin (not kembed) to avoid clash with the daemon binary kembed from #002 — naming alignment tracked as #013. *rosssector discovery* #007 (HTTP cache adapter) was blocked because services/ai/cache exposes no public KV PUTGET — only admin + middleware Wrap. New ticket `servicesai/cache#008 (public KV HTTP API) opened; embed #007` restated as blocked-by until that lands.
*0260511 (#004 split + #010 ship)*— #004 was bundling code + tokenizers + GPU LXC + weights + benchmarks; split into 4 subtickets (#009 ONNX runtime + LocalONNXWorker (CGO), #010 real tokenizers (this commit), #011 GPU LXC s.embed.gpu provisioning, #012 weights + MTEB benchmark). #010 shipped: internal/tokenizer/ with BERTcanonical WordPiece in pure Go (greedy longest~~prefix, ## continuations, CJK rune~~per~~piece, lowercase toggle, max~~chars fallback), Whitespace fallback, Registry modelID→Tokenizer index, [[tokenizers]] TOML config, handler swaps approxTokens → tokenizerFor(model.ID).Count. Quota billing now charges in target~~model tokens. SentencePiece deferred to #009/#012. #008 closed as superseded~~by #010. +13 tests (T1-T12 unit + I11 integration).
*0260511 (#003)*— Multi~~tenant quotas + per~~input cache + audit. internal/quota.Enforcer with day~~rolling UTC counters for text~~tokens / image~~count / audio~~seconds, per~~tenant override that shadows the default, AllowedModels + BatchMaxSize checks, Reserve/Refund clamp~~at~~zero. internal/cache with NoOp + Memory (LRU bound, tenant~~scoped key tuple). internal/audit slog wrapper. Admin API GET/POST/DELETE /v1/embed/quotas[/{tenant}] + GET /v1/embed/stats (admin~~gated). Cache hits skip quota by construction (handler reserves only the miss bundle). Estimated~~vs~~actual reconciliation refunds overshoots post~~dispatch. +26 tests (Q1~~Q10, C1~~C6, I1-I10). TTL deferred to #007 HTTP cache adapter.
*0260511 (#002)*— Go server skeleton shipped (backend/cmd/koder-embed + backend/cmd/koder-embed-worker). Wire format: HTTPJSON server↔worker (`POST /v1workerembed); gRPC deferred. Coalescer: per-(modality, model_id) queue, 10 ms window OR 64-input cap. Stub worker uses xxh3-seeded deterministic vectors (real ONNX in #004`#005). Auth + identity context split into internal/identctx/ so handlers can read the principal without an auth↔handler import cycle. 24 tests pass (R1-R7, C1-C7, H1-H8, S1-S2).

Selfhostedfirst analysis (5 gates)

Gate	Status	Notes
G1 Feature parity	pending	Skeleton phase; BGEE5CLIPCLAP cover textimage/audio self-hosted
G2 Performance	pending	Target BGE-large p50 < 50ms/batch=8 short sentences on RTX 4090
G3 Stability	pending	Pre-MVP
G4 Capability	pending	Multimodal (text+image+audio) covered; specialized intents may proxy initially
G5 Critical-path readiness	pending	Pre-MVP; consolidating kortexsearchrecsys is the first concrete unblock