Ai embed

AI Embed — Unified Embeddings Foundation

  • *rea:*Intelligence
  • *ath:*services/ai/embed
  • *ind:*Embedding generation foundation (text + image + audio; intent-based model selection)
  • *tatus:*v0.0.7 — #013 binary naming alignment shipped 20260511. Daemon binary renamed kembedkoder-embed; worker kembed-workerkoder-embed-worker; CLI kembed-adminkembed. Matches the cache precedent (koder-cache daemon + kcache CLI). systemd units updated; feature surface unchanged from v0.0.6. Previous v0.0.6: #007 HTTP cache adapter consuming services/ai/cache#008 KV API.

Role in the stack

embed consolidates embeddings server-side. Today every consumer ships its own model (kortex, search, recsys, agents) — no shared cache, inconsistent dimensions, no consolidated billing. Migrating to a better model means rewriting each consumer.

The headline design choice: *ntentbased selection* Consumers query by intent (`textmatchsearch, textclassify, image-match`) and the registry picks the model. Swapping a model is a registry change, not a consumer rewrite.

It is the Koder analog of OpenAI Embeddings, Voyage, Cohere Embed and Jina — self-hosted via BGE / E5 / CLIP / CLAP on GPU runtime, with proxy fallback to OpenAI / Voyage / Cohere through services/ai/gateway for capability gaps.

Boundary vs neighbors

  • infra/data/kdb-vector is the vector store (search, ANN); embed is the producer.
  • services/ai/cache is the embedding cache backend (keyed by model+input hash).
  • services/ai/runtime hosts the model weights.

Features (v1 target)

  • Text embeddings: BGEmultilingual default (pt+en), BGElargeen optin, E5, Instructor
  • Image embeddings: CLIPViTL/14 + SigLIP for cross-modal
  • Audio embeddings: CLAP for audiotext crossmodal
  • Intent registry: 6 default intents, per-tenant override
  • Batch coalescing: 10ms window OR 64-input cap for throughput
  • Cache integration: hits don't decrement quota

Primary couplings

Consumer Relationship
services/ai/memory Episodic + semantic memory backed by embeddings
services/ai/search Hybrid lexical+semantic search
services/ai/recsys User+item embedding for similarity
services/ai/kortex Knowledge-graph node + edge embeddings
services/ai/agents Tool selection + memory retrieval
services/ai/classify Zero-shot classification via embedding similarity
services/ai/extract Embedding-based entity disambiguation
services/ai/cache Embedding-result cache
services/ai/runtime Local model serving
infra/data/kdb-vector Native vector store target

RFC and bootstrap

  • RFC: embed-RFC-001-foundations.kmd — *ccepted*20260509
  • Bootstrap ticket: services/ai/backlog/done/121-embed-bootstrap.md
  • API contract (normative): api-v1.kmd + intents.toml — shipped 20260511 via #001 (5 endpoints, intentovermodel design, 7 registered models across textimageaudio, error map)
  • Implementation tickets:
    • done/: 001 (API contract + intent registry), 002 (Go server skeleton + coalescer), 003 (quotas + cache + audit + admin API), 004 (split into #009#012), 006 (kembedadmin CLI), 007 (HTTP cache adapter — consumes cache#008), 008 (superseded by #010), 010 (real tokenizers — WordPiece pureGo), 013 (binary naming alignment) — all 202605-11
    • pending/: 005 (image+audio CLIP/CLAP), 009 (ONNX runtime + LocalONNXWorker — CGO), 011 (GPU LXC s.embed.gpu provisioning), 012 (weights + MTEB benchmark — blocked-by #009 + #011)

Recent changes

  • *0260511 (#013 ship)*— Binary naming aligned with cache precedent. Daemon binary: kembedkoder-embed (matches cmd/koder-embed/ + systemd unit). Worker: kembed-workerkoder-embed-worker. CLI: kembed-adminkembed (now matches the binaries-and-cli/naming.kmd "binary = aliases[0] || slug" rule). systemd units' ExecStart paths updated to /usr/local/bin/koder-embed[-worker]. Zero behaviour change (pure rename); 65+ tests stay green. Breaking for any operator running pre-v0.0.7 packages on the same host — /usr/local/bin/kembed now refers to the CLI, not the daemon.
  • *0260511 (#007 ship)*— HTTP cache adapter shipped. internal/cache/http_adapter.HTTPCache consumes services/ai/cache#008 public KV API (PUT/GET/DELETE /v1/cache/kv/{tenant}/{key}) with circuit breaker (5 consecutive 5xx/network errors → open for 30s, single probe recloses on success), percall timeout (default 1s), TTL via ?ttl_seconds= (default 30 days, clamped serverside), defensive emptyBaseURL → NoOp. Wire format JSON {vector: [...], usage: {text_tokens, image_count, audio_seconds}}. 9 tests (HC1HC9) including bearer header propagation, URLencoding of |keys, 404=cleanmiss (no trip), trip+cooldown reopen, local Stats accumulation. [cache.backend] = "http" selects it; [cache.http] config section in TOML.
  • *0260511 (#006 ship + #007 blocker)*— kembed-admin CLI shipped (cmd/kembed/, bin/kembed-admin). Subcommands stats, quotas {list,set,delete}, models. Mirrors kcache pattern but binary named kembed-admin (not kembed) to avoid clash with the daemon binary kembed from #002 — naming alignment tracked as #013. *rosssector discovery* #007 (HTTP cache adapter) was blocked because services/ai/cache exposes no public KV PUTGET — only admin + middleware Wrap. New ticket `servicesai/cache#008 (public KV HTTP API) opened; embed #007` restated as blocked-by until that lands.
  • *0260511 (#004 split + #010 ship)*— #004 was bundling code + tokenizers + GPU LXC + weights + benchmarks; split into 4 subtickets (#009 ONNX runtime + LocalONNXWorker (CGO), #010 real tokenizers (this commit), #011 GPU LXC s.embed.gpu provisioning, #012 weights + MTEB benchmark). #010 shipped: internal/tokenizer/ with BERTcanonical WordPiece in pure Go (greedy longestprefix, ## continuations, CJK runeperpiece, lowercase toggle, maxchars fallback), Whitespace fallback, Registry modelID→Tokenizer index, [[tokenizers]] TOML config, handler swaps approxTokenstokenizerFor(model.ID).Count. Quota billing now charges in targetmodel tokens. SentencePiece deferred to #009/#012. #008 closed as supersededby #010. +13 tests (T1-T12 unit + I11 integration).
  • *0260511 (#003)*— Multitenant quotas + perinput cache + audit. internal/quota.Enforcer with dayrolling UTC counters for texttokens / imagecount / audioseconds, pertenant override that shadows the default, AllowedModels + BatchMaxSize checks, Reserve/Refund clampatzero. internal/cache with NoOp + Memory (LRU bound, tenantscoped key tuple). internal/audit slog wrapper. Admin API GET/POST/DELETE /v1/embed/quotas[/{tenant}] + GET /v1/embed/stats (admingated). Cache hits skip quota by construction (handler reserves only the miss bundle). Estimatedvsactual reconciliation refunds overshoots postdispatch. +26 tests (Q1Q10, C1C6, I1-I10). TTL deferred to #007 HTTP cache adapter.
  • *0260511 (#002)*— Go server skeleton shipped (backend/cmd/koder-embed + backend/cmd/koder-embed-worker). Wire format: HTTPJSON server↔worker (`POST /v1workerembed); gRPC deferred. Coalescer: per-(modality, model_id) queue, 10 ms window OR 64-input cap. Stub worker uses xxh3-seeded deterministic vectors (real ONNX in #004`#005). Auth + identity context split into internal/identctx/ so handlers can read the principal without an auth↔handler import cycle. 24 tests pass (R1-R7, C1-C7, H1-H8, S1-S2).

Selfhostedfirst analysis (5 gates)

Gate Status Notes
G1 Feature parity pending Skeleton phase; BGEE5CLIPCLAP cover textimage/audio self-hosted
G2 Performance pending Target BGE-large p50 < 50ms/batch=8 short sentences on RTX 4090
G3 Stability pending Pre-MVP
G4 Capability pending Multimodal (text+image+audio) covered; specialized intents may proxy initially
G5 Critical-path readiness pending Pre-MVP; consolidating kortexsearchrecsys is the first concrete unblock

Source: ../home/koder/dev/koder/meta/docs/stack/modules/ai-embed.md