Id RFC 011 storage on kdb next
RFC011 — Migrating koderid v2 storage from kdb 1.x to kdb-next
| Field | Value |
|---|---|
| Status | *ccepted*(2026 |
| Author(s) | Rodrigo (with Claude as scribe) |
| Date | 2026 |
| Accepted | 2026 |
| Depends on | platform/kdb/docs/rfcs/RFC-001-kdb-next-hyperscale-architecture.md |
| Affects | platform/id/v2/pkg/kdb/, all 6 services' main.go, migrations |
| Suspends | RFC |
1. Summary
This RFC is the *lientside counterpart*to RFC001 in the kdb module. It defines how koder-id v2 will migrate its persistence layer from the current kdb 1.0.6 SQL HTTP endpoints (with the optin `XOrgID` rewriter shipped in ticket 040) to *dbnext*(the Rust hyperscale substrate defined in kdb's RFC-001).
The migration is *on-disruptive* the existing pkg/kdb/ client keeps working unchanged for the parallel-run environment until kdb-next reaches Phase 3 of its roadmap. At that point a new pkg/kdbnext/ client is added sidebyside; services swap clients one at a time behind a feature flag; the cutover from kdb 1.x to kdbnext is servicebyservice, never bigbang.
2. Goals
- *ero behavior change*for the existing parallel run on
id-v2.koder.devuntil kdb-next is provably ready. - *ne service at a time*can be flipped from kdb 1.x to kdb-next
via env var, without redeploying the others.
- *enant model rewrite* stop pretending tenancy is a SQL column
filter (current state) and embrace tenancy as a request-level primitive carried in the gateway context.
- *o textual SQL on the wire* typed Record API calls between
koder
id v2 and kdbnext, eliminating the entire class of "rewriteredgecase" bugs that the kdb 1.0.6 rewriter has by design. - *udit, quotas, observability*ride along automatically because
they're enforced at the kdb
next gateway, not at the koderid v2 client. - * single dry-run*of the cutover, against the parallel run, is
the gating event for Phase 6 of kdb-next (the real production cutover of
id.koder.dev).
3. Non-goals
- This RFC does *ot*redesign koder-id v2's domain model. The
6 services (admin, auth, identity, oauth, saml, session) and their table layouts stay as they are; only the persistence client and the encoding format on the wire change.
- This RFC does *ot*speak Postgres
wire or libpq. koderid v2has never spoken Postgres directly; it uses HTTP to kdb. That stays.
- This RFC does *ot*schedule the production cutover of
id.koder.dev. That is Phase 6 of kdb's RFC-001, gated on Phase 5 passing.
4. Background
4.1 Where koder-id v2 stores data today
koder-id v2 services
│
│ HTTP/JSON (per-request: Authorization: Bearer <static_key>)
▼
kdb 1.0.6 (Go binary on s.k.lin :7900)
│
│ database/sql + ?N positional params
▼
SQLite file at /var/lib/koder/koder-kdb/koder-kdb.dbEach of the 6 services calls pkg/kdb/client.go which POSTs to /api/v1/sql/{exec,query,queryRow}. The body shape is:
{
"namespace": "koder_id_oauth",
"query": "INSERT INTO clients (id, name) VALUES ($1, $2)",
"params": {"$1": "abc", "$2": "Foo"}
}The server prepends <namespace>__ to every unqualified table name in the SQL string and (since kdb 1.0.6 / ticket 040) optionally injects org_id filters when the request carries the X-Org-ID header. koder-id v2 currently does *ot*send that header — it relies on the per-service namespace as its only isolation primitive, which is *ot*pertenant; it's permicroservice.
4.2 What's wrong with that today (qualitatively)
- *enancy is not real*
koder_id_oauth__clientscontains everyorg's clients. Two tenants with overlapping ids collide. Today there's only one tenant (
koder) so this is invisible. - *ire format is text SQL* every change to a table requires the
client to know the column list, the placeholder positions, the exact SQL syntax. Refactors break things in subtle ways.
- *allback to in-memory was silent* if the kdb is unreachable
the client used to fall back to a
MemoryClientthat "worked" but lost data on restart. Fixed in the previous session by ensuringKDB_API_KEYis always set, but the silent fallback class of bugs is still possible. - *o transactions across statements* each HTTP call is its own
SQLite transaction. Multi-step writes are racy.
- *o per
tenant rate limits, no pertenant audit* because thekdb 1.x server doesn't see tenants.
- *QLite is a single-writer* Phase 0 hits this around tens of
thousands of orgs.
4.3 What's wrong with it for hyperscale (quantitatively)
See RFC-001 §5 in kdb. The short version: at 100M tenants × 10k rows × audit, the entire koder-id v2 dataset is ~1T rows, which is ~3 orders of magnitude past where SQLite stops being honest.
5. Target client design
5.1 New crate / package: pkg/kdbnext/
A second client lives next to the existing pkg/kdb/ and is imported when a service is flipped to kdb-next. They never coexist in a single service binary; the swap is at compileandlaunch time, controlled by the env var KODER_ID_V2_STORAGE_BACKEND={kdb1|kdbnext}.
platform/id/v2/
├── pkg/
│ ├── kdb/ # current client; frozen except bug fixes
│ │ ├── client.go
│ │ ├── client_test.go
│ │ └── migration.go
│ └── kdbnext/ # new client (added in Phase 3 of RFC-001)
│ ├── client.go # talks to kdb-next gateway via HTTP/JSON or gRPC
│ ├── client_test.go
│ ├── tx.go # transaction handle
│ ├── records.go # typed table accessors generated from schemas
│ └── migration.go # online schema bootstrap
└── services/
└── <each>/cmd/main.go # picks kdb or kdbnext based on env5.2 API surface
The kdbnext client exposes a typed interface, *ot*raw SQL:
type Client interface {
// Tenancy is mandatory and always carried in the context.
Tx(ctx context.Context) (Tx, error)
// Single-statement convenience helpers (auto-tx).
GetByPK(ctx context.Context, table string, pk PK, dst any) error
Put(ctx context.Context, table string, row any) error
Delete(ctx context.Context, table string, pk PK) error
Query(ctx context.Context, table string, filter Filter) (Cursor, error)
}
type Tx interface {
Get(table string, pk PK, dst any) error
Put(table string, row any) error
Delete(table string, pk PK) error
Query(table string, filter Filter) (Cursor, error)
Commit() error
Rollback() error
}Filters are constructed in Go, not as strings:
filter := kdbnext.And(
kdbnext.Eq("client_type", "confidential"),
kdbnext.Lt("created_at", cutoff),
).OrderBy("created_at", kdbnext.Desc).Limit(50)The compiler in kdb-next translates this to an indexed range scan on the appropriate secondary index. There is no SQL string parser involved at runtime — the wire protocol carries a Protobuf-encoded filter tree.
5.3 Tenancy
The tenant id is *ever*part of the table name, the namespace, or the row data. It is part of the request context:
ctx = kdbnext.WithTenant(ctx, "tenant-koder")
client.GetByPK(ctx, "oauth_clients", PK{"abc"}, &client)The kdb-next gateway extracts the tenant id from the JWT (which the koder-id v2 service is itself the issuer of) and uses it as the keyspace prefix. There is *o way*for a service-level bug to read another tenant's data — the keyspace doesn't allow constructing a foreign tenant prefix.
The break-glass during bootstrap is a static signing key in /etc/koder-id-v2/env that issues a "super-tenant" JWT scoped to admin operations only. Removed once dogfooding stabilizes.
5.4 Schema bootstrap
The current pkg/kdb/migration.go runs raw CREATE TABLE statements through the SQL endpoint and tracks them in a _migrations table. This works because the kdb 1.x server is a thin wrapper over SQLite.
In kdbnext, schema bootstrap is *eclarative*
var oauthClientsSchema = kdbnext.Table{
Name: "oauth_clients",
PrimaryKey: []string{"id"},
Columns: []kdbnext.Column{
{Name: "id", Type: kdbnext.Text, NotNull: true},
{Name: "client_name", Type: kdbnext.Text, NotNull: true},
{Name: "client_type", Type: kdbnext.Text, NotNull: true},
{Name: "created_at", Type: kdbnext.Timestamp, NotNull: true},
// ...
},
Indexes: []kdbnext.Index{
{Name: "by_created_at", Columns: []string{"created_at"}},
{Name: "by_type", Columns: []string{"client_type"}},
},
SchemaVersion: 1,
}At service startup, the client calls EnsureTable(ctx, oauthClientsSchema) which is idempotent: kdb-next stores the table definition in its metadata range, allocates a table_id, and any future migration becomes a Migrate(oldSchema, newSchema) call that is run online by the kdb-migrate runner (RFC-001 §9 Phase 4).
5.5 Transactions
Multi-statement writes that today are individual HTTP calls become real transactions:
return client.WithTx(ctx, func(tx kdbnext.Tx) error {
if err := tx.Put("oauth_clients", c); err != nil {
return err
}
if err := tx.Put("audit_log", auditEntry); err != nil {
return err
}
return nil
})The two writes either both happen or neither does. Audit cannot be lost without losing the user-visible row.
6. Phased migration
This RFC's phases align 1:1 with kdb's RFC-001 phases. We do nothing on the koder-id side until kdb is ready.
Phase A — Holding pattern (= kdb RFC-001 Phases 0–2)
*cope* nothing changes in koder-id v2.
- The existing
pkg/kdb/client keeps talking to kdb 1.0.6. - The parallel run on
id-v2.koder.devcontinues to validate theservice correctness.
- We do *ot*start sending
X-Org-IDto the kdb 1.0.6 SQLrewriter. It's a dead end (single-tenant
koderonly). - Bug fixes only.
*xit* kdb-next Phase 2 acceptance criteria met (1M test tenants, p99 read ≤ 8 ms, etc.).
Phase B — Add pkg/kdbnext/ client (= kdb RFC-001 Phase 3)
*cope*
- Create
platform/id/v2/pkg/kdbnext/ - Implement the typed Client interface above
- Map every existing
pkg/kdbcall site to apkg/kdbnextequivalent(mechanical refactor; no behavior change in the services)
- Add the
KODER_ID_V2_STORAGE_BACKENDenv var with defaultkdb1 - One service flips to
kdbnextfirst: *auth*(it has the mostrows and the most read pressure, so it's the best canary)
- The flip is gated by feature flag in env, not by code branches
- Run the parallel run with oauth on kdbnext, the other 5 on kdb1
*xit*
- 7 days of clean parallel run with oauth on kdbnext
- p99 oauth read latency ≤ 15 ms (matching kdb-next Phase 3 budget)
- All 64 migrated OIDC clients persistent across restarts (the same
decisive test the previous session ran against kdb 1.0.6, now against kdbnext)
Phase C — Flip the remaining 5 services (= still kdb Phase 3)
*cope*
- Flip identity, session, auth, admin, saml (in this order; auth and
saml last because they touch the most secrets)
- One service per day; observe burn rate
- The Go shim in
pkg/kdbnext/is allowed to grow as edge cases comeup, but it must stay typed (no raw SQL fallbacks)
*xit*
- All 6 services on kdbnext
- Parallel run stable for 7 days
- The kdb 1.0.6 endpoints are no longer called by koder-id v2
- The kdb 1.0.6 binary on
s.k.lincontinues to serve metrics/alerts;it just doesn't see SQL traffic from koder-id v2 anymore
Phase D — Online migration tooling exercise (= kdb RFC-001 Phase 4)
*cope*
- Use
kdb-migrateto add a column tooauth_clients(e.g. a newlast_used_attimestamp) on the parallel run, with traffic flowing - Verify zero observable latency change
- Document the playbook in
pkg/kdbnext/MIGRATION.md
*xit*
- The migration completes; latency budget honored (≤ +5% read p99
during backfill)
- Playbook reviewed
Phase E — Hyperscale soak (= kdb RFC-001 Phase 5)
*cope*
- This is kdb
next's hyperscale validation. From the koderid v2side: load synthetic OIDC clients (10k tenants × 10 clients each = 100k rows) and exercise the auth flow at sustained 1k RPS for 24 hours
- Capture per-tenant p50/p99 numbers; ensure no tenant drifts past
the noisy-neighbor budget
*xit* kdbnext Phase 5 numerical targets met (RFC001 §5).
Phase F — Production cutover of id.koder.dev (= kdb RFC-001 Phase 6)
*cope*
- This is the real cutover that the suspended RFC-010 used to
describe. It is *ewritten from scratch*as a brand
new RFC012 ("Production cutover runbook v2") whose rollback story is grounded in kdb-next, not in kdb 1.0.6. - Zitadel stays online for 14 days as the rollback safety net
- Then Zitadel is decommissioned and the rebrand notes go into the
~/dev/koder/context/archive
*xit* id.koder.dev serves OIDC discovery as koder-id v2 over kdb-next; the original Zitadel instance is gone; production is the new world.
7. Test strategy
- *nit tests*in
pkg/kdbnext/for every typed accessor; target90% line coverage
- *ntegration tests*against a local kdb-next dev cluster (via
the
KvCluster::Localsledbacked backend in kdbnext) — fast, no TiKV needed for CI - *nd
toend test*that runs the full koder-id v2 OIDC flow(
/.well-known/openid-configuration,/authorize,/token,/userinfo) against the kdbnext backend - *ecisive persistence test*ported from the kdb 1.0.6 era:
insert a known client, restart the service, re-read, assert the client survived. This is the smoke test for any kdbnext deploy.
- *ross-tenant isolation chaos test* spin up 100 fake tenants,
insert random data per tenant, run 10k parallel reads, assert zero cross-tenant reads (this catches both client bugs and gateway bugs)
8. Backwards compatibility & rollback
At every phase, rollback is *lip the env var back to kdb1* The kdb 1.0.6 instance and the SQLite file on s.k.lin remain intact and writable for 30 days after Phase C completes — the parallel run is reversible until that grace period expires.
After Phase F (production cutover), rollback is the existing Zitadel-rollback story (Zitadel stays online for 14 days). After that, no rollback to the old world; we live in kdb-next.
9. Open questions
- *ervice-level transaction granularity* today many writes
in koder
id v2 are isolated singlestatement HTTP calls. When we move to kdbnext we get real transactions — but identifying which existing call sites should be wrapped in a tx requires perservice review. Do we do it perservice in Phase B/C, or leave it as a follow-up "consistency hardening" pass? - *chema fingerprint algorithm choice* deferred to kdb-next
Phase 2, but koder-id v2 needs to declare which version it targets. Likely outcome: koder-id v2 just imports a constant from
pkg/kdbnext. - *utover of the parallel
run DNS* do we keep `idv2.koder.dev`alive after Phase F as a rollback canary, or remove it? Decision in Phase F.
- *he 59 rotated client_secrets in
/tmp/migrated.csv*ons.k.linare valid for the kdb 1.0.6 parallel run but the actual production cutover (Phase F) will need a fresh rotation anyway. The current CSV is for the bootstrap; we re-rotate at cutover.
10. References
- RFC-001 in kdb: target architecture (this RFC's prerequisite)
- RFC
009 in koderid v2: original migration strategy (stillconceptually valid; the substrate changes)
- RFC
010 in koderid v2: SUSPENDED — will be replaced by a newRFC
012 written on top of kdbnext after Phase E - ticket 040 in kdb backlog: the SQL rewriter bridge that gets us
through Phase A
~/dev/koder/platform/id/v2/pkg/kdb/client.go: the current clientto be replaced
~/dev/koder/platform/id/v2/servicescmd/main.go: the 6 entrypoints where the env
vardriven flip happens