Id RFC 011 storage on kdb next

RFC011 — Migrating koderid v2 storage from kdb 1.x to kdb-next

Field	Value
Status	ccepted(20260409)
Author(s)	Rodrigo (with Claude as scribe)
Date	20260409
Accepted	20260409 by Rodrigo
Depends on	`platform/kdb/docs/rfcs/RFC-001-kdb-next-hyperscale-architecture.md`
Affects	`platform/id/v2/pkg/kdb/`, all 6 services' main.go, migrations
Suspends	RFC~~010 cutover until Phase 5 of RFC~~001 is reached

1. Summary

This RFC is the *lient~~side counterpart*to RFC~~001 in the kdb module. It defines how koder-id v2 will migrate its persistence layer from the current kdb 1.0.6 SQL HTTP endpoints (with the opt~~in `X~~Org~~ID` rewriter shipped in ticket 040) to *db~~next*(the Rust hyperscale substrate defined in kdb's RFC-001).

The migration is *on-disruptive* the existing pkg/kdb/ client keeps working unchanged for the parallel-run environment until kdb-next reaches Phase 3 of its roadmap. At that point a new pkg/kdbnext/ client is added sidebyside; services swap clients one at a time behind a feature flag; the cutover from kdb 1.x to kdb~~next is service~~by~~service, never big~~bang.

2. Goals

*ero behavior change*for the existing parallel run on
id-v2.koder.dev until kdb-next is provably ready.
*ne service at a time*can be flipped from kdb 1.x to kdb-next
via env var, without redeploying the others.
*enant model rewrite* stop pretending tenancy is a SQL column
filter (current state) and embrace tenancy as a request-level primitive carried in the gateway context.
*o textual SQL on the wire* typed Record API calls between
koder~~id v2 and kdb~~next, eliminating the entire class of "rewriter~~edge~~case" bugs that the kdb 1.0.6 rewriter has by design.
*udit, quotas, observability*ride along automatically because
they're enforced at the kdb~~next gateway, not at the koder~~id v2 client.
* single dry-run*of the cutover, against the parallel run, is
the gating event for Phase 6 of kdb-next (the real production cutover of id.koder.dev).

3. Non-goals

This RFC does *ot*redesign koder-id v2's domain model. The
6 services (admin, auth, identity, oauth, saml, session) and their table layouts stay as they are; only the persistence client and the encoding format on the wire change.
This RFC does *ot*speak Postgres~~wire or libpq. koder~~id v2
has never spoken Postgres directly; it uses HTTP to kdb. That stays.
This RFC does *ot*schedule the production cutover of
id.koder.dev. That is Phase 6 of kdb's RFC-001, gated on Phase 5 passing.

4. Background

4.1 Where koder-id v2 stores data today

koder-id v2 services
   │
   │  HTTP/JSON  (per-request: Authorization: Bearer <static_key>)
   ▼
kdb 1.0.6 (Go binary on s.k.lin :7900)
   │
   │  database/sql + ?N positional params
   ▼
SQLite file at /var/lib/koder/koder-kdb/koder-kdb.db

Each of the 6 services calls pkg/kdb/client.go which POSTs to /api/v1/sql/{exec,query,queryRow}. The body shape is:

{
  "namespace": "koder_id_oauth",
  "query":     "INSERT INTO clients (id, name) VALUES ($1, $2)",
  "params":    {"$1": "abc", "$2": "Foo"}
}

The server prepends <namespace>__ to every unqualified table name in the SQL string and (since kdb 1.0.6 / ticket 040) optionally injects org_id filters when the request carries the X-Org-ID header. koder-id v2 currently does *ot*send that header — it relies on the per-service namespace as its only isolation primitive, which is *ot*per~~tenant; it's per~~microservice.

4.2 What's wrong with that today (qualitatively)

*enancy is not real* koder_id_oauth__clients contains every
org's clients. Two tenants with overlapping ids collide. Today there's only one tenant (koder) so this is invisible.
*ire format is text SQL* every change to a table requires the
client to know the column list, the placeholder positions, the exact SQL syntax. Refactors break things in subtle ways.
*allback to in-memory was silent* if the kdb is unreachable
the client used to fall back to a MemoryClient that "worked" but lost data on restart. Fixed in the previous session by ensuring KDB_API_KEY is always set, but the silent fallback class of bugs is still possible.
*o transactions across statements* each HTTP call is its own
SQLite transaction. Multi-step writes are racy.
*o per~~tenant rate limits, no per~~tenant audit* because the
kdb 1.x server doesn't see tenants.
*QLite is a single-writer* Phase 0 hits this around tens of
thousands of orgs.

4.3 What's wrong with it for hyperscale (quantitatively)

See RFC-001 §5 in kdb. The short version: at 100M tenants × 10k rows × audit, the entire koder-id v2 dataset is ~1T rows, which is ~3 orders of magnitude past where SQLite stops being honest.

5. Target client design

5.1 New crate / package: `pkg/kdbnext/`

A second client lives next to the existing pkg/kdb/ and is imported when a service is flipped to kdb-next. They never coexist in a single service binary; the swap is at compile~~and~~launch time, controlled by the env var KODER_ID_V2_STORAGE_BACKEND={kdb1|kdbnext}.

platform/id/v2/
├── pkg/
│   ├── kdb/                  # current client; frozen except bug fixes
│   │   ├── client.go
│   │   ├── client_test.go
│   │   └── migration.go
│   └── kdbnext/              # new client (added in Phase 3 of RFC-001)
│       ├── client.go         # talks to kdb-next gateway via HTTP/JSON or gRPC
│       ├── client_test.go
│       ├── tx.go             # transaction handle
│       ├── records.go        # typed table accessors generated from schemas
│       └── migration.go      # online schema bootstrap
└── services/
    └── <each>/cmd/main.go    # picks kdb or kdbnext based on env

5.2 API surface

The kdbnext client exposes a typed interface, *ot*raw SQL:

type Client interface {
    // Tenancy is mandatory and always carried in the context.
    Tx(ctx context.Context) (Tx, error)

    // Single-statement convenience helpers (auto-tx).
    GetByPK(ctx context.Context, table string, pk PK, dst any) error
    Put(ctx context.Context, table string, row any) error
    Delete(ctx context.Context, table string, pk PK) error
    Query(ctx context.Context, table string, filter Filter) (Cursor, error)
}

type Tx interface {
    Get(table string, pk PK, dst any) error
    Put(table string, row any) error
    Delete(table string, pk PK) error
    Query(table string, filter Filter) (Cursor, error)
    Commit() error
    Rollback() error
}

Filters are constructed in Go, not as strings:

filter := kdbnext.And(
    kdbnext.Eq("client_type", "confidential"),
    kdbnext.Lt("created_at", cutoff),
).OrderBy("created_at", kdbnext.Desc).Limit(50)

The compiler in kdb-next translates this to an indexed range scan on the appropriate secondary index. There is no SQL string parser involved at runtime — the wire protocol carries a Protobuf-encoded filter tree.

5.3 Tenancy

The tenant id is *ever*part of the table name, the namespace, or the row data. It is part of the request context:

ctx = kdbnext.WithTenant(ctx, "tenant-koder")
client.GetByPK(ctx, "oauth_clients", PK{"abc"}, &client)

The kdb-next gateway extracts the tenant id from the JWT (which the koder-id v2 service is itself the issuer of) and uses it as the keyspace prefix. There is *o way*for a service-level bug to read another tenant's data — the keyspace doesn't allow constructing a foreign tenant prefix.

The break-glass during bootstrap is a static signing key in /etc/koder-id-v2/env that issues a "super-tenant" JWT scoped to admin operations only. Removed once dogfooding stabilizes.

5.4 Schema bootstrap

The current pkg/kdb/migration.go runs raw CREATE TABLE statements through the SQL endpoint and tracks them in a _migrations table. This works because the kdb 1.x server is a thin wrapper over SQLite.

In kdbnext, schema bootstrap is *eclarative*

var oauthClientsSchema = kdbnext.Table{
    Name:       "oauth_clients",
    PrimaryKey: []string{"id"},
    Columns: []kdbnext.Column{
        {Name: "id", Type: kdbnext.Text, NotNull: true},
        {Name: "client_name", Type: kdbnext.Text, NotNull: true},
        {Name: "client_type", Type: kdbnext.Text, NotNull: true},
        {Name: "created_at", Type: kdbnext.Timestamp, NotNull: true},
        // ...
    },
    Indexes: []kdbnext.Index{
        {Name: "by_created_at", Columns: []string{"created_at"}},
        {Name: "by_type", Columns: []string{"client_type"}},
    },
    SchemaVersion: 1,
}

At service startup, the client calls EnsureTable(ctx, oauthClientsSchema) which is idempotent: kdb-next stores the table definition in its metadata range, allocates a table_id, and any future migration becomes a Migrate(oldSchema, newSchema) call that is run online by the kdb-migrate runner (RFC-001 §9 Phase 4).

5.5 Transactions

Multi-statement writes that today are individual HTTP calls become real transactions:

return client.WithTx(ctx, func(tx kdbnext.Tx) error {
    if err := tx.Put("oauth_clients", c); err != nil {
        return err
    }
    if err := tx.Put("audit_log", auditEntry); err != nil {
        return err
    }
    return nil
})

The two writes either both happen or neither does. Audit cannot be lost without losing the user-visible row.

6. Phased migration

This RFC's phases align 1:1 with kdb's RFC-001 phases. We do nothing on the koder-id side until kdb is ready.

Phase A — Holding pattern (= kdb RFC-001 Phases 0–2)

*cope* nothing changes in koder-id v2.

The existing pkg/kdb/ client keeps talking to kdb 1.0.6.
The parallel run on id-v2.koder.dev continues to validate the
service correctness.
We do *ot*start sending X-Org-ID to the kdb 1.0.6 SQL
rewriter. It's a dead end (single-tenant koder only).
Bug fixes only.

*xit* kdb-next Phase 2 acceptance criteria met (1M test tenants, p99 read ≤ 8 ms, etc.).

Phase B — Add `pkg/kdbnext/` client (= kdb RFC-001 Phase 3)

*cope*

Create platform/id/v2/pkg/kdbnext/
Implement the typed Client interface above
Map every existing pkg/kdb call site to a pkg/kdbnext equivalent
(mechanical refactor; no behavior change in the services)
Add the KODER_ID_V2_STORAGE_BACKEND env var with default kdb1
One service flips to kdbnext first: *auth*(it has the most
rows and the most read pressure, so it's the best canary)
The flip is gated by feature flag in env, not by code branches
Run the parallel run with oauth on kdbnext, the other 5 on kdb1

*xit*

7 days of clean parallel run with oauth on kdbnext
p99 oauth read latency ≤ 15 ms (matching kdb-next Phase 3 budget)
All 64 migrated OIDC clients persistent across restarts (the same
decisive test the previous session ran against kdb 1.0.6, now against kdbnext)

Phase C — Flip the remaining 5 services (= still kdb Phase 3)

*cope*

Flip identity, session, auth, admin, saml (in this order; auth and
saml last because they touch the most secrets)
One service per day; observe burn rate
The Go shim in pkg/kdbnext/ is allowed to grow as edge cases come
up, but it must stay typed (no raw SQL fallbacks)

*xit*

All 6 services on kdbnext
Parallel run stable for 7 days
The kdb 1.0.6 endpoints are no longer called by koder-id v2
The kdb 1.0.6 binary on s.k.lin continues to serve metrics/alerts;
it just doesn't see SQL traffic from koder-id v2 anymore

Phase D — Online migration tooling exercise (= kdb RFC-001 Phase 4)

*cope*

Use kdb-migrate to add a column to oauth_clients (e.g. a new
last_used_at timestamp) on the parallel run, with traffic flowing
Verify zero observable latency change
Document the playbook in pkg/kdbnext/MIGRATION.md

*xit*

The migration completes; latency budget honored (≤ +5% read p99
during backfill)
Playbook reviewed

Phase E — Hyperscale soak (= kdb RFC-001 Phase 5)

*cope*

This is kdb~~next's hyperscale validation. From the koder~~id v2
side: load synthetic OIDC clients (10k tenants × 10 clients each = 100k rows) and exercise the auth flow at sustained 1k RPS for 24 hours
Capture per-tenant p50/p99 numbers; ensure no tenant drifts past
the noisy-neighbor budget

*xit* kdb~~next Phase 5 numerical targets met (RFC~~001 §5).

Phase F — Production cutover of `id.koder.dev` (= kdb RFC-001 Phase 6)

*cope*

This is the real cutover that the suspended RFC-010 used to
describe. It is *ewritten from scratch*as a brand~~new RFC~~012 ("Production cutover runbook v2") whose rollback story is grounded in kdb-next, not in kdb 1.0.6.
Zitadel stays online for 14 days as the rollback safety net
Then Zitadel is decommissioned and the rebrand notes go into the
~/dev/koder/context/ archive

*xit* id.koder.dev serves OIDC discovery as koder-id v2 over kdb-next; the original Zitadel instance is gone; production is the new world.

7. Test strategy

*nit tests*in pkg/kdbnext/ for every typed accessor; target
90% line coverage
*ntegration tests*against a local kdb-next dev cluster (via
the KvCluster::Local sled~~backed backend in kdb~~next) — fast, no TiKV needed for CI
*ndtoend test*that runs the full koder-id v2 OIDC flow
(/.well-known/openid-configuration, /authorize, /token, /userinfo) against the kdbnext backend
*ecisive persistence test*ported from the kdb 1.0.6 era:
insert a known client, restart the service, re-read, assert the client survived. This is the smoke test for any kdbnext deploy.
*ross-tenant isolation chaos test* spin up 100 fake tenants,
insert random data per tenant, run 10k parallel reads, assert zero cross-tenant reads (this catches both client bugs and gateway bugs)

8. Backwards compatibility & rollback

At every phase, rollback is *lip the env var back to kdb1* The kdb 1.0.6 instance and the SQLite file on s.k.lin remain intact and writable for 30 days after Phase C completes — the parallel run is reversible until that grace period expires.

After Phase F (production cutover), rollback is the existing Zitadel-rollback story (Zitadel stays online for 14 days). After that, no rollback to the old world; we live in kdb-next.

9. Open questions

*ervice-level transaction granularity* today many writes
in koder~~id v2 are isolated single~~statement HTTP calls. When we move to kdbnext we get real transactions — but identifying which existing call sites should be wrapped in a tx requires per~~service review. Do we do it per~~service in Phase B/C, or leave it as a follow-up "consistency hardening" pass?
*chema fingerprint algorithm choice* deferred to kdb-next
Phase 2, but koder-id v2 needs to declare which version it targets. Likely outcome: koder-id v2 just imports a constant from pkg/kdbnext.
*utover of the parallel~~run DNS* do we keep `id~~v2.koder.dev`
alive after Phase F as a rollback canary, or remove it? Decision in Phase F.
*he 59 rotated client_secrets in /tmp/migrated.csv*on
s.k.lin are valid for the kdb 1.0.6 parallel run but the actual production cutover (Phase F) will need a fresh rotation anyway. The current CSV is for the bootstrap; we re-rotate at cutover.

10. References

RFC-001 in kdb: target architecture (this RFC's prerequisite)
RFC~~009 in koder~~id v2: original migration strategy (still
conceptually valid; the substrate changes)
RFC~~010 in koder~~id v2: SUSPENDED — will be replaced by a new
RFC~~012 written on top of kdb~~next after Phase E
ticket 040 in kdb backlog: the SQL rewriter bridge that gets us
through Phase A
~/dev/koder/platform/id/v2/pkg/kdb/client.go: the current client
to be replaced
~/dev/koder/platform/id/v2/servicescmd/main.go: the 6 entry
points where the env~~var~~driven flip happens

RFC011 — Migrating koderid v2 storage from kdb 1.x to kdb-next

1. Summary

2. Goals

3. Non-goals

4. Background

4.1 Where koder-id v2 stores data today

4.2 What's wrong with that today (qualitatively)

4.3 What's wrong with it for hyperscale (quantitatively)

5. Target client design

5.1 New crate / package: pkg/kdbnext/

5.2 API surface

5.3 Tenancy

5.4 Schema bootstrap

5.5 Transactions

6. Phased migration

Phase A — Holding pattern (= kdb RFC-001 Phases 0–2)

Phase B — Add pkg/kdbnext/ client (= kdb RFC-001 Phase 3)

Phase C — Flip the remaining 5 services (= still kdb Phase 3)

Phase D — Online migration tooling exercise (= kdb RFC-001 Phase 4)

Phase E — Hyperscale soak (= kdb RFC-001 Phase 5)

Phase F — Production cutover of id.koder.dev (= kdb RFC-001 Phase 6)

7. Test strategy

8. Backwards compatibility & rollback

9. Open questions

10. References

5.1 New crate / package: `pkg/kdbnext/`

Phase B — Add `pkg/kdbnext/` client (= kdb RFC-001 Phase 3)

Phase F — Production cutover of `id.koder.dev` (= kdb RFC-001 Phase 6)