Id RFC 011 storage on kdb next

RFC011 — Migrating koderid v2 storage from kdb 1.x to kdb-next

Field Value
Status *ccepted*(20260409)
Author(s) Rodrigo (with Claude as scribe)
Date 20260409
Accepted 20260409 by Rodrigo
Depends on platform/kdb/docs/rfcs/RFC-001-kdb-next-hyperscale-architecture.md
Affects platform/id/v2/pkg/kdb/, all 6 services' main.go, migrations
Suspends RFC010 cutover until Phase 5 of RFC001 is reached

1. Summary

This RFC is the *lientside counterpart*to RFC001 in the kdb module. It defines how koder-id v2 will migrate its persistence layer from the current kdb 1.0.6 SQL HTTP endpoints (with the optin `XOrgID` rewriter shipped in ticket 040) to *dbnext*(the Rust hyperscale substrate defined in kdb's RFC-001).

The migration is *on-disruptive* the existing pkg/kdb/ client keeps working unchanged for the parallel-run environment until kdb-next reaches Phase 3 of its roadmap. At that point a new pkg/kdbnext/ client is added sidebyside; services swap clients one at a time behind a feature flag; the cutover from kdb 1.x to kdbnext is servicebyservice, never bigbang.

2. Goals

  1. *ero behavior change*for the existing parallel run on

    id-v2.koder.dev until kdb-next is provably ready.

  2. *ne service at a time*can be flipped from kdb 1.x to kdb-next

    via env var, without redeploying the others.

  3. *enant model rewrite* stop pretending tenancy is a SQL column

    filter (current state) and embrace tenancy as a request-level primitive carried in the gateway context.

  4. *o textual SQL on the wire* typed Record API calls between

    koderid v2 and kdbnext, eliminating the entire class of "rewriteredgecase" bugs that the kdb 1.0.6 rewriter has by design.

  5. *udit, quotas, observability*ride along automatically because

    they're enforced at the kdbnext gateway, not at the koderid v2 client.

  6. * single dry-run*of the cutover, against the parallel run, is

    the gating event for Phase 6 of kdb-next (the real production cutover of id.koder.dev).

3. Non-goals

  • This RFC does *ot*redesign koder-id v2's domain model. The

    6 services (admin, auth, identity, oauth, saml, session) and their table layouts stay as they are; only the persistence client and the encoding format on the wire change.

  • This RFC does *ot*speak Postgreswire or libpq. koderid v2

    has never spoken Postgres directly; it uses HTTP to kdb. That stays.

  • This RFC does *ot*schedule the production cutover of

    id.koder.dev. That is Phase 6 of kdb's RFC-001, gated on Phase 5 passing.

4. Background

4.1 Where koder-id v2 stores data today

koder-id v2 services
   │
   │  HTTP/JSON  (per-request: Authorization: Bearer <static_key>)
   ▼
kdb 1.0.6 (Go binary on s.k.lin :7900)
   │
   │  database/sql + ?N positional params
   ▼
SQLite file at /var/lib/koder/koder-kdb/koder-kdb.db

Each of the 6 services calls pkg/kdb/client.go which POSTs to /api/v1/sql/{exec,query,queryRow}. The body shape is:

{
  "namespace": "koder_id_oauth",
  "query":     "INSERT INTO clients (id, name) VALUES ($1, $2)",
  "params":    {"$1": "abc", "$2": "Foo"}
}

The server prepends <namespace>__ to every unqualified table name in the SQL string and (since kdb 1.0.6 / ticket 040) optionally injects org_id filters when the request carries the X-Org-ID header. koder-id v2 currently does *ot*send that header — it relies on the per-service namespace as its only isolation primitive, which is *ot*pertenant; it's permicroservice.

4.2 What's wrong with that today (qualitatively)

  • *enancy is not real* koder_id_oauth__clients contains every

    org's clients. Two tenants with overlapping ids collide. Today there's only one tenant (koder) so this is invisible.

  • *ire format is text SQL* every change to a table requires the

    client to know the column list, the placeholder positions, the exact SQL syntax. Refactors break things in subtle ways.

  • *allback to in-memory was silent* if the kdb is unreachable

    the client used to fall back to a MemoryClient that "worked" but lost data on restart. Fixed in the previous session by ensuring KDB_API_KEY is always set, but the silent fallback class of bugs is still possible.

  • *o transactions across statements* each HTTP call is its own

    SQLite transaction. Multi-step writes are racy.

  • *o pertenant rate limits, no pertenant audit* because the

    kdb 1.x server doesn't see tenants.

  • *QLite is a single-writer* Phase 0 hits this around tens of

    thousands of orgs.

4.3 What's wrong with it for hyperscale (quantitatively)

See RFC-001 §5 in kdb. The short version: at 100M tenants × 10k rows × audit, the entire koder-id v2 dataset is ~1T rows, which is ~3 orders of magnitude past where SQLite stops being honest.

5. Target client design

5.1 New crate / package: pkg/kdbnext/

A second client lives next to the existing pkg/kdb/ and is imported when a service is flipped to kdb-next. They never coexist in a single service binary; the swap is at compileandlaunch time, controlled by the env var KODER_ID_V2_STORAGE_BACKEND={kdb1|kdbnext}.

platform/id/v2/
├── pkg/
│   ├── kdb/                  # current client; frozen except bug fixes
│   │   ├── client.go
│   │   ├── client_test.go
│   │   └── migration.go
│   └── kdbnext/              # new client (added in Phase 3 of RFC-001)
│       ├── client.go         # talks to kdb-next gateway via HTTP/JSON or gRPC
│       ├── client_test.go
│       ├── tx.go             # transaction handle
│       ├── records.go        # typed table accessors generated from schemas
│       └── migration.go      # online schema bootstrap
└── services/
    └── <each>/cmd/main.go    # picks kdb or kdbnext based on env

5.2 API surface

The kdbnext client exposes a typed interface, *ot*raw SQL:

type Client interface {
    // Tenancy is mandatory and always carried in the context.
    Tx(ctx context.Context) (Tx, error)

    // Single-statement convenience helpers (auto-tx).
    GetByPK(ctx context.Context, table string, pk PK, dst any) error
    Put(ctx context.Context, table string, row any) error
    Delete(ctx context.Context, table string, pk PK) error
    Query(ctx context.Context, table string, filter Filter) (Cursor, error)
}

type Tx interface {
    Get(table string, pk PK, dst any) error
    Put(table string, row any) error
    Delete(table string, pk PK) error
    Query(table string, filter Filter) (Cursor, error)
    Commit() error
    Rollback() error
}

Filters are constructed in Go, not as strings:

filter := kdbnext.And(
    kdbnext.Eq("client_type", "confidential"),
    kdbnext.Lt("created_at", cutoff),
).OrderBy("created_at", kdbnext.Desc).Limit(50)

The compiler in kdb-next translates this to an indexed range scan on the appropriate secondary index. There is no SQL string parser involved at runtime — the wire protocol carries a Protobuf-encoded filter tree.

5.3 Tenancy

The tenant id is *ever*part of the table name, the namespace, or the row data. It is part of the request context:

ctx = kdbnext.WithTenant(ctx, "tenant-koder")
client.GetByPK(ctx, "oauth_clients", PK{"abc"}, &client)

The kdb-next gateway extracts the tenant id from the JWT (which the koder-id v2 service is itself the issuer of) and uses it as the keyspace prefix. There is *o way*for a service-level bug to read another tenant's data — the keyspace doesn't allow constructing a foreign tenant prefix.

The break-glass during bootstrap is a static signing key in /etc/koder-id-v2/env that issues a "super-tenant" JWT scoped to admin operations only. Removed once dogfooding stabilizes.

5.4 Schema bootstrap

The current pkg/kdb/migration.go runs raw CREATE TABLE statements through the SQL endpoint and tracks them in a _migrations table. This works because the kdb 1.x server is a thin wrapper over SQLite.

In kdbnext, schema bootstrap is *eclarative*

var oauthClientsSchema = kdbnext.Table{
    Name:       "oauth_clients",
    PrimaryKey: []string{"id"},
    Columns: []kdbnext.Column{
        {Name: "id", Type: kdbnext.Text, NotNull: true},
        {Name: "client_name", Type: kdbnext.Text, NotNull: true},
        {Name: "client_type", Type: kdbnext.Text, NotNull: true},
        {Name: "created_at", Type: kdbnext.Timestamp, NotNull: true},
        // ...
    },
    Indexes: []kdbnext.Index{
        {Name: "by_created_at", Columns: []string{"created_at"}},
        {Name: "by_type", Columns: []string{"client_type"}},
    },
    SchemaVersion: 1,
}

At service startup, the client calls EnsureTable(ctx, oauthClientsSchema) which is idempotent: kdb-next stores the table definition in its metadata range, allocates a table_id, and any future migration becomes a Migrate(oldSchema, newSchema) call that is run online by the kdb-migrate runner (RFC-001 §9 Phase 4).

5.5 Transactions

Multi-statement writes that today are individual HTTP calls become real transactions:

return client.WithTx(ctx, func(tx kdbnext.Tx) error {
    if err := tx.Put("oauth_clients", c); err != nil {
        return err
    }
    if err := tx.Put("audit_log", auditEntry); err != nil {
        return err
    }
    return nil
})

The two writes either both happen or neither does. Audit cannot be lost without losing the user-visible row.

6. Phased migration

This RFC's phases align 1:1 with kdb's RFC-001 phases. We do nothing on the koder-id side until kdb is ready.

Phase A — Holding pattern (= kdb RFC-001 Phases 0–2)

*cope* nothing changes in koder-id v2.

  • The existing pkg/kdb/ client keeps talking to kdb 1.0.6.
  • The parallel run on id-v2.koder.dev continues to validate the

    service correctness.

  • We do *ot*start sending X-Org-ID to the kdb 1.0.6 SQL

    rewriter. It's a dead end (single-tenant koder only).

  • Bug fixes only.

*xit* kdb-next Phase 2 acceptance criteria met (1M test tenants, p99 read ≤ 8 ms, etc.).

Phase B — Add pkg/kdbnext/ client (= kdb RFC-001 Phase 3)

*cope*

  • Create platform/id/v2/pkg/kdbnext/
  • Implement the typed Client interface above
  • Map every existing pkg/kdb call site to a pkg/kdbnext equivalent

    (mechanical refactor; no behavior change in the services)

  • Add the KODER_ID_V2_STORAGE_BACKEND env var with default kdb1
  • One service flips to kdbnext first: *auth*(it has the most

    rows and the most read pressure, so it's the best canary)

  • The flip is gated by feature flag in env, not by code branches
  • Run the parallel run with oauth on kdbnext, the other 5 on kdb1

*xit*

  • 7 days of clean parallel run with oauth on kdbnext
  • p99 oauth read latency ≤ 15 ms (matching kdb-next Phase 3 budget)
  • All 64 migrated OIDC clients persistent across restarts (the same

    decisive test the previous session ran against kdb 1.0.6, now against kdbnext)

Phase C — Flip the remaining 5 services (= still kdb Phase 3)

*cope*

  • Flip identity, session, auth, admin, saml (in this order; auth and

    saml last because they touch the most secrets)

  • One service per day; observe burn rate
  • The Go shim in pkg/kdbnext/ is allowed to grow as edge cases come

    up, but it must stay typed (no raw SQL fallbacks)

*xit*

  • All 6 services on kdbnext
  • Parallel run stable for 7 days
  • The kdb 1.0.6 endpoints are no longer called by koder-id v2
  • The kdb 1.0.6 binary on s.k.lin continues to serve metrics/alerts;

    it just doesn't see SQL traffic from koder-id v2 anymore

Phase D — Online migration tooling exercise (= kdb RFC-001 Phase 4)

*cope*

  • Use kdb-migrate to add a column to oauth_clients (e.g. a new

    last_used_at timestamp) on the parallel run, with traffic flowing

  • Verify zero observable latency change
  • Document the playbook in pkg/kdbnext/MIGRATION.md

*xit*

  • The migration completes; latency budget honored (≤ +5% read p99

    during backfill)

  • Playbook reviewed

Phase E — Hyperscale soak (= kdb RFC-001 Phase 5)

*cope*

  • This is kdbnext's hyperscale validation. From the koderid v2

    side: load synthetic OIDC clients (10k tenants × 10 clients each = 100k rows) and exercise the auth flow at sustained 1k RPS for 24 hours

  • Capture per-tenant p50/p99 numbers; ensure no tenant drifts past

    the noisy-neighbor budget

*xit* kdbnext Phase 5 numerical targets met (RFC001 §5).

Phase F — Production cutover of id.koder.dev (= kdb RFC-001 Phase 6)

*cope*

  • This is the real cutover that the suspended RFC-010 used to

    describe. It is *ewritten from scratch*as a brandnew RFC012 ("Production cutover runbook v2") whose rollback story is grounded in kdb-next, not in kdb 1.0.6.

  • Zitadel stays online for 14 days as the rollback safety net
  • Then Zitadel is decommissioned and the rebrand notes go into the

    ~/dev/koder/context/ archive

*xit* id.koder.dev serves OIDC discovery as koder-id v2 over kdb-next; the original Zitadel instance is gone; production is the new world.

7. Test strategy

  • *nit tests*in pkg/kdbnext/ for every typed accessor; target

    90% line coverage

  • *ntegration tests*against a local kdb-next dev cluster (via

    the KvCluster::Local sledbacked backend in kdbnext) — fast, no TiKV needed for CI

  • *ndtoend test*that runs the full koder-id v2 OIDC flow

    (/.well-known/openid-configuration, /authorize, /token, /userinfo) against the kdbnext backend

  • *ecisive persistence test*ported from the kdb 1.0.6 era:

    insert a known client, restart the service, re-read, assert the client survived. This is the smoke test for any kdbnext deploy.

  • *ross-tenant isolation chaos test* spin up 100 fake tenants,

    insert random data per tenant, run 10k parallel reads, assert zero cross-tenant reads (this catches both client bugs and gateway bugs)

8. Backwards compatibility & rollback

At every phase, rollback is *lip the env var back to kdb1* The kdb 1.0.6 instance and the SQLite file on s.k.lin remain intact and writable for 30 days after Phase C completes — the parallel run is reversible until that grace period expires.

After Phase F (production cutover), rollback is the existing Zitadel-rollback story (Zitadel stays online for 14 days). After that, no rollback to the old world; we live in kdb-next.

9. Open questions

  1. *ervice-level transaction granularity* today many writes

    in koderid v2 are isolated singlestatement HTTP calls. When we move to kdbnext we get real transactions — but identifying which existing call sites should be wrapped in a tx requires perservice review. Do we do it perservice in Phase B/C, or leave it as a follow-up "consistency hardening" pass?

  2. *chema fingerprint algorithm choice* deferred to kdb-next

    Phase 2, but koder-id v2 needs to declare which version it targets. Likely outcome: koder-id v2 just imports a constant from pkg/kdbnext.

  3. *utover of the parallelrun DNS* do we keep `idv2.koder.dev`

    alive after Phase F as a rollback canary, or remove it? Decision in Phase F.

  4. *he 59 rotated client_secrets in /tmp/migrated.csv*on

    s.k.lin are valid for the kdb 1.0.6 parallel run but the actual production cutover (Phase F) will need a fresh rotation anyway. The current CSV is for the bootstrap; we re-rotate at cutover.

10. References

  • RFC-001 in kdb: target architecture (this RFC's prerequisite)
  • RFC009 in koderid v2: original migration strategy (still

    conceptually valid; the substrate changes)

  • RFC010 in koderid v2: SUSPENDED — will be replaced by a new

    RFC012 written on top of kdbnext after Phase E

  • ticket 040 in kdb backlog: the SQL rewriter bridge that gets us

    through Phase A

  • ~/dev/koder/platform/id/v2/pkg/kdb/client.go: the current client

    to be replaced

  • ~/dev/koder/platform/id/v2/servicescmd/main.go: the 6 entry

    points where the envvardriven flip happens

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/id-RFC-011-storage-on-kdb-next.md