Kdb RFC 009 active active multi primary
RFC009 — kdbnext: ActiveActive MultiPrimary Replication
| Field | Value |
|---|---|
| Status | *raft*— gated behind RFC |
| Author(s) | Rodrigo (with Claude as scribe) |
| Date | 2026 |
| Audit | infra/data/kdb/docs/RFC-009-AUDIT.md; tickets #519–#524 |
| Related | RFC |
| Target module | infra/data/kdb/ |
1. Summary
RFC008 established *ingleprimary multi-region replication* one designated primary region per tenant, with async secondary replicas. This RFC extends kdbnext to support *ctiveactive multi-primary* writes are accepted in any region simultaneously, without forcing clients to cross the WAN for their home region.
Activeactive is not a single mode. kdbnext's data model spans namespaces with different consistency requirements. This RFC proposes a *er-namespace consistency tier*model:
| Tier | Key prefix | Writes accepted in | Conflict resolution |
|---|---|---|---|
crdt: |
CRDT-typed KV | Any region | CRDT merge (already implemented, RFC-008 §5.1) |
lww: |
Mutable KV, eventual consistency | Any region | HLC last |
strong: |
Catalog, auth tokens, financial | Primary region only | Raft quorum (same as today) |
relational: |
SQL tables | Primary region only | Not in v1 active-active |
Active-active v1 covers the crdt: and lww: tiers. The strong: and relational: tiers remain single-primary until a v2 RFC addresses distributed transactions (2PC/Paxos across WAN).
2. Motivation
RFC008 deferred activeactive explicitly:
*Active
active multiprimary ... requires distributed transactions across regions (2PC or Paxos across WAN), which introduces 50–250 ms extra latency per write.*
This is true for strong: and relational: tiers. However, a large fraction of kdb-next's data does not require strong consistency:
- *oder
talk messages*(only, CRDT-mergeable.crdt:namespace): appendChat messages sent from São Paulo while a European user simultaneously sends from Amsterdam should both commit locally and merge without conflict.
- *oder-kortex embeddings*(
lww:namespace): AI context records. Lastupsert wins is acceptable; a stale embedding that's overwritten 200 ms later causes no user-visible error.
- *oder-flow CI status*(
lww:namespace): build/pipeline state. Concurrentstatus updates from different regions are resolved by HLC ordering.
Blocking these writes on the WAN roundtrip (RFC008 topology) wastes ~250 ms per write for European clients writing to Brazilian-primary tenants. For koder-talk in particular, this visibly degrades the "feel" of the product.
3. Architecture
3.1 Write path per tier
Client → local gateway (any region)
│
├─ key prefix = "crdt:"
│ Write to local TiKV Raft group.
│ Async replication to other regions via RFC-008 Raft learner.
│ At merge time: CRDT merge (commutative, idempotent).
│ Result: converges everywhere, no conflicts possible.
│
├─ key prefix = "lww:"
│ Write to local TiKV Raft group.
│ HLC timestamp attached at commit time.
│ Async replication to other regions.
│ At merge time: higher HLC wins (deterministic, idempotent).
│ Result: eventual consistency, stale reads possible within lag budget.
│
├─ key prefix = "strong:"
│ Forwarded to primary-region gateway (same as RFC-008 today).
│ Raft commit in primary region. Async replication to secondaries.
│ Reads from secondary require "kdb-read-consistency: strong" or
│ may be stale within lag budget.
│
└─ key prefix = "relational:"
Forwarded to primary-region gateway (no change from RFC-008).3.2 Conflict detection for lww: tier
When a lww: write is replicated to a secondary that already has a write for the same key at a higher HLC (i.e., the local copy is "newer"), the incoming write is silently dropped. This is deterministic: both replicas converge to the same value (the one with the highest HLC).
Pathological case: two clients simultaneously write the same key from different regions with clocks drifting more than ±500 ms. Mitigation: HLC node_id tie-break ensures the same winner on all replicas even with equal physical timestamps. Clock health check in §4.1 enforces the drift bound.
3.3 Anti-entropy (convergence after partition)
After a network partition heals, diverged replicas must reconcile. Anti-entropy runs as a background job per TiKV region (shard):
- *erkle digest* each replica computes a Merkle tree over its key space
(keyed by HLC timestamp). The digest is cheap: O(1) network, O(log N) CPU.
- *igest exchange* leader and follower exchange digests every
anti_entropy_interval(default: 60 s). - *epair scan* mismatched sub-trees trigger a targeted key scan +
CRDT merge / HLC LWW resolution.
This is the same anti-entropy pattern used in data/crdt (see koder-crdt/src/anti_entropy.rs). RFC-009 extends it to the lww: tier.
4. Consistency Model
4.1 Hybrid Logical Clock requirements
Each kdb-gateway node MUST maintain HLC drift within ±500 ms of UTC (per RFC008 §5.2). For activeactive, this bound is critical for lww: correctness.
New requirement for active-active: all gateways MUST exchange HLC watermarks with peer gateways in other regions on every write batch (piggyback on the replication stream). A gateway that observes a peer HLC more than 1 second ahead MUST:
- Advance its own HLC to
peer_hlc + 1 logical tick. - Log a
WARN hlc_drift_correctionevent. - If drift exceeds 2 seconds, enter *ead-only mode*(refuse new writes to
lww:tier) until the clock is corrected.
4.2 Isolation guarantees per tier
| Tier | Read isolation | Write isolation |
|---|---|---|
crdt: |
Causal consistency (HLC-ordered) | Concurrent writes always merge |
lww: |
Eventual consistency | Concurrent writes: HLC LWW |
strong: |
Linearizable (Raft primary) | Serializable (Raft primary) |
relational: |
Read Committed (per RFC-004) | Serializable (Raft primary) |
The crdt: and lww: tiers do *ot*provide serializability. Applications using these tiers must be designed for eventual consistency (optimistic UI updates, conflict-tolerant merges).
4.3 Anomalies acknowledged
- *ead
yourwrites* NOT guaranteed forlww:tier across regions.A client that writes in São Paulo then reads from Amsterdam may see a stale value for up to one lag budget period. Clients that need read
yourwrites MUST usekdb-read-consistency: strong. - *onotonic reads* NOT guaranteed for
lww:tier. Sticky sessions (routeclient reads to the same gateway) mitigate this.
- *ausality* guaranteed for
crdt:tier (vector clock ordering). Forlww:tier, HLC provides probabilistic causality but not guaranteed.
5. Data Model Changes
5.1 Tier prefix convention
kdbnext keys already use structured prefixes (RFC001):
{region}:tenant:{id}:{key}Activeactive adds a mandatory secondlevel prefix indicating the consistency tier:
{region}:tenant:{id}:crdt:{type}:{entity_id} ← already used
{region}:tenant:{id}:lww:{table}:{row_id} ← new (active-active v1)
{region}:tenant:{id}:strong:{table}:{row_id} ← maps to current non-prefixed
{region}:tenant:{id}:relational:{sql_row_key} ← unchangedExisting keys without a tier prefix are treated as strong: for backward compatibility.
5.2 Catalog entry per tenant
The tenant catalog (RFC-001 §3) gains a new field:
{
"tenant_id": 42,
"primary_region": "sa-east-1",
"active_active_tiers": ["crdt", "lww"],
"active_active_enabled_at": "2026-05-01T00:00:00Z"
}Activeactive is optin per tenant. Tenants not in active_active_tiers have all writes forwarded to the primary region (RFC-008 behavior).
5.3 Migration: strong: → lww:
Products migrating table data from strong: to lww: tier (to gain active-active writes) MUST:
- Open an RFC addendum or engineering note describing the consistency
downgrade rationale for each table.
- Run a
tenant-shard-migratestyle migration to rekey existing recordsfrom
strong:prefix tolww:prefix. - Update all read/write paths in the product to use the new prefix.
6. Failure Scenarios
6.1 Region partition
network partition
São Paulo ─────────//──────── Amsterdam
(primary) (secondary, now isolated)During partition:
- Both regions continue accepting writes to
crdt:andlww:tiers. strong:andrelational:tiers: São Paulo accepts writes; Amsterdamblocks writes (no quorum) and serves stale reads.
- After healing: anti-entropy reconciles
crdt:andlww:tiers.strong:lag is bounded by the Raft replay time (expected < 5 minutes for typical partition durations).
6.2 Splitbrain (network partition + primary reelection)
A split-brain occurs if Amsterdam elects a new primary for a strong: shard while São Paulo still considers itself primary. This is prevented by:
- Raft quorum requiring a strict majority of voters. With 3 voters (2 SP + 1
AMS), AMS cannot form a quorum alone.
- For 2-node setups (1 SP + 1 AMS), a *itness node*(stateless, runs Raft
voting only) in a third datacenter (or cloud) prevents split-brain.
6.3 Clock skew attack / misconfiguration
A clock-skewed lww: write at t+10s causes all writes at t+0 to t+10s to be silently discarded on that key, even from correct nodes. Mitigation:
- 2
second readonly threshold (§4.1) catches runaway clock drift. - HLC watermark gossip limits the window of opportunity.
- Per
key write timestamps are exposed via `debug kv get -with-meta`, allowingforensic investigation.
7. Implementation Plan
Phase 1 — CRDT tier active-active (low effort, high value)
Enable writes to crdt: keys from any region gateway. The CRDT merge function already handles concurrent writes correctly. Only change needed: remove the "forward to primary" check for crdt: prefixes.
Estimated effort: 1–2 days. Unblocks: kodertalk presence, koderkortex embedding upserts.
Phase 2 — LWW tier activeactive + antientropy
- Implement HLC watermark gossip between gateway peers.
- Extend anti-entropy to cover
lww:keys (Merkle tree over HLC timestamps). - Update key routing to accept
lww:writes locally. - Add catalog opt-in field + migration tooling.
Estimated effort: 2–3 weeks.
Phase 3 — Observability + per-tenant rollout
- Prometheus metrics: replication lag per tier, HLC drift, anti-entropy repair
events.
- Alert thresholds: lag > 2× budget → warn; lag > 5× budget → page.
- Per-tenant rollout: enable
crdt:tier first, thenlww:after 2 weeks ofstability.
Estimated effort: 1 week.
Phase 4 (future RFC) — Strongconsistent activeactive
Distributed 2PC or Paxos across regions for strong: tier. This is the hard part (see RFC008 §3.2). Will be addressed when Koder has multiregion presence with > 100 k ops/s cross-region write traffic. Out of scope for this RFC.
8. Non-Goals (v1)
- Active-active for SQL (
relational:tier). - Active-active for schema catalog mutations.
- Active-active for authentication tokens (Koder ID
strong:data). - Sub-millisecond WAN replication (physical law).
- More than 2 primary regions simultaneously in Phase 1–2 (2-region only).
9. Open Questions
- *itness node infrastructure* where does the Raft witness live? s.r1
already hosts the primary. Possible options: a small VPS in São Paulo, a Cloud VM in us
east1, or a dedicated low-cost node in Europe. Decision needed before Phase 2 can be production-safe.
- *LC gossip transport* piggyback on TiKV's existing replication stream,
or add a separate gRPC channel between gateways? Piggybacking is simpler but couples the two concerns.
- *lww:` tier rollout sequence* which products move first? Proposed order:
koder-kortex (AI embeddings, highest write rate, LWW acceptable) → koder
talk (messages, but needs strong readyour-writes semantics — may need to staystrong:for now) → koder-flow (CI state).
- *onflict tombstones* should
lww:deletes be tombstoned with HLCtimestamps to prevent resurrection by a replicated write arriving after the delete? Current LWW implementation uses writes
winover-deletes. This may need to be reversed for some use cases.