Kdb RFC 009 active active multi primary

RFC009 — kdbnext: ActiveActive MultiPrimary Replication

Field	Value
Status	raft— gated behind RFC~~002 multi~~region. Audit 20260426 shows P0–P5 of §7 unimplemented.
Author(s)	Rodrigo (with Claude as scribe)
Date	20260416 (drafted) · 20260426 (audit)
Audit	`infra/data/kdb/docs/RFC-009-AUDIT.md`; tickets #519–#524
Related	RFC~~008 (single~~primary multi~~region), RFC~~001 §7, Ticket #041, #141
Target module	`infra/data/kdb/`

1. Summary

RFC~~008 established *ingle~~primary multi-region replication* one designated primary region per tenant, with async secondary replicas. This RFC extends kdb~~next to support *ctive~~active multi-primary* writes are accepted in any region simultaneously, without forcing clients to cross the WAN for their home region.

Active~~active is not a single mode. kdb~~next's data model spans namespaces with different consistency requirements. This RFC proposes a *er-namespace consistency tier*model:

Tier	Key prefix	Writes accepted in	Conflict resolution
`crdt:`	CRDT-typed KV	Any region	CRDT merge (already implemented, RFC-008 §5.1)
`lww:`	Mutable KV, eventual consistency	Any region	HLC last~~write~~wins
`strong:`	Catalog, auth tokens, financial	Primary region only	Raft quorum (same as today)
`relational:`	SQL tables	Primary region only	Not in v1 active-active

Active-active v1 covers the crdt: and lww: tiers. The strong: and relational: tiers remain single-primary until a v2 RFC addresses distributed transactions (2PC/Paxos across WAN).

2. Motivation

RFC~~008 deferred active~~active explicitly:

*Active~~active multi~~primary ... requires distributed transactions across regions (2PC or Paxos across WAN), which introduces 50–250 ms extra latency per write.*

This is true for strong: and relational: tiers. However, a large fraction of kdb-next's data does not require strong consistency:

*oder~~talk messages*(crdt: namespace): append~~only, CRDT-mergeable.
Chat messages sent from São Paulo while a European user simultaneously sends from Amsterdam should both commit locally and merge without conflict.
*oder-kortex embeddings*(lww: namespace): AI context records. Last
upsert wins is acceptable; a stale embedding that's overwritten 200 ms later causes no user-visible error.
*oder-flow CI status*(lww: namespace): build/pipeline state. Concurrent
status updates from different regions are resolved by HLC ordering.

Blocking these writes on the WAN round~~trip (RFC~~008 topology) wastes ~250 ms per write for European clients writing to Brazilian-primary tenants. For koder-talk in particular, this visibly degrades the "feel" of the product.

3. Architecture

3.1 Write path per tier

Client → local gateway (any region)
         │
         ├─ key prefix = "crdt:"
         │   Write to local TiKV Raft group.
         │   Async replication to other regions via RFC-008 Raft learner.
         │   At merge time: CRDT merge (commutative, idempotent).
         │   Result: converges everywhere, no conflicts possible.
         │
         ├─ key prefix = "lww:"
         │   Write to local TiKV Raft group.
         │   HLC timestamp attached at commit time.
         │   Async replication to other regions.
         │   At merge time: higher HLC wins (deterministic, idempotent).
         │   Result: eventual consistency, stale reads possible within lag budget.
         │
         ├─ key prefix = "strong:"
         │   Forwarded to primary-region gateway (same as RFC-008 today).
         │   Raft commit in primary region. Async replication to secondaries.
         │   Reads from secondary require "kdb-read-consistency: strong" or
         │   may be stale within lag budget.
         │
         └─ key prefix = "relational:"
             Forwarded to primary-region gateway (no change from RFC-008).

3.2 Conflict detection for `lww:` tier

When a lww: write is replicated to a secondary that already has a write for the same key at a higher HLC (i.e., the local copy is "newer"), the incoming write is silently dropped. This is deterministic: both replicas converge to the same value (the one with the highest HLC).

Pathological case: two clients simultaneously write the same key from different regions with clocks drifting more than ±500 ms. Mitigation: HLC node_id tie-break ensures the same winner on all replicas even with equal physical timestamps. Clock health check in §4.1 enforces the drift bound.

3.3 Anti-entropy (convergence after partition)

After a network partition heals, diverged replicas must reconcile. Anti-entropy runs as a background job per TiKV region (shard):

*erkle digest* each replica computes a Merkle tree over its key space
(keyed by HLC timestamp). The digest is cheap: O(1) network, O(log N) CPU.
*igest exchange* leader and follower exchange digests every
anti_entropy_interval (default: 60 s).
*epair scan* mismatched sub-trees trigger a targeted key scan +
CRDT merge / HLC LWW resolution.

This is the same anti-entropy pattern used in data/crdt (see koder-crdt/src/anti_entropy.rs). RFC-009 extends it to the lww: tier.

4. Consistency Model

4.1 Hybrid Logical Clock requirements

Each kdb-gateway node MUST maintain HLC drift within ±500 ms of UTC (per RFC~~008 §5.2). For active~~active, this bound is critical for lww: correctness.

New requirement for active-active: all gateways MUST exchange HLC watermarks with peer gateways in other regions on every write batch (piggyback on the replication stream). A gateway that observes a peer HLC more than 1 second ahead MUST:

Advance its own HLC to peer_hlc + 1 logical tick.
Log a WARN hlc_drift_correction event.
If drift exceeds 2 seconds, enter *ead-only mode*(refuse new writes to
lww: tier) until the clock is corrected.

4.2 Isolation guarantees per tier

Tier	Read isolation	Write isolation
`crdt:`	Causal consistency (HLC-ordered)	Concurrent writes always merge
`lww:`	Eventual consistency	Concurrent writes: HLC LWW
`strong:`	Linearizable (Raft primary)	Serializable (Raft primary)
`relational:`	Read Committed (per RFC-004)	Serializable (Raft primary)

The crdt: and lww: tiers do *ot*provide serializability. Applications using these tiers must be designed for eventual consistency (optimistic UI updates, conflict-tolerant merges).

4.3 Anomalies acknowledged

*ead~~your~~writes* NOT guaranteed for lww: tier across regions.
A client that writes in São Paulo then reads from Amsterdam may see a stale value for up to one lag budget period. Clients that need read~~your~~writes MUST use kdb-read-consistency: strong.
*onotonic reads* NOT guaranteed for lww: tier. Sticky sessions (route
client reads to the same gateway) mitigate this.
*ausality* guaranteed for crdt: tier (vector clock ordering). For
lww: tier, HLC provides probabilistic causality but not guaranteed.

5. Data Model Changes

5.1 Tier prefix convention

kdb~~next keys already use structured prefixes (RFC~~001):

{region}:tenant:{id}:{key}

Active~~active adds a mandatory second~~level prefix indicating the consistency tier:

{region}:tenant:{id}:crdt:{type}:{entity_id}    ← already used
{region}:tenant:{id}:lww:{table}:{row_id}        ← new (active-active v1)
{region}:tenant:{id}:strong:{table}:{row_id}     ← maps to current non-prefixed
{region}:tenant:{id}:relational:{sql_row_key}    ← unchanged

Existing keys without a tier prefix are treated as strong: for backward compatibility.

5.2 Catalog entry per tenant

The tenant catalog (RFC-001 §3) gains a new field:

{
  "tenant_id": 42,
  "primary_region": "sa-east-1",
  "active_active_tiers": ["crdt", "lww"],
  "active_active_enabled_at": "2026-05-01T00:00:00Z"
}

Active~~active is opt~~in per tenant. Tenants not in active_active_tiers have all writes forwarded to the primary region (RFC-008 behavior).

5.3 Migration: `strong:` → `lww:`

Products migrating table data from strong: to lww: tier (to gain active-active writes) MUST:

Open an RFC addendum or engineering note describing the consistency
downgrade rationale for each table.
Run a tenant-shard-migrate style migration to rekey existing records
from strong: prefix to lww: prefix.
Update all read/write paths in the product to use the new prefix.

6. Failure Scenarios

6.1 Region partition

                     network partition
São Paulo ─────────//──────── Amsterdam
 (primary)                   (secondary, now isolated)

During partition:

Both regions continue accepting writes to crdt: and lww: tiers.
strong: and relational: tiers: São Paulo accepts writes; Amsterdam
blocks writes (no quorum) and serves stale reads.
After healing: anti-entropy reconciles crdt: and lww: tiers.
strong: lag is bounded by the Raft replay time (expected < 5 minutes for typical partition durations).

6.2 Splitbrain (network partition + primary reelection)

A split-brain occurs if Amsterdam elects a new primary for a strong: shard while São Paulo still considers itself primary. This is prevented by:

Raft quorum requiring a strict majority of voters. With 3 voters (2 SP + 1
AMS), AMS cannot form a quorum alone.
For 2-node setups (1 SP + 1 AMS), a *itness node*(stateless, runs Raft
voting only) in a third datacenter (or cloud) prevents split-brain.

6.3 Clock skew attack / misconfiguration

A clock-skewed lww: write at t+10s causes all writes at t+0 to t+10s to be silently discarded on that key, even from correct nodes. Mitigation:

2~~second read~~only threshold (§4.1) catches runaway clock drift.
HLC watermark gossip limits the window of opportunity.
Per~~key write timestamps are exposed via `debug kv get -~~with-meta`, allowing
forensic investigation.

7. Implementation Plan

Phase 1 — CRDT tier active-active (low effort, high value)

Enable writes to crdt: keys from any region gateway. The CRDT merge function already handles concurrent writes correctly. Only change needed: remove the "forward to primary" check for crdt: prefixes.

Estimated effort: 1–2 days. Unblocks: koder~~talk presence, koder~~kortex embedding upserts.

Phase 2 — LWW tier activeactive + antientropy

Implement HLC watermark gossip between gateway peers.
Extend anti-entropy to cover lww: keys (Merkle tree over HLC timestamps).
Update key routing to accept lww: writes locally.
Add catalog opt-in field + migration tooling.

Estimated effort: 2–3 weeks.

Phase 3 — Observability + per-tenant rollout

Prometheus metrics: replication lag per tier, HLC drift, anti-entropy repair
events.
Alert thresholds: lag > 2× budget → warn; lag > 5× budget → page.
Per-tenant rollout: enable crdt: tier first, then lww: after 2 weeks of
stability.

Estimated effort: 1 week.

Phase 4 (future RFC) — Strongconsistent activeactive

Distributed 2PC or Paxos across regions for strong: tier. This is the hard part (see RFC~~008 §3.2). Will be addressed when Koder has multi~~region presence with > 100 k ops/s cross-region write traffic. Out of scope for this RFC.

8. Non-Goals (v1)

Active-active for SQL (relational: tier).
Active-active for schema catalog mutations.
Active-active for authentication tokens (Koder ID strong: data).
Sub-millisecond WAN replication (physical law).
More than 2 primary regions simultaneously in Phase 1–2 (2-region only).

9. Open Questions

*itness node infrastructure* where does the Raft witness live? s.r1
already hosts the primary. Possible options: a small VPS in São Paulo, a Cloud VM in us~~east~~1, or a dedicated low-cost node in Europe. Decision needed before Phase 2 can be production-safe.

*LC gossip transport* piggyback on TiKV's existing replication stream,
or add a separate gRPC channel between gateways? Piggybacking is simpler but couples the two concerns.

*lww:` tier rollout sequence* which products move first? Proposed order:
koder-kortex (AI embeddings, highest write rate, LWW acceptable) → koder~~talk (messages, but needs strong read~~your-writes semantics — may need to stay strong: for now) → koder-flow (CI state).

*onflict tombstones* should lww: deletes be tombstoned with HLC
timestamps to prevent resurrection by a replicated write arriving after the delete? Current LWW implementation uses writes~~win~~over-deletes. This may need to be reversed for some use cases.