Kdb RFC 009 active active multi primary

RFC009 — kdbnext: ActiveActive MultiPrimary Replication

Field Value
Status *raft*— gated behind RFC002 multiregion. Audit 20260426 shows P0–P5 of §7 unimplemented.
Author(s) Rodrigo (with Claude as scribe)
Date 20260416 (drafted) · 20260426 (audit)
Audit infra/data/kdb/docs/RFC-009-AUDIT.md; tickets #519–#524
Related RFC008 (singleprimary multiregion), RFC001 §7, Ticket #041, #141
Target module infra/data/kdb/

1. Summary

RFC008 established *ingleprimary multi-region replication* one designated primary region per tenant, with async secondary replicas. This RFC extends kdbnext to support *ctiveactive multi-primary* writes are accepted in any region simultaneously, without forcing clients to cross the WAN for their home region.

Activeactive is not a single mode. kdbnext's data model spans namespaces with different consistency requirements. This RFC proposes a *er-namespace consistency tier*model:

Tier Key prefix Writes accepted in Conflict resolution
crdt: CRDT-typed KV Any region CRDT merge (already implemented, RFC-008 §5.1)
lww: Mutable KV, eventual consistency Any region HLC lastwritewins
strong: Catalog, auth tokens, financial Primary region only Raft quorum (same as today)
relational: SQL tables Primary region only Not in v1 active-active

Active-active v1 covers the crdt: and lww: tiers. The strong: and relational: tiers remain single-primary until a v2 RFC addresses distributed transactions (2PC/Paxos across WAN).


2. Motivation

RFC008 deferred activeactive explicitly:

*Activeactive multiprimary ... requires distributed transactions across regions (2PC or Paxos across WAN), which introduces 50–250 ms extra latency per write.*

This is true for strong: and relational: tiers. However, a large fraction of kdb-next's data does not require strong consistency:

  • *odertalk messages*(crdt: namespace): appendonly, CRDT-mergeable.

    Chat messages sent from São Paulo while a European user simultaneously sends from Amsterdam should both commit locally and merge without conflict.

  • *oder-kortex embeddings*(lww: namespace): AI context records. Last

    upsert wins is acceptable; a stale embedding that's overwritten 200 ms later causes no user-visible error.

  • *oder-flow CI status*(lww: namespace): build/pipeline state. Concurrent

    status updates from different regions are resolved by HLC ordering.

Blocking these writes on the WAN roundtrip (RFC008 topology) wastes ~250 ms per write for European clients writing to Brazilian-primary tenants. For koder-talk in particular, this visibly degrades the "feel" of the product.


3. Architecture

3.1 Write path per tier

Client → local gateway (any region)
         │
         ├─ key prefix = "crdt:"
         │   Write to local TiKV Raft group.
         │   Async replication to other regions via RFC-008 Raft learner.
         │   At merge time: CRDT merge (commutative, idempotent).
         │   Result: converges everywhere, no conflicts possible.
         │
         ├─ key prefix = "lww:"
         │   Write to local TiKV Raft group.
         │   HLC timestamp attached at commit time.
         │   Async replication to other regions.
         │   At merge time: higher HLC wins (deterministic, idempotent).
         │   Result: eventual consistency, stale reads possible within lag budget.
         │
         ├─ key prefix = "strong:"
         │   Forwarded to primary-region gateway (same as RFC-008 today).
         │   Raft commit in primary region. Async replication to secondaries.
         │   Reads from secondary require "kdb-read-consistency: strong" or
         │   may be stale within lag budget.
         │
         └─ key prefix = "relational:"
             Forwarded to primary-region gateway (no change from RFC-008).

3.2 Conflict detection for lww: tier

When a lww: write is replicated to a secondary that already has a write for the same key at a higher HLC (i.e., the local copy is "newer"), the incoming write is silently dropped. This is deterministic: both replicas converge to the same value (the one with the highest HLC).

Pathological case: two clients simultaneously write the same key from different regions with clocks drifting more than ±500 ms. Mitigation: HLC node_id tie-break ensures the same winner on all replicas even with equal physical timestamps. Clock health check in §4.1 enforces the drift bound.

3.3 Anti-entropy (convergence after partition)

After a network partition heals, diverged replicas must reconcile. Anti-entropy runs as a background job per TiKV region (shard):

  1. *erkle digest* each replica computes a Merkle tree over its key space

    (keyed by HLC timestamp). The digest is cheap: O(1) network, O(log N) CPU.

  2. *igest exchange* leader and follower exchange digests every

    anti_entropy_interval (default: 60 s).

  3. *epair scan* mismatched sub-trees trigger a targeted key scan +

    CRDT merge / HLC LWW resolution.

This is the same anti-entropy pattern used in data/crdt (see koder-crdt/src/anti_entropy.rs). RFC-009 extends it to the lww: tier.


4. Consistency Model

4.1 Hybrid Logical Clock requirements

Each kdb-gateway node MUST maintain HLC drift within ±500 ms of UTC (per RFC008 §5.2). For activeactive, this bound is critical for lww: correctness.

New requirement for active-active: all gateways MUST exchange HLC watermarks with peer gateways in other regions on every write batch (piggyback on the replication stream). A gateway that observes a peer HLC more than 1 second ahead MUST:

  1. Advance its own HLC to peer_hlc + 1 logical tick.
  2. Log a WARN hlc_drift_correction event.
  3. If drift exceeds 2 seconds, enter *ead-only mode*(refuse new writes to

    lww: tier) until the clock is corrected.

4.2 Isolation guarantees per tier

Tier Read isolation Write isolation
crdt: Causal consistency (HLC-ordered) Concurrent writes always merge
lww: Eventual consistency Concurrent writes: HLC LWW
strong: Linearizable (Raft primary) Serializable (Raft primary)
relational: Read Committed (per RFC-004) Serializable (Raft primary)

The crdt: and lww: tiers do *ot*provide serializability. Applications using these tiers must be designed for eventual consistency (optimistic UI updates, conflict-tolerant merges).

4.3 Anomalies acknowledged

  1. *eadyourwrites* NOT guaranteed for lww: tier across regions.

    A client that writes in São Paulo then reads from Amsterdam may see a stale value for up to one lag budget period. Clients that need readyourwrites MUST use kdb-read-consistency: strong.

  2. *onotonic reads* NOT guaranteed for lww: tier. Sticky sessions (route

    client reads to the same gateway) mitigate this.

  3. *ausality* guaranteed for crdt: tier (vector clock ordering). For

    lww: tier, HLC provides probabilistic causality but not guaranteed.


5. Data Model Changes

5.1 Tier prefix convention

kdbnext keys already use structured prefixes (RFC001):

{region}:tenant:{id}:{key}

Activeactive adds a mandatory secondlevel prefix indicating the consistency tier:

{region}:tenant:{id}:crdt:{type}:{entity_id}    ← already used
{region}:tenant:{id}:lww:{table}:{row_id}        ← new (active-active v1)
{region}:tenant:{id}:strong:{table}:{row_id}     ← maps to current non-prefixed
{region}:tenant:{id}:relational:{sql_row_key}    ← unchanged

Existing keys without a tier prefix are treated as strong: for backward compatibility.

5.2 Catalog entry per tenant

The tenant catalog (RFC-001 §3) gains a new field:

{
  "tenant_id": 42,
  "primary_region": "sa-east-1",
  "active_active_tiers": ["crdt", "lww"],
  "active_active_enabled_at": "2026-05-01T00:00:00Z"
}

Activeactive is optin per tenant. Tenants not in active_active_tiers have all writes forwarded to the primary region (RFC-008 behavior).

5.3 Migration: strong:lww:

Products migrating table data from strong: to lww: tier (to gain active-active writes) MUST:

  1. Open an RFC addendum or engineering note describing the consistency

    downgrade rationale for each table.

  2. Run a tenant-shard-migrate style migration to rekey existing records

    from strong: prefix to lww: prefix.

  3. Update all read/write paths in the product to use the new prefix.

6. Failure Scenarios

6.1 Region partition

                     network partition
São Paulo ─────────//──────── Amsterdam
 (primary)                   (secondary, now isolated)

During partition:

  • Both regions continue accepting writes to crdt: and lww: tiers.
  • strong: and relational: tiers: São Paulo accepts writes; Amsterdam

    blocks writes (no quorum) and serves stale reads.

  • After healing: anti-entropy reconciles crdt: and lww: tiers.

    strong: lag is bounded by the Raft replay time (expected < 5 minutes for typical partition durations).

6.2 Splitbrain (network partition + primary reelection)

A split-brain occurs if Amsterdam elects a new primary for a strong: shard while São Paulo still considers itself primary. This is prevented by:

  • Raft quorum requiring a strict majority of voters. With 3 voters (2 SP + 1

    AMS), AMS cannot form a quorum alone.

  • For 2-node setups (1 SP + 1 AMS), a *itness node*(stateless, runs Raft

    voting only) in a third datacenter (or cloud) prevents split-brain.

6.3 Clock skew attack / misconfiguration

A clock-skewed lww: write at t+10s causes all writes at t+0 to t+10s to be silently discarded on that key, even from correct nodes. Mitigation:

  • 2second readonly threshold (§4.1) catches runaway clock drift.
  • HLC watermark gossip limits the window of opportunity.
  • Perkey write timestamps are exposed via `debug kv get -with-meta`, allowing

    forensic investigation.


7. Implementation Plan

Phase 1 — CRDT tier active-active (low effort, high value)

Enable writes to crdt: keys from any region gateway. The CRDT merge function already handles concurrent writes correctly. Only change needed: remove the "forward to primary" check for crdt: prefixes.

Estimated effort: 1–2 days. Unblocks: kodertalk presence, koderkortex embedding upserts.

Phase 2 — LWW tier activeactive + antientropy

  1. Implement HLC watermark gossip between gateway peers.
  2. Extend anti-entropy to cover lww: keys (Merkle tree over HLC timestamps).
  3. Update key routing to accept lww: writes locally.
  4. Add catalog opt-in field + migration tooling.

Estimated effort: 2–3 weeks.

Phase 3 — Observability + per-tenant rollout

  1. Prometheus metrics: replication lag per tier, HLC drift, anti-entropy repair

    events.

  2. Alert thresholds: lag > 2× budget → warn; lag > 5× budget → page.
  3. Per-tenant rollout: enable crdt: tier first, then lww: after 2 weeks of

    stability.

Estimated effort: 1 week.

Phase 4 (future RFC) — Strongconsistent activeactive

Distributed 2PC or Paxos across regions for strong: tier. This is the hard part (see RFC008 §3.2). Will be addressed when Koder has multiregion presence with > 100 k ops/s cross-region write traffic. Out of scope for this RFC.


8. Non-Goals (v1)

  • Active-active for SQL (relational: tier).
  • Active-active for schema catalog mutations.
  • Active-active for authentication tokens (Koder ID strong: data).
  • Sub-millisecond WAN replication (physical law).
  • More than 2 primary regions simultaneously in Phase 1–2 (2-region only).

9. Open Questions

  1. *itness node infrastructure* where does the Raft witness live? s.r1

    already hosts the primary. Possible options: a small VPS in São Paulo, a Cloud VM in useast1, or a dedicated low-cost node in Europe. Decision needed before Phase 2 can be production-safe.

  1. *LC gossip transport* piggyback on TiKV's existing replication stream,

    or add a separate gRPC channel between gateways? Piggybacking is simpler but couples the two concerns.

  1. *lww:` tier rollout sequence* which products move first? Proposed order:

    koder-kortex (AI embeddings, highest write rate, LWW acceptable) → kodertalk (messages, but needs strong readyour-writes semantics — may need to stay strong: for now) → koder-flow (CI state).

  1. *onflict tombstones* should lww: deletes be tombstoned with HLC

    timestamps to prevent resurrection by a replicated write arriving after the delete? Current LWW implementation uses writeswinover-deletes. This may need to be reversed for some use cases.

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/kdb-RFC-009-active-active-multi-primary.md