Kdb RFC 008 multi region replication

RFC008 — kdbnext: Multi-Region Replication

Field Value
Status *raft*(20260416) — awaiting engineering review
Author(s) Rodrigo (with Claude as scribe)
Date 20260416
Related RFC001 §7 (deferred multiregion), Ticket #141 (CRDT integration)
Target module infra/data/kdb/

1. Summary

This RFC specifies multiregion replication for kdbnext. The goal is to allow the Koder data layer to operate across two or more geographic data centers (e.g., São Paulo + Amsterdam) with:

  • *synchronous cross-region replication* writes committed in the primary

    region are propagated to secondaries within a defined lag budget.

  • *ocal reads* clients in a given region are served from the nearest

    replica without round-tripping to another continent.

  • *utomatic failover* if the primary region becomes unavailable, a

    secondary is promoted with a target RTO of < 30 seconds.

  • *onflict resolution* concurrent writes from different regions are

    reconciled via CRDT merge (for CRDTtyped values, RFC001 §7 + Ticket #141) or lastwritewins (for non-CRDT payloads).

Scope: *ingle-primary*(one designated primary region per shard at any time), with a roadmap to activeactive multiprimary in a later RFC.


2. Motivation

The Koder platform currently runs entirely in São Paulo (s.r1). Three pressures drive multi-region:

  1. *GPD data residency (Brazil)* Brazilian-resident user data must stay in

    Brazil. As international customers join, their data may need to stay in their home jurisdiction. Per-tenant region assignment solves this.

  2. *atency* European Koder customers would observe > 250 ms RTT to São Paulo.

    A Amsterdam replica cuts that to < 20 ms for reads.

  3. *vailability SLA* RPO < 15 minutes and RTO < 30 seconds requires the

    ability to failover across data centers. A single-region cluster cannot meet this SLA for outage scenarios beyond a single rack.


3. Topology: Single-Primary with Async Secondaries

3.1 Region roles

At any given time, each key range (TiKV region / shard) has:

  • * primary region* accepts writes and is the authoritative source of truth

    for that shard. Internally the TiKV Raft group leader lives here.

  • * secondary regions*(N ≥ 0): receive async replicated writes from the

    primary. Serve reads at potentially stale data (within the lag budget).

Each tenant has an assigned primary region recorded in the Catalog:

catalog.tenants.primary_region = "sa-east-1"  -- São Paulo default

Writes for a tenant always go to that tenant's primary region gateway. The gateway proxy in other regions forwards tenant writes to the correct primary gateway, adding one extra roundtrip for crossregion writers. This is acceptable: Koder's products are primarily data-resident in the user's home region.

3.2 Why not activeactive multiprimary?

Activeactive multiprimary (accepting writes in every region simultaneously) requires distributed transactions across regions (2PC or Paxos across WAN), which introduces:

  • 50–250 ms extra latency per write (cross-region RTT for consensus).
  • Complex conflict resolution for non-CRDT types.
  • Significantly harder correctness proofs.

Given Koder's current scale (< 1M tenants as of this RFC), single-primary is sufficient and operationally simpler. The CRDT work (Ticket #141, data/crdt) positions us to move to activeactive for CRDTtyped data later without disruption, since CRDT merge is the same operation whether conflicts arise from concurrent clients or concurrent region writes.


4. Replication Mechanism

4.1 TiKV cross-region replication

kdbnext uses TiKV as its KV substrate. TiKV's Raftbased replication runs within a single region today. For cross-region replication, we add a *iKV Follower Replication*tier: each shard's Raft group gains follower replicas in secondary regions. These learner replicas do not participate in quorum (writes complete once the intra-region quorum is satisfied) but they receive the Raft log asynchronously.

Configuration: each TiKV store in secondary regions is labeled with datacenter-replica: true, and TiFlash-style async replication sends the Raft log entries to secondaries after the primary quorum commits.

4.2 Replication log entry format

Each committed Raft entry arriving at a secondary replica is replayed via the kdb-record merge engine:

  1. If the key matches a CRDT prefix (key prefix crdt:), the entry is handed

    to the KdbCrdtStore::cas_merge path (see infra/data/crdt/crates/koder-crdt).

  2. Otherwise, the entry is applied as a lastwritewins PUT (timestamp =

    Raft index, globally monotonic within a shard).

This guarantees that CRDT values converge under concurrent writes from multiple regions, and non-CRDT values use LWW (most recently committed write in the Raft log wins).

4.3 Replication lag budget per product

Acceptable replication lag is a product-level SLA, not a global one:

Product Lag Budget Rationale
koder-talk ≤ 500 ms Chat messages must appear quickly on secondary reads
koder-flow ≤ 5 s CI/CD state, repos — eventual consistency ok
koder-id ≤ 200 ms Auth tokens used globally; stale reads dangerous
koder-mail ≤ 2 s Email threading; moderate staleness acceptable
kodersign ≤ 30 s Legal docs infrequently accessed; low urgency
koder-kortex ≤ 10 s AI context store — near-realtime preferred

These budgets are enforced by the backpressure policy (§7) and monitored via Prometheus metrics (§8).


5. Conflict Resolution

5.1 CRDT-typed values

For keys in the crdt: namespace ("{ns}:crdt:{type}:{entity_id}"), conflict resolution is handled by the data/crdt library:

  • *Counter* pairwise max — both writes advance the counter.
  • *wwRegister* HLC ordering — the write with the higher HLC timestamp wins.
  • *rSet* union of adds/removes — concurrent adds win over removes.
  • *ectorClock* pairwise max — causally safe.

These operations are idempotent and order-independent (commutative, associative, idempotent). Crossregion concurrent writes never produce splitbrain for CRDT values, because the CRDT merge function can be applied at any point to reconcile diverged replicas.

5.2 NonCRDT values: LastWrite-Wins

For non-CRDT key prefixes (record data, schema catalog, etc.), conflict resolution uses a *ybrid Logical Clock (HLC) lastwritewins*policy:

  • Each write carries an 8-byte HLC timestamp prefix (same format as

    kdb-record's MergeStrategy::LastWriteWins).

  • At merge time, the write with the higher HLC wins.
  • The HLC includes: physical_ms (NTPsynced) + logical (tiebreak counter) +

    node_id (tiebreak within a millisecond). See `infradatacrdtcrateskodercrdtsrchlc.rs`.

*LC accuracy requirement* all kdb-gateway nodes must keep their clocks within *± 500 ms*of UTC. Clock drift beyond this threshold causes NTP to flag the node, and the gateway should refuse writes until the clock is re-synchronized. This is enforced by a clock health check in the gateway startup probe.

5.3 Schema catalog conflicts

The schema catalog (tenanttableschema definitions) is control-plane data and requires *trong consistency*

  • Catalog mutations are always routed to the primary region, never applied

    concurrently in two regions.

  • The Raft quorum for the catalog shard spans all regions (Raft voters, not

    learners). This means catalog writes have WAN latency, but catalog mutations are rare (< 1/min) and not latency-sensitive for end users.


6. Read Routing

6.1 Replica reads (default: eventual consistency)

By default, kdb-gateway in a secondary region serves reads from the local TiKV follower replicas. Clients get low-latency responses at the cost of potentially stale data (within the lag budget).

6.2 Strong reads (explicit opt-in)

Clients can request strong consistency via a gRPC metadata header:

kdb-read-consistency: strong

The gateway then forwards the read to the primary region gateway, which reads from the Raft leader. Latency increases by one WAN round-trip, but the client is guaranteed to see all committed writes.

6.3 Client API

The kdb-next pgwire layer exposes consistency preference via a session parameter:

SET kdb.read_consistency = 'strong';   -- guarantees freshness
SET kdb.read_consistency = 'eventual'; -- default; uses local replica

The kdbcli will expose this as `readconsistency strong|eventual`.


7. Backpressure and Circuit Breaker

When a secondary replica falls behind the lag budget, the backpressure policy kicks in:

  1. *arn* at 50% of the lag budget, the gateway emits a WARN log entry and

    the Prometheus metric kdb_replication_lag_seconds{region} rises.

  2. *edirect* at 100% of the lag budget, the gateway redirects read requests

    for affected tenants to the primary region gateway (adding WAN latency but ensuring freshness). The redirect is transparent to the client (HTTP 307 for HTTP clients; connection re-routing for pgwire clients).

  3. *ircuit breaker* if the replica is > 5× the lag budget behind (extreme

    lag), the replica is marked degraded and removed from the read pool until it catches up. An alert fires to the on-call team.

The redirect threshold is configurable per product via the tenant catalog:

{ "replication_lag_budget_ms": 500, "backpressure_redirect": true }

8. Failover Policy

8.1 Failure detection

Each region's kdb-gateway cluster sends *eartbeats*to every other region every * second*via a dedicated health endpoint (GET /api/v1/region-health). A region is considered unavailable when:

  • 3 consecutive heartbeats are missed (3 seconds), AND
  • A quorum of at least ceil(N/2) other regions agree the region is unreachable.

The quorum requirement prevents split-brain from network partitions where only some regions can reach the primary.

8.2 Automatic failover

On primary region failure detection:

  1. *D (TiKV Placement Driver) failover*(< 5 s): TiKV PD detects leader

    loss and elects a new Raft leader within the surviving replica set.

  2. *enant catalog update*(< 10 s): the surviving region with the most

    uptodate Raft log is promoted to primary. The catalog's tenants.primary_region is updated via an emergency Raft proposal.

  3. *NS / load balancer update*(< 15 s): the Koder infrastructure updates

    the kdb-gateway DNS records to route new connections to the promoted region.

  4. *lient reconnect*(< 5 s): kdb-gateway clients (talkd, flowd, etc.) use

    exponential backoff with jitter (max 5 s) and reconnect automatically.

*arget RTO* < 30 seconds endtoend for application clients. *arget RPO* < 15 minutes (maximum potential data loss = max replication lag + timetodetect).

8.3 Manual failover

The kdbctl region CLI (Phase 8.1 implementation ticket) will support:

kdbctl region failover --from sa-east-1 --to eu-west-1 --tenant acme
kdbctl region promote eu-west-1  # promotes entire region for all tenants
kdbctl region demote sa-east-1   # marks a region as secondary-only

8.4 Rollback after failover

After the primary region recovers, it rejoins as a *econdary*(not automatically repromoted) to avoid doublefailover instability. An operator must explicitly run kdbctl region promote sa-east-1 to restore the original topology.


9. Cross-Region TLS (mTLS)

All inter-region communication (Raft replication, health heartbeats, read redirects, write forwarding) uses *utual TLS (mTLS)*

  • Each region's gateway cluster has a dedicated TLS keypair issued by the Koder

    internal CA (infra/certs/koder-internal-ca).

  • Certificates rotate every *0 days*via cert-manager.
  • TLS 1.3 minimum; cipher suite restricted to ECDHE+AES256GCM.
  • CRL / OCSP stapling is mandatory for all inter-region endpoints.

The region identity (SANs in the TLS cert) encodes the region label: kdb-gateway.sa-east-1.internal.koder.dev

The internal CA is separate from the public CA (Let's Encrypt) used for customer-facing endpoints.


10. Observability

10.1 Prometheus metrics

The kdbgateway exports the following perregion metrics:

Metric Type Labels
kdb_replication_lag_seconds Gauge region, tenant
kdb_replication_entries_queued Gauge region
kdb_replication_bytes_per_second Gauge region
kdb_failover_total Counter from_region, to_region
kdb_read_redirect_total Counter region, reason
kdb_cross_region_rtt_seconds Histogram from, to
kdb_hlc_drift_milliseconds Gauge node

10.2 Alerting rules

The following Prometheus alerting rules are required:

groups:
  - name: kdb-multi-region
    rules:
      - alert: KdbReplicationLagHigh
        expr: kdb_replication_lag_seconds > 5
        for: 30s
        labels: { severity: warning }
        annotations:
          summary: "kdb replication lag > 5s in {{ $labels.region }}"

      - alert: KdbReplicationLagCritical
        expr: kdb_replication_lag_seconds > 30
        for: 10s
        labels: { severity: critical }
        annotations:
          summary: "kdb replication lag > 30s — circuit breaker likely"

      - alert: KdbRegionUnreachable
        expr: kdb_cross_region_rtt_seconds == 0
        for: 15s
        labels: { severity: critical }
        annotations:
          summary: "kdb region {{ $labels.to }} unreachable from {{ $labels.from }}"

      - alert: KdbHLCDriftHigh
        expr: kdb_hlc_drift_milliseconds > 500
        for: 60s
        labels: { severity: warning }
        annotations:
          summary: "kdb node {{ $labels.node }} clock drift > 500ms — NTP issue?"

11. Implementation Roadmap

Multi-region support is designed to be incremental. Each phase is gated by the previous phase's acceptance criteria.

Phase 8.1 — Region metadata (4 weeks)

  • Add primary_region field to tenant catalog records.
  • Implement kdbctl region list|assign|promote|demote CLI commands.
  • Deploy region health heartbeat endpoints.
  • Deploy cross-region mTLS infrastructure.
  • *ate* heartbeat latency < 5 ms p99 within each region pair; mTLS cert rotation works endtoend.

Phase 8.2 — Async replication (6 weeks)

  • Add TiKV learner replicas in the secondary region (Amsterdam).
  • Implement Raft log shipping and apply loop in kdb-gateway.
  • Export replication lag metrics to Prometheus + Grafana dashboard.
  • Implement backpressure redirect for tenants exceeding lag budget.
  • *ate* replication lag < 1 s p99 under normal load; redirect tested.

Phase 8.3 — Read routing (3 weeks)

  • Implement kdb.read_consistency session parameter in pgwire layer.
  • Route eventual-consistency reads to local replica; strong reads to primary.
  • kdbcli: add `readconsistency` flag.
  • *ate* eventual reads served locally (RTT < 5 ms intra-region); strong reads correctly forwarded.

Phase 8.4 — Automatic failover (6 weeks)

  • Implement failure detection (quorumbased, 3heartbeat threshold).
  • Implement PD leader re-election integration.
  • Implement DNS/LB update automation on failover.
  • Implement kdbctl region failover and rollback commands.
  • Full failover drill: inject primary failure, measure RTO.
  • *ate* RTO < 30 s, RPO < 15 min, verified by chaos test.

Phase 8.5 — CRDT activeactive (future, RFC009)

  • Extend CRDT store to accept writes from any region simultaneously.
  • Remove the singleprimary restriction for CRDTtyped keys.
  • This phase requires RFC009 (activeactive CRDT replication).

12. Open Questions

  1. *iKV learner vs. async follower* should we use TiKV's learner replicas

    (official async replication path) or a custom WAL-shipping mechanism? Recommendation: TiKV learners — they receive first-party support and have known performance characteristics. Custom WAL shipping is only justified if TiKV learners prove insufficient after benchmarks.

  1. *msterdam provider* OVH Amsterdam vs. Hetzner Falkenstein? Infrastructure

    team to decide based on cost, latency to São Paulo, and SLA. Preliminary data: Hetzner FalkensteinSAOPAULO RTT ≈ 170–190 ms.

  1. *ertenant primary region vs. pershard primary region* tenant-level

    assignment is simpler to reason about but causes hot shards if many tenants share a shard. Consider hybrid: pertenant default with shardlevel override for hot tenants.

  1. *ctive-active timeline* the CRDT groundwork (Ticket #141) is ready.

    The question is whether activeactive for nonCRDT keys is needed before the Koder platform exceeds 10M tenants. Current growth rate: 0 to target of 100M over 10+ years. Recommendation: defer to RFC-009.


13. Security Considerations

  • *o plaintext inter-region traffic* mTLS is mandatory (§9). Any

    misconfigured node that attempts to connect without a valid cert is rejected.

  • *enant data residency* the Catalog's primary_region field is the

    authoritative record of where a tenant's data lives. LGPD compliance requires Brazilianresident tenants to have `primary_region = "saeast-1"`. Cross-region replication copies data to secondary regions; if a tenant has a data residency restriction, secondary replication to non-qualifying regions must be disabled. This is tracked per-tenant in the catalog.

  • *udit log replication* audit log entries must be replicated to all regions

    where the tenant's data is held (for local compliance querying), but must also be retained in the primary region for the jurisdiction's statutory period.


14. References

  • RFC001 §7 — Multiregion deferred rationale
  • infra/data/crdt/crates/koder-crdt — CRDT runtime (GCounter, OrSet, VectorClock, LwwRegister)
  • Ticket #141 — CRDT integration with kdb-next
  • Ticket #132 — Cluster sizing baseline
  • TiKV documentation: Follower Read
  • PingCAP: TiKV Placement Driver failure handling
  • Hybrid Logical Clocks: Kulkarni et al., 2014

RFC-008 status: *Draft* Pending review by at least one Koder engineer before promotion to Proposed. Implementation tickets to be opened from the Phase 8.1–8.4 roadmap after acceptance.*

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/kdb-RFC-008-multi-region-replication.md