Kdb RFC 008 multi region replication

RFC008 — kdbnext: Multi-Region Replication

Field	Value
Status	raft(20260416) — awaiting engineering review
Author(s)	Rodrigo (with Claude as scribe)
Date	20260416
Related	RFC~~001 §7 (deferred multi~~region), Ticket #141 (CRDT integration)
Target module	`infra/data/kdb/`

1. Summary

This RFC specifies multi~~region replication for kdb~~next. The goal is to allow the Koder data layer to operate across two or more geographic data centers (e.g., São Paulo + Amsterdam) with:

*synchronous cross-region replication* writes committed in the primary
region are propagated to secondaries within a defined lag budget.
*ocal reads* clients in a given region are served from the nearest
replica without round-tripping to another continent.
*utomatic failover* if the primary region becomes unavailable, a
secondary is promoted with a target RTO of < 30 seconds.
*onflict resolution* concurrent writes from different regions are
reconciled via CRDT merge (for CRDT~~typed values, RFC~~001 §7 + Ticket #141) or last~~write~~wins (for non-CRDT payloads).

Scope: *ingle-primary*(one designated primary region per shard at any time), with a roadmap to active~~active multi~~primary in a later RFC.

2. Motivation

The Koder platform currently runs entirely in São Paulo (s.r1). Three pressures drive multi-region:

*GPD data residency (Brazil)* Brazilian-resident user data must stay in
Brazil. As international customers join, their data may need to stay in their home jurisdiction. Per-tenant region assignment solves this.
*atency* European Koder customers would observe > 250 ms RTT to São Paulo.
A Amsterdam replica cuts that to < 20 ms for reads.
*vailability SLA* RPO < 15 minutes and RTO < 30 seconds requires the
ability to failover across data centers. A single-region cluster cannot meet this SLA for outage scenarios beyond a single rack.

3. Topology: Single-Primary with Async Secondaries

3.1 Region roles

At any given time, each key range (TiKV region / shard) has:

* primary region* accepts writes and is the authoritative source of truth
for that shard. Internally the TiKV Raft group leader lives here.
* secondary regions*(N ≥ 0): receive async replicated writes from the
primary. Serve reads at potentially stale data (within the lag budget).

Each tenant has an assigned primary region recorded in the Catalog:

catalog.tenants.primary_region = "sa-east-1"  -- São Paulo default

Writes for a tenant always go to that tenant's primary region gateway. The gateway proxy in other regions forwards tenant writes to the correct primary gateway, adding one extra round~~trip for cross~~region writers. This is acceptable: Koder's products are primarily data-resident in the user's home region.

3.2 Why not activeactive multiprimary?

Active~~active multi~~primary (accepting writes in every region simultaneously) requires distributed transactions across regions (2PC or Paxos across WAN), which introduces:

50–250 ms extra latency per write (cross-region RTT for consensus).
Complex conflict resolution for non-CRDT types.
Significantly harder correctness proofs.

Given Koder's current scale (< 1M tenants as of this RFC), single-primary is sufficient and operationally simpler. The CRDT work (Ticket #141, data/crdt) positions us to move to active~~active for CRDT~~typed data later without disruption, since CRDT merge is the same operation whether conflicts arise from concurrent clients or concurrent region writes.

4. Replication Mechanism

4.1 TiKV cross-region replication

kdb~~next uses TiKV as its KV substrate. TiKV's Raft~~based replication runs within a single region today. For cross-region replication, we add a *iKV Follower Replication*tier: each shard's Raft group gains follower replicas in secondary regions. These learner replicas do not participate in quorum (writes complete once the intra-region quorum is satisfied) but they receive the Raft log asynchronously.

Configuration: each TiKV store in secondary regions is labeled with datacenter-replica: true, and TiFlash-style async replication sends the Raft log entries to secondaries after the primary quorum commits.

4.2 Replication log entry format

Each committed Raft entry arriving at a secondary replica is replayed via the kdb-record merge engine:

If the key matches a CRDT prefix (key prefix crdt:), the entry is handed
to the KdbCrdtStore::cas_merge path (see infra/data/crdt/crates/koder-crdt).
Otherwise, the entry is applied as a last~~write~~wins PUT (timestamp =
Raft index, globally monotonic within a shard).

This guarantees that CRDT values converge under concurrent writes from multiple regions, and non-CRDT values use LWW (most recently committed write in the Raft log wins).

4.3 Replication lag budget per product

Acceptable replication lag is a product-level SLA, not a global one:

Product	Lag Budget	Rationale
koder-talk	≤ 500 ms	Chat messages must appear quickly on secondary reads
koder-flow	≤ 5 s	CI/CD state, repos — eventual consistency ok
koder-id	≤ 200 ms	Auth tokens used globally; stale reads dangerous
koder-mail	≤ 2 s	Email threading; moderate staleness acceptable
kodersign	≤ 30 s	Legal docs infrequently accessed; low urgency
koder-kortex	≤ 10 s	AI context store — near-realtime preferred

These budgets are enforced by the backpressure policy (§7) and monitored via Prometheus metrics (§8).

5. Conflict Resolution

5.1 CRDT-typed values

For keys in the crdt: namespace ("{ns}:crdt:{type}:{entity_id}"), conflict resolution is handled by the data/crdt library:

*Counter* pairwise max — both writes advance the counter.
*wwRegister* HLC ordering — the write with the higher HLC timestamp wins.
*rSet* union of adds/removes — concurrent adds win over removes.
*ectorClock* pairwise max — causally safe.

These operations are idempotent and order-independent (commutative, associative, idempotent). Cross~~region concurrent writes never produce split~~brain for CRDT values, because the CRDT merge function can be applied at any point to reconcile diverged replicas.

5.2 NonCRDT values: LastWrite-Wins

For non-CRDT key prefixes (record data, schema catalog, etc.), conflict resolution uses a *ybrid Logical Clock (HLC) last~~write~~wins*policy:

Each write carries an 8-byte HLC timestamp prefix (same format as
kdb-record's MergeStrategy::LastWriteWins).
At merge time, the write with the higher HLC wins.
The HLC includes: physical_ms (NTP~~synced) + logical (tie~~break counter) +
node_id (tie~~break within a millisecond). See `infradatacrdtcrateskoder~~crdtsrchlc.rs`.

*LC accuracy requirement* all kdb-gateway nodes must keep their clocks within *� 500 ms*of UTC. Clock drift beyond this threshold causes NTP to flag the node, and the gateway should refuse writes until the clock is re-synchronized. This is enforced by a clock health check in the gateway startup probe.

5.3 Schema catalog conflicts

The schema catalog (tenanttableschema definitions) is control-plane data and requires *trong consistency*

Catalog mutations are always routed to the primary region, never applied
concurrently in two regions.
The Raft quorum for the catalog shard spans all regions (Raft voters, not
learners). This means catalog writes have WAN latency, but catalog mutations are rare (< 1/min) and not latency-sensitive for end users.

6. Read Routing

6.1 Replica reads (default: eventual consistency)

By default, kdb-gateway in a secondary region serves reads from the local TiKV follower replicas. Clients get low-latency responses at the cost of potentially stale data (within the lag budget).

6.2 Strong reads (explicit opt-in)

Clients can request strong consistency via a gRPC metadata header:

kdb-read-consistency: strong

The gateway then forwards the read to the primary region gateway, which reads from the Raft leader. Latency increases by one WAN round-trip, but the client is guaranteed to see all committed writes.

6.3 Client API

The kdb-next pgwire layer exposes consistency preference via a session parameter:

SET kdb.read_consistency = 'strong';   -- guarantees freshness
SET kdb.read_consistency = 'eventual'; -- default; uses local replica

The kdb~~cli will expose this as `read~~consistency strong|eventual`.

7. Backpressure and Circuit Breaker

When a secondary replica falls behind the lag budget, the backpressure policy kicks in:

*arn* at 50% of the lag budget, the gateway emits a WARN log entry and
the Prometheus metric kdb_replication_lag_seconds{region} rises.
*edirect* at 100% of the lag budget, the gateway redirects read requests
for affected tenants to the primary region gateway (adding WAN latency but ensuring freshness). The redirect is transparent to the client (HTTP 307 for HTTP clients; connection re-routing for pgwire clients).
*ircuit breaker* if the replica is > 5× the lag budget behind (extreme
lag), the replica is marked degraded and removed from the read pool until it catches up. An alert fires to the on-call team.

The redirect threshold is configurable per product via the tenant catalog:

{ "replication_lag_budget_ms": 500, "backpressure_redirect": true }

8. Failover Policy

8.1 Failure detection

Each region's kdb-gateway cluster sends *eartbeats*to every other region every * second*via a dedicated health endpoint (GET /api/v1/region-health). A region is considered unavailable when:

3 consecutive heartbeats are missed (3 seconds), AND
A quorum of at least ceil(N/2) other regions agree the region is unreachable.

The quorum requirement prevents split-brain from network partitions where only some regions can reach the primary.

8.2 Automatic failover

On primary region failure detection:

*D (TiKV Placement Driver) failover*(< 5 s): TiKV PD detects leader
loss and elects a new Raft leader within the surviving replica set.
*enant catalog update*(< 10 s): the surviving region with the most
uptodate Raft log is promoted to primary. The catalog's tenants.primary_region is updated via an emergency Raft proposal.
*NS / load balancer update*(< 15 s): the Koder infrastructure updates
the kdb-gateway DNS records to route new connections to the promoted region.
*lient reconnect*(< 5 s): kdb-gateway clients (talkd, flowd, etc.) use
exponential backoff with jitter (max 5 s) and reconnect automatically.

*arget RTO* < 30 seconds endtoend for application clients. *arget RPO* < 15 minutes (maximum potential data loss = max replication lag + timetodetect).

8.3 Manual failover

The kdbctl region CLI (Phase 8.1 implementation ticket) will support:

kdbctl region failover --from sa-east-1 --to eu-west-1 --tenant acme
kdbctl region promote eu-west-1  # promotes entire region for all tenants
kdbctl region demote sa-east-1   # marks a region as secondary-only

8.4 Rollback after failover

After the primary region recovers, it rejoins as a *econdary*(not automatically re~~promoted) to avoid double~~failover instability. An operator must explicitly run kdbctl region promote sa-east-1 to restore the original topology.

9. Cross-Region TLS (mTLS)

All inter-region communication (Raft replication, health heartbeats, read redirects, write forwarding) uses *utual TLS (mTLS)*

Each region's gateway cluster has a dedicated TLS keypair issued by the Koder
internal CA (infra/certs/koder-internal-ca).
Certificates rotate every *0 days*via cert-manager.
TLS 1.3 minimum; cipher suite restricted to ECDHE+AES~~256~~GCM.
CRL / OCSP stapling is mandatory for all inter-region endpoints.

The region identity (SANs in the TLS cert) encodes the region label: kdb-gateway.sa-east-1.internal.koder.dev

The internal CA is separate from the public CA (Let's Encrypt) used for customer-facing endpoints.

10. Observability

10.1 Prometheus metrics

The kdb~~gateway exports the following per~~region metrics:

Metric	Type	Labels
`kdb_replication_lag_seconds`	Gauge	`region`, `tenant`
`kdb_replication_entries_queued`	Gauge	`region`
`kdb_replication_bytes_per_second`	Gauge	`region`
`kdb_failover_total`	Counter	`from_region`, `to_region`
`kdb_read_redirect_total`	Counter	`region`, `reason`
`kdb_cross_region_rtt_seconds`	Histogram	`from`, `to`
`kdb_hlc_drift_milliseconds`	Gauge	`node`

10.2 Alerting rules

The following Prometheus alerting rules are required:

groups:
  - name: kdb-multi-region
    rules:
      - alert: KdbReplicationLagHigh
        expr: kdb_replication_lag_seconds > 5
        for: 30s
        labels: { severity: warning }
        annotations:
          summary: "kdb replication lag > 5s in {{ $labels.region }}"

      - alert: KdbReplicationLagCritical
        expr: kdb_replication_lag_seconds > 30
        for: 10s
        labels: { severity: critical }
        annotations:
          summary: "kdb replication lag > 30s — circuit breaker likely"

      - alert: KdbRegionUnreachable
        expr: kdb_cross_region_rtt_seconds == 0
        for: 15s
        labels: { severity: critical }
        annotations:
          summary: "kdb region {{ $labels.to }} unreachable from {{ $labels.from }}"

      - alert: KdbHLCDriftHigh
        expr: kdb_hlc_drift_milliseconds > 500
        for: 60s
        labels: { severity: warning }
        annotations:
          summary: "kdb node {{ $labels.node }} clock drift > 500ms — NTP issue?"

11. Implementation Roadmap

Multi-region support is designed to be incremental. Each phase is gated by the previous phase's acceptance criteria.

Phase 8.1 — Region metadata (4 weeks)

Add primary_region field to tenant catalog records.
Implement kdbctl region list|assign|promote|demote CLI commands.
Deploy region health heartbeat endpoints.
Deploy cross-region mTLS infrastructure.
*ate* heartbeat latency < 5 ms p99 within each region pair; mTLS cert rotation works endtoend.

Phase 8.2 — Async replication (6 weeks)

Add TiKV learner replicas in the secondary region (Amsterdam).
Implement Raft log shipping and apply loop in kdb-gateway.
Export replication lag metrics to Prometheus + Grafana dashboard.
Implement backpressure redirect for tenants exceeding lag budget.
*ate* replication lag < 1 s p99 under normal load; redirect tested.

Phase 8.3 — Read routing (3 weeks)

Implement kdb.read_consistency session parameter in pgwire layer.
Route eventual-consistency reads to local replica; strong reads to primary.
kdb~~cli: add `read~~consistency` flag.
*ate* eventual reads served locally (RTT < 5 ms intra-region); strong reads correctly forwarded.

Phase 8.4 — Automatic failover (6 weeks)

Implement failure detection (quorum~~based, 3~~heartbeat threshold).
Implement PD leader re-election integration.
Implement DNS/LB update automation on failover.
Implement kdbctl region failover and rollback commands.
Full failover drill: inject primary failure, measure RTO.
*ate* RTO < 30 s, RPO < 15 min, verified by chaos test.

Phase 8.5 — CRDT activeactive (future, RFC009)

Extend CRDT store to accept writes from any region simultaneously.
Remove the single~~primary restriction for CRDT~~typed keys.
This phase requires RFC~~009 (active~~active CRDT replication).

12. Open Questions

*iKV learner vs. async follower* should we use TiKV's learner replicas
(official async replication path) or a custom WAL-shipping mechanism? Recommendation: TiKV learners — they receive first-party support and have known performance characteristics. Custom WAL shipping is only justified if TiKV learners prove insufficient after benchmarks.

*msterdam provider* OVH Amsterdam vs. Hetzner Falkenstein? Infrastructure
team to decide based on cost, latency to São Paulo, and SLA. Preliminary data: Hetzner Falkenstein~~SAO~~PAULO RTT ≈ 170–190 ms.

*er~~tenant primary region vs. per~~shard primary region* tenant-level
assignment is simpler to reason about but causes hot shards if many tenants share a shard. Consider hybrid: per~~tenant default with shard~~level override for hot tenants.

*ctive-active timeline* the CRDT groundwork (Ticket #141) is ready.
The question is whether active~~active for non~~CRDT keys is needed before the Koder platform exceeds 10M tenants. Current growth rate: 0 to target of 100M over 10+ years. Recommendation: defer to RFC-009.

13. Security Considerations

*o plaintext inter-region traffic* mTLS is mandatory (§9). Any
misconfigured node that attempts to connect without a valid cert is rejected.
*enant data residency* the Catalog's primary_region field is the
authoritative record of where a tenant's data lives. LGPD compliance requires Brazilian~~resident tenants to have `primary_region = "sa~~east-1"`. Cross-region replication copies data to secondary regions; if a tenant has a data residency restriction, secondary replication to non-qualifying regions must be disabled. This is tracked per-tenant in the catalog.
*udit log replication* audit log entries must be replicated to all regions
where the tenant's data is held (for local compliance querying), but must also be retained in the primary region for the jurisdiction's statutory period.

14. References

RFC~~001 §7 — Multi~~region deferred rationale
infra/data/crdt/crates/koder-crdt — CRDT runtime (GCounter, OrSet, VectorClock, LwwRegister)
Ticket #141 — CRDT integration with kdb-next
Ticket #132 — Cluster sizing baseline
TiKV documentation: Follower Read
PingCAP: TiKV Placement Driver failure handling
Hybrid Logical Clocks: Kulkarni et al., 2014

RFC-008 status: *Draft* Pending review by at least one Koder engineer before promotion to Proposed. Implementation tickets to be opened from the Phase 8.1–8.4 roadmap after acceptance.*