Kdb RFC 008 multi region replication
RFC008 — kdbnext: Multi-Region Replication
| Field | Value |
|---|---|
| Status | *raft*(2026 |
| Author(s) | Rodrigo (with Claude as scribe) |
| Date | 2026 |
| Related | RFC |
| Target module | infra/data/kdb/ |
1. Summary
This RFC specifies multiregion replication for kdbnext. The goal is to allow the Koder data layer to operate across two or more geographic data centers (e.g., São Paulo + Amsterdam) with:
- *synchronous cross-region replication* writes committed in the primary
region are propagated to secondaries within a defined lag budget.
- *ocal reads* clients in a given region are served from the nearest
replica without round-tripping to another continent.
- *utomatic failover* if the primary region becomes unavailable, a
secondary is promoted with a target RTO of < 30 seconds.
- *onflict resolution* concurrent writes from different regions are
reconciled via CRDT merge (for CRDT
typed values, RFC001 §7 + Ticket #141) or lastwritewins (for non-CRDT payloads).
Scope: *ingle-primary*(one designated primary region per shard at any time), with a roadmap to activeactive multiprimary in a later RFC.
2. Motivation
The Koder platform currently runs entirely in São Paulo (s.r1). Three pressures drive multi-region:
- *GPD data residency (Brazil)* Brazilian-resident user data must stay in
Brazil. As international customers join, their data may need to stay in their home jurisdiction. Per-tenant region assignment solves this.
- *atency* European Koder customers would observe > 250 ms RTT to São Paulo.
A Amsterdam replica cuts that to < 20 ms for reads.
- *vailability SLA* RPO < 15 minutes and RTO < 30 seconds requires the
ability to failover across data centers. A single-region cluster cannot meet this SLA for outage scenarios beyond a single rack.
3. Topology: Single-Primary with Async Secondaries
3.1 Region roles
At any given time, each key range (TiKV region / shard) has:
- * primary region* accepts writes and is the authoritative source of truth
for that shard. Internally the TiKV Raft group leader lives here.
- * secondary regions*(N ≥ 0): receive async replicated writes from the
primary. Serve reads at potentially stale data (within the lag budget).
Each tenant has an assigned primary region recorded in the Catalog:
catalog.tenants.primary_region = "sa-east-1" -- São Paulo defaultWrites for a tenant always go to that tenant's primary region gateway. The gateway proxy in other regions forwards tenant writes to the correct primary gateway, adding one extra roundtrip for crossregion writers. This is acceptable: Koder's products are primarily data-resident in the user's home region.
3.2 Why not activeactive multiprimary?
Activeactive multiprimary (accepting writes in every region simultaneously) requires distributed transactions across regions (2PC or Paxos across WAN), which introduces:
- 50–250 ms extra latency per write (cross-region RTT for consensus).
- Complex conflict resolution for non-CRDT types.
- Significantly harder correctness proofs.
Given Koder's current scale (< 1M tenants as of this RFC), single-primary is sufficient and operationally simpler. The CRDT work (Ticket #141, data/crdt) positions us to move to activeactive for CRDTtyped data later without disruption, since CRDT merge is the same operation whether conflicts arise from concurrent clients or concurrent region writes.
4. Replication Mechanism
4.1 TiKV cross-region replication
kdbnext uses TiKV as its KV substrate. TiKV's Raftbased replication runs within a single region today. For cross-region replication, we add a *iKV Follower Replication*tier: each shard's Raft group gains follower replicas in secondary regions. These learner replicas do not participate in quorum (writes complete once the intra-region quorum is satisfied) but they receive the Raft log asynchronously.
Configuration: each TiKV store in secondary regions is labeled with datacenter-replica: true, and TiFlash-style async replication sends the Raft log entries to secondaries after the primary quorum commits.
4.2 Replication log entry format
Each committed Raft entry arriving at a secondary replica is replayed via the kdb-record merge engine:
- If the key matches a CRDT prefix (key prefix
crdt:), the entry is handedto the
KdbCrdtStore::cas_mergepath (seeinfra/data/crdt/crates/koder-crdt). - Otherwise, the entry is applied as a last
writewins PUT (timestamp =Raft index, globally monotonic within a shard).
This guarantees that CRDT values converge under concurrent writes from multiple regions, and non-CRDT values use LWW (most recently committed write in the Raft log wins).
4.3 Replication lag budget per product
Acceptable replication lag is a product-level SLA, not a global one:
| Product | Lag Budget | Rationale |
|---|---|---|
| koder-talk | ≤ 500 ms | Chat messages must appear quickly on secondary reads |
| koder-flow | ≤ 5 s | CI/CD state, repos — eventual consistency ok |
| koder-id | ≤ 200 ms | Auth tokens used globally; stale reads dangerous |
| koder-mail | ≤ 2 s | Email threading; moderate staleness acceptable |
| kodersign | ≤ 30 s | Legal docs infrequently accessed; low urgency |
| koder-kortex | ≤ 10 s | AI context store — near-realtime preferred |
These budgets are enforced by the backpressure policy (§7) and monitored via Prometheus metrics (§8).
5. Conflict Resolution
5.1 CRDT-typed values
For keys in the crdt: namespace ("{ns}:crdt:{type}:{entity_id}"), conflict resolution is handled by the data/crdt library:
- *Counter* pairwise max — both writes advance the counter.
- *wwRegister* HLC ordering — the write with the higher HLC timestamp wins.
- *rSet* union of adds/removes — concurrent adds win over removes.
- *ectorClock* pairwise max — causally safe.
These operations are idempotent and order-independent (commutative, associative, idempotent). Crossregion concurrent writes never produce splitbrain for CRDT values, because the CRDT merge function can be applied at any point to reconcile diverged replicas.
5.2 NonCRDT values: LastWrite-Wins
For non-CRDT key prefixes (record data, schema catalog, etc.), conflict resolution uses a *ybrid Logical Clock (HLC) lastwritewins*policy:
- Each write carries an 8-byte HLC timestamp prefix (same format as
kdb-record'sMergeStrategy::LastWriteWins). - At merge time, the write with the higher HLC wins.
- The HLC includes: physical_ms (NTP
synced) + logical (tiebreak counter) +node_id (tie
break within a millisecond). See `infradatacrdtcrateskodercrdtsrchlc.rs`.
*LC accuracy requirement* all kdb-gateway nodes must keep their clocks within *± 500 ms*of UTC. Clock drift beyond this threshold causes NTP to flag the node, and the gateway should refuse writes until the clock is re-synchronized. This is enforced by a clock health check in the gateway startup probe.
5.3 Schema catalog conflicts
The schema catalog (tenanttableschema definitions) is control-plane data and requires *trong consistency*
- Catalog mutations are always routed to the primary region, never applied
concurrently in two regions.
- The Raft quorum for the catalog shard spans all regions (Raft voters, not
learners). This means catalog writes have WAN latency, but catalog mutations are rare (< 1/min) and not latency-sensitive for end users.
6. Read Routing
6.1 Replica reads (default: eventual consistency)
By default, kdb-gateway in a secondary region serves reads from the local TiKV follower replicas. Clients get low-latency responses at the cost of potentially stale data (within the lag budget).
6.2 Strong reads (explicit opt-in)
Clients can request strong consistency via a gRPC metadata header:
kdb-read-consistency: strongThe gateway then forwards the read to the primary region gateway, which reads from the Raft leader. Latency increases by one WAN round-trip, but the client is guaranteed to see all committed writes.
6.3 Client API
The kdb-next pgwire layer exposes consistency preference via a session parameter:
SET kdb.read_consistency = 'strong'; -- guarantees freshness
SET kdb.read_consistency = 'eventual'; -- default; uses local replicaThe kdbcli will expose this as `readconsistency strong|eventual`.
7. Backpressure and Circuit Breaker
When a secondary replica falls behind the lag budget, the backpressure policy kicks in:
- *arn* at 50% of the lag budget, the gateway emits a
WARNlog entry andthe Prometheus metric
kdb_replication_lag_seconds{region}rises. - *edirect* at 100% of the lag budget, the gateway redirects read requests
for affected tenants to the primary region gateway (adding WAN latency but ensuring freshness). The redirect is transparent to the client (HTTP 307 for HTTP clients; connection re-routing for pgwire clients).
- *ircuit breaker* if the replica is > 5× the lag budget behind (extreme
lag), the replica is marked
degradedand removed from the read pool until it catches up. An alert fires to the on-call team.
The redirect threshold is configurable per product via the tenant catalog:
{ "replication_lag_budget_ms": 500, "backpressure_redirect": true }8. Failover Policy
8.1 Failure detection
Each region's kdb-gateway cluster sends *eartbeats*to every other region every * second*via a dedicated health endpoint (GET /api/v1/region-health). A region is considered unavailable when:
- 3 consecutive heartbeats are missed (3 seconds), AND
- A quorum of at least
ceil(N/2)other regions agree the region is unreachable.
The quorum requirement prevents split-brain from network partitions where only some regions can reach the primary.
8.2 Automatic failover
On primary region failure detection:
- *D (TiKV Placement Driver) failover*(< 5 s): TiKV PD detects leader
loss and elects a new Raft leader within the surviving replica set.
- *enant catalog update*(< 10 s): the surviving region with the most
up
todate Raft log is promoted to primary. The catalog'stenants.primary_regionis updated via an emergency Raft proposal. - *NS / load balancer update*(< 15 s): the Koder infrastructure updates
the kdb-gateway DNS records to route new connections to the promoted region.
- *lient reconnect*(< 5 s): kdb-gateway clients (talkd, flowd, etc.) use
exponential backoff with jitter (max 5 s) and reconnect automatically.
*arget RTO* < 30 seconds endtoend for application clients. *arget RPO* < 15 minutes (maximum potential data loss = max replication lag + timetodetect).
8.3 Manual failover
The kdbctl region CLI (Phase 8.1 implementation ticket) will support:
kdbctl region failover --from sa-east-1 --to eu-west-1 --tenant acme
kdbctl region promote eu-west-1 # promotes entire region for all tenants
kdbctl region demote sa-east-1 # marks a region as secondary-only8.4 Rollback after failover
After the primary region recovers, it rejoins as a *econdary*(not automatically repromoted) to avoid doublefailover instability. An operator must explicitly run kdbctl region promote sa-east-1 to restore the original topology.
9. Cross-Region TLS (mTLS)
All inter-region communication (Raft replication, health heartbeats, read redirects, write forwarding) uses *utual TLS (mTLS)*
- Each region's gateway cluster has a dedicated TLS keypair issued by the Koder
internal CA (
infra/certs/koder-internal-ca). - Certificates rotate every *0 days*via cert-manager.
- TLS 1.3 minimum; cipher suite restricted to ECDHE+AES
256GCM. - CRL / OCSP stapling is mandatory for all inter-region endpoints.
The region identity (SANs in the TLS cert) encodes the region label: kdb-gateway.sa-east-1.internal.koder.dev
The internal CA is separate from the public CA (Let's Encrypt) used for customer-facing endpoints.
10. Observability
10.1 Prometheus metrics
The kdbgateway exports the following perregion metrics:
| Metric | Type | Labels |
|---|---|---|
kdb_replication_lag_seconds |
Gauge | region, tenant |
kdb_replication_entries_queued |
Gauge | region |
kdb_replication_bytes_per_second |
Gauge | region |
kdb_failover_total |
Counter | from_region, to_region |
kdb_read_redirect_total |
Counter | region, reason |
kdb_cross_region_rtt_seconds |
Histogram | from, to |
kdb_hlc_drift_milliseconds |
Gauge | node |
10.2 Alerting rules
The following Prometheus alerting rules are required:
groups:
- name: kdb-multi-region
rules:
- alert: KdbReplicationLagHigh
expr: kdb_replication_lag_seconds > 5
for: 30s
labels: { severity: warning }
annotations:
summary: "kdb replication lag > 5s in {{ $labels.region }}"
- alert: KdbReplicationLagCritical
expr: kdb_replication_lag_seconds > 30
for: 10s
labels: { severity: critical }
annotations:
summary: "kdb replication lag > 30s — circuit breaker likely"
- alert: KdbRegionUnreachable
expr: kdb_cross_region_rtt_seconds == 0
for: 15s
labels: { severity: critical }
annotations:
summary: "kdb region {{ $labels.to }} unreachable from {{ $labels.from }}"
- alert: KdbHLCDriftHigh
expr: kdb_hlc_drift_milliseconds > 500
for: 60s
labels: { severity: warning }
annotations:
summary: "kdb node {{ $labels.node }} clock drift > 500ms — NTP issue?"11. Implementation Roadmap
Multi-region support is designed to be incremental. Each phase is gated by the previous phase's acceptance criteria.
Phase 8.1 — Region metadata (4 weeks)
- Add
primary_regionfield to tenant catalog records. - Implement
kdbctl region list|assign|promote|demoteCLI commands. - Deploy region health heartbeat endpoints.
- Deploy cross-region mTLS infrastructure.
- *ate* heartbeat latency < 5 ms p99 within each region pair; mTLS cert rotation works end
toend.
Phase 8.2 — Async replication (6 weeks)
- Add TiKV learner replicas in the secondary region (Amsterdam).
- Implement Raft log shipping and apply loop in kdb-gateway.
- Export replication lag metrics to Prometheus + Grafana dashboard.
- Implement backpressure redirect for tenants exceeding lag budget.
- *ate* replication lag < 1 s p99 under normal load; redirect tested.
Phase 8.3 — Read routing (3 weeks)
- Implement
kdb.read_consistencysession parameter in pgwire layer. - Route eventual-consistency reads to local replica; strong reads to primary.
- kdb
cli: add `readconsistency` flag. - *ate* eventual reads served locally (RTT < 5 ms intra-region); strong reads correctly forwarded.
Phase 8.4 — Automatic failover (6 weeks)
- Implement failure detection (quorum
based, 3heartbeat threshold). - Implement PD leader re-election integration.
- Implement DNS/LB update automation on failover.
- Implement
kdbctl region failoverand rollback commands. - Full failover drill: inject primary failure, measure RTO.
- *ate* RTO < 30 s, RPO < 15 min, verified by chaos test.
Phase 8.5 — CRDT activeactive (future, RFC009)
- Extend CRDT store to accept writes from any region simultaneously.
- Remove the single
primary restriction for CRDTtyped keys. - This phase requires RFC
009 (activeactive CRDT replication).
12. Open Questions
- *iKV learner vs. async follower* should we use TiKV's learner replicas
(official async replication path) or a custom WAL-shipping mechanism? Recommendation: TiKV learners — they receive first-party support and have known performance characteristics. Custom WAL shipping is only justified if TiKV learners prove insufficient after benchmarks.
- *msterdam provider* OVH Amsterdam vs. Hetzner Falkenstein? Infrastructure
team to decide based on cost, latency to São Paulo, and SLA. Preliminary data: Hetzner Falkenstein
SAOPAULO RTT ≈ 170–190 ms.
- *er
tenant primary region vs. pershard primary region* tenant-levelassignment is simpler to reason about but causes hot shards if many tenants share a shard. Consider hybrid: per
tenant default with shardlevel override for hot tenants.
- *ctive-active timeline* the CRDT groundwork (Ticket #141) is ready.
The question is whether active
active for nonCRDT keys is needed before the Koder platform exceeds 10M tenants. Current growth rate: 0 to target of 100M over 10+ years. Recommendation: defer to RFC-009.
13. Security Considerations
- *o plaintext inter-region traffic* mTLS is mandatory (§9). Any
misconfigured node that attempts to connect without a valid cert is rejected.
- *enant data residency* the Catalog's
primary_regionfield is theauthoritative record of where a tenant's data lives. LGPD compliance requires Brazilian
resident tenants to have `primary_region = "saeast-1"`. Cross-region replication copies data to secondary regions; if a tenant has a data residency restriction, secondary replication to non-qualifying regions must be disabled. This is tracked per-tenant in the catalog. - *udit log replication* audit log entries must be replicated to all regions
where the tenant's data is held (for local compliance querying), but must also be retained in the primary region for the jurisdiction's statutory period.
14. References
- RFC
001 §7 — Multiregion deferred rationale infra/data/crdt/crates/koder-crdt— CRDT runtime (GCounter, OrSet, VectorClock, LwwRegister)- Ticket #141 — CRDT integration with kdb-next
- Ticket #132 — Cluster sizing baseline
- TiKV documentation: Follower Read
- PingCAP: TiKV Placement Driver failure handling
- Hybrid Logical Clocks: Kulkarni et al., 2014
RFC-008 status: *Draft* Pending review by at least one Koder engineer before promotion to Proposed. Implementation tickets to be opened from the Phase 8.1–8.4 roadmap after acceptance.*