Kdb RFC 011 standby read replicas
RFC011 — kdbnext: Standby Read Replicas
| Field | Value |
|---|---|
| Status | *raft*— single |
| Author(s) | Rodrigo (with Claude as scribe) |
| Date | 2026 |
| Audit | infra/data/kdb/docs/RFC-011-AUDIT.md |
| Related | RFC |
| Target module | infra/data/kdb/ |
1. Summary
kdb-next currently routes all reads through the primary TiKV region, which creates contention between OLTP workloads and heavy analytical / batch queries. This RFC defines *tandby read replicas*— dedicated kdb-gateway nodes that serve reads from TiKV learner replicas, offloading the primary Raft quorum.
Design goals:
- *ero write amplification* replicas receive data via TiKV Raft (learner
peers) and do not consume the WAL stream (RFC-010) unless used as PITR targets.
- *unable consistency* clients choose between stale reads, snapshot reads,
and strong (linearizable) reads per request.
- *ransparent routing* the primary kdb-gateway proxies reads to the closest
replica automatically, with fallback to primary on lag threshold breach.
- *romotable standby* a replica can be promoted to primary via a Raft
learner
tovoter transition with sub-30s RTO. - *bservable lag* per
replica lag metrics in LSN and wallclock time,integrated with the Prometheus stack (§6).
2. Background: TiKV Raft Learners and Follower Reads
2.1 Raft learners
TiKV supports adding *earner*nodes to a Raft group. Learners receive the full Raft log from leaders and persist it locally but do not participate in the quorum vote. This means:
- Writes are never delayed waiting for learner acknowledgement.
- Learners replicate asynchronously and may lag behind the leader.
- Promotion to voter is a one
step PD command (`pdctl config ...`).
kdb-next will designate one learner peer per TiKV region for each read replica deployment. The PD scheduler places these learner stores in the read replica's datacenter zone.
2.2 TiKV Follower Read (ReadIndex)
TiKV implements Follower Read using the *eadIndex*protocol:
- The follower sends a
ReadIndexRPC to the leader. - The leader responds with its current commit index (
read_index). - The follower waits until its applied index ≥
read_index. - The follower serves the read from its local state.
This guarantees *inearizable*reads from a follower with a single RPC roundtrip to the leader (vs. two roundtrips for a full Raft read). The kdb-gateway uses the tikv_client_go API flag kv.WithReplicaRead() to opt into this mode per request.
For *tale reads*(allowed to be up to N milliseconds behind), TiKV offers a timestampbased Stale Read feature that requires no leader roundtrip at all.
3. Architecture
┌─────────────────────────────────────────────────┐
│ Primary Datacenter (GRU) │
│ │
│ ┌──────────────┐ ┌──────────────────────┐ │
Client ──────►│ │ kdb-gateway │─────►│ TiKV cluster │ │
│ │ (primary) │ │ (leader + voters) │ │
│ └──────┬───────┘ └─────────┬────────────┘ │
│ │ read proxy │ Raft log │
└─────────┼───────────────────────── ┼────────────┘
│ │
│ │ Raft learner replication
│ ▼
┌──────────┼──────────────────────────────────────┐
│ │ Read Replica Zone (GRU-2) │
│ │ │
│ ┌───────▼──────┐ ┌──────────────────────┐ │
│ │ kdb-gateway │─────►│ TiKV learner stores │ │
│ │ (replica) │ │ (read-only peers) │ │
│ └──────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────────┘*ead replica gateway*is a separate kdb-gateway binary configured in replica mode (role = "replica" in gateway.toml). It connects exclusively to the TiKV learner stores in its zone and does not accept write requests.
*rimary gateway*maintains a replica registry (§4.2) and proxies reads to replica gateways when the client's consistency requirement allows it.
4. Read Routing Protocol
4.1 Consistency levels per request
Clients express consistency requirements via a gRPC header or per-RPC option:
| Level | Behaviour | Latency impact |
|---|---|---|
STALE |
Serve from learner without ReadIndex check. | Minimal — no extra RPC |
SNAPSHOT |
Serve from learner after ReadIndex confirmation. | +1 leader round-trip |
STRONG |
Route to primary Raft leader. | Primary only |
Default level: SNAPSHOT (balances freshness vs. offload).
The STALE level requires clients to specify max_staleness_ms; reads that cannot be served within this window fall back to SNAPSHOT.
4.2 Replica registry
The primary gateway maintains an in-memory ReplicaRegistry:
type ReplicaEntry struct {
ID string // replica gateway ID
Addr string // gRPC endpoint
HeadLSN string // last applied LSN (from heartbeat)
LagMs int64 // estimated lag in milliseconds
Healthy bool
LastSeen time.Time
}
type ReplicaRegistry struct {
mu sync.RWMutex
replicas map[string]*ReplicaEntry
}Replica gateways push heartbeat probes to the primary every 1 second over a dedicated gRPC stream (ReplicaHeartbeat). The primary marks a replica unhealthy after 5 missed heartbeats (5 s).
4.3 Read routing algorithm
func RouteRead(req ReadRequest, registry ReplicaRegistry) GatewayTarget:
if req.Consistency == STRONG:
return Primary
candidates = registry.HealthyReplicas()
if len(candidates) == 0:
return Primary // no replicas available — fallback
best = min(candidates, key=LagMs)
if req.Consistency == STALE:
if best.LagMs <= req.MaxStalenessMsMs:
return best
return Primary // lag exceeds tolerance
// SNAPSHOT: any healthy replica qualifies (ReadIndex ensures freshness)
return best // round-robin across equal-lag replicasThe primary gateway forwards the request over gRPC to the selected replica gateway. If the replica call fails (deadline, connection error), the primary retries once on a different replica, then falls back to serving locally.
5. Replica Lifecycle
5.1 Cold start
- *dd learner peer*via PD:
pd-ctl operator add add-learner <store-id> <region-id> - TiKV streams the Raft log snapshot + delta to the learner store.
- The replica gateway starts in
CATCHING_UPstate; it rejects read requestsuntil its applied index is within
replica_lag_threshold_lsnof the leader's commit index (configurable, default: 50 000 entries ≈ 500 ms at 100 k writes/s). - Once caught up, the replica transitions to
READYand begins accepting reads.
5.2 Warm standby
The replica gateway polls its TiKV learner stores for the current applied index every 100 ms and computes lag:
lag_entries = leader.commit_index - learner.applied_index
lag_ms = lag_entries / avg_write_rate_per_msBoth are published as Prometheus metrics (§6).
5.3 Promotion to primary (failover)
Trigger: the primary gateway fails its health check for > failover_timeout seconds (default: 15 s).
Promotion steps:
- The operator (or automated watchdog) issues a PD command to convert the
learner store to a voter:
pd-ctl operator add transfer-learner-to-voter <store-id> - Once the learner becomes a voter, Raft elects a new leader from the voter
set (sub-second in normal conditions).
- The replica gateway reconfigures itself to
role = "primary", registerswith the service-discovery layer, and begins accepting writes.
- The old primary, if it recovers, rejoins as a learner (or is added back as
voter after data reconciliation via WAL stream comparison).
Target RTO: * 30 s*(step 1–3 combined; PD learnertovoter promotion is typically < 5 s; DNS/service-discovery TTL is the main variable).
6. Observability
6.1 Prometheus metrics
| Metric | Type | Description |
|---|---|---|
kdb_replica_lag_entries{replica_id} |
gauge | Raft index entries behind leader |
kdb_replica_lag_ms{replica_id} |
gauge | Estimated lag in milliseconds |
kdb_replica_state{replica_id} |
gauge | 0catching_up, 1ready, 2=unhealthy |
kdb_read_routed_replica_total{replica_id} |
counter | Reads forwarded to this replica |
kdb_read_fallback_total{reason} |
counter | Reads falling back to primary (lagerrornoRep) |
kdb_replica_heartbeat_latency_ms |
hist | Heartbeat round-trip latency |
6.2 Alerting rules
- alert: KdbReplicaHighLag
expr: kdb_replica_lag_ms > 2000
for: 2m
annotations:
summary: "Read replica {{ $labels.replica_id }} lag > 2s"
- alert: KdbReplicaUnhealthy
expr: kdb_replica_state == 2
for: 30s
annotations:
summary: "Read replica {{ $labels.replica_id }} is unhealthy"
- alert: KdbReadFallbackSpike
expr: rate(kdb_read_fallback_total[5m]) > 100
annotations:
summary: "High primary fallback rate — check replica health"7. Configuration
New fields in gateway.toml:
[replica]
# "primary" | "replica" — role of this gateway instance.
role = "primary"
# replica_id is a stable identifier for this replica (used in metrics).
# Required when role = "replica".
replica_id = ""
# primary_addr is the gRPC endpoint of the primary gateway.
# Required when role = "replica".
primary_addr = ""
# lag_threshold_lsn is the maximum lag (in Raft entries) before the
# replica rejects reads and the primary falls back to local serving.
lag_threshold_lsn = 50000
# failover_timeout_s is the number of seconds of primary unavailability
# before automated promotion is triggered (0 = manual only).
failover_timeout_s = 15
# heartbeat_interval_ms is how often the replica pushes a heartbeat to the
# primary. Must be < replica health check timeout (5 s).
heartbeat_interval_ms = 10008. Implementation Plan
Phase 1 — TiKV learner provisioning + follower read (1 week)
- Add
pd-ctlhelper commands tokdb-adminCLI:kdb-admin replica add --store <id>— add learner peer for all regions.kdb-admin replica status— show lag per region per learner.
- Enable TiKV
kv.WithReplicaRead(kv.ReplicaReadFollower)in the Go clientfor
SNAPSHOTreads. - Benchmark: measure p99 read latency improvement vs. primary for a
mixed OLTP + analytics workload.
Phase 2 — Replica gateway mode + routing (2 weeks)
- Add
role = "replica"configuration; replica gateway rejects writes. - Implement
ReplicaRegistrywith heartbeat gRPC stream. - Implement
RouteReadalgorithm; integrate into the kdb-gateway read path. - Integration test: primary gateway correctly routes to replica and falls
back on lag breach.
Phase 3 — Lag tracking + metrics (1 week)
- Expose Prometheus metrics (§6).
- Add alerting rules to the kdb Grafana dashboard.
- Implement
STALEconsistency level withmax_staleness_msenforcement.
Phase 4 — Promotion and failover (1–2 weeks)
- Implement automated watchdog in the replica gateway (monitors primary
heartbeat; triggers PD learner
tovoter afterfailover_timeout_s). - Write runbook for manual promotion (§5.3) and add to
docs/ops/. - Chaos test: kill primary gateway, verify RTO < 30 s, verify no data loss.
9. Non-Goals (v1)
- *ross
region read replicas* covered by RFC008 (multi-region). This RFCfocuses on intra-region replicas (same datacenter, different zone/rack).
- *rite forwarding from replica to primary* clients must connect to the
primary for writes. No transparent write proxy.
- *ultiple simultaneous promotions* at most one promotion in flight at a
time; the watchdog uses a distributed lock (PD KV store) to avoid split-brain.
- *ogical replication for heterogeneous targets*(e.g., PostgreSQL read
replica): out of scope; see CDC RFC (future).
10. Open Questions
- *earner count per region* how many learner peers per TiKV region?
One learner per replica deployment is simplest, but read replicas can share a single learner store to avoid excessive Raft message fan-out. Trade-off: failure isolation vs. resource efficiency.
- *ross-consistency within a transaction* if a client mixes SNAPSHOT reads
and STRONG writes in one logical transaction, the kdb-gateway must ensure the SNAPSHOT reads observe at least the write timestamp from the same session. Current proposal: use session-level
min_read_tswatermark. Needs prototyping.
- *earner snapshot size* a new learner must receive a full snapshot of all
TiKV regions before it can apply the delta log. At 1 TB of data, snapshot transfer alone may take hours. Should kdb-next support "warm learner" — a learner pre
seeded from a PITR base snapshot (RFC010 §5) to reduce catchup time?
- *ervice discovery* the replica registry currently uses a static
primary_addrin config. For dynamic environments (Kubernetes, LXC with floating IPs), the registry should be backed by a service-discovery system (e.g., PD KV store used as a lightweight etcd). What is the preferred mechanism for kdb-next's target infra?