Kdb RFC 011 standby read replicas

RFC011 — kdbnext: Standby Read Replicas

Field	Value
Status	raft— single~~node standby flag shipped in pgwire; multi~~node routingregistryheartbeat unimplemented. Tickets #529–#534 cover P0–P5.
Author(s)	Rodrigo (with Claude as scribe)
Date	20260416 (drafted) · 20260426 (audit)
Audit	`infra/data/kdb/docs/RFC-011-AUDIT.md`
Related	RFC~~001 (architecture), RFC~~008 (multi~~region), RFC~~010 (WAL streaming), #044
Target module	`infra/data/kdb/`

1. Summary

kdb-next currently routes all reads through the primary TiKV region, which creates contention between OLTP workloads and heavy analytical / batch queries. This RFC defines *tandby read replicas*— dedicated kdb-gateway nodes that serve reads from TiKV learner replicas, offloading the primary Raft quorum.

Design goals:

*ero write amplification* replicas receive data via TiKV Raft (learner
peers) and do not consume the WAL stream (RFC-010) unless used as PITR targets.
*unable consistency* clients choose between stale reads, snapshot reads,
and strong (linearizable) reads per request.
*ransparent routing* the primary kdb-gateway proxies reads to the closest
replica automatically, with fallback to primary on lag threshold breach.
*romotable standby* a replica can be promoted to primary via a Raft
learnertovoter transition with sub-30s RTO.
*bservable lag* per~~replica lag metrics in LSN and wall~~clock time,
integrated with the Prometheus stack (§6).

2. Background: TiKV Raft Learners and Follower Reads

2.1 Raft learners

TiKV supports adding *earner*nodes to a Raft group. Learners receive the full Raft log from leaders and persist it locally but do not participate in the quorum vote. This means:

Writes are never delayed waiting for learner acknowledgement.
Learners replicate asynchronously and may lag behind the leader.
Promotion to voter is a one~~step PD command (`pd~~ctl config ...`).

kdb-next will designate one learner peer per TiKV region for each read replica deployment. The PD scheduler places these learner stores in the read replica's datacenter zone.

2.2 TiKV Follower Read (ReadIndex)

TiKV implements Follower Read using the *eadIndex*protocol:

The follower sends a ReadIndex RPC to the leader.
The leader responds with its current commit index (read_index).
The follower waits until its applied index ≥ read_index.
The follower serves the read from its local state.

This guarantees *inearizable*reads from a follower with a single RPC round~~trip to the leader (vs. two round~~trips for a full Raft read). The kdb-gateway uses the tikv_client_go API flag kv.WithReplicaRead() to opt into this mode per request.

For *tale reads*(allowed to be up to N milliseconds behind), TiKV offers a timestamp~~based Stale Read feature that requires no leader round~~trip at all.

3. Architecture

                ┌─────────────────────────────────────────────────┐
                │              Primary Datacenter (GRU)           │
                │                                                 │
                │  ┌──────────────┐      ┌──────────────────────┐ │
  Client ──────►│  │ kdb-gateway  │─────►│  TiKV cluster        │ │
                │  │  (primary)   │      │  (leader + voters)   │ │
                │  └──────┬───────┘      └─────────┬────────────┘ │
                │         │ read proxy               │ Raft log    │
                └─────────┼───────────────────────── ┼────────────┘
                           │                          │
                           │                          │ Raft learner replication
                           │                          ▼
                ┌──────────┼──────────────────────────────────────┐
                │          │    Read Replica Zone (GRU-2)         │
                │          │                                      │
                │  ┌───────▼──────┐      ┌──────────────────────┐ │
                │  │ kdb-gateway  │─────►│  TiKV learner stores │ │
                │  │  (replica)   │      │  (read-only peers)   │ │
                │  └──────────────┘      └──────────────────────┘ │
                └─────────────────────────────────────────────────┘

*ead replica gateway*is a separate kdb-gateway binary configured in replica mode (role = "replica" in gateway.toml). It connects exclusively to the TiKV learner stores in its zone and does not accept write requests.

*rimary gateway*maintains a replica registry (§4.2) and proxies reads to replica gateways when the client's consistency requirement allows it.

4. Read Routing Protocol

4.1 Consistency levels per request

Clients express consistency requirements via a gRPC header or per-RPC option:

Level	Behaviour	Latency impact
`STALE`	Serve from learner without ReadIndex check.	Minimal — no extra RPC
`SNAPSHOT`	Serve from learner after ReadIndex confirmation.	+1 leader round-trip
`STRONG`	Route to primary Raft leader.	Primary only

Default level: SNAPSHOT (balances freshness vs. offload).

The STALE level requires clients to specify max_staleness_ms; reads that cannot be served within this window fall back to SNAPSHOT.

4.2 Replica registry

The primary gateway maintains an in-memory ReplicaRegistry:

type ReplicaEntry struct {
    ID         string        // replica gateway ID
    Addr       string        // gRPC endpoint
    HeadLSN    string        // last applied LSN (from heartbeat)
    LagMs      int64         // estimated lag in milliseconds
    Healthy    bool
    LastSeen   time.Time
}

type ReplicaRegistry struct {
    mu       sync.RWMutex
    replicas map[string]*ReplicaEntry
}

Replica gateways push heartbeat probes to the primary every 1 second over a dedicated gRPC stream (ReplicaHeartbeat). The primary marks a replica unhealthy after 5 missed heartbeats (5 s).

4.3 Read routing algorithm

func RouteRead(req ReadRequest, registry ReplicaRegistry) GatewayTarget:
  if req.Consistency == STRONG:
    return Primary

  candidates = registry.HealthyReplicas()
  if len(candidates) == 0:
    return Primary          // no replicas available — fallback

  best = min(candidates, key=LagMs)
  if req.Consistency == STALE:
    if best.LagMs <= req.MaxStalenessMsMs:
      return best
    return Primary          // lag exceeds tolerance
  
  // SNAPSHOT: any healthy replica qualifies (ReadIndex ensures freshness)
  return best               // round-robin across equal-lag replicas

The primary gateway forwards the request over gRPC to the selected replica gateway. If the replica call fails (deadline, connection error), the primary retries once on a different replica, then falls back to serving locally.

5. Replica Lifecycle

5.1 Cold start

*dd learner peer*via PD:

pd-ctl operator add add-learner <store-id> <region-id>

TiKV streams the Raft log snapshot + delta to the learner store.
The replica gateway starts in CATCHING_UP state; it rejects read requests
until its applied index is within replica_lag_threshold_lsn of the leader's commit index (configurable, default: 50 000 entries ≈ 500 ms at 100 k writes/s).
Once caught up, the replica transitions to READY and begins accepting reads.

5.2 Warm standby

The replica gateway polls its TiKV learner stores for the current applied index every 100 ms and computes lag:

lag_entries = leader.commit_index - learner.applied_index
lag_ms      = lag_entries / avg_write_rate_per_ms

Both are published as Prometheus metrics (§6).

5.3 Promotion to primary (failover)

Trigger: the primary gateway fails its health check for > failover_timeout seconds (default: 15 s).

Promotion steps:

The operator (or automated watchdog) issues a PD command to convert the
learner store to a voter:
```
pd-ctl operator add transfer-learner-to-voter <store-id>
```
Once the learner becomes a voter, Raft elects a new leader from the voter
set (sub-second in normal conditions).
The replica gateway reconfigures itself to role = "primary", registers
with the service-discovery layer, and begins accepting writes.
The old primary, if it recovers, rejoins as a learner (or is added back as
voter after data reconciliation via WAL stream comparison).

Target RTO: * 30 s*(step 1–3 combined; PD learnertovoter promotion is typically < 5 s; DNS/service-discovery TTL is the main variable).

6. Observability

6.1 Prometheus metrics

Metric	Type	Description
`kdb_replica_lag_entries{replica_id}`	gauge	Raft index entries behind leader
`kdb_replica_lag_ms{replica_id}`	gauge	Estimated lag in milliseconds
`kdb_replica_state{replica_id}`	gauge	0catching_up, 1ready, 2=unhealthy
`kdb_read_routed_replica_total{replica_id}`	counter	Reads forwarded to this replica
`kdb_read_fallback_total{reason}`	counter	Reads falling back to primary (lagerrornoRep)
`kdb_replica_heartbeat_latency_ms`	hist	Heartbeat round-trip latency

6.2 Alerting rules

- alert: KdbReplicaHighLag
  expr: kdb_replica_lag_ms > 2000
  for: 2m
  annotations:
    summary: "Read replica {{ $labels.replica_id }} lag > 2s"

- alert: KdbReplicaUnhealthy
  expr: kdb_replica_state == 2
  for: 30s
  annotations:
    summary: "Read replica {{ $labels.replica_id }} is unhealthy"

- alert: KdbReadFallbackSpike
  expr: rate(kdb_read_fallback_total[5m]) > 100
  annotations:
    summary: "High primary fallback rate — check replica health"

7. Configuration

New fields in gateway.toml:

[replica]
# "primary" | "replica" — role of this gateway instance.
role = "primary"

# replica_id is a stable identifier for this replica (used in metrics).
# Required when role = "replica".
replica_id = ""

# primary_addr is the gRPC endpoint of the primary gateway.
# Required when role = "replica".
primary_addr = ""

# lag_threshold_lsn is the maximum lag (in Raft entries) before the
# replica rejects reads and the primary falls back to local serving.
lag_threshold_lsn = 50000

# failover_timeout_s is the number of seconds of primary unavailability
# before automated promotion is triggered (0 = manual only).
failover_timeout_s = 15

# heartbeat_interval_ms is how often the replica pushes a heartbeat to the
# primary. Must be < replica health check timeout (5 s).
heartbeat_interval_ms = 1000

8. Implementation Plan

Phase 1 — TiKV learner provisioning + follower read (1 week)

Add pd-ctl helper commands to kdb-admin CLI:
- kdb-admin replica add --store <id> — add learner peer for all regions.
- kdb-admin replica status — show lag per region per learner.
Enable TiKV kv.WithReplicaRead(kv.ReplicaReadFollower) in the Go client
for SNAPSHOT reads.
Benchmark: measure p99 read latency improvement vs. primary for a
mixed OLTP + analytics workload.

Phase 2 — Replica gateway mode + routing (2 weeks)

Add role = "replica" configuration; replica gateway rejects writes.
Implement ReplicaRegistry with heartbeat gRPC stream.
Implement RouteRead algorithm; integrate into the kdb-gateway read path.
Integration test: primary gateway correctly routes to replica and falls
back on lag breach.

Phase 3 — Lag tracking + metrics (1 week)

Expose Prometheus metrics (§6).
Add alerting rules to the kdb Grafana dashboard.
Implement STALE consistency level with max_staleness_ms enforcement.

Phase 4 — Promotion and failover (1–2 weeks)

Implement automated watchdog in the replica gateway (monitors primary
heartbeat; triggers PD learnertovoter after failover_timeout_s).
Write runbook for manual promotion (§5.3) and add to docs/ops/.
Chaos test: kill primary gateway, verify RTO < 30 s, verify no data loss.

9. Non-Goals (v1)

*ross~~region read replicas* covered by RFC~~008 (multi-region). This RFC
focuses on intra-region replicas (same datacenter, different zone/rack).
*rite forwarding from replica to primary* clients must connect to the
primary for writes. No transparent write proxy.
*ultiple simultaneous promotions* at most one promotion in flight at a
time; the watchdog uses a distributed lock (PD KV store) to avoid split-brain.
*ogical replication for heterogeneous targets*(e.g., PostgreSQL read
replica): out of scope; see CDC RFC (future).

10. Open Questions

*earner count per region* how many learner peers per TiKV region?
One learner per replica deployment is simplest, but read replicas can share a single learner store to avoid excessive Raft message fan-out. Trade-off: failure isolation vs. resource efficiency.

*ross-consistency within a transaction* if a client mixes SNAPSHOT reads
and STRONG writes in one logical transaction, the kdb-gateway must ensure the SNAPSHOT reads observe at least the write timestamp from the same session. Current proposal: use session-level min_read_ts watermark. Needs prototyping.

*earner snapshot size* a new learner must receive a full snapshot of all
TiKV regions before it can apply the delta log. At 1 TB of data, snapshot transfer alone may take hours. Should kdb-next support "warm learner" — a learner pre~~seeded from a PITR base snapshot (RFC~~010 §5) to reduce catchup time?

*ervice discovery* the replica registry currently uses a static
primary_addr in config. For dynamic environments (Kubernetes, LXC with floating IPs), the registry should be backed by a service-discovery system (e.g., PD KV store used as a lightweight etcd). What is the preferred mechanism for kdb-next's target infra?