Kdb RFC 011 standby read replicas

RFC011 — kdbnext: Standby Read Replicas

Field Value
Status *raft*— singlenode standby flag shipped in pgwire; multinode routingregistryheartbeat unimplemented. Tickets #529–#534 cover P0–P5.
Author(s) Rodrigo (with Claude as scribe)
Date 20260416 (drafted) · 20260426 (audit)
Audit infra/data/kdb/docs/RFC-011-AUDIT.md
Related RFC001 (architecture), RFC008 (multiregion), RFC010 (WAL streaming), #044
Target module infra/data/kdb/

1. Summary

kdb-next currently routes all reads through the primary TiKV region, which creates contention between OLTP workloads and heavy analytical / batch queries. This RFC defines *tandby read replicas*— dedicated kdb-gateway nodes that serve reads from TiKV learner replicas, offloading the primary Raft quorum.

Design goals:

  • *ero write amplification* replicas receive data via TiKV Raft (learner

    peers) and do not consume the WAL stream (RFC-010) unless used as PITR targets.

  • *unable consistency* clients choose between stale reads, snapshot reads,

    and strong (linearizable) reads per request.

  • *ransparent routing* the primary kdb-gateway proxies reads to the closest

    replica automatically, with fallback to primary on lag threshold breach.

  • *romotable standby* a replica can be promoted to primary via a Raft

    learnertovoter transition with sub-30s RTO.

  • *bservable lag* perreplica lag metrics in LSN and wallclock time,

    integrated with the Prometheus stack (§6).


2. Background: TiKV Raft Learners and Follower Reads

2.1 Raft learners

TiKV supports adding *earner*nodes to a Raft group. Learners receive the full Raft log from leaders and persist it locally but do not participate in the quorum vote. This means:

  • Writes are never delayed waiting for learner acknowledgement.
  • Learners replicate asynchronously and may lag behind the leader.
  • Promotion to voter is a onestep PD command (`pdctl config ...`).

kdb-next will designate one learner peer per TiKV region for each read replica deployment. The PD scheduler places these learner stores in the read replica's datacenter zone.

2.2 TiKV Follower Read (ReadIndex)

TiKV implements Follower Read using the *eadIndex*protocol:

  1. The follower sends a ReadIndex RPC to the leader.
  2. The leader responds with its current commit index (read_index).
  3. The follower waits until its applied index ≥ read_index.
  4. The follower serves the read from its local state.

This guarantees *inearizable*reads from a follower with a single RPC roundtrip to the leader (vs. two roundtrips for a full Raft read). The kdb-gateway uses the tikv_client_go API flag kv.WithReplicaRead() to opt into this mode per request.

For *tale reads*(allowed to be up to N milliseconds behind), TiKV offers a timestampbased Stale Read feature that requires no leader roundtrip at all.


3. Architecture

                ┌─────────────────────────────────────────────────┐
                │              Primary Datacenter (GRU)           │
                │                                                 │
                │  ┌──────────────┐      ┌──────────────────────┐ │
  Client ──────►│  │ kdb-gateway  │─────►│  TiKV cluster        │ │
                │  │  (primary)   │      │  (leader + voters)   │ │
                │  └──────┬───────┘      └─────────┬────────────┘ │
                │         │ read proxy               │ Raft log    │
                └─────────┼───────────────────────── ┼────────────┘
                           │                          │
                           │                          │ Raft learner replication
                           │                          ▼
                ┌──────────┼──────────────────────────────────────┐
                │          │    Read Replica Zone (GRU-2)         │
                │          │                                      │
                │  ┌───────▼──────┐      ┌──────────────────────┐ │
                │  │ kdb-gateway  │─────►│  TiKV learner stores │ │
                │  │  (replica)   │      │  (read-only peers)   │ │
                │  └──────────────┘      └──────────────────────┘ │
                └─────────────────────────────────────────────────┘

*ead replica gateway*is a separate kdb-gateway binary configured in replica mode (role = "replica" in gateway.toml). It connects exclusively to the TiKV learner stores in its zone and does not accept write requests.

*rimary gateway*maintains a replica registry (§4.2) and proxies reads to replica gateways when the client's consistency requirement allows it.


4. Read Routing Protocol

4.1 Consistency levels per request

Clients express consistency requirements via a gRPC header or per-RPC option:

Level Behaviour Latency impact
STALE Serve from learner without ReadIndex check. Minimal — no extra RPC
SNAPSHOT Serve from learner after ReadIndex confirmation. +1 leader round-trip
STRONG Route to primary Raft leader. Primary only

Default level: SNAPSHOT (balances freshness vs. offload).

The STALE level requires clients to specify max_staleness_ms; reads that cannot be served within this window fall back to SNAPSHOT.

4.2 Replica registry

The primary gateway maintains an in-memory ReplicaRegistry:

type ReplicaEntry struct {
    ID         string        // replica gateway ID
    Addr       string        // gRPC endpoint
    HeadLSN    string        // last applied LSN (from heartbeat)
    LagMs      int64         // estimated lag in milliseconds
    Healthy    bool
    LastSeen   time.Time
}

type ReplicaRegistry struct {
    mu       sync.RWMutex
    replicas map[string]*ReplicaEntry
}

Replica gateways push heartbeat probes to the primary every 1 second over a dedicated gRPC stream (ReplicaHeartbeat). The primary marks a replica unhealthy after 5 missed heartbeats (5 s).

4.3 Read routing algorithm

func RouteRead(req ReadRequest, registry ReplicaRegistry) GatewayTarget:
  if req.Consistency == STRONG:
    return Primary

  candidates = registry.HealthyReplicas()
  if len(candidates) == 0:
    return Primary          // no replicas available — fallback

  best = min(candidates, key=LagMs)
  if req.Consistency == STALE:
    if best.LagMs <= req.MaxStalenessMsMs:
      return best
    return Primary          // lag exceeds tolerance
  
  // SNAPSHOT: any healthy replica qualifies (ReadIndex ensures freshness)
  return best               // round-robin across equal-lag replicas

The primary gateway forwards the request over gRPC to the selected replica gateway. If the replica call fails (deadline, connection error), the primary retries once on a different replica, then falls back to serving locally.


5. Replica Lifecycle

5.1 Cold start

  1. *dd learner peer*via PD:
    pd-ctl operator add add-learner <store-id> <region-id>
  2. TiKV streams the Raft log snapshot + delta to the learner store.
  3. The replica gateway starts in CATCHING_UP state; it rejects read requests

    until its applied index is within replica_lag_threshold_lsn of the leader's commit index (configurable, default: 50 000 entries ≈ 500 ms at 100 k writes/s).

  4. Once caught up, the replica transitions to READY and begins accepting reads.

5.2 Warm standby

The replica gateway polls its TiKV learner stores for the current applied index every 100 ms and computes lag:

lag_entries = leader.commit_index - learner.applied_index
lag_ms      = lag_entries / avg_write_rate_per_ms

Both are published as Prometheus metrics (§6).

5.3 Promotion to primary (failover)

Trigger: the primary gateway fails its health check for > failover_timeout seconds (default: 15 s).

Promotion steps:

  1. The operator (or automated watchdog) issues a PD command to convert the

    learner store to a voter:

    pd-ctl operator add transfer-learner-to-voter <store-id>
  2. Once the learner becomes a voter, Raft elects a new leader from the voter

    set (sub-second in normal conditions).

  3. The replica gateway reconfigures itself to role = "primary", registers

    with the service-discovery layer, and begins accepting writes.

  4. The old primary, if it recovers, rejoins as a learner (or is added back as

    voter after data reconciliation via WAL stream comparison).

Target RTO: * 30 s*(step 1–3 combined; PD learnertovoter promotion is typically < 5 s; DNS/service-discovery TTL is the main variable).


6. Observability

6.1 Prometheus metrics

Metric Type Description
kdb_replica_lag_entries{replica_id} gauge Raft index entries behind leader
kdb_replica_lag_ms{replica_id} gauge Estimated lag in milliseconds
kdb_replica_state{replica_id} gauge 0catching_up, 1ready, 2=unhealthy
kdb_read_routed_replica_total{replica_id} counter Reads forwarded to this replica
kdb_read_fallback_total{reason} counter Reads falling back to primary (lagerrornoRep)
kdb_replica_heartbeat_latency_ms hist Heartbeat round-trip latency

6.2 Alerting rules

- alert: KdbReplicaHighLag
  expr: kdb_replica_lag_ms > 2000
  for: 2m
  annotations:
    summary: "Read replica {{ $labels.replica_id }} lag > 2s"

- alert: KdbReplicaUnhealthy
  expr: kdb_replica_state == 2
  for: 30s
  annotations:
    summary: "Read replica {{ $labels.replica_id }} is unhealthy"

- alert: KdbReadFallbackSpike
  expr: rate(kdb_read_fallback_total[5m]) > 100
  annotations:
    summary: "High primary fallback rate — check replica health"

7. Configuration

New fields in gateway.toml:

[replica]
# "primary" | "replica" — role of this gateway instance.
role = "primary"

# replica_id is a stable identifier for this replica (used in metrics).
# Required when role = "replica".
replica_id = ""

# primary_addr is the gRPC endpoint of the primary gateway.
# Required when role = "replica".
primary_addr = ""

# lag_threshold_lsn is the maximum lag (in Raft entries) before the
# replica rejects reads and the primary falls back to local serving.
lag_threshold_lsn = 50000

# failover_timeout_s is the number of seconds of primary unavailability
# before automated promotion is triggered (0 = manual only).
failover_timeout_s = 15

# heartbeat_interval_ms is how often the replica pushes a heartbeat to the
# primary. Must be < replica health check timeout (5 s).
heartbeat_interval_ms = 1000

8. Implementation Plan

Phase 1 — TiKV learner provisioning + follower read (1 week)

  1. Add pd-ctl helper commands to kdb-admin CLI:
    • kdb-admin replica add --store <id> — add learner peer for all regions.
    • kdb-admin replica status — show lag per region per learner.
  2. Enable TiKV kv.WithReplicaRead(kv.ReplicaReadFollower) in the Go client

    for SNAPSHOT reads.

  3. Benchmark: measure p99 read latency improvement vs. primary for a

    mixed OLTP + analytics workload.

Phase 2 — Replica gateway mode + routing (2 weeks)

  1. Add role = "replica" configuration; replica gateway rejects writes.
  2. Implement ReplicaRegistry with heartbeat gRPC stream.
  3. Implement RouteRead algorithm; integrate into the kdb-gateway read path.
  4. Integration test: primary gateway correctly routes to replica and falls

    back on lag breach.

Phase 3 — Lag tracking + metrics (1 week)

  1. Expose Prometheus metrics (§6).
  2. Add alerting rules to the kdb Grafana dashboard.
  3. Implement STALE consistency level with max_staleness_ms enforcement.

Phase 4 — Promotion and failover (1–2 weeks)

  1. Implement automated watchdog in the replica gateway (monitors primary

    heartbeat; triggers PD learnertovoter after failover_timeout_s).

  2. Write runbook for manual promotion (§5.3) and add to docs/ops/.
  3. Chaos test: kill primary gateway, verify RTO < 30 s, verify no data loss.

9. Non-Goals (v1)

  • *rossregion read replicas* covered by RFC008 (multi-region). This RFC

    focuses on intra-region replicas (same datacenter, different zone/rack).

  • *rite forwarding from replica to primary* clients must connect to the

    primary for writes. No transparent write proxy.

  • *ultiple simultaneous promotions* at most one promotion in flight at a

    time; the watchdog uses a distributed lock (PD KV store) to avoid split-brain.

  • *ogical replication for heterogeneous targets*(e.g., PostgreSQL read

    replica): out of scope; see CDC RFC (future).


10. Open Questions

  1. *earner count per region* how many learner peers per TiKV region?

    One learner per replica deployment is simplest, but read replicas can share a single learner store to avoid excessive Raft message fan-out. Trade-off: failure isolation vs. resource efficiency.

  1. *ross-consistency within a transaction* if a client mixes SNAPSHOT reads

    and STRONG writes in one logical transaction, the kdb-gateway must ensure the SNAPSHOT reads observe at least the write timestamp from the same session. Current proposal: use session-level min_read_ts watermark. Needs prototyping.

  1. *earner snapshot size* a new learner must receive a full snapshot of all

    TiKV regions before it can apply the delta log. At 1 TB of data, snapshot transfer alone may take hours. Should kdb-next support "warm learner" — a learner preseeded from a PITR base snapshot (RFC010 §5) to reduce catchup time?

  1. *ervice discovery* the replica registry currently uses a static

    primary_addr in config. For dynamic environments (Kubernetes, LXC with floating IPs), the registry should be backed by a service-discovery system (e.g., PD KV store used as a lightweight etcd). What is the preferred mechanism for kdb-next's target infra?

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/kdb-RFC-011-standby-read-replicas.md