Kdb RFC 010 wal streaming replication

RFC010 — kdbnext: WAL Streaming Replication

Field	Value
Status	artial— Postgres-compatible streaming shipped in pgwire (START_REPLICATION + slots + XLogData); gRPC WalStream + PITR CLI tracked as #525–#528.
Author(s)	Rodrigo (with Claude as scribe)
Date	20260416 (drafted) · 20260426 (audit)
Audit	`infra/data/kdb/docs/RFC-010-AUDIT.md`
Related	RFC~~008 (multi~~region), RFC~~009 (active~~active), Ticket #043, #044
Target module	`infra/data/kdb/`

1. Summary

kdb~~next uses TiKV's Raft log for intra~~shard replication, but has no public-facing continuous WAL stream. This RFC defines a *AL streaming interface*that external consumers can subscribe to: replica nodes, incremental backup agents, CDC pipelines, and audit systems.

Design goals:

*esumable* a client can reconnect at any previously seen LSN without data loss.
*ow latency* entries are delivered within one RTT of Raft commit.
*ackpressure-safe* clients that fall behind don't cause OOM on the server.
*uthenticated* all subscriptions require a valid KDB Auth bearer token.
*ITR~~ready* the stream feeds into incremental backups for Point~~in-Time Recovery.

2. Log Sequence Number (LSN)

*esolution per ADR~~001 (2026~~04~~26):*kdb~~next keeps local_lsn = u64 for PostgreSQL pgwire compatibility, and exposes a separate *28-bit GlobalLsn**only on the gRPC WalStream surface and PITR backup segments* for cross-region ordering. The original draft below described a 128-bit replacement for the u64 LSN; that variant was rejected because it broke pgwire compat. See infra/data/kdb/docs/adr/ADR-001-lsn-format.md for the full rationale.

2.1 Structure

A kdb~~next GlobalLsn is a 128~~bit value composed of:

┌──────────────────────────┬──────────────┬──────────────────┐
│  global_ts (64 bits)     │ region_id    │ raft_index       │
│  (PD timestamp, ms + logi│ (32 bits)    │ (32 bits)        │
└──────────────────────────┴──────────────┴──────────────────┘

global_ts: PD~~issued timestamp (the same 64~~bit HLC used for MVCC),
guarantees global ordering across regions.
region_id: TiKV region (shard) ID.
raft_index: monotonically increasing index within the Raft log of this region.

String representation: {global_ts_hex}.{region_id}.{raft_index} — lexicographically sortable.

The pgwire replication path (START_REPLICATION, replication slots, pg_current_wal_lsn()) continues to use the local_lsn: u64 unchanged. The gRPC WalStream service (ticket #525) carries both.

2.2 LSN ordering

LSNs are totally ordered by (global_ts, region_id, raft_index). A WAL subscription with start_lsn = X receives all entries with LSN ≥ X.

The *lobal watermark*is the minimum LSN across all active subscriptions on a gateway node. The server retains WAL entries until the global watermark advances past them (bounded retention, see §6).

3. WAL Entry Format

Each WAL entry is a Protobuf message:

syntax = "proto3";
package kdb.wal.v1;

message WalEntry {
  string   lsn             = 1;   // "{global_ts_hex}.{region_id}.{raft_index}"
  uint64   committed_at_ms = 2;   // wall-clock commit time (UTC milliseconds)
  string   tenant_id       = 3;   // tenant identifier
  string   key_prefix      = 4;   // e.g. "sa-east-1:tenant:42:"
  OpType   op_type         = 5;
  bytes    key             = 6;   // full key (without region prefix)
  bytes    value           = 7;   // absent for Delete operations
  uint64   hlc_ts          = 8;   // Hybrid Logical Clock timestamp
  bytes    checksum        = 9;   // CRC32C of (key + value) for integrity
}

enum OpType {
  OP_PUT    = 0;
  OP_DELETE = 1;
  OP_BATCH  = 2;   // value contains a serialized BatchWrite proto
}

message WalBatch {
  repeated WalEntry entries = 1;
  string            first_lsn = 2;
  string            last_lsn  = 3;
  Compression       compression = 4;
}

enum Compression {
  NONE   = 0;
  SNAPPY = 1;   // default for batches > 4 KB
  ZSTD   = 2;   // preferred for large batches (> 64 KB)
}

The server may coalesce multiple entries into a WalBatch when:

Subscriber is behind (catch-up mode): batches of up to 1000 entries.
High write rate: entries committed within the same 10 ms window are batched.
Compression kicks in automatically for batches > 4 KB.

4. Subscription Protocol

4.1 gRPC service definition

service WalStream {
  // Subscribe streams WAL entries from start_lsn.
  // The server streams entries until the client cancels or disconnects.
  rpc Subscribe(SubscribeRequest) returns (stream WalResponse);

  // Ack confirms the client has processed up to ack_lsn.
  // The server uses this to advance the watermark and reclaim WAL storage.
  // Called as a unary RPC after processing a batch (not per-entry).
  rpc Ack(AckRequest) returns (AckResponse);

  // GetLSN returns the current head LSN and the oldest retained LSN.
  // Clients use this to determine whether their start_lsn is still available.
  rpc GetLSN(GetLSNRequest) returns (GetLSNResponse);
}

message SubscribeRequest {
  string start_lsn        = 1;  // start streaming from this LSN (inclusive)
  string tenant_filter    = 2;  // empty = all tenants (requires admin token)
  string key_prefix_filter = 3; // empty = all keys
  uint32 batch_size_hint  = 4;  // suggested entries per batch (server may ignore)
}

message WalResponse {
  oneof payload {
    WalBatch batch       = 1;
    Heartbeat heartbeat  = 2;  // sent every 5s when idle
    string    error      = 3;  // terminal error (server closes stream after)
  }
}

message Heartbeat {
  string current_lsn = 1;  // current head LSN (replica can verify lag)
}

message AckRequest  { string ack_lsn = 1; }
message AckResponse { string watermark_lsn = 1; }

message GetLSNRequest  {}
message GetLSNResponse {
  string head_lsn    = 1;  // latest committed LSN
  string oldest_lsn  = 2;  // oldest retained LSN (= global watermark - 1)
}

4.2 Authentication

All gRPC calls require:

Authorization: Bearer <kdb-auth-jwt>

The JWT must include one of:

role: "replication" — subscribe to a specific tenant's WAL.
role: "admin" — subscribe to all tenants (used by backup agents).

Tokens are validated against the kdb~~auth JWKS endpoint (RFC~~001 §5).

4.3 Resumption after disconnect

On reconnect, the client replays SubscribeRequest with the last acked LSN as start_lsn. The server resumes from that LSN if it is within the retention window. If start_lsn is older than oldest_lsn, the server returns:

error: "lsn_not_available: start_lsn=<X> older than oldest_lsn=<Y>; "
       "perform a base snapshot and restart from head_lsn=<Z>"

4.4 Backpressure

The server uses a per-subscription send queue (default: 10,000 entries). If the queue is full for more than backpressure_timeout (default: 30 s), the server terminates the stream with:

error: "backpressure_timeout: subscriber too slow"

The client should reconnect (potentially with a larger LSN skip if it doesn't need historical data) or increase its processing throughput.

5. PointinTime Recovery (PITR) Integration

5.1 Architecture

PITR uses two components:

*ase snapshot* a consistent full backup of all TiKV key ranges at a
specific LSN. Created by the existing tenant-shard-migrate snapshot mechanism (or a dedicated kdb snapshot command).
*AL segment files* rolling WAL stream written to disk by a backup agent
(subscribes via WalStream.Subscribe with role: admin). Each segment covers a LSN range.

Recovery to time T:

Find the most recent base snapshot before T.
Replay WAL segments from snapshot_lsn to T.
All entries with committed_at_ms > T are discarded.

5.2 WAL segment file format

Each segment file: kdb-wal-{first_lsn}-{last_lsn}.bin

[magic: 4 bytes "KWAL"]
[version: 2 bytes]
[first_lsn: varint]
[last_lsn: varint]
[entry_count: varint]
[entries: repeated length-prefixed WalEntry proto]
[footer_crc32: 4 bytes]

Segments are rotated when:

Size exceeds 128 MB.
Time since first entry exceeds 1 hour.
The backup agent receives a HeartbeatWatermark event from the server.

5.3 Segment retention policy

Segment retention is configured per backup policy:

pitr_retention_days: 7 — keep WAL segments for 7 days.
Older segments are pruned by the backup agent (runs GC every hour).
Base snapshots older than pitr_retention_days + buffer_days are also pruned.

6. Server-Side WAL Retention

The server retains WAL entries for a minimum of wal_retention_minutes (default: 60 minutes) plus the lag of the slowest active subscriber. This bounds the memory footprint while allowing slow subscribers to catch up.

TiKV's compaction is gated: the kdb-gateway sets a safe_point in PD that prevents compaction of MVCC versions that are still within the WAL retention window. This is the same mechanism TiDB uses for its CDC/TiFlash streaming.

6.1 Memory bound estimate

At 100,000 writes/s with an average entry size of 256 bytes and a 60-minute window:

100,000 × 60 × 60 × 256 bytes ≈ 92 GB

This is too large for in~~memory retention. The gateway maintains only an in~~memory *ndex*(LSN → TiKV MVCC version), and re-reads the actual KV data from TiKV on demand. The index size:

100,000 × 3600 × 32 bytes (LSN + version) ≈ 11.5 GB

For 1 million writes/s, the index reaches ~115 GB — acceptable for a 256 GB gateway node. Above this scale, the index is sharded per Raft region.

7. Implementation Plan

Phase 1 — Protobuf schema + GetLSN (1 week)

Add kdb/wal/v1/wal.proto to infra/data/kdb/proto/.
Implement GetLSN RPC: reads current PD timestamp + min Raft index across
live regions. No streaming yet.
Unit tests for LSN encoding/ordering.

Implement WalStream.Subscribe gRPC server-side stream in Rust.
Per-subscription queue with backpressure enforcement.
Catch-up mode: replay from TiKV MVCC scan for start_lsn < current.
Live mode: hook into TiKV CDC (Change Data Capture) observer.
Integration tests: subscriber receives all committed writes.

Phase 3 — Ack + watermark + retention (1 week)

Ack RPC updates per-subscriber watermark.
Global watermark = min(all subscriber watermarks).
GC job: evict in-memory index entries behind global watermark.
Update safe_point in PD (gate TiKV MVCC compaction).

Phase 4 — PITR backup agent (2 weeks)

CLI kdb wal backup --pitr-dir /backups --segment-size 128m — connects as
admin subscriber and writes segment files.
CLI kdb wal restore --snapshot <path> --wal-dir /backups --to-time <RFC3339>.
Segment rotation, CRC verification, retention GC.

Phase 5 — Observability (1 week)

Prometheus metrics:

kdb_wal_subscriber_lag_lsn{subscriber_id} — entries behind head.
kdb_wal_queue_depth{subscriber_id} — pending entries in send queue.
kdb_wal_bytes_sent_total — total bytes sent to all subscribers.
kdb_wal_retention_index_bytes — in-memory index size.

Alerting:

kdb_wal_subscriber_lag_lsn > 1_000_000 for > 5 min → warn (subscriber
falling behind, may miss retention window).

8. Non-Goals (v1)

Logical replication (PostgreSQL~~style row~~level change events for SQL tables).
The WAL stream is physical (raw KV operations). Logical CDC is a separate RFC.
Encryption of WAL stream in transit beyond TLS (mTLS is enforced via gRPC).
Fan-out to > 100 simultaneous subscribers per gateway (Phase 1–3 target: 10).
Cross~~version WAL compatibility (format can break between major kdb~~next versions).

9. Open Questions

*iKV CDC API* TiKV's CDC observer API (ChangeDataEvent) exposes region-level
change events. Mapping these to per-key WAL entries requires filtering by key prefix (tenant boundary). The current TiKV CDC API may not support per-prefix subscriptions — needs investigation.

*chema for OP_BATCH* should batch writes be flattened into individual
OP_PUT/OP_DELETE entries by the server, or should the subscriber receive the batch as a single atomic unit? Atomic batches are important for relational table row updates that span multiple KV keys.

*ompaction interaction* when TiKV compacts MVCC versions, the original
write LSNs disappear. For catch-up from very old start_lsn, the gateway may need to synthesize WAL entries from the compacted snapshot + delta. Alternatively, base snapshot is mandatory for any start_lsn older than wal_retention_minutes. Document which model kdb-next will use.

*egment encryption for PITR files* WAL segments may contain sensitive
user data (messages, tokens, documents). Segments at rest should be encrypted. What key management approach? Options: envelope encryption via Koder Keys (RFC~~014), or a per~~backup-job symmetric key stored in the segment header.