Kdb RFC 010 wal streaming replication

RFC010 — kdbnext: WAL Streaming Replication

Field Value
Status *artial*— Postgres-compatible streaming shipped in pgwire (START_REPLICATION + slots + XLogData); gRPC WalStream + PITR CLI tracked as #525–#528.
Author(s) Rodrigo (with Claude as scribe)
Date 20260416 (drafted) · 20260426 (audit)
Audit infra/data/kdb/docs/RFC-010-AUDIT.md
Related RFC008 (multiregion), RFC009 (activeactive), Ticket #043, #044
Target module infra/data/kdb/

1. Summary

kdbnext uses TiKV's Raft log for intrashard replication, but has no public-facing continuous WAL stream. This RFC defines a *AL streaming interface*that external consumers can subscribe to: replica nodes, incremental backup agents, CDC pipelines, and audit systems.

Design goals:

  • *esumable* a client can reconnect at any previously seen LSN without data loss.
  • *ow latency* entries are delivered within one RTT of Raft commit.
  • *ackpressure-safe* clients that fall behind don't cause OOM on the server.
  • *uthenticated* all subscriptions require a valid KDB Auth bearer token.
  • *ITRready* the stream feeds into incremental backups for Pointin-Time Recovery.

2. Log Sequence Number (LSN)

*esolution per ADR001 (20260426):*kdbnext keeps local_lsn = u64 for PostgreSQL pgwire compatibility, and exposes a separate *28-bit GlobalLsn**only on the gRPC WalStream surface and PITR backup segments* for cross-region ordering. The original draft below described a 128-bit replacement for the u64 LSN; that variant was rejected because it broke pgwire compat. See infra/data/kdb/docs/adr/ADR-001-lsn-format.md for the full rationale.

2.1 Structure

A kdbnext GlobalLsn is a 128bit value composed of:

┌──────────────────────────┬──────────────┬──────────────────┐
│  global_ts (64 bits)     │ region_id    │ raft_index       │
│  (PD timestamp, ms + logi│ (32 bits)    │ (32 bits)        │
└──────────────────────────┴──────────────┴──────────────────┘
  • global_ts: PDissued timestamp (the same 64bit HLC used for MVCC),

    guarantees global ordering across regions.

  • region_id: TiKV region (shard) ID.
  • raft_index: monotonically increasing index within the Raft log of this region.

String representation: {global_ts_hex}.{region_id}.{raft_index} — lexicographically sortable.

The pgwire replication path (START_REPLICATION, replication slots, pg_current_wal_lsn()) continues to use the local_lsn: u64 unchanged. The gRPC WalStream service (ticket #525) carries both.

2.2 LSN ordering

LSNs are totally ordered by (global_ts, region_id, raft_index). A WAL subscription with start_lsn = X receives all entries with LSN ≥ X.

The *lobal watermark*is the minimum LSN across all active subscriptions on a gateway node. The server retains WAL entries until the global watermark advances past them (bounded retention, see §6).


3. WAL Entry Format

Each WAL entry is a Protobuf message:

syntax = "proto3";
package kdb.wal.v1;

message WalEntry {
  string   lsn             = 1;   // "{global_ts_hex}.{region_id}.{raft_index}"
  uint64   committed_at_ms = 2;   // wall-clock commit time (UTC milliseconds)
  string   tenant_id       = 3;   // tenant identifier
  string   key_prefix      = 4;   // e.g. "sa-east-1:tenant:42:"
  OpType   op_type         = 5;
  bytes    key             = 6;   // full key (without region prefix)
  bytes    value           = 7;   // absent for Delete operations
  uint64   hlc_ts          = 8;   // Hybrid Logical Clock timestamp
  bytes    checksum        = 9;   // CRC32C of (key + value) for integrity
}

enum OpType {
  OP_PUT    = 0;
  OP_DELETE = 1;
  OP_BATCH  = 2;   // value contains a serialized BatchWrite proto
}

message WalBatch {
  repeated WalEntry entries = 1;
  string            first_lsn = 2;
  string            last_lsn  = 3;
  Compression       compression = 4;
}

enum Compression {
  NONE   = 0;
  SNAPPY = 1;   // default for batches > 4 KB
  ZSTD   = 2;   // preferred for large batches (> 64 KB)
}

The server may coalesce multiple entries into a WalBatch when:

  1. Subscriber is behind (catch-up mode): batches of up to 1000 entries.
  2. High write rate: entries committed within the same 10 ms window are batched.
  3. Compression kicks in automatically for batches > 4 KB.

4. Subscription Protocol

4.1 gRPC service definition

service WalStream {
  // Subscribe streams WAL entries from start_lsn.
  // The server streams entries until the client cancels or disconnects.
  rpc Subscribe(SubscribeRequest) returns (stream WalResponse);

  // Ack confirms the client has processed up to ack_lsn.
  // The server uses this to advance the watermark and reclaim WAL storage.
  // Called as a unary RPC after processing a batch (not per-entry).
  rpc Ack(AckRequest) returns (AckResponse);

  // GetLSN returns the current head LSN and the oldest retained LSN.
  // Clients use this to determine whether their start_lsn is still available.
  rpc GetLSN(GetLSNRequest) returns (GetLSNResponse);
}

message SubscribeRequest {
  string start_lsn        = 1;  // start streaming from this LSN (inclusive)
  string tenant_filter    = 2;  // empty = all tenants (requires admin token)
  string key_prefix_filter = 3; // empty = all keys
  uint32 batch_size_hint  = 4;  // suggested entries per batch (server may ignore)
}

message WalResponse {
  oneof payload {
    WalBatch batch       = 1;
    Heartbeat heartbeat  = 2;  // sent every 5s when idle
    string    error      = 3;  // terminal error (server closes stream after)
  }
}

message Heartbeat {
  string current_lsn = 1;  // current head LSN (replica can verify lag)
}

message AckRequest  { string ack_lsn = 1; }
message AckResponse { string watermark_lsn = 1; }

message GetLSNRequest  {}
message GetLSNResponse {
  string head_lsn    = 1;  // latest committed LSN
  string oldest_lsn  = 2;  // oldest retained LSN (= global watermark - 1)
}

4.2 Authentication

All gRPC calls require:

Authorization: Bearer <kdb-auth-jwt>

The JWT must include one of:

  • role: "replication" — subscribe to a specific tenant's WAL.
  • role: "admin" — subscribe to all tenants (used by backup agents).

Tokens are validated against the kdbauth JWKS endpoint (RFC001 §5).

4.3 Resumption after disconnect

On reconnect, the client replays SubscribeRequest with the last acked LSN as start_lsn. The server resumes from that LSN if it is within the retention window. If start_lsn is older than oldest_lsn, the server returns:

error: "lsn_not_available: start_lsn=<X> older than oldest_lsn=<Y>; "
       "perform a base snapshot and restart from head_lsn=<Z>"

4.4 Backpressure

The server uses a per-subscription send queue (default: 10,000 entries). If the queue is full for more than backpressure_timeout (default: 30 s), the server terminates the stream with:

error: "backpressure_timeout: subscriber too slow"

The client should reconnect (potentially with a larger LSN skip if it doesn't need historical data) or increase its processing throughput.


5. PointinTime Recovery (PITR) Integration

5.1 Architecture

PITR uses two components:

  1. *ase snapshot* a consistent full backup of all TiKV key ranges at a

    specific LSN. Created by the existing tenant-shard-migrate snapshot mechanism (or a dedicated kdb snapshot command).

  2. *AL segment files* rolling WAL stream written to disk by a backup agent

    (subscribes via WalStream.Subscribe with role: admin). Each segment covers a LSN range.

Recovery to time T:

  1. Find the most recent base snapshot before T.
  2. Replay WAL segments from snapshot_lsn to T.
  3. All entries with committed_at_ms > T are discarded.

5.2 WAL segment file format

Each segment file: kdb-wal-{first_lsn}-{last_lsn}.bin

[magic: 4 bytes "KWAL"]
[version: 2 bytes]
[first_lsn: varint]
[last_lsn: varint]
[entry_count: varint]
[entries: repeated length-prefixed WalEntry proto]
[footer_crc32: 4 bytes]

Segments are rotated when:

  • Size exceeds 128 MB.
  • Time since first entry exceeds 1 hour.
  • The backup agent receives a HeartbeatWatermark event from the server.

5.3 Segment retention policy

Segment retention is configured per backup policy:

  • pitr_retention_days: 7 — keep WAL segments for 7 days.
  • Older segments are pruned by the backup agent (runs GC every hour).
  • Base snapshots older than pitr_retention_days + buffer_days are also pruned.

6. Server-Side WAL Retention

The server retains WAL entries for a minimum of wal_retention_minutes (default: 60 minutes) plus the lag of the slowest active subscriber. This bounds the memory footprint while allowing slow subscribers to catch up.

TiKV's compaction is gated: the kdb-gateway sets a safe_point in PD that prevents compaction of MVCC versions that are still within the WAL retention window. This is the same mechanism TiDB uses for its CDC/TiFlash streaming.

6.1 Memory bound estimate

At 100,000 writes/s with an average entry size of 256 bytes and a 60-minute window:

100,000 × 60 × 60 × 256 bytes ≈ 92 GB

This is too large for inmemory retention. The gateway maintains only an inmemory *ndex*(LSN → TiKV MVCC version), and re-reads the actual KV data from TiKV on demand. The index size:

100,000 × 3600 × 32 bytes (LSN + version) ≈ 11.5 GB

For 1 million writes/s, the index reaches ~115 GB — acceptable for a 256 GB gateway node. Above this scale, the index is sharded per Raft region.


7. Implementation Plan

Phase 1 — Protobuf schema + GetLSN (1 week)

  1. Add kdb/wal/v1/wal.proto to infra/data/kdb/proto/.
  2. Implement GetLSN RPC: reads current PD timestamp + min Raft index across

    live regions. No streaming yet.

  3. Unit tests for LSN encoding/ordering.

Phase 2 — Subscribe stream (2–3 weeks)

  1. Implement WalStream.Subscribe gRPC server-side stream in Rust.
  2. Per-subscription queue with backpressure enforcement.
  3. Catch-up mode: replay from TiKV MVCC scan for start_lsn < current.
  4. Live mode: hook into TiKV CDC (Change Data Capture) observer.
  5. Integration tests: subscriber receives all committed writes.

Phase 3 — Ack + watermark + retention (1 week)

  1. Ack RPC updates per-subscriber watermark.
  2. Global watermark = min(all subscriber watermarks).
  3. GC job: evict in-memory index entries behind global watermark.
  4. Update safe_point in PD (gate TiKV MVCC compaction).

Phase 4 — PITR backup agent (2 weeks)

  1. CLI kdb wal backup --pitr-dir /backups --segment-size 128m — connects as

    admin subscriber and writes segment files.

  2. CLI kdb wal restore --snapshot <path> --wal-dir /backups --to-time <RFC3339>.
  3. Segment rotation, CRC verification, retention GC.

Phase 5 — Observability (1 week)

Prometheus metrics:

  • kdb_wal_subscriber_lag_lsn{subscriber_id} — entries behind head.
  • kdb_wal_queue_depth{subscriber_id} — pending entries in send queue.
  • kdb_wal_bytes_sent_total — total bytes sent to all subscribers.
  • kdb_wal_retention_index_bytes — in-memory index size.

Alerting:

  • kdb_wal_subscriber_lag_lsn > 1_000_000 for > 5 min → warn (subscriber

    falling behind, may miss retention window).


8. Non-Goals (v1)

  • Logical replication (PostgreSQLstyle rowlevel change events for SQL tables).

    The WAL stream is physical (raw KV operations). Logical CDC is a separate RFC.

  • Encryption of WAL stream in transit beyond TLS (mTLS is enforced via gRPC).
  • Fan-out to > 100 simultaneous subscribers per gateway (Phase 1–3 target: 10).
  • Crossversion WAL compatibility (format can break between major kdbnext versions).

9. Open Questions

  1. *iKV CDC API* TiKV's CDC observer API (ChangeDataEvent) exposes region-level

    change events. Mapping these to per-key WAL entries requires filtering by key prefix (tenant boundary). The current TiKV CDC API may not support per-prefix subscriptions — needs investigation.

  1. *chema for OP_BATCH* should batch writes be flattened into individual

    OP_PUT/OP_DELETE entries by the server, or should the subscriber receive the batch as a single atomic unit? Atomic batches are important for relational table row updates that span multiple KV keys.

  1. *ompaction interaction* when TiKV compacts MVCC versions, the original

    write LSNs disappear. For catch-up from very old start_lsn, the gateway may need to synthesize WAL entries from the compacted snapshot + delta. Alternatively, base snapshot is mandatory for any start_lsn older than wal_retention_minutes. Document which model kdb-next will use.

  1. *egment encryption for PITR files* WAL segments may contain sensitive

    user data (messages, tokens, documents). Segments at rest should be encrypted. What key management approach? Options: envelope encryption via Koder Keys (RFC014), or a perbackup-job symmetric key stored in the segment header.

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/kdb-RFC-010-wal-streaming-replication.md