Kdb RFC 010 wal streaming replication
RFC010 — kdbnext: WAL Streaming Replication
| Field | Value |
|---|---|
| Status | *artial*— Postgres-compatible streaming shipped in pgwire (START_REPLICATION + slots + XLogData); gRPC WalStream + PITR CLI tracked as #525–#528. |
| Author(s) | Rodrigo (with Claude as scribe) |
| Date | 2026 |
| Audit | infra/data/kdb/docs/RFC-010-AUDIT.md |
| Related | RFC |
| Target module | infra/data/kdb/ |
1. Summary
kdbnext uses TiKV's Raft log for intrashard replication, but has no public-facing continuous WAL stream. This RFC defines a *AL streaming interface*that external consumers can subscribe to: replica nodes, incremental backup agents, CDC pipelines, and audit systems.
Design goals:
- *esumable* a client can reconnect at any previously seen LSN without data loss.
- *ow latency* entries are delivered within one RTT of Raft commit.
- *ackpressure-safe* clients that fall behind don't cause OOM on the server.
- *uthenticated* all subscriptions require a valid KDB Auth bearer token.
- *ITR
ready* the stream feeds into incremental backups for Pointin-Time Recovery.
2. Log Sequence Number (LSN)
*esolution per ADR
001 (20260426):*kdbnext keepslocal_lsn = u64for PostgreSQL pgwire compatibility, and exposes a separate *28-bitGlobalLsn**only on the gRPCWalStreamsurface and PITR backup segments* for cross-region ordering. The original draft below described a 128-bit replacement for the u64 LSN; that variant was rejected because it broke pgwire compat. Seeinfra/data/kdb/docs/adr/ADR-001-lsn-format.mdfor the full rationale.
2.1 Structure
A kdbnext bit value composed of:GlobalLsn is a 128
┌──────────────────────────┬──────────────┬──────────────────┐
│ global_ts (64 bits) │ region_id │ raft_index │
│ (PD timestamp, ms + logi│ (32 bits) │ (32 bits) │
└──────────────────────────┴──────────────┴──────────────────┘global_ts: PDissued timestamp (the same 64bit HLC used for MVCC),guarantees global ordering across regions.
region_id: TiKV region (shard) ID.raft_index: monotonically increasing index within the Raft log of this region.
String representation: {global_ts_hex}.{region_id}.{raft_index} — lexicographically sortable.
The pgwire replication path (START_REPLICATION, replication slots, pg_current_wal_lsn()) continues to use the local_lsn: u64 unchanged. The gRPC WalStream service (ticket #525) carries both.
2.2 LSN ordering
LSNs are totally ordered by (global_ts, region_id, raft_index). A WAL subscription with start_lsn = X receives all entries with LSN ≥ X.
The *lobal watermark*is the minimum LSN across all active subscriptions on a gateway node. The server retains WAL entries until the global watermark advances past them (bounded retention, see §6).
3. WAL Entry Format
Each WAL entry is a Protobuf message:
syntax = "proto3";
package kdb.wal.v1;
message WalEntry {
string lsn = 1; // "{global_ts_hex}.{region_id}.{raft_index}"
uint64 committed_at_ms = 2; // wall-clock commit time (UTC milliseconds)
string tenant_id = 3; // tenant identifier
string key_prefix = 4; // e.g. "sa-east-1:tenant:42:"
OpType op_type = 5;
bytes key = 6; // full key (without region prefix)
bytes value = 7; // absent for Delete operations
uint64 hlc_ts = 8; // Hybrid Logical Clock timestamp
bytes checksum = 9; // CRC32C of (key + value) for integrity
}
enum OpType {
OP_PUT = 0;
OP_DELETE = 1;
OP_BATCH = 2; // value contains a serialized BatchWrite proto
}
message WalBatch {
repeated WalEntry entries = 1;
string first_lsn = 2;
string last_lsn = 3;
Compression compression = 4;
}
enum Compression {
NONE = 0;
SNAPPY = 1; // default for batches > 4 KB
ZSTD = 2; // preferred for large batches (> 64 KB)
}The server may coalesce multiple entries into a WalBatch when:
- Subscriber is behind (catch-up mode): batches of up to 1000 entries.
- High write rate: entries committed within the same 10 ms window are batched.
- Compression kicks in automatically for batches > 4 KB.
4. Subscription Protocol
4.1 gRPC service definition
service WalStream {
// Subscribe streams WAL entries from start_lsn.
// The server streams entries until the client cancels or disconnects.
rpc Subscribe(SubscribeRequest) returns (stream WalResponse);
// Ack confirms the client has processed up to ack_lsn.
// The server uses this to advance the watermark and reclaim WAL storage.
// Called as a unary RPC after processing a batch (not per-entry).
rpc Ack(AckRequest) returns (AckResponse);
// GetLSN returns the current head LSN and the oldest retained LSN.
// Clients use this to determine whether their start_lsn is still available.
rpc GetLSN(GetLSNRequest) returns (GetLSNResponse);
}
message SubscribeRequest {
string start_lsn = 1; // start streaming from this LSN (inclusive)
string tenant_filter = 2; // empty = all tenants (requires admin token)
string key_prefix_filter = 3; // empty = all keys
uint32 batch_size_hint = 4; // suggested entries per batch (server may ignore)
}
message WalResponse {
oneof payload {
WalBatch batch = 1;
Heartbeat heartbeat = 2; // sent every 5s when idle
string error = 3; // terminal error (server closes stream after)
}
}
message Heartbeat {
string current_lsn = 1; // current head LSN (replica can verify lag)
}
message AckRequest { string ack_lsn = 1; }
message AckResponse { string watermark_lsn = 1; }
message GetLSNRequest {}
message GetLSNResponse {
string head_lsn = 1; // latest committed LSN
string oldest_lsn = 2; // oldest retained LSN (= global watermark - 1)
}4.2 Authentication
All gRPC calls require:
Authorization: Bearer <kdb-auth-jwt>The JWT must include one of:
role: "replication"— subscribe to a specific tenant's WAL.role: "admin"— subscribe to all tenants (used by backup agents).
Tokens are validated against the kdbauth JWKS endpoint (RFC001 §5).
4.3 Resumption after disconnect
On reconnect, the client replays SubscribeRequest with the last acked LSN as start_lsn. The server resumes from that LSN if it is within the retention window. If start_lsn is older than oldest_lsn, the server returns:
error: "lsn_not_available: start_lsn=<X> older than oldest_lsn=<Y>; "
"perform a base snapshot and restart from head_lsn=<Z>"4.4 Backpressure
The server uses a per-subscription send queue (default: 10,000 entries). If the queue is full for more than backpressure_timeout (default: 30 s), the server terminates the stream with:
error: "backpressure_timeout: subscriber too slow"The client should reconnect (potentially with a larger LSN skip if it doesn't need historical data) or increase its processing throughput.
5. PointinTime Recovery (PITR) Integration
5.1 Architecture
PITR uses two components:
- *ase snapshot* a consistent full backup of all TiKV key ranges at a
specific LSN. Created by the existing
tenant-shard-migratesnapshot mechanism (or a dedicatedkdb snapshotcommand). - *AL segment files* rolling WAL stream written to disk by a backup agent
(subscribes via
WalStream.Subscribewithrole: admin). Each segment covers a LSN range.
Recovery to time T:
- Find the most recent base snapshot before T.
- Replay WAL segments from
snapshot_lsnto T. - All entries with
committed_at_ms > Tare discarded.
5.2 WAL segment file format
Each segment file: kdb-wal-{first_lsn}-{last_lsn}.bin
[magic: 4 bytes "KWAL"]
[version: 2 bytes]
[first_lsn: varint]
[last_lsn: varint]
[entry_count: varint]
[entries: repeated length-prefixed WalEntry proto]
[footer_crc32: 4 bytes]Segments are rotated when:
- Size exceeds 128 MB.
- Time since first entry exceeds 1 hour.
- The backup agent receives a
HeartbeatWatermarkevent from the server.
5.3 Segment retention policy
Segment retention is configured per backup policy:
pitr_retention_days: 7— keep WAL segments for 7 days.- Older segments are pruned by the backup agent (runs GC every hour).
- Base snapshots older than
pitr_retention_days + buffer_daysare also pruned.
6. Server-Side WAL Retention
The server retains WAL entries for a minimum of wal_retention_minutes (default: 60 minutes) plus the lag of the slowest active subscriber. This bounds the memory footprint while allowing slow subscribers to catch up.
TiKV's compaction is gated: the kdb-gateway sets a safe_point in PD that prevents compaction of MVCC versions that are still within the WAL retention window. This is the same mechanism TiDB uses for its CDC/TiFlash streaming.
6.1 Memory bound estimate
At 100,000 writes/s with an average entry size of 256 bytes and a 60-minute window:
100,000 × 60 × 60 × 256 bytes ≈ 92 GBThis is too large for inmemory retention. The gateway maintains only an inmemory *ndex*(LSN → TiKV MVCC version), and re-reads the actual KV data from TiKV on demand. The index size:
100,000 × 3600 × 32 bytes (LSN + version) ≈ 11.5 GBFor 1 million writes/s, the index reaches ~115 GB — acceptable for a 256 GB gateway node. Above this scale, the index is sharded per Raft region.
7. Implementation Plan
Phase 1 — Protobuf schema + GetLSN (1 week)
- Add
kdb/wal/v1/wal.prototoinfra/data/kdb/proto/. - Implement
GetLSNRPC: reads current PD timestamp + min Raft index acrosslive regions. No streaming yet.
- Unit tests for LSN encoding/ordering.
Phase 2 — Subscribe stream (2–3 weeks)
- Implement
WalStream.SubscribegRPC server-side stream in Rust. - Per-subscription queue with backpressure enforcement.
- Catch-up mode: replay from TiKV MVCC scan for
start_lsn< current. - Live mode: hook into TiKV CDC (Change Data Capture) observer.
- Integration tests: subscriber receives all committed writes.
Phase 3 — Ack + watermark + retention (1 week)
AckRPC updates per-subscriber watermark.- Global watermark = min(all subscriber watermarks).
- GC job: evict in-memory index entries behind global watermark.
- Update safe_point in PD (gate TiKV MVCC compaction).
Phase 4 — PITR backup agent (2 weeks)
- CLI
kdb wal backup --pitr-dir /backups --segment-size 128m— connects asadmin subscriber and writes segment files.
- CLI
kdb wal restore --snapshot <path> --wal-dir /backups --to-time <RFC3339>. - Segment rotation, CRC verification, retention GC.
Phase 5 — Observability (1 week)
Prometheus metrics:
kdb_wal_subscriber_lag_lsn{subscriber_id}— entries behind head.kdb_wal_queue_depth{subscriber_id}— pending entries in send queue.kdb_wal_bytes_sent_total— total bytes sent to all subscribers.kdb_wal_retention_index_bytes— in-memory index size.
Alerting:
kdb_wal_subscriber_lag_lsn > 1_000_000for > 5 min → warn (subscriberfalling behind, may miss retention window).
8. Non-Goals (v1)
- Logical replication (PostgreSQL
style rowlevel change events for SQL tables).The WAL stream is physical (raw KV operations). Logical CDC is a separate RFC.
- Encryption of WAL stream in transit beyond TLS (mTLS is enforced via gRPC).
- Fan-out to > 100 simultaneous subscribers per gateway (Phase 1–3 target: 10).
- Cross
version WAL compatibility (format can break between major kdbnext versions).
9. Open Questions
- *iKV CDC API* TiKV's CDC observer API (
ChangeDataEvent) exposes region-levelchange events. Mapping these to per-key WAL entries requires filtering by key prefix (tenant boundary). The current TiKV CDC API may not support per-prefix subscriptions — needs investigation.
- *chema for
OP_BATCH* should batch writes be flattened into individualOP_PUT/OP_DELETEentries by the server, or should the subscriber receive the batch as a single atomic unit? Atomic batches are important for relational table row updates that span multiple KV keys.
- *ompaction interaction* when TiKV compacts MVCC versions, the original
write LSNs disappear. For catch-up from very old
start_lsn, the gateway may need to synthesize WAL entries from the compacted snapshot + delta. Alternatively, base snapshot is mandatory for anystart_lsnolder thanwal_retention_minutes. Document which model kdb-next will use.
- *egment encryption for PITR files* WAL segments may contain sensitive
user data (messages, tokens, documents). Segments at rest should be encrypted. What key management approach? Options: envelope encryption via Koder Keys (RFC
014), or a perbackup-job symmetric key stored in the segment header.