Kdb RFC 007 timeseries storage
RFC007 — Timeseries optimized storage
| Field | Value |
|---|---|
| Status | *mplemented (v1.1)*— v2.0 downsampling tiers outstanding (#518) |
| Author(s) | Rodrigo (with Claude as scribe) |
| Date | 2026 |
| Target module | infra/data/kdb/crates/kdb-timeseries |
| Audit | infra/data/kdb/docs/RFC-007-AUDIT.md |
| Related | RFC-001 §9 (record layer); backlog #117; Koder Observe product |
1. Summary
kdbnext stores relational rows. Timeseries workloads (metrics, events, traces) have radically different access patterns: append-only high-frequency writes, range reads over time windows, and aggregation queries over millions of samples. This RFC defines a *hunk-based columnar storage model*on top of the existing KvCluster substrate that serves those patterns without a separate TSDB engine.
2. Motivation
Koder Observe (koderapm, koderwire) currently writes to InfluxDB and Prometheus remote storage. As kdb-next stabilises, consolidating onto a single storage engine:
- Eliminates an operational dependency (no separate InfluxDB cluster).
- Gives Observe the same multi-tenancy and auth story as the relational layer.
- Enables cross-metric JOIN against relational tables (e.g. join metrics
on
user_idto relationaluserstable) in the SQL layer.
3. Non-goals
- Full PromQL support — we expose equivalent scalar functions
(
rate,time_bucket, …) via the SQL layer; PromQL can be translated by the query rewriter. - Column-oriented OLAP (Arrow, Parquet) — the TSDB model here is
time
primary, not columnprimary. - Multi
region replication of timeseries data — inherits whateverthe substrate provides (TiKV replication).
4. Data model
4.1 Chunk
The unit of storage is a *hunk* a compressed block covering a fixed-width time window for one (tenant, metric_name) series.
Default chunk width: * hour*(CHUNK_MS = 3_600_000). Configurable perseries up to 24h for lowfrequency metrics.
A chunk stores two parallel columns:
- *imestamps*—
Vec<i64>(unix milliseconds), delta-encoded. - *alues*—
Vec<f64>, XORencoded (Gorillastyle).
Typical compressed size: 1–12 bytes per sample (Gorilla: ~1.37 bytes for real-world metrics; PoC uses prost varints: ~3–8 bytes/sample).
4.2 Keyspace layout
<tenant_id: u64 BE>
<table_id: u32 BE> ← the TIMESERIES table for this tenant
<index_id: u32 BE = 0> ← primary index
<metric_name: bytes>
<NUL: 0x00> ← separator (metric names must not contain NUL)
<chunk_start_ms: i64 BE> ← aligned to CHUNK_MS boundaryThis layout allows:
- *ingle-series range scan*(
prefix = tenant + table + metric_name + NULwith
range = [chunk_start_1, chunk_start_2)) in O(n_chunks) KV reads — the common query pattern. - *ll-series scan*(for cardinality queries) via a broader scan of
the tenant + table prefix.
4.3 Value encoding (PoC — prost varint)
message TimeChunkWire {
repeated int64 timestamps_delta_ms = 1; // deltas from chunk_start_ms
repeated double values = 2; // raw f64
int64 chunk_start_ms = 3;
uint32 sample_count = 4;
}PoC uses prost varints. v1.0 replaces with Gorilla (deltaofdelta + bitpacking for timestamps; XOR runlength for values) — a transparent wire-format upgrade behind the same key layout.
5. Write path
append_sample(tenant, table_id, metric_name, ts_ms, value)
1. chunk_start = floor(ts_ms, CHUNK_MS)
2. key = build_chunk_key(tenant, table_id, metric_name, chunk_start)
3. existing = kv.get(key)
4. chunk = if existing { decode(existing) } else { empty }
5. chunk.push(ts_ms, value) ← keeps sorted order
6. kv.put(key, encode(chunk))Step 6 is a nontransactional singlekey put. Concurrent writers to the same chunk key use an optimistic retry loop (readmodifywrite with TiKV optimistic transactions, gated by the same begin_tx API the record layer already uses).
6. Read path
scan_range(tenant, table_id, metric_name, from_ms, to_ms) → Vec<Sample>
1. chunk_start_0 = floor(from_ms, CHUNK_MS)
2. chunk_start_N = floor(to_ms, CHUNK_MS)
3. keys = KvCluster::scan(prefix_range(metric, chunk_start_0, chunk_start_N+1))
4. for each chunk: decode → filter samples in [from_ms, to_ms]
5. return flat, sorted Vec<(ts_ms, value)>7. Query primitives
Implemented as pure Rust functions over Vec<Sample> — no query planner dependency in this crate; the SQL planner calls these after a scan_range.
| Function | Input | Output |
|---|---|---|
time_bucket(samples, bucket_ms) |
sorted samples | Vec<(bucket_ts, avg_value)> |
rate(samples, window_ms) |
sorted samples (counter) | Vec<(ts, rate/sec)> |
increase(samples, from_ms, to_ms) |
sorted samples | f64 (total increase) |
irate(samples) |
sorted samples | Vec<(ts, instant_rate/sec)> |
8. Retention / TTL
v1: manual TTL via kdbctl ts compact --older-than <duration>. Background compaction from raw → downsampled tiers is planned v2 (out of scope for PoC).
9. Implementation plan
| Phase | Scope | Ticket |
|---|---|---|
| PoC (this RFC) | Chunk key layout, prost encoding, append_sample, scan_range, 4 query functions, tests |
#117 |
| v1.0 | Gorilla compression, retention TTL, cardinality index, kdbctl ts subcommand |
#128 |
| v1.1 | Prometheus remote_write HTTP endpoint, OTLP metrics ingest |
#129 |
| v2.0 | Downsampling background compaction (raw → 5m → 1h tiers) | #130 |
10. Decision log
| Date | Decision | Notes |
|---|---|---|
| 2026 |
Drafted | Claude scribe; awaiting Rodrigo review |