Kdb RFC 007 timeseries storage

RFC007 — Timeseries optimized storage

Field	Value
Status	mplemented (v1.1)— v2.0 downsampling tiers outstanding (#518)
Author(s)	Rodrigo (with Claude as scribe)
Date	20260415 (drafted) · 20260426 (audit)
Target module	`infra/data/kdb/crates/kdb-timeseries`
Audit	`infra/data/kdb/docs/RFC-007-AUDIT.md`
Related	RFC-001 §9 (record layer); backlog #117; Koder Observe product

1. Summary

kdb~~next stores relational rows. Time~~series workloads (metrics, events, traces) have radically different access patterns: append-only high-frequency writes, range reads over time windows, and aggregation queries over millions of samples. This RFC defines a *hunk-based columnar storage model*on top of the existing KvCluster substrate that serves those patterns without a separate TSDB engine.

2. Motivation

Koder Observe (koder~~apm, koder~~wire) currently writes to InfluxDB and Prometheus remote storage. As kdb-next stabilises, consolidating onto a single storage engine:

Eliminates an operational dependency (no separate InfluxDB cluster).
Gives Observe the same multi-tenancy and auth story as the relational layer.
Enables cross-metric JOIN against relational tables (e.g. join metrics
on user_id to relational users table) in the SQL layer.

3. Non-goals

Full PromQL support — we expose equivalent scalar functions
(rate, time_bucket, …) via the SQL layer; PromQL can be translated by the query rewriter.
Column-oriented OLAP (Arrow, Parquet) — the TSDB model here is
time~~primary, not column~~primary.
Multi~~region replication of time~~series data — inherits whatever
the substrate provides (TiKV replication).

4. Data model

4.1 Chunk

The unit of storage is a *hunk* a compressed block covering a fixed-width time window for one (tenant, metric_name) series.

Default chunk width: * hour*(CHUNK_MS = 3_600_000). Configurable per~~series up to 24h for low~~frequency metrics.

A chunk stores two parallel columns:

*imestamps*— Vec<i64> (unix milliseconds), delta-encoded.
*alues*— Vec<f64>, XOR~~encoded (Gorilla~~style).

Typical compressed size: 1–12 bytes per sample (Gorilla: ~1.37 bytes for real-world metrics; PoC uses prost varints: ~3–8 bytes/sample).

4.2 Keyspace layout

<tenant_id: u64 BE>
<table_id: u32 BE>           ← the TIMESERIES table for this tenant
<index_id: u32 BE = 0>       ← primary index
<metric_name: bytes>
<NUL: 0x00>                  ← separator (metric names must not contain NUL)
<chunk_start_ms: i64 BE>     ← aligned to CHUNK_MS boundary

This layout allows:

*ingle-series range scan*(prefix = tenant + table + metric_name + NUL
with range = [chunk_start_1, chunk_start_2)) in O(n_chunks) KV reads — the common query pattern.
*ll-series scan*(for cardinality queries) via a broader scan of
the tenant + table prefix.

4.3 Value encoding (PoC — prost varint)

message TimeChunkWire {
  repeated int64  timestamps_delta_ms = 1;  // deltas from chunk_start_ms
  repeated double values              = 2;  // raw f64
  int64           chunk_start_ms      = 3;
  uint32          sample_count        = 4;
}

PoC uses prost varints. v1.0 replaces with Gorilla (deltaofdelta + bit~~packing for timestamps; XOR run~~length for values) — a transparent wire-format upgrade behind the same key layout.

5. Write path

append_sample(tenant, table_id, metric_name, ts_ms, value)
  1. chunk_start = floor(ts_ms, CHUNK_MS)
  2. key = build_chunk_key(tenant, table_id, metric_name, chunk_start)
  3. existing = kv.get(key)
  4. chunk = if existing { decode(existing) } else { empty }
  5. chunk.push(ts_ms, value)          ← keeps sorted order
  6. kv.put(key, encode(chunk))

Step 6 is a non~~transactional single~~key put. Concurrent writers to the same chunk key use an optimistic retry loop (read~~modify~~write with TiKV optimistic transactions, gated by the same begin_tx API the record layer already uses).

6. Read path

scan_range(tenant, table_id, metric_name, from_ms, to_ms) → Vec<Sample>
  1. chunk_start_0 = floor(from_ms, CHUNK_MS)
  2. chunk_start_N = floor(to_ms,   CHUNK_MS)
  3. keys = KvCluster::scan(prefix_range(metric, chunk_start_0, chunk_start_N+1))
  4. for each chunk: decode → filter samples in [from_ms, to_ms]
  5. return flat, sorted Vec<(ts_ms, value)>

7. Query primitives

Implemented as pure Rust functions over Vec<Sample> — no query planner dependency in this crate; the SQL planner calls these after a scan_range.

Function	Input	Output
`time_bucket(samples, bucket_ms)`	sorted samples	`Vec<(bucket_ts, avg_value)>`
`rate(samples, window_ms)`	sorted samples (counter)	`Vec<(ts, rate/sec)>`
`increase(samples, from_ms, to_ms)`	sorted samples	`f64` (total increase)
`irate(samples)`	sorted samples	`Vec<(ts, instant_rate/sec)>`

8. Retention / TTL

v1: manual TTL via kdbctl ts compact --older-than <duration>. Background compaction from raw → downsampled tiers is planned v2 (out of scope for PoC).

9. Implementation plan

Phase	Scope	Ticket
PoC (this RFC)	Chunk key layout, prost encoding, `append_sample`, `scan_range`, 4 query functions, tests	#117
v1.0	Gorilla compression, retention TTL, cardinality index, `kdbctl ts` subcommand	#128
v1.1	Prometheus `remote_write` HTTP endpoint, OTLP metrics ingest	#129
v2.0	Downsampling background compaction (raw → 5m → 1h tiers)	#130

10. Decision log

Date	Decision	Notes
20260415	Drafted	Claude scribe; awaiting Rodrigo review