Kdb RFC 007 timeseries storage

RFC007 — Timeseries optimized storage

Field Value
Status *mplemented (v1.1)*— v2.0 downsampling tiers outstanding (#518)
Author(s) Rodrigo (with Claude as scribe)
Date 20260415 (drafted) · 20260426 (audit)
Target module infra/data/kdb/crates/kdb-timeseries
Audit infra/data/kdb/docs/RFC-007-AUDIT.md
Related RFC-001 §9 (record layer); backlog #117; Koder Observe product

1. Summary

kdbnext stores relational rows. Timeseries workloads (metrics, events, traces) have radically different access patterns: append-only high-frequency writes, range reads over time windows, and aggregation queries over millions of samples. This RFC defines a *hunk-based columnar storage model*on top of the existing KvCluster substrate that serves those patterns without a separate TSDB engine.

2. Motivation

Koder Observe (koderapm, koderwire) currently writes to InfluxDB and Prometheus remote storage. As kdb-next stabilises, consolidating onto a single storage engine:

  • Eliminates an operational dependency (no separate InfluxDB cluster).
  • Gives Observe the same multi-tenancy and auth story as the relational layer.
  • Enables cross-metric JOIN against relational tables (e.g. join metrics

    on user_id to relational users table) in the SQL layer.

3. Non-goals

  • Full PromQL support — we expose equivalent scalar functions

    (rate, time_bucket, …) via the SQL layer; PromQL can be translated by the query rewriter.

  • Column-oriented OLAP (Arrow, Parquet) — the TSDB model here is

    timeprimary, not columnprimary.

  • Multiregion replication of timeseries data — inherits whatever

    the substrate provides (TiKV replication).

4. Data model

4.1 Chunk

The unit of storage is a *hunk* a compressed block covering a fixed-width time window for one (tenant, metric_name) series.

Default chunk width: * hour*(CHUNK_MS = 3_600_000). Configurable perseries up to 24h for lowfrequency metrics.

A chunk stores two parallel columns:

  • *imestamps*— Vec<i64> (unix milliseconds), delta-encoded.
  • *alues*— Vec<f64>, XORencoded (Gorillastyle).

Typical compressed size: 1–12 bytes per sample (Gorilla: ~1.37 bytes for real-world metrics; PoC uses prost varints: ~3–8 bytes/sample).

4.2 Keyspace layout

<tenant_id: u64 BE>
<table_id: u32 BE>           ← the TIMESERIES table for this tenant
<index_id: u32 BE = 0>       ← primary index
<metric_name: bytes>
<NUL: 0x00>                  ← separator (metric names must not contain NUL)
<chunk_start_ms: i64 BE>     ← aligned to CHUNK_MS boundary

This layout allows:

  • *ingle-series range scan*(prefix = tenant + table + metric_name + NUL

    with range = [chunk_start_1, chunk_start_2)) in O(n_chunks) KV reads — the common query pattern.

  • *ll-series scan*(for cardinality queries) via a broader scan of

    the tenant + table prefix.

4.3 Value encoding (PoC — prost varint)

message TimeChunkWire {
  repeated int64  timestamps_delta_ms = 1;  // deltas from chunk_start_ms
  repeated double values              = 2;  // raw f64
  int64           chunk_start_ms      = 3;
  uint32          sample_count        = 4;
}

PoC uses prost varints. v1.0 replaces with Gorilla (deltaofdelta + bitpacking for timestamps; XOR runlength for values) — a transparent wire-format upgrade behind the same key layout.

5. Write path

append_sample(tenant, table_id, metric_name, ts_ms, value)
  1. chunk_start = floor(ts_ms, CHUNK_MS)
  2. key = build_chunk_key(tenant, table_id, metric_name, chunk_start)
  3. existing = kv.get(key)
  4. chunk = if existing { decode(existing) } else { empty }
  5. chunk.push(ts_ms, value)          ← keeps sorted order
  6. kv.put(key, encode(chunk))

Step 6 is a nontransactional singlekey put. Concurrent writers to the same chunk key use an optimistic retry loop (readmodifywrite with TiKV optimistic transactions, gated by the same begin_tx API the record layer already uses).

6. Read path

scan_range(tenant, table_id, metric_name, from_ms, to_ms) → Vec<Sample>
  1. chunk_start_0 = floor(from_ms, CHUNK_MS)
  2. chunk_start_N = floor(to_ms,   CHUNK_MS)
  3. keys = KvCluster::scan(prefix_range(metric, chunk_start_0, chunk_start_N+1))
  4. for each chunk: decode → filter samples in [from_ms, to_ms]
  5. return flat, sorted Vec<(ts_ms, value)>

7. Query primitives

Implemented as pure Rust functions over Vec<Sample> — no query planner dependency in this crate; the SQL planner calls these after a scan_range.

Function Input Output
time_bucket(samples, bucket_ms) sorted samples Vec<(bucket_ts, avg_value)>
rate(samples, window_ms) sorted samples (counter) Vec<(ts, rate/sec)>
increase(samples, from_ms, to_ms) sorted samples f64 (total increase)
irate(samples) sorted samples Vec<(ts, instant_rate/sec)>

8. Retention / TTL

v1: manual TTL via kdbctl ts compact --older-than <duration>. Background compaction from raw → downsampled tiers is planned v2 (out of scope for PoC).

9. Implementation plan

Phase Scope Ticket
PoC (this RFC) Chunk key layout, prost encoding, append_sample, scan_range, 4 query functions, tests #117
v1.0 Gorilla compression, retention TTL, cardinality index, kdbctl ts subcommand #128
v1.1 Prometheus remote_write HTTP endpoint, OTLP metrics ingest #129
v2.0 Downsampling background compaction (raw → 5m → 1h tiers) #130

10. Decision log

Date Decision Notes
20260415 Drafted Claude scribe; awaiting Rodrigo review

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/kdb-RFC-007-timeseries-storage.md