Kdb RFC 002 opentelemetry instrumentation

RFC002 — OpenTelemetry instrumentation for kdb and kdbnext

Field	Value
Status	raft
Author(s)	Rodrigo (with Claude as scribe)
Date	20260410
Target module	`platform/kdb/` (Go) and `platform/kdb/next/` (Rust)
Related	RFC~~001 §10 (cross~~cutting observability), backlog #046
Supersedes	Ad~~hoc Prometheus~~only metrics in the current Go kdb

1. Summary

This RFC defines how the Koder database ("kdb") exposes *races, metrics and logs*to operators, following the *penTelemetry*specification (OTel) endtoend. The goal is that an operator running a Koder platform on top of kdb can plug any OTel-compatible collector (Grafana TempoMimirLoki, Honeycomb, Jaeger, Datadog, New Relic, Chronosphere, …) and get:

request-level distributed traces across the full query path,
per-tenant RED (rateerrorsduration) metrics,
structured audit and error logs correlated by trace/span ids,

without touching code, only configuration.

Today the Go kdb exposes Prometheus-compatible metrics and very limited internal tracing. kdb-next (Rust) has no instrumentation yet. Neither emits OTLP. That blocks every downstream Koder product (id v2, flow, talk, bull, kortex, …) from using a unified observability backend — a requirement called out in RFC-001 §2.5 and promised in the product pitch.

2. Goals

*ingle spec, two implementations.*The Go kdb and Rust kdb-next
both emit data that conforms to the OTel semantic conventions for databases. A dashboard written against one works against the other.
*TLP-first.*gRPC OTLP is the primary export protocol. HTTP/JSON
OTLP is supported for constrained environments. Prometheus scrape is kept as a secondary compatibility surface — not removed.
*ero-config baseline.*Running kdb with default config, on a
host where no collector is reachable, must still work: instrumentation is cheap enough to keep on by default, and exporters degrade to no-op when the collector is unreachable.
*enancy is a label, not a free~~for~~all.*Every metric, span and
log record carries tenant_id as an attribute, but cardinality is bounded by the budget defined in RFC-001 §6.8 (metrics cardinality budget). High-cardinality attributes (query text, row ids) are never used as metric labels.
*ontext propagation across the whole request.*A query that
enters via gRPC must carry the W3C traceparent through: parser → planner → executor → storage → replication RPC → follower. Async work (background compaction, vacuum, WAL ship) gets its own root span tagged with the originating tenant when applicable.
*rivacy by default.*Query text, parameter values, row contents
and credential material are *ever*attached to spans or logs by default. An explicit opt-in (otel.capture_query_text = true) is required, and turning it on emits a warning in the audit log.

3. Non-goals

*undling a collector.*kdb emits OTLP; it does not run
otel~~collector in~~process. Operators are expected to run a collector (sidecar or daemonset) in their environment.
*eplacing the access log / audit log.*Structured logs continue
to live in their current place; they are also emitted via OTel logs for correlation, but the file-based access log stays as the source of truth for compliance audits.
*DK vendoring gymnastics.*We use the upstream OTel SDKs as-is
(go.opentelemetry.io/otel for Go; opentelemetry crate family for Rust). We do not fork them.

4. Semantic conventions

We follow the OTel *atabase semantic conventions*(v1.27+):

Attribute	Meaning
`db.system`	`"kdb"` (new value, to be registered upstream; until then, custom)
`db.name`	logical database name within the tenant
`db.operation`	`SELECT` / `INSERT` / `UPDATE` / `DELETE` / `DDL` / `KV_GET` / …
`db.statement`	the query text — nly if `otel.capture_query_text = true`
`db.kdb.tenant_id`	tenant (org) id — Koder-specific
`db.kdb.shard_id`	shard serving the request
`db.kdb.model`	`relational` / `document` / `object` / `keyvalue` / `graph`
`db.kdb.isolation_level`	`read_committed` / `repeatable_read` / `serializable` / `snapshot`
`db.kdb.plan_hash`	hash of the chosen query plan (for plan stability analysis)
`db.kdb.rows_examined`	rows read from storage
`db.kdb.rows_returned`	rows returned to client
`db.kdb.cache_hit`	boolean — plan cache or result cache hit

Attributes with the db.kdb. prefix are Koder-specific extensions; the rest are standard OTel. If/when OTel registers a kdb system value, we'll drop the custom prefix.

5. Spans

5.1 Required spans (every query)

kdb.query                        — root span for each incoming request
├── kdb.parse                    — tokenize + parse (KQL/SQL)
├── kdb.plan                     — query planner
├── kdb.execute                  — top-level executor
│   ├── kdb.storage.read         — per storage access
│   ├── kdb.storage.write        — per storage write
│   └── kdb.replication.append   — Raft log append (writes only)
└── kdb.serialize                — result marshaling

5.2 Required spans (background work)

kdb.bg.compaction                — storage compaction batch
kdb.bg.vacuum                    — per-tenant vacuum pass
kdb.bg.wal_ship                  — WAL streaming (see RFC-043)
kdb.bg.backup                    — backup job

Background spans are tagged with db.kdb.tenant_id when the work is scoped to a single tenant; otherwise they carry db.kdb.tenant_id = "<shared>".

5.3 Context propagation

*nbound (client → kdb):*W3C traceparent / tracestate headers
are read from gRPC metadata and HTTP headers.
*ntra-kdb (node → node):*propagated via gRPC metadata on every
internal RPC (Raft, replication, Record API — see RFC-047).
*sync handoff:*when the executor enqueues async work, it stores
the parent span context in the job descriptor; the worker starts a new span with that context as parent.

6. Metrics

All metrics are emitted via OTel metrics SDK and, in parallel, exposed on the existing Prometheus scrape endpoint (same numbers, same labels, so existing dashboards keep working during the migration).

6.1 RED metrics (per operation)

Name	Type	Unit	Labels
`kdb.query.count`	Counter	1	`operation`, `model`, `status`, `tenant_bucket`
`kdb.query.duration`	Histogram	s	`operation`, `model`, `tenant_bucket`
`kdb.query.rows_returned`	Histogram	1	`operation`, `model`
`kdb.query.errors`	Counter	1	`operation`, `error_kind`

tenant_bucket is a bounded-cardinality label: tenants are hashed into N buckets (default N=64). This keeps per-tenant visibility for p99/p999 analysis without blowing up label cardinality. Raw tenant_id is only carried on spans, where cardinality doesn't compound.

6.2 Resource metrics

Name	Type	Unit
`kdb.storage.bytes_used`	Gauge	bytes
`kdb.storage.rows`	Gauge	1
`kdb.connections.active`	Gauge	1
`kdb.connections.idle`	Gauge	1
`kdb.txn.active`	Gauge	1
`kdb.txn.aborted`	Counter	1
`kdb.replication.lag`	Gauge	s
`kdb.cache.hit_ratio`	Gauge	1

6.3 Cardinality budget

The cardinality budget from RFC-001 §6.8 applies: the total number of unique (metric × label set) tuples per node must not exceed 500k. Adding a new label to an existing metric requires updating this budget in this RFC and benchmarking the worst case.

7. Logs

Every log record is emitted via the OTel logs SDK and includes the
active trace_id and span_id, so logs can be jumped to from a trace view.
Severity levels follow OTel (TRACE, DEBUG, INFO, WARN,
ERROR, FATAL).
The existing file-based access log (src/log/access.kmd) remains
the source of truth for audit; OTel logs are for operator observability, not compliance.
Sensitive fields (credentials, query parameter values, row data)
are redacted at the SDK layer via an attribute processor, not by asking callers to be careful.

8. Configuration

New section in kdb.toml:

[otel]
enabled         = true
service_name    = "kdb"
service_version = "0.9.0"       # filled by build
resource_attrs  = { deployment_environment = "prod" }

# Exporter: "otlp-grpc" (default) | "otlp-http" | "none"
exporter        = "otlp-grpc"
endpoint        = "http://otel-collector:4317"
insecure        = false
headers         = { }           # e.g. auth tokens

# Sampling
traces_sampler  = "parentbased_traceidratio"
traces_ratio    = 0.01          # 1% of root spans
metrics_interval = "10s"
logs_enabled    = true

# Privacy knobs — off by default
capture_query_text = false
capture_params     = false

Environment variables follow the standard OTel names (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, …) and override the TOML.

9. Sampling strategy

*efault:*parent-based trace ID ratio at 1%.
*orced sampling:*any request with traceparent already sampled
by an upstream caller is always sampled (we respect the parent).
*rror boost:*if a query returns an error, the span is upgraded
to "sampled" regardless of the ratio (tail-based, applied at export time via a custom processor).
*low query boost:*same as error boost, but for queries exceeding
the per~~operation p99 target in RFC~~001 §5.

10. Implementation plan

Phase 1 — Go kdb baseline (2 weeks)

Add go.opentelemetry.io/otel + otlpgrpc exporter deps.
Wire TracerProvider, MeterProvider, LoggerProvider from
kdb.toml.
Instrument the gRPC entry point to create the kdb.query root
span and extract inbound context.
Instrument parser, planner, executor with child spans.
Port existing Prometheus metrics to OTel metrics via the
Prometheus bridge (go.opentelemetry.io/otel/exporters/prometheus) so the scrape endpoint keeps working and OTLP also ships.
Add the RED metrics from §6.1 where missing.
Redaction processor for sensitive attributes.

Phase 2 — kdb-next (Rust) parity (1 week, can run in parallel)

*tatus on 20260410:*crates/kdb-obs/ already has a working traces pipeline: tracing_setup.rs wires an OTLP gRPC exporter with parent~~based 1% sampling and a 100 ms slow~~request threshold, and bridges the existing tracing spans via tracing-opentelemetry. Metrics are still Prometheus-only (metrics.rs uses the prometheus crate directly, with tiered tenant bucketing: top~~100 → `tier~~a, next-1000 →tier-b, rest → other`). Logs are not wired through OTel yet.

Remaining work:

*etrics via OTLP.*Add opentelemetry-otlp metrics exporter and
mirror the existing Prometheus metrics into it. Keep the Prometheus scrape endpoint on as compatibility.
*lign metric names and label sets with §6*(today the crate uses
its own ad-hoc names like rpc_duration; they need to become kdb.query.duration etc. — keep the Prometheus names as a translation layer to avoid breaking current dashboards).
*ogs via OTel.*Wire opentelemetry-appender-tracing (or
equivalent) so log records carry trace_id / span_id and are exported via OTLP. Keep the file sink for the access log.
*ropagate context beyond the gateway.*Today only the gateway
RPC entry point creates spans. Extend instrumentation into kdb-record, kdb-kv-trait and replication RPCs so the full query tree from §5.1 shows up in a trace, not just the root rpc span.
*dopt the §4 semantic conventions*as attributes on existing
spans (db.kdb.tenant_id, db.kdb.model, db.kdb.shard_id, …). The current crate only sets tenant.id and rpc.service / rpc.method.

Phase 3 — Replication and background (1 week)

Propagate context through Raft RPCs.
Background span roots for compaction, vacuum, WAL ship, backup.
Add kdb.replication.lag gauge.

Phase 4 — Dashboards and smoke tests (1 week)

Ship a reference Grafana dashboard JSON under
observe/dashboards/kdb.json.
Integration test: run kdb against an in-process OTLP collector,
fire N queries, assert spans, metrics, logs are received with the expected attributes.
Cardinality test: generate 1M distinct tenants and assert metric
cardinality stays under budget.

11. Migration and compatibility

The existing Prometheus scrape endpoint stays on for at least two
minor versions after Phase 1 ships. Only after downstream Koder products have moved their dashboards to OTLP will we consider removing it — and that removal is a separate RFC.
The existing access log file format is untouched.
No breaking changes to kdb.toml: the new [otel] section is
additive; defaults are backwards compatible (enabled = true, exporter = "otlp-grpc", but endpoint is empty by default and the exporter degrades to no-op when empty).

12. Open questions

*o we register db.system = "kdb" upstream with OTel?*Doing so
makes us a recognized database in every OTel-aware APM tool. It's a ~3-month process. Proposed: yes, start in parallel with Phase 1.
*ow do we bound the tenant_bucket label at 100M tenants?*With
64 buckets, each bucket averages ~1.5M tenants, which is fine for p99 but not for "which tenant is slow". Proposed: spans carry raw tenant_id, metrics carry bucketed; operators doing single-tenant investigation use the trace view.
*ampling in control plane vs data plane.*Control plane calls
(auth, schema lookups) are cheap but very frequent. Should they have a lower sample ratio than data plane queries? Proposed: yes, a separate traces_ratio_control_plane = 0.001 knob in Phase 3.
*og volume.*Emitting every log record via OTel could 10x log
volume on operators who don't want it. Proposed: logs_enabled = true by default but level-gated at INFO; debug logs are only emitted to the file sink.

13. References

OpenTelemetry Semantic Conventions for Database Client Calls —
https:/pentelemetry.iodocsspecssemconvdatabase/
OpenTelemetry Protocol (OTLP) —
https:/pentelemetry.iodocsspecsotlp
RFC-001 §2.5, §6.8, §10 (this repo)
backlogpending046~~opentelemetry~~instrumentation.kmd

14. Decision log

Date	Decision	Notes
20260410	Drafted	Claude scribe, awaiting Rodrigo review
20260410	Corrected	Phase 2 rewritten after auditing `crates/kdb-obs/`: traces pipeline already exists; remaining work is metrics+logs OTLP and spec alignment