Ai dataset
AI Dataset — Versioning + Splits + Schema + Dedup
- *rea:*Intelligence
- *ath:*
services/ai/dataset - *ind:*Dataset foundation (versioning + splits + schema validation + dedup; analog of HF Datasets / W&B Tables / DVC)
- *tatus:*v0.0.1 — sector bootstrapping (2026
0509)
Role in the stack
dataset consolidates ML data plumbing. ML reproducibility requires versioned datasets — without it, "trained on dataset X" is meaningless. products/horizontal/libras is currently designing its ingest pipeline in isolation; if a second project starts the same way, they reinvent splits, schema validation, dedup and lineage from scratch.
It is the Koder analog of HuggingFace Datasets, Weights & Biases Tables, DVC, lakeFS and Pachyderm — built thin on top of kdb-blob (contentaddressable) + `kdbmeta (manifests + lineage), with a kdataset` CLI for the human path.
Boundary vs neighbors
services/ai/trainingis the primary downstream consumer (mandatory input).services/ai/embedis used internally for semantic dedup.services/ai/extractis a curated source upstream.- Generic ETL is out of scope (separate problem space).
Features (v1 target)
- Content-addressable manifests (versions share blobs)
- Schema validation (strict / permissive / repair modes)
- Deterministic splits (hashed partitioning; stratified + group-aware)
- Dedup (exact
hash + semanticcosine via embed) - Lineage tracking (every transform records inputs + params)
- Cross-tenant explicit grant
- Parquet + JSONL + Arrow IO
kdatasetCLI as first-class human path
Primary couplings
| Consumer | Relationship |
|---|---|
services/ai/training |
Primary consumer (mandatory) |
products/horizontal/libras |
First domain dataset |
services/ai/embed |
Semantic dedup engine |
services/ai/extract |
Curated entity source |
services/ai/recsys |
User+item interaction datasets |
services/ai/agents |
Eval datasets for agent benchmarks |
infra/data/kdb-blob |
File storage (content-addressable) |
infra/data/kdb-meta |
Manifests + lineage + tags |
RFC and bootstrap
- RFC:
dataset-RFC-001-foundations.kmd— *ccepted*20260509 - Bootstrap ticket:
services/ai/backlog/done/123-dataset-bootstrap.md - Implementation tickets:
services/ai/dataset/backlog/pending/{001..005}
Selfhostedfirst analysis (5 gates)
| Gate | Status | Notes |
|---|---|---|
| G1 Feature parity | pending | Skeleton phase; covers HF Datasets + DVC essentials |
| G2 Performance | pending | Streaming required for multi-GB datasets; bound by storage IO |
| G3 Stability | pending | Pre-MVP |
| G4 Capability | pending | Generic ETL out of scope; everything else covered |
| G5 Critical-path readiness | pending | Pretraining |