Ai dataset

AI Dataset — Versioning + Splits + Schema + Dedup

*rea:*Intelligence
*ath:*services/ai/dataset
*ind:*Dataset foundation (versioning + splits + schema validation + dedup; analog of HF Datasets / W&B Tables / DVC)
*tatus:*v0.0.1 — sector bootstrapping (20260509)

Role in the stack

dataset consolidates ML data plumbing. ML reproducibility requires versioned datasets — without it, "trained on dataset X" is meaningless. products/horizontal/libras is currently designing its ingest pipeline in isolation; if a second project starts the same way, they reinvent splits, schema validation, dedup and lineage from scratch.

It is the Koder analog of HuggingFace Datasets, Weights & Biases Tables, DVC, lakeFS and Pachyderm — built thin on top of kdb-blob (content~~addressable) + `kdb~~meta (manifests + lineage), with a kdataset` CLI for the human path.

Boundary vs neighbors

services/ai/training is the primary downstream consumer (mandatory input).
services/ai/embed is used internally for semantic dedup.
services/ai/extract is a curated source upstream.
Generic ETL is out of scope (separate problem space).

Features (v1 target)

Content-addressable manifests (versions share blobs)
Schema validation (strict / permissive / repair modes)
Deterministic splits (hashed partitioning; stratified + group-aware)
Dedup (exact~~hash + semantic~~cosine via embed)
Lineage tracking (every transform records inputs + params)
Cross-tenant explicit grant
Parquet + JSONL + Arrow IO
kdataset CLI as first-class human path

Primary couplings

Consumer	Relationship
`services/ai/training`	Primary consumer (mandatory)
`products/horizontal/libras`	First domain dataset
`services/ai/embed`	Semantic dedup engine
`services/ai/extract`	Curated entity source
`services/ai/recsys`	User+item interaction datasets
`services/ai/agents`	Eval datasets for agent benchmarks
`infra/data/kdb-blob`	File storage (content-addressable)
`infra/data/kdb-meta`	Manifests + lineage + tags

RFC and bootstrap

RFC: dataset-RFC-001-foundations.kmd — *ccepted*20260509
Bootstrap ticket: services/ai/backlog/done/123-dataset-bootstrap.md
Implementation tickets: services/ai/dataset/backlog/pending/{001..005}

Selfhostedfirst analysis (5 gates)

Gate	Status	Notes
G1 Feature parity	pending	Skeleton phase; covers HF Datasets + DVC essentials
G2 Performance	pending	Streaming required for multi-GB datasets; bound by storage IO
G3 Stability	pending	Pre-MVP
G4 Capability	pending	Generic ETL out of scope; everything else covered
G5 Critical-path readiness	pending	Pre~~MVP; pre~~requisite for `training`