Ai dataset

AI Dataset — Versioning + Splits + Schema + Dedup

  • *rea:*Intelligence
  • *ath:*services/ai/dataset
  • *ind:*Dataset foundation (versioning + splits + schema validation + dedup; analog of HF Datasets / W&B Tables / DVC)
  • *tatus:*v0.0.1 — sector bootstrapping (20260509)

Role in the stack

dataset consolidates ML data plumbing. ML reproducibility requires versioned datasets — without it, "trained on dataset X" is meaningless. products/horizontal/libras is currently designing its ingest pipeline in isolation; if a second project starts the same way, they reinvent splits, schema validation, dedup and lineage from scratch.

It is the Koder analog of HuggingFace Datasets, Weights & Biases Tables, DVC, lakeFS and Pachyderm — built thin on top of kdb-blob (contentaddressable) + `kdbmeta (manifests + lineage), with a kdataset` CLI for the human path.

Boundary vs neighbors

  • services/ai/training is the primary downstream consumer (mandatory input).
  • services/ai/embed is used internally for semantic dedup.
  • services/ai/extract is a curated source upstream.
  • Generic ETL is out of scope (separate problem space).

Features (v1 target)

  • Content-addressable manifests (versions share blobs)
  • Schema validation (strict / permissive / repair modes)
  • Deterministic splits (hashed partitioning; stratified + group-aware)
  • Dedup (exacthash + semanticcosine via embed)
  • Lineage tracking (every transform records inputs + params)
  • Cross-tenant explicit grant
  • Parquet + JSONL + Arrow IO
  • kdataset CLI as first-class human path

Primary couplings

Consumer Relationship
services/ai/training Primary consumer (mandatory)
products/horizontal/libras First domain dataset
services/ai/embed Semantic dedup engine
services/ai/extract Curated entity source
services/ai/recsys User+item interaction datasets
services/ai/agents Eval datasets for agent benchmarks
infra/data/kdb-blob File storage (content-addressable)
infra/data/kdb-meta Manifests + lineage + tags

RFC and bootstrap

  • RFC: dataset-RFC-001-foundations.kmd — *ccepted*20260509
  • Bootstrap ticket: services/ai/backlog/done/123-dataset-bootstrap.md
  • Implementation tickets: services/ai/dataset/backlog/pending/{001..005}

Selfhostedfirst analysis (5 gates)

Gate Status Notes
G1 Feature parity pending Skeleton phase; covers HF Datasets + DVC essentials
G2 Performance pending Streaming required for multi-GB datasets; bound by storage IO
G3 Stability pending Pre-MVP
G4 Capability pending Generic ETL out of scope; everything else covered
G5 Critical-path readiness pending PreMVP; prerequisite for training

Source: ../home/koder/dev/koder/meta/docs/stack/modules/ai-dataset.md