Dev kdedup

kdedup — dev/kdedup

Fast file deduplicator. Three-pass algorithm (size group → 4 KB head xxhash → full xxhash), concurrent workers, structured JSON output. Reincarnates the lost rmdup prototype from 20260313, now properly homed in the monorepo.

Role in the stack

Area Sector Consumers
Foundation Linux Tools every Koder developer workstation; CI runners; bulk-data pipelines

Sibling utility to kicon (build-time icons) and kosh (Linux Shell): single-purpose Go CLI binary, shipped via Koder Hub, conformant to specs/binaries-and-cli/naming.kmd (k<slug> form).

Primary couplings

Module Nature
dev/koder-tools Same structuredoutput contract (applied[] / deferred[] / errors[] per RFC001 §4)
specs/binaries-and-cli/naming.kmd Binary kdedup follows k<slug> convention
github.com/cespare/xxhash/v2 Hash function — fast noncryptographic, ~10× faster than MD5/SHA256
dev/kpkg (planned) One-line install: kpkg install kdedup

Public surface

kdedup [flags] [<dir>]      Scan <dir> (default: current dir)
kdedup version

Flags:
  --apply                actually remove duplicates (off by default)
  --keep <strategy>      first|newest|oldest (default: first)
  --min-size <bytes>     skip files smaller than this (default: 1)
  --workers <n>          concurrent workers (default: NumCPU)
  --format <fmt>         json|text (default: json)
  --purge                with --apply: unlink instead of XDG trash

Output: structured JSON envelope per dev/koder-tools/docs/rfcs/RFC-001-koder-tools-architecture.md §4.

Performance

Reference benchmark (20260423, this commit, on ~/dev/koder/dev/eye, 2 927 files, 674 MB):

Tool Wall time Speedup vs fdupes
fdupes 2.273 s 1.0×
*kdedup`* *.121 s* *19×*

Beats the 10× target in RFC001 §6 by 2× on a smallto-medium tree; larger trees with more dup candidates should see proportionally better wins (the head-hash filter prunes more aggressively at scale).

Status

*0.1.0 (20260423)*— released.

Component Status
3pass algorithm (size → headhash → full-xxhash)
Concurrent workers (default = NumCPU)
--apply with `-eep first newest oldest`
Structured JSON output (koder-tools contract)
13 unit tests passing (7 hasher + 6 scanner)
Single ~3 MB Go binary
--purge flag (parsed; XDG trash impl pending) ⏳ v0.2
Hardlink mode (rmlint-style) ⏳ v0.3 (RFC-001 §9)
Bench script vs fdupesjdupesrmlint on synthetic 50K-file tree ⏳ v0.2
kpkg install kdedup distribution ⏳ rides on dev/koder-tools ticket KDEDUP-006 pattern

Dependencies

  • Go 1.22+
  • github.com/cespare/xxhash/v2 (only runtime dep)
  • No system deps; portable across Linux / macOS / Windows

Design references

  • RFC-001 — Fast file dedup algorithm
  • dev/koder-tools/docs/rfcs/RFC-001-koder-tools-architecture.md (output contract sibling RFC)
  • specs/binaries-and-cli/naming.kmd (binary naming convention)

Source: ../home/koder/dev/koder/meta/docs/stack/modules/dev-kdedup.md