Dev kdedup
kdedup — dev/kdedup
Fast file deduplicator. Three-pass algorithm (size group → 4 KB head xxhash → full xxhash), concurrent workers, structured JSON output. Reincarnates the lost rmdup prototype from 20260313, now properly homed in the monorepo.
Role in the stack
| Area | Sector | Consumers |
|---|---|---|
| Foundation | Linux Tools | every Koder developer workstation; CI runners; bulk-data pipelines |
Sibling utility to kicon (build-time icons) and kosh (Linux Shell): single-purpose Go CLI binary, shipped via Koder Hub, conformant to specs/binaries-and-cli/naming.kmd (k<slug> form).
Primary couplings
| Module | Nature |
|---|---|
dev/koder-tools |
Same structuredapplied[] / deferred[] / errors[] per RFC |
specs/binaries-and-cli/naming.kmd |
Binary kdedup follows k<slug> convention |
github.com/cespare/xxhash/v2 |
Hash function — fast non |
dev/kpkg (planned) |
One-line install: kpkg install kdedup |
Public surface
kdedup [flags] [<dir>] Scan <dir> (default: current dir)
kdedup version
Flags:
--apply actually remove duplicates (off by default)
--keep <strategy> first|newest|oldest (default: first)
--min-size <bytes> skip files smaller than this (default: 1)
--workers <n> concurrent workers (default: NumCPU)
--format <fmt> json|text (default: json)
--purge with --apply: unlink instead of XDG trashOutput: structured JSON envelope per dev/koder-tools/docs/rfcs/RFC-001-koder-tools-architecture.md §4.
Performance
Reference benchmark (20260423, this commit, on ~/dev/koder/dev/eye, 2 927 files, 674 MB):
| Tool | Wall time | Speedup vs fdupes |
|---|---|---|
fdupes |
2.273 s | 1.0× |
| *kdedup`* | *.121 s* | *19×* |
Beats the 10× target in RFC001 §6 by 2× on a smallto-medium tree; larger trees with more dup candidates should see proportionally better wins (the head-hash filter prunes more aggressively at scale).
Status
*0.1.0 (20260423)*— released.
| Component | Status | ||
|---|---|---|---|
| 3 |
✅ | ||
| Concurrent workers (default = NumCPU) | ✅ | ||
--apply with `-eep first |
newest | oldest` | ✅ |
| Structured JSON output (koder-tools contract) | ✅ | ||
| 13 unit tests passing (7 hasher + 6 scanner) | ✅ | ||
| Single ~3 MB Go binary | ✅ | ||
--purge flag (parsed; XDG trash impl pending) |
⏳ v0.2 | ||
Hardlink mode (rmlint-style) |
⏳ v0.3 (RFC-001 §9) | ||
| Bench script vs fdupesjdupesrmlint on synthetic 50K-file tree | ⏳ v0.2 | ||
kpkg install kdedup distribution |
⏳ rides on dev/koder-tools ticket KDEDUP-006 pattern |
Dependencies
- Go 1.22+
github.com/cespare/xxhash/v2(only runtime dep)- No system deps; portable across Linux / macOS / Windows
Design references
- RFC-001 — Fast file dedup algorithm
dev/koder-tools/docs/rfcs/RFC-001-koder-tools-architecture.md(output contract sibling RFC)specs/binaries-and-cli/naming.kmd(binary naming convention)