Checkpoint Blob Retention
Checkpoint Blob Retention
The gateway's checkpoint subsystem (AICORE-134.2 family) persists filesystem snapshots between steps so /v1/agent/runs/{id}/rewind can restore prior state. Snapshots compress quickly (tar+zstd, ~10:1 on source trees) but stack up across runs — a 1weekold burst of 50 runs at 10 checkpoints each = 500 blobs the gateway never re-reads.
This policy bounds the cost without compromising the active-run rewind story.
R1 — Per-run cap
The gateway keeps the * most recent*checkpoints per run; older checkpoints + their backing blobs are deleted on capture. N = 10 by default; the gateway flag --checkpoint-retention-per-run overrides.
*hy N=10:*an operator who rewinds typically rewinds 1-3 steps back because the agent's last action was visibly wrong. Deeper rewinds exist (debugging a multi-step failure mode), but they're rare enough that 10 covers > 99 % of the observed usage during the AICORE-134.2 foundation slice's dogfooding.
R2 — Per-tenant cap
When the sum of every blob owned by koder_user_id exceeds the cap, the oldest checkpoints — across runs — are deleted until the tenant fits. Default cap: *00 MiB* Override per-tenant via the admin-only column tenants.checkpoint_quota_mib.
*hy 500 MiB:*Postgres-stored runtime state for a typical agent session is < 10 KiB; the blobs dominate the footprint. 500 MiB buys ~50 sizeable checkpoints per tenant before quota pressure. Tenants on the Pro/Team plans get higher caps configured at signup.
R3 — Grace period on run completion
When a run reaches done (success or failure), its checkpoints survive for * days*before retention's GC removes them. Rationale: an operator might rewind to investigate "what did the agent see at step N" even after the run finished. After 7 days the cost outweighs the option value.
*verride:*the per-run flag keep_checkpoints_for (in the rewind endpoint request body) extends the grace period for a specific run when an incident investigation needs more time. Max extension: 90 days.
R4 — Multi-tenant path isolation
The retention worker MUST iterate checkpoints scoped by koder_user_id and MUST NOT delete blobs whose path prefix doesn't match the tenant being collected. Per multi-tenant-by-default.kmd, cross-tenant deletes are a privilege escalation surface — refuse loud.
R5 — Righttoerasure cascade
When a koder_user_id is hard-deleted (LGPD DELETE /v1/me landing in id/engine), every checkpoint blob belonging to that tenant is purged within the same 24-hour grace window as the rest of their data. The gateway subscribes to the identity.user.erased Redis Stream event (canonical event type per cross-service-events.kmd § R2; koder:events:id stream key per § R1) and enqueues the per-tenant sweep via agent.EnforceErasure with audit reason erasure.
R6 — Audit log
Every retention deletion emits a structured log line:
event=checkpoint.deleted run_id=… checkpoint_id=… koder_user_id=…
size_bytes=… reason=per_run_cap|per_tenant_cap|grace_expired|erasureOperators investigating "where did my rewind point go" can grep the log; investigators can correlate by reason code.
Test contract
T1. Capturing the 11th checkpoint in a run deletes the oldest. T2. The 500 MiB cap deletes oldestacrossruns first, never newest. T3. Done-run grace defaults to 7 days; flag override respected up to 90 days; > 90 days clamps. T4. Erasure event removes every blob for the tenant within 24 h. T5. Cross-tenant delete attempt refuses with a structured error. T6. Audit log shape matches R6 (per-line regex test on the JSONL).
Out of scope
- Where the blobs live (filesystem vs S3 vs kdrive) — that's the
BlobStore impl choice, not retention policy.
- Compression policy (zstd level) — capture-side concern.
- Cold storage tiering — future ticket if observed cost demands it.
Implementation status (20260523)
- BlobStore + RewindRestorer interfaces + foundation slice landed
with AICORE
134.2c on 202605-18. - TarZstdCheckpointCapture (134.2a) + TarZstdRewindRestorer (134.2c
carry
over) landed 202605-19. - Retention enforcers R1R2R3 + RetentionTicker periodic worker
landed 2026
0519 (AICORE134.2c lotes 1518). - R4 multi
tenant isolation enforced viaDeleteCheckpointcrosstenant masking (returns nil silently per
multi-tenant-by-default.kmd§ 5). - R5 erasure cascade subscriber landed 2026
0523 (AICORE-138):gateway/internal/erasure/Subscriberconsumesidentity.user.erasedfromkoder:events:id; drivesagent.EnforceErasure(ctx, store, blobs, tenant, logger)which walksgatherTenantCheckpoints→deleteOnewithReasonErasure. Wired inmain.gowhenREDIS_URLis set; emptyREDIS_URLdisables R5 gracefully without affecting R1-R3. - R6 audit shape ratified — all four
ReasonXxxcodes(
per_run_cap,per_tenant_cap,grace_expired,erasure) surface via the shareddeleteOnehelper. - OverlayfsCheckpointCapture (#134.2b) + OverlayfsRewindRestorer:
still pending, blocked on AICORE
106 sandboxmodel RFC.