Checkpoint Blob Retention

draft

Checkpoint Blob Retention

The gateway's checkpoint subsystem (AICORE-134.2 family) persists filesystem snapshots between steps so /v1/agent/runs/{id}/rewind can restore prior state. Snapshots compress quickly (tar+zstd, ~10:1 on source trees) but stack up across runs — a 1weekold burst of 50 runs at 10 checkpoints each = 500 blobs the gateway never re-reads.

This policy bounds the cost without compromising the active-run rewind story.

R1 — Per-run cap

The gateway keeps the * most recent*checkpoints per run; older checkpoints + their backing blobs are deleted on capture. N = 10 by default; the gateway flag --checkpoint-retention-per-run overrides.

*hy N=10:*an operator who rewinds typically rewinds 1-3 steps back because the agent's last action was visibly wrong. Deeper rewinds exist (debugging a multi-step failure mode), but they're rare enough that 10 covers > 99 % of the observed usage during the AICORE-134.2 foundation slice's dogfooding.

R2 — Per-tenant cap

When the sum of every blob owned by koder_user_id exceeds the cap, the oldest checkpoints — across runs — are deleted until the tenant fits. Default cap: *00 MiB* Override per-tenant via the admin-only column tenants.checkpoint_quota_mib.

*hy 500 MiB:*Postgres-stored runtime state for a typical agent session is < 10 KiB; the blobs dominate the footprint. 500 MiB buys ~50 sizeable checkpoints per tenant before quota pressure. Tenants on the Pro/Team plans get higher caps configured at signup.

R3 — Grace period on run completion

When a run reaches done (success or failure), its checkpoints survive for * days*before retention's GC removes them. Rationale: an operator might rewind to investigate "what did the agent see at step N" even after the run finished. After 7 days the cost outweighs the option value.

*verride:*the per-run flag keep_checkpoints_for (in the rewind endpoint request body) extends the grace period for a specific run when an incident investigation needs more time. Max extension: 90 days.

R4 — Multi-tenant path isolation

The retention worker MUST iterate checkpoints scoped by koder_user_id and MUST NOT delete blobs whose path prefix doesn't match the tenant being collected. Per multi-tenant-by-default.kmd, cross-tenant deletes are a privilege escalation surface — refuse loud.

R5 — Righttoerasure cascade

When a koder_user_id is hard-deleted (LGPD DELETE /v1/me landing in id/engine), every checkpoint blob belonging to that tenant is purged within the same 24-hour grace window as the rest of their data. The gateway subscribes to the identity.user.erased Redis Stream event (canonical event type per cross-service-events.kmd § R2; koder:events:id stream key per § R1) and enqueues the per-tenant sweep via agent.EnforceErasure with audit reason erasure.

R6 — Audit log

Every retention deletion emits a structured log line:

event=checkpoint.deleted run_id=… checkpoint_id=… koder_user_id=…
size_bytes=… reason=per_run_cap|per_tenant_cap|grace_expired|erasure

Operators investigating "where did my rewind point go" can grep the log; investigators can correlate by reason code.

Test contract

T1. Capturing the 11th checkpoint in a run deletes the oldest. T2. The 500 MiB cap deletes oldestacrossruns first, never newest. T3. Done-run grace defaults to 7 days; flag override respected up to 90 days; > 90 days clamps. T4. Erasure event removes every blob for the tenant within 24 h. T5. Cross-tenant delete attempt refuses with a structured error. T6. Audit log shape matches R6 (per-line regex test on the JSONL).

Out of scope

  • Where the blobs live (filesystem vs S3 vs kdrive) — that's the

    BlobStore impl choice, not retention policy.

  • Compression policy (zstd level) — capture-side concern.
  • Cold storage tiering — future ticket if observed cost demands it.

Implementation status (20260523)

  • BlobStore + RewindRestorer interfaces + foundation slice landed

    with AICORE134.2c on 202605-18.

  • TarZstdCheckpointCapture (134.2a) + TarZstdRewindRestorer (134.2c

    carryover) landed 202605-19.

  • Retention enforcers R1R2R3 + RetentionTicker periodic worker

    landed 20260519 (AICORE134.2c lotes 1518).

  • R4 multitenant isolation enforced via DeleteCheckpoint cross

    tenant masking (returns nil silently per multi-tenant-by-default.kmd § 5).

  • R5 erasure cascade subscriber landed 20260523 (AICORE-138):

    gateway/internal/erasure/Subscriber consumes identity.user.erased from koder:events:id; drives agent.EnforceErasure(ctx, store, blobs, tenant, logger) which walks gatherTenantCheckpointsdeleteOne with ReasonErasure. Wired in main.go when REDIS_URL is set; empty REDIS_URL disables R5 gracefully without affecting R1-R3.

  • R6 audit shape ratified — all four ReasonXxx codes

    (per_run_cap, per_tenant_cap, grace_expired, erasure) surface via the shared deleteOne helper.

  • OverlayfsCheckpointCapture (#134.2b) + OverlayfsRewindRestorer:

    still pending, blocked on AICORE106 sandboxmodel RFC.

Source: ../home/koder/dev/koder/meta/docs/stack/policies/checkpoint-retention.kmd