Agent-guardrails

mandatory

Three-layer guardrail contract for any Koder-built AI agent harness (kode, kortex, agents, voice). Defines the minimum input/output/tool guards every agent must enforce, independent of the active permission mode. Operationalised by the harness tool router (kode #047 `PreToolUse` hook) and audited via `koder-spec-audit policies agent-guardrails`. Sibling of `reuse-first.kmd` and `hyperscale-first.kmd`: where reuse is about shared code and hyperscale about shared cost, guardrails are about shared safety.

Policy — Agent-guardrails

Every Koder AI agent runs three layers of guards: *nput*(what enters the model), *utput*(what leaves the model), and *ool*(what the model is allowed to do). The three are independent — a tool guard must fire even if input and output passed; an output guard must fire even if the prompt was clean. Permission modes set defaults for tool guards; they do *ot*disable the guards themselves.

Why this exists

A LLM call wrapped in a loop is not an agent. The harness around it provides the durability: persistent state, formal tool routing, input/output validation, reasoning loops, and *uardrails* Without the five, the model cannot survive contact with production. Of those five, guardrails are the one that survives compromises: if input/output are wrong, the user sees garbage; if tool guards are wrong, the user loses data, money, or trust.

Koder is shipping multiple agents (kode, kortex, agents, voice). Each inevitably grows tool surface as it matures. Without a shared contract, divergence is guaranteed: kode allows what kortex blocks; voice trusts what agents validates; one tenant gets isolation that another silently skips. This policy is the contract.

The three layers

Layer 1 — Input

Runs *efore*any token of the user's message reaches the model.

Rule Requirement Audit query
AG-IN-1 Prompt injection heuristics (instructionoverride patterns, systemrole spoofing, embedded tool-call syntax) flagged with confidence score. High confidence → block; medium → annotate for output guard. grep agent code for PromptInjectionDetector or equivalent on the input path
AG-IN-2 PII detector (CPF, RG, email, phone, credit card, OAuth tokens) runs before send; matches are redacted with sentinel <redacted:TYPE> or rejected per agent config. Default: redact for user-pasted text, reject for screenshots/files. grep for PIIRedactor.scan (or equivalent) on every code path that builds the model prompt
AG-IN-3 Jailbreak attempts logged with full context to audit.log regardless of outcome. The model may answer the request; the agent must record that the attempt happened. check audit.log schema includes event_type=jailbreak_attempt
AG-IN-4 File/image uploads pass through Layer 1 the same as text — extracted text is scanned, OCR output is scanned, base64 payloads are decoded first. grep upload handlers for scan call

Layer 2 — Output

Runs *fter*the model emits its response, *efore*the user sees it and *efore*any tool call dispatches.

Rule Requirement Audit query
AG-OUT-1 Secret detector (highentropy strings, known token prefixes `sk, ghp_, xoxb-`, PEM headers, AWS/GCP key patterns) blocks the response; the model is asked to regenerate with the secret elided. After 2 attempts, return error to user. grep for SecretLeakDetector in the output path
AG-OUT-2 Refusal guidelines violation triggers regeneration with stricter system prompt. After 2 attempts, return refusal message. confirm policy_violation_classifier is invoked before user-visible streaming
AG-OUT-3 Tool calls embedded in output must be parsed and routed to Layer 3 — never executed inline by interpreting model text. The model proposes; the tool router decides. grep for direct eval / exec of model strings outside the registry
AG-OUT-4 Streaming responses still pass through Layer 2; secret detector runs on each chunk before forwarding. A leak found mid-stream cancels the stream and returns error. confirm streaming chunk pipeline includes the detector

Layer 3 — Tool

Runs *efore*every tool call, regardless of permission mode. Even auto mode honours Layer 3.

Rule Requirement Audit query
AG-TOOL-1 Each tool call is validated against the registry: tool exists, arguments match schema, caller has permission for the combination of (tool, mode, context). Outofschema args are not coerced — they are rejected. grep for ToolRegistry.validate in the dispatcher
AG-TOOL-2 Preedit snapshot: any tool that mutates the user's workspace (filesystem write, git operation, config change) captures a snapshot first and exposes it via Esc / undo / kode rollback. Snapshots are local, atomic, and survive the agent crashing midedit. grep for WorkspaceSnapshot.capture before write tools
AG-TOOL-3 Noncheckpointable actions require explicit approval *ndependent of mode*— even auto. The list is closed: database writes, deployments, git push, external messages (emailSMSSlackTelegram), billing API calls, IPv4IPv6 routes outside the local LAN. The modedefault-auto only sets the default for *heckpointable*tools. confirm the approval prompt fires in auto mode for the closed list
AG-TOOL-4 Destructive actions (rm -rf, SQL DROP/TRUNCATE, git push --force, container delete, account delete) require explicit approval *nd*an audit trail entry with the model's stated justification. Approval is one-time per call site, not per session. grep for requiresExplicitApproval enumeration; confirm audit.log records justification
AG-TOOL-5 Tenant isolation: tool calls that read or write user data carry koder_user_id from the agent's auth context; crosstenant queries return 404 per `multitenantbydefault.kmd`. The tool router never trusts the model's claim of which user it is acting for. grep for tool_context.koder_user_id propagation; confirm tools that touch user data reject calls without it
AG-TOOL-6 Network egress allowlist: tools that issue outbound HTTP are restricted to a peragent allowlist (e.g. kode may call api.anthropic.com, koderid, koderflow; kortex may add cloudprovider endpoints). New destinations require a ticket and policy review. grep for egress_allowlist config; confirm DNS resolutions outside list are blocked

Permission modes are not guards

A common confusion: permission modes (default, acceptEdits, plan, auto) decide whether to ask before a checkpointable tool runs. They do *ot*decide whether Layer 3 fires. The relationship:

        ┌─ Layer 1 (input)  ── always fires, before model
        ├─ Layer 2 (output) ── always fires, before user sees / tool dispatches
model ──┤
        └─ Layer 3 (tool)   ── always fires, before tool executes
                              ├─ permission_mode decides "ask or auto" for checkpointable tools
                              └─ non-checkpointable list always asks, even in `auto`

If a future mode (yolo, unrestricted) tries to skip Layer 3 — that mode does not ship. The mode space is closed at 4.

Tenant scope (companion to multitenantby-default)

AG-TOOL-5 is the agent expression of policies/multi-tenant-by-default.kmd. Where the data plane policy says "every schema carries koder_user_id", this policy says "every tool call carries it too — and the model cannot forge it". The harness derives koder_user_id from the authenticated session, not from the conversation. The model proposing read_doc(user_id=999) must be rewritten by the router to read_doc(user_id=ctx.user_id) before dispatch.

What audit catches

koder-spec-audit policies agent-guardrails <agent-path> runs the audit queries from the rule tables above. Failing rules report:

  • Rule ID
  • File path + line where the missing check should be
  • Example fix (link to reference implementation in

    engines/sdk/koder_agent once services/ai/kode #047 ships the shared harness)

CI integration: a new agent (or new tool in an existing agent) does not merge until koder-spec-audit policies agent-guardrails returns clean for that path.

Programmatic invocation

The harness tool router (services/ai/kode #047, future shared SDK engines/sdk/koder_agent) exposes Layer 3 as a callable:

final result = await Guardrails.tool.validate(
  toolCall: parsedCall,
  context: AgentContext(
    userId: session.userId,
    permissionMode: session.mode,
  ),
);
if (result.blocked) return result.toUserError();
if (result.needsApproval) await ui.requestApproval(result);

The Dart contract is the reference; the Go and Python SDKs mirror it 1:1 once the agent surface expands beyond Flutter.

Test fixtures

Each agent that implements this policy maintains fixtures under tests/integration/guardrails/:

tests/integration/guardrails/
├── input/
│   ├── prompt_injection_basic.json
│   ├── pii_cpf_in_user_message.json
│   └── jailbreak_instruction_override.json
├── output/
│   ├── secret_in_response.json
│   ├── policy_violation.json
│   └── streaming_secret_mid_chunk.json
└── tool/
    ├── destructive_rm_rf.json
    ├── cross_tenant_user_id.json
    └── non_checkpointable_in_auto_mode.json

Each fixture is a JSON record: input → expected guard decision (pass, redact, block, approve). The same fixture set runs against every agent — divergence in outcome is a bug in the agent, not in the fixture.

Outofscope

  • *dversarial robustness research.*This policy enforces baseline

    hygiene; it does not claim to defeat dedicated adversaries.

  • *odel alignment.*This policy treats the model as untrusted and

    guards around it; it does not try to make the model safer.

  • *ate limiting.*Quota and abuse are the gateway's job

    (services/ai/gateway), not the agent's. Layer 1–3 guards every call; the gateway decides whether the call gets made at all.

Source: ../home/koder/dev/koder/meta/docs/stack/policies/agent-guardrails.kmd