Agent-guardrails

mandatory

Three-layer guardrail contract for any Koder-built AI agent harness (kode, kortex, agents, voice). Defines the minimum input/output/tool guards every agent must enforce, independent of the active permission mode. Operationalised by the harness tool router (kode #047 `PreToolUse` hook) and audited via `koder-spec-audit policies agent-guardrails`. Sibling of `reuse-first.kmd` and `hyperscale-first.kmd`: where reuse is about shared code and hyperscale about shared cost, guardrails are about shared safety.

Policy — Agent-guardrails

Every Koder AI agent runs three layers of guards: *nput*(what enters the model), *utput*(what leaves the model), and *ool*(what the model is allowed to do). The three are independent — a tool guard must fire even if input and output passed; an output guard must fire even if the prompt was clean. Permission modes set defaults for tool guards; they do *ot*disable the guards themselves.

Why this exists

A LLM call wrapped in a loop is not an agent. The harness around it provides the durability: persistent state, formal tool routing, input/output validation, reasoning loops, and *uardrails* Without the five, the model cannot survive contact with production. Of those five, guardrails are the one that survives compromises: if input/output are wrong, the user sees garbage; if tool guards are wrong, the user loses data, money, or trust.

Koder is shipping multiple agents (kode, kortex, agents, voice). Each inevitably grows tool surface as it matures. Without a shared contract, divergence is guaranteed: kode allows what kortex blocks; voice trusts what agents validates; one tenant gets isolation that another silently skips. This policy is the contract.

The three layers

Layer 1 — Input

Runs *efore*any token of the user's message reaches the model.

Rule	Requirement	Audit query
`AG-IN-1`	Prompt injection heuristics (instruction~~override patterns, system~~role spoofing, embedded tool-call syntax) flagged with confidence score. High confidence → block; medium → annotate for output guard.	grep agent code for `PromptInjectionDetector` or equivalent on the input path
`AG-IN-2`	PII detector (CPF, RG, email, phone, credit card, OAuth tokens) runs before send; matches are redacted with sentinel `<redacted:TYPE>` or rejected per agent config. Default: redact for user-pasted text, reject for screenshots/files.	grep for `PIIRedactor.scan` (or equivalent) on every code path that builds the model prompt
`AG-IN-3`	Jailbreak attempts logged with full context to `audit.log` regardless of outcome. The model may answer the request; the agent must record that the attempt happened.	check `audit.log` schema includes `event_type=jailbreak_attempt`
`AG-IN-4`	File/image uploads pass through Layer 1 the same as text — extracted text is scanned, OCR output is scanned, base64 payloads are decoded first.	grep upload handlers for scan call

Layer 2 — Output

Runs *fter*the model emits its response, *efore*the user sees it and *efore*any tool call dispatches.

Rule	Requirement	Audit query
`AG-OUT-1`	Secret detector (high~~entropy strings, known token prefixes `sk~~`,` ghp_`,` xoxb-`, PEM headers, AWS/GCP key patterns) blocks the response; the model is asked to regenerate with the secret elided. After 2 attempts, return error to user.	grep for `SecretLeakDetector` in the output path
`AG-OUT-2`	Refusal guidelines violation triggers regeneration with stricter system prompt. After 2 attempts, return refusal message.	confirm `policy_violation_classifier` is invoked before user-visible streaming
`AG-OUT-3`	Tool calls embedded in output must be parsed and routed to Layer 3 — never executed inline by interpreting model text. The model proposes; the tool router decides.	grep for direct `eval` / `exec` of model strings outside the registry
`AG-OUT-4`	Streaming responses still pass through Layer 2; secret detector runs on each chunk before forwarding. A leak found mid-stream cancels the stream and returns error.	confirm streaming chunk pipeline includes the detector

Layer 3 — Tool

Runs *efore*every tool call, regardless of permission mode. Even auto mode honours Layer 3.

Rule	Requirement	Audit query
`AG-TOOL-1`	Each tool call is validated against the registry: tool exists, arguments match schema, caller has permission for the combination of (tool, mode, context). Outofschema args are not coerced — they are rejected.	grep for `ToolRegistry.validate` in the dispatcher
`AG-TOOL-2`	Preedit snapshot: any tool that mutates the user's workspace (filesystem write, git operation, config change) captures a snapshot first and exposes it via `Esc` / undo / `kode rollback`. Snapshots are local, atomic, and survive the agent crashing midedit.	grep for `WorkspaceSnapshot.capture` before write tools
`AG-TOOL-3`	Noncheckpointable actions require explicit approval ndependent of mode— even `auto`. The list is closed: database writes, deployments, `git push`, external messages (emailSMSSlackTelegram), billing API calls, IPv4IPv6 routes outside the local LAN. The modedefault-`auto` only sets the default for heckpointabletools.	confirm the approval prompt fires in `auto` mode for the closed list
`AG-TOOL-4`	Destructive actions (`rm -rf`, SQL `DROP`/`TRUNCATE`, `git push --force`, container delete, account delete) require explicit approval ndan audit trail entry with the model's stated justification. Approval is one-time per call site, not per session.	grep for `requiresExplicitApproval` enumeration; confirm `audit.log` records justification
`AG-TOOL-5`	Tenant isolation: tool calls that read or write user data carry `koder_user_id` from the agent's auth context; cross~~tenant queries return 404 per `multi~~tenantbydefault.kmd`. The tool router never trusts the model's claim of which user it is acting for.	grep for `tool_context.koder_user_id` propagation; confirm tools that touch user data reject calls without it
`AG-TOOL-6`	Network egress allowlist: tools that issue outbound HTTP are restricted to a per~~agent allowlist (e.g. kode may call api.anthropic.com, koder~~id, koder~~flow; kortex may add cloud~~provider endpoints). New destinations require a ticket and policy review.	grep for `egress_allowlist` config; confirm DNS resolutions outside list are blocked

Permission modes are not guards

A common confusion: permission modes (default, acceptEdits, plan, auto) decide whether to ask before a checkpointable tool runs. They do *ot*decide whether Layer 3 fires. The relationship:

        ┌─ Layer 1 (input)  ── always fires, before model
        ├─ Layer 2 (output) ── always fires, before user sees / tool dispatches
model ──┤
        └─ Layer 3 (tool)   ── always fires, before tool executes
                              ├─ permission_mode decides "ask or auto" for checkpointable tools
                              └─ non-checkpointable list always asks, even in `auto`

If a future mode (yolo, unrestricted) tries to skip Layer 3 — that mode does not ship. The mode space is closed at 4.

Tenant scope (companion to multitenantby-default)

AG-TOOL-5 is the agent expression of policies/multi-tenant-by-default.kmd. Where the data plane policy says "every schema carries koder_user_id", this policy says "every tool call carries it too — and the model cannot forge it". The harness derives koder_user_id from the authenticated session, not from the conversation. The model proposing read_doc(user_id=999) must be rewritten by the router to read_doc(user_id=ctx.user_id) before dispatch.

What audit catches

koder-spec-audit policies agent-guardrails <agent-path> runs the audit queries from the rule tables above. Failing rules report:

Rule ID
File path + line where the missing check should be
Example fix (link to reference implementation in
engines/sdk/koder_agent once services/ai/kode #047 ships the shared harness)

CI integration: a new agent (or new tool in an existing agent) does not merge until koder-spec-audit policies agent-guardrails returns clean for that path.

Programmatic invocation

The harness tool router (services/ai/kode #047, future shared SDK engines/sdk/koder_agent) exposes Layer 3 as a callable:

final result = await Guardrails.tool.validate(
  toolCall: parsedCall,
  context: AgentContext(
    userId: session.userId,
    permissionMode: session.mode,
  ),
);
if (result.blocked) return result.toUserError();
if (result.needsApproval) await ui.requestApproval(result);

The Dart contract is the reference; the Go and Python SDKs mirror it 1:1 once the agent surface expands beyond Flutter.

Test fixtures

Each agent that implements this policy maintains fixtures under tests/integration/guardrails/:

tests/integration/guardrails/
├── input/
│   ├── prompt_injection_basic.json
│   ├── pii_cpf_in_user_message.json
│   └── jailbreak_instruction_override.json
├── output/
│   ├── secret_in_response.json
│   ├── policy_violation.json
│   └── streaming_secret_mid_chunk.json
└── tool/
    ├── destructive_rm_rf.json
    ├── cross_tenant_user_id.json
    └── non_checkpointable_in_auto_mode.json

Each fixture is a JSON record: input → expected guard decision (pass, redact, block, approve). The same fixture set runs against every agent — divergence in outcome is a bug in the agent, not in the fixture.

Outofscope

*dversarial robustness research.*This policy enforces baseline
hygiene; it does not claim to defeat dedicated adversaries.
*odel alignment.*This policy treats the model as untrusted and
guards around it; it does not try to make the model safer.
*ate limiting.*Quota and abuse are the gateway's job
(services/ai/gateway), not the agent's. Layer 1–3 guards every call; the gateway decides whether the call gets made at all.