Agent-guardrails
Three-layer guardrail contract for any Koder-built AI agent harness (kode, kortex, agents, voice). Defines the minimum input/output/tool guards every agent must enforce, independent of the active permission mode. Operationalised by the harness tool router (kode #047 `PreToolUse` hook) and audited via `koder-spec-audit policies agent-guardrails`. Sibling of `reuse-first.kmd` and `hyperscale-first.kmd`: where reuse is about shared code and hyperscale about shared cost, guardrails are about shared safety.
Policy — Agent-guardrails
Every Koder AI agent runs three layers of guards: *nput*(what enters the model), *utput*(what leaves the model), and *ool*(what the model is allowed to do). The three are independent — a tool guard must fire even if input and output passed; an output guard must fire even if the prompt was clean. Permission modes set defaults for tool guards; they do *ot*disable the guards themselves.
Why this exists
A LLM call wrapped in a loop is not an agent. The harness around it provides the durability: persistent state, formal tool routing, input/output validation, reasoning loops, and *uardrails* Without the five, the model cannot survive contact with production. Of those five, guardrails are the one that survives compromises: if input/output are wrong, the user sees garbage; if tool guards are wrong, the user loses data, money, or trust.
Koder is shipping multiple agents (kode, kortex, agents, voice). Each inevitably grows tool surface as it matures. Without a shared contract, divergence is guaranteed: kode allows what kortex blocks; voice trusts what agents validates; one tenant gets isolation that another silently skips. This policy is the contract.
The three layers
Layer 1 — Input
Runs *efore*any token of the user's message reaches the model.
| Rule | Requirement | Audit query |
|---|---|---|
AG-IN-1 |
Prompt injection heuristics (instruction |
grep agent code for PromptInjectionDetector or equivalent on the input path |
AG-IN-2 |
PII detector (CPF, RG, email, phone, credit card, OAuth tokens) runs before send; matches are redacted with sentinel <redacted:TYPE> or rejected per agent config. Default: redact for user-pasted text, reject for screenshots/files. |
grep for PIIRedactor.scan (or equivalent) on every code path that builds the model prompt |
AG-IN-3 |
Jailbreak attempts logged with full context to audit.log regardless of outcome. The model may answer the request; the agent must record that the attempt happened. |
check audit.log schema includes event_type=jailbreak_attempt |
AG-IN-4 |
File/image uploads pass through Layer 1 the same as text — extracted text is scanned, OCR output is scanned, base64 payloads are decoded first. | grep upload handlers for scan call |
Layer 2 — Output
Runs *fter*the model emits its response, *efore*the user sees it and *efore*any tool call dispatches.
| Rule | Requirement | Audit query |
|---|---|---|
AG-OUT-1 |
Secret detector (high, ghp_, xoxb-`, PEM headers, AWS/GCP key patterns) blocks the response; the model is asked to regenerate with the secret elided. After 2 attempts, return error to user. |
grep for SecretLeakDetector in the output path |
AG-OUT-2 |
Refusal guidelines violation triggers regeneration with stricter system prompt. After 2 attempts, return refusal message. | confirm policy_violation_classifier is invoked before user-visible streaming |
AG-OUT-3 |
Tool calls embedded in output must be parsed and routed to Layer 3 — never executed inline by interpreting model text. The model proposes; the tool router decides. | grep for direct eval / exec of model strings outside the registry |
AG-OUT-4 |
Streaming responses still pass through Layer 2; secret detector runs on each chunk before forwarding. A leak found mid-stream cancels the stream and returns error. | confirm streaming chunk pipeline includes the detector |
Layer 3 — Tool
Runs *efore*every tool call, regardless of permission mode. Even auto mode honours Layer 3.
| Rule | Requirement | Audit query |
|---|---|---|
AG-TOOL-1 |
Each tool call is validated against the registry: tool exists, arguments match schema, caller has permission for the combination of (tool, mode, context). Out |
grep for ToolRegistry.validate in the dispatcher |
AG-TOOL-2 |
PreEsc / undo / kode rollback. Snapshots are local, atomic, and survive the agent crashing mid |
grep for WorkspaceSnapshot.capture before write tools |
AG-TOOL-3 |
Nonauto. The list is closed: database writes, deployments, git push, external messages (emailSMSSlackTelegram), billing API calls, IPv4IPv6 routes outside the local LAN. The modeauto only sets the default for *heckpointable*tools. |
confirm the approval prompt fires in auto mode for the closed list |
AG-TOOL-4 |
Destructive actions (rm -rf, SQL DROP/TRUNCATE, git push --force, container delete, account delete) require explicit approval *nd*an audit trail entry with the model's stated justification. Approval is one-time per call site, not per session. |
grep for requiresExplicitApproval enumeration; confirm audit.log records justification |
AG-TOOL-5 |
Tenant isolation: tool calls that read or write user data carry koder_user_id from the agent's auth context; cross |
grep for tool_context.koder_user_id propagation; confirm tools that touch user data reject calls without it |
AG-TOOL-6 |
Network egress allowlist: tools that issue outbound HTTP are restricted to a per |
grep for egress_allowlist config; confirm DNS resolutions outside list are blocked |
Permission modes are not guards
A common confusion: permission modes (default, acceptEdits, plan, auto) decide whether to ask before a checkpointable tool runs. They do *ot*decide whether Layer 3 fires. The relationship:
┌─ Layer 1 (input) ── always fires, before model
├─ Layer 2 (output) ── always fires, before user sees / tool dispatches
model ──┤
└─ Layer 3 (tool) ── always fires, before tool executes
├─ permission_mode decides "ask or auto" for checkpointable tools
└─ non-checkpointable list always asks, even in `auto`If a future mode (yolo, unrestricted) tries to skip Layer 3 — that mode does not ship. The mode space is closed at 4.
Tenant scope (companion to multitenantby-default)
AG-TOOL-5 is the agent expression of policies/multi-tenant-by-default.kmd. Where the data plane policy says "every schema carries koder_user_id", this policy says "every tool call carries it too — and the model cannot forge it". The harness derives koder_user_id from the authenticated session, not from the conversation. The model proposing read_doc(user_id=999) must be rewritten by the router to read_doc(user_id=ctx.user_id) before dispatch.
What audit catches
koder-spec-audit policies agent-guardrails <agent-path> runs the audit queries from the rule tables above. Failing rules report:
- Rule ID
- File path + line where the missing check should be
- Example fix (link to reference implementation in
engines/sdk/koder_agentonceservices/ai/kode #047ships the shared harness)
CI integration: a new agent (or new tool in an existing agent) does not merge until koder-spec-audit policies agent-guardrails returns clean for that path.
Programmatic invocation
The harness tool router (services/ai/kode #047, future shared SDK engines/sdk/koder_agent) exposes Layer 3 as a callable:
final result = await Guardrails.tool.validate(
toolCall: parsedCall,
context: AgentContext(
userId: session.userId,
permissionMode: session.mode,
),
);
if (result.blocked) return result.toUserError();
if (result.needsApproval) await ui.requestApproval(result);The Dart contract is the reference; the Go and Python SDKs mirror it 1:1 once the agent surface expands beyond Flutter.
Test fixtures
Each agent that implements this policy maintains fixtures under tests/integration/guardrails/:
tests/integration/guardrails/
├── input/
│ ├── prompt_injection_basic.json
│ ├── pii_cpf_in_user_message.json
│ └── jailbreak_instruction_override.json
├── output/
│ ├── secret_in_response.json
│ ├── policy_violation.json
│ └── streaming_secret_mid_chunk.json
└── tool/
├── destructive_rm_rf.json
├── cross_tenant_user_id.json
└── non_checkpointable_in_auto_mode.jsonEach fixture is a JSON record: input → expected guard decision (pass, redact, block, approve). The same fixture set runs against every agent — divergence in outcome is a bug in the agent, not in the fixture.
Outofscope
- *dversarial robustness research.*This policy enforces baseline
hygiene; it does not claim to defeat dedicated adversaries.
- *odel alignment.*This policy treats the model as untrusted and
guards around it; it does not try to make the model safer.
- *ate limiting.*Quota and abuse are the gateway's job
(
services/ai/gateway), not the agent's. Layer 1–3 guards every call; the gateway decides whether the call gets made at all.