RFC 007 — ChatOps Pipeline
RFC 007 — ChatOps Pipeline
- Tracking ticket: backlogdone007
- Depends on:
- RFC 002 — Kortex Architecture (Approved, 2026
0407) — defines the five subsystems (Senses, Brain, Reflexes, Memory, Coordination) - RFC 005 — LLM Provider Abstraction (Approved, 2026
0408) — defines the Kode agent interface and tool use - RFC 006 — Auto
Remediation Rules Engine (Approved, 202604-08) — defines the rules DSL and action catalog
- RFC 002 — Kortex Architecture (Approved, 2026
- Status: Approved (2026
0414)
1. Summary
This RFC defines the ChatOps Pipeline — an endtoend autonomous remediation flow that starts when a user sends a screenshot of an error in a chat group and ends when the fix is deployed to production. The pipeline connects four Koder products (Talk, Kortex, Kode/AI, Flow, Jet) into a single coordinated workflow.
The key insight is that most bug reports in real-world operations arrive as screenshots in chat groups, not as structured alerts. The PoC deployment at Pouso Alegre — MG proved that an AI agent monitoring logs, detecting errors, fixing code, committing, and deploying autonomously can reduce incident resolution time from hours to minutes. This RFC formalizes that pattern and extends it to the chat channel.
1.1 The pipeline at a glance
User sends screenshot with error in chat group
|
[1. Koder Talk] bot monitors group, detects screenshot
| extracts image, sends ChatOpsEvent via OTLP push
v
[2. Kortex Senses] ingests ChatOpsEvent into hot cache
|
v
[3. Kortex Brain] Vision/LLM analyzes screenshot, extracts error
| correlates with recent deploys, logs, known incidents
v
[4. Kortex Reflexes] evaluates chatops rules
| decides: invoke Kode (autonomous) or ask approval (supervised)
v
[5. Kode (AI Agent)] clones repo, locates bug, writes fix + regression test
| commits, pushes to branch, creates PR or merges directly
v
[6. Koder Flow] receives push, creates release (if configured)
| fires webhook to Kortex
v
[7. Kortex Coordination] receives release event
| instructs Jet to deploy new version
v
[8. koder-jet] pulls new release, swaps service, health check
|
v
[9. Talk bot] posts status updates back to chat group
"Error detected → Fix committed → Deployed → Verified"1.2 Design principles
- Chat is a first-class event source. Screenshots in chat groups are as valid as structured OTLP alerts. Talk is an upstream to Kortex Senses, just like
observe/logorobserve/mon. - Autonomous by default, supervised by choice. The pipeline supports two modes — autonomous (detect
fixdeploynotify) and supervised (detectfixaskdeploy-notify). The mode is configurable per environment, per repository, and per rule. - Human
inthe-loop is optional, not mandatory. The PoC experience proved that requiring human approval for every fix introduces unacceptable latency in critical environments. The gate is configurable. - Feedback loop in chat. The bot posts every pipeline stage back to the originating chat group, so users always know what happened. Silence is never acceptable.
- Safety through the rules engine. The pipeline does not bypass RFC 006 safety guards. Rate limits, blast radius, circuit breakers, and dry-run all apply. The Kode agent is just another action in the rules engine catalog.
2. Context and motivation
2.1 The screenshot problem
In practice, most production errors are first reported by end users — not by monitoring systems. The typical flow is:
- User encounters an error in the application
- User takes a screenshot
- User sends the screenshot to a support chat group (WhatsApp, Telegram, Google Chat)
- A developer sees the screenshot (minutes to hours later)
- Developer tries to reproduce the error
- Developer locates the bug, writes a fix
- Developer commits, pushes, deploys
- Developer confirms the fix in the chat group
Steps 4–8 take anywhere from 30 minutes to several hours. During this time, every user hitting the same bug is blocked.
2.2 The PoC precedent
During the Pouso Alegre — MG PoC deployment (March–April 2026), the Kode AI agent was configured to:
- Monitor aggregated logs from
poc.vivver.com - Detect new errors automatically
- Analyze root cause
- Write fixes with regression tests
- Commit and push to the release branch
- Deploy to production via container swap
This autonomous loop resolved dozens of production issues — many at night or on weekends — with zero human intervention. Users arrived in the morning to find their reported bugs already fixed.
The ChatOps pipeline generalizes this pattern: instead of monitoring server logs, the trigger is a screenshot in a chat group. Instead of a single monolithic agent, the pipeline is decomposed into Koder products, each doing what it does best.
2.3 Why not just use log monitoring?
Log monitoring (the PoC approach) catches errors that produce logs. But:
- Some errors are client-side only (JavaScript exceptions, UI glitches, rendering bugs)
- Some errors are transient and don't leave persistent log traces
- Some errors are configuration issues that don't throw exceptions
- Users often report behavioral bugs ("this button does nothing") that have no error log
Screenshots capture the user's experience, which is the ground truth. The ChatOps pipeline complements log monitoring — it doesn't replace it.
3. Koder Talk — Bot Mode
3.1 Multi-platform adapters
Koder Talk gains a bot subsystem — a set of platform adapters that allow Talk to monitor external chat groups. Each adapter implements the ChatAdapter interface:
// ChatAdapter connects to an external messaging platform and streams
// incoming messages to the bot engine.
type ChatAdapter interface {
// Name returns the adapter identifier (e.g., "whatsapp", "telegram").
Name() string
// Connect establishes a connection to the platform using the
// provided credentials. It blocks until the connection is ready
// or the context is cancelled.
Connect(ctx context.Context, creds AdapterCredentials) error
// Listen returns a channel of incoming messages from monitored groups.
// The channel is closed when the context is cancelled.
Listen(ctx context.Context) (<-chan IncomingMessage, error)
// Send posts a message to a specific group/channel.
Send(ctx context.Context, target ChatTarget, msg OutgoingMessage) error
// Close gracefully disconnects from the platform.
Close() error
}3.2 Supported platforms (Phase 1)
| Platform | Adapter | Library / API | Auth method |
|---|---|---|---|
| Telegram | telegram |
Telegram Bot API (HTTP long poll) | Bot token |
whatsapp |
whatsmeow (Multi-Device) | QR code pairing | |
| Google Chat | googlechat |
Google Chat API (Pub/Sub) | Service account |
3.3 Phase 2 platforms
| Platform | Adapter | Notes |
|---|---|---|
| Slack | slack |
Socket Mode API |
| Discord | discord |
Gateway WebSocket |
| Koder Talk native | talk-native |
Direct WebSocket (Noise_XX), no bridge needed |
| Microsoft Teams | teams |
Bot Framework |
3.4 Message model
// IncomingMessage represents a message received from any platform.
type IncomingMessage struct {
// Platform identifies the source adapter (e.g., "telegram").
Platform string
// GroupID is the platform-specific group/channel identifier.
GroupID string
// GroupName is the human-readable group name.
GroupName string
// SenderID is the platform-specific user identifier.
SenderID string
// SenderName is the display name of the sender.
SenderName string
// Text is the message text (may be empty for image-only messages).
Text string
// Images contains attached images (screenshots, photos).
Images []ImageAttachment
// Timestamp is when the message was sent (platform time).
Timestamp time.Time
// ReplyTo is the message ID this is a reply to (if any).
ReplyTo string
// Raw is the platform-specific raw message (for debugging).
Raw any
}
// ImageAttachment represents an image attached to a message.
type ImageAttachment struct {
// URL is the download URL for the image (platform-specific).
URL string
// MimeType is the MIME type (e.g., "image/png", "image/jpeg").
MimeType string
// Data is the raw image bytes (populated after download).
Data []byte
// Width and Height in pixels (if known).
Width int
Height int
}3.5 Screenshot detection heuristics
The bot engine applies heuristics to decide whether an incoming message likely contains an error report:
- Image present: Message contains at least one image attachment.
- Error keywords in text: Message text (or reply context) contains keywords like "erro", "error", "bug", "quebrou", "não funciona", "travou", "caiu", "crash", "falha", "problema", "help", "broken". Keyword list is configurable and supports regex.
- Image content analysis: If heuristics 1+2 are inconclusive, submit the image to VisionLLM for classification: "Does this screenshot show an application error, exception, or malfunction?" (cheap, fast — single yesno question).
- Group configuration: Some groups may be configured as "error-only" (every image is treated as an error report) or "ignore" (bot never analyzes images from this group).
A message that passes detection is promoted to a ChatOpsEvent and pushed to Kortex.
3.6 ChatOpsEvent (OTLP extension)
The bot emits a custom OTLP event when it detects a potential error report:
// ChatOpsEvent is sent by Koder Talk bot to Kortex Senses
// when a screenshot of an error is detected in a chat group.
message ChatOpsEvent {
// Source platform (telegram, whatsapp, googlechat, etc.)
string platform = 1;
// Group/channel where the screenshot was posted
string group_id = 2;
string group_name = 3;
// User who posted the screenshot
string sender_id = 4;
string sender_name = 5;
// The screenshot image (raw bytes, typically PNG/JPEG)
bytes screenshot = 6;
string screenshot_mime = 7;
// Accompanying text message (if any)
string text = 8;
// Detection confidence (0.0 to 1.0)
float confidence = 9;
// Detection method: "keywords", "vision", "group_config"
string detection_method = 10;
// Timestamp of the original message
google.protobuf.Timestamp message_time = 11;
// Reply-to context (surrounding messages for context)
repeated ContextMessage context_messages = 12;
}
message ContextMessage {
string sender_name = 1;
string text = 2;
google.protobuf.Timestamp timestamp = 3;
}This event is sent via OTLP HTTP push to Kortex Senses (:4327), using the standard Koder event envelope with:
koder.product.name = "koder-talk"koder.event.type = "ChatOpsEvent"koder.event.severity = "warn"
4. Kortex — ChatOps processing
4.1 Senses: Talk as upstream
Talk is added as a push-only upstream in the Senses subsystem. Unlike observe/log or observe/mon which use both push and pull, Talk only pushes — there is no meaningful "pull" operation for chat messages.
Configuration:
senses:
push:
http_addr: ":4327"
grpc_addr: ":4328"
channel_size: 8192
# Talk events accepted on the same OTLP endpoints
# No special configuration needed — Talk sends standard OTLPThe Senses subsystem recognizes ChatOpsEvent by the koder.event.type attribute and routes it to the Brain's ChatOps analyzer.
4.2 Brain: Screenshot analysis
When the Brain receives a ChatOpsEvent, it executes the ChatOps analysis pipeline:
Step 1 — Vision extraction
Submit the screenshot to the LLM provider (RFC 005) with a Vision prompt:
Analyze this screenshot from a production application. Extract:
1. The error message (exact text)
2. The error type (exception, HTTP error, UI bug, crash, etc.)
3. The affected component (URL, page name, feature)
4. Stack trace (if visible)
5. Any other diagnostic information visible
Respond in structured JSON.Output: ScreenshotAnalysis struct with extracted error details.
Step 2 — Correlation
Cross-reference the extracted error with:
- Recent deploy events from
infra/jet(did a deploy happen in the last N minutes?) - Recent errors in
observe/logmatching the same error signature - Known incidents in Memory (has this error been seen before? what fixed it?)
- Active rules in Reflexes (is there already a rule handling this?)
Output: CorrelationResult with probable root cause, related events, and confidence score.
Step 3 — Triage decision
Based on the analysis:
- If error matches a known incident with a known fix → use the playbook from Memory
- If error correlates with a recent deploy → likely regression, rollback candidate
- If error is novel → invoke Kode for investigation and fix
- If confidence is too low → escalate to human (post in chat asking for more context)
4.3 Reflexes: ChatOps rules
A new rule type chatops is added to the rules engine (RFC 006):
# rules/chatops-auto-fix.yaml
- name: chatops-auto-fix
description: "Invoke Kode AI to fix bugs detected from chat screenshots"
trigger:
type: pattern
source: koder-talk
event_type: ChatOpsEvent
conditions:
- field: confidence
op: gte
value: 0.7
actions:
- invoke_kode:
task: fix_bug
context_from: event # pass screenshot analysis to Kode
repo: auto # detect repo from error context
branch_strategy: auto # direct_push (autonomous) or pr (supervised)
guards:
mode: autonomous # or "supervised"
rate_limit: 10/hour
blast_radius: single_service
circuit_breaker:
threshold: 3
window: 1h
dry_run: false
notify:
channel: origin # reply in the same chat group
on_start: true
on_success: true
on_failure: true4.4 New action: invoke_kode
A new action is added to the Reflexes action catalog (RFC 006 §6):
// InvokeKodeAction spawns a Kode AI agent to investigate and fix a bug.
type InvokeKodeAction struct {
// Task is the type of work for Kode: "fix_bug", "investigate",
// "rollback", "write_test".
Task string
// Repo is the target repository. "auto" means Kode determines
// the repo from the error context (stack trace, URL, component name).
Repo string
// BranchStrategy controls how Kode commits:
// "direct_push" — push directly to the release/main branch (autonomous mode)
// "pr" — create a feature branch and open a PR (supervised mode)
// "auto" — use the mode configured in the rule's guards
BranchStrategy string
// ContextFrom specifies where Kode gets its context:
// "event" — the ChatOpsEvent (screenshot analysis, correlation)
// "logs" — recent error logs from observe/log
// "both" — merge event context with recent logs
ContextFrom string
// MaxDuration is the maximum time Kode is allowed to work on this fix.
MaxDuration time.Duration
// RegressionTest controls whether Kode must write a regression test.
// Default: true.
RegressionTest bool
}The action executor:
- Spawns a Kode agent session via the Kode API
- Passes the
ScreenshotAnalysis+CorrelationResultas context - Waits for Kode to complete (with timeout)
- Captures the result: commit hash, branch, PR URL, files changed
- If
BranchStrategy == "direct_push"and the rule mode isautonomous, proceeds to deployment - If
BranchStrategy == "pr", posts the PR link in the chat group and waits for merge
4.5 Coordination: Release → Deploy
When Kode pushes a fix and a release is created (either manually or via automation), the Coordination subsystem handles the deploy:
Flow webhook → Kortex
Koder Flow fires a webhook on release creation:
{
"action": "published",
"release": {
"tag_name": "v2.3.1",
"target_commitish": "release",
"name": "v2.3.1 — ChatOps auto-fix",
"body": "Automated fix for: NullPointerException in agendamento.go:142\nTriggered by: ChatOps screenshot from grupo-poc (Telegram)\nKode session: ks-20260408-143022",
"assets": [...]
},
"repository": {
"full_name": "vivver/saude-publica"
}
}Kortex receives this webhook at POST /api/v1/webhooks/flow and creates a ReleaseEvent in the Senses cache.
Deploy instruction to Jet
The Coordination subsystem calls the koder-jet admin API (RFC 004):
POST /admin/deploys
Content-Type: application/json
{
"service": "saude-publica",
"version": "v2.3.1",
"source": "chatops-pipeline",
"strategy": "rolling",
"health_check": {
"url": "/api/health",
"interval": "5s",
"timeout": "30s"
}
}Jet pulls the new release, performs a rolling deploy (or container swap), runs health checks, and reports back to Kortex.
5. Operational modes
5.1 Autonomous mode
The full pipeline runs without human intervention:
screenshot → detect → analyze → fix → commit → release → deploy → notifyBest for:
- PoC / pilot deployments where speed matters
- Nighttime / weekend incidents with no staff online
- Known-stable codebases with good test coverage
- Environments with easy rollback (containers, blue-green)
5.2 Supervised mode
The pipeline pauses before deployment and asks for human approval:
screenshot → detect → analyze → fix → commit → PR → [ask approval] → merge → release → deploy → notifyBest for:
- Production environments serving many users
- Codebases without comprehensive test coverage
- Regulated environments requiring change approval
- New deployments where trust in AI fixes is still being established
5.3 Configuration
Mode is configured at three levels (most specific wins):
# kortex.yaml — global default
chatops:
default_mode: supervised
timeout_escalation: 30m # if no human responds in 30min, escalate to autonomous
# Per-environment override
environments:
poc.vivver.com:
mode: autonomous
notify: grupo-poc
auto_release: true
auto_deploy: true
app.vivver.com:
mode: supervised
notify: grupo-dev
require_approval_from: ["rodrigo", "francisco"]
timeout_escalation: 60m
# Per-repository override
repositories:
vivver/saude-publica:
mode: autonomous # override for this specific repo
branch: release5.4 Timeout escalation
When supervised mode is active but no human responds within timeout_escalation minutes:
- Bot posts a warning: "Ninguém respondeu em 30 minutos. Deployando automaticamente em 5 minutos. Responda 'cancelar' para impedir."
- If no "cancelar" response in 5 minutes, the pipeline escalates to autonomous mode
- The escalation is logged in Kortex Memory as a
TimeoutEscalationincident
This prevents fixes from being stuck in limbo when no one is available.
6. Chat feedback loop
6.1 Status messages
The bot posts progress updates back to the originating chat group at every pipeline stage:
| Stage | Message template |
|---|---|
| Detection | "Erro detectado na screenshot de @{sender}. Analisando..." |
| Analysis | "Erro identificado: {error_summary} em {component}. Buscando correção..." |
| Fix started | "Kode iniciou correção no repositório {repo}. Sessão: {session_id}" |
| Fix committed | "Correção commitada: {commit_url}nArquivos alterados: {files}nTeste de regressão: incluído" |
| PR created | "PR aberta: {pr_url}nAguardando aprovação para deploy. Responda 'aprovar' ou 'rejeitar'." |
| Deploying | "Deployando v{version} em {environment}..." |
| Deployed | "v{version} deployada com sucesso em {environment}. Verifique se o erro foi corrigido." |
| Failed | "Não consegui corrigir este erro automaticamente. Detalhes: {reason}nEscalando para o time." |
| Rollback | "Fix causou regressão. Rollback para v{prev_version} executado. Investigação manual necessária." |
6.2 Interactive commands
Users can interact with the bot in the chat group:
| Command | Action |
|---|---|
aprovar / approve |
Approve pending deployment (supervised mode) |
rejeitar / reject |
Reject pending fix, discard branch |
cancelar / cancel |
Cancel in-progress pipeline |
status |
Show current pipeline status |
rollback |
Rollback to previous version |
detalhes / details |
Show full analysis details |
ignorar / ignore |
Mark this error as known/expected, don't fix |
silenciar / mute |
Temporarily mute bot notifications (1h default) |
6.3 Locale
Status messages are emitted in the language configured per group. Default: pt-BR for Vivver/Crescer groups, en-US for Koder groups.
7. Kode agent integration
7.1 Kode session lifecycle
When Kortex Reflexes triggers invoke_kode, it creates a Kode session:
type KodeSession struct {
// SessionID is a unique identifier (e.g., "ks-20260408-143022")
SessionID string
// Trigger contains the original ChatOpsEvent + analysis
Trigger ChatOpsTrigger
// Repo is the resolved repository (owner/name)
Repo string
// Branch is the target branch for the fix
Branch string
// Mode is "autonomous" or "supervised"
Mode string
// Status tracks the session state
Status KodeSessionStatus // pending, running, succeeded, failed
// Result contains the outcome (populated on completion)
Result *KodeResult
// StartedAt, CompletedAt track timing
StartedAt time.Time
CompletedAt time.Time
// MaxDuration is the hard timeout
MaxDuration time.Duration
}
type KodeResult struct {
CommitHash string
Branch string
PRURL string // empty if direct push
FilesChanged []string
TestsAdded []string
Summary string // human-readable summary of the fix
RollbackCommit string // commit to revert to if fix is bad
}7.2 Context handoff
Kode receives a structured context package:
type ChatOpsTrigger struct {
// Screenshot analysis from Brain
Analysis ScreenshotAnalysis
// Correlation with other events
Correlation CorrelationResult
// Recent error logs from observe/log (if available)
RecentLogs []LogEntry
// Known incidents from Memory (if similar error seen before)
SimilarIncidents []IncidentSummary
// Playbook from Memory (if a fix is already known)
Playbook *Playbook
}7.3 Kode behavior constraints
When invoked by the ChatOps pipeline, Kode operates under specific constraints:
- Single-repo scope: Kode only modifies the target repository. No cross-repo changes.
- Regression test mandatory: Every fix must include a regression test (per CLAUDE.md rules).
- No destructive operations: Kode cannot delete files, drop tables, or remove features. It can only add/modify code.
- Time-boxed: Hard timeout of
MaxDuration(default: 15 minutes). If Kode can't fix it in time, the session fails and escalates to human. - Commit message format: Includes
[chatops]prefix and reference to the originating chat message.
8. Flow webhook integration
8.1 Webhook configuration
Koder Flow is configured to send webhooks to Kortex on release events:
URL: https://{kortex_host}/api/v1/webhooks/flow
Events: release.published
Secret: {shared_secret}
Content-Type: application/json8.2 Webhook receiver
Kortex exposes a webhook endpoint:
// POST /api/v1/webhooks/flow
// Receives Gitea/Forgejo webhook payloads for release events.
func (s *Server) handleFlowWebhook(w http.ResponseWriter, r *http.Request) {
// 1. Validate HMAC signature
// 2. Parse release payload
// 3. Check if release was created by ChatOps pipeline (commit message contains [chatops])
// 4. If yes, create ReleaseEvent and push to Senses
// 5. Coordination picks up ReleaseEvent and triggers deploy via Jet
}8.3 Non-ChatOps releases
The webhook receiver also handles releases created by humans or other automation. If a release is not tagged as [chatops], Kortex logs it in Memory as a deploy event but does not auto-deploy (unless a separate deploy rule is configured).
9. Safety and rollback
9.1 Pre-deploy health check
Before instructing Jet to deploy, Kortex Coordination verifies:
- Tests passed: If the repository has CI configured, wait for CI to pass (or run tests via Kode).
- No conflicting deploys: Check the deploy lock (per
feedback_poc_session_coordination.mdprotocol). - Service is healthy: Current service health is OK (don't deploy on top of an existing outage caused by something else).
9.2 Post-deploy verification
After Jet reports successful deployment:
- Wait 60 seconds for the service to stabilize
- Run a health check against the deployed service
- If the original error was reproducible, attempt to reproduce it (if possible)
- Monitor error rate for 5 minutes — if error rate increases, auto-rollback
9.3 Automatic rollback
If post-deploy verification fails:
- Kortex instructs Jet to rollback to the previous version
- Bot posts rollback notification in the chat group
- The failed fix is recorded in Kortex Memory as a
FailedFixincident - Kode session is marked as
failed_post_deploy - The issue is escalated to human intervention
9.4 Circuit breaker
The ChatOps pipeline has its own circuit breaker (independent of individual rule circuit breakers):
- If 3 consecutive ChatOps
triggered fixes fail (fix doesn't compile, tests fail, postdeploy regression), the pipeline enters cooldown mode for 1 hour - In cooldown mode, the bot still detects and analyzes errors, but does not attempt to fix them — instead, it posts the analysis in the chat group for human action
- The circuit breaker can be manually reset via the
statuscommand in chat
10. Observability
10.1 Metrics
The ChatOps pipeline emits Prometheus metrics:
# Detection
kortex_chatops_screenshot_detected_total{platform, group, method}
kortex_chatops_screenshot_analyzed_total{platform, group, result}
# Pipeline
kortex_chatops_pipeline_started_total{mode, repo}
kortex_chatops_pipeline_completed_total{mode, repo, result}
kortex_chatops_pipeline_duration_seconds{mode, repo, result}
# Kode sessions
kortex_chatops_kode_session_started_total{repo, task}
kortex_chatops_kode_session_completed_total{repo, task, result}
kortex_chatops_kode_session_duration_seconds{repo, task}
# Deployments
kortex_chatops_deploy_total{env, repo, result}
kortex_chatops_rollback_total{env, repo, reason}
# Circuit breaker
kortex_chatops_circuit_breaker_state{state} # closed, open, half_open10.2 Audit trail
Every pipeline execution is recorded in Kortex Memory (Postgres) with:
- Full event chain (detection → analysis → fix → deploy)
- All chat messages sent/received
- Kode session ID and result
- Commit hashes, PR URLs, deploy versions
- Timing for each stage
- Mode (autonomous/supervised) and whether escalation happened
11. Configuration reference
11.1 Talk bot configuration (talkd.toml)
[bot]
enabled = true
# Platform adapters
[[bot.adapters]]
name = "telegram"
enabled = true
token = "${TELEGRAM_BOT_TOKEN}" # env var reference
[[bot.adapters]]
name = "whatsapp"
enabled = true
device_store = "/var/lib/koder-talk/whatsapp-device.db"
[[bot.adapters]]
name = "googlechat"
enabled = false
service_account = "/etc/koder-talk/google-chat-sa.json"
# Groups to monitor
[[bot.groups]]
platform = "telegram"
group_id = "-1001234567890"
name = "grupo-poc"
mode = "monitor" # "monitor" (detect errors) or "ignore"
locale = "pt-BR"
[[bot.groups]]
platform = "whatsapp"
group_id = "120363123456789@g.us"
name = "suporte-vivver"
mode = "monitor"
locale = "pt-BR"
# Detection settings
[bot.detection]
keywords = ["erro", "error", "bug", "quebrou", "não funciona", "travou", "caiu", "crash", "falha", "problema", "broken", "failed"]
keyword_regex = [] # additional regex patterns
vision_fallback = true # use Vision/LLM when keywords are inconclusive
min_confidence = 0.5 # minimum confidence to emit ChatOpsEvent
# Kortex integration
[bot.kortex]
endpoint = "http://127.0.0.1:4327" # OTLP push endpoint11.2 Kortex ChatOps configuration (kortex.yaml)
chatops:
enabled: true
default_mode: supervised # "autonomous" or "supervised"
timeout_escalation: 30m # escalate to autonomous if no human response
# Kode agent settings
kode:
max_duration: 15m
regression_test: true
branch_prefix: "chatops-fix/"
# Per-environment overrides
environments:
poc.vivver.com:
mode: autonomous
auto_release: true
auto_deploy: true
notify_group: grupo-poc
app.vivver.com:
mode: supervised
require_approval: true
timeout_escalation: 60m
notify_group: grupo-dev
# Per-repository overrides
repositories:
vivver/saude-publica:
mode: autonomous
branch: release
# Circuit breaker
circuit_breaker:
threshold: 3 # failures before opening
window: 1h
cooldown: 1h
# Flow webhook
flow_webhook:
secret: "${FLOW_WEBHOOK_SECRET}"
path: /api/v1/webhooks/flow12. Implementation phases
Phase 1 — Foundation (tickets 008–010)
- Talk bot mode with Telegram adapter (simplest to implement)
- Kortex ChatOpsEvent ingestion in Senses
- Brain screenshot analysis (Vision/LLM)
- Basic
invoke_kodeaction in Reflexes - Chat feedback loop (status messages)
Phase 2 — Full pipeline (tickets 011–012)
- WhatsApp and Google Chat adapters
- Flow webhook → Kortex integration
- Kortex → Jet deploy coordination
- Autonomous/supervised mode switching
- Timeout escalation
- Post
deploy verification and autorollback
Phase 3 — Intelligence (tickets 013–014)
- Memory integration (incident history, playbooks)
- LLM-proposed ChatOps rules (learning loop from RFC 006)
- Interactive commands in chat (approve, reject, rollback)
- Circuit breaker dashboard in Kortex UI
- Metrics and observability
13. Open questions
| # | Question | Proposed answer | Status |
|---|---|---|---|
| 1 | Should Kode sessions be visible in the Kortex UI? | Yes — dedicated "ChatOps Sessions" page with status, logs, and replay | Open |
| 2 | Should the bot support audio messages (voice notes describing errors)? | Yes — delivered in koder-talk v0.2.0 (ticket 005 voice, ticket 006 video) + Kortex v0.2.0 (ticket 018 Brain voice/video handlers). Voice notes are common in Brazilian WhatsApp chats; deferring this lost too many real bug reports. |
Resolved |
| 3 | Should the pipeline support multi-repo fixes? | No for Phase 1. Single-repo constraint keeps blast radius contained. | Resolved |
| 4 | Rate limiting per chat group? | Yes — configurable, default 10 events/hour per group to prevent spam floods | Resolved |
| 5 | Should Kode be able to ask clarifying questions in chat? | Phase 2 — bot asks, user responds, Kode continues with additional context | Open |
| 6 | Should the pipeline accept *ny*implementation request, not just bug reports? | Yes — delivered in koder-talk v0.3.0 (ticket 008 featureAnalyzeFeatureRequest handler). The bot detects both error keywords and feature keywords (ptbody.intent = error_report | feature_request, and feature requests are hardFeatureRequestTriageResult.Supervised = true. New koder.chatops.text event type covers text-only requests without media. |
Resolved |
| 7 | Should the Brain expose per-analyzer metrics for Koder Mon to scrape? | Yes — delivered in Kortex v0.4.0 (ticket 018 followup). Every analyzer path (screenshot, voice, video, featurerequest, browsererror) has a counter + histogram at :9190/metrics, using prometheus/client_golang directly — same convention as the 12 other Koder products already instrumented. Registered as a scrape target in observe/mon (koder-kortex in the live koder-mon-server.yaml). Labels are minimal on purpose — high-cardinality context (sender, group, raw error) stays in structured logs. |
Resolved |
14. Relationship to existing RFCs
| RFC | Relationship |
|---|---|
| 001 — Ecosystem Map | Talk is a new upstream in the ecosystem. No schema ownership conflict — Talk emits standard OTLP events. |
| 002 — Architecture | ChatOps is a concrete use case exercising all five subsystems end |
| 003 — Common Event Schema | ChatOpsEvent extends the schema with a new event type. Must be registered in observe/observability. |
| 004 — Common Control Plane | Deploy action uses the standard /admin/deploys endpoint added in RFC 004 §5.6. |
| 005 — LLM Provider | Vision analysis uses the LLM provider abstraction. Kode sessions are a new tool use pattern. |
| 006 — Rules Engine | invoke_kode is a new action in the action catalog. chatops is a new rule type. |
15. Resolved decisions
- Talk as event source, not Kortex scraping chat: Talk is the right module to monitor chats — it already has the transport layer and will have platform adapters. Kortex should not embed chat platform SDKs.
- OTLP for Talk → Kortex communication: Standard event protocol, no custom RPC needed. Talk pushes, Kortex receives on existing endpoints.
- Autonomous mode is safe because of existing guards: The rules engine (RFC 006) already provides rate limits, blast radius, circuit breakers, and dry-run. Autonomous mode just means
require_approval: false— all other guards remain active. - Single-repo constraint for Phase 1: Multi
repo fixes are complex and errorprone. Phase 1 keeps it simple. Cross-repo orchestration is a Coordination concern for later. - Regression test is mandatory: Every ChatOps fix must include a regression test. This is non-negotiable — it's the safety net that prevents the same bug from recurring.
- Intent
based dispatch instead of eventtype proliferation: When Phase 3 added feature requests, the natural shape would have been new event types (koder.chatops.feature_request.*). Instead, all existingkoder.chatops.*event types gained abody.intentfield (error_report|feature_request) and the Brain dispatches by intent. This keeps the wildcard Reflexes rule (chatops-auto-fix) matching without modification and lets thetalkdandkortexevolve independently. The only new event type iskoder.chatops.textfor text-only messages without any media. - Feature requests are hard-pinned to supervised mode: Unlike bug fixes, which can run autonomously in trusted environments, feature requests always run in supervised mode (PR against target branch, never direct push).
FeatureRequestTriageResult.Supervisedis hardcodedtrueand theReasonstring carries an explicitsupervised-onlyinstruction that downstream Kode reads at runtime. Rationale: implementing a feature without review has larger blast radius than fixing a regression — the test suite does not catch "feature was wrong" the way it catches "bug came back". - Prometheus metrics via the default registry: The Brain exposes per-analyzer counters + histograms at
:9190/metricsusingpromautoagainst the default Prometheus registry. This is the same convention used by the 12 other Koder products instrumented withprometheus/client_golang. Koder Mon scrapes the endpoint. Labels are minimal — cardinality is bounded on purpose, and rich context (sender name, group name, raw error message) stays in structured logs, not in metrics.