RFC 007 — ChatOps Pipeline

Approved

RFC 007 — ChatOps Pipeline

  • Tracking ticket: backlogdone007
  • Depends on:
    • RFC 002 — Kortex Architecture (Approved, 20260407) — defines the five subsystems (Senses, Brain, Reflexes, Memory, Coordination)
    • RFC 005 — LLM Provider Abstraction (Approved, 20260408) — defines the Kode agent interface and tool use
    • RFC 006 — AutoRemediation Rules Engine (Approved, 202604-08) — defines the rules DSL and action catalog
  • Status: Approved (20260414)

1. Summary

This RFC defines the ChatOps Pipeline — an endtoend autonomous remediation flow that starts when a user sends a screenshot of an error in a chat group and ends when the fix is deployed to production. The pipeline connects four Koder products (Talk, Kortex, Kode/AI, Flow, Jet) into a single coordinated workflow.

The key insight is that most bug reports in real-world operations arrive as screenshots in chat groups, not as structured alerts. The PoC deployment at Pouso Alegre — MG proved that an AI agent monitoring logs, detecting errors, fixing code, committing, and deploying autonomously can reduce incident resolution time from hours to minutes. This RFC formalizes that pattern and extends it to the chat channel.

1.1 The pipeline at a glance

User sends screenshot with error in chat group
        |
   [1. Koder Talk]  bot monitors group, detects screenshot
        |  extracts image, sends ChatOpsEvent via OTLP push
        v
   [2. Kortex Senses]  ingests ChatOpsEvent into hot cache
        |
        v
   [3. Kortex Brain]  Vision/LLM analyzes screenshot, extracts error
        |  correlates with recent deploys, logs, known incidents
        v
   [4. Kortex Reflexes]  evaluates chatops rules
        |  decides: invoke Kode (autonomous) or ask approval (supervised)
        v
   [5. Kode (AI Agent)]  clones repo, locates bug, writes fix + regression test
        |  commits, pushes to branch, creates PR or merges directly
        v
   [6. Koder Flow]  receives push, creates release (if configured)
        |  fires webhook to Kortex
        v
   [7. Kortex Coordination]  receives release event
        |  instructs Jet to deploy new version
        v
   [8. koder-jet]  pulls new release, swaps service, health check
        |
        v
   [9. Talk bot]  posts status updates back to chat group
        "Error detected → Fix committed → Deployed → Verified"

1.2 Design principles

  1. Chat is a first-class event source. Screenshots in chat groups are as valid as structured OTLP alerts. Talk is an upstream to Kortex Senses, just like observe/log or observe/mon.
  2. Autonomous by default, supervised by choice. The pipeline supports two modes — autonomous (detectfixdeploynotify) and supervised (detectfixaskdeploy-notify). The mode is configurable per environment, per repository, and per rule.
  3. Humaninthe-loop is optional, not mandatory. The PoC experience proved that requiring human approval for every fix introduces unacceptable latency in critical environments. The gate is configurable.
  4. Feedback loop in chat. The bot posts every pipeline stage back to the originating chat group, so users always know what happened. Silence is never acceptable.
  5. Safety through the rules engine. The pipeline does not bypass RFC 006 safety guards. Rate limits, blast radius, circuit breakers, and dry-run all apply. The Kode agent is just another action in the rules engine catalog.

2. Context and motivation

2.1 The screenshot problem

In practice, most production errors are first reported by end users — not by monitoring systems. The typical flow is:

  1. User encounters an error in the application
  2. User takes a screenshot
  3. User sends the screenshot to a support chat group (WhatsApp, Telegram, Google Chat)
  4. A developer sees the screenshot (minutes to hours later)
  5. Developer tries to reproduce the error
  6. Developer locates the bug, writes a fix
  7. Developer commits, pushes, deploys
  8. Developer confirms the fix in the chat group

Steps 4–8 take anywhere from 30 minutes to several hours. During this time, every user hitting the same bug is blocked.

2.2 The PoC precedent

During the Pouso Alegre — MG PoC deployment (March–April 2026), the Kode AI agent was configured to:

  • Monitor aggregated logs from poc.vivver.com
  • Detect new errors automatically
  • Analyze root cause
  • Write fixes with regression tests
  • Commit and push to the release branch
  • Deploy to production via container swap

This autonomous loop resolved dozens of production issues — many at night or on weekends — with zero human intervention. Users arrived in the morning to find their reported bugs already fixed.

The ChatOps pipeline generalizes this pattern: instead of monitoring server logs, the trigger is a screenshot in a chat group. Instead of a single monolithic agent, the pipeline is decomposed into Koder products, each doing what it does best.

2.3 Why not just use log monitoring?

Log monitoring (the PoC approach) catches errors that produce logs. But:

  • Some errors are client-side only (JavaScript exceptions, UI glitches, rendering bugs)
  • Some errors are transient and don't leave persistent log traces
  • Some errors are configuration issues that don't throw exceptions
  • Users often report behavioral bugs ("this button does nothing") that have no error log

Screenshots capture the user's experience, which is the ground truth. The ChatOps pipeline complements log monitoring — it doesn't replace it.


3. Koder Talk — Bot Mode

3.1 Multi-platform adapters

Koder Talk gains a bot subsystem — a set of platform adapters that allow Talk to monitor external chat groups. Each adapter implements the ChatAdapter interface:

// ChatAdapter connects to an external messaging platform and streams
// incoming messages to the bot engine.
type ChatAdapter interface {
    // Name returns the adapter identifier (e.g., "whatsapp", "telegram").
    Name() string

    // Connect establishes a connection to the platform using the
    // provided credentials. It blocks until the connection is ready
    // or the context is cancelled.
    Connect(ctx context.Context, creds AdapterCredentials) error

    // Listen returns a channel of incoming messages from monitored groups.
    // The channel is closed when the context is cancelled.
    Listen(ctx context.Context) (<-chan IncomingMessage, error)

    // Send posts a message to a specific group/channel.
    Send(ctx context.Context, target ChatTarget, msg OutgoingMessage) error

    // Close gracefully disconnects from the platform.
    Close() error
}

3.2 Supported platforms (Phase 1)

Platform Adapter Library / API Auth method
Telegram telegram Telegram Bot API (HTTP long poll) Bot token
WhatsApp whatsapp whatsmeow (Multi-Device) QR code pairing
Google Chat googlechat Google Chat API (Pub/Sub) Service account

3.3 Phase 2 platforms

Platform Adapter Notes
Slack slack Socket Mode API
Discord discord Gateway WebSocket
Koder Talk native talk-native Direct WebSocket (Noise_XX), no bridge needed
Microsoft Teams teams Bot Framework

3.4 Message model

// IncomingMessage represents a message received from any platform.
type IncomingMessage struct {
    // Platform identifies the source adapter (e.g., "telegram").
    Platform    string

    // GroupID is the platform-specific group/channel identifier.
    GroupID     string

    // GroupName is the human-readable group name.
    GroupName   string

    // SenderID is the platform-specific user identifier.
    SenderID    string

    // SenderName is the display name of the sender.
    SenderName  string

    // Text is the message text (may be empty for image-only messages).
    Text        string

    // Images contains attached images (screenshots, photos).
    Images      []ImageAttachment

    // Timestamp is when the message was sent (platform time).
    Timestamp   time.Time

    // ReplyTo is the message ID this is a reply to (if any).
    ReplyTo     string

    // Raw is the platform-specific raw message (for debugging).
    Raw         any
}

// ImageAttachment represents an image attached to a message.
type ImageAttachment struct {
    // URL is the download URL for the image (platform-specific).
    URL         string

    // MimeType is the MIME type (e.g., "image/png", "image/jpeg").
    MimeType    string

    // Data is the raw image bytes (populated after download).
    Data        []byte

    // Width and Height in pixels (if known).
    Width       int
    Height      int
}

3.5 Screenshot detection heuristics

The bot engine applies heuristics to decide whether an incoming message likely contains an error report:

  1. Image present: Message contains at least one image attachment.
  2. Error keywords in text: Message text (or reply context) contains keywords like "erro", "error", "bug", "quebrou", "não funciona", "travou", "caiu", "crash", "falha", "problema", "help", "broken". Keyword list is configurable and supports regex.
  3. Image content analysis: If heuristics 1+2 are inconclusive, submit the image to VisionLLM for classification: "Does this screenshot show an application error, exception, or malfunction?" (cheap, fast — single yesno question).
  4. Group configuration: Some groups may be configured as "error-only" (every image is treated as an error report) or "ignore" (bot never analyzes images from this group).

A message that passes detection is promoted to a ChatOpsEvent and pushed to Kortex.

3.6 ChatOpsEvent (OTLP extension)

The bot emits a custom OTLP event when it detects a potential error report:

// ChatOpsEvent is sent by Koder Talk bot to Kortex Senses
// when a screenshot of an error is detected in a chat group.
message ChatOpsEvent {
    // Source platform (telegram, whatsapp, googlechat, etc.)
    string platform = 1;

    // Group/channel where the screenshot was posted
    string group_id = 2;
    string group_name = 3;

    // User who posted the screenshot
    string sender_id = 4;
    string sender_name = 5;

    // The screenshot image (raw bytes, typically PNG/JPEG)
    bytes screenshot = 6;
    string screenshot_mime = 7;

    // Accompanying text message (if any)
    string text = 8;

    // Detection confidence (0.0 to 1.0)
    float confidence = 9;

    // Detection method: "keywords", "vision", "group_config"
    string detection_method = 10;

    // Timestamp of the original message
    google.protobuf.Timestamp message_time = 11;

    // Reply-to context (surrounding messages for context)
    repeated ContextMessage context_messages = 12;
}

message ContextMessage {
    string sender_name = 1;
    string text = 2;
    google.protobuf.Timestamp timestamp = 3;
}

This event is sent via OTLP HTTP push to Kortex Senses (:4327), using the standard Koder event envelope with:

  • koder.product.name = "koder-talk"
  • koder.event.type = "ChatOpsEvent"
  • koder.event.severity = "warn"

4. Kortex — ChatOps processing

4.1 Senses: Talk as upstream

Talk is added as a push-only upstream in the Senses subsystem. Unlike observe/log or observe/mon which use both push and pull, Talk only pushes — there is no meaningful "pull" operation for chat messages.

Configuration:

senses:
  push:
    http_addr: ":4327"
    grpc_addr: ":4328"
    channel_size: 8192
    # Talk events accepted on the same OTLP endpoints
    # No special configuration needed — Talk sends standard OTLP

The Senses subsystem recognizes ChatOpsEvent by the koder.event.type attribute and routes it to the Brain's ChatOps analyzer.

4.2 Brain: Screenshot analysis

When the Brain receives a ChatOpsEvent, it executes the ChatOps analysis pipeline:

Step 1 — Vision extraction

Submit the screenshot to the LLM provider (RFC 005) with a Vision prompt:

Analyze this screenshot from a production application. Extract:
1. The error message (exact text)
2. The error type (exception, HTTP error, UI bug, crash, etc.)
3. The affected component (URL, page name, feature)
4. Stack trace (if visible)
5. Any other diagnostic information visible

Respond in structured JSON.

Output: ScreenshotAnalysis struct with extracted error details.

Step 2 — Correlation

Cross-reference the extracted error with:

  • Recent deploy events from infra/jet (did a deploy happen in the last N minutes?)
  • Recent errors in observe/log matching the same error signature
  • Known incidents in Memory (has this error been seen before? what fixed it?)
  • Active rules in Reflexes (is there already a rule handling this?)

Output: CorrelationResult with probable root cause, related events, and confidence score.

Step 3 — Triage decision

Based on the analysis:

  • If error matches a known incident with a known fix → use the playbook from Memory
  • If error correlates with a recent deploy → likely regression, rollback candidate
  • If error is novel → invoke Kode for investigation and fix
  • If confidence is too low → escalate to human (post in chat asking for more context)

4.3 Reflexes: ChatOps rules

A new rule type chatops is added to the rules engine (RFC 006):

# rules/chatops-auto-fix.yaml
- name: chatops-auto-fix
  description: "Invoke Kode AI to fix bugs detected from chat screenshots"
  trigger:
    type: pattern
    source: koder-talk
    event_type: ChatOpsEvent
    conditions:
      - field: confidence
        op: gte
        value: 0.7
  actions:
    - invoke_kode:
        task: fix_bug
        context_from: event  # pass screenshot analysis to Kode
        repo: auto           # detect repo from error context
        branch_strategy: auto  # direct_push (autonomous) or pr (supervised)
  guards:
    mode: autonomous         # or "supervised"
    rate_limit: 10/hour
    blast_radius: single_service
    circuit_breaker:
      threshold: 3
      window: 1h
    dry_run: false
  notify:
    channel: origin          # reply in the same chat group
    on_start: true
    on_success: true
    on_failure: true

4.4 New action: invoke_kode

A new action is added to the Reflexes action catalog (RFC 006 §6):

// InvokeKodeAction spawns a Kode AI agent to investigate and fix a bug.
type InvokeKodeAction struct {
    // Task is the type of work for Kode: "fix_bug", "investigate",
    // "rollback", "write_test".
    Task            string

    // Repo is the target repository. "auto" means Kode determines
    // the repo from the error context (stack trace, URL, component name).
    Repo            string

    // BranchStrategy controls how Kode commits:
    //   "direct_push" — push directly to the release/main branch (autonomous mode)
    //   "pr" — create a feature branch and open a PR (supervised mode)
    //   "auto" — use the mode configured in the rule's guards
    BranchStrategy  string

    // ContextFrom specifies where Kode gets its context:
    //   "event" — the ChatOpsEvent (screenshot analysis, correlation)
    //   "logs" — recent error logs from observe/log
    //   "both" — merge event context with recent logs
    ContextFrom     string

    // MaxDuration is the maximum time Kode is allowed to work on this fix.
    MaxDuration     time.Duration

    // RegressionTest controls whether Kode must write a regression test.
    // Default: true.
    RegressionTest  bool
}

The action executor:

  1. Spawns a Kode agent session via the Kode API
  2. Passes the ScreenshotAnalysis + CorrelationResult as context
  3. Waits for Kode to complete (with timeout)
  4. Captures the result: commit hash, branch, PR URL, files changed
  5. If BranchStrategy == "direct_push" and the rule mode is autonomous, proceeds to deployment
  6. If BranchStrategy == "pr", posts the PR link in the chat group and waits for merge

4.5 Coordination: Release → Deploy

When Kode pushes a fix and a release is created (either manually or via automation), the Coordination subsystem handles the deploy:

Flow webhook → Kortex

Koder Flow fires a webhook on release creation:

{
  "action": "published",
  "release": {
    "tag_name": "v2.3.1",
    "target_commitish": "release",
    "name": "v2.3.1 — ChatOps auto-fix",
    "body": "Automated fix for: NullPointerException in agendamento.go:142\nTriggered by: ChatOps screenshot from grupo-poc (Telegram)\nKode session: ks-20260408-143022",
    "assets": [...]
  },
  "repository": {
    "full_name": "vivver/saude-publica"
  }
}

Kortex receives this webhook at POST /api/v1/webhooks/flow and creates a ReleaseEvent in the Senses cache.

Deploy instruction to Jet

The Coordination subsystem calls the koder-jet admin API (RFC 004):

POST /admin/deploys
Content-Type: application/json

{
  "service": "saude-publica",
  "version": "v2.3.1",
  "source": "chatops-pipeline",
  "strategy": "rolling",
  "health_check": {
    "url": "/api/health",
    "interval": "5s",
    "timeout": "30s"
  }
}

Jet pulls the new release, performs a rolling deploy (or container swap), runs health checks, and reports back to Kortex.


5. Operational modes

5.1 Autonomous mode

The full pipeline runs without human intervention:

screenshot → detect → analyze → fix → commit → release → deploy → notify

Best for:

  • PoC / pilot deployments where speed matters
  • Nighttime / weekend incidents with no staff online
  • Known-stable codebases with good test coverage
  • Environments with easy rollback (containers, blue-green)

5.2 Supervised mode

The pipeline pauses before deployment and asks for human approval:

screenshot → detect → analyze → fix → commit → PR → [ask approval] → merge → release → deploy → notify

Best for:

  • Production environments serving many users
  • Codebases without comprehensive test coverage
  • Regulated environments requiring change approval
  • New deployments where trust in AI fixes is still being established

5.3 Configuration

Mode is configured at three levels (most specific wins):

# kortex.yaml — global default
chatops:
  default_mode: supervised
  timeout_escalation: 30m  # if no human responds in 30min, escalate to autonomous

# Per-environment override
environments:
  poc.vivver.com:
    mode: autonomous
    notify: grupo-poc
    auto_release: true
    auto_deploy: true

  app.vivver.com:
    mode: supervised
    notify: grupo-dev
    require_approval_from: ["rodrigo", "francisco"]
    timeout_escalation: 60m

# Per-repository override
repositories:
  vivver/saude-publica:
    mode: autonomous  # override for this specific repo
    branch: release

5.4 Timeout escalation

When supervised mode is active but no human responds within timeout_escalation minutes:

  1. Bot posts a warning: "Ninguém respondeu em 30 minutos. Deployando automaticamente em 5 minutos. Responda 'cancelar' para impedir."
  2. If no "cancelar" response in 5 minutes, the pipeline escalates to autonomous mode
  3. The escalation is logged in Kortex Memory as a TimeoutEscalation incident

This prevents fixes from being stuck in limbo when no one is available.


6. Chat feedback loop

6.1 Status messages

The bot posts progress updates back to the originating chat group at every pipeline stage:

Stage Message template
Detection "Erro detectado na screenshot de @{sender}. Analisando..."
Analysis "Erro identificado: {error_summary} em {component}. Buscando correção..."
Fix started "Kode iniciou correção no repositório {repo}. Sessão: {session_id}"
Fix committed "Correção commitada: {commit_url}nArquivos alterados: {files}nTeste de regressão: incluído"
PR created "PR aberta: {pr_url}nAguardando aprovação para deploy. Responda 'aprovar' ou 'rejeitar'."
Deploying "Deployando v{version} em {environment}..."
Deployed "v{version} deployada com sucesso em {environment}. Verifique se o erro foi corrigido."
Failed "Não consegui corrigir este erro automaticamente. Detalhes: {reason}nEscalando para o time."
Rollback "Fix causou regressão. Rollback para v{prev_version} executado. Investigação manual necessária."

6.2 Interactive commands

Users can interact with the bot in the chat group:

Command Action
aprovar / approve Approve pending deployment (supervised mode)
rejeitar / reject Reject pending fix, discard branch
cancelar / cancel Cancel in-progress pipeline
status Show current pipeline status
rollback Rollback to previous version
detalhes / details Show full analysis details
ignorar / ignore Mark this error as known/expected, don't fix
silenciar / mute Temporarily mute bot notifications (1h default)

6.3 Locale

Status messages are emitted in the language configured per group. Default: pt-BR for Vivver/Crescer groups, en-US for Koder groups.


7. Kode agent integration

7.1 Kode session lifecycle

When Kortex Reflexes triggers invoke_kode, it creates a Kode session:

type KodeSession struct {
    // SessionID is a unique identifier (e.g., "ks-20260408-143022")
    SessionID       string

    // Trigger contains the original ChatOpsEvent + analysis
    Trigger         ChatOpsTrigger

    // Repo is the resolved repository (owner/name)
    Repo            string

    // Branch is the target branch for the fix
    Branch          string

    // Mode is "autonomous" or "supervised"
    Mode            string

    // Status tracks the session state
    Status          KodeSessionStatus  // pending, running, succeeded, failed

    // Result contains the outcome (populated on completion)
    Result          *KodeResult

    // StartedAt, CompletedAt track timing
    StartedAt       time.Time
    CompletedAt     time.Time

    // MaxDuration is the hard timeout
    MaxDuration     time.Duration
}

type KodeResult struct {
    CommitHash      string
    Branch          string
    PRURL           string      // empty if direct push
    FilesChanged    []string
    TestsAdded      []string
    Summary         string      // human-readable summary of the fix
    RollbackCommit  string      // commit to revert to if fix is bad
}

7.2 Context handoff

Kode receives a structured context package:

type ChatOpsTrigger struct {
    // Screenshot analysis from Brain
    Analysis        ScreenshotAnalysis

    // Correlation with other events
    Correlation     CorrelationResult

    // Recent error logs from observe/log (if available)
    RecentLogs      []LogEntry

    // Known incidents from Memory (if similar error seen before)
    SimilarIncidents []IncidentSummary

    // Playbook from Memory (if a fix is already known)
    Playbook        *Playbook
}

7.3 Kode behavior constraints

When invoked by the ChatOps pipeline, Kode operates under specific constraints:

  1. Single-repo scope: Kode only modifies the target repository. No cross-repo changes.
  2. Regression test mandatory: Every fix must include a regression test (per CLAUDE.md rules).
  3. No destructive operations: Kode cannot delete files, drop tables, or remove features. It can only add/modify code.
  4. Time-boxed: Hard timeout of MaxDuration (default: 15 minutes). If Kode can't fix it in time, the session fails and escalates to human.
  5. Commit message format: Includes [chatops] prefix and reference to the originating chat message.

8. Flow webhook integration

8.1 Webhook configuration

Koder Flow is configured to send webhooks to Kortex on release events:

URL: https://{kortex_host}/api/v1/webhooks/flow
Events: release.published
Secret: {shared_secret}
Content-Type: application/json

8.2 Webhook receiver

Kortex exposes a webhook endpoint:

// POST /api/v1/webhooks/flow
// Receives Gitea/Forgejo webhook payloads for release events.
func (s *Server) handleFlowWebhook(w http.ResponseWriter, r *http.Request) {
    // 1. Validate HMAC signature
    // 2. Parse release payload
    // 3. Check if release was created by ChatOps pipeline (commit message contains [chatops])
    // 4. If yes, create ReleaseEvent and push to Senses
    // 5. Coordination picks up ReleaseEvent and triggers deploy via Jet
}

8.3 Non-ChatOps releases

The webhook receiver also handles releases created by humans or other automation. If a release is not tagged as [chatops], Kortex logs it in Memory as a deploy event but does not auto-deploy (unless a separate deploy rule is configured).


9. Safety and rollback

9.1 Pre-deploy health check

Before instructing Jet to deploy, Kortex Coordination verifies:

  1. Tests passed: If the repository has CI configured, wait for CI to pass (or run tests via Kode).
  2. No conflicting deploys: Check the deploy lock (per feedback_poc_session_coordination.md protocol).
  3. Service is healthy: Current service health is OK (don't deploy on top of an existing outage caused by something else).

9.2 Post-deploy verification

After Jet reports successful deployment:

  1. Wait 60 seconds for the service to stabilize
  2. Run a health check against the deployed service
  3. If the original error was reproducible, attempt to reproduce it (if possible)
  4. Monitor error rate for 5 minutes — if error rate increases, auto-rollback

9.3 Automatic rollback

If post-deploy verification fails:

  1. Kortex instructs Jet to rollback to the previous version
  2. Bot posts rollback notification in the chat group
  3. The failed fix is recorded in Kortex Memory as a FailedFix incident
  4. Kode session is marked as failed_post_deploy
  5. The issue is escalated to human intervention

9.4 Circuit breaker

The ChatOps pipeline has its own circuit breaker (independent of individual rule circuit breakers):

  • If 3 consecutive ChatOpstriggered fixes fail (fix doesn't compile, tests fail, postdeploy regression), the pipeline enters cooldown mode for 1 hour
  • In cooldown mode, the bot still detects and analyzes errors, but does not attempt to fix them — instead, it posts the analysis in the chat group for human action
  • The circuit breaker can be manually reset via the status command in chat

10. Observability

10.1 Metrics

The ChatOps pipeline emits Prometheus metrics:

# Detection
kortex_chatops_screenshot_detected_total{platform, group, method}
kortex_chatops_screenshot_analyzed_total{platform, group, result}

# Pipeline
kortex_chatops_pipeline_started_total{mode, repo}
kortex_chatops_pipeline_completed_total{mode, repo, result}
kortex_chatops_pipeline_duration_seconds{mode, repo, result}

# Kode sessions
kortex_chatops_kode_session_started_total{repo, task}
kortex_chatops_kode_session_completed_total{repo, task, result}
kortex_chatops_kode_session_duration_seconds{repo, task}

# Deployments
kortex_chatops_deploy_total{env, repo, result}
kortex_chatops_rollback_total{env, repo, reason}

# Circuit breaker
kortex_chatops_circuit_breaker_state{state}  # closed, open, half_open

10.2 Audit trail

Every pipeline execution is recorded in Kortex Memory (Postgres) with:

  • Full event chain (detection → analysis → fix → deploy)
  • All chat messages sent/received
  • Kode session ID and result
  • Commit hashes, PR URLs, deploy versions
  • Timing for each stage
  • Mode (autonomous/supervised) and whether escalation happened

11. Configuration reference

11.1 Talk bot configuration (talkd.toml)

[bot]
enabled = true

# Platform adapters
[[bot.adapters]]
name = "telegram"
enabled = true
token = "${TELEGRAM_BOT_TOKEN}"  # env var reference

[[bot.adapters]]
name = "whatsapp"
enabled = true
device_store = "/var/lib/koder-talk/whatsapp-device.db"

[[bot.adapters]]
name = "googlechat"
enabled = false
service_account = "/etc/koder-talk/google-chat-sa.json"

# Groups to monitor
[[bot.groups]]
platform = "telegram"
group_id = "-1001234567890"
name = "grupo-poc"
mode = "monitor"          # "monitor" (detect errors) or "ignore"
locale = "pt-BR"

[[bot.groups]]
platform = "whatsapp"
group_id = "120363123456789@g.us"
name = "suporte-vivver"
mode = "monitor"
locale = "pt-BR"

# Detection settings
[bot.detection]
keywords = ["erro", "error", "bug", "quebrou", "não funciona", "travou", "caiu", "crash", "falha", "problema", "broken", "failed"]
keyword_regex = []        # additional regex patterns
vision_fallback = true    # use Vision/LLM when keywords are inconclusive
min_confidence = 0.5      # minimum confidence to emit ChatOpsEvent

# Kortex integration
[bot.kortex]
endpoint = "http://127.0.0.1:4327"  # OTLP push endpoint

11.2 Kortex ChatOps configuration (kortex.yaml)

chatops:
  enabled: true
  default_mode: supervised   # "autonomous" or "supervised"
  timeout_escalation: 30m    # escalate to autonomous if no human response

  # Kode agent settings
  kode:
    max_duration: 15m
    regression_test: true
    branch_prefix: "chatops-fix/"

  # Per-environment overrides
  environments:
    poc.vivver.com:
      mode: autonomous
      auto_release: true
      auto_deploy: true
      notify_group: grupo-poc

    app.vivver.com:
      mode: supervised
      require_approval: true
      timeout_escalation: 60m
      notify_group: grupo-dev

  # Per-repository overrides
  repositories:
    vivver/saude-publica:
      mode: autonomous
      branch: release

  # Circuit breaker
  circuit_breaker:
    threshold: 3             # failures before opening
    window: 1h
    cooldown: 1h

  # Flow webhook
  flow_webhook:
    secret: "${FLOW_WEBHOOK_SECRET}"
    path: /api/v1/webhooks/flow

12. Implementation phases

Phase 1 — Foundation (tickets 008–010)

  • Talk bot mode with Telegram adapter (simplest to implement)
  • Kortex ChatOpsEvent ingestion in Senses
  • Brain screenshot analysis (Vision/LLM)
  • Basic invoke_kode action in Reflexes
  • Chat feedback loop (status messages)

Phase 2 — Full pipeline (tickets 011–012)

  • WhatsApp and Google Chat adapters
  • Flow webhook → Kortex integration
  • Kortex → Jet deploy coordination
  • Autonomous/supervised mode switching
  • Timeout escalation
  • Postdeploy verification and autorollback

Phase 3 — Intelligence (tickets 013–014)

  • Memory integration (incident history, playbooks)
  • LLM-proposed ChatOps rules (learning loop from RFC 006)
  • Interactive commands in chat (approve, reject, rollback)
  • Circuit breaker dashboard in Kortex UI
  • Metrics and observability

13. Open questions

# Question Proposed answer Status
1 Should Kode sessions be visible in the Kortex UI? Yes — dedicated "ChatOps Sessions" page with status, logs, and replay Open
2 Should the bot support audio messages (voice notes describing errors)? Yes — delivered in koder-talk v0.2.0 (ticket 005 voice, ticket 006 video) + Kortex v0.2.0 (ticket 018 Brain voice/video handlers). Voice notes are common in Brazilian WhatsApp chats; deferring this lost too many real bug reports. Resolved
3 Should the pipeline support multi-repo fixes? No for Phase 1. Single-repo constraint keeps blast radius contained. Resolved
4 Rate limiting per chat group? Yes — configurable, default 10 events/hour per group to prevent spam floods Resolved
5 Should Kode be able to ask clarifying questions in chat? Phase 2 — bot asks, user responds, Kode continues with additional context Open
6 Should the pipeline accept *ny*implementation request, not just bug reports? Yes — delivered in koder-talk v0.3.0 (ticket 008 featurerequest detection) + Kortex v0.3.0 (ticket 023 AnalyzeFeatureRequest handler). The bot detects both error keywords and feature keywords (ptBR + enUS), the Brain routes by body.intent = error_report | feature_request, and feature requests are hardpinned to supervised mode (PR, never direct push) via FeatureRequestTriageResult.Supervised = true. New koder.chatops.text event type covers text-only requests without media. Resolved
7 Should the Brain expose per-analyzer metrics for Koder Mon to scrape? Yes — delivered in Kortex v0.4.0 (ticket 018 followup). Every analyzer path (screenshot, voice, video, featurerequest, browsererror) has a counter + histogram at :9190/metrics, using prometheus/client_golang directly — same convention as the 12 other Koder products already instrumented. Registered as a scrape target in observe/mon (koder-kortex in the live koder-mon-server.yaml). Labels are minimal on purpose — high-cardinality context (sender, group, raw error) stays in structured logs. Resolved

14. Relationship to existing RFCs

RFC Relationship
001 — Ecosystem Map Talk is a new upstream in the ecosystem. No schema ownership conflict — Talk emits standard OTLP events.
002 — Architecture ChatOps is a concrete use case exercising all five subsystems endtoend.
003 — Common Event Schema ChatOpsEvent extends the schema with a new event type. Must be registered in observe/observability.
004 — Common Control Plane Deploy action uses the standard /admin/deploys endpoint added in RFC 004 §5.6.
005 — LLM Provider Vision analysis uses the LLM provider abstraction. Kode sessions are a new tool use pattern.
006 — Rules Engine invoke_kode is a new action in the action catalog. chatops is a new rule type.

15. Resolved decisions

  1. Talk as event source, not Kortex scraping chat: Talk is the right module to monitor chats — it already has the transport layer and will have platform adapters. Kortex should not embed chat platform SDKs.
  2. OTLP for Talk → Kortex communication: Standard event protocol, no custom RPC needed. Talk pushes, Kortex receives on existing endpoints.
  3. Autonomous mode is safe because of existing guards: The rules engine (RFC 006) already provides rate limits, blast radius, circuit breakers, and dry-run. Autonomous mode just means require_approval: false — all other guards remain active.
  4. Single-repo constraint for Phase 1: Multirepo fixes are complex and errorprone. Phase 1 keeps it simple. Cross-repo orchestration is a Coordination concern for later.
  5. Regression test is mandatory: Every ChatOps fix must include a regression test. This is non-negotiable — it's the safety net that prevents the same bug from recurring.
  6. Intentbased dispatch instead of eventtype proliferation: When Phase 3 added feature requests, the natural shape would have been new event types (koder.chatops.feature_request.*). Instead, all existing koder.chatops.* event types gained a body.intent field (error_report | feature_request) and the Brain dispatches by intent. This keeps the wildcard Reflexes rule (chatops-auto-fix) matching without modification and lets the talkd and kortex evolve independently. The only new event type is koder.chatops.text for text-only messages without any media.
  7. Feature requests are hard-pinned to supervised mode: Unlike bug fixes, which can run autonomously in trusted environments, feature requests always run in supervised mode (PR against target branch, never direct push). FeatureRequestTriageResult.Supervised is hardcoded true and the Reason string carries an explicit supervised-only instruction that downstream Kode reads at runtime. Rationale: implementing a feature without review has larger blast radius than fixing a regression — the test suite does not catch "feature was wrong" the way it catches "bug came back".
  8. Prometheus metrics via the default registry: The Brain exposes per-analyzer counters + histograms at :9190/metrics using promauto against the default Prometheus registry. This is the same convention used by the 12 other Koder products instrumented with prometheus/client_golang. Koder Mon scrapes the endpoint. Labels are minimal — cardinality is bounded on purpose, and rich context (sender name, group name, raw error message) stays in structured logs, not in metrics.

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/kortex-007-chatops-pipeline.kmd