RFC 015 — Browser Event Source for ChatOps
RFC 015 — Browser Event Source for ChatOps
Approved ChatOps extension
- Tracking ticket: backlogpending015
rfcbrowsereventsource.md - Depends on:
- RFC 003 — Common Event Schema (Approved, 2026
0408) — extended here withkoder.browser.*event types - RFC 004 — Common Control Plane (Approved) — defines the koder-jet admin API used to enable/disable browser tap per site
- RFC 007 — ChatOps Pipeline (Approved, 2026
0414) — this RFC adds a third event source to the pipeline defined there
- RFC 003 — Common Event Schema (Approved, 2026
- Implementation home:
infra/jet(injection + ingestion + enrichment) andplatform/kortex(event ingestion + new chatops rules) - Status: Approved (2026
0414)
1. Summary
This RFC defines the Browser Event Source — a new upstream for the Kortex ChatOps pipeline that captures clientside errors automatically and proactively from every web application served by `koderjet. It is the third event source after observe/log (server-side logs) and kodertalk` (chat screenshots), and it closes the "clientside blind spot" explicitly identified in RFC 007 §2.3.
The mechanism is straightforward: koder-jet, the reverse proxy that fronts every Koder web property, injects a small JavaScript payload (tap.js, ~35 KiB minified + gzipped) into every HTML response. The payload installs error listeners (served endpoint. Jet enriches each event with deployment context (service name, version, deploy id, environment), scrubs PII, deduplicates floods, and forwards to Kortex Senses as standard OTLP events with the new window.onerror, unhandledrejection, console wrapping, ResourceTiming, CSP violations, Web Vitals) and beacons batched events back to a jetkoder.browser.* event type family.
The key design principles are:
- Proactive over reactive. Browser events are captured automatically, without the user having to remember to send a screenshot. Most users hit a bug and abandon the page; the bug is gone unless we caught it in the browser.
- Zero application code. The mechanism is reverse
proxylevel. Application teams do not need to install an SDK, link a library, or change their build. Optin is persite insites.toml, applies immediately to the next request. - OTLP all the way. The wire format follows RFC 003 — Common Event Schema. No bespoke protocol. Events flow into Kortex Senses on the same OTLP endpoints already in production.
- Privacy by default. No URL query strings, no input values, no cookies, no localStorage. Stack traces are resolved server
side via private source maps so the maps never leave the jet host. PII scrubbing happens before the event leaves the user's network in the case of selfhosted deployments. - Bounded cost. Per
site sampling, pererror rate limiting, and per-user circuit breaking ensure that an application bug that triggers thousands of console errors per second cannot fill the Kortex queue or run up an LLM bill.
The output is a new, fully-supported source feeding RFC 007's existing chatops rules engine. The same autonomous/supervised mode switching, the same Kode invocation, the same chat feedback loop. From the rules engine's perspective, a koder.browser.error event is just another input — the same downstream pipeline that handles screenshot reports handles JS exceptions.
1.1 The pipeline at a glance
[ Browser ] [ koder-jet ] [ Kortex Senses ]
│ │ │
│ GET /index.html ─────────────────────────────────────▶ │
│ ◀── HTML + injected <script src="/_jet/tap.js"> ─────│ │
│ │ │
│ GET /_jet/tap.js ─────────────────────────────────────▶ │
│ ◀── tap.js (~3-5 KiB minified, gzipped, cached) ─────│ │
│ │ │
│ [ tap.js installs listeners ] │ │
│ • window.onerror │ │
│ • window.onunhandledrejection │ │
│ • console.error wrapper │ │
│ • PerformanceObserver (longtask, paint, lcp) │ │
│ • ResourceTiming (failed CSS/JS/img/fetch) │ │
│ • CSP violation reports │ │
│ • fetch / XHR monkey-patch (5xx, network errors) │ │
│ • visibilitychange / beforeunload (final flush) │ │
│ │ │
│ POST /_jet/beacon ─[batched JSON, gzipped, sendBeacon]▶ │
│ │ │
│ │ [ jet enrichment ] │
│ │ • add resource attrs: │
│ │ service, version, env, │
│ │ deploy id, build sha │
│ │ • resolve stack traces │
│ │ via private source maps │
│ │ • PII scrub │
│ │ • dedupe (error_hash + ttl)│
│ │ • per-user circuit break │
│ │ │
│ │ POST /v1/events ────────────▶ ChatOps rules engine
│ │ event_type: │ (RFC 007)
│ │ "koder.browser.error" │2. Context and motivation
2.1 The client-side blind spot
RFC 007 §2.3 enumerated the gap explicitly:
Some errors are client-side only (JavaScript exceptions, UI glitches, rendering bugs) Some errors are transient and don't leave persistent log traces Some errors are configuration issues that don't throw exceptions Users often report behavioral bugs ("this button does nothing") that have no error log
Today the ChatOps pipeline has two event sources:
- *observe/log`*— server
side logs. Catches anything that produces a log line on the backend. Misses everything that happens after the HTML is served: JavaScript exceptions, fetch failures the backend never saw, render bugs, asset 404s served from CDN, CSP violations, thirdparty widget breakage. - *koder-talk`*(RFC 007 §3) — screenshots posted in chat groups. Reactive: depends on a user noticing, taking a screenshot, opening the right group, and sending it. Even motivated users do this maybe one time in twenty.
The result is that the pipeline is blind to a significant fraction of real production bugs. A JavaScript exception that breaks a form submit will never reach Kortex unless the user manually reports it.
2.2 The PoC precedent (revisited)
The Pouso Alegre — MG PoC (MarchApril 2026) proved that autonomous monitoring + AI fix + autodeploy works for serverside logbased detection. The same loop should work for browserside detection — and arguably better, because clientside errors usually carry richer context (URL, user agent, full stack trace, the exact page state that triggered the bug) than a server log line typically does.
The PoC also established the trust model: users want fixes to happen automatically. The friction of reporting bugs is the bottleneck, not the willingness to receive fixes. Removing the reporting step entirely is the next obvious move.
2.3 Why koder-jet is the right home
koder-jet is the reverse proxy that fronts every Koder web property. Three properties make it the natural injection point:
- It already touches every HTML response. Adding a
<script>tag to the response body is trivial — the streaming HTML rewriter already exists for the benchmark and observability injections. - It already knows the deployment context. The site config in
sites.tomlknows which service the request is going to, what version is currently deployed, what environment, and which release artifact. Every browser event can be enriched with this metadata at the edge, with zero application participation. - It already talks to Kortex. Jet sends deploy events to Kortex on every release. The OTLP push channel is established. Adding a new event type is a config change.
2.4 Why not Sentry / Bugsnag / Datadog RUM
Three reasons:
- Self-hosting requirement. Koder products are deployed in environments (PoC servers, on-prem hospital systems, regulated tenants) where sending raw browser data to a SaaS RUM provider is not acceptable on privacy or compliance grounds.
- Pipeline integration. Off
theshelf RUM products produce dashboards. The Koder ChatOps pipeline produces fixes. The output of the RUM tool is the input of the autonomous fix loop — the connector between them is the interesting work, and we may as well own the whole stack. - Cost. SaaS RUM providers charge per session or per event. At Koder's planned scale (every customer site fronted by jet), the per
event cost of a SaaS solution dominates. A homegrown jetbased capture is a fixed cost.
This RFC does not preclude also shipping events to a third-party RUM tool when a customer wants it. The jet endpoint can fan out. But the canonical destination is Kortex.
3. Event categories
The browser tap captures events in the following categories. Each category maps to a specific koder.event.type value under the koder.browser.* namespace.
| Category | Event type | Source API | Triggered by |
|---|---|---|---|
| *avaScript exception* | koder.browser.error.js |
window.onerror, addEventListener("error") |
Uncaught throw, syntax error, runtime error in script |
| *romise rejection* | koder.browser.error.unhandled_rejection |
window.onunhandledrejection |
Async/await error not caught, .then chain without .catch |
| *onsole error* | koder.browser.error.console |
console.error monkey-patch |
Application code that logs an error without throwing |
| *etwork failure* | koder.browser.error.network |
fetch / XMLHttpRequest monkey-patch + PerformanceResourceTiming |
5xx response, CORS error, network unreachable, certificate error, abort |
| *sset load failure* | koder.browser.error.asset |
error event on <img>, <script>, <link>, <iframe> |
404 / 5xx on CSS, JS, image, font, video, iframe |
| *SP violation* | koder.browser.error.csp |
securitypolicyviolation event |
Inline script blocked, eval blocked, foreign origin blocked |
| *ong task* | koder.browser.perf.longtask |
PerformanceObserver({type:"longtask"}) |
Main thread blocked > 50ms (default; configurable) |
| *ayout shift (CLS)* | koder.browser.perf.layout_shift |
PerformanceObserver({type:"layout-shift"}) |
Cumulative Layout Shift > 0.25 over the session |
| *low LCP* | koder.browser.perf.lcp_slow |
PerformanceObserver({type:"largest-contentful-paint"}) |
LCP > 4s (poor per Web Vitals) |
| *age abandon after error* | koder.browser.behavior.abandon |
visibilitychange + recent error in window |
User closes tab within N seconds of an error event firing |
| *age click*(Phase 2) | koder.browser.behavior.rage_click |
Click handler heuristic | ≥3 clicks on the same element within 1 second with no resulting state change |
| *ead click*(Phase 2) | koder.browser.behavior.dead_click |
Click handler + mutation observer | Click on element that has no observable effect within 500ms |
The first 6 categories (errors) are the high-value events that flow into the ChatOps fix pipeline. Categories 79 (perf) are signals that feed dashboards and trend analysis but do not normally trigger autonomous fixes. Categories 1012 (behavior) are user experience events that complement the technical signals — a rage click is the strongest possible signal that something is broken even when no exception fires.
3.1 Phasing
| Phase | Categories included | Notes |
|---|---|---|
| Phase 1 | js, unhandled_rejection, console, network, asset, csp | The technical errors. Highest signal, lowest privacy risk, simplest to implement. |
| Phase 2 | longtask, layoutshift, lcpslow, abandon | Performance and abandonment. Useful for trend analysis and as secondary signals for the fix pipeline. |
| Phase 3 | rageclick, deadclick | Behavioral. Highest privacy risk (requires DOM inspection). Optional even after Phase 3 ships. |
4. The BrowserEvent schema
The schema extends RFC 003 — Common Event Schema. Browser events ride on the standard OTLP Logs records with a structured body and the type discriminator under koder.event.type. This RFC defines the body shape and the resource attributes added by jet at the edge.
4.1 Resource attributes (added by jet)
These are attached to every event by koder-jet after it receives the beacon, before forwarding to Kortex. They are the same attributes already present on every other Koder OTLP event (RFC 003 §4) — reused without extension.
| Attribute | Source | Example |
|---|---|---|
koder.product.name |
sites.toml site definition |
"saude-publica" |
koder.product.version |
jet deploy state | "v2.3.0" |
koder.deployment.env |
sites.toml |
"prd", "stg", "dev" (3-letter canonical per policies/environments.kmd; legacy "production"/"staging" accepted by jet validator with deprecation warning — jet#139) |
koder.deployment.deploy_id |
jet deploy state | "dpl-20260408-153022" |
koder.deployment.build_sha |
jet deploy state | "a8f7c3d" |
koder.tenant.id |
request header (if present) | "tenant-vivver-pousoalegre" |
koder.event.type |
constant per category | "koder.browser.error.js" |
koder.event.severity |
per category default | "error", "warn", "info" |
Note that koder.product.version, deploy_id, and build_sha come from jet's own deploy state, not from the browser. The browser cannot lie about which version is deployed — jet knows because it served the assets.
4.2 Body fields (browser-supplied)
The event body is a JSON object. Fields are conditionally present based on event category. The following table is the union; per-category required fields are noted.
| Field | Type | Required for | Description |
|---|---|---|---|
event_id |
string (uuid v7) | all | Unique per |
session_id |
string | all | Per-tab session id (lifetime: page load to unload) |
page_url |
string (sanitized) | all | URL with query string and fragment stripped |
page_route |
string | all | Logical route, if the app uses a router (set via tap.js API) |
referrer |
string (sanitized) | all | Same sanitization as page_url |
user_agent |
string | all | Raw UA string |
viewport |
object {w, h} |
all | Viewport pixels |
timestamp_ms |
int64 | all | Browser |
error_message |
string | js, unhandled_rejection, console | The thrown message / rejection reason / first console arg |
error_name |
string | js, unhandled_rejection | Error.name (e.g., "TypeError") |
stack_trace |
string | js, unhandled_rejection | Raw client |
source_file |
string | js | The file where the error originated (post-resolution) |
source_line |
int | js | Source line (post-resolution) |
source_column |
int | js | Source column (post-resolution) |
script_url |
string | asset | URL of the failing asset |
network_url |
string (sanitized) | network | URL of the failing fetch |
network_status |
int | network | HTTP status code (0 if no response) |
network_method |
string | network | HTTP method |
csp_directive |
string | csp | Violated directive |
csp_blocked_uri |
string | csp | URI that was blocked |
duration_ms |
int | longtask, lcp_slow | Performance metric value |
cls_value |
float | layout_shift | CLS score |
error_hash |
string (sha256[:16]) | all | Hash of the deduplication key (see §6.2) |
breadcrumbs |
arrayobject | optional | Last N user actions / navigations / network requests preceding the error (Phase 2) |
4.3 Concrete example — JavaScript exception
{
"resource_attributes": {
"koder.product.name": "saude-publica",
"koder.product.version": "v2.3.0",
"koder.deployment.env": "production",
"koder.deployment.deploy_id": "dpl-20260408-153022",
"koder.deployment.build_sha": "a8f7c3d",
"koder.event.type": "koder.browser.error.js",
"koder.event.severity": "error"
},
"timestamp_unix_nano": 1712592623123456000,
"body": {
"event_id": "01HX4M2AWPY8Z5Q7K4F8R3N0Z6",
"session_id": "sess-7f3a2b",
"page_url": "https://saude.poc.vivver.com/agendamento",
"page_route": "/agendamento",
"referrer": "https://saude.poc.vivver.com/dashboard",
"user_agent": "Mozilla/5.0 (X11; Linux x86_64) ...",
"viewport": {"w": 1920, "h": 1080},
"timestamp_ms": 1712592623123,
"error_message": "Cannot read properties of null (reading 'patientId')",
"error_name": "TypeError",
"stack_trace": "TypeError: Cannot read properties of null (reading 'patientId')\n at AgendamentoForm.handleSubmit (agendamento.js:142:21)\n at HTMLFormElement.<anonymous> (agendamento.js:89:12)",
"source_file": "src/components/AgendamentoForm.tsx",
"source_line": 142,
"source_column": 21,
"error_hash": "a8f7c3d5b2e91047"
}
}4.4 Schema registration
This RFC requires registering the new event types in infra/observe/observability/schemas/browser/v1/. The schema lives there per RFC 003 §10. The Protobuf source becomes the canonical definition; jet vendors the generated Go SDK; tap.js uses a hand-authored TypeScript type that mirrors the schema (validated in CI against the JSON Schema generated from the Protobuf).
5. koder-jet — the injection and ingestion side
This section is the implementation contract for the infra/jet side of the work. It is intentionally specific so that ticket 071 (and follow-ups) can be implemented without ambiguity.
5.1 Persite optin
Browser tap is opt-in per site. Sites that already have SentryDatadogetc. or sites that handle highly sensitive content (banking, healthcare under strict compliance) can leave it off. The site config in sites.toml gains a new section:
[sites.routes."/"]
upstream = "http://127.0.0.1:8001"
[sites.browser_tap]
enabled = true
sample_rate = 1.0 # 0.0 to 1.0
exclude_routes = ["/admin