Self-hosted CI runner pool — move CI off the dev laptop

draft

infraRFC002 — Self-hosted CI runner pool

Status

*raft*— opened 20260520 after a k~~ship wave exposed that the single CI runner lives on the developer's laptop and rsyncs the dev worktree directly. Of 8 release tags shipped during the wave, ≥5 release~~pipelines failed with the same root cause (rsync hitting root-owned files inside `infralinuxdistrochroot/`). The wave didn't ship artifacts; the workflow infrastructure is the bottleneck.

Summary

Move Koder Flow Actions execution *ff the developer laptop*onto a dedicated container pool in s.khost1. Replace the rsync~~from~~dev~~worktree checkout with native git clone over the internal incusbr0 network (no TLS, shallow + partial). Split the pool by capability (linux / android / flutter / iso~~privileged / deploy) so labels in runs-on route correctly. Decommission the laptop runner once the pool covers all surfaces.

The shipping bug fixed in commit ci(workflows): exclude infra/linux/distro/chroot/ from LOCAL_REPO rsync is the *atch* This RFC is the *tructural*fix.

Motivation

What broke

The note runner (actrunner v0.2.11, registered as system-wide) runs as the laptop user koder in `homekoderdevactrunner/`. Every release workflow contains this checkout block:

- name: Checkout (local rsync)
  env:
    LOCAL_REPO: /home/koder/dev/koder
  run: |
    rsync -a --delete \
      --exclude='.git' \
      --exclude='*/build/' \
      ...
      "${LOCAL_REPO}/" "${GITHUB_WORKSPACE}/"

The comment in jet-release.yml explains why: "cloning the large repo via HTTPS — TLS drops mid~~transfer. Instead, rsync from the local clone to the workspace." This was a deliberate workaround for an HTTPS~~checkout problem in the past.

The workaround compounds three structural issues:

*ardcoded path.*LOCAL_REPO=/home/koder/dev/koder only works on the dev's machine.
*orktree contention.*Runner reads the same files the developer is editing. Race conditions during edits. Index corruption seen during the 20260520 wave (fatal: .git/index: index file smaller than expected).
*ermission cross~~talk.*sudo lb build creates root~~owned files inside infra/linux/distro/chroot/ (Koder Linux ISO build). The runner runs as koder and can't read them. rsync exits 23. Job fails.

All 54 release workflows + kdb-jepsen.yml are affected.

Why a runner on the laptop is wrong long-term

problem	impact
Laptop off / asleep / no network → no CI	shipping blocked
Dev edits during build → race + index corruption	flaky CI, looks like product bug
Heavy build (Flutter desktop) saturates laptop CPU	dev work blocked during builds
Single runner = serial queue	wave of 8 releases = 30-60min queue time
Workflows assume dev's filesystem layout	not portable, not testable
`chroot/` is a permanent landmine	every infra/linux iteration risks re-breaking

The laptop runner was a bootstrap convenience that hardened into the only working path. Time to extract.

Design

Architecture

flow.koder.dev  (Gitea Actions, lives in s.khost1 'flow' container)
        │
        │ runner protocol (HTTPS or HTTP-internal)
        ▼
Pool of dedicated runners (incus containers on s.khost1):
   ┌──────────────────────────────────────────────────────┐
   │ ci-runner-linux     labels: linux, ubuntu-latest     │
   │   - Go 1.25, Node 24, Rust stable, Flutter Linux     │
   │ ci-runner-android   labels: android                  │
   │   - Android SDK + NDK, Java, Gradle, ABIs            │
   │ ci-runner-flutter   labels: flutter-web              │
   │   - Flutter + Chromium for web/desktop bundles       │
   │ ci-runner-iso       labels: iso, privileged          │
   │   - lb (live-build) + sudo policy, isolated chroot   │
   │ ci-runner-deploy    labels: deploy                   │
   │   - SSH keys mounted, target hosts in known_hosts    │
   └──────────────────────────────────────────────────────┘

Workflows declare runs-on: [self-hosted, <capability>]. Each runner picks up jobs matching its label set.

Checkout: native git, no rsync hack

Replace the Checkout (local rsync) block with actions/checkout@v4 (or equivalent for Gitea Actions) configured to use internal network:

- name: Checkout
  uses: actions/checkout@v4
  with:
    repository: Koder/koder
    ref: ${{ github.ref }}
    fetch-depth: 1                # shallow — no history needed
    filter: blob:none             # partial — fetch blobs lazily
    submodules: false             # we have one broken submodule (koder-hardware); skip

For internal network: configure the runner's git config --global url.<internal>.insteadOf <public> so clones go via http://flow.koder.dev/ (resolved internally to 10.0.1.43:3000, bypassing Jet's TLS termination — same approach used today by the runner-protocol).

Why this solves the TLS-drop problem:

solution	why it works
HTTP internal (no TLS)	no TLS layer to drop
`--depth=1`	200MB instead of 5GB transfer
`--filter=blob:none`	blobs fetched only when checked out, lazily
`protocol.version=2`	resilient transport

Combination: git clone http://10.0.1.43:3000/Koder/koder.git --depth=1 --filter=blob:none -c protocol.version=2 . — tested locally to complete in <30s.

Capability labels (initial set)

label	purpose	examples of workflows that use it
`linux`	Linux builds (Go, Rust, Node, Flutter Linux)	`koder-web-kit-release`, `probe-release`, `jet-release`, `koder-tools-release`
`android`	Android APK/AAB	`eye-release`, `koru-release`, `ktermux-app-release`
`flutter-web`	Flutter web + desktop bundles	`kruze-release`, `kterm-release`, `home-release`
`iso`	Koder Linux ISO build (privileged container, `lb build`)	`distro-release` (future)
`deploy`	SSH outbound to s.khost1 services + Hub	`kpc-landing-release`, `domains-release` deploy step
`audit`	lightweight audits (`audit-hub-coverage`, naming, paths)	`audit-*.yml`

ubuntu-latest should be retired as a label (it's a GitHub~~ism that doesn't describe anything actionable on self~~hosted infra) — replaced by linux for clarity.

Runner mode: docker vs host

Two viable modes for act_runner:

*ocker mode*(default): each job runs in a fresh container pulled from the workflow's runs-on image. Strong isolation. Requires nested containers (Incus container with security.nesting=true running Docker). *ecommended for the pool* every job starts clean, no cross-job state.
*ost mode*(:host suffix on labels): jobs run on the runner host filesystem directly. Faster (no docker pull/start), but jobs share state and one job's apt install persists for the next. Today's laptop runner uses host mode.

Pool runners use *ocker mode*for linuxandroidflutter-web. The iso runner stays *ost mode privileged*(because lb build needs raw mounts that nested Docker can't deliver).

Where does the pool live

Sector: infra/ci/ (new — does not exist in the monorepo today). Following RFC-003 §8, this RFC declares the sector and its internal layout:

infra/ci/
├── README.kmd              ← what the sector is for
├── koder.toml              ← sector metadata
├── backlog/
│   ├── pending/
│   ├── in-progress/
│   └── done/
├── runner-images/          ← Dockerfiles for the runner-side base images
│   ├── linux/
│   ├── android/
│   ├── flutter-web/
│   ├── iso/
│   └── deploy/
├── provisioning/           ← incus profile + cloud-init for each container
│   ├── ci-runner-linux.toml
│   ├── ci-runner-android.toml
│   └── ...
└── runbooks/
    ├── runner-add.kmd
    ├── runner-rotate-token.kmd
    └── runner-decommission.kmd

The container ci-runner already created during the wave (10.0.1.49 on s.khost1, Debian Trixie + Docker 29.5 + act_runner v0.2.11) becomes the first instance of ci-runner-linux. Sized 4 CPU / 8 GB. Future runners cloned from this template.

Phasing

Phase 1 — Bandaid landed (20260520)

Done in commit ci(workflows): exclude infra/linux/distro/chroot/.... All 54 release workflows now skip the chroot dir during rsync. Failing release CI runs unblock immediately. *oes not solve any structural issue*— just keeps shipping working while the rest of this RFC is implemented.

Phase 2 — One runner, one workflow, prove the design (next 1-2 weeks)

Create sector infra/ci/ with the layout above.
Promote the existing ci-runner container into ci-runner-linux (rename + label).
Pick *ne*simple workflow (recommendation: probe-release.yml) and write a parallel version probe-release-v2.yml that:
- Uses runs-on: [self-hosted, linux] (the new runner)
- Uses actions/checkout@v4 with the internal-network + shallow + partial clone
- No LOCAL_REPO rsync
Push a new probe tag and verify both old and new pipelines run. Compare wall-time + artifact equality.
If green: promote v2 → main probe-release.yml, archive the old version.

Phase 3 — Expand to all "linux" workflows (2-3 weeks)

Convert all workflows whose builds only need GoRustNode/Flutter~~Linux. Drive the migration by `runs~~on label change + checkout block rewrite. Per-PR per-workflow to keep blast radius small. Maintain LOCAL_REPO` fallback path until each workflow is verified.

Phase 4 — Android + Flutter-web runners (1 sprint)

Provision ci-runner-android and ci-runner-flutter containers. Install respective toolchains. Migrate Android + Flutter~~desktop workflows. Each toolchain install gets a `runbooks/runner~~add.kmd` checklist (reproducible).

Phase 5 — ISO + deploy runners (1 sprint)

Hardest two:

*SO runner* needs root inside the container, real mount for lb build. Use Incus container with security.privileged=true (the only one of the pool). Quarantined.
*eploy runner* needs SSH keys for s.khost1 service containers + Hub publishing creds. Strict role separation from build runners.

Phase 6 — Decommission laptop runner (final cleanup)

Stop act_runner daemon on the laptop.
Optionally unregister from Flow (or leave as inactive — historical).
Remove all LOCAL_REPO-based blocks from any leftover workflow.

Risks + mitigations

risk	mitigation
Pool runner toolchain drifts from laptop's	Toolchain pinned in `runner-images/<role>/Dockerfile`; CI updates batch-tested
Nested Docker performance	Benchmark in Phase 2; fall back to host mode if measurable degradation > 30%
New TLS-drop problem at scale	Internal HTTP + shallow + partial clone removes the original cause; monitor
ISO runner is privileged	Strict allowlist of branches/tags that can dispatch to it; signed
Single-host SPOF (s.khost1)	Future RFC: multi-host pool across s.khost1 + s.r1. Not in scope here
Migration breaks active workflows	Each workflow's v2 runs in parallel with v1 until green for 3+ consecutive releases

Non-goals

Replacing actrunner with something else (e.g., GitHub Actions runner directly). actrunner is the Gitea-blessed solution; works.
Multi-host runner pool (HA across machines). Future RFC after Phase 6.
Self-hosted Docker registry for runner base images. Use the existing Hub or registry.koder.dev as available; not in this RFC.
Caching layer (cargo, npm, pub.dev caches). Big win, separate RFC.

Decision

*dopt* Phase 1 already shipped. Phase 2 starts on the next available sprint slot.

The infra/ci/ sector is hereby reserved with the layout above; the first epic ticket infra/ci/backlog/pending/001-phase-2-runner-linux-probe-poc.md opens immediately to track Phase 2.

Open questions

*nternal HTTP host header* when the runner clones via http://10.0.1.43:3000/Koder/koder.git, does Gitea reject because Host header doesn't match flow.koder.dev? Test in Phase 2 before designing around it.
*ubmodule koder-hardware* currently has no .gitmodules entry but the warning fires every checkout. Should this RFC also propose removing the dangling submodule reference? — out of scope, but file a follow-up ticket.
*orkflow path* workflows live in .gitea/workflows/ at monorepo root. Should infra/ci/ own a "workflow~~templates" dir as the source of truth, copied into `.gitea` at build time? Aesthetic, not load~~bearing.

References

meta/docs/stack/rfcs/infra-RFC-001-koder-runtime-strategy.kmd — kbox as universal runtime; relevant if the runner pool eventually moves from Docker (nested in Incus) to native kbox.
meta/docs/stack/policies/test-host-isolation.kmd — adjacent concern (heavy tests on a dedicated VM). The principle is the same: dev environment ≠ CI/test environment.
meta/context/infrastructure/servers.md — s.khost1 capacity.
meta/context/infrastructure/services.md — Koder Flow + internal IPs.