Self-hosted CI runner pool — move CI off the dev laptop
infraRFC002 — Self-hosted CI runner pool
Status
*raft*— opened 20260520 after a kship wave exposed that the single CI runner lives on the developer's laptop and rsyncs the dev worktree directly. Of 8 release tags shipped during the wave, ≥5 releasepipelines failed with the same root cause (rsync hitting root-owned files inside `infralinuxdistrochroot/`). The wave didn't ship artifacts; the workflow infrastructure is the bottleneck.
Summary
Move Koder Flow Actions execution *ff the developer laptop*onto a dedicated container pool in s.khost1. Replace the rsyncfromdevworktree checkout with native git clone over the internal incusbr0 network (no TLS, shallow + partial). Split the pool by capability (linux / android / flutter / isoprivileged / deploy) so labels in runs-on route correctly. Decommission the laptop runner once the pool covers all surfaces.
The shipping bug fixed in commit ci(workflows): exclude infra/linux/distro/chroot/ from LOCAL_REPO rsync is the *atch* This RFC is the *tructural*fix.
Motivation
What broke
The note runner (actrunner v0.2.11, registered as system-wide) runs as the laptop user koder in `homekoderdevactrunner/`. Every release workflow contains this checkout block:
- name: Checkout (local rsync)
env:
LOCAL_REPO: /home/koder/dev/koder
run: |
rsync -a --delete \
--exclude='.git' \
--exclude='*/build/' \
...
"${LOCAL_REPO}/" "${GITHUB_WORKSPACE}/"The comment in jet-release.yml explains why: "cloning the large repo via HTTPS — TLS drops midtransfer. Instead, rsync from the local clone to the workspace." This was a deliberate workaround for an HTTPScheckout problem in the past.
The workaround compounds three structural issues:
- *ardcoded path.*
LOCAL_REPO=/home/koder/dev/koderonly works on the dev's machine. - *orktree contention.*Runner reads the same files the developer is editing. Race conditions during edits. Index corruption seen during the 2026
0520 wave (fatal: .git/index: index file smaller than expected). - *ermission cross
talk.*owned files insidesudo lb buildcreates rootinfra/linux/distro/chroot/(Koder Linux ISO build). The runner runs askoderand can't read them. rsync exits 23. Job fails.
All 54 release workflows + kdb-jepsen.yml are affected.
Why a runner on the laptop is wrong long-term
| problem | impact |
|---|---|
| Laptop off / asleep / no network → no CI | shipping blocked |
| Dev edits during build → race + index corruption | flaky CI, looks like product bug |
| Heavy build (Flutter desktop) saturates laptop CPU | dev work blocked during builds |
| Single runner = serial queue | wave of 8 releases = 30-60min queue time |
| Workflows assume dev's filesystem layout | not portable, not testable |
chroot/ is a permanent landmine |
every infra/linux iteration risks re-breaking |
The laptop runner was a bootstrap convenience that hardened into the only working path. Time to extract.
Design
Architecture
flow.koder.dev (Gitea Actions, lives in s.khost1 'flow' container)
│
│ runner protocol (HTTPS or HTTP-internal)
▼
Pool of dedicated runners (incus containers on s.khost1):
┌──────────────────────────────────────────────────────┐
│ ci-runner-linux labels: linux, ubuntu-latest │
│ - Go 1.25, Node 24, Rust stable, Flutter Linux │
│ ci-runner-android labels: android │
│ - Android SDK + NDK, Java, Gradle, ABIs │
│ ci-runner-flutter labels: flutter-web │
│ - Flutter + Chromium for web/desktop bundles │
│ ci-runner-iso labels: iso, privileged │
│ - lb (live-build) + sudo policy, isolated chroot │
│ ci-runner-deploy labels: deploy │
│ - SSH keys mounted, target hosts in known_hosts │
└──────────────────────────────────────────────────────┘Workflows declare runs-on: [self-hosted, <capability>]. Each runner picks up jobs matching its label set.
Checkout: native git, no rsync hack
Replace the Checkout (local rsync) block with actions/checkout@v4 (or equivalent for Gitea Actions) configured to use internal network:
- name: Checkout
uses: actions/checkout@v4
with:
repository: Koder/koder
ref: ${{ github.ref }}
fetch-depth: 1 # shallow — no history needed
filter: blob:none # partial — fetch blobs lazily
submodules: false # we have one broken submodule (koder-hardware); skipFor internal network: configure the runner's git config --global url.<internal>.insteadOf <public> so clones go via http://flow.koder.dev/ (resolved internally to 10.0.1.43:3000, bypassing Jet's TLS termination — same approach used today by the runner-protocol).
Why this solves the TLS-drop problem:
| solution | why it works |
|---|---|
| HTTP internal (no TLS) | no TLS layer to drop |
--depth=1 |
200MB instead of 5GB transfer |
--filter=blob:none |
blobs fetched only when checked out, lazily |
protocol.version=2 |
resilient transport |
Combination: git clone http://10.0.1.43:3000/Koder/koder.git --depth=1 --filter=blob:none -c protocol.version=2 . — tested locally to complete in <30s.
Capability labels (initial set)
| label | purpose | examples of workflows that use it |
|---|---|---|
linux |
Linux builds (Go, Rust, Node, Flutter Linux) | koder-web-kit-release, probe-release, jet-release, koder-tools-release |
android |
Android APK/AAB | eye-release, koru-release, ktermux-app-release |
flutter-web |
Flutter web + desktop bundles | kruze-release, kterm-release, home-release |
iso |
Koder Linux ISO build (privileged container, lb build) |
distro-release (future) |
deploy |
SSH outbound to s.khost1 services + Hub | kpc-landing-release, domains-release deploy step |
audit |
lightweight audits (audit-hub-coverage, naming, paths) |
audit-*.yml |
ubuntu-latest should be retired as a label (it's a GitHubism that doesn't describe anything actionable on selfhosted infra) — replaced by linux for clarity.
Runner mode: docker vs host
Two viable modes for act_runner:
- *ocker mode*(default): each job runs in a fresh container pulled from the workflow's
runs-onimage. Strong isolation. Requires nested containers (Incus container withsecurity.nesting=truerunning Docker). *ecommended for the pool* every job starts clean, no cross-job state. - *ost mode*(
:hostsuffix on labels): jobs run on the runner host filesystem directly. Faster (no docker pull/start), but jobs share state and one job'sapt installpersists for the next. Today's laptop runner uses host mode.
Pool runners use *ocker mode*for linuxandroidflutter-web. The iso runner stays *ost mode privileged*(because lb build needs raw mounts that nested Docker can't deliver).
Where does the pool live
Sector: infra/ci/ (new — does not exist in the monorepo today). Following RFC-003 §8, this RFC declares the sector and its internal layout:
infra/ci/
├── README.kmd ← what the sector is for
├── koder.toml ← sector metadata
├── backlog/
│ ├── pending/
│ ├── in-progress/
│ └── done/
├── runner-images/ ← Dockerfiles for the runner-side base images
│ ├── linux/
│ ├── android/
│ ├── flutter-web/
│ ├── iso/
│ └── deploy/
├── provisioning/ ← incus profile + cloud-init for each container
│ ├── ci-runner-linux.toml
│ ├── ci-runner-android.toml
│ └── ...
└── runbooks/
├── runner-add.kmd
├── runner-rotate-token.kmd
└── runner-decommission.kmdThe container ci-runner already created during the wave (10.0.1.49 on s.khost1, Debian Trixie + Docker 29.5 + act_runner v0.2.11) becomes the first instance of ci-runner-linux. Sized 4 CPU / 8 GB. Future runners cloned from this template.
Phasing
Phase 1 — Bandaid landed (20260520)
Done in commit ci(workflows): exclude infra/linux/distro/chroot/.... All 54 release workflows now skip the chroot dir during rsync. Failing release CI runs unblock immediately. *oes not solve any structural issue*— just keeps shipping working while the rest of this RFC is implemented.
Phase 2 — One runner, one workflow, prove the design (next 1-2 weeks)
- Create sector
infra/ci/with the layout above. - Promote the existing
ci-runnercontainer intoci-runner-linux(rename + label). - Pick *ne*simple workflow (recommendation:
probe-release.yml) and write a parallel versionprobe-release-v2.ymlthat:- Uses
runs-on: [self-hosted, linux](the new runner) - Uses
actions/checkout@v4with the internal-network + shallow + partial clone - No
LOCAL_REPOrsync
- Uses
- Push a new probe tag and verify both old and new pipelines run. Compare wall-time + artifact equality.
- If green: promote v2 → main
probe-release.yml, archive the old version.
Phase 3 — Expand to all "linux" workflows (2-3 weeks)
Convert all workflows whose builds only need GoRustNode/FlutterLinux. Drive the migration by `runson label change + checkout block rewrite. Per-PR per-workflow to keep blast radius small. Maintain LOCAL_REPO` fallback path until each workflow is verified.
Phase 4 — Android + Flutter-web runners (1 sprint)
Provision ci-runner-android and ci-runner-flutter containers. Install respective toolchains. Migrate Android + Flutterdesktop workflows. Each toolchain install gets a `runbooks/runneradd.kmd` checklist (reproducible).
Phase 5 — ISO + deploy runners (1 sprint)
Hardest two:
- *SO runner* needs root inside the container, real
mountforlb build. Use Incus container withsecurity.privileged=true(the only one of the pool). Quarantined. - *eploy runner* needs SSH keys for s.khost1 service containers + Hub publishing creds. Strict role separation from build runners.
Phase 6 — Decommission laptop runner (final cleanup)
- Stop
act_runner daemonon the laptop. - Optionally unregister from Flow (or leave as inactive — historical).
- Remove all
LOCAL_REPO-based blocks from any leftover workflow.
Risks + mitigations
| risk | mitigation |
|---|---|
| Pool runner toolchain drifts from laptop's | Toolchain pinned in runner-images/<role>/Dockerfile; CI updates batch-tested |
| Nested Docker performance | Benchmark in Phase 2; fall back to host mode if measurable degradation > 30% |
| New TLS-drop problem at scale | Internal HTTP + shallow + partial clone removes the original cause; monitor |
| ISO runner is privileged | Strict allowlist of branches/tags that can dispatch to it; signed |
| Single-host SPOF (s.khost1) | Future RFC: multi-host pool across s.khost1 + s.r1. Not in scope here |
| Migration breaks active workflows | Each workflow's v2 runs in parallel with v1 until green for 3+ consecutive releases |
Non-goals
- Replacing actrunner with something else (e.g., GitHub Actions runner directly). actrunner is the Gitea-blessed solution; works.
- Multi-host runner pool (HA across machines). Future RFC after Phase 6.
- Self-hosted Docker registry for runner base images. Use the existing Hub or
registry.koder.devas available; not in this RFC. - Caching layer (cargo, npm, pub.dev caches). Big win, separate RFC.
Decision
*dopt* Phase 1 already shipped. Phase 2 starts on the next available sprint slot.
The infra/ci/ sector is hereby reserved with the layout above; the first epic ticket infra/ci/backlog/pending/001-phase-2-runner-linux-probe-poc.md opens immediately to track Phase 2.
Open questions
- *nternal HTTP host header* when the runner clones via
http://10.0.1.43:3000/Koder/koder.git, does Gitea reject because Host header doesn't matchflow.koder.dev? Test in Phase 2 before designing around it. - *ubmodule
koder-hardware* currently has no.gitmodulesentry but the warning fires every checkout. Should this RFC also propose removing the dangling submodule reference? — out of scope, but file a follow-up ticket. - *orkflow path* workflows live in
.gitea/workflows/at monorepo root. Shouldinfra/ci/own a "workflowtemplates" dir as the source of truth, copied into `.gitea` at build time? Aesthetic, not loadbearing.
References
meta/docs/stack/rfcs/infra-RFC-001-koder-runtime-strategy.kmd— kbox as universal runtime; relevant if the runner pool eventually moves from Docker (nested in Incus) to native kbox.meta/docs/stack/policies/test-host-isolation.kmd— adjacent concern (heavy tests on a dedicated VM). The principle is the same: dev environment ≠ CI/test environment.meta/context/infrastructure/servers.md— s.khost1 capacity.meta/context/infrastructure/services.md— Koder Flow + internal IPs.