Self-hosted CI runner pool — move CI off the dev laptop

draft

infraRFC002 — Self-hosted CI runner pool

Status

*raft*— opened 20260520 after a kship wave exposed that the single CI runner lives on the developer's laptop and rsyncs the dev worktree directly. Of 8 release tags shipped during the wave, ≥5 releasepipelines failed with the same root cause (rsync hitting root-owned files inside `infralinuxdistrochroot/`). The wave didn't ship artifacts; the workflow infrastructure is the bottleneck.

Summary

Move Koder Flow Actions execution *ff the developer laptop*onto a dedicated container pool in s.khost1. Replace the rsyncfromdevworktree checkout with native git clone over the internal incusbr0 network (no TLS, shallow + partial). Split the pool by capability (linux / android / flutter / isoprivileged / deploy) so labels in runs-on route correctly. Decommission the laptop runner once the pool covers all surfaces.

The shipping bug fixed in commit ci(workflows): exclude infra/linux/distro/chroot/ from LOCAL_REPO rsync is the *atch* This RFC is the *tructural*fix.

Motivation

What broke

The note runner (actrunner v0.2.11, registered as system-wide) runs as the laptop user koder in `homekoderdevactrunner/`. Every release workflow contains this checkout block:

- name: Checkout (local rsync)
  env:
    LOCAL_REPO: /home/koder/dev/koder
  run: |
    rsync -a --delete \
      --exclude='.git' \
      --exclude='*/build/' \
      ...
      "${LOCAL_REPO}/" "${GITHUB_WORKSPACE}/"

The comment in jet-release.yml explains why: "cloning the large repo via HTTPS — TLS drops midtransfer. Instead, rsync from the local clone to the workspace." This was a deliberate workaround for an HTTPScheckout problem in the past.

The workaround compounds three structural issues:

  1. *ardcoded path.*LOCAL_REPO=/home/koder/dev/koder only works on the dev's machine.
  2. *orktree contention.*Runner reads the same files the developer is editing. Race conditions during edits. Index corruption seen during the 20260520 wave (fatal: .git/index: index file smaller than expected).
  3. *ermission crosstalk.*sudo lb build creates rootowned files inside infra/linux/distro/chroot/ (Koder Linux ISO build). The runner runs as koder and can't read them. rsync exits 23. Job fails.

All 54 release workflows + kdb-jepsen.yml are affected.

Why a runner on the laptop is wrong long-term

problem impact
Laptop off / asleep / no network → no CI shipping blocked
Dev edits during build → race + index corruption flaky CI, looks like product bug
Heavy build (Flutter desktop) saturates laptop CPU dev work blocked during builds
Single runner = serial queue wave of 8 releases = 30-60min queue time
Workflows assume dev's filesystem layout not portable, not testable
chroot/ is a permanent landmine every infra/linux iteration risks re-breaking

The laptop runner was a bootstrap convenience that hardened into the only working path. Time to extract.

Design

Architecture

flow.koder.dev  (Gitea Actions, lives in s.khost1 'flow' container)
        │
        │ runner protocol (HTTPS or HTTP-internal)
        ▼
Pool of dedicated runners (incus containers on s.khost1):
   ┌──────────────────────────────────────────────────────┐
   │ ci-runner-linux     labels: linux, ubuntu-latest     │
   │   - Go 1.25, Node 24, Rust stable, Flutter Linux     │
   │ ci-runner-android   labels: android                  │
   │   - Android SDK + NDK, Java, Gradle, ABIs            │
   │ ci-runner-flutter   labels: flutter-web              │
   │   - Flutter + Chromium for web/desktop bundles       │
   │ ci-runner-iso       labels: iso, privileged          │
   │   - lb (live-build) + sudo policy, isolated chroot   │
   │ ci-runner-deploy    labels: deploy                   │
   │   - SSH keys mounted, target hosts in known_hosts    │
   └──────────────────────────────────────────────────────┘

Workflows declare runs-on: [self-hosted, <capability>]. Each runner picks up jobs matching its label set.

Checkout: native git, no rsync hack

Replace the Checkout (local rsync) block with actions/checkout@v4 (or equivalent for Gitea Actions) configured to use internal network:

- name: Checkout
  uses: actions/checkout@v4
  with:
    repository: Koder/koder
    ref: ${{ github.ref }}
    fetch-depth: 1                # shallow — no history needed
    filter: blob:none             # partial — fetch blobs lazily
    submodules: false             # we have one broken submodule (koder-hardware); skip

For internal network: configure the runner's git config --global url.<internal>.insteadOf <public> so clones go via http://flow.koder.dev/ (resolved internally to 10.0.1.43:3000, bypassing Jet's TLS termination — same approach used today by the runner-protocol).

Why this solves the TLS-drop problem:

solution why it works
HTTP internal (no TLS) no TLS layer to drop
--depth=1 200MB instead of 5GB transfer
--filter=blob:none blobs fetched only when checked out, lazily
protocol.version=2 resilient transport

Combination: git clone http://10.0.1.43:3000/Koder/koder.git --depth=1 --filter=blob:none -c protocol.version=2 . — tested locally to complete in <30s.

Capability labels (initial set)

label purpose examples of workflows that use it
linux Linux builds (Go, Rust, Node, Flutter Linux) koder-web-kit-release, probe-release, jet-release, koder-tools-release
android Android APK/AAB eye-release, koru-release, ktermux-app-release
flutter-web Flutter web + desktop bundles kruze-release, kterm-release, home-release
iso Koder Linux ISO build (privileged container, lb build) distro-release (future)
deploy SSH outbound to s.khost1 services + Hub kpc-landing-release, domains-release deploy step
audit lightweight audits (audit-hub-coverage, naming, paths) audit-*.yml

ubuntu-latest should be retired as a label (it's a GitHubism that doesn't describe anything actionable on selfhosted infra) — replaced by linux for clarity.

Runner mode: docker vs host

Two viable modes for act_runner:

  • *ocker mode*(default): each job runs in a fresh container pulled from the workflow's runs-on image. Strong isolation. Requires nested containers (Incus container with security.nesting=true running Docker). *ecommended for the pool* every job starts clean, no cross-job state.
  • *ost mode*(:host suffix on labels): jobs run on the runner host filesystem directly. Faster (no docker pull/start), but jobs share state and one job's apt install persists for the next. Today's laptop runner uses host mode.

Pool runners use *ocker mode*for linuxandroidflutter-web. The iso runner stays *ost mode privileged*(because lb build needs raw mounts that nested Docker can't deliver).

Where does the pool live

Sector: infra/ci/ (new — does not exist in the monorepo today). Following RFC-003 §8, this RFC declares the sector and its internal layout:

infra/ci/
├── README.kmd              ← what the sector is for
├── koder.toml              ← sector metadata
├── backlog/
│   ├── pending/
│   ├── in-progress/
│   └── done/
├── runner-images/          ← Dockerfiles for the runner-side base images
│   ├── linux/
│   ├── android/
│   ├── flutter-web/
│   ├── iso/
│   └── deploy/
├── provisioning/           ← incus profile + cloud-init for each container
│   ├── ci-runner-linux.toml
│   ├── ci-runner-android.toml
│   └── ...
└── runbooks/
    ├── runner-add.kmd
    ├── runner-rotate-token.kmd
    └── runner-decommission.kmd

The container ci-runner already created during the wave (10.0.1.49 on s.khost1, Debian Trixie + Docker 29.5 + act_runner v0.2.11) becomes the first instance of ci-runner-linux. Sized 4 CPU / 8 GB. Future runners cloned from this template.

Phasing

Phase 1 — Bandaid landed (20260520)

Done in commit ci(workflows): exclude infra/linux/distro/chroot/.... All 54 release workflows now skip the chroot dir during rsync. Failing release CI runs unblock immediately. *oes not solve any structural issue*— just keeps shipping working while the rest of this RFC is implemented.

Phase 2 — One runner, one workflow, prove the design (next 1-2 weeks)

  • Create sector infra/ci/ with the layout above.
  • Promote the existing ci-runner container into ci-runner-linux (rename + label).
  • Pick *ne*simple workflow (recommendation: probe-release.yml) and write a parallel version probe-release-v2.yml that:
    • Uses runs-on: [self-hosted, linux] (the new runner)
    • Uses actions/checkout@v4 with the internal-network + shallow + partial clone
    • No LOCAL_REPO rsync
  • Push a new probe tag and verify both old and new pipelines run. Compare wall-time + artifact equality.
  • If green: promote v2 → main probe-release.yml, archive the old version.

Phase 3 — Expand to all "linux" workflows (2-3 weeks)

Convert all workflows whose builds only need GoRustNode/FlutterLinux. Drive the migration by `runson label change + checkout block rewrite. Per-PR per-workflow to keep blast radius small. Maintain LOCAL_REPO` fallback path until each workflow is verified.

Phase 4 — Android + Flutter-web runners (1 sprint)

Provision ci-runner-android and ci-runner-flutter containers. Install respective toolchains. Migrate Android + Flutterdesktop workflows. Each toolchain install gets a `runbooks/runneradd.kmd` checklist (reproducible).

Phase 5 — ISO + deploy runners (1 sprint)

Hardest two:

  • *SO runner* needs root inside the container, real mount for lb build. Use Incus container with security.privileged=true (the only one of the pool). Quarantined.
  • *eploy runner* needs SSH keys for s.khost1 service containers + Hub publishing creds. Strict role separation from build runners.

Phase 6 — Decommission laptop runner (final cleanup)

  • Stop act_runner daemon on the laptop.
  • Optionally unregister from Flow (or leave as inactive — historical).
  • Remove all LOCAL_REPO-based blocks from any leftover workflow.

Risks + mitigations

risk mitigation
Pool runner toolchain drifts from laptop's Toolchain pinned in runner-images/<role>/Dockerfile; CI updates batch-tested
Nested Docker performance Benchmark in Phase 2; fall back to host mode if measurable degradation > 30%
New TLS-drop problem at scale Internal HTTP + shallow + partial clone removes the original cause; monitor
ISO runner is privileged Strict allowlist of branches/tags that can dispatch to it; signed
Single-host SPOF (s.khost1) Future RFC: multi-host pool across s.khost1 + s.r1. Not in scope here
Migration breaks active workflows Each workflow's v2 runs in parallel with v1 until green for 3+ consecutive releases

Non-goals

  • Replacing actrunner with something else (e.g., GitHub Actions runner directly). actrunner is the Gitea-blessed solution; works.
  • Multi-host runner pool (HA across machines). Future RFC after Phase 6.
  • Self-hosted Docker registry for runner base images. Use the existing Hub or registry.koder.dev as available; not in this RFC.
  • Caching layer (cargo, npm, pub.dev caches). Big win, separate RFC.

Decision

*dopt* Phase 1 already shipped. Phase 2 starts on the next available sprint slot.

The infra/ci/ sector is hereby reserved with the layout above; the first epic ticket infra/ci/backlog/pending/001-phase-2-runner-linux-probe-poc.md opens immediately to track Phase 2.

Open questions

  1. *nternal HTTP host header* when the runner clones via http://10.0.1.43:3000/Koder/koder.git, does Gitea reject because Host header doesn't match flow.koder.dev? Test in Phase 2 before designing around it.
  2. *ubmodule koder-hardware* currently has no .gitmodules entry but the warning fires every checkout. Should this RFC also propose removing the dangling submodule reference? — out of scope, but file a follow-up ticket.
  3. *orkflow path* workflows live in .gitea/workflows/ at monorepo root. Should infra/ci/ own a "workflowtemplates" dir as the source of truth, copied into `.gitea` at build time? Aesthetic, not loadbearing.

References

  • meta/docs/stack/rfcs/infra-RFC-001-koder-runtime-strategy.kmd — kbox as universal runtime; relevant if the runner pool eventually moves from Docker (nested in Incus) to native kbox.
  • meta/docs/stack/policies/test-host-isolation.kmd — adjacent concern (heavy tests on a dedicated VM). The principle is the same: dev environment ≠ CI/test environment.
  • meta/context/infrastructure/servers.md — s.khost1 capacity.
  • meta/context/infrastructure/services.md — Koder Flow + internal IPs.

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/infra-RFC-002-self-hosted-runner-pool.kmd