Dns RFC 002 phase1 first pop network

RFC-002 — Koder Herald Phase 1: First PoP Network

Field Value
Status *raft*(20260415)
Author(s) Rodrigo (with Claude as scribe)
Date 20260415
Target module platform/dns/
Depends on RFC-001 (accepted), Phase 0 complete

1. Summary

Phase 1 moves Koder Herald from a Phase 0 softwareonly bridge (ClouDNS/Porkbun as execution layer) to operating its own authoritative DNS infrastructure at the first PoP. The PoP uses managed anycast (provided by the hosting partner — no own BGP yet) backed by Knot DNS + koderdns-sync. This phase proves the product on real infrastructure before investing in own BGP (Phase 2).

Phase 1 exit criteria:

  • At least one PoP is live, reachable via anycast IP, and serving authoritative DNS
  • koderherald, koderdns-sync, and Knot DNS are running in production
  • Koder's own infrastructure (koder.dev and related domains) is migrated off ClouDNS to Koder Herald
  • At least 3 paying external customers are using the platform
  • p95 query latency < 10 ms from Brazil; < 30 ms from Miami and Amsterdam

2. Architecture Recap (from RFC-001)

  ┌──────────────────────────────────────┐
  │  koder-herald (engine)               │
  │  - Zone/record store (kdb)           │
  │  - GeoDNS rules                      │
  │  - Failover monitors                 │
  │  - DDNS service                      │
  │  - Analytics collector               │
  └────────────────┬─────────────────────┘
                   │ HTTP API (zone serial polling)
       ┌───────────┴──────────┐
       │                      │
  ┌────▼──────────────────┐   │  (future PoPs)
  │  PoP — São Paulo      │   │
  │  ┌──────────────────┐ │   │
  │  │  Knot DNS        │ │   │
  │  │  (authoritative) │ │   │
  │  └────────┬─────────┘ │   │
  │           │ knotc     │   │
  │  ┌────────▼─────────┐ │   │
  │  │  koder-dns-sync  │ │   │
  │  │  (zone agent)    │ │   │
  │  └──────────────────┘ │   │
  │  Anycast IP: Vultr    │   │
  └───────────────────────┘   │

3. Technical Decisions

Decision 1 — First PoP location: São Paulo, Brazil

*hosen* São Paulo (BR).

*ationale* Koder is a Brazilian company; all current customers are in Brazil. São Paulo has multiple quality data centres (Equinix SP, Ascenty, LGPD-compliant). PTT.br (São Paulo) is Brazil's primary IXP — useful for Phase 2 peering. Starting in BR minimises latency for the existing customer base before expanding globally.

Decision 2 — Managed anycast provider: Vultr BGP

*hosen* Vultr with BGP sessions (managed anycast).

*ationale* Vultr offers BGP on dedicated servers in São Paulo starting from ~$60/month. They announce the IP block — Koder does not need its own ASN or PI block at this stage. This removes the operational burden of running BIRD 2 and managing a BGP session before there is customer demand to justify it. Own BGP is Phase 2 (RFC-003).

Alternatives evaluated:

  • *ivelocity*— also viable, slightly more expensive in BR
  • *etzner*— no São Paulo PoP yet (2026)
  • *urricane Electric*— requires own ASN upfront

Decision 3 — DNS software at PoP: Knot DNS 3.x

*hosen* Knot DNS 3.x, installed from official packages (knot on Debian/Ubuntu).

*ationale* Already decided in RFC001 §12 (decision #3). Knot DNS provides modgeoip (Phase 1 ticket #015 and RFC001 ticket #010), builtin DNSSEC with autosigning, and rate limiting — all in a single package. The koderdnssync agent writes zone files and calls `knotc zonereload`; no deep integration needed.

Decision 4 — DNSSEC: Knot DNS keymgr, auto-signing

*hosen* Knot DNS native DNSSEC auto-signing via keymgr. Herald does not sign zones — it delegates signing entirely to Knot DNS at the PoP.

*ationale* Knot DNS has a mature, production-grade DNSSEC implementation used by ccTLD operators. Implementing DNSSEC in Herald would duplicate this work and introduce security risk. The zone file exported by Herald is unsigned; Knot DNS signs it locally and serves the signed responses. Key material stays on the PoP (never leaves the server).

DNSSEC keys are backed up via keymgr backup to encrypted storage daily.

Decision 5 — Authentication: Koder ID JWT (replace XTenantID header)

*hosen* Bearer JWT issued by Koder ID. The sub claim is the tenant ID.

*ationale* Phase 0 uses a plain X-Tenant-ID header — no authentication, only identification. This is acceptable in a closed environment but not for a public API. Phase 1 wires Herald into Koder ID (already in production as of 20260409) as an API client. Herald validates the JWT signature using Koder ID's public key (fetched once at startup, cached with 24h TTL).

Migration: Phase 0 clients using X-Tenant-ID continue to work until Phase 2 (grace period). A feature flag LEGACY_TENANT_HEADER=true in Herald enables the fallback.

Decision 6 — Storage: keep kdb (SQLite-backed) for Phase 1

*hosen* kdb remains the storage layer for Phase 1.

*ationale* kdbnext (Rust/TiKV, RFC001 in platform/kdb/next/) is still in design. The current kdb handles the Herald workload well for Phase 1 volumes (thousands of zones, millions of records). Migrating to kdbnext midphase introduces unnecessary risk. Herald will be designed to swap the storage layer via the existing repository interface, so migration to kdb-next in Phase 2 will be straightforward.

Decision 7 — Zone transfer: zone file export (no AXFR protocol)

*hosen* koderdnssync fetches zone files via GET /api/v1/zones/{zone}/export (HTTP, BIND format), not via DNS AXFR protocol.

*ationale* Implementing AXFR (RFC 5936) in Herald is significant work for Phase 1. The HTTP-based export achieves the same result (full zone sync on serial change) and is simpler to implement, debug, and monitor. AXFR will be added in Phase 2 to support external secondaries (ticket #043).

Decision 8 — PoP health monitoring: Herald polls koderdnssync healthz

*hosen* Each koderdnssync agent exposes GET /healthz returning last-sync serial, uptime, and zone count. Herald polls this endpoint every 60 seconds. If a PoP misses 3 consecutive polls, Herald raises an alert.

*ationale* Simple and decoupled. The sync agent already has all the information needed (last synced serial, whether knotc reload succeeded). Herald does not need to query the DNS port (UDP 53) for health — that is the job of external monitoring (e.g. DNScheck, uptime monitoring).


4. Implementation Sequence

Tickets to implement in order:

# Ticket Depends on Phase 1 milestone
#011 Production deployment of koder-herald Phase 0 complete Engine live
#012 Koder ID JWT authentication #011, Koder ID live Secure API
#013 DNSSEC key management (Knot DNS keymgr) #014 Signed zones
#014 First PoP deployment — São Paulo (Vultr) #011 First PoP live
#015 Knot DNS + mod-geoip at PoP #014 GeoDNS serving
#016 Secondary DNS slave zone support #014 External secondaries
#017 apps/domains integration #012 Customer onboarding
#018 Customer billing model (zones + queries) #017 Revenue-ready
#019 PoP heartbeat + multi-PoP failover #014 Operational monitoring
#020 Terraform/OpenTofu provider #012 Infraascode
#021 SOA and NS customisation #014 Custom nameservers
#022 Rate limiting per tenant #011 Abuse protection
#023 Zone locking and change review #012 Ops safety
#024 DNSSEC key rollover automation #013 Key hygiene
#025 Weighted round-robin routing #015 Traffic distribution

5. Open Questions

  1. *ultr BGP or own ASN?*— Confirmed: Vultr managed anycast for Phase 1. Own ASN/BGP is Phase 2 (RFC-003).
  2. *ow many PoPs in Phase 1?*— Start with 1 (São Paulo). Add a 2nd PoP (Miami or Amsterdam) once the first is stable and there is demand from outside Brazil. Target: 2 PoPs by end of Phase 1.
  3. *dbnext timeline*— Phase 1 stays on current kdb. If kdbnext reaches production before Phase 1 ends, evaluate migration opportunistically.

6. Success Criteria

  • Koder Herald engine running in production on s.k.lin (or dedicated VM)
  • São Paulo PoP live with Knot DNS serving authoritative DNS via managed anycast IP
  • koder.dev and all Koder-operated domains migrated off ClouDNS to Koder Herald
  • JWT-authenticated public API with Swagger/OpenAPI spec published
  • Minimum 3 paying external customers onboarded via apps/domains
  • p95 query latency < 10 ms from São Paulo (measured via DNSperf or Catchpoint)
  • Zero DNSSEC validation failures in production monitoring (90 days)
  • Terraform provider published on Terraform Registry

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/dns-RFC-002-phase1-first-pop-network.md