Signs RFC 001 architecture overview

signsRFC001 — Sign-language avatar + bilingual translation across the Koder Stack

Status *eady for ratification*— all technical questions resolved 20260429 (Q1Q4 + Q8Q10). Q5Q7 are humanprocess work running in parallel.
Created 20260429
Renamed 20260523 (libras-RFC-001signs-RFC-001 per owner-ratified rename services/ai/librasservices/ai/signs; the engine is generic, Libras is the first corpus)
Last revised 20260523
Author Koder team
Modules services/ai/signs (proposed; first corpus: Libras), engines/sdk/koder_hand_kit (proposed), every Koder UI (consumer)
Related specs policies/sdk-first.kmd, policies/hyperscale-first.kmd, specs/koder-app/behaviors.kmd, policies/language.kmd
Related projects Koru (products/horizontal/koru) — explicitly avoided "Hand" naming so this RFC could claim it
Supersedes Draft v1 (proposed Path A integration with VLibras as fallback) — withdrawn 20260429 in favor of fully proprietary stack

1. Goal

Make every Koder UI capable of presenting its content in *razilian Sign Language (Libras)*through a friendly, animated 3D *vatar (Hand)*that signs in real time, activated by a single Settings toggle and persisted across the user's apps via Koder ID.

The capability must:

  • Accept *razilian Portuguese (ptBR) and American English (enUS)*as source languages.
  • Output *ibras*(only) — ASL is reserved for a future version.
  • Work in *obile, desktop, TV, and web*Koder UIs — i.e., a crosscutting feature per `policies/sdkfirst.kmd`.
  • Be *ully proprietary and selfhosted*— no dependency on VLibras or any thirdparty service (Path C).
  • Be *rchitected for hyperscale from day one*— no throw-away v0; the architecture that ships first is the architecture that scales to v∞.

2. Why now

  • Accessibility-first products attract a meaningful user base (10M+ deaf/HoH Brazilians, ~5% of population per IBGE) and align with Koder's positioning as the platform that "speaks Brazil while shipping globally".
  • Publicsector buyers (vital to Edictus, ERP track) require Libras compliance — a single SDK widget makes every Koder app eligible at zero perproduct effort.
  • Koder products are enUS first in UI source / marketing per policies/language.kmd, but operate primarily in Brazil where deaf users sign Libras. *and serves the same UI to two sourcelanguage audiences in one architecture.*
  • Koru just shipped (20260429) without using "Hand" as its brand precisely so this RFC could claim it.

3. Requirements

Functional

  • *1* Single Settings toggle in any Koder UI activates Hand; persists across apps via Koder ID.
  • *2* Translate visible UI text and on-demand passages to Libras in real time.
  • *3* Render an animated 3D avatar overlay (draggable, resizable, dismissible).
  • *4* Work on mobile (Flutter AndroidiOS), desktop (Flutter LinuxmacOS/Windows), TV (TizenOS+WebOS), and web (Flutter Web + landing-page widget).
  • *5* Optional speech input (holdtospeak) — defers to services/ai/voice for ASR.
  • *6* Accept both *tBR*and *nUS*as source languages, with explicit lang tag on the API call (auto-detection as fallback).

Non-functional

  • *F1* Latency: gloss-sequence start within ≤300ms after request; first gesture rendered within ≤500ms.
  • *F2* Bundle size: SDK widget ≤8MB on mobile (gzipped); avatar assets streamable from CDN if larger.
  • *F3* Offline graceful degradation: SDK has a small cached glossary (high-frequency phrases) so common UI labels keep signing without network.
  • *F4* Privacy: translation requests are stateless and not logged with content; only opt-in usage telemetry per specs/errors/reporting.kmd.
  • *F5* Compliance: Lei nº 10.4362002 + Decreto 5.6262005 cited in landing/legal copy; WCAG 2.2 AA where applicable.
  • *F6* *elfhosted only.*Like every other Koder backend (Flow, AI Voice, Foundation), Koder AI Signs runs entirely on Koder infrastructure. Calls to any external endpoint are forbidden — sovereignty (no thirdparty can take the service down) + privacy (translation requests never leave Koder infra).
  • *F7* *yperscalefirst*per `policies/hyperscalefirst.kmd` — every architectural decision optimizes the longterm system over shortterm ship speed; no throw-away v0.

4. Architecture

4.1 The pipeline (4 stages, swappable, instrumented)

[ source text + lang tag ]
    │
    ▼
┌────────────────────────────────────┐
│ §1 Normalization                   │  bilingual: pt-BR | en-US
│   - spell correction               │  rules per language
│   - abbreviation expansion         │  ("vc"→"você", "u"→"you")
│   - sentence segmentation          │
└──────────────┬─────────────────────┘
               │
               ▼
┌────────────────────────────────────┐
│ §2 Semantic parsing                │  ML — multilingual foundation model
│   - intent / entity / sentiment    │  (Llama-3.x · Gemma-3 · Qwen)
│   - cross-lingual representation   │  one model serves both src langs
└──────────────┬─────────────────────┘
               │
               ▼
┌────────────────────────────────────┐
│ §3 Glosa generation (HYBRID)       │  ┌──────────────────────────────┐
│   ┌─────────────────────────────┐  │  │ Confidence router            │
│   │ Rule-based dictionary       │◀─┼──│ · high conf → use rules      │
│   │ (curated, deterministic)    │  │  │ · low conf  → use ML         │
│   └─────────────────────────────┘  │  │ · ML hallucinates → fallback │
│   ┌─────────────────────────────┐  │  │   to fingerspelling          │
│   │ ML translator               │  │  └──────────────────────────────┘
│   │ (foundation + LoRA fine-tune)│  │
│   └─────────────────────────────┘  │
└──────────────┬─────────────────────┘
               │  glosa sequence (intermediate language, src-agnostic)
               ▼
┌────────────────────────────────────┐
│ §4 Animation generation            │  glosa → 3D timeline
│   - sign animation lookup          │  (Hand renders client-side)
│   - inflection / classifier rules  │  Libras spatial grammar
│   - timing / rhythm                │
└──────────────┬─────────────────────┘
               │
               ▼
[ Hand avatar signs ]

*hy hybrid neural-symbolic, not pure ML or pure rules:*

Need Best handled by Why
UI labels, brand names, technical terms Rules Zero hallucination; auditable; ships fast
Free-form text, novel phrasings, semantics ML Generalizes; handles inputs we never enumerated
Hard cases where ML is uncertain Fingerspelling fallback Honest "I don't know"; users can still read it

Productiongrade NLP in specialized domains (medical, legal, signlanguage) *ll use hybrid* Singleparadigm systems either don't scale (rulesonly) or hallucinate dangerously (ML-only).

4.2 Foundation model strategy (Q3 — RESOLVED 20260429)

We do *ot*train from scratch — we *inetune*a multilingual foundation model that already understands ptBR and en-US.

  • *hy not from scratch:*training a 7Bparam model on PT+EN+Libras from zero would cost ~$500k2M and 6+ months. Finetuning gives us 95% of the result for ~$520k and weeks.

Decision: Gemma 4 + LoRA via Unsloth

*election rationale:*

Model License PT-BR EN-US Size for LoRA Decision
*emma 4* *pache 2.0* Strong (100+ langs trained) Strong E2B 8–10GB VRAM · E4B 17GB *HOSEN*
Llama 4 Meta license (commercial OK <700M MAU) Strong (PT one of 8 official) Strong varies Fallback if Gemma 4 underperforms PT-BR↔Libras in v1 spike
Qwen 3.5 Apache 2.0 Decent (Asia-focused training) Strong varies 201 langs supported but PT-BR quality less proven
~abiá-7b (Maritaca)~ Research-only Excellent Weak *EJECTED*— not commercial-friendly
~abiá2 / Sabiá3~ Commercial API only Excellent n/a *EJECTED*— violates NF6 (self-hosted only)

*hy Gemma 4:*

  • *pache 2.0*in 2026 (changed from Gemma-license) — cleanest legal posture, no clauses to track, no MAU thresholds.
  • *00+ training languages, 30+ firstclass*— both ptBR and en-US natively strong.
  • *4B size (~4B effective params)*— sweet spot for a specialized translation task; fits commodity GPU; inference latency aligned with NF1 (<300ms).
  • *ctive Unsloth pipeline*— official LoRA documentation; ~$1016 + 812h on a single H100 produces a 50-200MB LoRA adapter that merges with the base model.
  • *rosslingual transfer*— single finetune handles both ptBR and enUS inputs because the base model understands both natively. A ptBR training example partially teaches the enUS equivalent — corpus efficiency multiplier.

*rchitectural commitment vs. specific model:*the architectural decision (foundation model + LoRA) is stable; the specific base (Gemma 4 today) is swappable. Reevaluate at each major release of Gemma / Llama / Qwen — finetune effort to migrate is days, not months.

*1 spike before locking:*train a small LoRA on a 5kpair ptBR↔glosa subset, evaluate human-rated quality vs. baseline. If Gemma 4 underperforms Llama 4 by >15% on the eval set, switch to Llama 4 (license tradeoff acceptable given we're <700M MAU by orders of magnitude).

4.3 Boundaries — modules to create

services/ai/signs/                       ← backend, owns the 4-stage pipeline
  ├── backend/         (Go HTTP/gRPC server: /v1/translate, /v1/speech)
  ├── stages/
  │   ├── normalize/   (Go — language-specific rules, deterministic)
  │   ├── semantic/    (Python — calls foundation model via gRPC)
  │   ├── glosa/       (Python — hybrid: rule lookup + ML inference)
  │   └── animate/     (Go — glosa → keyframes; run server-side OR client)
  ├── corpus/          (versioned PT+EN+glosa parallel corpus)
  ├── models/          (model registry: weights + version + eval metrics)
  ├── eval/            (held-out test set + regression suite)
  ├── annotation/      (web tool for Libras consultants to label/dispute)
  ├── docker-compose.yaml
  └── README.md

engines/sdk/koder_hand_kit/                ← Flutter SDK package
  ├── lib/
  │   ├── koder_hand_kit.dart              (public API)
  │   └── src/
  │       ├── hand_avatar.dart             (3D avatar widget — picks backend by platform)
  │       ├── hand_overlay.dart            (draggable floating window)
  │       ├── hand_button.dart             ("Ver em Libras" / "Watch in Libras")
  │       ├── hand_gate.dart               (auto-listen to UI text via accessibility tree)
  │       ├── gloss_player.dart            (consumes gloss sequence → animation timeline)
  │       ├── offline_dictionary.dart      (cached high-frequency entries for NF3)
  │       ├── face_overlay.dart            (Rive-based facial NMM blendshapes)
  │       └── backends/
  │           ├── filament_backend.dart    (mobile + desktop via platform channel)
  │           └── modelviewer_backend.dart (web + TV via <model-viewer>)
  ├── android/                             (Kotlin glue: Filament SurfaceView + glTF loader)
  ├── ios/                                 (Swift glue: Filament + Metal)
  ├── linux/macos/windows/                 (C++ glue: Filament desktop)
  ├── assets/                              (Hand glTF master + SMPL-X rig + Rive face)
  └── docs/

products/horizontal/hand/                  ← brand surface
  ├── README.md, koder.toml, icon.svg
  ├── landing/index.html                   (https://hand.koder.dev)
  └── (no app/ — Hand is embedded inside other apps via koder_hand_kit)

4.4 Separation of translation and rendering

Hard architectural boundary: the *ranslator*produces *losa*(intermediate language), the *enderer*(Hand SDK) consumes glosa.

Why: gives us four levers we can pull independently:

Change Affects Doesn't affect
Better translation model servicesaisigns backend Hand SDK, animations, apps
Better avatar 3D model Hand SDK assets Backend, translation quality
New input modality (voice) Stage 1 normalization Stages 2-4, SDK
New output modality (recorded video for non-Flutter integrations) Renderer parallel to Hand Backend, glosa

This is the same pattern that lets Whisper improvements not affect Spotify clients.

4.5 Naming

  • *servicesaisigns`*— descriptive, follows AI service convention (voice, recsys, runtime, bot). Technical reference name.
  • *enginessdkkoderhandkit** — SDK package name; mirrors koder_kit` family.
  • *and*— public brand of the avatar (visible label "Avatar Hand" / "Hand — sign-language avatar"). Brand score 83 (Great), confirmed in earlier scoping. Visual identity is hands signing — semantic exactly matches the function.
  • Settings labels:
    • enUS: "Signlanguage avatar (Hand)"
    • pt-BR: "Avatar de Libras (Hand)"

Twoname pattern (engine + brand) follows the established engines/kodec (technical) ↔ Play/Tune (consumerfacing) pattern.

4.6 Scope ladder

The architecture exists in full from v0. What changes between versions is *he maturity and coverage of each stage* not the architecture itself. *o throw-aways.*

Version Stage 1 (norm) Stage 2 (semantic) Stage 3 (glosa) Stage 4 (animate) Inputs Surfaces
*0* ptBR + enUS rules, basic spell correction Foundation model loaded but used only as a "confidence checker" *ulebased dictionary*(~5001000 concepts, bilingual triggers) + fingerspelling fallback ~1000 preanimated signs (motioncapture) + datilologia animations Text only Mobile (Android first)
*1* Full normalization (regional spelling, idioms) *L active*— fine-tuned foundation model Hybrid: rules for known concepts, ML for the rest, confidence-routed Sign bank expanded to ~5000 + classifier rules + spatial grammar Text + speech (services/ai/voice integration) Mobile + Desktop + Web
*2* Multisource disambiguation (mixedlang text) Larger model, RLHF from production feedback ML dominant + ensemble + active learning loop in production Bank of 10k+ signs + regional variants (Northeast / Southeast Libras) Text + speech + camera (sign → text reverse) All Koder UIs (incl. TV)
*N (reserved)* *SL*(American Sign Language) as a separate engine — Koder would internationalize for US deaf market. Same pipeline, different glosa target language, different sign animation bank.

Critically: rulebased dictionary is *ot*a v0only crutch. It remains in production forever as the high-confidence layer. Same applies to fingerspelling — never deprecated, always available for proper nouns and unknown terms.

4.7 Avatar tech stack (Q2 — RESOLVED 20260429)

Hand is rendered by a *erplatform bestofclass engine* all consuming the *ame glTF asset with SMPLX rig* No single engine is forced to be sub-optimal everywhere; one asset workflow serves all surfaces.

Animator workflow                    Rendering per platform
─────────────────                    ──────────────────────
Blender / Maya
    │
    ├─ skeletal animation            ┌──────────────────────────┐
    │   on SMPL-X rig                │ Mobile (Android, iOS)    │
    │                                │   → Filament (Google)    │
    │                                │      via platform channel│
    └─ export glTF + GLB             │      ~6MB native lib     │
        (single source of truth)     ├──────────────────────────┤
              │                      │ Desktop (Lin/Mac/Win)    │
              │                      │   → Filament native      │
              │                      ├──────────────────────────┤
              ├─→ runtime ─→         │ Web (Flutter Web)        │
              │                      │   → <model-viewer> (Google)│
              │                      │      built on three.js   │
              │                      ├──────────────────────────┤
              │                      │ TV (TizenOS, WebOS)      │
              │                      │   → <model-viewer>       │
              │                      │      (TVs are web-first) │
              │                      └──────────────────────────┘
              │
              └─→ Face overlay
                  → Rive (excellent at facial blendshapes,
                          essential for Libras non-manual markers
                          which are part of the grammar, not
                          decoration)

*hy this stack and not a single engine:*

Need Filament flutter_scene Unity Rive 3D model-viewer
glTF + SMPL-X support
Production-mature ⚠️ preview ⚠️ 3D new ✅ web only
Bundle ≤ NF2 (8MB mobile) ✅ ~6MB ✅ ~2MB ❌ +25MB ✅ ~1MB n/a (web)
Hand-finger precision ❌ stylized
Cross-platform needs glue yes (preview) yes (heavy) yes web only
Used in Google production ⚠️ preview n/a n/a

Filament is the productionmature backend (Sceneform, Google Earth, etc.). flutter_scene will likely match it once it leaves preview — at that point we reevaluate and possibly migrate mobile/desktop. The glTF asset format and SMPLX rig are forwardcompatible with that migration; only the rendering glue changes.

*hy SMPL-X specifically:*

SMPLX is the de facto rig for signlanguage research — used by Hand Talk's pipeline, SignAvatars (ECCV 2024), and most academic papers in the space. Adopting SMPLX aligns with the format ecosystem (Blender/Maya export, glTF compatibility, MediaPipe pose retargeting). Note: the *ignAvatars dataset itself is noncommercial*and excluded from our training corpus (see §4.8 Corpus); we use SMPL-X as a *ormat/rig*independent of any particular dataset.

*hy Rive for face overlay (and only the face):*

Libras nonmanual markers — eyebrow position, mouth shape, eye gaze direction, head tilt — are *rammatical* not decorative. Filament can do facial blendshapes but Rive's facialanimation tooling and editor are markedly better and the runtime is tiny (~700KB). The compromise: face is rendered as a Rive surface composited over the Filament 3D body, sharing timing and parameters via the Hand SDK. Single avatar visually, two engines internally.

*uture migration target:*when flutter_scene ships stable, reevaluate replacing Filament with it. Same glTF assets work directly; only the platformchannel glue is replaced.

4.8 Corpus strategy (Q4 — RESOLVED 20260429)

Brazilian Libras corpus landscape is rich (UFPB, UFSC, UFPel, UFPE all have datasets). Our strategy *uilds a primary corpus from VLibrasBD (CC BYSA 4.0, commercialpermitted)*and supplements it with bootstrap MT for en-US plus paid annotation for v2 quality refinement.

*rimary corpus — VLibrasBD (Mendeley Data, CC BY-SA 4.0):*

  • *27,349 aligned ptBR ↔ Librasglosa sentence pairs* built by 10 Libras interpreters
  • ~72k general-purpose sentences + ~55k from Brazilian federal government content (gov.br services)
  • DOI: 10.17632/ryj88ckjww — open-download from Mendeley Data
  • *icense: CC BYSA 4.0*— commercial use permitted; *harealike clause*applies to derivative works (interpreted conservatively, our trained models inherit CC BY-SA)

*cceptance of sharealike:*openweight models align with Koder's selfhosted ethos. Our value proposition is the integrated stack (Hand SDK + apps + Koder ID + UX), not the model weights themselves. Trade is acceptable vs. ~$50100k cost of annotating 127k pairs from scratch.

*upplementary corpora (license verification pending before training):*

Dataset Pairs / Records Use License status
Libras-UFPel Corpus 4,800 audiovisuais (2,400 sentences + 2,400 isolated signs) Refinement / regional Pelotense variant TBD
V-Librasil (UFPE) 4,089 sign videos / 1,364 terms / 3 articulators *ose extraction*for animation bank (we extract poses; we don't redistribute videos) TBD (IEEE Dataport)
LSWH100 144k synthetic handshape images / 100 classes v2+ sign→text reverse (handshape recognition) TBD
LIBRAS-UFOP 56 signs Kinect RGB-D + skeleton Supplementary pose TBD
~ignAvatars~ 70k videos / 8.34M frames SMPL-X *on-commercial only — EXCLUDED from training*

*ilingual strategy (en-US bootstrap):*

VLibrasBD is ptBR only. To get enUS → glosa coverage:

  1. *T bootstrap*— pass the PT side of VLibrasBD through DeepL/Google Translate to produce ~127k en-US ↔ glosa pairs. Cost: ~$200 in DeepL API.
  2. *rosslingual transfer*— Gemma 4's existing multilingual capacity means a ptBR → glosa finetune partially teaches enUS → glosa for free.
  3. *aid annotation for v2 refinement*— commission ~10k highquality enUS → glosa pairs from professional Libras interpreters fluent in English. Estimate: ~\(5k (10k pairs ÷ 45 sentences/h × R\)100/h ≈ R$22k).

*nimation bank strategy:*

Since SignAvatars is non-commercial, the avatar's signing animations are produced through:

  1. *Librasil pose extraction*— process the dataset's videos with MediaPipe Holistic + retarget to our SMPLX rig (we extract poses, we don't redistribute the videos). License-permitting.
  2. *wn motioncapture session*— commission a mocap studio session with deaf native signer for the initial 1,000 signs (~R$3060k including studio time, signer fee, post-production).
  3. *and-keyframed animation*— supplementary for rare signs, glossary expansion, edge cases.

*ctive learning loop (production):*

  • Model flags high-uncertainty inputs in production
  • Flagged inputs queued in annotation pipeline
  • Libras consultants label / dispute / approve
  • Retrained periodically (monthly initially → weekly at scale)
  • Output: continuous quality improvement; corpus grows organically without large upfront annotation push

*ost estimate — corpus year 1: ~$50-80k.*

Item Cost
VLibrasBD (CC BY-SA) $0
MT bootstrap PT→EN ~$200
Animation bank (mocap, 1,000 signs) $5-12k
Initial dictionary curation (consultor Libras 3 wks) $3-5k
LoRA fine-tune (1 H100, ~10h) $10-50
Cultural review board honoraria (Q5) $2-5k/yr
Paid annotation v2 (en-US 10k pairs) $5k
Active learning corpus growth (year 2+) $10-20k/yr

4.9 Data infrastructure (built from day 1, not day 1000)

These exist starting v0 even if at small scale, so v1's ML training plugs in cleanly:

  • *orpus*— versioned object storage with metadata (source, dialect, quality score, lang tag). ptBR and enUS examples co-located, both targeting same glosa.
  • *nnotation pipeline*— web tool (services/ai/signs/annotation/) for Libras consultants (deaf native signers we contract) to label, verify, dispute. Permissive of regional variants; tracks reviewer identity for academic citation.
  • *ctive learning loop*— model flags high-uncertainty inputs in production; humans label; retrained periodically (initially monthly, later weekly).
  • *val suite*— heldout test set + regression suite + production feedback. BLEUglosa as primary metric; human review (deaf native signer panel) as quality gate.
  • *odel registry*— every trained model versioned with corpus snapshot + hyperparameters + eval results. Rollback automatic if production metrics regress.

4.10 Serving infrastructure — gRPC (Q8 — RESOLVED 20260429)

*ecision: follow existing Koder gRPC convention.*

gRPC is already the established backend transport in the Koder Stack. Audit found:

  • `servicesfoundationidengineprotokoderid{identity,keys,session,admin}v1

Source: ../home/koder/dev/koder/meta/docs/stack/rfcs/signs-RFC-001-architecture-overview.md