Signs RFC 001 architecture overview
signsRFC001 — Sign-language avatar + bilingual translation across the Koder Stack
| Status | *eady for ratification*— all technical questions resolved 2026 |
|---|---|
| Created | 2026 |
| Renamed | 2026libras-RFC-001 → signs-RFC-001 per owner-ratified rename services/ai/libras → services/ai/signs; the engine is generic, Libras is the first corpus) |
| Last revised | 2026 |
| Author | Koder team |
| Modules | services/ai/signs (proposed; first corpus: Libras), engines/sdk/koder_hand_kit (proposed), every Koder UI (consumer) |
| Related specs | policies/sdk-first.kmd, policies/hyperscale-first.kmd, specs/koder-app/behaviors.kmd, policies/language.kmd |
| Related projects | Koru (products/horizontal/koru) — explicitly avoided "Hand" naming so this RFC could claim it |
| Supersedes | Draft v1 (proposed Path A integration with VLibras as fallback) — withdrawn 2026 |
1. Goal
Make every Koder UI capable of presenting its content in *razilian Sign Language (Libras)*through a friendly, animated 3D *vatar (Hand)*that signs in real time, activated by a single Settings toggle and persisted across the user's apps via Koder ID.
The capability must:
- Accept *razilian Portuguese (pt
BR) and American English (enUS)*as source languages. - Output *ibras*(only) — ASL is reserved for a future version.
- Work in *obile, desktop, TV, and web*Koder UIs — i.e., a cross
cutting feature per `policies/sdkfirst.kmd`. - Be *ully proprietary and self
hosted*— no dependency on VLibras or any thirdparty service (Path C). - Be *rchitected for hyperscale from day one*— no throw-away v0; the architecture that ships first is the architecture that scales to v∞.
2. Why now
- Accessibility-first products attract a meaningful user base (10M+ deaf/HoH Brazilians, ~5% of population per IBGE) and align with Koder's positioning as the platform that "speaks Brazil while shipping globally".
- Public
sector buyers (vital to Edictus, ERP track) require Libras compliance — a single SDK widget makes every Koder app eligible at zero perproduct effort. - Koder products are en
US first in UI source / marketing perlanguage audiences in one architecture.*policies/language.kmd, but operate primarily in Brazil where deaf users sign Libras. *and serves the same UI to two source - Koru just shipped (2026
0429) without using "Hand" as its brand precisely so this RFC could claim it.
3. Requirements
Functional
- *1* Single Settings toggle in any Koder UI activates Hand; persists across apps via Koder ID.
- *2* Translate visible UI text and on-demand passages to Libras in real time.
- *3* Render an animated 3D avatar overlay (draggable, resizable, dismissible).
- *4* Work on mobile (Flutter AndroidiOS), desktop (Flutter LinuxmacOS/Windows), TV (TizenOS+WebOS), and web (Flutter Web + landing-page widget).
- *5* Optional speech input (hold
tospeak) — defers toservices/ai/voicefor ASR. - *6* Accept both *t
BR*and *nUS*as source languages, with explicitlangtag on the API call (auto-detection as fallback).
Non-functional
- *F1* Latency: gloss-sequence start within ≤300ms after request; first gesture rendered within ≤500ms.
- *F2* Bundle size: SDK widget ≤8MB on mobile (gzipped); avatar assets streamable from CDN if larger.
- *F3* Offline graceful degradation: SDK has a small cached glossary (high-frequency phrases) so common UI labels keep signing without network.
- *F4* Privacy: translation requests are stateless and not logged with content; only opt-in usage telemetry per
specs/errors/reporting.kmd. - *F5* Compliance: Lei nº 10.4362002 + Decreto 5.6262005 cited in landing/legal copy; WCAG 2.2 AA where applicable.
- *F6* *elf
hosted only.*Like every other Koder backend (Flow, AI Voice, Foundation), Koder AI Signs runs entirely on Koder infrastructure. Calls to any external endpoint are forbidden — sovereignty (no thirdparty can take the service down) + privacy (translation requests never leave Koder infra). - *F7* *yperscale
first*per `policies/hyperscalefirst.kmd` — every architectural decision optimizes the longterm system over shortterm ship speed; no throw-away v0.
4. Architecture
4.1 The pipeline (4 stages, swappable, instrumented)
[ source text + lang tag ]
│
▼
┌────────────────────────────────────┐
│ §1 Normalization │ bilingual: pt-BR | en-US
│ - spell correction │ rules per language
│ - abbreviation expansion │ ("vc"→"você", "u"→"you")
│ - sentence segmentation │
└──────────────┬─────────────────────┘
│
▼
┌────────────────────────────────────┐
│ §2 Semantic parsing │ ML — multilingual foundation model
│ - intent / entity / sentiment │ (Llama-3.x · Gemma-3 · Qwen)
│ - cross-lingual representation │ one model serves both src langs
└──────────────┬─────────────────────┘
│
▼
┌────────────────────────────────────┐
│ §3 Glosa generation (HYBRID) │ ┌──────────────────────────────┐
│ ┌─────────────────────────────┐ │ │ Confidence router │
│ │ Rule-based dictionary │◀─┼──│ · high conf → use rules │
│ │ (curated, deterministic) │ │ │ · low conf → use ML │
│ └─────────────────────────────┘ │ │ · ML hallucinates → fallback │
│ ┌─────────────────────────────┐ │ │ to fingerspelling │
│ │ ML translator │ │ └──────────────────────────────┘
│ │ (foundation + LoRA fine-tune)│ │
│ └─────────────────────────────┘ │
└──────────────┬─────────────────────┘
│ glosa sequence (intermediate language, src-agnostic)
▼
┌────────────────────────────────────┐
│ §4 Animation generation │ glosa → 3D timeline
│ - sign animation lookup │ (Hand renders client-side)
│ - inflection / classifier rules │ Libras spatial grammar
│ - timing / rhythm │
└──────────────┬─────────────────────┘
│
▼
[ Hand avatar signs ]*hy hybrid neural-symbolic, not pure ML or pure rules:*
| Need | Best handled by | Why |
|---|---|---|
| UI labels, brand names, technical terms | Rules | Zero hallucination; auditable; ships fast |
| Free-form text, novel phrasings, semantics | ML | Generalizes; handles inputs we never enumerated |
| Hard cases where ML is uncertain | Fingerspelling fallback | Honest "I don't know"; users can still read it |
Productiongrade NLP in specialized domains (medical, legal, signlanguage) *ll use hybrid* Singleparadigm systems either don't scale (rulesonly) or hallucinate dangerously (ML-only).
4.2 Foundation model strategy (Q3 — RESOLVED 20260429)
We do *ot*train from scratch — we *inetune*a multilingual foundation model that already understands ptBR and en-US.
- *hy not from scratch:*training a 7B
param model on PT+EN+Libras from zero would cost ~$500k2M and 6+ months. Finetuning gives us 95% of the result for ~$520k and weeks.
Decision: Gemma 4 + LoRA via Unsloth
*election rationale:*
| Model | License | PT-BR | EN-US | Size for LoRA | Decision |
|---|---|---|---|---|---|
| *emma 4* | *pache 2.0* | Strong (100+ langs trained) | Strong | E2B 8–10GB VRAM · E4B 17GB | *HOSEN* |
| Llama 4 | Meta license (commercial OK <700M MAU) | Strong (PT one of 8 official) | Strong | varies | Fallback if Gemma 4 underperforms PT-BR↔Libras in v1 spike |
| Qwen 3.5 | Apache 2.0 | Decent (Asia-focused training) | Strong | varies | 201 langs supported but PT-BR quality less proven |
| ~abiá-7b (Maritaca)~ | Research-only | Excellent | Weak | — | *EJECTED*— not commercial-friendly |
| ~abiá |
Commercial API only | Excellent | n/a | — | *EJECTED*— violates NF6 (self-hosted only) |
*hy Gemma 4:*
- *pache 2.0*in 2026 (changed from Gemma-license) — cleanest legal posture, no clauses to track, no MAU thresholds.
- *00+ training languages, 30+ first
class*— both ptBR and en-US natively strong. - *4B size (~4B effective params)*— sweet spot for a specialized translation task; fits commodity GPU; inference latency aligned with NF1 (<300ms).
- *ctive Unsloth pipeline*— official LoRA documentation; ~$10
16 + 812h on a single H100 produces a 50-200MB LoRA adapter that merges with the base model. - *ross
lingual transfer*— single finetune handles both ptBR and enUS inputs because the base model understands both natively. A ptBR training example partially teaches the enUS equivalent — corpus efficiency multiplier.
*rchitectural commitment vs. specific model:*the architectural decision (foundation model + LoRA) is stable; the specific base (Gemma 4 today) is swappable. Reevaluate at each major release of Gemma / Llama / Qwen — finetune effort to migrate is days, not months.
*1 spike before locking:*train a small LoRA on a 5kpair ptBR↔glosa subset, evaluate human-rated quality vs. baseline. If Gemma 4 underperforms Llama 4 by >15% on the eval set, switch to Llama 4 (license tradeoff acceptable given we're <700M MAU by orders of magnitude).
4.3 Boundaries — modules to create
services/ai/signs/ ← backend, owns the 4-stage pipeline
├── backend/ (Go HTTP/gRPC server: /v1/translate, /v1/speech)
├── stages/
│ ├── normalize/ (Go — language-specific rules, deterministic)
│ ├── semantic/ (Python — calls foundation model via gRPC)
│ ├── glosa/ (Python — hybrid: rule lookup + ML inference)
│ └── animate/ (Go — glosa → keyframes; run server-side OR client)
├── corpus/ (versioned PT+EN+glosa parallel corpus)
├── models/ (model registry: weights + version + eval metrics)
├── eval/ (held-out test set + regression suite)
├── annotation/ (web tool for Libras consultants to label/dispute)
├── docker-compose.yaml
└── README.md
engines/sdk/koder_hand_kit/ ← Flutter SDK package
├── lib/
│ ├── koder_hand_kit.dart (public API)
│ └── src/
│ ├── hand_avatar.dart (3D avatar widget — picks backend by platform)
│ ├── hand_overlay.dart (draggable floating window)
│ ├── hand_button.dart ("Ver em Libras" / "Watch in Libras")
│ ├── hand_gate.dart (auto-listen to UI text via accessibility tree)
│ ├── gloss_player.dart (consumes gloss sequence → animation timeline)
│ ├── offline_dictionary.dart (cached high-frequency entries for NF3)
│ ├── face_overlay.dart (Rive-based facial NMM blendshapes)
│ └── backends/
│ ├── filament_backend.dart (mobile + desktop via platform channel)
│ └── modelviewer_backend.dart (web + TV via <model-viewer>)
├── android/ (Kotlin glue: Filament SurfaceView + glTF loader)
├── ios/ (Swift glue: Filament + Metal)
├── linux/macos/windows/ (C++ glue: Filament desktop)
├── assets/ (Hand glTF master + SMPL-X rig + Rive face)
└── docs/
products/horizontal/hand/ ← brand surface
├── README.md, koder.toml, icon.svg
├── landing/index.html (https://hand.koder.dev)
└── (no app/ — Hand is embedded inside other apps via koder_hand_kit)4.4 Separation of translation and rendering
Hard architectural boundary: the *ranslator*produces *losa*(intermediate language), the *enderer*(Hand SDK) consumes glosa.
Why: gives us four levers we can pull independently:
| Change | Affects | Doesn't affect |
|---|---|---|
| Better translation model | servicesaisigns backend | Hand SDK, animations, apps |
| Better avatar 3D model | Hand SDK assets | Backend, translation quality |
| New input modality (voice) | Stage 1 normalization | Stages 2-4, SDK |
| New output modality (recorded video for non-Flutter integrations) | Renderer parallel to Hand | Backend, glosa |
This is the same pattern that lets Whisper improvements not affect Spotify clients.
4.5 Naming
- *servicesaisigns`*— descriptive, follows AI service convention (voice, recsys, runtime, bot). Technical reference name.
- *enginessdkkoderhandkit
** — SDK package name; mirrorskoder_kit` family. - *and*— public brand of the avatar (visible label "Avatar Hand" / "Hand — sign-language avatar"). Brand score 83 (Great), confirmed in earlier scoping. Visual identity is hands signing — semantic exactly matches the function.
- Settings labels:
- en
US: "Signlanguage avatar (Hand)" - pt-BR: "Avatar de Libras (Hand)"
- en
Two
name pattern (engine + brand) follows the establishedfacing) pattern.engines/kodec(technical) ↔ Play/Tune (consumer
4.6 Scope ladder
The architecture exists in full from v0. What changes between versions is *he maturity and coverage of each stage* not the architecture itself. *o throw-aways.*
| Version | Stage 1 (norm) | Stage 2 (semantic) | Stage 3 (glosa) | Stage 4 (animate) | Inputs | Surfaces |
|---|---|---|---|---|---|---|
| *0* | pt |
Foundation model loaded but used only as a "confidence checker" | *ule |
~1000 pre |
Text only | Mobile (Android first) |
| *1* | Full normalization (regional spelling, idioms) | *L active*— fine-tuned foundation model | Hybrid: rules for known concepts, ML for the rest, confidence-routed | Sign bank expanded to ~5000 + classifier rules + spatial grammar | Text + speech (services/ai/voice integration) |
Mobile + Desktop + Web |
| *2* | Multi |
Larger model, RLHF from production feedback | ML dominant + ensemble + active learning loop in production | Bank of 10k+ signs + regional variants (Northeast / Southeast Libras) | Text + speech + camera (sign → text reverse) | All Koder UIs (incl. TV) |
| *N (reserved)* | — | — | — | — | — | *SL*(American Sign Language) as a separate engine — Koder would internationalize for US deaf market. Same pipeline, different glosa target language, different sign animation bank. |
Critically: rulebased dictionary is *ot*a v0only crutch. It remains in production forever as the high-confidence layer. Same applies to fingerspelling — never deprecated, always available for proper nouns and unknown terms.
4.7 Avatar tech stack (Q2 — RESOLVED 20260429)
Hand is rendered by a *erplatform bestofclass engine* all consuming the *ame glTF asset with SMPLX rig* No single engine is forced to be sub-optimal everywhere; one asset workflow serves all surfaces.
Animator workflow Rendering per platform
───────────────── ──────────────────────
Blender / Maya
│
├─ skeletal animation ┌──────────────────────────┐
│ on SMPL-X rig │ Mobile (Android, iOS) │
│ │ → Filament (Google) │
│ │ via platform channel│
└─ export glTF + GLB │ ~6MB native lib │
(single source of truth) ├──────────────────────────┤
│ │ Desktop (Lin/Mac/Win) │
│ │ → Filament native │
│ ├──────────────────────────┤
├─→ runtime ─→ │ Web (Flutter Web) │
│ │ → <model-viewer> (Google)│
│ │ built on three.js │
│ ├──────────────────────────┤
│ │ TV (TizenOS, WebOS) │
│ │ → <model-viewer> │
│ │ (TVs are web-first) │
│ └──────────────────────────┘
│
└─→ Face overlay
→ Rive (excellent at facial blendshapes,
essential for Libras non-manual markers
which are part of the grammar, not
decoration)*hy this stack and not a single engine:*
| Need | Filament | flutter_scene | Unity | Rive 3D | model-viewer |
|---|---|---|---|---|---|
| glTF + SMPL-X support | ✅ | ✅ | ✅ | ❌ | ✅ |
| Production-mature | ✅ | ⚠️ preview | ✅ | ⚠️ 3D new | ✅ web only |
| Bundle ≤ NF2 (8MB mobile) | ✅ ~6MB | ✅ ~2MB | ❌ +25MB | ✅ ~1MB | n/a (web) |
| Hand-finger precision | ✅ | ✅ | ✅ | ❌ stylized | ✅ |
| Cross-platform | needs glue | yes (preview) | yes (heavy) | yes | web only |
| Used in Google production | ✅ | ⚠️ preview | n/a | n/a | ✅ |
Filament is the productionmature backend (Sceneform, Google Earth, etc.). flutter_scene will likely match it once it leaves preview — at that point we reevaluate and possibly migrate mobile/desktop. The glTF asset format and SMPLX rig are forwardcompatible with that migration; only the rendering glue changes.
*hy SMPL-X specifically:*
SMPLX is the de facto rig for signlanguage research — used by Hand Talk's pipeline, SignAvatars (ECCV 2024), and most academic papers in the space. Adopting SMPLX aligns with the format ecosystem (Blender/Maya export, glTF compatibility, MediaPipe pose retargeting). Note: the *ignAvatars dataset itself is noncommercial*and excluded from our training corpus (see §4.8 Corpus); we use SMPL-X as a *ormat/rig*independent of any particular dataset.
*hy Rive for face overlay (and only the face):*
Libras nonmanual markers — eyebrow position, mouth shape, eye gaze direction, head tilt — are *rammatical* not decorative. Filament can do facial blendshapes but Rive's facialanimation tooling and editor are markedly better and the runtime is tiny (~700KB). The compromise: face is rendered as a Rive surface composited over the Filament 3D body, sharing timing and parameters via the Hand SDK. Single avatar visually, two engines internally.
*uture migration target:*when flutter_scene ships stable, reevaluate replacing Filament with it. Same glTF assets work directly; only the platformchannel glue is replaced.
4.8 Corpus strategy (Q4 — RESOLVED 20260429)
Brazilian Libras corpus landscape is rich (UFPB, UFSC, UFPel, UFPE all have datasets). Our strategy *uilds a primary corpus from VLibrasBD (CC BYSA 4.0, commercialpermitted)*and supplements it with bootstrap MT for en-US plus paid annotation for v2 quality refinement.
*rimary corpus — VLibrasBD (Mendeley Data, CC BY-SA 4.0):*
- *27,349 aligned pt
BR ↔ Librasglosa sentence pairs* built by 10 Libras interpreters - ~72k general-purpose sentences + ~55k from Brazilian federal government content (gov.br services)
- DOI: 10.17632/ryj88ckjww — open-download from Mendeley Data
- *icense: CC BY
SA 4.0*— commercial use permitted; *harealike clause*applies to derivative works (interpreted conservatively, our trained models inherit CC BY-SA)
*cceptance of sharealike:*openweight models align with Koder's selfhosted ethos. Our value proposition is the integrated stack (Hand SDK + apps + Koder ID + UX), not the model weights themselves. Trade is acceptable vs. ~$50100k cost of annotating 127k pairs from scratch.
*upplementary corpora (license verification pending before training):*
| Dataset | Pairs / Records | Use | License status |
|---|---|---|---|
| Libras-UFPel Corpus | 4,800 audiovisuais (2,400 sentences + 2,400 isolated signs) | Refinement / regional Pelotense variant | TBD |
| V-Librasil (UFPE) | 4,089 sign videos / 1,364 terms / 3 articulators | *ose extraction*for animation bank (we extract poses; we don't redistribute videos) | TBD (IEEE Dataport) |
| LSWH100 | 144k synthetic handshape images / 100 classes | v2+ sign→text reverse (handshape recognition) | TBD |
| LIBRAS-UFOP | 56 signs Kinect RGB-D + skeleton | Supplementary pose | TBD |
| ~ignAvatars~ | 70k videos / 8.34M frames SMPL-X | — | *on-commercial only — EXCLUDED from training* |
*ilingual strategy (en-US bootstrap):*
VLibrasBD is ptBR only. To get enUS → glosa coverage:
- *T bootstrap*— pass the PT side of VLibrasBD through DeepL/Google Translate to produce ~127k en-US ↔ glosa pairs. Cost: ~$200 in DeepL API.
- *ross
lingual transfer*— Gemma 4's existing multilingual capacity means a ptBR → glosa finetune partially teaches enUS → glosa for free. - *aid annotation for v2 refinement*— commission ~10k high
quality enUS → glosa pairs from professional Libras interpreters fluent in English. Estimate: ~\(5k (10k pairs ÷ 45 sentences/h × R\)100/h ≈ R$22k).
*nimation bank strategy:*
Since SignAvatars is non-commercial, the avatar's signing animations are produced through:
- *
Librasil pose extraction*— process the dataset's videos with MediaPipe Holistic + retarget to our SMPLX rig (we extract poses, we don't redistribute the videos). License-permitting. - *wn motion
capture session*— commission a mocap studio session with deaf native signer for the initial 1,000 signs (~R$3060k including studio time, signer fee, post-production). - *and-keyframed animation*— supplementary for rare signs, glossary expansion, edge cases.
*ctive learning loop (production):*
- Model flags high-uncertainty inputs in production
- Flagged inputs queued in annotation pipeline
- Libras consultants label / dispute / approve
- Retrained periodically (monthly initially → weekly at scale)
- Output: continuous quality improvement; corpus grows organically without large upfront annotation push
*ost estimate — corpus year 1: ~$50-80k.*
| Item | Cost |
|---|---|
| VLibrasBD (CC BY-SA) | $0 |
| MT bootstrap PT→EN | ~$200 |
| Animation bank (mocap, 1,000 signs) | $5-12k |
| Initial dictionary curation (consultor Libras 3 wks) | $3-5k |
| LoRA fine-tune (1 H100, ~10h) | $10-50 |
| Cultural review board honoraria (Q5) | $2-5k/yr |
| Paid annotation v2 (en-US 10k pairs) | $5k |
| Active learning corpus growth (year 2+) | $10-20k/yr |
4.9 Data infrastructure (built from day 1, not day 1000)
These exist starting v0 even if at small scale, so v1's ML training plugs in cleanly:
- *orpus*— versioned object storage with metadata (source, dialect, quality score, lang tag). pt
BR and enUS examples co-located, both targeting same glosa. - *nnotation pipeline*— web tool (
services/ai/signs/annotation/) for Libras consultants (deaf native signers we contract) to label, verify, dispute. Permissive of regional variants; tracks reviewer identity for academic citation. - *ctive learning loop*— model flags high-uncertainty inputs in production; humans label; retrained periodically (initially monthly, later weekly).
- *val suite*— held
out test set + regression suite + production feedback. BLEUglosa as primary metric; human review (deaf native signer panel) as quality gate. - *odel registry*— every trained model versioned with corpus snapshot + hyperparameters + eval results. Rollback automatic if production metrics regress.
4.10 Serving infrastructure — gRPC (Q8 — RESOLVED 20260429)
*ecision: follow existing Koder gRPC convention.*
gRPC is already the established backend transport in the Koder Stack. Audit found:
- `servicesfoundationidengineprotokoderid{identity,keys,session,admin}v1