Signs RFC 001 architecture overview

signsRFC001 — Sign-language avatar + bilingual translation across the Koder Stack

Status	eady for ratification— all technical questions resolved 20260429 (Q1~~Q4 + Q8~~Q10). Q5~~Q7 are human~~process work running in parallel.
Created	20260429
Renamed	20260523 (`libras-RFC-001` → `signs-RFC-001` per owner-ratified rename `services/ai/libras` → `services/ai/signs`; the engine is generic, Libras is the first corpus)
Last revised	20260523
Author	Koder team
Modules	`services/ai/signs` (proposed; first corpus: Libras), `engines/sdk/koder_hand_kit` (proposed), every Koder UI (consumer)
Related specs	`policies/sdk-first.kmd`, `policies/hyperscale-first.kmd`, `specs/koder-app/behaviors.kmd`, `policies/language.kmd`
Related projects	Koru (`products/horizontal/koru`) — explicitly avoided "Hand" naming so this RFC could claim it
Supersedes	Draft v1 (proposed Path A integration with VLibras as fallback) — withdrawn 20260429 in favor of fully proprietary stack

1. Goal

Make every Koder UI capable of presenting its content in *razilian Sign Language (Libras)*through a friendly, animated 3D *vatar (Hand)*that signs in real time, activated by a single Settings toggle and persisted across the user's apps via Koder ID.

The capability must:

Accept *razilian Portuguese (pt~~BR) and American English (en~~US)*as source languages.
Output *ibras*(only) — ASL is reserved for a future version.
Work in *obile, desktop, TV, and web*Koder UIs — i.e., a cross~~cutting feature per `policies/sdk~~first.kmd`.
Be *ully proprietary and self~~hosted*— no dependency on VLibras or any third~~party service (Path C).
Be *rchitected for hyperscale from day one*— no throw-away v0; the architecture that ships first is the architecture that scales to v∞.

2. Why now

Accessibility-first products attract a meaningful user base (10M+ deaf/HoH Brazilians, ~5% of population per IBGE) and align with Koder's positioning as the platform that "speaks Brazil while shipping globally".
Public~~sector buyers (vital to Edictus, ERP track) require Libras compliance — a single SDK widget makes every Koder app eligible at zero per~~product effort.
Koder products are en~~US first in UI source / marketing per policies/language.kmd, but operate primarily in Brazil where deaf users sign Libras. *and serves the same UI to two source~~language audiences in one architecture.*
Koru just shipped (20260429) without using "Hand" as its brand precisely so this RFC could claim it.

3. Requirements

Functional

*1* Single Settings toggle in any Koder UI activates Hand; persists across apps via Koder ID.
*2* Translate visible UI text and on-demand passages to Libras in real time.
*3* Render an animated 3D avatar overlay (draggable, resizable, dismissible).
*4* Work on mobile (Flutter AndroidiOS), desktop (Flutter LinuxmacOS/Windows), TV (TizenOS+WebOS), and web (Flutter Web + landing-page widget).
*5* Optional speech input (holdtospeak) — defers to services/ai/voice for ASR.
*6* Accept both *t~~BR*and *n~~US*as source languages, with explicit lang tag on the API call (auto-detection as fallback).

Non-functional

*F1* Latency: gloss-sequence start within ≤300ms after request; first gesture rendered within ≤500ms.
*F2* Bundle size: SDK widget ≤8MB on mobile (gzipped); avatar assets streamable from CDN if larger.
*F3* Offline graceful degradation: SDK has a small cached glossary (high-frequency phrases) so common UI labels keep signing without network.
*F4* Privacy: translation requests are stateless and not logged with content; only opt-in usage telemetry per specs/errors/reporting.kmd.
*F5* Compliance: Lei nº 10.4362002 + Decreto 5.6262005 cited in landing/legal copy; WCAG 2.2 AA where applicable.
*F6* *elf~~hosted only.*Like every other Koder backend (Flow, AI Voice, Foundation), Koder AI Signs runs entirely on Koder infrastructure. Calls to any external endpoint are forbidden — sovereignty (no third~~party can take the service down) + privacy (translation requests never leave Koder infra).
*F7* *yperscale~~first*per `policies/hyperscale~~first.kmd` — every architectural decision optimizes the long~~term system over short~~term ship speed; no throw-away v0.

4. Architecture

4.1 The pipeline (4 stages, swappable, instrumented)

[ source text + lang tag ]
    │
    ▼
┌────────────────────────────────────┐
│ §1 Normalization                   │  bilingual: pt-BR | en-US
│   - spell correction               │  rules per language
│   - abbreviation expansion         │  ("vc"→"você", "u"→"you")
│   - sentence segmentation          │
└──────────────┬─────────────────────┘
               │
               ▼
┌────────────────────────────────────┐
│ §2 Semantic parsing                │  ML — multilingual foundation model
│   - intent / entity / sentiment    │  (Llama-3.x · Gemma-3 · Qwen)
│   - cross-lingual representation   │  one model serves both src langs
└──────────────┬─────────────────────┘
               │
               ▼
┌────────────────────────────────────┐
│ §3 Glosa generation (HYBRID)       │  ┌──────────────────────────────┐
│   ┌─────────────────────────────┐  │  │ Confidence router            │
│   │ Rule-based dictionary       │◀─┼──│ · high conf → use rules      │
│   │ (curated, deterministic)    │  │  │ · low conf  → use ML         │
│   └─────────────────────────────┘  │  │ · ML hallucinates → fallback │
│   ┌─────────────────────────────┐  │  │   to fingerspelling          │
│   │ ML translator               │  │  └──────────────────────────────┘
│   │ (foundation + LoRA fine-tune)│  │
│   └─────────────────────────────┘  │
└──────────────┬─────────────────────┘
               │  glosa sequence (intermediate language, src-agnostic)
               ▼
┌────────────────────────────────────┐
│ §4 Animation generation            │  glosa → 3D timeline
│   - sign animation lookup          │  (Hand renders client-side)
│   - inflection / classifier rules  │  Libras spatial grammar
│   - timing / rhythm                │
└──────────────┬─────────────────────┘
               │
               ▼
[ Hand avatar signs ]

*hy hybrid neural-symbolic, not pure ML or pure rules:*

Need	Best handled by	Why
UI labels, brand names, technical terms	Rules	Zero hallucination; auditable; ships fast
Free-form text, novel phrasings, semantics	ML	Generalizes; handles inputs we never enumerated
Hard cases where ML is uncertain	Fingerspelling fallback	Honest "I don't know"; users can still read it

Production~~grade NLP in specialized domains (medical, legal, sign~~language) *ll use hybrid* Single~~paradigm systems either don't scale (rules~~only) or hallucinate dangerously (ML-only).

4.2 Foundation model strategy (Q3 — RESOLVED 20260429)

We do *ot*train from scratch — we *ine~~tune*a multilingual foundation model that already understands pt~~BR and en-US.

*hy not from scratch:*training a 7B~~param model on PT+EN+Libras from zero would cost ~$500k~~2M and 6+ months. Fine~~tuning gives us 95% of the result for ~$5~~20k and weeks.

Decision: Gemma 4 + LoRA via Unsloth

*election rationale:*

Model	License	PT-BR	EN-US	Size for LoRA	Decision
emma 4	pache 2.0	Strong (100+ langs trained)	Strong	E2B 8–10GB VRAM · E4B 17GB	HOSEN
Llama 4	Meta license (commercial OK <700M MAU)	Strong (PT one of 8 official)	Strong	varies	Fallback if Gemma 4 underperforms PT-BR↔Libras in v1 spike
Qwen 3.5	Apache 2.0	Decent (Asia-focused training)	Strong	varies	201 langs supported but PT-BR quality less proven
~abiá-7b (Maritaca)~	Research-only	Excellent	Weak	—	EJECTED— not commercial-friendly
~abiá~~2 / Sabiá~~3~	Commercial API only	Excellent	n/a	—	EJECTED— violates NF6 (self-hosted only)

*hy Gemma 4:*

*pache 2.0*in 2026 (changed from Gemma-license) — cleanest legal posture, no clauses to track, no MAU thresholds.
*00+ training languages, 30+ first~~class*— both pt~~BR and en-US natively strong.
*4B size (~4B effective params)*— sweet spot for a specialized translation task; fits commodity GPU; inference latency aligned with NF1 (<300ms).
*ctive Unsloth pipeline*— official LoRA documentation; ~$10~~16 + 8~~12h on a single H100 produces a 50-200MB LoRA adapter that merges with the base model.
*ross~~lingual transfer*— single fine~~tune handles both pt~~BR and en~~US inputs because the base model understands both natively. A pt~~BR training example partially teaches the en~~US equivalent — corpus efficiency multiplier.

*rchitectural commitment vs. specific model:*the architectural decision (foundation model + LoRA) is stable; the specific base (Gemma 4 today) is swappable. Re~~evaluate at each major release of Gemma / Llama / Qwen — fine~~tune effort to migrate is days, not months.

*1 spike before locking:*train a small LoRA on a 5k~~pair pt~~BR↔glosa subset, evaluate human-rated quality vs. baseline. If Gemma 4 underperforms Llama 4 by >15% on the eval set, switch to Llama 4 (license tradeoff acceptable given we're <700M MAU by orders of magnitude).

4.3 Boundaries — modules to create

services/ai/signs/                       ← backend, owns the 4-stage pipeline
  ├── backend/         (Go HTTP/gRPC server: /v1/translate, /v1/speech)
  ├── stages/
  │   ├── normalize/   (Go — language-specific rules, deterministic)
  │   ├── semantic/    (Python — calls foundation model via gRPC)
  │   ├── glosa/       (Python — hybrid: rule lookup + ML inference)
  │   └── animate/     (Go — glosa → keyframes; run server-side OR client)
  ├── corpus/          (versioned PT+EN+glosa parallel corpus)
  ├── models/          (model registry: weights + version + eval metrics)
  ├── eval/            (held-out test set + regression suite)
  ├── annotation/      (web tool for Libras consultants to label/dispute)
  ├── docker-compose.yaml
  └── README.md

engines/sdk/koder_hand_kit/                ← Flutter SDK package
  ├── lib/
  │   ├── koder_hand_kit.dart              (public API)
  │   └── src/
  │       ├── hand_avatar.dart             (3D avatar widget — picks backend by platform)
  │       ├── hand_overlay.dart            (draggable floating window)
  │       ├── hand_button.dart             ("Ver em Libras" / "Watch in Libras")
  │       ├── hand_gate.dart               (auto-listen to UI text via accessibility tree)
  │       ├── gloss_player.dart            (consumes gloss sequence → animation timeline)
  │       ├── offline_dictionary.dart      (cached high-frequency entries for NF3)
  │       ├── face_overlay.dart            (Rive-based facial NMM blendshapes)
  │       └── backends/
  │           ├── filament_backend.dart    (mobile + desktop via platform channel)
  │           └── modelviewer_backend.dart (web + TV via <model-viewer>)
  ├── android/                             (Kotlin glue: Filament SurfaceView + glTF loader)
  ├── ios/                                 (Swift glue: Filament + Metal)
  ├── linux/macos/windows/                 (C++ glue: Filament desktop)
  ├── assets/                              (Hand glTF master + SMPL-X rig + Rive face)
  └── docs/

products/horizontal/hand/                  ← brand surface
  ├── README.md, koder.toml, icon.svg
  ├── landing/index.html                   (https://hand.koder.dev)
  └── (no app/ — Hand is embedded inside other apps via koder_hand_kit)

4.4 Separation of translation and rendering

Hard architectural boundary: the *ranslator*produces *losa*(intermediate language), the *enderer*(Hand SDK) consumes glosa.

Why: gives us four levers we can pull independently:

Change	Affects	Doesn't affect
Better translation model	servicesaisigns backend	Hand SDK, animations, apps
Better avatar 3D model	Hand SDK assets	Backend, translation quality
New input modality (voice)	Stage 1 normalization	Stages 2-4, SDK
New output modality (recorded video for non-Flutter integrations)	Renderer parallel to Hand	Backend, glosa

This is the same pattern that lets Whisper improvements not affect Spotify clients.

4.5 Naming

*servicesaisigns`*— descriptive, follows AI service convention (voice, recsys, runtime, bot). Technical reference name.
*enginessdkkoderhandkit** — SDK package name; mirrors koder_kit` family.
*and*— public brand of the avatar (visible label "Avatar Hand" / "Hand — sign-language avatar"). Brand score 83 (Great), confirmed in earlier scoping. Visual identity is hands signing — semantic exactly matches the function.
Settings labels:
- en~~US: "Sign~~language avatar (Hand)"
- pt-BR: "Avatar de Libras (Hand)"

Two~~name pattern (engine + brand) follows the established engines/kodec (technical) ↔ Play/Tune (consumer~~facing) pattern.

4.6 Scope ladder

The architecture exists in full from v0. What changes between versions is *he maturity and coverage of each stage* not the architecture itself. *o throw-aways.*

Version	Stage 1 (norm)	Stage 2 (semantic)	Stage 3 (glosa)	Stage 4 (animate)	Inputs	Surfaces
0	pt~~BR + en~~US rules, basic spell correction	Foundation model loaded but used only as a "confidence checker"	ule~~based dictionary(~500~~1000 concepts, bilingual triggers) + fingerspelling fallback	~1000 pre~~animated signs (motion~~capture) + datilologia animations	Text only	Mobile (Android first)
1	Full normalization (regional spelling, idioms)	L active— fine-tuned foundation model	Hybrid: rules for known concepts, ML for the rest, confidence-routed	Sign bank expanded to ~5000 + classifier rules + spatial grammar	Text + speech (`services/ai/voice` integration)	Mobile + Desktop + Web
2	Multi~~source disambiguation (mixed~~lang text)	Larger model, RLHF from production feedback	ML dominant + ensemble + active learning loop in production	Bank of 10k+ signs + regional variants (Northeast / Southeast Libras)	Text + speech + camera (sign → text reverse)	All Koder UIs (incl. TV)
N (reserved)	—	—	—	—	—	SL(American Sign Language) as a separate engine — Koder would internationalize for US deaf market. Same pipeline, different glosa target language, different sign animation bank.

Critically: rule~~based dictionary is *ot*a v0~~only crutch. It remains in production forever as the high-confidence layer. Same applies to fingerspelling — never deprecated, always available for proper nouns and unknown terms.

4.7 Avatar tech stack (Q2 — RESOLVED 20260429)

Hand is rendered by a *er~~platform best~~of~~class engine* all consuming the *ame glTF asset with SMPL~~X rig* No single engine is forced to be sub-optimal everywhere; one asset workflow serves all surfaces.

Animator workflow                    Rendering per platform
─────────────────                    ──────────────────────
Blender / Maya
    │
    ├─ skeletal animation            ┌──────────────────────────┐
    │   on SMPL-X rig                │ Mobile (Android, iOS)    │
    │                                │   → Filament (Google)    │
    │                                │      via platform channel│
    └─ export glTF + GLB             │      ~6MB native lib     │
        (single source of truth)     ├──────────────────────────┤
              │                      │ Desktop (Lin/Mac/Win)    │
              │                      │   → Filament native      │
              │                      ├──────────────────────────┤
              ├─→ runtime ─→         │ Web (Flutter Web)        │
              │                      │   → <model-viewer> (Google)│
              │                      │      built on three.js   │
              │                      ├──────────────────────────┤
              │                      │ TV (TizenOS, WebOS)      │
              │                      │   → <model-viewer>       │
              │                      │      (TVs are web-first) │
              │                      └──────────────────────────┘
              │
              └─→ Face overlay
                  → Rive (excellent at facial blendshapes,
                          essential for Libras non-manual markers
                          which are part of the grammar, not
                          decoration)

*hy this stack and not a single engine:*

Need	Filament	flutter_scene	Unity	Rive 3D	model-viewer
glTF + SMPL-X support	✅	✅	✅	❌	✅
Production-mature	✅	⚠️ preview	✅	⚠️ 3D new	✅ web only
Bundle ≤ NF2 (8MB mobile)	✅ ~6MB	✅ ~2MB	❌ +25MB	✅ ~1MB	n/a (web)
Hand-finger precision	✅	✅	✅	❌ stylized	✅
Cross-platform	needs glue	yes (preview)	yes (heavy)	yes	web only
Used in Google production	✅	⚠️ preview	n/a	n/a	✅

Filament is the production~~mature backend (Sceneform, Google Earth, etc.). flutter_scene will likely match it once it leaves preview — at that point we re~~evaluate and possibly migrate mobile/desktop. The glTF asset format and SMPL~~X rig are forward~~compatible with that migration; only the rendering glue changes.

*hy SMPL-X specifically:*

SMPL~~X is the de facto rig for sign~~language research — used by Hand Talk's pipeline, SignAvatars (ECCV 2024), and most academic papers in the space. Adopting SMPL~~X aligns with the format ecosystem (Blender/Maya export, glTF compatibility, MediaPipe pose retargeting). Note: the *ignAvatars dataset itself is non~~commercial*and excluded from our training corpus (see §4.8 Corpus); we use SMPL-X as a *ormat/rig*independent of any particular dataset.

*hy Rive for face overlay (and only the face):*

Libras non~~manual markers — eyebrow position, mouth shape, eye gaze direction, head tilt — are *rammatical* not decorative. Filament can do facial blendshapes but Rive's facial~~animation tooling and editor are markedly better and the runtime is tiny (~700KB). The compromise: face is rendered as a Rive surface composited over the Filament 3D body, sharing timing and parameters via the Hand SDK. Single avatar visually, two engines internally.

*uture migration target:*when flutter_scene ships stable, re~~evaluate replacing Filament with it. Same glTF assets work directly; only the platform~~channel glue is replaced.

4.8 Corpus strategy (Q4 — RESOLVED 20260429)

Brazilian Libras corpus landscape is rich (UFPB, UFSC, UFPel, UFPE all have datasets). Our strategy *uilds a primary corpus from VLibrasBD (CC BY~~SA 4.0, commercial~~permitted)*and supplements it with bootstrap MT for en-US plus paid annotation for v2 quality refinement.

*rimary corpus — VLibrasBD (Mendeley Data, CC BY-SA 4.0):*

*27,349 aligned pt~~BR ↔ Libras~~glosa sentence pairs* built by 10 Libras interpreters
~72k general-purpose sentences + ~55k from Brazilian federal government content (gov.br services)
DOI: 10.17632/ryj88ckjww — open-download from Mendeley Data
*icense: CC BY~~SA 4.0*— commercial use permitted; *hare~~alike clause*applies to derivative works (interpreted conservatively, our trained models inherit CC BY-SA)

*cceptance of share~~alike:*open~~weight models align with Koder's self~~hosted ethos. Our value proposition is the integrated stack (Hand SDK + apps + Koder ID + UX), not the model weights themselves. Trade is acceptable vs. ~$50~~100k cost of annotating 127k pairs from scratch.

*upplementary corpora (license verification pending before training):*

Dataset	Pairs / Records	Use	License status
Libras-UFPel Corpus	4,800 audiovisuais (2,400 sentences + 2,400 isolated signs)	Refinement / regional Pelotense variant	TBD
V-Librasil (UFPE)	4,089 sign videos / 1,364 terms / 3 articulators	ose extractionfor animation bank (we extract poses; we don't redistribute videos)	TBD (IEEE Dataport)
LSWH100	144k synthetic handshape images / 100 classes	v2+ sign→text reverse (handshape recognition)	TBD
LIBRAS-UFOP	56 signs Kinect RGB-D + skeleton	Supplementary pose	TBD
~ignAvatars~	70k videos / 8.34M frames SMPL-X	—	on-commercial only — EXCLUDED from training

*ilingual strategy (en-US bootstrap):*

VLibrasBD is pt~~BR only. To get en~~US → glosa coverage:

*T bootstrap*— pass the PT side of VLibrasBD through DeepL/Google Translate to produce ~127k en-US ↔ glosa pairs. Cost: ~$200 in DeepL API.
*ross~~lingual transfer*— Gemma 4's existing multilingual capacity means a pt~~BR → glosa fine~~tune partially teaches en~~US → glosa for free.
*aid annotation for v2 refinement*— commission ~10k high~~quality en~~US → glosa pairs from professional Libras interpreters fluent in English. Estimate: ~$5k (10k pairs ÷ 45 sentences/h × R$100/h ≈ R$22k).

*nimation bank strategy:*

Since SignAvatars is non-commercial, the avatar's signing animations are produced through:

*~~Librasil pose extraction*— process the dataset's videos with MediaPipe Holistic + retarget to our SMPL~~X rig (we extract poses, we don't redistribute the videos). License-permitting.
*wn motion~~capture session*— commission a mocap studio session with deaf native signer for the initial 1,000 signs (~R$30~~60k including studio time, signer fee, post-production).
*and-keyframed animation*— supplementary for rare signs, glossary expansion, edge cases.

*ctive learning loop (production):*

Model flags high-uncertainty inputs in production
Flagged inputs queued in annotation pipeline
Libras consultants label / dispute / approve
Retrained periodically (monthly initially → weekly at scale)
Output: continuous quality improvement; corpus grows organically without large upfront annotation push

*ost estimate — corpus year 1: ~$50-80k.*

Item	Cost
VLibrasBD (CC BY-SA)	$0
MT bootstrap PT→EN	~$200
Animation bank (mocap, 1,000 signs)	$5-12k
Initial dictionary curation (consultor Libras 3 wks)	$3-5k
LoRA fine-tune (1 H100, ~10h)	$10-50
Cultural review board honoraria (Q5)	$2-5k/yr
Paid annotation v2 (en-US 10k pairs)	$5k
Active learning corpus growth (year 2+)	$10-20k/yr

4.9 Data infrastructure (built from day 1, not day 1000)

These exist starting v0 even if at small scale, so v1's ML training plugs in cleanly:

*orpus*— versioned object storage with metadata (source, dialect, quality score, lang tag). pt~~BR and en~~US examples co-located, both targeting same glosa.
*nnotation pipeline*— web tool (services/ai/signs/annotation/) for Libras consultants (deaf native signers we contract) to label, verify, dispute. Permissive of regional variants; tracks reviewer identity for academic citation.
*ctive learning loop*— model flags high-uncertainty inputs in production; humans label; retrained periodically (initially monthly, later weekly).
*val suite*— held~~out test set + regression suite + production feedback. BLEU~~glosa as primary metric; human review (deaf native signer panel) as quality gate.
*odel registry*— every trained model versioned with corpus snapshot + hyperparameters + eval results. Rollback automatic if production metrics regress.

4.10 Serving infrastructure — gRPC (Q8 — RESOLVED 20260429)

*ecision: follow existing Koder gRPC convention.*

gRPC is already the established backend transport in the Koder Stack. Audit found:

`servicesfoundationidengineprotokoderid{identity,keys,session,admin}v1