Development Stages and Classification of ASL Avatars and Recognition Models
- Ryan Hait-Campbell
- 3 days ago
- 41 min read
Written by Travis Dougherty, CEO of GoSign.AI
Originally posted on LinkedIn: https://www.linkedin.com/pulse/development-stages-classification-asl-avatars-models-travis-dougherty-qgn2e/

Introduction
American Sign Language (ASL) communication is complex and multimodal – it involves not just manual hand signs, but also facial expressions, head movements, body posture, and other cues synchronized together. These layers convey phonological, grammatical, and pragmatic information in parallel. Sign language avatars are computer-animated characters that produce signing (often from text input), and sign language recognition (SLR) models attempt to interpret sign language from video or sensor data. Both technologies have advanced rapidly, but they are still measured against the richness of human signing. Researchers emphasize that full sign language understanding or generation requires integrating all these channels in a coordinated way. This report reviews how the development of signing avatars and SLR models can be classified into stages mirroring linguistic acquisition, and how we evaluate their skills at each stage. We also compare these frameworks to human sign language learning (and spoken language benchmarks like CEFR), and note any standardized or proposed evaluation rubrics, datasets, and open challenges.
Linguistic Levels and Developmental Stages in Sign Language Technology
Sign language linguistics defines multiple levels of structure – from phonology (basic parameters like handshape or movement) to lexicon (individual signs), syntax (sentence grammar), and discourse/pragmatics (context and conversation). Similarly, we can categorize the maturity of sign language technologies by the complexity of language units they handle:
Phoneme/Fingerspelling Level
At the most basic level, models focus on the “phonemes” of sign language – the fundamental components of signs – and on fingerspelling (the manual alphabet). Early sign recognition research often tackled fingerspelling as a simpler sub-problem: recognizing static handshapes for letters or numbers. SLR models in this stage classify hand configurations and simple gestures. In parallel, sign synthesis models can operate at a phonemic level by stringing together basic parameters. For example, some avatar frameworks break each sign into formational features (handshape, location, movement, palm orientation, etc.) and drive animations from these parameters. This parametric approach is analogous to generating speech from phonemes. It allows coverage of many signs if the system can produce all necessary components. Phoneme-level evaluation involves testing whether the system correctly renders or recognizes basic units: e.g. distinguishing handshape differences or correctly producing a fingerspelled word. Sign linguists have identified five key parameters (four manual and one non-manual) that distinguish signs. An understanding of these is critical at this stage.
Minimal pairs in ASL illustrating the five phonemic parameters of signs: handshape (e.g. CANDY vs APPLE), location (e.g. SUMMER vs UGLY), movement (e.g. COFFEE vs YEAR), palm orientation (e.g. BALANCE vs MAYBE), and a non-manual signal (e.g. LATE vs NOT-YET, which differ by facial expression). At the phoneme level, avatars must capture and combine these elemental differences, while SLR models must detect them.
Avatars at this level: A basic signing avatar might be limited to fingerspelling or a small set of signs rendered one-by-one without nuanced motion. Its movements may appear stiff or “dictionary-like” – one sign after another – because coarticulation (transitions) and facial grammar are not yet handled. However, using a phonemic representation (such as HamNoSys or SiGML notation) can give the avatar linguistic awareness of sign structure, aiding future scalability. For instance, researchers have developed notation-based models where each sign is encoded as a bundle of features, analogous to phonetic symbols, and then animated. This stage establishes a foundation, ensuring the avatar can form the building blocks of signs correctly (e.g. the handshapes for the ASL alphabet).
SLR models at this level: Early or limited SLR models may recognize a set of handshapes or track specific parts of the body. Many projects started with isolated static gestures (like letter spelling or numeral signs) because they involve a single posture per symbol. A common approach is component-level classification: the system first identifies elements such as the handshape, orientation, or location, then combines those to infer which sign is present. This decoupling mirrors phoneme recognition in speech. It simplifies the task because there are fewer primitive classes (e.g. dozens of handshapes instead of thousands of whole signs). Notably, if an SLR system covers all necessary phonemic categories, it could recognize new signs by composition without retraining. At this stage, accuracy on detecting these basic components is the key metric. For example, a system might be tested on how well it recognizes each letter in a fingerspelled word or differentiates minimal pairs that only differ in one parameter. Studies have shown that incorporating non-manual features even at this low level can boost recognition accuracy (e.g. adding facial features improved sentence recognition from 88% to 92% in one experiment).
Lexical/Sign Level
The next developmental tier involves whole signs as lexical items. Here the goal is to produce or recognize individual signs (glosses) in the vocabulary. For avatars, this means the system has a sign dictionary and can animate each sign, often by stringing together prerecorded or parameterized motions. Many sign synthesis models indeed operate at the gloss level: they take input text, map each word to an equivalent sign (gloss), then play those sign animations in sequence. In this stage, the avatar might still lack fluent transitions, but it can communicate basic ideas sign-by-sign. The prototype described by Xu et al. (as summarized in a review) is an example: it converts an input sentence into lexical units and grammatical markers, encodes the signs via a dictionary, and then plays the sequence on the avatar. The emphasis is on isolated sign accuracy – each sign should be correct and clearly rendered. Evaluation criteria include whether the avatar’s sign matches the target meaning and is understandable in isolation. Native signers might be asked to identify the sign the avatar produced, to measure if it’s recognizable.
On the SLR side, this stage corresponds to isolated sign recognition. The system can identify signs one at a time when they are not running together in a sentence. This is a widely studied task, often framed as classifying video clips each containing a single sign from a known vocabulary. Research has produced many public benchmarks for isolated SLR – for instance, the WLASL dataset (a large ASL lexicon in video) and others for various sign languages. Typical metrics are classification accuracy or top-5 accuracy on a test set of isolated signs. Achieving high accuracy at the lexical level is necessary before tackling full sentences. Indeed, a survey notes that much of automatic sign language analysis to date focused on recognizing signs in their citation (dictionary) form. That is analogous to recognizing spoken words without yet understanding how they change in context. A strong lexical-level SLR model might correctly recognize, say, 95% of a set of 1,000 isolated ASL signs in lab settings. However, it might still falter on signs produced in running discourse or by new signers if it’s overfitted to specific conditions. At this level, there is little understanding of grammar – each sign is treated independently.
Sentence/Grammatical Level
Beyond isolated signs, the next major leap is handling sequences of signs arranged in grammatical order – i.e. continuous signing at the sentence level. This stage introduces sign language grammar: use of space for grammatical roles, facial expressions as syntactic markers (e.g. eyebrow raises for questions), inflection of signs (modulating movement to indicate aspect or agreement), and coarticulation effects between neighboring signs. For signing avatars, reaching this level means the avatar is no longer just playing one sign after another from a dictionary; it must modify signs and insert transitions so that the sentence is fluent and linguistically correct. Researchers developing sign avatars have worked on tools to blend signs smoothly and apply grammatical non-manual signals. For example, an avatar might need to drop a hand at the end of one sign in preparation for the next, or change a movement to indicate plural or temporal aspect. There are prototype models using rule-based grammars (e.g. HPSG-based generation of ASL sentences) that ensure the avatar’s output follows grammatical rules. One reported approach divides input text into a lexical layer and a grammatical layer: at the grammatical layer, rules handle word order changes or function words, then the avatar signs the resulting sentence. Achieving proper coarticulation – the blending between signs – and grammatical accuracy is a major benchmark at this level. Avatars start to include non-manual markers in their signing; for instance, raising the eyebrows during a yes/no question, or puffing cheeks for intensity. A notable example is the SiMAX avatar system, which explicitly included a “grammar function” to handle facial expressions like raised eyebrows for questions. An avatar at this stage is evaluated on sentence-level intelligibility: can Deaf viewers correctly understand entire sentences, not just individual signs? Studies like Kipp et al. (2011) introduced comprehensibility testing where avatar-signed sentences are compared against human signers.

In these tests, participants try to understand sentences signed by the avatar; the results indicate how close the avatar is to human clarity. For instance, one method (“delta” testing) assessed comprehension by measuring the difference in understanding between an avatar and a live signer for the same utterances.
For SLR models, the sentence level is addressed by continuous sign language recognition (CSLR) or sign language translation (SLT) systems. Continuous SLR means the system takes a video with a stream of signs (without clear boundaries) and must segment and recognize the sequence. This is substantially harder than isolated SLR because signs influence each other’s appearance and there are no obvious breaks. Indeed, continuous sign sentences usually involve a mix of fingerspelled words, short pauses, and modified signs that are context-dependent. Over the past decades, continuous SLR has seen progress with sequence modeling techniques (e.g. Hidden Markov Models, and more recently RNNs and Transformers with gloss or text outputs). A critical benchmark here is the RWTH-PHOENIX-Weather corpus (German Sign Language), a popular dataset with weather report sentences and corresponding annotations. Models are evaluated with sequence accuracy metrics analogous to speech recognition, such as Word Error Rate or Sign Error Rate (how often the recognized sequence deviates from the reference). For example, San-Segundo et al. report a Sign Error Rate of 31.6% on one machine-translation avatar system, meaning roughly a third of signs were incorrect. Some works also use BLEU score to evaluate if the sequence of signs (or the translated text) matches reference translations. Beyond just recognizing the gloss sequence, an advanced SLR at this level tries to capture grammatical markers: e.g. detecting if a question is being signed (which might require noticing a brow raise or a specific question sign). Researchers note that simply stringing together recognized lexical signs is not enough for full understanding – the system must also interpret grammatical processes that alter sign appearance (like a verb being directed towards a location to indicate its object). Handling these requires integrating the manual recognition with non-manual cues (for example, decoding a head tilt as part of grammar). Progress is being made: including facial features has been shown to improve continuous SLR performance. Still, grammar understanding is an area of active research and few models truly parse sign language grammar; most focus on getting the correct gloss sequence or translating to spoken language equivalents.
Discourse/Pragmatic Level
The highest stage of skill involves discourse and pragmatic competence – the ability to handle multiple sentences in context, conversational turn-taking, and the subtleties of meaning beyond literal signs. For an avatar, this means functioning as an embodied conversational agent in sign language. The avatar would need to maintain continuity across sentences (e.g. keeping track of referents established in space), use pragmatic cues (like eye gaze shifts to manage turn-taking or role-shifting when telling a story), and possibly adjust signing style to the context or user (more formal vs casual signing, etc.). This level is largely aspirational with current technology. Some research projects have begun treating sign avatars as virtual humans that can interact – for example, an Irish Sign Language avatar has been envisioned as an interactive conversational agent. Achieving discourse-level proficiency requires integration of natural language understanding, not just translation. The avatar would need to choose when to emphasize a point with larger signing, how to ask for clarification, how to show backchannel feedback (like nodding while “listening”), and so on. Evaluation at this level might involve user studies in realistic interactive scenarios: e.g. can Deaf users carry on a dialogue with the avatar on a given topic and feel that it follows the conversation appropriately? So far, no standard test suite exists for pragmatic abilities, since very few models reach this level in a general way. However, one could imagine tests for contextual responsiveness – for instance, whether an avatar correctly uses pointing signs to refer back to entities introduced earlier, or appropriately uses facial expressions to signal a change in topic or a conversational intent.
For SLR models, discourse understanding means the system goes beyond transcribing signs to actually interpreting meaning in context. This could involve linking pronouns or indexical signs to their antecedents across sentence boundaries, recognizing the topic of a conversation, or understanding idiomatic and culturally nuanced expressions. Virtually all current SLR models operate at the sentence level or below – true discourse-level comprehension (especially in open-domain conversation) remains unsolved. A “pragmatic” SLR system might, for example, observe a multiparty signed conversation and determine who is addressing whom, or detect the intent behind a signed utterance (question, command, joke, etc.) in context. Achieving this would align SLR with broader AI language understanding. The true test of sign recognition, as Ong and Ranganath put it, is dealing with natural signing by native signers in all its richness. That includes the rapid, context-dependent signing that occurs in real dialogues, complete with interruptions, repairs, and shifts in footing. Progress towards this is limited by data – large multi-sentence sign language dialogue corpora are scarce – and by the complexity of modeling so many linguistic layers together. Some comparative frameworks look to spoken language for inspiration, noting that spoken language tech has metrics for discourse (like coherence or dialog success rates). For sign, an analogous evaluation might be whether a recognized dialogue can be correctly answered by a system, indicating it grasped the context. This area is rife with open questions: how to incorporate world knowledge and context into sign recognition? How to handle the fact that signers often omit information that is understood from context (pro-drop and context-heavy communication)? These challenges point to the need for holistic evaluation beyond per-sign accuracy.
In summary, as we move from phoneme-level to discourse-level, both avatars and SLR models must incorporate more linguistic knowledge and context:
Phoneme/fingerspelling stage: basics of hand configurations and isolated gestures.
Lexical stage: a repertoire of signs produced or recognized individually.
Sentence/grammatical stage: fluent sequences with correct grammar and transitions.
Discourse stage: maintaining coherence and appropriate use across interactions.
Each stage builds on the previous. Notably, human signers (e.g. Deaf children) acquire
these competencies in phases as well – first babbling and basic signs, then simple combinations, then complex grammar by early childhood. Current technology mirrors this trajectory, with many models now between the lexical and sentence level, and research pushing into grammatical and discourse capabilities.
Evaluation Frameworks for Sign Language Avatars
How do we measure the “skill” of a sign language avatar? Researchers and practitioners have proposed a variety of criteria, often evaluated through a mix of expert review, user studies with Deaf participants, and objective metrics. Important dimensions for avatar evaluation include:
Linguistic Accuracy: Does the avatar sign the correct signs and follow the intended meaning? This includes choosing the right sign equivalents for words (particularly important for sign language translation models) and using correct sign order when forming sentences. For example, in one study with an early avatar (TESSA), Deaf focus group participants were asked to identify if each signed phrase was accurate; they found only ~61% of the avatar’s phrases were correctly understood, with most errors caused by unclear signing. Accuracy can be quantified by comparing the avatar’s output to reference translations (using metrics like BLEU for overall translation quality or calculating a Sign Error Rate – the percentage of signs that were wrong). High linguistic accuracy is fundamental; an avatar might have a pleasant look and fluid motion, but if it signs the wrong words, it fails its main purpose.
Expressiveness and Non-Manual Markers: This criterion evaluates how well the avatar employs facial expressions, mouth movements, head tilts, shoulder raises, and other non-manual signals that are integral to sign language grammar and emotion. A skilled avatar should raise its eyebrows for a yes/no question, frown or squint for a wh-question, use appropriate mouth shapes (mouthing or mouth gestures) for certain signs, and convey affect (like surprise or happiness) when needed. Expressiveness greatly affects comprehensibility and naturalness. Avatars are often criticized when they are “stone-faced” or monotonous. One competition report noted that an avatar’s expression richness and non-manual element depiction were areas needing improvement even when its hand movements were correct. There are currently no simple automatic metrics for expressiveness, so evaluation relies on native signers’ judgments: e.g. rating how well an avatar’s facial expression matched the sentence’s intent. Recent research has also started to quantify emotional expressiveness – for instance, assessing if adding emotion to an avatar’s face yields higher comprehension (some studies found only a small improvement). Overall, inclusion of non-manuals is considered essential for an avatar to reach advanced levels of signing proficiency.
Coarticulation and Fluency: This measures how smoothly the avatar transitions between signs and whether it avoids unrealistic stop-and-start motion. Human signers naturally blend signs, adjusting hand trajectories so the end of one sign leads into the start of the next. If an avatar lacks coarticulation handling, its signing will look disjointed or robotic (each sign starting from a default position). Fluent motion also involves timing – slight holds or faster movements as appropriate. Avatars are evaluated on whether their signing flows like a human’s. Techniques like keyframe interpolation or motion blending are used to improve this. For example, researchers showed that using key frame animations from a sign library can produce more natural-looking sequences than playing back unedited motion-capture clips. In evaluations, general users and Deaf viewers often rate “naturalness” on a Likert scale. In one contest, judges gave separate scores for accuracy vs naturalness, and some teams scored higher in naturalness due to smoother animation despite similar accuracy. Thus, an avatar should be not only correct but also fluid in delivery.
Grammatical Fidelity: This overlaps with accuracy and expressiveness but focuses on syntax and morphology. Does the avatar correctly handle changes required by sign language structure? For instance, ASL uses spatial agreement – an avatar should modify verb movement to indicate subject/object if required (pointing toward established locations in space). It should also handle plural forms (maybe by slight repetition or circular movements) and use classifiers appropriately (handshapes representing classes of objects, placed and moved according to what they depict). Evaluating grammar often involves expert analysis of specific phenomena. Some formal proposals exist: for example, quality frameworks from machine translation applied to sign avatars include checking gloss translation accuracy and grammar search correctness. Patel et al. measured their avatar’s grammatical output by the correctness of an intermediate translation step and reported about 77% translation accuracy with their rule-based grammar component. Another approach is having linguists or Deaf evaluators review avatar output for known grammatical features (negation, questions, role shift, etc.) and note errors. A lack of correct grammatical rendering can make an avatar’s signing confusing or ungrammatical to viewers even if individual signs are right. As a case in point, in a Swiss German Sign Language avatar evaluation, testers used relatively simple sentences (CEFR A1–A2 level) precisely because more complex grammar might have been beyond the avatar’s capabilities at that time.
Comprehensibility and Intelligibility: Ultimately, can people understand the avatar? Comprehensibility is often treated as the gold-standard metric, especially by end-users. It is tested by methods such as asking Deaf participants to watch avatar-signed utterances and then demonstrate or explain their understanding. Kipp et al. introduced delta testing – comparing understanding of an avatar’s signing to understanding of equivalent human signing – as a way to objectively score how far the avatar lags behind a human signer. If the human signer gets 95% of viewers to understand a sentence, and the avatar only 70%, the “delta” indicates room for improvement. In practice, many studies report subjective comprehension rates. For instance, one experiment found no significant difference in viewers’ comprehension whether an avatar had added emotional facial expressions or not, implying that basic intelligibility came more from clear signing than from emotional nuance. Comprehensibility is influenced by all the above factors (accuracy, fluency, etc.), so it serves as a composite outcome measure.
User Acceptance and Preference: Even if an avatar is understandable, do users like it and trust it? This is a somewhat less tangible metric but important for real-world deployment. It includes the avatar’s visual appearance (cartoonish vs realistic) and how engaging or relatable it is. Interesting research in 2016, Toward the Ideal Signing Avatar from Purdue University, compared a highly realistic avatar model to a more stylized (simplified/cartoon) avatar signing the same sentences. They found that although recognition and legibility of signs were rated similarly for both, users significantly preferred the stylized character’s appearance. The realistic avatar did not confer any intelligibility benefit and was actually less appealing. This indicates that uncanny valley issues can affect acceptance. Avatars can be customized (e.g., different avatars for different target audiences, such as a friendly child avatar for kids’ content). Evaluating preference often uses surveys or the System Usability Scale (SUS). For example, an Arabic signing avatar app was given to Deaf users who then scored it ~79.8 on SUS, indicating acceptable usability. Feedback from focus groups is also crucial; a 2022 review highlights that involving Deaf users in focus groups helps identify why some reject avatar technology (historical mistrust, poor quality, etc.). So, beyond technical measures, a successful avatar is one that the Deaf community is willing to adopt.
In practice, evaluations of signing avatars combine qualitative methods (user focus groups, interviews, expert analysis) and quantitative metrics (BLEU scores, error rates, comprehension percentages, SUS scores). A 2022 article by Bennbaia surveyed many such methods and concluded that no single metric suffices – comprehensive testing should cover accurate translation, non-manual signals, spatial referencing, avatar appearance, and naturalness. One challenge noted is the lack of a unified, widely adopted evaluation standard for avatars. Projects like ViSiCAST/eSIGN in the 2000s did conduct comprehensibility tests, but often with small samples and varying protocols. Today, efforts like the annual Sign Language Translation & Avatar Technology (SLTAT) workshops aim to develop better evaluation practices. Until a standardized rubric emerges, the best approach is to triangulate using multiple measures as we’ve outlined.
Evaluation Frameworks for Sign Language Recognition Models
For SLR models, evaluation frameworks depend on the specific task (isolated sign classification vs continuous recognition vs translation) and on how much linguistic understanding the model purportedly has. Key performance indicators include recognition accuracy at various levels, but also extend to how well the system deals with different signers, noise, and linguistic generalization. Here are common evaluation aspects for SLR models:
Isolated Sign Recognition Accuracy: If the model is designed for recognizing single signs (or short phrases) from a fixed vocabulary, the straightforward metric is accuracy (% of signs correctly recognized). Datasets like ASL-LEX or WLASL (which covers 2,000+ ASL signs) are used to benchmark this. Researchers report top-1 and top-5 accuracy, confusion matrices of which signs get mixed up, etc. A state-of-the-art isolated SLR model might exceed 95% accuracy on a controlled dataset, but performance can drop with larger vocabularies or more varied inputs. Evaluation should also consider signer independence – i.e., testing on signers not seen during training. A robust model handles different body sizes, speeds, and styles. Some works explicitly test on a held-out signer to measure generalization. Additionally, since isolated SLR is a closed-set classification problem, false positives (misclassifying a sign as another in the set) are tracked. Precision and recall might be calculated if dealing with detection from continuous input. Overall, isolated sign accuracy is the analogue of “word recognition accuracy” in speech. It’s necessary but not sufficient for higher-level tasks.
Continuous Recognition and Segmentation Performance: For models tackling full sentences, evaluation becomes more complex. Metrics borrowed from speech recognition and translation are common. Sign Error Rate (SER) is one, defined similarly to word error rate – it counts substitutions, deletions, and insertions of sign glosses compared to a reference, divided by the reference length. A SER of 0% would mean a perfect match to the ground truth gloss sequence. Another metric is BLEU score (and related MT metrics like ROUGE or METEOR) if the output is in a spoken language (for sign-to-text translation). These give a sense of how well the system conveys the content of the sign stream. For example, in one machine translation evaluation, a BLEU score of ~0.58 was achieved along with a 31.6% SER, as mentioned earlier. Aside from overall sequence accuracy, researchers examine segmentation quality – whether the model correctly finds the boundaries between signs in continuous input. This can be evaluated by IoU (intersection-over-union) between predicted segments and true segments on a timeline, or by measuring alignment quality if using time-aligned gloss annotations. Continuous SLR is often evaluated on specialized corpora (like PHOENIX for German Sign or CSL for Chinese Sign Language) with predefined train/test splits. Competitions and benchmarks (e.g., the ChaLearn LAP challenges, or more recently the AUTSL challenge for Turkish SL) provide leaderboards for sequence recognition tasks. One important aspect is grammar and language modeling: does the SLR output respect the grammar of the target representation? If the system outputs sign gloss sequences, one can check grammatical plausibility (though glosses often don’t capture all inflections). If it outputs spoken language sentences (doing translation), then standard grammaticality of the text can be assessed. Some advanced evaluations have looked at how well an SLR system handles specific sign linguistics in context – for instance, if a test set has many inflected verb forms or classifier constructions, does the system still recognize the base sign or interpret the inflection? These are not yet common in benchmarking, but are of research interest as models begin to incorporate linguistic tokens beyond simple glosses.
Sign Language Understanding and Grammar Awareness: This refers to evaluating whether the model truly captures the meaning and grammatical nuances, not just producing correct labels. For example, an SLR system might correctly output the gloss “BUY JOHN BOOK” for an ASL sentence meaning “John buys a book,” but does it understand who is the buyer vs seller? (In ASL, roles might be indicated by spatial placement or orientation of the verb “buy”.) Evaluating understanding might involve downstream tasks: e.g., question answering (the system watches a signed sentence and answers a question about it), or coherence tests (given a signed statement and a follow-up, does the system detect if the follow-up is a logical response?). Such evaluations are still quite rare. One proxy is assessing if the system can handle linguistic generalization: for instance, can it recognize a sign in a novel inflection it never saw in training? A specific example is numeral incorporation (signs that include numbers, like “3-WEEKS”); a model trained only on “1-WEEK” and “2-WEEKS” might be tested on “3-WEEKS” to see if it generalizes the pattern. Similarly, role shift (where a signer assumes a character’s role in a narrative by a slight body shift and facial expression) is a pragmatic marker – evaluating recognition of that could be done by seeing if the system outputs something like “” in an annotation. Because formal benchmarks for these fine-grained linguistic features are lacking, researchers often perform error analysis on system outputs to identify if grammar-related errors occur. Ong & Ranganath (2005) argued that non-manual signals and grammatical processes had received comparatively little attention and that future models needed to integrate these for “full understanding”. An increase in papers on this topic (e.g. recognizing negation via headshake, or questions via brow raise detection) suggests a developing evaluation focus: measuring how well models detect such non-manual cues in context. In one study on syntactic facial expressions, integrating those features boosted recognition of sentence types.
Robustness, Signer Independence, and Adaptability: An important practical evaluation of SLR models is how they perform under varied conditions. This includes different signers (demographics, signing styles), different lighting or camera setups, signing speeds, and whether the model can handle vocabulary it wasn’t explicitly trained on. Signer-independent evaluation is standard: test sets usually contain signers not in the training set. If a system is intended for real-world use, it might also be tested on spontaneous signing (from dialogues or unscripted content) versus the typically more formal, isolated signing found in training corpora. Error rates often climb with spontaneous input. Another facet is adaptability: can the model adapt to a new signer quickly? Some evaluations allow model adaptation on a small set of samples from a new signer and then test on more from that signer, to simulate personalization. Also, since sign languages differ regionally and individually, robustness includes handling synonym signs or variants. One framework for evaluation is cross-dataset testing: train on one dataset, test on another, to gauge generalization. For example, a model trained on lab-recorded signs might be tested on a video of an interpreter signing in a noisy background. Measuring the drop in accuracy quantifies robustness to domain shift. With the rise of deep learning, researchers have noted issues like overfitting to specific signing backgrounds or clothing; thus some benchmarks explicitly introduce variations. The recent SignAvatars dataset with 3D motion capture aims to provide a broader basis for testing algorithms on multiple prompts and multiple signers in a unified way. Competitions sometimes include a challenge set with adversarial conditions (e.g. motion blur or partial occlusion of the signer) to see how gracefully models degrade. In summary, beyond raw accuracy, a mature SLR model is judged by its stability across realistic conditions – a crucial consideration for any product intended to interpret sign language in the wild.
Speed and Real-Time Performance: While not a linguistic metric, speed is often evaluated for SLR models intended for live interpretation. Can the model process video in real-time (e.g. 30 frames per second) with minimal latency? This is important for interactive applications like video conferencing with automatic sign recognition. If we view development stages, earlier research often ignored real-time aspects and processed offline, whereas newer models attempt streaming recognition. Evaluation might involve measuring inference time per frame and end-to-end latency from a sign being performed to the output being produced. Some academic challenges set an upper bound on model runtime or include efficiency as a tiebreaker. For avatars, there is a parallel in evaluating how quickly text can be translated to sign animations (some report processing X words per second and end-to-end delay), but generally for avatars speed is less critical than for recognition (since an avatar can buffer text). For SLR, a model that is 99% accurate but takes 10 seconds to recognize a simple sentence would be impractical in conversation, so there’s a trade-off where slightly lower accuracy might be acceptable for a big gain in speed.
It’s worth noting that SLR evaluation frameworks draw from both computer vision metrics (for gesture recognition) and language technology metrics (for translation and understanding). An emerging direction is evaluating SLR in an end-to-end sense: for example, how well does an SLR model assist in a real task like searching a sign language video archive or aiding communication? This is analogous to evaluating speech recognition by task success (did the voice assistant do what the user wanted?). While formal “sign language understanding” benchmarks are still rudimentary, the field is gradually incorporating multi-factor evaluations. A 2022 survey presented a taxonomy (see below for Figure 1 from Sarhan & Frintrop 2023) highlighting that SLR models are characterized by input modalities (RGB video, depth sensors, skeleton/keypoint input), by which sign parameters they model (manual vs non-manual features), by fusion methods for those features, etc., and that multiple complementary channels (hand, face, body) are needed for robust recognition.

This reinforces the importance of multimodal evaluation: a model should ideally be tested on its ability to utilize both manual and non-manual information. For instance, one could evaluate a model’s accuracy with and without facial keypoints provided, to quantify how much facial data helps – Caridakis et al. did something similar (88% → 92% with facial cues added). Such experiments help establish the value of multimodal integration in the recognition process.
In summary, evaluating SLR models involves a mixture of accuracy metrics (per sign or per sequence), linguistic evaluation (correctness of grammar/meaning), and robustness checks. Unlike avatars, which ultimately face the judgment of human viewers directly, SLR models are often evaluated indirectly via these metrics since the “consumer” of their output might be another system or a hearing person relying on the translation. However, an interesting parallel evaluation could be done: if an SLR model outputs text, one can measure how well a Deaf person’s message is conveyed by seeing if a hearing person (reading the text) understands what the Deaf signer meant, thereby closing the communication loop as a real-world test. Few studies have done full human-in-the-loop evaluations, but it’s a valuable direction to truly assess comprehension quality, not just recognition accuracy.
Comparisons to Human Acquisition and Proficiency Scales
It is instructive to compare these technological stages to how humans learn sign language and to frameworks used for spoken/written language proficiency. Deaf children acquiring ASL progress through milestones that closely mirror spoken language development. In the first year of life, babies babble with their hands and produce simple signs or points; by 18-24 months, toddlers can use a vocabulary of signs (often approximated and without complex inflections) and begin combining them. By age 3-4, children handle basic syntax (sign order, use of pronouns and question facial expressions) and start using morphological aspects like classifiers and verb inflections. By 5-6, most Deaf children (exposed early) command pretty complex grammar including role-shifting in storytelling, use of space for reference, and fingerspelling for proper nouns. This natural timeline from simple to complex sign usage is analogous to how our sign tech evolves: initial model tackle the equivalent of “single words,” then “simple sentences,” and so forth. Just as a child might produce telegraphic sign sentences at first (omitting some inflections) and later become fluent, early avatars produce somewhat telegraphic signing and are now inching toward more fluent output.
One can also draw analogies with spoken language technologies and proficiency scales. The Common European Framework of Reference for Languages (CEFR) is widely used to rate human language proficiency (A1 for beginner, up to C2 for mastery). While CEFR was designed for spoken/written languages, the concept has been applied to sign languages in some educational contexts. For instance, courses in Europe for learning national sign languages sometimes align their curriculum with CEFR levels (A1, A2, etc. for basic user, B1, B2 for independent user, etc.). There’s mention in literature of sign language data or assessments categorized by CEFR levels – one paper collected signing data from A1 up to C1 level content. In evaluating a Swiss German Sign Language avatar, researchers focused on simple (A1/A2) sentences presumably because those were within the avatar’s capabilities. By analogy, we might say a “CEFR A1” signing avatar can handle very basic phrases (introductions, simple statements) but not complex discourse, whereas a “C1” level avatar (if it existed) would be one that can convey nuanced information, argumentation, and abstract topics in sign language with native-like fluency. Although these are informal extrapolations, they provide a useful lens: we expect avatars to eventually progress in fluency much like a human second-language learner of ASL would. Similarly, an SLR system could be rated by the complexity of language it can reliably understand: one might currently equate existing models to somewhere in the A-levels (understands basic everyday signs in constrained contexts), while no system is yet at “C2” (understands anything said in any context, with implicit meanings, humor, etc.).
Another comparison is to speech recognition history. Early speech recognizers in the 1950s-60s started with digits and single-word commands (comparable to isolated signs), then progressed to connected word recognition and eventually continuous speech dictation by the 1990s. Today’s speech recognition handles natural discourse fairly well in many cases. Sign recognition is on a similar trajectory but lags behind due to the modality’s complexity and less data. In speech, phoneme recognition was a sub-problem, just as handshape recognition is in sign. Speech tech also had to incorporate language models to resolve ambiguity, analogous to how sign tech now uses language models to improve continuous sign recognition (e.g., to prefer sign sequences that form a valid sentence over random sequences). We also see parallels in evaluation: word error rate for speech vs sign error rate for sign; intelligibility tests for text-to-speech vs comprehensibility tests for sign avatars. And just as spoken language systems are now being evaluated on higher-order understanding (like voice assistants being able to handle context in a dialogue), sign models will need to be evaluated on conversational abilities in the future.
It’s worth mentioning the ASL Proficiency Interview (ASLPI) and similar assessments used to rate human signers’ skills. ASLPI yields a score (0-5) for a person’s signing ability. Criteria involve vocabulary range, grammatical accuracy, fluency, and sociolinguistic/cultural appropriateness. One could imagine repurposing such criteria to rate a signing avatar. For example, an avatar that only signs prepared weather reports with some errors might be “Intermediate” whereas an avatar that can discuss a range of topics smoothly with near-native grammar might be “Advanced.” However, currently no one applies ASLPI to avatars – the gap is too large. But the conceptual overlap is that both humans and avatars can be assessed on similar axes: lexicon, grammar, fluency, comprehension elicitation. The same axes apply to SLR models in terms of what inputs they can handle.
In NLP (natural language processing) and AI more broadly, we often define capability tiers. For instance, language understanding can be evaluated at the lexical level (word sense disambiguation), syntactic level (parsing accuracy), semantic level (can it answer questions about a text), and pragmatic level (does it understand implied meaning or speaker intentions). For sign language AI, we see the need for analogous tiered evaluation. Are we evaluating just the identification of signs (lexical)? Or the interpretation of meaning (semantic)? A human-like sign language understanding would include all these layers. So far, much evaluation sticks to lexical and some syntax (e.g., was the gloss sequence right?). A few translation tasks incorporate semantics (did the system convey the message correctly in English, for example). Pragmatics (like distinguishing formal vs informal signing, or detecting irony in signing) is essentially untouched in evaluation – an open research frontier.
To summarize, human signers provide both a model and a benchmark for sign language technology. Children learning sign show us which aspects are acquired earliest (pointing, basic signs) and which come later (inflection, storytelling), hinting where technology might naturally find easier vs harder problems. Formal scales like CEFR or ACTFL for languages give a framework that could inspire how we classify system competence. And parallels with spoken language tech evolution remind us that progress often comes stepwise: handle small vocabularies, then short sentences, then open conversation. While we haven’t yet labeled sign models as “Level 1” or “Level 5” explicitly in the field, the discussion is leaning toward such classification, especially as more commercial products emerge and need to be described in terms familiar to users (e.g., “this device can recognize basic everyday signs, but not technical jargon”). In academic proposals, one might encounter phrases like “toy examples” vs “in the wild signing” – essentially distinguishing an early-stage system from a mature one. Bridging this gap is part of current research agendas.
Datasets, Benchmarks, and Test Suites by Development Stage
Over the years, a variety of datasets and benchmarks have been created to evaluate sign language technologies at different levels. Each corresponds loosely to a stage of complexity:
Phonology and Fingerspelling Datasets: To evaluate handshape recognition and fingerspelling, researchers have compiled datasets like the ASL Fingerspelling Dataset (videos of people spelling words letter by letter) and one-handed alphabet datasets. These allow testing models on all 26 handshapes in ASL under various conditions. Metrics are often letter accuracy or word accuracy (for a sequence of letters). Another type of resource is sign language phonetic/phonemic datasets – e.g., collections of minimal pairs or isolated parameters. The ASL Lexicon Video Dataset (ASLLVD) includes video of thousands of isolated signs along with annotations of handshape, location, etc., enabling evaluation of component recognition algorithms. There are also smaller research datasets focusing on specific non-manuals (for instance, a set of videos with and without headshakes for testing a head movement detector). While not “benchmarks” in the competitive sense, these resources let researchers quantify performance on single aspects (like “the system detects head nods with X% precision”).
Isolated Sign Benchmarks: For lexical-level evaluation, we have many labeled corpora of isolated signs. WLASL (Word-Level ASL) is a prominent benchmark introduced in 2020 with over 21,000 video instances of 2,000 distinct ASL signs, collected from online video sources. There’s also MS-ASL (another large-scale ASL dataset), and others like CSL Isolated (Chinese Sign Language isolated words), LSA64 (Argentinian Sign Language, 64 words), and so on. These benchmarks are used in competitions and papers to report classification accuracy. They are analogous to image classification datasets in vision (each video belongs to one class = one sign). Performance on them is well-tracked and has improved with deep learning; e.g., recent models might reach 90%+ on moderate-size lexicons, though still lower for very large lexicons or when signers vary. The evaluation is straightforward: the dataset is split into training and test sets, and accuracy on the test set is the score. For sign language avatars, there isn’t a direct analog (since avatars produce signs rather than classify them), but an avatar’s lexical coverage can be tested by seeing if it can produce a list of known signs understandably. In fact, some avatar evaluations involve a receptive test: show users a set of isolated signs generated by the avatar and ask them what sign it was – essentially measuring if each avatar sign is recognizable. This is effectively an isolated sign comprehension test for the avatar output.
Continuous Signing Corpora: To evaluate sentence-level recognition or avatar output, researchers rely on corpora of full sentences with annotations. A flagship example is RWTH-PHOENIX-Weather 2014 (and its 2018 extension) – a corpus of German Sign Language interpreted weather forecast videos with gloss and German translations. It has become the standard benchmark for continuous SLR and sign language translation (SLT) tasks. Models are ranked by their word error rate or BLEU score on this corpus. For ASL, there is not yet an equivalently large open continuous dataset with gloss annotations, but smaller ones exist (e.g., ASL sentence datasets from ASLLVD, or the SMILE dataset for Italian, or corpora from the Linguistic Data Consortium in limited domains). Another continuous benchmark is the CSL (Chinese Sign Language) corpus used in the recent sign translation competitions in China. In 2023, a Chinese national standard “T/CADHOH 0004-2023” defined test specifications for intelligent sign language translation, and an evaluation event provided a corpus of 1,074 sentences for teams to train and test their avatars or recognition models. This shows an effort to standardize evaluation with specific datasets. For avatars, continuous signing evaluation often uses comprehension tests: e.g., give users several sentences signed by the avatar and by a human, and ask questions to see if they understood the content. The ViSiCAST project in the 2000s did this with a small set of test sentences (like weather announcements) to compare comprehension between avatar and human videos. Nowadays, if a company releases a sign translation system, they might demonstrate it on a standard set of sentences (like news clips) and possibly measure comprehension in a user study, but there isn’t a single agreed test suite. We might envision something like a “Sign Language Turing Test” where an avatar’s signing in various scenarios is judged against human signing for indistinguishability in meaning – however, that remains a conceptual goal more than a reality.
Linguistic Phenomena Test Suites: As sign language technology matures, there’s interest in creating targeted test suites for specific linguistic features. For spoken NLP, one might create challenge sets for particular constructions (say, anaphora resolution or idioms). For sign, one could similarly curate a set of sentences focusing on, say, role shift in narratives, classifier usage, numeral incorporation, non-manual grammatical negation, etc. A system (avatar or recognizer) could then be evaluated on each subset. For instance, does the avatar correctly sign all the plural forms in a set of plural sentences? Or does the recognizer correctly identify questions as questions in a set of various question forms? Currently, such fine-grained test suites exist mostly in research prototypes. One example: some sign linguists and computer scientists collaborated on evaluating an avatar’s ability to perform classifier predicates (signs that represent objects through handshape and movement). They had Deaf participants rate how understandable and correct the avatar’s classifier depictions were in various clips. Another example: testing a recognition system on sentences with different sentence types (declarative vs. questions vs. topicalized sentences) to see if it picks up the difference – this requires the dataset to have those labels, which some annotated corpora do. As datasets grow (like the ongoing efforts to annotate large sign corpora for linguistic research), we expect more of these targeted evaluations to emerge.
Interactive Dialogue Datasets: Looking to the discourse level, a few datasets of multi-turn signed conversations are being recorded (for example, the SMILE corpus includes dialogues in Italian Sign Language). There’s also the BSL Corpus in the UK which has conversations and narratives. These could become testbeds for dialogue-based evaluations – e.g., can an avatar take part in a scripted dialogue or can a recognizer transcribe a dialogue accurately and identify who is signing when. Another relevant kind of dataset is those used for translation in context. Recently, at the WMT 2022 translation competition, one task was sign language to text translation for Swiss German Sign Language, which implicitly tests understanding of multiple sentences since the content was captioned for news segments. Still, by and large, evaluation beyond single sentences is very limited.
Quality of Service Benchmarks: In industry, beyond accuracy, things like uptime, response latency, user feedback are tracked. For example, a company deploying a sign recognition kiosk might track how often it fails to recognize a user’s signing (like a percentage of interactions that require fallback to a human) – not a standard academic metric, but a practical KPI. Similarly, if an avatar is used on a public website, the maintainers might gather feedback from Deaf users: e.g., a rating of translation quality over time, or reports of errors. These feed into an iterative improvement process analogous to beta testing software. While not formalized in literature, it’s worth noting that real-world deployment brings these “soft” evaluations.
Finally, it’s important to acknowledge gaps and open debates in evaluation:
There is no single authoritative benchmark for sign language generation (avatars) akin to, say, ImageNet in computer vision. Instead, evaluations are often custom to a project. This fragmentation makes it hard to compare models. One group has an avatar that scores 80% on their comprehension test, another claims 90% on a different test – but if the tests aren’t the same, the numbers mean little. An open debate is how to standardize these tests without ignoring the fact that sign languages are diverse and that Deaf communities should be involved in setting the criteria of “good signing.”
For SLR, most benchmarks focus on surface accuracy (gloss or word correctness), but not on deeper understanding. Some researchers argue we need to move toward evaluating whether the meaning was conveyed, not just the form. This parallels debates in machine translation (adequacy vs fluency metrics) and speech recognition (a perfectly transcribed voicemail is only useful if you can act on it correctly). In sign language translation, a BLEU score doesn’t tell you if a Deaf person’s nuance was captured. So there is discussion about creating evaluation metrics that include semantic equivalence or even using human assessment where raters judge if a translation “faithfully and fully” conveys the signers’ message.
The role of Deaf community acceptance is a critical evaluative aspect for avatars. There is an ongoing conversation (sometimes heated) between technologists and Deaf activists about signing avatars. Some Deaf advocates have opposed use of avatars in certain domains (like artistic performances or important public communications), arguing that they lack the cultural and expressive qualities of human interpreters. Technologists often counter that avatars can improve accessibility if done well. This debate implicitly sets a bar: an avatar will truly be successful when Deaf users themselves consider it an acceptable equivalent in some contexts. Right now, evaluation studies show mixed acceptance – Deaf participants often say the avatar is useful for certain things (like getting the gist of written content they otherwise couldn’t access) but would not prefer it over a human for rich communication. Thus, evaluation has to include questions like “Would you use this avatar service? In what situations?” Progress in that sense might be measured in increasing acceptance over time as quality improves and as Deaf users have more input in avatar design
Tracking Progress and Milestones in Research and Industry
Researchers and companies often describe the progress of sign language models in terms of incremental achievements or versions. While not always formalized as “alpha, beta, release” in academic papers, one can identify milestones analogous to those in software development and language proficiency:
Prototype/Alpha Stage: At this stage, a system is typically limited in scope – for example, an early sign recognition prototype might only work for one signer in a lab on 20 signs, or an avatar might only be able to fingerspell and produce a dozen scripted sentences. These are often proof-of-concept demonstrations. Researchers might refer to them in papers as working on a “toy lexicon” or “restricted grammar” to validate an approach. An alpha-stage industry product (not widely released) might be described in the press as “in development, currently supporting basic greetings and emergency phrases” (for instance). The emphasis is on core functionality, not breadth. Many academic systems remain in this stage due to limited data – they showcase a technique on a small set. The “alpha” milestone is crossed once the system can handle a minimum viable set of functions end-to-end (e.g., translate a simple sentence and display it in sign, or recognize an isolated sign from live camera input).
Beta Stage: Here the system is more robust and covers more vocabulary or grammar, and is often tested by a wider user group. Companies like SignAll (which works on sign recognition for kiosks) have done beta deployments in public spaces to get feedback. Beta sign language avatars might be integrated into trial websites or apps to see how users interact. During this stage, developers track metrics like failure rate in real use, user satisfaction, and edge cases discovered. For example, a beta avatar might handle weather forecasts well but users start asking it to sign daily news; the team then learns what new vocabulary or grammar constructs are needed. Some organizations do label their versions: e.g., “ASL Translator 1.0” might have manual signs only, then “2.0” adds facial expressions, etc., though such numbering is not standardized across the field.
Increasing Language Complexity Tiers: Many projects implicitly define tiers of language complexity they aim to conquer. One might start with fixed phrases or announcements (where the sentence structures are predictable). Next tier could be simple domain-specific sentences (like questions and answers about banking services, using a controlled vocabulary). A further tier is open domain sentences on everyday topics. And the highest would be conversational dialogue. We see evidence of this in how companies roll out features. For instance, the Chinese evaluation described earlier provided data in specific domains (transportation, medical, legal) with increasing complexity. Teams likely treated each domain or sentence type as a step. Another example: The European project “EASIER” (2021-2024) is working on sign language translation and plans milestones from translating individual sentences to handling interactive questions. Although not called “Tier 1, Tier 2,” the concept is there. In academia, one might propose linguistic fidelity levels for avatars: Level 1 – only manual signs, no facial grammar; Level 2 – adds basic facial expressions and timing; Level 3 – full range of non-manuals and complex inflections. These could serve as checklists for development. Indeed, Wolfe et al. 2022 emphasize that adding each layer (facial, spatial referencing, etc.) brings an avatar closer to human-like signing.
Linguistic Fidelity Milestones: Some concrete milestones noted in literature include:
Adding facial grammar: e.g., the moment an avatar gained the ability to raise eyebrows for questions was a milestone celebrated in the SiMAX system. It meant the avatar could distinguish question vs statement, a big linguistic step.
Achieving a certain corpus BLEU score or error rate: in research, surpassing a baseline (like getting BLEU > 0.5 in sign translation, or under 20% SER on a standard test) is considered progress. Each year, competitions note these improvements.
Multi-lingual support: Some models initially work for one sign language; extending to another (like making an avatar that can sign in ASL and in British Sign Language by switching its database) is a milestone showing generality. For example, one competition winner used a pre-trained multilingual model and fine-tuned it to improve their avatar’s performance, which was a novel approach and credited for their win.
Real-world pilot: When a system moves from lab to real environment (e.g., an avatar interpreting announcements in a train station, or a signing robot deployed in a museum), it’s often reported in case studies. These pilots often reveal the difference between a system that works in controlled settings vs one that robustly handles interruptions, variability, and user engagement.
Passing user acceptance thresholds: A subtle but important milestone might be when a majority of a test group of Deaf users rate the avatar’s signing as satisfactory or when an SLR system can be reliably used by Deaf signers without frustration. Such qualitative milestones are sometimes mentioned in studies (for instance, “X% of participants said they would use this system regularly”).
Researchers also hold workshops and challenges specifically to track progress. The Workshop on Sign Language Translation and Avatar Technology (SLTAT) and the CVPR/ICCV challenges on SLR create forums to compare models on shared tasks. Over 25 years, there has been a noted trend: isolated sign recognition is considered largely “solved” for small vocabularies, so the focus shifted to continuous signing; now that is gradually improving, translation and higher-level understanding are the new frontiers. Each frontier becomes a milestone once enough groups demonstrate viable solutions.
Another way to think of skill tiers is to borrow from CEFR or human curriculum:
Basic (A1): System can handle the alphabet and a set of common signs or phrases (e.g., greetings, simple statements). Avatar at this level may sign in a very robotic way but understandably; SLR at this level may recognize single words in ideal conditions.
Beginner (A2): Can handle simple sentences in a familiar domain. Avatar can sign short sentences with some correct facial cues; SLR can recognize short phrases or questions, maybe with constraints (like a weather report domain).
Intermediate (B1/B2): Can handle everyday language. Avatar can sign longer sentences with moderate fluency and handle more vocabulary, including some idioms or classifiers in context; SLR can transcribe or translate a conversation about everyday topics with some errors but overall coherence.
Advanced (C1): Can handle complex, technical, or abstract language with good fluency. Avatar signs nearly as expressively as a native signer on a wide range of topics (though might still lack some subtlety or have occasional unnatural movements); SLR can interpret lectures or stories, preserving most content and nuance.
Mastery (C2): Near-native performance. The avatar could potentially pass as a human in certain contexts (aside from visual appearance), handling humor, poetry, rapid dialogue; the SLR system understands and translates anything a native signer can say, including implied meanings and cultural references.
Currently, no model is at C2. But framing progress in these terms can help set targets. Indeed, some research roadmaps explicitly cite achieving “native signer level” as the ultimate goal. Open questions remain on how to quantify when that goal is reached – likely through extensive testing with native signers in real situations.
Where formal frameworks are lacking, as is often the case in this interdisciplinary domain, we can synthesize a proposed classification model to guide evaluation. Based on the state of the art and the discussions above, here is a potential multi-level classification for both sign language avatars and SLR models:

Level 1: Manual Alphabet and Numbers – The system handles fingerspelling A–Z and basic numeric signs. Avatar: Can perform the manual alphabet clearly and produce number signs. No facial expressions or context – effectively a signing dictionary that spells or shows isolated signs. SLR: Can recognize handshapes for letters and digits from still images or slow-moving video. Vocabulary on the order of <40 symbols. (This level corresponds to the phoneme/fingerspelling stage discussed.)
Level 2: Isolated Lexicon (Basic Words) – The system handles a fixed vocabulary of isolated signs (tens to hundreds of everyday signs). Avatar: Can perform individual signs (words) understandably, possibly with some default mouthing, but signs are not yet combined into sentences. Good for generating one-word responses or labeling objects. SLR: Can classify isolated signs from a single signer or a small set of signers with high accuracy. Doesn’t handle continuous input. This is comparable to a very rudimentary signer who knows some words but not how to string them together.
Level 3: Phrase/Simple Sentence – The system can manage short sentences or phrases, especially in a constrained context (e.g., common phrases like “thank you,” “nice to meet you” or domain-specific prompts like “next, please”). Avatar: Strings a few signs together with basic transitions. May have limited facial expressions (perhaps can do a yes/no question facial expression if explicitly scripted). Still might appear mechanical but conveys short messages. SLR: Can recognize sequences of 2–3 signs or a simple sentence in a given domain (like weather or greetings). Might rely on pause detection between signs or known sentence templates. Struggles with longer or novel sentences. This level might be akin to a beginner user of sign language who can get simple points across.
Level 4: Complex Sentences with Grammar – The system handles full sentences with proper grammatical structure and a broader vocabulary (hundreds to thousands of signs). Avatar: Produces multi-clause sentences with moderate fluency. Integrates non-manual markers for basic grammatical functions (questions, negation) and uses spatial referencing for pronouns or indexing. Transitions between signs are smoother (less jerky). Can incorporate classifier handshapes in predictable ways (perhaps showing a vehicle moving by using a classifier sign). The avatar’s signing is understandable on familiar topics and doesn’t violate major grammatical rules of the sign language, though it may still lack the expressiveness of a human. SLR: Recognizes continuous signing in sentences, possibly outputting gloss or translating to spoken language. It can handle normal signing speed for familiar topics, with errors mostly on infrequent signs or heavy inflections. It starts to capture context – for example, recognizing when a sign is modified by motion to indicate repeated aspects. Performance might be measured by a sign error rate in the 20-40% range on everyday sentences, and it may use language models to ensure outputs make sense. This corresponds to an intermediate proficiency where the system “knows the language” enough to communicate but will falter with very idiomatic or unexpected inputs.
Level 5: Discourse and Interactive Dialogue – The system manages multi-sentence discourse and basic interactive communication. Avatar: Maintains context across sentences (e.g., remembers that “she” refers to a person whose sign name was introduced earlier in the dialogue space), uses turn-taking signals (gaze or pause to yield the floor, subtle hand motions to hold the floor), and adapts to the conversational partner’s cues. It can express a range of pragmatic functions – making polite requests versus giving commands with appropriate demeanor, showing emotions like surprise or skepticism in its signing when telling a story, etc. An avatar at this level could feasibly host a simple live conversation or narrate a story with different characters by using role shift. SLR: Can transcribe or translate conversations in real-time, correctly attributing utterances to different speakers (signer diarization), handling back-and-forth exchanges. It understands when one signer is asking a question and another is answering. Perhaps it even captures some non-explicit content (like noting sarcasm if extremely advanced). In terms of proficiency, this is near-native territory – something not yet achieved, but the evaluation here would involve integrative tasks (e.g., use the SLR system in a live chat scenario and see if two people can communicate through it without misunderstandings).
Level 6: Native-Like Performance – This hypothetical top level means the system is essentially as effective as a fluent human in the domain of use. Avatar: Indistinguishable from a human interpreter in conveying not just the literal content but the subtleties of meaning and culture. It could interpret poetry or jokes, use creative signing, and adjust style (signing more formally for a news broadcast, more colloquially for a chat). It would pass rigorous comprehensibility tests where Deaf users get the same understanding from the avatar as they would from a skilled human signer. SLR: Would likewise capture signed content in full detail – including those subtleties – enabling translations that preserve tone and intent. It could serve as an automatic interpreter in meetings, capturing everything from technical content to humor accurately. Evaluation at this level would likely require deep qualitative assessment, as pure numerical metrics might all be near perfect. This is the aspirational goal that drives much research, but as of 2025 no system is at this stage.
This proposed classification is a synthesis of current capabilities and linguistic benchmarks, meant to provide a structured way to discuss progress. It also exposes the open debates:
How do we measure when an avatar or SLR model moves from one level to the next? (We likely need standard tests, as argued above.)
Are there intermediate sub-levels we should explicitly mark (e.g., between 3 and 4 might be a system that does sentences well but not true dialogue – perhaps call it 3.5)?
Do all sign languages present the same difficulty at each level? (Some languages have more non-manual grammar, which might mean an avatar must meet a higher bar for expressiveness in those languages to be considered “Level 3”.)
And importantly, do we evaluate these models by comparing them to human benchmarks? For instance, one could use Deaf children’s acquisition as rough yardsticks: if an avatar is at “Level 3”, is it roughly as understandable as a 4-year-old Deaf child’s signing? Such comparisons might make sense in a qualitative way, but technologically we measure against the gold standard of adult fluent signers for comprehension.
In conclusion, the field of sign language technology is moving toward increasingly comprehensive evaluations that reflect true linguistic skill. We have identified parallel tracks for avatars (production quality) and SLR models (recognition/understanding quality), with a clear need to integrate multimodal aspects (manual and non-manual) at every stage for success. There is an emerging consensus that purely focusing on isolated sign accuracy or raw translation scores is insufficient – both research and industry are pivoting to judge models by how well they handle the richness of real signed language. By defining stages of development and skill tiers, we can better chart our progress and identify what the next milestones should be (for example, improving an avatar from intelligible but unemotional signing to expressive, engaging storytelling, or pushing an SLR model from literal gloss recognition to genuine understanding of context and intent). With standard frameworks and community-driven benchmarks, the hope is that evaluation will keep pace with development, ensuring that as models improve, we can trust they are moving closer to the goal of effective, natural sign language communication.
Sources:
Adamo-Villani, N., Anasingararu, S. (2016) Toward the Ideal Signing Avatar from Purdue University https://pdfs.semanticscholar.org/5b21/cdde8732213c3e15dd26fad88897a37f266e.pdf
Bennbaia, S. (2022) Toward an evaluation model for signing avatars https://nafath.mada.org.qa/nafath-article/mcn-20-05/
Dimou, A-L., Papavassiliou, V., Goulas, T., et al. (2022) What about synthetic signing? A methodology for signer involvement in the development of avatar technology with generative capacity. https://www.frontiersin.org/journals/communication/articles/10.3389/fcomm.2022.798644/full
Yu, Z., Huang, S., Cheng, Y., Birdal, T. (2022) SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark from Imperial College in London, UK and Tencent AI Labs in Shenzhen, China. https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/00653.pdf
Aziz, M., & Othman, A. (2023). Evolution and Trends in Sign Language Avatar Systems: Unveiling a 40-Year Journey via Systematic Review. Multimodal Technologies and Interaction. https://www.mdpi.com/2414-4088/7/10/97
Kipp. M., et al. (2011) Sign Language Avatars: Animation and Comprehensibility https://www.michaelkipp.de/publication/Kippetal11.pdf
Facebook AI (2020). How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language. https://research.facebook.com/publications/how2sign-a-large-scale-multimodal-dataset-for-continuous-american-sign-language/
Li, D., et al. (2020). Word-level Deep Sign Language Recognition: A New Large-scale Dataset and Methods Comparison. WACV 2020. https://arxiv.org/abs/1910.11006
Quandt, L. C., et al. (2022). Attitudes toward Signing Avatars Vary Depending on Hearing Status, Age of Signed Language Acquisition, and Avatar Type. Frontiers in Psychology. https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2022.730917/full
Johnson, R. (2022) Improved Facial Realism through an Enchanced Representation of Anatomical Behavior in Sign Language Avatars from DePaul University. https://aclanthology.org/2022.sltat-1.8.pdf
Zhang, Annan, et al. (2024). Translation Quality Evaluation of Sign Language Avatar. Proceedings of CCL 2024 Evaluation Workshop. https://aclanthology.org/2024.ccl-3.45/
Common European Framework of Reference for Languages: learning, teaching, assessment (CEFR) – Adapted for Sign Languages (2018). European Centre for Modern Languages.
American Sign Language Proficiency Interview (ASLPI) – Proficiency Levels. Gallaudet University (n.d.).
Florida Atlantic University (2025). Engineers bring sign language to ‘life’ using AI to translate in real-time. TechXplore. https://techxplore.com/news/2025-04-language-life-ai-real.html