A mouth which lacks a consonant at a few millimeters can do away with a whole humanoid contact.

Columbia University scientists constructed a facial robot, EMO, which learnt to move its silicone lips in sync with speech by initially using its own face as a test object. The piece revolves around a realistic challenge in social robotics humans analyze faces distortively, and the mouth area has sufficient timing and shape information such that minor mistakes are interpreted as “almost human,” a perceived trap commonly known as the uncanny valley. The platform of EMO is mechanically expressive, with 26 facial actuators that can perform several motions, but you need the system to have the capability of mapping sound to coordinated deformation and that has to be reliable.
The training pipeline shunned manual rules. EMO experienced thousands of random facial motions as it monitored itself in the mirror and it learned an internal “self model” which linked motor input to the observed results. That knowledge of itself was then associated with hours of human speech and clips of singing, in such a manner that the robot could match audio patterns with shapes of mouth without having any semantic idea of what was being spoken. In a human preference experiment of 1,300 volunteers, the vision-to-action language model approach held up better than two baselines, one driven by audio loudness, and the other by imitating the nearest example of a “landmark” that received 62.46% of the selections as the closest match to ideal lip movement.
Some of the other mistakes were not glamourous but telling. Hod Lipson explained that we had certain trouble with hard sounds, like ‘B’, and sounds which require the puckering of lips, like ‘W’. These phonemes require short, highly curved contacts and releases at the lips, where compliance, friction, and actuator bandwidth are all apparent as visible lag. A reminder here, too, is that speech animation is not a matter of perception: speech animation is a materials-and-mechanisms issue: to make a bilabial closure appear crisp or mushy it is the thickness of the silicone, the way it is mounted, the way forces are distributed under the skin.
Such a mouth-first approach is consistent with what a long history of research in perception studies has always suggested: the visual channel can alter the perception of what a listener believes they heard materially, a phenomenon also so well-established with respect to the McGurk effect. To the designers of the robots, it is more of an operational than an academic implication. When a face is irregular in timing, the individuals do not just pick up on this, they may misjudge the speech signal in itself, particularly in a noisy environment when visual cues come to the rescue.
The physical aspect of “believable” faces is in motion, as well. Researchers at Tokyo University showed robot faces adorned with built-to-order living skin tissue tethered in collagen-based scaffolds simulating biological tethering, in order to move freely without tearing, and even repair themselves. Meanwhile, an Osaka University team wrote about expression synthesis using decaying wave synthesis, overlaying breathing, blinking and other micro-gestures as overlaying superposition of waveforms to prevent the brittle transition between prebuilt facial scenes.
These threads put side by side give the impression that they are heading in the same direction but in two different directions: the face is no longer a decorative shell of humanoids. It is emerging as a control surface, at which learning-based mapping, soft-material adhesion as well as real-time production of expression intersect–since it is the mouth, more precisely, that determines whether the machine before us is talking, or just acting.
