Your voice will unlock the magic of XR

illustration of an XR experience where a man uses VR gogles to experience himself as an avatar playing a game in a virtual realm

Should you ever be in any doubt over the truth of Arthur C. Clarke’s Third Law (the one that states that any sufficiently advanced technology is indistinguishable from magic), just try a HoloLens 2 experience where you can tell holograms what to do. 

Anyone who manages to walk away from that without feeling a little bit like a wizard or witch casting a successful spell is likely leading a fairly jaded and joyless existence. There is something fundamentally satisfying, almost primal, about watching the world change as you command it to. Like lucid dreaming.

Yet even the naysayers can probably agree that voice-based technologies driven by artificial intelligence represent a huge market opportunity. Millions of us now own some sort of room-based smart home device such as Google Home or Amazon Echo, and the global voice and speech recognition market size is estimated to reach nearly $32 billion USD by 2025. We are witnessing a convergence of various emerging technologies in that space – such as voice recognition, natural language processing, and machine learning – powered by 5G connectivity and the AR cloud. 

In this screenless world of spatial computing, interfaces will need to become more intuitive, efficient, and empathic.

Immersive experiences like those afforded by the HoloLens offer a tantalizing glimpse of where this is all headed: a screenless world where digital interfaces become a part of natural human interactions, creating an entirely new form of hybrid – or extended – reality. In fact, Gartner predicts that this year, 30 percent of web browsing will be done without a screen. 

The next technology revolution will usher in the era of spatial computing, where multisensory experiences allow us to interact with both the real and digital worlds through natural, intuitive interfaces such as haptics, limb and eye tracking, and even elements such as taste and scent

In this screenless world of spatial computing, interfaces will need to become more intuitive, efficient, and empathic. Let’s take a look at three ways in which voice technologies are already enabling this. 

Intuitive UX

A woman examining graphical, numerical and written data on a virtual screen that appears like a 2D hologram

Spatial audio and AI-driven voice technologies are crucial elements for creating compelling immersive experiences. As Kai Havukainen, Head of Product at Nokia Technologies explained in an interview for Scientific American, “Building a dynamic soundscape is essential for virtual experiences to really engender a sense of presence.” Humans, he added, are simply hardwired to pay attention to sound and instinctively use it to map their surroundings, find points of interest and assess potential danger.

There are, however, design considerations that must be taken into account when tackling the challenges of an entirely new medium together with fast-evolving technologies. 

Tim Stutts, Interaction Design Lead at Magic Leap, highlights the sheer complexity of these UX challenges, “A level of complexity is added with voice commands, as the notion of directionality becomes abstract—the cursor for voice is effectively the underlying AI used to determine the intent of a statement, then relate it back to objects, apps and system functions.” 

“For voice experiences, you need to have a natural language interface that performs well enough to understand different accents, dialects, and languages,” adds Mark Asher, director of corporate strategy at Adobe, who believes the advancement of voice technologies will serve to “bring the humanity back to computing,”

There are still many hurdles to overcome before we reach that utopian vision of Star Trek’s universal translator, however. As we move towards more pervasive and complex experiences where users have multiple applications open at the same time, they will need to circumvent problems such as unintentionally commanding a hologram when you’re actually talking to the person next to you. 

Yet looking at the exponential way AI technologies have developed over the past decade, it isn’t unreasonable to extrapolate that the next few years we will usher in real-time contextual applications that accurately identify and action commands based on accurate assessments of your surroundings (both real and virtual), your personal preferences, and even your biofeedback. 

Voice biofeedback

graphic depiction of a soundwave

Extended reality (XR) technologies already deploy a multitude of sensors that enable the collection of biofeedback, yet voice provides a rich vein of data that can be collected without the need for cumbersome wearables.

Apart from deliberately using commands to interact with the world around us, our voices provide the scope for AI to contextualize our XR experiences based on subconscious factors such as our mood and physical health. Cymatics – the name given to the process of visualizing soundwaves – gives us some insight into the depth and complexity of the unique patterns projected by our voice.

To produce speech, the brain communicates with the Vagus Nerve and sends a signal to the larynx, which vibrates out stored information through the vocal cords. Since vocalization is entirely integrated within both our central (CNS) and autonomic nervous system (ANS), there is an established correlation between voice output and the impact of stress. 

Researchers have been developing methods for voice stress analysis (VSA) and computerized stress detection and body scanning devices for many years. Companies such as Insight Health Apps already leverage this rich data to feed corrective waveforms and patterns back into the body in the form of “quantum biofeedback”. 

Bridging the Uncanny Valley

When I was first invited to test the social VR platform Sansar, I was shown around some of its virtual worlds by Linden Lab’s CEO Ebbe Altberg. To this day, my lasting impression of that demo was how our interaction felt very natural in spite of us being 5,000 miles and several time zones apart (I was in London and he in San Francisco) and the fact that his avatar looked nothing like his real-world persona.

Sansar’s avatar editor offers hundreds of options to customize your virtual proxy.

Not only did Ebbe’s digital self have the face of a woman, but that face was attached to a cartoony blimp-like dinosaur bodysuit. Still, when he spoke, the avatar’s lips, teeth and facial muscles synchronized to the sounds in a way that my brain registered as true. It demonstrated one of the peculiar things about designing virtual experiences: the malleability of  “reality” and the fact that we are more willing to suspend our disbelief for some aspects of it than others. Hence an avatar’s appearance doesn’t matter nearly as much as these subconscious prompts that form the core of human interaction. 

It was an interesting way of avoiding the pervasive problem known as the “uncanny valley” which describes that awkward sense of unease you feel when a character or avatar appears human-like yet “not quite there.” Speech Graphics developed the technology that creates this notoriously difficult-to-achieve illusion that an animated face is the source of the sound you hear. Their pipeline merges powerful speech analysis with procedural animation techniques. To achieve this, the algorithm replicates not only the movement of the lips but also decodes from that speech the energy and emotion of the speaker.

Technology will soon be sufficiently advanced so that it will become an invisible layer of our reality 

“In the sound of speech, there is a wealth of information about what the speaker was doing when he or she made the sound—including the movements of the mouth as it produced the sound, and the energetic state of the speaker, from which we can deduce facial expression. From syllables to scowls,” its website reads. And because it is a universal physical model, it works for any language and any type of character, from realistic humans to cartoon-like avatars. 

The future of voice

As digital experiences move beyond the familiar constraints of screens, our modes of interaction with the digital world are also evolving. Paradoxically, that evolution is taking us back to basic and instinctual forms of natural human interaction, hence the enduring relevance of Clarke’s Law. Technology will soon be sufficiently advanced so that it will become an invisible layer of our reality rather than a separate realm requiring special skills to access. And in that hybrid reality, we will experience an entirely new type of magic.

About Futurithmic

It is our mission to explore the implications of emerging technologies, seeking answers to next-level questions about how they will affect society, business, politics and the environment of tomorrow.

We aim to inform and inspire through thoughtful research, responsible reporting, and clear, unbiased writing, and to create a platform for a diverse group of innovators to bring multiple perspectives.

Futurithmic is building the media that connects the conversation.

You might also enjoy
aerial panoramic view over Cargo container terminal at the Port of Hamburg, the largest port in Germany
Empowering video surveillance and video analytics with a distributed cloud