Conversational voice UIs have tantalized and disappointed us for decades. By considering technical capabilities and context of use, we can utilize voice to support natural human interactions.

VOICE USER INTERFACES have been tantalizing and disappointing us for decades. The promise is simple — to allow people to communicate with machines naturally using voice. It is arguably the most natural human-computer input method available, reflecting the efficiency of speech in communicating complex and emotionally rich thoughts with the people around us.

While other computer interaction methods are artificial tools where people must adapt to the computer, voice interaction reverses this by putting the onus on the computer to understand the human. No keyboard, mouse, or remote is necessary for input, just the user’s voice. It’s obvious that most people can speak significantly faster than they can type but there’s a deeper reason voice, when done well, wins over other input methods. Voice input is highly effective in flattening cumbersome navigation structures. Because users draw on a shared vocabulary of thousands of words, it is like exposing a huge menu of options at the surface of the UI. Similarly, voice can improve communication back to the user by reducing the weight placed on overcrowded information display, cumbersome visual metaphors, and dependency on cryptic iconography.

However, voice recognition has suffered from poor public reception in the past. The biggest contributors to its lack of popularity have been poor system performance, unrealistic user expectations and inappropriate use cases. As with any other technology, these factors need to come together carefully in design so that the implementation of speech technologies plays to its technical strengths in the moments that are most beneficial to users.

Whereas many speech systems in the past had to be trained to their user’s voice, most modern systems do just fine from day one and keep improving over time with more usage. Additionally, as traditional computers give way to smartphones, tablets, connected TV, and automotive UI, the need for quick interactions that are hands-free and non-visual is more important than ever.

Because voice is a natural, integral part of how people interact with the world at large, we may make assumptions about how to address a consumer device through voice. However, as interface designers, we must identify and utilize the appropriate role for voice in a user experience, and that largely depends on the relationship between our user and the machine, and what they are trying to achieve.


The Four Types of Voice Interaction

We propose a simple framework for understanding and designing voice interfaces. With these primary voice interface considerations in mind designers can inform which type of speech interactions are appropriate for their product. The four main interaction roles for voice include command, dictation, agent, and identification.



Command-style interaction is the most basic form of speech recognition and was also one of the first offerings to enable speech input. Technically, command-based interactions are some of the simplest for a speech recognition engine to process, since it can match against a relatively small set of commands (e.g. “Pause music”). More advanced commands can be given if the user speaks in a specific, pre-defined syntax (e.g. “Play artist Arcade Fire”). This verb-noun construct is common in earlier generations of speech technology and can be found in products like Apple’s original Voice Control for iPhone, Microsoft’s Ford Sync, and automated phone trees that are almost universally loathed — ”What account balance are you looking for? You can say Checking, Savings or Credit Card.”.

In time, most command-style interaction will give way to more advanced interactions (see agent-style interactions below) where natural language becomes a better way to communicate with machines. However, there still will be instances in which simple, structured commands are appropriate and valuable. Situations that require a high degree of accuracy and where the physical and visual modes are limited, or even non-existent, can still benefit from direct and concise speech.



Speech dictation has historically been one of the few drivers pushing consumers to actually pay — sometimes hundreds of dollars — for a speech recognition program. These systems typically required exhausting training periods, cumbersome headsets, and saintly amounts of patience to be effective. Advancements in microphones and mobile computing, however, have made dictation both more palatable and available on many more platforms, like smartphones and tablets. Apple and Google recently embraced dictation as alternatives to mobile touchscreen keyboards. Additionally, Apple has added system-wide dictation to its latest Mac operating system update, Mountain Lion, at no additional cost.

Dictation will continue to appear in many products because even at 80% accuracy, it’s still a better, and much faster, experience than typing on tablets, smartphones, or TVs. Along with speed, context of use for dictation is another incredibly important factor. With our increasingly mobile lifestyles, the need to input data while on the go is becoming more frequent. Walking down the sidewalk while texting is miserable (and also dangerous). Driving and texting is often illegal (and even more dangerous). Already, dictation is a tool that can help millions of mobile users in everyday life.



An agent is a system that can accept natural language input, intelligently process the information, and appropriately respond either onscreen, through audio, or both. Agent-style speech recognition has become the celebrity of product launches of late, with both Apple’s Siri and Google’s Voice Search playing prominent roles in their recent conference keynotes. Agents often take on a persona, like Apple’s Siri or Nuance’s recently announced Nina. The persona’s purpose is to help the user, often conversationally, receive the information requested. A great example of this type of query is asking Siri “do I need an umbrella today?” The system is able to understand the question and correctly respond. Agent-style interactions, which now star in multi-million dollar television spots, are implicitly teaching the public what they can expect from speech, and setting the bar for would-be competitors.

Although still in its infancy, agent-style interactions will continue to develop and improve. We will see this interaction style become increasingly popular and grace more devices and customer support telephone lines. Additionally, the agent persona will be one of the next big opportunities for companies to brand their experience. Soon, agents will become a core part of automotive and TV experiences where physical controls are clumsy or even dangerous. Imagine being able to ask your TV for “next episode of Breaking Bad” or “what’s a funny TV show like 30 Rock?” or ask your car to “start the playlist Mike shared with me yesterday.”



Although still not widely used beyond Hollywood movies, biometric voice identification is another role that voice recognition can handily perform. Nuance’s promotional video for Nina highlights the user’s ability to say a non-secretive passphrase that verifies the user through biometric identification.

In addition to the obvious security and privacy uses, voice identification can also improve experiences where user profiles are important. In a multi-user context for example, when mom asks the family tablet to “add CNN to my favorites,” it will know whose favorites because it recognizes her voice.

The Future of Voice Is Bright

Each new day yields technological advances that allow people to address devices in increasingly intuitive, natural ways. We are becoming more comfortable addressing machines and new interaction patterns are emerging. However, the trick is knowing that voice won’t work for everyone in every situation. Never lose sight of designing with technical limitations in mind and be diligent in using voice at the right moments and for the right reasons. When designed appropriately, voice interactions feel magical and powerful and can be a strong product differentiator. But more importantly, good voice UI lets us feel more human as we interact with our devices.


Punchcut is a human interface design company specializing in mobile, connected products and services. Punchcut works with the world’s top companies to envision, design and realize next generation connected experiences across devices and platforms that engage customers and transform businesses in a connected world.
A Punchcut Perspective | Contributors: Mike Sparandara, Lonny Chu and Jared Benson
© Punchcut LLC, All rights reserved.