Monday, January 25, 2010

Voice-operated vs. voice-augmented interfaces

Mention the term "voice control" and one of two scenarios usually comes to mind.  The first scenario is of stilted, error-prone interactions, often involving the user repeating himself as the computer dumbly responds with "I don't understand."  Worse, it can lead to the user shouting, "No!" as the phone asks if he would like to call the ex-girlfriend he hasn't gotten around to deleting from his contacts yet.  This would be the "real world" scenario.  The second scenario that may come to mind is of magical understanding, where the computer has all the language comprehension of a fluent speaker.  It readily parses natural language and produces instantaneous results to any query or command.  This is the "Star Trek" scenario.


"Computer, calculate warp trajectory."
"Did you say, 'Call Charlotte Padalecki?'"


There's a pretty wide gap between the two scenarios, and there has been a tremendous amount of time and money invested in pursuing the second scenario.  The idea of controlling a computer with voice commands is an enticing one, to the point that it is in fact often referenced as an archetype of "futuristic" interaction.  This is for good reason: it's not going to happen anytime soon.

The list of technologies that would have to be perfected to reach this level of control is daunting.  For the ideal scenario of standing in the middle of a room and candidly addressing a ubiquitous computer listener to do your bidding, we would need the capability to:
  • Identify the speaker (voice identification)
  • Filter out ambient noise as well as speech not directed at the computer (signal processing)
  • Understand that the computer is being addressed at the moment (speech recognition and possibly even tone recognition unless listening is triggered with a completely unique phrase)
  • Parse the command (speech recognition)
  • Act upon the command (natural language processing)

So far, very few, if any, of these technologies are mature enough to meet the requirements of the Star Trek scenario.  It is not yet the future, to our collective chagrin, and we're going to have to continue using our hands to control our devices, like cavemen.

I just killed a buffalo with my universal remote.

So we're currently stuck without the ability to interact naturally with our computers using only our voices, but we've sunk tons of effort into developing computers that can recognize and respond imperfectly to voice commands.  Until we manage to finally reach that holy grail of Star Trek interaction, do we just suffer using our computers manually as academia and industry each make a million baby steps towards the goal?

Well, yes.  That's how progress gets made.  But in the meantime, I propose putting some consideration into an offshoot to this particular pursuit: multimodal voice control.

Multimodal voice control isn't an entirely unheard-of concept.  W3C has a working draft of technical requirements for multimodal interactions at http://www.w3.org/TR/multimodal-reqs, and there are other investigations of ways to combine modalities for richer interactions out there.  But in most cases, voice control is viewed as an alternative, fully-functional and redundant layer of interaction, much like how Windows Speech Recognition allows you to operate the computer entirely by very awkward and complicated voice commands.  It serves to support users who cannot operate the interface by normal means, and must instead rely on speech for navigation.  What does not seem to have been considered, however, is using voice control as a supplementary mode of interaction.

If you've ever waved someone over and then begun to speak to them, or if you've used your hands to better articulate a spatial concept, you've interacted multimodally with someone with one of those modes being speech.  Humans naturally interact using whatever modes are available to us.  We talk, wave, point, wink, nod, poke, shove, and engage in any of a number of attempts to convey information as we see fit.  Sometimes we even attempt to engage our computers in completely unsupported ways:
 
 It's part of human nature to throw everything we've got at it.  Just watch a golfer try to coax his ball back onto the fairway mid-flight.  He may shout at it, wave at it, or try to lean in the direction he wants it to travel.  Unfortunately, the golf ball won't listen.  Fortunately, even though computers don't listen either, they could if we wanted them to.

Consider the possibility of a computer that listens to what you say concurrently to functioning with the usual modes of interaction (i.e. mouse and keyboard in most cases).  Just as you might provide the bulk of input via keyboard with additional mouse input, we could create a system where voice input could be taken optionally at any point in time.  Allow me to illustrate with an example.

I sit down at my computer.  Much like interacting with a person, I have to get the computer's attention first.  I do so quite naturally by addressing it by name.  My computer's name is Eddie, so I say, "Hey, Eddie."  At this point, the computer begins listening to me.  It's only polite, as in conversation with a person, that you manage attention by addressing the person to whom you wish to talk, and excuse yourself from conversation should you have to speak to someone else.

I begin working.  Let's say I want to write an email.  Because I'm pretty quick with the mouse and keyboard, I go ahead and launch my email client in the usual fashion.  At this point, the computer and I both know that I'm working on an email, so everyone's on the same page.  The benefit of this is that Eddie can limit his working vocabulary to words and phrases related to email.  He doesn't need to know about tabs, history, song and album titles, or anything else that isn't appropriate.  Now that I've got the program open, I can just say, "let's start an email to Mark."  Since the key terms are "start," "email," and "Mark," most of the decorative and linking words can be parsed out relatively easily.  Eddie dutifully starts a new email message to my friend Mark.

Now I begin typing.  Somewhere in my email, I type out a sentence and realize that it's redundant.  I don't need it in there, so I say, "scratch that last sentence."  Eddie then erases the redundant sentence.  Since we're in a state of composing text, Eddie can respond to a number of natural language constructs involving sentences, paragraphs, words, and typography.  I'm not writing an email with my voice, because then I'd spend half of my time dictating punctuation and saying things like, "Things have been good here in Durham period.  Argh.  ...Scratch that.  Correct 'during'.  'Durham.'"  Instead, I can type an email, but choose between two (or more) different modes of interaction for certain actions.  When chunks of text are easily identified and manipulated in spoken terms, then I have the option of doing so if it's faster than selecting the text with the mouse or keyboard.

Continuing my missive, I'm typing merrily along, and then my phone rings.  I don't want Eddie to think I'm talking to him while I'm on the phone, but we aren't at the point where Eddie can watch me via webcam to determine that I'm on the phone (see my blog post on multimodal visual interaction, coming in 2019).  I politely excuse myself from the ongoing "conversation" with Eddie by saying, "Hang on."  Eddie waits while I talk, and then snaps back to attention when I return to the interaction with, "Okay, Eddie, I'm back."  I finish typing my email, and I say, "Okay, send it."  I'm done emailing, so I say, "Minimize window," and get back to drooling over $2000 camera lenses at Amazon.

Hopefully this example makes some sense.  It's not a completely fluid speech interaction, but it's natural.  In fact, it's arguably more natural than almost any unimodal interaction.  We are multimodal creatures.  We talk with our hands and we push with our words.  As a rule of thumb, if it's listening, we'll talk to it.  If it's not, we'll talk to it anyway.  Since computers have microphones and some modicum of smarts these days, I figure we might as well put it to use.

Given that we often find ourselves operating computers on a focused, one-on-one basis, there are a lot of ways voice could be used to augment and accelerate the standard physical modes of interaction.  Just consider the utility of some of the following example commands in common scenarios:
  • "Yes/yeah/sure/okay/uh-huh/yup/yep/yarp/whatever", "No/nope/nah/narp/nuh-uh" - Confirm/deny
  • "Go back/forward", "Close this tab/window", "Email this to..." - General navigation
  • "What was that?", "Do/Show that again", "Huh?" - Repeat last indication or alert
  • "Whoa", "hang on", "hold on", "hey" - Induce a pause in animation or process
  • "Never mind", "Oops", "Whoops", "Scratch that" - Undo
  • "Shh", "Shut up", "Be quiet" - Mute volume
  • "Goodbye/Good night", "Later", "See ya" - Close or log out
  • "Hey [computer name]", "Hang on", "I'm back" - Toggle listening on/off
  • "Oh crap, it's the Feds" - Repeatedly write random data to hard drive
Yelling at your computer never really accomplished anything except maybe make you feel better, but maybe one day it could serve an actual purpose.  It'll still make you feel better too.

No comments:

Post a Comment