How to Prototype Voice Interfaces with Wizard-of-Oz

Quickly gain insights in a high-fidelity context

Published in

Prototypr

7 min readNov 12, 2017

While there are many tools and platforms for prototyping the interaction between user and machine in conversational interfaces, they only cover basic features. As soon as your digital assistant uses your app or third party functionality, you’ll find yourself fed up as a designer.
Developing this connection would require a lot of effort, which makes it even more crucial to do an intense prototyping and testing session. But how can a designer prototype these more complex features?

I’ll show how to prototype digital assistants apps using the Wizard-of-Oz technique based on my personal experience with testing early-stage AI systems and developing prototAIpe.

What is the Wizard-of-Oz method?

Wizard-of-Oz is a rather old technique in user testing that allows you to emulate complex functionality in early stages. The feature which would normally be realised with a piece of software is then handled by a human, ideally an expert. So instead of writing an algorithm to find the best route from New York to Philadelphia, you let another person, e.g. an experienced commuter, do this (along with the current traffic status pulled from a webpage, to do the task properly).
The user does not know this and thinks he is interacting with a computer, which makes for the perfect test environment to get valid results. This means you need a way of communicating to the user in a way that makes it easy for him to believe he’s working with a machine.

This method got its name from the story The Wonderful Wizard of Oz, where an ordinary man hides behind a curtain and pretends, through the use of technology, to be a powerful wizard.

Why Wizard of Oz?

Some of you might be asking: Why should put in the effort of simulating the machine with another person? There are a lot of great tools that let you plan the interaction with every phrase to be said or written by the system and the user. Why not use these?

It’s a lot of work to map out every possible scenario that would be required to reach the same level of “fidelity” as with the Wizard-of-Oz method.
You’re static. If you predetermine everything that the system answers, you only allow certain variables in your commands. So every user will be required to ask “What’s the weather in New York?”, even if he is living somewhere else. This constrains the interaction and makes it unnatural affecting the validity of the results.
You only can do a conversation. The value of the application is not able to go beyond providing information, because an unintelligent bot is doing the conversation. A task like “Alexa, turn off the light” could not be fulfilled by such an application. In contrast, a human can handle those tasks that require external services.

The last point is especially important, if you want to test larger voice apps early on, because their biggest feature is integration within existing systems and applications.

How to make it work for digital assistants

In order to apply this method to testing digital assistants, you need a special setup.

First of all a physical device is required, that is capable of producing audio output and ideally of recording and sending what the user said. An original device like an Amazon Echo or Google Home would be the best choice to replicate a realistic scenario, but a bluetooth speaker does the job as well if you don’t have a choice.
Additionally you need some way of turning the written answers of the wizard into spoken content, so the users hears a mechanical voice.
To be well prepared during the testing, scripting beforehand is necessary and a dedicated tool enlightens this. For some a normal note-taking app might fit better. I then recommend a sophisticated one, to ensure keeping up with the conversation.

Therefore I recommend the following setup:

Amazon Echo / Google Home / Amazon Fire TV Stick / A bluetooth speaker
(Two smartphones)
Amazon Polly / Alexa Voice Snippets
A notetaking App with fine-grained filtering / Twine
A stopwatch / additional smartphone

It works like this:
You connect the bluetooth speaker to the wizards computer und describe it to the user as the digital assistant device if not using a “real” one (Here’s a tutorial how to use your Echo as a bluetooth speaker). If it is possible for you to place the wizard behind a thin partition wall or something similar, this would be ideal as the wizard could hear what the participant is saying. Alternatively you can use a voice call between two phones and tell the user that the one he is seeing is for recording purposes.
The wizard has to prepare himself by opening the scripting notes, where answers to most common interaction phrases are pre-written, so he can copy&paste. Additionally he has to setup Polly in the right language, so the spoken content is turned into speech.

After instructing the user on his role and tasks in the experiment, you start the experiment and let the him or her interact with “the system”.
Important: If you’re working with an Amazon Echo, you’ll need to instruct your participants in a special way or use a small tweak. Although your device is used for bluetooth streaming, it will recognise the “wake” word, aka “Alexa” or “Ok Google”. Saying “Alexa tell [insert_your_prototype_here] to do XYZ” will not only be send to your wizard, but wake the AI in the background, which won’t be able to handle the request and this brakes the experience and experiment setting. You can either tell your users to use a different word, e.g. your apps name, or change the wake word. Unfortunately the latter is only possible for Amazon devices and you’re limited to three options: Alexa, Amazon or Echo.

If the participant agreed to recording the session on video, this is a huge plus, as you match phrases and points of interaction with facial expressions and mimic later on. Any way I highly recommend using a stopwatch and noting anything you consider important together with the time, even if its “just” audio. If you claimed, the “microphone smartphone” is your recording device and haven’t got video recording either, I suggest jotting down the basic interaction flow. It shouldn’t be a problem to keep up with the conversation as long as you only use bullet point and keywords.

The conversation itself should run pretty fluently, provided you trained your wizard on his job and tools and he is awake. For maximum use of this method, note all incidents where a “classic” digital asisstent, prototype or fully developed, might have struggled. This will help you later on to point out potential breakpoints, which deserve special attention.

Ending the experiment does not offer any stumbling blocks, expect revealing the wizard or exiting bluetooth streaming to early. Both will impact the users post-formed picture of the session and thereby the connection to your product and feedback later on, as they destroy the illusion of interacting with a computer. If you plan to come back to the users and collect additional feedback or conduct another test, both of the points above would be deadly.

If you’ve got an Amazon Echo, Google Home or similar and a decent wizard, this setup does only minimally impact the comparability to interacting with a true voice app. The only concern remaining is the voice, as it is different than the original one. An alternative to Polly might be extracting various Alexa voice snippets and connecting them to a sentence, but this seems like a lot of effort and inflexibility to me.
If you use a modified form of this setup, comparability will likely decrease a lot, but it still works for validating the usefulness of a feature or application, just not for a correct user test.

A sneak peek into the future

As you can see, the setups I listed above are more workarounds than real tools. I have been quite frustrated with the status quo of possibilities and started to work on prototAIpe, a tool to rapidly prototype digital assitants and chatbots in flexible way. If this sounds interesting, I would be glad if you take a look at our upcoming page.

Thanks for reading! I hope you can now kickstart your conversational interface project or have the tools to go back to user testing, whatever you need right now. What are your experiences of prototyping voice interfaces? Do you recommend another solution or setup? Feel free to leave your thoughts in the comment box, I will happily join the discussion.

If you liked the article, please share the love by applauding 👏