UX Australia 2018 Presentation Notes by Tim Noonan

Tim Presenting at UX Australia Conference

Voice UX — Insights from the blind side:
designing richer voice experiences for all.

Listen to the full Recording of Tim’s Session

Presentation Description

2018 is the year where audio and voice interfaces are finally emerging and taking prominence as viable alternatives and complements to screen based interactions. In this experiential session you will be inspired to consider voice interface design from a fresh and expanded perspective.

As we all become more accustomed to voice interactions, our expectations for more complex activities and more personalised interfaces are inevitable. Drawing on learnings from information-rich voice applications designed for, and with input from, blind users, Tim will share insights and his unique understandings around elegant and efficient voice experience design.

A solid grounding in speech and voice output is crucial for great voice application design. Tim will explore and demonstrate the power of the human voice and how it can be harnessed to create an increased sense of connection and inclusion with users.

Tim will highlight the fundamental differences between screen based and voice-first application design. Tim will conclude the session with his top suggestions for creating intuitive, natural-sounding voice interfaces and applications that speak for themselves.

Overview

Drawing from two advanced voice application case studies which Tim headed up, and based on 25 years experience in designing and implementing voice applications, this presentation provides broad insights and learnings that aren’t well covered in the current voice literature.

When you are blind, listening is never optional.

One thing most blind people have in common is that they have had to become high functioning listeners, skilled in efficiently processing and retaining auditory information.

Blind users provided extensive input and feedback to both case studies covered here.

The overarching idea of this presentation is that we can learn from non-visual user’s needs, preferences and strategies, as we design and enhance modern voice experiences.

The other high-level theme recurring throughout this presentation is that voice (has the potential to be) so much more than a string of words to be automatically converted into sound.

At the moment, modern voice assistants are largely single-turn call and response based. However, people are becoming more accustomed to voice interactions and as a result, their desires and expectations for more complex transactions, richer conversational sessions and expanded functionality are inevitable.

I consider this nascent field of voice assistants is only at around version 0.1 level so we have immense opportunities for progress in the coming years.

A key challenge for advanced voice applications is the transformation, navigation and presentation of complex or voluminous data for the user. In differing ways the voice services behind both of the following case studies devised new ways and refined existing means to address this challenge.

This session is all about voice and sound

It’s actually visually slides free, but the recorded session contains various audio samples (sound slides).

So for the next 40 minutes I invite you to:

  • close your eyes;
  • relax your ears
  • and come with me on a journey into my rich – invisible – interface.

In addition to giving your visual centres a rest, Closing your eyes also helps to bring you into a more open and expansive listening state (listening position).

Just a little about me, so you know where I’m coming from:

I’ve worked in voice UX design since the early 90s.

I describe myself as a ‘Professional Listener’.

This presentation draws on most of My Core Interests:
Voice & Sound,
Listening & Speaking,
People, Technology & Design

Some Voice UX Basics

  • Voice UI is often broader service design

  • Voice UI Design or Voice Experience Design as I prefer to think of it, is a totally different paradigm to screen-UX.

  • Voice UX is different to systems that present screen-based information into sound – They are called Screenreaders.

Whereas voice UIs optimise the sound output to maximise the naturalness, clarity and intuitiveness of the service, the job of screen-readers is to convey — through speech or braille — all relevant visual elements of the application and operating system.

A screen-reader doesn’t have knowledge of the content; it is a translator into a different modality.

  • Visual Interfaces are naturally spatial – our eyes do most of the scanning, navigating, focusing etc.

  • Sound interfaces – in contrast – work in the dimension of time. There is no cursor or moveable pointer. How we sequence the words and the messages, completely determines how the listener experiences the service.

Our main focus today is on the voice output side of voice user interfaces and how voices are perceived by users.

Case Study 1 – iVote by Phone and iVote by Web

iVote had two interfaces:

  • iVote by Phone – which we are exploring today and
  • iVote by Web, an accessible web interface for any users who met the eligibility criteria.

Watch the iVote Promotional Youtube Video Here

To support users through a long and complex voting session and to ensure no un-due emphasis was given to any of hundreds of candidates, meticulous Voice selection and direction was paramount to the design process for iVote by Phone. Other than for early prototyping, we consciously used no synthetic speech in this application.

We used two distinct voices (voice fonts), one (female) for the telephone service itself and another (male) for speaking all candidate names.

While visual designers obsess about typeface, font, colours and iconography in visual apps, there is an obvious gap in the literature and the collective consciousness about voice properties and its importance in good design.

Through various audio samples, I demonstrated some of the strategies we employed enabling users to independently navigate through a complex ballot paper using a telephone keypad, including enabling users to even vote Below-The-Line.

In particular, reflecting the ballot paper layout in the telephone service. The keypad acted like a cursor cross for navigating to groups and candidates.

Because users would only have need to use the service once, it was paramount that usability and discoverability were central to the design.

We also provided a practise service so users could try and learn the system as many times as they wished, ahead of casting their vote. User confusion or misunderstandings about how they were completing their ballot were obviously unacceptable.

Hear a 1 minute demo of iVote by Phone

“The fully automated iVote system used in NSW is superior to any other we have seen in an Australian election so far, and voters have clearly endorsed the system by using it in greater numbers than ever before. We will be working towards encouraging this system as the gold standard for future elections.” — Marianne Diamond, Vision Australia

Additional iVote Background

My iVote involvement - in conjunction with Judy Birkenhead, an electoral expert - over an intense 9 month period included:
Read More ..
  • Conceptual design and comprehensive scripting of wording for iVote by Phone;

  • Preparation of functional software design specifications;

  • Development of automation systems and processes for processing audio, automating text-to-speech and maximising audio quality;

  • Researching recording studio options and recommending a studio Twenty5Eight who we worked with for the extensive and highly time-critical voice recording requirements of the iVote project;

  • Preparation of a Vocal Branding profile for each of the key ‘voices’ required for iVote which we used to cast voice talent;

  • Meticulous vocal direction of voice talent for the recording of hundreds of prompts and nearly 1000 candidate names;

  • Provided in-house Voiceover and audio production services for iVote promos;

  • Consulting to the Commission on promotional strategy to maximise the uptake of iVote by people with disabilities;

  • Web Accessibility and usability services including design recommendations, access consulting, conducting observational usability studies and ensuring web accessibility compliance in conjunction with Scenario Seven Pty Ltd;

  • Designing and conducting observational usability design studies for iVote by Phone;

  • Publishing an updated Standard after completion of the iVote project, to document UX recommendations.

Download The updated Version of the Australian Telephone Voting Standard in PDF

Case Study 2: Today’s News Now

Developed in-house in 1997 onwards, TNN was A sophisticated text-to-speech Information-rich voice application for browsing, reading and reviewing newspaper articles over the phone.

The voice UI approaches we formulated were based on standards as well as Enlisting real-time Input from dedicated blind and vision impaired users of the service.

We used DTMF (Touch Tone) input strategies for searching, navigating, skipping reading and reviewing rich content from the service.

Hear a Brief Audio demonstration of ‘Today’s News Now’ Phone Newspaper service.

Even today, it would be difficult to create a reliable voice input approach for power users of TNN and the iVote system.

This is an area needing more work as we move into a ‘Voice First’ era.

Whereas iVote was centred around carefully scripted and directed human voices, Today’s News Now is entirely automated and utilised the DECTalk speech synthesiser..

We devised PERL and regular expression-base techniques to fully automate the transformation of print information to a spoken word format that reflected spoken conventions, through the use of PERL & regular Expression pattern-matching to better render phone numbers, Times, opening hours, pronunciation of proper names etc.

A key design challenge here was not to discard the original text, as users also wanted to review content to check the spelling of names and the like.

For example, although we wanted the system to pronounce ‘Grand Prix’ as ‘Graun Pre’ it was important that users could check how the event was actually spelt in the print edition of the newspaper.

Because there were few developer tools available for TTS and IVRs in the 90s, We developed a high-level scripting language called PhoneScript which was optimised for the rapid prototyping and development of powerful telephone applications which are able to present a range of rich information sources to callers, through synthetic speech.

The three Applications we developed in the PhoneScript environment were:

  • JobPhone for presenting structured access to job vacancy advertisements from the mycareer.com.au website;

  • LibTel for browsing Royal Blind Society’s braille and talking book catalogue and allowing online ordering; and

  • Today’s News Now structured access to the full text of Fairfax and News Ltd newspapers.

Some of the unique features of the PhoneScript environment which were requested by users and service managers include:

Read more ...
  • A development environment optimised for text-to-speech IVR services (most existing platforms are recorded-message based);

  • Automatic processing of text through extensive PERL regular expressions, so as to dramatically improve pronunciations, and to convert written conventions into their spoken equivalents. Examples are Australian place names, names of politicians, British pronunciations, reading out complex currency values, intelligent reading of dates times and date ranges, identification and clear rendition of telephone numbers, identification of compound words and more;

  • Sophisticated acronym processing module which is able to identify (based on a large number of context rules) whether upper case words should be spoken or spelt out. Most speech synthesisers don’t employ enough context information to do this job very well;

  • Full ‘review’ mode, allowing a caller to navigate a document by paragraph, sentence or word. Any word can be spelt out;

  • A set of menus for adjustment and personalisation of speech parameters including speed, volume, pitch and personality.

  • Three separate sets of voice parameters, one for menus/system messages, one for article/document reading and one for help messages. This provides increased navigation context and can increase comprehension of reading (listening);

  • Based around a very high-level scripting environment which hides the complexity of preparing text-to-speech buffers, telephony controls etc. This means that limited programming skills are required to tailor or fine-tune application user interface elements. Examples of some scripting commands are “hangup”, “say”, “spell” “saysubst” “title” “SayArticle” etc;

  • The scripting language facilitates automatic compliance with the Australian and New Zealand standard, with respect to standard key assignments and timeouts, but these can be easily overridden as required;

  • An intuitive ‘talking keypad’ approach to alphabetic entry, which complies to Appendix B of the standard;

    Centred around a database-driven approach to data access, allowing a clean separation of back-end processing of source information, and front-end presentation of information to callers;

Read the full Article on TNN, LibTel and JobPhone Features and Capabilities

Features raised by blind users of voice output systems and services include:

  • Designs that make efficient use of the user’s time
  • Two or more verbosity levels
  • Put the key information near to the front of spoken messages, but not as first syllables
  • Personalisation of settings including speed of voice
  • If the service plays long-form audio such as podcasts or audio books, allow that audio to be played by the user at various speeds
  • Remember last listening point when resuming play in a subsequent session
  • Sync my play position across other platforms such as in my smart phone podcasting app
  • Include instructions and help within the voice app; don’t send the user elsewhere – to an instruction booklet or to an app
  • Provide brief answers and allow for more detail to be requested
  • Allow content to be navigated and sections to be skipped through
  • Allow information, a link, phone number or email address to be repeated or expanded for clarity on request
  • And … For heaven’s sake, provide an un-do command for all those times you mis-hear what I actually said! Were tired of meaningless items turning up on our shopping lists!

Voice Personality, persona, humour, empathy and psychology

  • Human Voices are intrinsically linked with issues of identity and personality. The etymology of the word “Persona” directly translates as “Through Sound”.
  • Humans are wired to listen to more than words spoken, We hear tone, pauses, volume and timbre too. – Those come from us actually understanding meaning and how it can be expressed through voice and spoken word.
  • We hear vocal (tonal) language from our mother’s voice for three months while in her womb, well before we clearly hear verbal words.
  • All of this means that automated speech can trigger conscious (or unconscious) cognitive dissonance when the words sound human but are lacking nuanced meaning, or when they appear to contain contrary vocal messaging to the words being spoken.
  • As an example, how is a computer supposed to meaningfully say “I didn’t say he stole the money”? Which word or words should be emphasised is completely dependent on the surrounding context and the story being told.

Language understanding coupled with human-like expression is starting to blur the line between human speech and computer generated speech.

Siri voices and Google work is the most obvious but this is clearly the direction of future text to speech research and application.

For example, IBMs Watson recentby took part in a debate, to influence and persuade listeners of its case against a human speaker.

Recent research by Pablo Arias a final-year PhD student in perception and cognitive science at the audio research lab, IRCAM, in Paris has identified the main articulatory factors that are audible when a person does or does not smile. He has developed an algorithm that can desmile or ensmile any human utterance.

Pablo also discusses research finding that as we listen to a voice, our brain waves adjust in response to what we hear.

Close

We are just starting out on the journey of truely intelligent assistants and the next 5-10 years will be very interesting indeed!

Today, the Voice Assistant race is mostly about features and functionality, but the personality, trustability and relatability of voice assistants and interfaces will be just as important in the longer term.

As technology continues to better understand voice, language and emotion, lets work together to ensure that future use of voice technology is always respectful of users and their emotions and make sure that it serves to constructively assist and support all walks of humanity.

Today’s session draws on my own experiences and those of blind beta testers and users. I hope our insights, learnings and experiences can inform and improve voice services now and into the future.

Other Concepts and themes in the Presentation

The main focus of my presentation was the two case studies I’ve described here. I mentioned some of the following concepts in passing, but they are listed here as additional background information.

Observations on problematic real-life Interactions between Voice assistants and Accessibility Features

  • When Voiceover is enabled, Siri shouts on the watch, even if Voiceover volume is set to quiet.
  • In some iOS releases, Siri and Voiceover on the iPhone often collided while trying to access sound system resources, causing hangs.
  • Voice apps often hear Screenreader output and try and act on it.
  • Siri on iOS devices added typing input in place of speech recognition to accommodate people with speech issues, but it’s not possible to call Siri up via an external keyboard.

Voice First

Voice First is the Amazon catch-cry but all but low-hanging fruit is shunted to a screen-based app. – Accessibility and usability of the Amazon Alexa app or the Google Home app for installation or configuration is arduous and in no way mirrors the voice simplicity of the device itself.

  • Complex set-up is a barrier to uptake and contrasts to the near effortless set-up of HomePod.
  • When it arrived I spent 1 minute setting up my HomePod – have you tried setting up and tailoring an Echo product or a Google Home? Its slow, tedious and kind of complex/technical.
  • How will older people and people with mild cognitive issues ever do it?

Siri is multi-modal, and mainly offered on a touch-first platform.

  • Siri often assumes you will look at a screen – which doesn’t always work. If I ask my Apple Watch what time it is, it cheerfully responds with a quip about time such as “fruit flies like a banana, time flies like an arrow” – only 13 unnecessary syllables – but then doesn’t actually speak the time.
  • Misguided personality programming doesn’t properly prioritise the most relevant information. The example I played in the presentation was “Its crunch time…”. This issue has finally been resolved in the late betas for Watch OS 5.

Hear “What time is it?” Siri response

  • Now that Siri is launched on the HomePod, which has no screen, Apple’s multi-modal touch-first model starts to show its biases. Calendars weren’t available at launch, because they had been designed on the assumption that calendar detail were being presented on screen, in addition to minimal voice feedback.

  • The take-away is that even if you are multi-modal, there are many situations where users will be constrained to audio only, so this needs to be factored into designs.

Some inclusive Voice Experience Design considerations

Note that depending on the platform, these are usually issues beyond the control of a voice application designer, but longer-term these biases and factors need to be considered and addressed.

Is your choice of output voice aligned with your user base? Gender, ethnicity, accent, age, and personality? Do they feel included or separate?

Is your app’s purpose and audience suited to its voice? As a hypothetical, Is a female voice (on all the leading assistants) going to work for a voice-oriented gay male dating app akin to Grindr?

  • Speech recognition biases have been found to be skewed towards anglo adults.
  • Voice models for computer generated text to speech are biased to US and UK white speakers;
  • African Americans have no easy way to use their screen readers with a voice that matches their linguistic community. Apparently, there are no African American speakers in Nuances voice portfolio.

Is your speech recognition engine able to handle speech impediments, stutters, nervous speech, shaky and broken speech? The Mozilla Speech Recognition Corpus Project may be an opportunity to include folks with different speech profiles, speech impediments etc.

Are your timeouts sufficient for people who speak slowly or take more time to formulate their requests/responses?

Does your service have understanding of and respond to colloquial informal terms and phrases from your users? This also has a bearing on how comfortable and accepted your users feel.

Touch Controls or buttons on Smart Speakers

Though not immediately apparent to everyone, eye-hand coordination should not be a design requirement for voice assistants, as they principally work in the auditory (non visual) domain.

Google Home Mini and HomePod both employ touch-sensitive controls for volume adjustment, pause/play etc.

  • In the dark, when you are not awake, when the device is above your line of sight or when you are reaching past the device, touch controls can be wrongly triggered.
  • I am for-ever readjusting the volume of my Google Home Mini devices, or unintentionally resuming music playback on my HomePod when reaching for something else near-by.

Alexa in contrast has physical buttons – with nominal tactually differentiated surfaces, so it can be operated by feel, in the dark. – Though better design overall, physical buttons could be more problematic for people with physical disabilities to operate.

Voice apps and assistants currently perform somewhere around the level of a child or office junior

  • don’t trust with confidential information;
  • Don’t expect consistently good responses
  • Expect an over-confident manner in contrast to actual capabilities.
  • May have mind elsewhere when you call on it – network or Alexa outages
  • Unexpected or insensitive responses when you are busy and on task – Siri quips and Alexa laughter
  • Frustration and need to rephrase your request several times and sometimes still be unsure if you were accurately understood and if what you requested was actually done
  • Often having to shout and call by name (wake word) to attract attention.

About Tim

Blind from birth, Tim Noonan is a voice experience designer, inclusive design consultant and an expert in voice & spoken communication.

Building on his formal background in cognitive psychology, linguistics and education, Tim has been designing and crafting advanced voice interfaces since the early 90s and was one of the principle authors of the Australian and New Zealand standard on interactive voice response systems AS/NZS 4263.

Tim is the principle author of several other standards relating to automated voice systems, including automated telephone-based voting, telephone-based speech recognition and four industry standards on the accessibility of electronic banking channels and inclusive authentication solutions.

Tim has also been a pioneer in the accessibility field for more than three decades. He particularly loves working with emerging and future technologies to come up with innovative ways to make them more inclusive, effective and comfortable to use.

A career highlight for Tim was working as the lead Inclusive User Experience designer for iVote – a fully automated telephone-based and web-based voting system for the NSW Electoral Commission. iVote was issued with Vision Australia’s Making A Difference Award and was recommended as the ‘Gold Standard’ for accessible voting.

For the last 25 years Tim has been leading the way in teaching, conceptualising and designing technologies that communicate with users through voice and sound – both for accessibility and mainstream users.

website by twpg