Tim Presenting at UX Australia Conference

Voice UX — Insights from the blind side:
designing richer voice experiences for all.

Listen to the full Recording of Tim’s Session

Presentation Description

2018 is the year where audio and voice interfaces are finally emerging and taking prominence as viable alternatives and complements to screen based interactions. In this experiential session you will be inspired to consider voice interface design from a fresh and expanded perspective.

As we all become more accustomed to voice interactions, our expectations for more complex activities and more personalised interfaces are inevitable. Drawing on learnings from information-rich voice applications designed for, and with input from, blind users, Tim will share insights and his unique understandings around elegant and efficient voice experience design.

A solid grounding in speech and voice output is crucial for great voice application design. Tim will explore and demonstrate the power of the human voice and how it can be harnessed to create an increased sense of connection and inclusion with users.

Tim will highlight the fundamental differences between screen based and voice-first application design. Tim will conclude the session with his top suggestions for creating intuitive, natural-sounding voice interfaces and applications that speak for themselves.

Overview

Drawing from two advanced voice application case studies which Tim headed up, and based on 25 years experience in designing and implementing voice applications, this presentation provides broad insights and learnings that aren’t well covered in the current voice literature.

When you are blind, listening is never optional.

One thing most blind people have in common is that they have had to become high functioning listeners, skilled in efficiently processing and retaining auditory information.

Blind users provided extensive input and feedback to both case studies covered here.

The overarching idea of this presentation is that we can learn from non-visual user’s needs, preferences and strategies, as we design and enhance modern voice experiences.

The other high-level theme recurring throughout this presentation is that voice (has the potential to be) so much more than a string of words to be automatically converted into sound.

At the moment, modern voice assistants are largely single-turn call and response based. However, people are becoming more accustomed to voice interactions and as a result, their desires and expectations for more complex transactions, richer conversational sessions and expanded functionality are inevitable.

I consider this nascent field of voice assistants is only at around version 0.1 level so we have immense opportunities for progress in the coming years.

A key challenge for advanced voice applications is the transformation, navigation and presentation of complex or voluminous data for the user. In differing ways the voice services behind both of the following case studies devised new ways and refined existing means to address this challenge.

This session is all about voice and sound

It’s actually visually slides free, but the recorded session contains various audio samples (sound slides).

So for the next 40 minutes I invite you to:

close your eyes;
relax your ears
and come with me on a journey into my rich – invisible – interface.

In addition to giving your visual centres a rest, Closing your eyes also helps to bring you into a more open and expansive listening state (listening position).

Just a little about me, so you know where I’m coming from:

I’ve worked in voice UX design since the early 90s.

I describe myself as a ‘Professional Listener’.

This presentation draws on most of My Core Interests:
Voice & Sound,
Listening & Speaking,
People, Technology & Design

Some Voice UX Basics

Voice UI is often broader service design
Voice UI Design or Voice Experience Design as I prefer to think of it, is a totally different paradigm to screen-UX.
Voice UX is different to systems that present screen-based information into sound – They are called Screenreaders.

Whereas voice UIs optimise the sound output to maximise the naturalness, clarity and intuitiveness of the service, the job of screen-readers is to convey — through speech or braille — all relevant visual elements of the application and operating system.

A screen-reader doesn’t have knowledge of the content; it is a translator into a different modality.

Visual Interfaces are naturally spatial – our eyes do most of the scanning, navigating, focusing etc.
Sound interfaces – in contrast – work in the dimension of time. There is no cursor or moveable pointer. How we sequence the words and the messages, completely determines how the listener experiences the service.

Our main focus today is on the voice output side of voice user interfaces and how voices are perceived by users.

Case Study 1 – iVote by Phone and iVote by Web

iVote had two interfaces:

iVote by Phone – which we are exploring today and
iVote by Web, an accessible web interface for any users who met the eligibility criteria.

Watch the iVote Promotional Youtube Video Here

To support users through a long and complex voting session and to ensure no un-due emphasis was given to any of hundreds of candidates, meticulous Voice selection and direction was paramount to the design process for iVote by Phone. Other than for early prototyping, we consciously used no synthetic speech in this application.

We used two distinct voices (voice fonts), one (female) for the telephone service itself and another (male) for speaking all candidate names.

@TimNoonan #professionalistener thanks for sharing your important work today. Voice fonts! Who knew. Also great to finally understand why the voices on phone systems sound the way they do. Dry and monotone to not influence a voter for example. @UXAustralia
— dionw (@dionw) August 31, 2018

While visual designers obsess about typeface, font, colours and iconography in visual apps, there is an obvious gap in the literature and the collective consciousness about voice properties and its importance in good design.

Through various audio samples, I demonstrated some of the strategies we employed enabling users to independently navigate through a complex ballot paper using a telephone keypad, including enabling users to even vote Below-The-Line.

In particular, reflecting the ballot paper layout in the telephone service. The keypad acted like a cursor cross for navigating to groups and candidates.

Because users would only have need to use the service once, it was paramount that usability and discoverability were central to the design.

We also provided a practise service so users could try and learn the system as many times as they wished, ahead of casting their vote. User confusion or misunderstandings about how they were completing their ballot were obviously unacceptable.

Hear a 1 minute demo of iVote by Phone

“The fully automated iVote system used in NSW is superior to any other we have seen in an Australian election so far, and voters have clearly endorsed the system by using it in greater numbers than ever before. We will be working towards encouraging this system as the gold standard for future elections.” — Marianne Diamond, Vision Australia

Additional iVote Background

My iVote involvement - in conjunction with Judy Birkenhead, an electoral expert - over an intense 9 month period included:

Case Study 2: Today’s News Now

Developed in-house in 1997 onwards, TNN was A sophisticated text-to-speech Information-rich voice application for browsing, reading and reviewing newspaper articles over the phone.

The voice UI approaches we formulated were based on standards as well as Enlisting real-time Input from dedicated blind and vision impaired users of the service.

We used DTMF (Touch Tone) input strategies for searching, navigating, skipping reading and reviewing rich content from the service.

Hear a Brief Audio demonstration of ‘Today’s News Now’ Phone Newspaper service.

Even today, it would be difficult to create a reliable voice input approach for power users of TNN and the iVote system.

This is an area needing more work as we move into a ‘Voice First’ era.

Whereas iVote was centred around carefully scripted and directed human voices, Today’s News Now is entirely automated and utilised the DECTalk speech synthesiser..

We devised PERL and regular expression-base techniques to fully automate the transformation of print information to a spoken word format that reflected spoken conventions, through the use of PERL & regular Expression pattern-matching to better render phone numbers, Times, opening hours, pronunciation of proper names etc.

A key design challenge here was not to discard the original text, as users also wanted to review content to check the spelling of names and the like.

For example, although we wanted the system to pronounce ‘Grand Prix’ as ‘Graun Pre’ it was important that users could check how the event was actually spelt in the print edition of the newspaper.

Because there were few developer tools available for TTS and IVRs in the 90s, We developed a high-level scripting language called PhoneScript which was optimised for the rapid prototyping and development of powerful telephone applications which are able to present a range of rich information sources to callers, through synthetic speech.

The three Applications we developed in the PhoneScript environment were:

JobPhone for presenting structured access to job vacancy advertisements from the mycareer.com.au website;
LibTel for browsing Royal Blind Society’s braille and talking book catalogue and allowing online ordering; and
Today’s News Now structured access to the full text of Fairfax and News Ltd newspapers.

Some of the unique features of the PhoneScript environment which were requested by users and service managers include:

Features raised by blind users of voice output systems and services include:

Designs that make efficient use of the user’s time
Two or more verbosity levels
Put the key information near to the front of spoken messages, but not as first syllables
Personalisation of settings including speed of voice
If the service plays long-form audio such as podcasts or audio books, allow that audio to be played by the user at various speeds
Remember last listening point when resuming play in a subsequent session
Sync my play position across other platforms such as in my smart phone podcasting app
Include instructions and help within the voice app; don’t send the user elsewhere – to an instruction booklet or to an app
Provide brief answers and allow for more detail to be requested
Allow content to be navigated and sections to be skipped through
Allow information, a link, phone number or email address to be repeated or expanded for clarity on request
And … For heaven’s sake, provide an un-do command for all those times you mis-hear what I actually said! Were tired of meaningless items turning up on our shopping lists!

Voice Personality, persona, humour, empathy and psychology

Human Voices are intrinsically linked with issues of identity and personality. The etymology of the word “Persona” directly translates as “Through Sound”.
Humans are wired to listen to more than words spoken, We hear tone, pauses, volume and timbre too. – Those come from us actually understanding meaning and how it can be expressed through voice and spoken word.
We hear vocal (tonal) language from our mother’s voice for three months while in her womb, well before we clearly hear verbal words.
All of this means that automated speech can trigger conscious (or unconscious) cognitive dissonance when the words sound human but are lacking nuanced meaning, or when they appear to contain contrary vocal messaging to the words being spoken.
As an example, how is a computer supposed to meaningfully say “I didn’t say he stole the money”? Which word or words should be emphasised is completely dependent on the surrounding context and the story being told.

Language understanding coupled with human-like expression is starting to blur the line between human speech and computer generated speech.

Siri voices and Google work is the most obvious but this is clearly the direction of future text to speech research and application.

For example, IBMs Watson recentby took part in a debate, to influence and persuade listeners of its case against a human speaker.

Recent research by Pablo Arias a final-year PhD student in perception and cognitive science at the audio research lab, IRCAM, in Paris has identified the main articulatory factors that are audible when a person does or does not smile. He has developed an algorithm that can desmile or ensmile any human utterance.

Pablo also discusses research finding that as we listen to a voice, our brain waves adjust in response to what we hear.

Close

We are just starting out on the journey of truely intelligent assistants and the next 5-10 years will be very interesting indeed!

Today, the Voice Assistant race is mostly about features and functionality, but the personality, trustability and relatability of voice assistants and interfaces will be just as important in the longer term.

As technology continues to better understand voice, language and emotion, lets work together to ensure that future use of voice technology is always respectful of users and their emotions and make sure that it serves to constructively assist and support all walks of humanity.

Voice technology is starting to blend with humanity – it’s harder to tell if we’re listening to a human or a machine. Our challenge is we need to be respectful to users when building these services and act ethically. @TimNoonan #UXA18
— Allison Ravenhall (@RavenAlly) August 31, 2018

Today’s session draws on my own experiences and those of blind beta testers and users. I hope our insights, learnings and experiences can inform and improve voice services now and into the future.

About Tim

Blind from birth, Tim Noonan is a voice experience designer, inclusive design consultant and an expert in voice & spoken communication.

Building on his formal background in cognitive psychology, linguistics and education, Tim has been designing and crafting advanced voice interfaces since the early 90s and was one of the principle authors of the Australian and New Zealand standard on interactive voice response systems AS/NZS 4263.

Tim is the principle author of several other standards relating to automated voice systems, including automated telephone-based voting, telephone-based speech recognition and four industry standards on the accessibility of electronic banking channels and inclusive authentication solutions.

Tim has also been a pioneer in the accessibility field for more than three decades. He particularly loves working with emerging and future technologies to come up with innovative ways to make them more inclusive, effective and comfortable to use.

A career highlight for Tim was working as the lead Inclusive User Experience designer for iVote – a fully automated telephone-based and web-based voting system for the NSW Electoral Commission. iVote was issued with Vision Australia’s Making A Difference Award and was recommended as the ‘Gold Standard’ for accessible voting.

For the last 25 years Tim has been leading the way in teaching, conceptualising and designing technologies that communicate with users through voice and sound – both for accessibility and mainstream users.

UX Australia 2018 Presentation Notes by Tim Noonan

Voice UX — Insights from the blind side:
designing richer voice experiences for all.

Listen to the full Recording of Tim’s Session

Presentation Description

Overview

Some Voice UX Basics

Case Study 1 – iVote by Phone and iVote by Web

Watch the iVote Promotional Youtube Video Here

Hear a 1 minute demo of iVote by Phone

Additional iVote Background

Case Study 2: Today’s News Now

Hear a Brief Audio demonstration of ‘Today’s News Now’ Phone Newspaper service.

Some of the unique features of the PhoneScript environment which were requested by users and service managers include:

Features raised by blind users of voice output systems and services include:

Voice Personality, persona, humour, empathy and psychology

Close

Other Concepts and themes in the Presentation

Observations on problematic real-life Interactions between Voice assistants and Accessibility Features

Voice First

Siri is multi-modal, and mainly offered on a touch-first platform.

Hear “What time is it?” Siri response

Some inclusive Voice Experience Design considerations

Touch Controls or buttons on Smart Speakers

Voice apps and assistants currently perform somewhere around the level of a child or office junior

About Tim

UX Australia 2018 Presentation Notes by Tim Noonan

Voice UX — Insights from the blind side: designing richer voice experiences for all.

Listen to the full Recording of Tim’s Session

Presentation Description

Overview

Some Voice UX Basics

Case Study 1 – iVote by Phone and iVote by Web

Watch the iVote Promotional Youtube Video Here

Hear a 1 minute demo of iVote by Phone

Additional iVote Background

Case Study 2: Today’s News Now

Hear a Brief Audio demonstration of ‘Today’s News Now’ Phone Newspaper service.

Some of the unique features of the PhoneScript environment which were requested by users and service managers include:

Features raised by blind users of voice output systems and services include:

Voice Personality, persona, humour, empathy and psychology

Close

Other Concepts and themes in the Presentation

Observations on problematic real-life Interactions between Voice assistants and Accessibility Features

Voice First

Siri is multi-modal, and mainly offered on a touch-first platform.

Hear “What time is it?” Siri response

Some inclusive Voice Experience Design considerations

Touch Controls or buttons on Smart Speakers

Voice apps and assistants currently perform somewhere around the level of a child or office junior

About Tim

Share this:

Voice UX — Insights from the blind side:
designing richer voice experiences for all.