Algorhythmic Sounds: Reviews of Human-Computer Interaction of Automatic Speech Recognition Papers

This semester, one of my subjects that is under "Special Topics" (meaning, they are seasonal, and there will be high chances the topic won't ever be available in coming semesters) is about Human-Computer Interaction.

My childhood friend, who happens to also be our professor for this subject, Sir Teej ~~still getting used to calling him with an honourific ROFLMAO~~, required us to make reports of five topics from ACM Digital Library that was published from years 2012-2015 about HCI.

Since I've decided to take the research route of studying, improving, and innovating acoustics / sound signal analyses for practical uses, hence this blog, started with the topic I'm highly considering for my SP, automatic speech recognition ~~(but I'm hopeful I will delve and concentrate more on bioacoustics in the future, focusing usage for agriculture - because FOOOOOD~).~~

The abstracts of the papers, along with the reviews after the cut.

Note: This post is "live" and will edit most of it when I'm done with my reports and reviews. I just thought I should write them down, make a template, and publish them ASAP instead I end up never posting it.

<Insert Sozi presentation when done reporting all five papers>

Title: Improving Automatic Speech Recognition Through Head Pose Driven Visual Grounding
Author(s): Soroush Vosoughi
Conference, Year, Session: CHI 2014 - Session: Applications of Body Sensing

Abstract: In this paper, we present a multimodal speech recognition system for real world scene description tasks. Given a visual scene, the system dynamically biases its language model based on the content of the visual scene and visual attention of the speaker. Visual attention is used to focus on likely objects within the scene. Given a spoken description the system
then uses the visually biased language model to process the speech. The system uses head pose as a proxy for the visual attention of the speaker. Readily available standard computer
vision algorithms are used to recognize the objects in the scene and automatic real-time head pose estimation is done using depth data captured via a Microsoft Kinect. The system
was evaluated on multiple participants. Overall, incorporating visual information into the speech recognizer greatly improved speech recognition accuracy. The rapidly decreasing
cost of 3D sensing technologies such as the Kinect allows systems with similar underlying principles to be used for many speech recognition tasks where there is visual information.

Review & comments:

------------

Title: Warping Time for More Effective Real-Time Crowdsourcing
Author(s): Walter S. Lasecki, Christopher D. Miller, and Jeffrey P. Bigham
Conference, Year, Session: CHI 2013 - Session: Collaborative Creation

Abstract: In this paper, we introduce the idea of “warping time” to improve crowd performance on the difficult task of captioning speech in real-time. Prior work has shown that the crowd can
collectively caption speech in real-time by merging the partial results of multiple workers. Because non-expert workers can not keep up with natural speaking rates, the task is frustrating
and prone to errors as workers buffer what they hear to type later. The TimeWarp approach automatically increases and decreases the speed of speech playback systematically across individual workers who caption only the periods played at reduced speed. Studies with 139 remote crowd workers and 24 local participants show that this approach improves median coverage (14.8%), precision (11.2%), and per-word latency (19.1%). Warping time may also help crowds outperform individuals on other difficult real-time performance tasks.

Review & comments:

------------

Title: Effects of Public vs. Private Automated Transcripts on Multiparty Communication between Native and Non-Native English Speakers
Author(s): Ge Gao, Naomi Yamashita, Ari Hautasaari, Andy Echenique, Susan R. Fussell
Conference, Year, Session: CHI 2014 - Session: Multilingual Communication
Abstract: Real-time transcripts generated by automated speech recognition (ASR) technologies have the potential to facilitate communication between native speakers (NS) and non-native speakers (NNS). Previous studies of ASR have focused on how transcripts aid NNS speech comprehension. In this study, we examine whether transcripts benefit multiparty real-time conversation between NS and NNS. We hypothesized that ASR transcripts would be more beneficial when the transcripts were publicly shared by all group members as opposed to when they were seen only by
the NNS. To test our hypothesis, we conducted a lab experiment in which 14 groups of native and non-native speakers engaged in a story-telling task. Half of the groups received private transcripts that were available only to the NNS; the other half received publicly shared transcripts that were available to all group members. NS spoke more clearly, and both NS and NNS rated the quality of communication higher, when transcripts were publicly shared. These findings inform the design of future tools to support multilingual group communication.

Review & comments:

------------

Title: Improving Literacy in Developing Countries Using Speech Recognition-Supported Games on Mobile Devices
Author(s): Walter S. Lasecki, Christopher D. Miller, and Jeffrey P. Bigham
Conference, Year, Session: Anuj Kumar, Pooja Reddy, Anuj Tewari, Rajat Agrawal, Matthew Kam

Abstract: Learning to read in a second language is challenging, but highly rewarding. For low-income children in developing countries, this task can be significantly more challenging
because of lack of access to high-quality schooling, but can potentially improve economic prospects at the same time. A synthesis of research findings suggests that practicing recalling and vocalizing words for expressing an intended meaning could improve word reading skills – including reading in a second language – more than silent recognition of what the given words mean. Unfortunately, many language learning software do not support this instructional approach, owing to the technical challenges of incorporating speech recognition support to check that the
learner is vocalizing the correct word. In this paper, we present results from a usability test and two subsequent experiments that explore the use of two speech recognition-enabled mobile games to help rural children in India read words with understanding. Through a working speech recognition prototype, we discuss two major contributions of this work: first, we give empirical evidence that shows the extent to which productive training (i.e. vocalizing words) is superior to receptive vocabulary training, and discuss the use of scaffolding hints to “unpack” factors in the learner’s linguistic knowledge that may impact reading. Second, we discuss what our results suggest for future research in HCI.

Review & comments:

------------

Title: Voice Typing: A New Speech Interaction Model for Dictation on Touchscreen Devices
Author(s): Anuj Kumar, Tim Paek, Bongshin Lee
Conference, Year, Session: CHI 2012 - Session: Check This Out: Recommender Systems

Abstract: Dictation using speech recognition could potentially serve as an efficient input method for touchscreen devices. However, dictation systems today follow a mentally disruptive speech interaction model: users must first formulate utterances and then produce them, as they would with a voice recorder. Because utterances do not get transcribed until users have finished speaking, the entire output appears and users must break their train of thought to verify and correct it. In this paper, we introduce Voice Typing, a new speech interaction model where users’ utterances are transcribed as they produce them to enable real-time error identification. For fast correction, users leverage a marking menu using touch gestures. Voice
Typing aspires to create an experience akin to having a secretary type for you, while you monitor and correct the text. In a user study where participants composed emails using both Voice Typing and traditional dictation, they not only reported lower cognitive demand for Voice Typing but also exhibited 29% relative reduction of user corrections. Overall, they also preferred Voice Typing.

Review & comments:

Algorhythmic Sounds

Tuesday, 20 October 2015

Reviews of Human-Computer Interaction of Automatic Speech Recognition Papers

No comments:

Post a Comment