• Home
  • Alerts
  • About
  • Services
SafeSearch:  On

Download Adler2007Speech.pdf

Contents : EUROGRAPHICS Workshop on Sketch-Based Interfaces and Modeling (2007) M. van de Panne E. Saund (Editors) Speech and Sketching: An Empirical Study of Multimodal Interaction A. Adler1 and R. Davis1 1 MIT CSAIL 32 Vassar St Cambridge MA 02139 USA Abstract Sketch recognition can capture the sketching component of a multimodal conversation about design but it does not capture information conveyed in the other modalities. The informal speech that accompanies a sketch often has a considerable amount of additional information. We want to develop a digital whiteboard capable of understanding both sketching and speech and capable of participating in a conversation similar to one that the user would have with a human design partner. We conducted a user study to help us understand what kinds of conversations users would have with a whiteboard capable of recognizing a sketch. We report results that we believe will help guide the design of an effective multimodal interface and discuss implications for system architectures. Categories and Subject Descriptors (according to ACM CCS): H.5.2 Information interfaces and presentation : User Interfaces. - Natural language Graphical user interfaces Evaluation/methodology Input devices and strategies Interaction styles User-centered design Voice I/O 1. Introduction Sketching is widely used in the early stages of design Dav02 UWC90 . However the sketch alone may not tell the whole story because some information is notoriously difficult to express graphically. Sketching is often accompanied by speech that although informal still conveys a considerable amount of information. Interaction about the design with another person helps work out details and uncover mistakes. To illustrate the importance of speech consider the sketch of a robot provided by one of our subjects and compare it to the photograph of the robot (Figure 1). It's hard to make any sense of the sketch alone but the accompanying speech identified the parts of the sketch and how they fit together. Our goal is to make the computer a collaborative partner for early design moving beyond a system that interprets sketches alone to a multimodal system that incorporates speech and dialogue capabilities. Although there are systems that allow the user to utter simple spoken commands to a sketch system CJM 97 DKD05 Kai05 our long-term goal is to move beyond simple commands to create a multimodal digital whiteboard capable of a more natural conversation with the user. Instead of simple spoken commands c The Eurographics Association 2007. Figure 1: Left: a sketch of a robot right: the robot. (like uttering "block" while pointing) we want the user to provide a narrative while sketching. A complete understanding of unrestricted narrative is of course unduly difficult our goal is to have the system understand enough of the sketch and enough of the speech to engage the user in a sensible conversation. So far we have focused on the sketching interface and the user study described in this paper. We are working on the modeling and interaction parts of the system. Traditional dialogue and command-driven systems make many assumptions about what computer-human interaction should be like and the dialogues are typically quite Appeared in BIM '07: Proceedings of the 4th Eurographics workshop on Sketch-based interfaces and modeling pp.83-90 A.Adler & R. Davis / Speech and Sketching: An Empirical Study of Multimodal Interaction structured. Although such approaches are tractable wellunderstood and sometimes quite useful they might not be well suited for open-ended domains such as design. To understand this issue better we conducted a study of humanhuman interaction aimed at eliciting requirements for a multimodal conversational design assistant. In the next section we describe the user study and explain how data was annotated. The following sections describe the qualitative and quantitative results of the study including how sketching language multimodal input questions and comments were used by the study participants. We then discuss the implications for the system architecture and conclude with a discussion of related work. (a) Full Schematic Adder (b) AC/DC Transformer Schematic Figure 2: Schematic views of the full adder and the AC/DC transformer that the participants could choose to view. 2. User Study Other systems that let users sketch and speak are typically limited in one or more of the following dimensions: Command-based speech The user talks to the system differently than they would talk to a person issuing short commands not natural speech. (e.g. OD01 ) Unidirectional communication The system cannot ask questions or add things to the sketch. (e.g. Kai05 ) Annotation instead of drawing The user can only annotate an existing representation not use free form drawing. (e.g. JEW 02 ) Fixed set of graphical symbols The user has to know a fixed symbol vocabulary. (e.g. OCW 00 ) The goal of our study was to relax these constraints and look at a bidirectional conversation with more narrative speech and unrestricted sketching. textual description of the circuit and a list of suggested components. They had the option of viewing a schematic of the transformer and adder circuits (Figure 2) before they began drawing but the schematic was not visible while they were drawing. The experimenter and participant sat across a table from each other each with a Tablet PC. We considered having a physical barrier between the experimenter and the participant but didn't because a barrier would have created an unnatural environment and obstructed the video recording. In order to encourage all communication to be done by interacting with the drawing surface the experimenter looked at his tablet and avoided eye contact with the participant. The Tablet PCs were equipped with software we designed that replicates on each tablet in real time whatever is drawn on the other tablet in effect producing a single drawing surface usable by two people at once. It is possible that interactions with the Tablet PCs differ from the interactions with a whiteboard sized device but for this study we used Tablet PCs. The software allowed the users to sketch and annotate the sketch using a pen and a highlighter. Buttons above the sketching area allowed users to switch between five pen colors and five highlighter colors. Another button allowed users to switch into or out of a pixel-based erase mode allowing either user to erase parts of any stroke. Finally there was a button that allowed either user to create a new blank page. The software recorded the (x y) position time and pressure data for each point drawn by either user. To enhance the feeling of naturalness strokes were rendered so that they were thicker when the user applied more pressure. Two video cameras and headset microphones were used to record the study. The audio video and sketching inputs were synchronized. This enabled playback and analysis of the timing of the speech and sketching events. At various points in the study the experimenter added to the sketch and asked questions about different components. 2.2. Data Annotation At the conclusion of the study we had two movie files (one for the participant and one for the experimenter) for each of c The Eurographics Association 2007. 2.1. Study Setup Ideally we would have conducted a Wizard-of-Oz study in which responses to the participant would appear to be coming from the computer. Given the open-ended nature of the speech and sketching we wanted to capture we determined that it would be too difficult to obtain a responsive and natural feeling in a Wizard-of-Oz study. We view the study as one step in developing a system which will help determine whether conversations with the computer are similar to conversations with another person. Eighteen subjects participated in the study all of them students in the Introductory Digital System Laboratory class at MIT. Participants were instructed to sketch and talk about four different items: a floor plan for a room with which they were familiar the design for an AC/DC transformer the design for a full adder and the final project they built for their digital circuit design class. In addition there were instructions and a warm-up condition to familiarize the participants with the system and the interface. For the AC/DC transformer and the full adder the participants were given a A.Adler & R. Davis / Speech and Sketching: An Empirical Study of Multimodal Interaction Experimenter: Participant: so then what's what's um this piece what's that that would be the mux for the data input actually Participant: Figure 3: A sketch of a participant's project. that was a uh uh yeah a memory bank with five hundred and twelve um yep five hundred and twelve bits this ah i could that i had read and write access to the four items the users were asked to draw along with one XML file Ske for each page of sketching. The XML files contained a full record of the sketching by both the participant and the experimenter. We created software that replayed the study by using these data streams. The software also allowed us to select parts of the audio tracks for playback and transcription. The transcript and audio segments were passed to the Sphinx speech recognizer HAH 93 forced-alignment function which produced precise timestamps for each word. The transcripts were verified by playing the segment of the audio file and confirming that it contained the correct word. 3. Study Analysis Our analysis of the study has focused on how speech and sketching work together when people are interacting with each other. Figure 3 shows a sketch and Figure 4 illustrates the type of speech that accompanied it. In general the sketches contained the circuit itself and additional strokes related to its function or identification of its components. In Figure 5 the sketch contains the AC/DC converter and arrows indicating the flow of current in each of two operating conditions. Highlighter strokes are used to identify components in the circuit. Data from 6 of the 18 participants have been processed as described above. Each of the six datasets has data from each task (i.e. the warm-up and four sketching tasks). The total length of the data is approximately 105 minutes about 17.5 minutes of data for each participant. The participants drew 2704 strokes 74 erase strokes and spoke 10 848 instances of 1177 words. The experimenter drew 155 strokes 3 erase strokes and uttered 2282 instances of 334 words. The rest of the data has not yet been transcribed because of the time-consuming nature of the transcription and annotation process. Our ongoing qualitative analysis of the recorded and transcribed data has led to a series of initial observations. We have divided the observations into five categories: sketching language multimodal interaction questions and comments. Although these categories aren't mutually exclusive they help frame the observations and our discussion. c The Eurographics Association 2007. Figure 4: Fragments of the conversation accompanying Figure 3. Notice the disfluencies and repeated words. Figure 5: A sketch from the user study of an AC/DC transformer. 3.1. Observations about Sketching Our observations about sketching can be divided into two categories: stroke statistics and the use of color. As part of our analysis we labeled each stroke as one of four types: creation modification selection and writing. Creation strokes accounted for 52% of the strokes writing strokes accounted for 40% selection strokes accounted for 5% and modification strokes accounted for 3%. Looking at the total amount of ink in the sketches 63% was from creation strokes 21% was from writing strokes 8% was from selection strokes and 8% was from modification strokes. The percentage of ink that is writing more accurately reflects the composition of the sketches than stroke count because multiple strokes are frequently used for a single letter. A low percentage of strokes are selections but these are very important strokes to understand because they are key to understanding the user's action. The average number of colors used in a sketch was 3.0 (with a standard deviation of 1.8). The number of times the color was changed in a sketch was much more variable the average number of color changes was 4.7 but the standard deviation was 6.5. There were a few sketches where the participant changed the color more than 10 times once during a long sketch a participant changed the color 28 times. A.Adler & R. Davis / Speech and Sketching: An Empirical Study of Multimodal Interaction Figure 6: Color was used to indicate corresponding areas of the sketch. "the result will be R whereas... if so let's let's eh the result will be R... is that if the carry in is carry eh if the carry in is one then the result here will be R this is in case the carry in is one." The speech here is ungrammatical disfluent and repetitive clearly making it more difficult for a speech recognition system. However the repetition of the key words "result " "carry in " and "R" should allow us to identify them as the key concepts being discussed. The repetition could also provide evidence that the user is thinking about what to say. This evidence about user uncertainty could help a system determine that the user is interruptible. Participants' responses to questions tended to reuse vocabulary from the question. For example when asked "so is this the is that the diode " the participant replied: "this is the diode yeah." A system could expect a response to questions to have phrasing similar to the question facilitating the speech recognition task. Not unexpectedly we found that the participants' speech relates to what they are currently sketching. For example in one sketch the participant is drawing a box and while drawing it says "so let's see we got the power converter over here " the box is the representation of the power converter he is talking about. This may facilitate matching the sketching and speech events as they are roughly concurrent. 3.3. Multimodal We encountered three varieties of multimodal interaction between the speech and sketching inputs: referencing lists of items referencing written words and coordination between input modalities. Participants in the study would often verbally list several objects and sketch the same objects in the same order. For example when sketching a floor plan one participant said "eh so here I got a computer desk here I got another desk and here I got my sink " while sketching the objects in the same order. In another sketch a participant drew a data table and spoke the column labels aloud in the same order that he sketched them. The consistent ordering of objects in both modalities provides another method for associating sketched objects with the corresponding speech. Participants who wrote out words such as "codec" or "FPGA" referenced these words in their speech using phrases such as "so the the codec is pretty much built in into the like uh standard um eh standard uh FPGA interface." If the handwriting can be recognized this information can help identify the words in the speech input as has been done in Kai05 . Participants also wrote abbreviations for spoken words for example "Cell." for "Cellular." Recognizing these textual abbreviations will also help find correspondences between the sketch and the speech. As noted we found that speech often roughly matches whatever is currently being sketched. Subjects indicated a c The Eurographics Association 2007. We found that color was used in several different ways: to identify regions that were already drawn to differentiate objects and to add an "artistic" character. We consider each of these in turn. Color was frequently used to refer back to existing parts of the sketch and/or to link different parts together. In Figure 6 for example three different colors were used to indicate the correspondence between different parts of the sketch a crucial step in understanding it. We found that a color change is an excellent indication that the user is starting a new object. Segmenting strokes into objects is difficult because there are numerous ways to group the strokes. Color can aid segmentation by providing a good clue about which strokes should be grouped together. Color was also used to reflect the real-world colors of objects including for example bodies of water. These colors can aid in segmenting the input but also have deeper meaning because they relate to real-world objects and associations. This would allow color references in the speech to be matched with the colors and objects in the sketch. Others have explored the use of color in sketching e.g. Classroom Presenter AHWA04 . They found that the frequency of color changes varied by presenter but that color changes were used for contrast and to visually distinguish objects. Our more complicated sketches seemed to use more colors which concurs with the findings of AHWA04 . 3.2. Language The language chosen by participants provided several valuable insights. The most readily apparent observation is that the speech tended to be highly disfluent with frequent word and phrase repetition. This phenomenon appears to occur more frequently when participants are thinking about what to say. Second participants' responses to questions posed to them tended to reuse words from the question. Third not unexpectedly the speech utterances are related to the current sketching. We address each of these observations in turn. The repetition of words or phrases occurred more frequently when participants were thinking about what to say. One participant describing the output "R" of a circuit said: A.Adler & R. Davis / Speech and Sketching: An Empirical Study of Multimodal Interaction same. A participant's reply might be that both gates are AND gates or that one is an AND gate and one is an OR gate. These elaborated answers to questions were an unexpected result of the study. Asking questions keeps the participant engaged and encourages them to continue talking. The resulting additional speech and sketching data would give a system a better chance of understanding the sketch. The interaction also appears to encourage the participants to provide more information about the sketch and it appears to cause the participants to think more critically about the sketch so that they spot and correct errors or ambiguities. Even simple questions like "Are these the same " seems to be enough to spark an extended response from the participant especially if there is a subtle aspect of the objects that was not previously revealed. 3.5. Participant Comments tendency to enforce this coordination: if a subject's speech got too far ahead of their sketching they typically slowed down or paused their speech to compensate. There were many examples in the study where the participant paused their speech to finish drawing an object then continued. For example one participant said "and that's also a data out line" and then finished writing "Data out" before continuing the speech. In another case a participant said "um you come in and" and then paused while he finished drawing an arrow to indicate the entrance to the room. These observations provide additional data that the two modalities are closely coordinated. We can use this relationship in a system to help match speech utterances with sketching. Participants made several comments during the study that did not relate directly to the sketch but still provided valuable information. Uncertainty was indicated through the use of phrases such as "I believe" or "I don't remember." Some comments related to the user interface for example "I'll try to use a different color." Other comments referenced the appearance of the sketch. Two examples of this type of comment are: "it's all getting a little messy" and "I'll draw openings like this I don't know... I draw li... I drew like a switch before." These comments still provide insight into the participant's actions and could help a system understand what the users are doing but don't relate directly to the sketching. Another observation from the study is that the participant and the experimenter are expected to be able to fill in words that their partner forgot. For example one participant expected the experimenter to help with forgotten vocabulary and another participant filled in a word the experimenter forgot. This might be another way that a system could interact with the user saying something like "And this is ah..." and pausing prompting the user to identify the object. 4. Quantitative Analysis Work in ODK97 reports on a series of user studies in which users interacted multimodally with a simulated map system. They examined the types of overlap that occurred between the speech and sketching finding that the sketch input preceded the speech input a large percentage of the time. The nature of the overlap is important to properly align the speech and sketching inputs. More recently KBEC07 examined the timing of speech and handwriting. We performed a similar analysis of our data matching corresponding speech phrases and sketching events. For example we matched the speech utterance "so we have an arrangement of four diodes" with the strokes making up the diodes that were sketched concurrently. We segmented (a) Original (b) Revised Figure 7: Left: the original sketch right: after revision. One data output line in the original image has been replaced by three in the revised image. 3.4. Questions When the experimenter asked the participants questions the participants often made revisions or explained their design in more depth than the questions required. Some questions caused the participant to make the sketch more accurate as in Figure 7. When the experimenter asked if the three outputs highlighted in green (Figure 7(a)) were the same the participant realized that the original sketch was inaccurate then revised it by replacing one data output with three separate lines (Figure 7(b)). Questions about one part of the sketch also spurred explanations about other parts as participants apparently decided other parts of the sketch might be confusing as well based on the question asked. When one participant was asked about a label for a column in a data table he not only clarified that label he also explained the other four labels. Comparison questions also encouraged participants to explain the sketch in more detail by explaining how the parts were or were not similar. For example participants were asked if several different gates in the full adder were the c The Eurographics Association 2007.
  • Rating :      
  • Get Online Jobs!
  • File Type : .pdf
  •    
  • Length : 8 pages
  • File Size: 424.3 kb
  • Virus Tested : No
  • Verified : 2013-03-22
  • Source: rationale.csail.mit.edu
 Email File   

INFO HASH : 4f87937952e8b0ae383f02e31e665b2ceb9cdddf
blog comments powered by Disqus
Download now

File Size: 424.3 kb

Document Preview

    Other Downloads

  • intro7_s3.pdf13.2 kb
  • calendar_08_03_adult_teen.pdf139.4 kb
  • isiugo1.pdf46.5 kb
  • alvarado2000intelligent.pdf69.5 kb
  • calendar_08_03_child.pdf152.4 kb

    Related Keywords

  • publications  

  • Add Media
  • |
  • Terms of Use
  • |
  • FAQ / Help

© 2012 all rights reserved