Experimental evaluation of five methods for collecting emotions in field settings with mobile applications

Experimental evaluation of five methods for collecting emotions in field settings with mobile applications

ARTICLE IN PRESS Int. J. Human-Computer Studies 65 (2007) 404–418 www.elsevier.com/locate/ijhcs Experimental evaluation of five methods for collectin...

603KB Sizes 0 Downloads 17 Views

ARTICLE IN PRESS

Int. J. Human-Computer Studies 65 (2007) 404–418 www.elsevier.com/locate/ijhcs

Experimental evaluation of five methods for collecting emotions in field settings with mobile applications Minna Isomursu, Marika Ta¨hti, Soili Va¨ina¨mo¨, Kari Kuutti University of Oulu, P.O. Box 3000, FI-90014 University of Oulu, Finland Available online 9 January 2007

Abstract This paper presents experiences on using five different self-report methods, two adopted from literature and three self-created, for collecting information about emotional responses to mobile applications. These methods were used in nine separate field experiments done in naturalistic settings. Based on our experiments, we can argue that all of these methods can be successfully used for collecting emotional responses to evaluate mobile applications in mobile settings. However, differences can be identified in the suitability of the methods for different research setups. Even though the self-report instruments provide a feasible alternative for evaluating emotions evoked by mobile applications, several challenges were identified, for example, in capturing the dynamic nature of mobile interaction usage situations and contexts. To summarise our results, we propose a framework for selecting and comparing these methods for different usage purposes. r 2006 Elsevier Ltd. All rights reserved. Keywords: Emotions; User experience; Mobile applications

1. Introduction When human–computer interaction (HCI) research moved from desktop applications towards mobile devices and services, the practical problems related to carrying out evaluation in mobile environments rapidly became known. It is commonly accepted that data collection for the evaluation of mobile services and devices is a central challenge and that novel methods must be found for it (Jordan, 2000; Norman, 2004). This paper reports the experience of using self-report methods which can be used for evaluating mobile applications in mobile settings from the viewpoint of emotional responses evoked in a usage situation. This paper explores the role of affect as an essential component of user experience (Forlizzi and Battarbee, 2004) which is experienced by the user in human–technology interaction. The information about the affect aroused Corresponding author. Fax: +358 8 553 1890.

E-mail addresses: minna.isomursu@oulu.fi (M. Isomursu), marika.tahti@oulu.fi (M. Ta¨hti), [email protected] (S. Va¨ina¨mo¨), kari.kuutti@oulu.fi (K. Kuutti). 1071-5819/$ - see front matter r 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.ijhcs.2006.11.007

is not used by the computer as in affective computing (Hudlicka, 2003; Picard, 1997), but it is used by the designers as a basis for design decisions which would aim at the best possible user experience. Emotions are at the heart of user experience, and affect how we plan to interact with products, actually interact with products, and what kind of perceptions and outcomes surround those interactions (Forlizzi and Battarbee, 2004). We interpret user experience and related emotions here through interaction (see Forlizzi and Battarbee, 2004). As Battarbee (2004) summarises: the difference in seeing emotions as responses to designed products compared to emotions as part of interaction is significant. When we try to understand user experience and related emotions through use and interaction, we have to examine the user and the context in a broad sense. As our goal is to find out about affects which are aroused during the use of technology, we need to have instruments for collecting information about emotions. In this paper, we examine the use of self-report instruments (discussed, for example, by Csikszentmihalyi and Larson, 1987; Desmet, 2002; Lang, 1980; Reijnveld et al., 2003) which are especially suited for assessing subjective feelings

ARTICLE IN PRESS M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

(Desmet, 2002). The self-report instruments often utilise verbal scales or protocols, but also non-verbal instruments have been developed (e.g. Lang, 1980; Desmet, 2002). In this paper, we adopt two non-verbal instruments, SAM (Lang, 1980) and Emocards (Desmet et al., 2001). We also present three new instruments which use both verbal and non-verbal scales or languages, some combining the both, for collecting information about emotions. These new methods have been developed especially for mobile evaluation settings. Other types of instruments which can be used for collecting information about emotions but not examined in this paper include instruments measuring physiological reactions (Ark et al., 1999; Partala et al., 2000) and instruments measuring expressions (Ekman and Friesen, 1978; Kaiser and Wehrle, 1994; Litman and Forbes, 2003). The measurement of physiological reaction includes instruments which measure one or more bodily reactions (heart rate, pupil dilatation, etc.) associated with emotions. The methods which measure expressions concentrate on facial or vocal expressions. The instruments relying on self-reporting utilise rating scales, verbal protocols (verbal methods) or pictograms (non-verbal methods) which are used by the user to describe his/her emotions. Research indicates (Desmet, 2002; Reijnveld et al., 2003) that the emotions which are elicited by products are difficult to verbalise. To overcome this problem, self-report instruments, which use pictograms have been developed for non-verbal emotion descriptions. Instruments like Self Assessment Manikin (SAM) (Lang, 1980) and Emocards (Desmet et al., 2001) use pictograms for measuring generalised emotional states. Other instruments such as Product Emotion Measurement (PrEmo) (Desmet, 2002) have been developed to measure distinct emotions. According to Dormann (2001), the SAM instrument has been successfully used in marketing and for emotional evaluation of company home pages. The Emocards have been applied to identify emotion elicited by mobile phones (Desmet et al., 2001) and office chairs (Reijnveld et al., 2003). PrEmo has been used to evaluate emotions which different car types elicit (Desmet, 2002). Capturing emotions, especially in a changing environment, is not a simple task. In an evaluation situation, challenges in capturing emotions occur on several levels and phases. First, the evaluation setting must allow users to experience emotions, which are as close as possible to the emotions they would experience in real usage situations. Secondly, as emotions are subjective experiences, the evaluation method should somehow capture expressions of emotional responses, which can be recorded and used by designers for analysis. Thirdly, as user experience is dynamic (Forlizzi and Ford, 2000), a user feels several different emotions during the process of interaction. For example, at first he or she can be happy and excited about the new product, but later become disappointed, sad or even angry if and when problems occur. Thus, emotions must be sampled several times during the use of the

405

product, which often means organising long-term tests. Fourthly, the interpreting of captured expressions of emotions is difficult. For example, interpreting emotions from facial expressions captured on video has been an active and debated area of research for decades (see discussion in Schiano et al., 2004). Mobile devices and applications are becoming widely common today. Evaluating these applications may be a challenging task, even without emotion collection. The difficulty of evaluation results from the nature of mobility, and also leads to the fact that standard methods of laboratory testing do not fit as such, because they are not designed for mobile use or real life context (Palen and Salzman, 2002). Mobile applications may be used in unpredictable situations and the usage may involve other interwoven actions too. Therefore, for evaluating mobile applications, we need to complement laboratory evaluations (Beck et al., 2003) with real life field evaluations. In the following sections, we report experiences from five different self-report methods we have used in collecting emotions and user experiences from the mobile users in real environment. The SAM (Lang, 1980) and Emocards (Desmet et al., 2001) have been adopted from literature. Because of shortcomings we discovered with the SAM (Ta¨hti et al., 2004a) and Emocards (Ta¨hti et al., 2004b), we were motivated to develop new emotion collection methods designed especially for field tests, where the user is using a mobile terminal (e.g. a smartphone). The new methods which were developed are Experience Clip (Isomursu et al., 2004), Expressing Experiences and Emotions (3E) (Ta¨hti and Arhippainen, 2004) and Mobile Feedback Application (Arhippainen et al., 2004). 2. Field experiments This section gives an overview of the field experiment settings where the selected methods have been used. All field experiments were performed in Oulu, Finland during years 2001–2005. Self-report methods were used in conjunction with other evaluation methods for collecting user experience information in real environment field experiments. All nine different field experiments evaluated demonstrations of context-aware mobile applications. They were related to research projects done at the University of Oulu, and the main resources for evaluation were graduates working in the research projects collecting material for their PhD theses. The field experiments were conducted in several separate research projects, all aiming at exploring new context sensitive mobile applications from different viewpoints, including technical, economical and usability factors. The work reported here is a retrospective analysis of the research on collection of emotions done during these seven field experiments. Table 1 presents a summary of the field experiments. The CAPNET field tests involved a context sensitive office application. The users were professionals working on wide range of different domains, both technical and

ARTICLE IN PRESS M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

406 Table 1 Information about field experiments Method

Data type

Field test name

Device

Number of accepted test users

Total number of participants in each method

SAM

Quantitative Quantitative

3E (in diary) Feedback app Experience Clip

Qualitative Quantitative Qualitative

PDA PDA PDA PDA Smartphone PDA Smartphone Smartphone Smartphone

6 9 5 8 48 20 12 36 14

15

Emocards

CAPNET proto 2 CAPNET IM SmartRotuaari 1 SmartRotuaari 1 Rotuaari SmartLibrary Adamos Menu SmartRotuaari 1 SmartRotuaari 2

non-technical. The applications included context sensitive remainders, map services, office resource management, etc. (Ta¨hti et al., 2004a; Perttunen et al., 2005). The SmartRotuaari and Rotuaari field tests evaluated context sensitive applications within a city setting. Applications included personalised context sensitive advertising, tourist information services, etc. (Ojala et al., 2003). The users were recruited on city centre streets, i.e. they were people spending time in central Oulu. The Adamos field test evaluated a personalised and location sensitive service menu for smartphones. The users were IT professionals (Arhippainen et al., 2004). The SmartLibrary field test evaluated a map-based book location service at the University of Oulu library. Users were students visiting the library (Aittola et al., 2003).





 

3. Experiences from selected methods This section both introduces the methods and summarises the experiences we had when applying them. To begin with, we explain the framework used for analysing the experiences related to selected methods. Next, we present the principles of each method used in our field trials, and summarise our experiences. First, we analyse experiences from adopting two methods which have been introduced in the literature, namely the SAM and Emocards methods. Then, we proceed by introducing three new methods, which we have developed and applied to overcome the problems encountered in applying the existing methods in mobile settings.

 

 

The users should be as ‘‘real’’ as possible. They should be users who would use the product in their normal life.

The product should be used in real-life situations. We avoided assigning test tasks for the users. Instead, the test setting should encourage the users to use the product in their everyday life in tasks which they would be performing even when not evaluating a product. Mobility and physical context of the usage should not be restricted. The users should be able to move as they like, in the way they would normally move, and in places they would be if they were not performing an evaluation. The analysis method should not require physical instruments which need to be attached to the user, or visibly attached to the product under evaluation. The researcher should not be present during the use. Our experience shows that the presence of a researcher affects both how the user uses the product, and how he or she reports the emotions during the use. We wanted to eliminate the physical presence of a researcher, even though we realise that the users were aware that the results would be analysed by the researchers after the use.

Fit—How well the user feels he/she is capable of expressing the emotions related to the use? Usability—Does the method feel intuitive and easy to use? Is the method and the instruments related to it easy to use, learn and understand? What is the effort required by the method before, during and after use? User experience—What kind of user experiences the method and related instruments evoked? Disturbance—Does the method render the usage situation unnatural?

From the designer’s viewpoint, we analysed the following:

 

20 12 50

To verify whether this goal was achieved, we analysed the experiences from each selected method from two main viewpoints: that of the user and that of the designer. From the user’s point of view, we analysed the following:

3.1. Analysis framework Based on the experiences on evaluating mobile applications, our goal was to find methods which can be used in a test setting that is as realistic as possible. In order to capture emotions which users would experience in normal use, we wanted to avoid laboratory settings and artificial test setups. To accomplish this, we identified the following needs:

61

Input for design—Can the results provide valuable data for design? Does the method provide information for tracking which design decisions affect the emotions

ARTICLE IN PRESS M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

 

expressed? How can the results of the evaluation be presented to those who make decisions (both managerial or design decisions)? Interpretation—How problematic is it to interpret the captured expressions of emotion and the aspects causing the emotions? Validity—Is the data provided by the method valid? Does it reflect real usage situations?

For exploring the last issue, i.e. evaluating how valid the data collected with each method is, we used primarily methodological triangulation. This has been done by using more than one emotion collection method in each experiment, and then comparing the collected results. The methods used for the triangulation were usually interviews and questionnaires, as these are well-known tools for qualitative research and provide descriptive data which can be used for finding reasons for deviations or possible faults in the methods. The data collected with the triangulation methods was used for validating the data collected with the analysed method. If inconsistencies were found, their reasons were further explored in more detail. Naturally, the weakness of the triangulation as a validation method is that we cannot say if the inconsistencies are caused by a fault in the method we are evaluating, or if it is introduced by a problem in a method we are using for the triangulation. For example, using interviews as the triangulation method in evaluating the validity of expression of emotions is challenging, as it is very difficult to verbalise emotional responses (which is the starting argument for this research). However, the triangulation can point out issues which need further investigation, even if all those issues are not caused by faults in the methods. We also used the subjective opinion of the users for validating the data provided by the methods under evaluation. This means that we asked our research subjects if they were satisfied with how they were able to express their emotions when having the method as their aid. To complement this, the subjective opinion of the designers was used to validate the definition of how confident the designers were that they were able to interpret and understand the expressions of emotions correctly. In addition, we performed controlled experiments for measuring the reliability and validity of the new methods proposed in this paper. The experiments were performed by comparing the results obtained with a new method with the results which were produced with existing emotion collection methods. However, as these experiments were done in artificial and strictly controlled research settings, we leave them out of this paper, as our goal here is to present experiences on using these methods in real-world settings.

Descriptions of both methods and summaries of our experiences are given in the following sections. 3.2.1. Self-assessment manikin The SAM method is based on a series of pictures of puppets (Fig. 1) which are used for measuring three dimensions of emotion by using three axes: pleasure– displeasure (top figure series), degree of arousal (middle) and dominance–submissiveness (bottom) (Lang, 1980). It was originally implemented as an interactive computer application, but later the paper-and-pencil version (used here) was developed for tackling scalability issues. Bradley and Lang (1994) claim that these three dimensions are primary in organising human experience, both semantic and affective. The method is designed for collecting information about subjective feelings. The SAM instrument is based on the assumption that emotions vary continuously along some dimensions. Instead of selecting one picture to represent the emotional state of the user, the user uses several dimensions to describe the position of his/ her emotion on each continuum separately. The challenge is to define the dimensions so that they would be independent and would not impact each others (Mandryk et al., 2006). The research setting situates the user with a specific interaction actually taking place, thus supposedly reducing recall errors and hypothetical answers (Iachello et al., 2006). However, providing the researcher access to knowledge about this specific situation to support interpretation of collected data requires complementing the SAM with other data collection methods. The SAM pictures are usually presented to the user in paper format, and the user selects the picture matching the emotional response by ticking the corresponding picture on each dimension with a pen. The method is most feasible when used for measuring emotions before use and/or after use. The problem of after-use evaluation is that the longer the time lapse between the experienced emotion and the evaluation, the more the results are distorted (Scherer, 1989) as the user needs to rely on his or her memory.

3.2. Experiences from adopting existing methods We started by adopting two well-known methods, the SAM and Emocards, which utilise self-report instruments.

407

Fig. 1. The SAM self-report method (Desmet, 2002).

ARTICLE IN PRESS 408

M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

We (Ta¨hti et al., 2004a) collected user experiences of context adaptive mobile application by applying the SAM method in two field experiments with the total number of participants being 15. Using the picture-based instrument is very simple, as it requires only pen and paper. This is easy as long as the instrument is used before and after the actual use. If the method were to be used to reflect the dynamic nature of the user experience to collect information about emotions aroused during use, the situation would be different. In mobile situations, having a pen and paper based information collecting instrument is inconvenient, sometimes impossible. When the method is used only before and after use, it cannot be used for expressing fleeting emotions experienced during interaction. It also became obvious that, as mobile applications are almost always used in situations where the context plays a very important role, it is difficult to separate the emotions caused by the application from the emotions evoked by other context parameters. Therefore, it is crucial to complement the method with other methods which would help the designer to understand the context of use. Analysing the results is very straightforward and statistics can be used to present the results, as collected information is very structured. However, analysing the reasoning behind the selections is impossible, if no additional methods are used to collect information about context and interaction. In both cases, the results collected by means of the SAM method correlated well with results collected with other methods, such as diaries and interviews. However, we noted that the SAM scales were not easy to interpret for test users. For example, in the case of the arousal scale, the users asked were they supposed to evaluate the state of the application or themselves. The dominance scale was not clear either. Users were not sure if the small figure was supposed to mean that the application did all the work, or if the correct interpretation would be that the user was not able to control the application at all. Similar problems in interpreting the dominance scale have been reported also by Bradley and Lang (1994) who state that dominance judgements need to specify which member of the interaction is being judged. However, test users told that they liked this method because it was easy to use and it gave them the possibility to express their emotions in three different scales. Even though we found out that the scales were not so clear to all

users, our results indicate that this method gives information on how well the users managed to use the system, whether they liked it or not and if they were excited about the tested application. However, in order to get information on what actually happened during interaction and what were the product features which evoked the emotion in test users, we need to complement the method with other methods providing data about the interaction between the user and the application and about the context where the interaction took place. Our experiences of this method are summarised in Table 2. 3.2.2. Emocards The Emocards method consists of 16 cartoon faces (see Fig. 2, eight male faces and eight female faces) which depict eight distinct emotional responses. The method is based on the assumption that emotions can be classified or categorised into set of emotions which each can be associated with a specific recognisable facial expression. This view has been lately challenged, for example, by Schiano et al. (2004) who found no evidence for categorical perception and questioned the research evidence supporting emotion categories. Research indicates that humans use facial cues as a main source for emotional information even though other sources would provide more accurate information (Ekman and Friesen, 1969). This supports the assumption that representing an emotion can be done most efficiently with facial features. The set of emotions presented by the cartoon faces in Emocards method has been selected to include the emotions which are most frequently elicited by product appearance (Desmet, 2002). In practice, the test user selects one of the eight faces each representing a distinct emotion (Desmet et al., 2001). This method is usually used to summarise emotions towards a product, for example, before or after use. The explanations of the emotions represented by the faces (visible in Fig. 2) are for analysis purposes only and, therefore, they are not shown to the users in the test situation. We applied the Emocards method (Ta¨hti et al., 2004b) in field experiments with 61 participants. Before and after testing, the test users of the services were asked to select an Emocard which best depicted their emotions and give a free formatted verbal explanation for their selection. The results of this test show correlation between the selected picture and the verbal explanation, which the users gave. For example, when the subjects were not satisfied

Table 2 Summary of experiences from the SAM method Positive experiences

Negative experiences

Easy to use Simple equipment requirements Results in numerical form Easy to analyse Gives information from pleasure, arousal and dominance

Scales sometimes difficult to interpret for subjects Requires additional data collection if explanations are sought Difficult to perform during use, more suitable for capturing emotions before and after use

ARTICLE IN PRESS M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

with the service, they selected an Emocard from the unpleasant axis and, when they were satisfied with the service, they selected an Emocard from pleasant axis. However, the users often complained that it was difficult to find a picture, which would represent their emotional state. There are two possible explanations for this. First, it is possible that the emotions selected for this instrument do not include a picture, which would represent the emotion the user experienced. This might be caused by the fact that the emotions chosen to this method have been selected to represent the emotions most often elicited by product appearance, and our main focus of analysis was on interaction. Secondly, it might be that even if the emotion the user was experienced was included in the pictures, the user might not been able to recognise it and therefore cannot find it in an evaluation situation. One aspect which might have an effect on how recognisable the faces are is that they are static representations of facial expressions. Research shows that dynamic facial expressions are recognised better than static expressions (Collier, 1985), and as we were using pen and paper based user interface, our facial expressions were naturally static. Desmet has later (2002) developed this method further to include animated puppets for expressing emotions. These have proved to be more easy to recognise and better in displaying small detailed differences in emotions (Desmet, 2002) than static faces used here. However, in the triangulation-based validation, some differences were seen with the selected Emocard and the verbal explanation given by the subject. Four out of 61

409

selected ‘‘excited neutral’’ but the explanation shows that they were unable to use services and therefore could not be satisfied with it. In addition, two users selected also ‘‘excited neutral’’ even when their verbal explanations showed that they were quite satisfied with the service. Desmet (2002) also discussed the difficulty of neutral emotions. He observed that people had difficulties in interpreting facial pictures portraying neutral emotions and that they had a very strong tendency to interpret a facial picture as either pleasant or unpleasant. Our analysis show that emotions of male and female faces were interpreted differently. Even though they are supposed to represent the same emotional response, many users interpreted the emotion on female face to be different from the emotion showed on the male face. For example, the two faces on the top of the Fig. 2 depicting ‘‘excited neutral’’ was experienced to indicate two different emotions depending on which face you looked at. Also, we found out that sometimes users wanted to mark more than just one picture to express a more complex mixture of emotional responses. The idea of this method is to select just one picture, the picture which best depicts the emotion experienced. This method is easy to use as pen and paper suffice. However, using pen and paper always creates a time-lapse between use and evaluation thus distorting the results by relying on users’ memory. The results are easy to analyse as they can be changed in numerical form (from one to eight) and statistics can be used in analysis. The experiences of this method are summarised in Table 3. 3.2.3. Summary of the problems in adopted methods As our experiences show, we had some problems in using selected methods in our research setting. It is possible to categorize the problems encountered into three broad categories:

Fig. 2. Cartoon faces on Emocards (Desmet et al., 2001).

1. Problems related to the theory or model behind the method—how ‘‘accurate’’ the descriptions provided for expressing emotion are. Because both of these methods have been used successfully, it is obvious that they measure something related to emotional experience, but it is difficult to tell what exactly this something is. The frequency of problems the users had in finding a category which corresponded to their feelings might indicate that the categories are illustrated badly, but it may also mean that our emotional experience is just not

Table 3 Summary of experiences from the Emocards method Positive experiences

Negative experiences

Easy to use Simple equipment requirements Results in numerical form Easy to analyse

Does not address the dynamic side of the experience Pictures sometimes difficult to interpret for subjects Requires additional data collection if explanations are sought Difficult to perform during use, more suitable for capturing emotions before and after use Requires the user to summarise his/her emotions into one selection

ARTICLE IN PRESS 410

M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

structured in the way these methods suggest. This casts also some doubt in the quantification of results—for example, how comparable are the numbers between different users, different applications, in different situations? 2. Problems related to the act of interpretation and classification. Although pictorial methods have been developed to solve the problem of interpretation and categorization act, users still must project their state of mind into pre-selected emotional categories. This is not what people usually do, and even if the categories themselves were correct, they still could have difficulties in interpreting their feelings and connecting the result with a right category. 3. Problems related to mobility. Emotions are something not easily remembered, and they should be caught as close of their emergence as possible, ‘‘on-the-fly’’. Neither of the methods is suitable for that, but the data collected is coming from pre- or post-mobility situation. Also, when the focus is on interaction, the context parameters are needed for interpreting the collected data, and for understanding how it relates to application features. 3.3. Experiences from developing new methods for emotion collection To overcome problems we encountered with existing self-report based emotion collection methods, we developed and trial-used new methods. These methods have not been systematically or incrementally developed, i.e. we do not claim that one method is an improvement when compared to another method. Each of the new methods presented here aims to solve some specific problems related to evaluating emotions experienced during the use of mobile application. The first method, called 3E, aims at solving problems with emotion categorisation problems and providing a structured and instructed language for expressing emotions. However, it does not solve the problems related to the mobility and only partially supports examining emotion as a dynamic element influenced by the interaction. The second method, which uses a feedback application in a mobile phone, aims at solving the problems related with mobility and dynamic interaction. The third method, called Experience Clip, has originally been developed for capturing user experiences in mobile situations, but is analysed here from the viewpoint of emotion evaluation. The Experience Clip is especially designed to utilise the social dimension of co-experience (Battarbee, 2004) by encouraging sharing and reflecting emotions related to user experiences. 3.3.1. 3E The inspiration for a new method called expressing emotions and experiences (3E) was the idea that the emotional expressions could be used not only as direct representation of the user’s emotional status, but also as a

social language which the user can use in communicating with the researcher. There is research evidence that, for example, facial language should be interpreted more as a social language used in a social interaction than as direct representation of the emotional status of the user (Kraus and Johnston, 1979; Fernandez-Dols and Ruiz-Belda, 1995). For example, people may smile when they want to communicate that they are friendly and loving even though they feel sad inside. The 3E method provides the user a structured way of expressing emotions by drawing and writing (Ta¨hti and Arhippainen, 2004). The method uses a template utilised by the users for expressing their experiences and emotions during the use of the evaluated application (Fig. 3). With the 3E, the users need to invest time and effort to construct a drawing which represents their feelings. When compared with the SAM and Emocards methods, the process takes more time, involvement and effort from the user. The goal of the SAM and Emocards instruments is to provide a quick and easy way for collecting emotional responses, because it is assumed that the less the person needs to invest cognitive effort for selecting the picture, the better the selection corresponds with the real inner emotional status of the user (Desmet, 2002). However, with a method which requires more time and cognitive effort, the users can use the emotional gestures to communicate other issues, too, such as their attitudes, opinions, user experience, etc. An example presented in Fig. 4 includes a drawing with which a user described a positive attitude he had towards the application and our trial with a smiling face and sunny weather. We have used the 3E method usually as a part of an experience diary in different test settings with different kinds of users. For example, in a field test where the 3E was integrated with an experience diary, we had 20 participants who used the diary to record experiences every time they used the tested application. Diary studies have been used commonly by HCI researchers (Palen and Salzman, 2002) for studying natural settings while being able to have some research control. Users were asked to use the application at least five times during the testing period and use the experience diary after each usage. The diary consisted of questions concerning each usage situation and how the user experienced the dominance of the application. After the

Fig. 3. The drawing template used in the 3E method (Ta¨hti and Arhippainen, 2004).

ARTICLE IN PRESS M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

411

Our findings indicate that if the users experienced negative feelings towards the application, they tended to express their emotions with a more negative message on a thought balloon, and a less negative message on the speech balloon of the drawing. This might be explained by politeness, as in a normal discussion situation, expressing negative feelings very directly can be interpreted as impolite or rude. In the SAM and Emocards, test users were required to interpret the given pictures. In the 3E, the difficulty of interpretation is transferred to researchers. Our material shows that the content and format of the pictures vary so much that it is very difficult to create analysis schemes and classifications. Therefore, getting quantitative results is difficult and, in practice, we had to use mostly qualitative analysis of the pictures. Thus, the analysis of the results is more time consuming, complicated and requires special knowledge which is not required with the SAM or the Emocards. We also noted that analysing the figures and bubbles as separate entities was not enough, but useful information could also be obtained by analysing the relationships between the different entities of the picture. We planned the drawing instructions and templates to minimise planning and structuring of drawings. However, some subjects stated that because their drawing skills are poor, they did not like this method. The experiences of the 3E method are presented in Table 4. Fig. 4. An example of 3E drawing.

test period, the test users were interviewed. The interviews were semi-structured and designed to support the structure of the diary. In order to validate the results, the goal was to get deeper and more detailed information about the data the user had written and drawn to the diary. Our results indicated that experience diaries using the 3E can provide detailed and versatile material about user experiences and emotions related to them. Even though the drawing process was structured and guided, there was great variation between figures drawn by different users. Some users preferred not to draw the facial gestures at all, leaving the face blank and only using text or context to express their emotions. This method does not require complicated equipment as a paper format diary suffices. However, separating the information collection instrument (paper format diary) from the evaluated application (operating in a smartphone) almost inevitably leads to situations where the two are not in the same physical place, and the user cannot record his/ her experiences with the application as the instrument is not available. The method allows users to express their emotions with a combination of a picture and an explanatory text. Even though the pictures are free-format, drawing is instructed. However, the analysis will take a lot of time and also special knowledge of psychology in interpreting the results will be needed.

3.3.2. Feedback application in mobile phone The SAM, Emocards and 3E methods are best suited for collecting emotions before and/or after the use of the tested application or service. However, as experiences and emotions related to them often are fleeting moments (Battarbee, 2004), trying to recall them after use is difficult for users. To be able to capture expressions of emotions as they are experienced, we created an experience sampling method (ESM, Larson and Csikszentmihalyi, 1983) inspired feedback application for Symbian smartphones (Arhippainen et al., 2004). A similar instrument has been developed, for example, by the Massachusetts Institute of Technology (MIT) (context-aware experience sampling tool, Intille et al., 2003). The SAM and Emocards represent methods which have been primarily designed for evaluating the emotions evoked by static features of the product— mainly physical appearance or look-and-feel. Therefore, they assume that the emotions experiences by the user do not change rapidly. There can be changes over time, but those are not very rapid as the physical appearance of the product does not change. However, when our focus is on evaluating the emotional experiences evoked by the dynamic interaction with a mobile application, it is inevitable that the emotions can rapidly change even between extremes. Therefore, collecting information about emotions during use becomes crucial not only for minimising the time lapse between experienced emotion and data collection, but also for capturing fleeting emotions which are followed by new emotional experiences

ARTICLE IN PRESS 412

M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

Table 4 Summary of experiences from the 3E method Positive experiences

Negative experiences

Easy to use No complicate equipment needed Free formatted pictures give freedom of expression to users Combines written and visual expression Reveals emotions and their explanations

Analysing is challenging Difficult to arrange during use, better suitable for using before and after the use Some people do not like to draw

during interaction. Also, as our focus is on dynamic interaction rather than static appearance, we need to be able to link the fleeting emotions with the knowledge about the status of the application and interaction sequences at the moment of the evaluation as well as the information about the other context variables. This approach combines the application to be evaluated and the emotion collection instrument in one and the same device. Therefore, we can be sure that as the user is using the application the data collection instrument is available, and can be utilised right after or during the use. The device which is utilised for using the actual application is also employed in answering the questions dealing with emotional responses to the application. The questions are presented based on time or application events. As answering the questions with textual input using the small numeric keypad in a mobile context is not very convenient, we decided to use emoticons instead: The users answer to questions simply by selecting an appropriate emoticon on the screen. Similar to the Emocards and SAM, this method also forces the user to select the best possible alternative from a set of figures. The method also involves the problem of choosing only one figure, but now selection will be done rather often and usually just after an interaction has happened, which should effectively reduce the need for selecting several emoticons at once. The software allows both system- and user-initiated experience recording. A predefined question may pop-up on the screen in response to a timer or a programme event, such as after the user has completed a certain task. Alternatively, the user him/herself may start the software from the menu of the smartphone and give feedback spontaneously. In the first software version, the pop-up method supports two kinds of predefined questions: emotion capturing via choosing one of the nine emoticons, and Boolean type questions which can be answered either ‘‘yes’’ or ‘‘no’’. These question types were chosen for keeping the typing needed from the user as minimal as possible, as the feasibility studies showed that small keyboard of the smartphone is not best suited for free text entry. Based on the feasibility studies, it was also decided not to use voice capture for recording explanations of experiences. Using voice capture would be quite a natural way for using a smartphone, and could potentially be beneficial in getting more in-depth answers in mobile situations. However, the feasibility study indicated that using voice input for explaining emotions was shown to

Fig. 5. Answering feedback question with emoticons (Arhippainen et al., 2004).

create privacy problems. A typical sequence of answering is illustrated in Fig. 5. The software implementation consists of a mobile background application which processes timing of questions and stores questions with replies, and a simple mobile feedback application which is visible in the phone menu. The background application is not visible for the user. It starts automatically when the phone is turned on. It reads the predefined time- or action-dependent questions from an initialisation file, sets up a timer and starts the feedback application when needed. Twelve test users used the ESM inspired feedback application for the period of five days. The experiences are summarised in Table 5. The field experiment experiences show that one of the most obvious strengths of a feedback application integrated to the same device as the evaluated software is obviously that it is always present when the evaluated application is used. With paper based self-report instruments, for example, there always is a risk than in a mobile situation the self-report instrument is physically in different place at the time when the evaluated application is used. This reduces the stress and disturbance caused by the evaluation to the users, as they do not need to carry extra items and there is no pressure to remember to take them with them. It also makes the results more reliable and consistent, as valuable feedback is not missed when the instrument is not present. Moreover, integrating the feedback application to the same device makes it possible to initiate feedback actions when the user is interacting with the application. This allows us to capture fleeting emotions which emerge during or right after use. In the experiments, information about emotions which are directly connected to a specific interaction event proved to be most valuable, as this

ARTICLE IN PRESS M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

413

Table 5 Summary of experiences from feedback application Positive

Negative

Easy to use

Requires a feedback application which preferably runs on the same platform as the evaluated application Questions need to be carefully planned beforehand

Data collection is integrated to same device as the application under evaluation, i.e., it is always available when application is used Emotion expressions can be collected during use Results are in numerical form

Questions need to be brief because they are presented on a small screen and in a mobile usage situations As answers requiring free formatted verbal explanations are avoided, clarifications for the answers cannot be asked

Easy to analyse

provides the designers information about what exact action evoked the emotion. In order to use the collected information for iterative design, the designer needs to know what design feature evoked the emotional response. When feedback collection can be synchronised with the actual interaction events, the designer has access to information which helps in creating this link. For example, by combining the time stamps and activity logs of feedback and evaluated application, the designer knows exactly what the user did and how the action succeeded immediately before emotional feedback was collected. In this method, the evaluation application uses the user interface of the mobile device which is designed to be used in mobile situation. This proved to be an advantage when compared, for example, with pen and paper based instruments that were clumsy or impossible to use in certain mobile situations, such as when shopping or exercising. Certain mobile situations were challenging from the user interface point of view even with devices which had been designed to be used in mobile situations. For example, some users reported that when they were driving a car and the feedback application demanded their attention, they might just randomly press some key without watching the screen so they could direct their full attention to driving. Also, in some cases it was suspected that the device had been, for example, in a pocket or a purse with keypad unlocked and false feedback input was given because keys were randomly pressed. Also, as the mobile phone was used for much more than feedback collection, in some situations the other functionalities could disturb with the feedback application. For example, if the feedback application was waiting for a response from the user when the user received a phone call, the user could just quickly press something, accidentally or knowingly, to get rid of the feedback application and to answer the call. Providing the users an option to abort feedback application was only a partial solution, as often the users did not even look at the screen to make a selection in these situations, but just randomly pressed some key. From the evaluation point of view, these situations were problematic, as these responses did not give reliable information about the emotions, and it was impossible to separate this type of unreliable data from real data.

The experiments raised a question whether combining the evaluated application and the application used for evaluating into the same device would distort the results. Two possible reasons were identified: the first related with Media Equation (Reeves and Nass, 1996), and the second by confusing the application with each other. Media Equation states that when the same computer is used for running the application under evaluation and the application used for evaluation, the users tend to be ‘‘polite’’ towards the computer and give more positive answers than they would give if the evaluation were done with other means (e.g. pen and paper or another computer). However, Media Equation has been questioned; for example Goldstein et al. (2002) found no evidence of Media Equation with mobile devices. Another possible problem was that the users could confuse the evaluated application with the application used for evaluation. For example, when asked about emotions related to the use of evaluated application, they might actually be reporting about emotions related to the application used for evaluation. We saw signs of this even when observing discussions between researchers: when they talked about ‘‘the application’’, we often noted that some meant the application under evaluation and others, in the same discussion, were talking about the application used for evaluation. In our experiment, this problem might have further emphasised especially because the application we were evaluating did not really require any action from the user, as it was an automatic adaptive feature which did not require any user input. Therefore, all active interaction the user was involved with happened with the application used for evaluation. Thus, the users easily associated the trial with the evaluation application. As this method invades the everyday life of the test users, the issue of user acceptance was found to be important during the experiment. Some users found it irritating that the evaluation application demanded their attention from time to time. As it was running on their own phone, they could not get rid of it without reducing their ability to be connected to the outside world via phone. Five out of twelve test users mentioned that the sound used for notifying the question was disturbing in some situations. However, the users also commented that they knew that it is just a test and as the test would only last for a certain

ARTICLE IN PRESS 414

M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

period of time, they were able to tolerate the disturbance. However, it remains unclear if the irritancy caused by our evaluation application had any effects on the emotional responses the users reported. This method proved to be useful and easy-to-use for collecting emotions almost real-time in unsupervised field test conditions. The developed feedback tool provides researchers with the possibility of collecting emotions frequently during the use of a product. Moreover, the users are able to use the application being tested in their normal daily life without the emotional response of researchers and any extra equipment, if the feedback application can be integrated with the device which runs the tested application. To help the analysis of the collected data, we developed a framework based on a circumplex of affect (Russell 1980). The framework helps in placing the emoticon in three categories: negative, positive and neutral. As we chose not to use text or voice input, the information content provided by the answers alone is probably the weakest part of the method. Additional comments and explanations could not be provided to clarify the answers. This can be remarkably relieved by applying complementary methods such as experience diaries or interviews. In addition, test users were able to select only one picture. In some cases, the users would have liked to have the possibility to select several pictures. This seemed to be the case especially when they felt that none of the predefined pictures depicted the exact emotion they where feeling. Moreover, the planning of the short and informative questions was difficult. It was challenging to formulate a question in a short format which would be presented in a small screen in a mobile situation so that the test users would not misinterpret the question. 3.3.3. Experience clip The idea of the Experience Clip method (Isomursu et al., 2004) was to define a method which would provide data about dynamic interaction events evoking fleeting emotions with context information which would help the designer to interpret the data. The central idea of the technique is simple: to avoid disturbance, we took pairs of volunteer users who participated in a field test together, gave our application device to one and a separate mobile phone with a capability to take short video clips to another. We asked the volunteer with the phone to take video clips when the other was using the application under evaluation. We encouraged the users to take as many clips as possible, but left the final choice of what events to film to the users. The experience Clip method was especially designed to provide the users a tool for reflecting the feelings and emotions which are evoked by using the device. As the video was captured by a friend instead of a researcher, the social situation was very natural for expressing and describing emotions. The friend often has a long history of corresponding with the user, so he or she is an expert in interpreting and understanding the emotions expressed by this specific user. This makes it is easy for the user to

describe her or his inner feelings, and makes the discussions flowing and natural. As the users also often switched roles during use, they were able to share and compare experiences with each other. The Experience Clips emphasise sharing experiences, and related emotions, with people. First, the user shares the experience with the friend, who is capturing the situation on video. Second, the user and the friend together share the experience which can now be called co-experience (Battarbee, 2004), with the researchers. The process of sharing experience has its effects on how experience is understood not only by others, but also by the one sharing the experience. The discussions between the user and the friend shooting the video would be captured at the time the user was using the device. Simultaneously, the video captures context information, facial and bodily gestures, tone of voice and other emotional expression. Therefore, we can argue that even though we would primarily categorise the Experience Clip as a self-report instrument, it also has elements of expression evaluation instrument (as video-based instruments often are). As the video was shot with a camera phone, we could use the time stamps of the videos and the interaction log to match the interaction events with the expressions of emotions captured by videos. Therefore, we were able to capture rich variety of contextual data to support the designers in interpreting the data collected about the emotions of the user. We soon found out that when users were observing each other, we got information about versatile usage situations and rich material about user responses to them. Without a researcher being present in the usage situation, the users seemed to feel free to use the evaluated application in innovative ways, they used time for exploring the possibilities of the application and they seemed to enjoy finding new features and even operational failures from the applications. We compared the data collected with the Experience Clips with data collected by a researcher shadowing the user with a video camera, and found striking differences. When the researcher was present in a usage situation, the users seemed to be in a hurry. They did not explore the possibilities of the application, but proceeded in rather straightforward manner. This is probably because they did not want to ‘‘waste’’ the time of the researcher and wanted to be time-effective in the evaluation situation. Secondly, when the users were shadowed by the researcher, they seemed to follow expected or typical behaviour patterns, while without researcher being present, they seemed to look for unusual usage situations and even situations when the application would not operate correctly. For example, they were searching places where the positioning software would not operate correctly and the application would claim that they would be somewhere else than they actually were. With researcher following the user, the users seemed to avoid situations where the application would not operate correctly. If they noticed that the positioning software was not operating correctly, they would try to return to a place

ARTICLE IN PRESS M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

where it had been reliable. The unusual and innovative usage situations often evoked the strongest emotional responses from the users, and therefore provided the designers most material for iterative design. Also, as the users (i.e. the video shooter and the user of the application together) could choose for themselves whether to capture the usage or not, or even if they would delete the video after recording, they felt free to explore with the application. We noted that some users did not want their failures to be recorder, and they forbid the friend for shooting video in those situations. However, not all friends obeyed but continued shooting. Most users were happy to be filmed when they failed to use the application as expected, simultaneously expressing their negative emotions. Users also expressed wishes how they would have liked applications to behave and operate. Naturally, if the application did not operate as they expected, they reacted on that explaining their feelings. Also, we received new application ideas which emerged from the context of use. For example, one user walked by a restaurant and thought of a context sensitive mobile application which could be useful for him in that exact situation. He shot a short video explaining his idea. These kinds of videos show that even though the researcher is not physically present, users were aware that researchers would be evaluating the video clips they produced. Users did not only state their opinion about the applications, they also created mini-plays on video to make their point clear. We noted that especially frustration caused by the technical problems experienced inspired the users to shoot funny clips. For example, one video shows the user throwing the device to the Baltic Sea (actually, he throws a stone but the illusion is perfect on the video) as he cannot get accurate positioning information and therefore gets frustrated. This technique was used in two different field trials having 50 test users in total. A summary of experiences from the Experience Clip techniques is presented in Table 6. The weakness of the method is that the interpretation and analysis of the emotions which users show is difficult. This was especially difficult with the technology we used some years ago, as the quality of the mobile video in the terms of picture and audio quality was not yet very high. Also, as the length of the video clip was restricted with the devices

415

we used at the time of the field trial, the length of the individual Experience Clip was not decided by the users, but restricted by technology. Technical restrictions are much less constrained with current technology, as the video capabilities of mobile phones advance. Analysing the Experience Clips could be more efficient, if users could participate not only in collecting data, but also in analysing it. That would allow interpretation of the collected material with original users. However, this would require more involvement and commitment from the research subjects which we could have expected or required in our field trial settings. Another problem with analysing very short Experience Clips is the problem of integrating them. Even though we can use activity log files and time stamps of videos for integration, there always were gaps where we saw something possibly important was happening, but could not tell because no video material was available. Therefore, creating a coherent story out of short clips is a challenge. Having the users to decide which usage situations to record and which not seemed to spur versatile and innovative usage, but it also had a downside. We cannot be sure if the range of usage situations captured represents the real usage situations which would take place during real use. Therefore, combining the Experience Clips with other non-intrusive methods, such as monitoring real use with system logs, provides valuable support for data collected with the Experience Clips. 4. Selecting the method As our discussion shows, even with self-reporting methods, there are several possibilities for capturing expressions of emotions in mobile situations. We have identified four distinct factors which affect the selection of a self-reporting method. These are (1) type of data sought, (2) resources available, (3) test situation and (4) users. Each of the factors can be further divided into questions as has been done in Table 7. Our comparison framework both summarises our experiences, and provides other researchers and practitioners with a resource for supporting the decision what kinds of methods can be used in a given evaluation situation. Table 7 summarises the methods with given framework. For ease of comparison of the methods,

Table 6 Summary of experiences from the experience clip Positive

Negative

Easy to use Emotion expressions can be collected during use Provides expressions of emotions in verbal form as well as with physical cues (facial expressions, gestures, etc.) Provides information about versatile usage situations Uses technology which the users were eager and comfortable to use

Requires special equipment (video capable mobile phone) Not suitable for long testing periods Interpreting emotions from video and analysing interpreted material is challenging Users may choose not to record all kinds of usage situations External environmental conditions (such as lightning or background noise) may have big impact on the quality of video. Requires small groups of users (at least two)

ARTICLE IN PRESS 416

M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

Table 7 The framework for selecting the best-suiting method Best-suiting methods Type of data sought Resources

Test situation

Users

I need qualitative data I need quantitative data I can use smartphones. I can use feedback software I need simple results quickly or I don’t have much time for analysis I cannot plan exact questions before evaluation I want to gather the data during the use I want to gather the data after the use I want to collect different scales of emotions The duration of evaluation is short I want to conduct a long-term test I can’t personally assist users I don’t want to get involved in the test situation and/or affect to the results I will have lots of users I want to explore social situations or groups of users I need easy method for users

The figures in the table depict different methods: SAM

, Emocards

, 3E , Feedback Application

and Experience Clip

.

Table 8 An example of how to use comparison framework Best suiting methods Qualitative Type of data sought

Resources

Test situation

Quantitative

I need qualitative data TRUE two options I need quantitative data TRUE - include these three I can use smartphones TRUE I can use feedback software FALSE - exclude feedback I want to conduct long term test TRUE - exclude EC I can’t personally assist users TRUE Remaining options

we have assigned each method an illustrative icon. To make the table easier to read, each method is always presented in the same column of the table. Naturally, in each evaluation situation some factors and questions are more important than the others. When using the comparison framework in practice, the most important questions need to be identified, and some of the questions may be irrelevant and can be excluded. Also, the requirements may require combining several methods if no single method can provide the range of information needed in the evaluation. Our experience shows that combining several methods is often the best solution. It provides the possibility for methodological triangulation, and also reduces the risk of problems which may occur in

applying a single method (e.g. technical failure or poorly planned questions). Table 8 presents an example of a situation where the table is used for selecting methods in a real-world situation. The case situation is the following: The aim is to evaluate emotions elicited by a mobile phone which automatically changes its ringing tone to match the context of the user. The application recognises the context of the user and automatically uses loud ringing tone for the calls which are important in a given context, and uses non-disturbing ringing tones for calls which are not important in a given context. The application is implemented in smartphones and is used

ARTICLE IN PRESS M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

continuously by 15 users for one week. Both qualitative and quantitative material is required. First, the questions which are most important for the study are identified, and irrelevant questions are excluded (see Table 8). If some of the questions exclude the use of some method, the method is crossed in the table. Both quantitative and qualitative material is wanted and thus, at the beginning, all five methods are selected. Smartphones are available during the test but there is no possibility to use a feedback software. Thus, feedback is excluded. A long-term test is required, thus the Experience Clip is not feasible. The remaining methods are the SAM, 3E and Emocards as none of these requires the researcher to assist the users.

5. Conclusions Even though the emotions have been extensively studied for a long time in psychology, there still is controversy on how human emotions could be presented and described. For example, there is controversy on whether facial expressions are perceived categorically (e.g. Schiano et al., 2004), or if a set of emotions which can be called ‘‘basic emotions’’ exists (Ortony and Turner, 1990). When our understanding of emotions and their presentation is still incomplete, it is difficult to define methods which would succeed in capturing information about them. However, as emotion is such a powerful aspect of human–computer interaction, it cannot be discarded in a design process. Therefore, we need to use our current understanding, even if incomplete and controversial, in developing methods which could support designers for taking emotions into account in the design process. As emotions are at the heart of user experience, we believe that interest in evaluating the emotional responses evoked by technology will grow in the future. Self-report instruments provide feasible and lightweight tools for collecting such information, as other methods still often require special equipment and knowledge not always accessible for HCI practitioners. Our analysis shows that even with emotion collection methods based on self-report instruments, there are lots of possibilities for variation. Some methods give meagre data, but the results are easy to interpret, while others give rich and detailed data, but require extensive and difficult interpretation and analysis work. Both groups are useful but for different purposes. For example, methods which provide results that are quantitative and easy to interpret are very useful when the results are needed to support management decisions which require hard facts and compact presentation. On the other hand, when the results are needed for directing the design and for creating new design solutions, more descriptive and detailed data is needed on the relationships between evoked emotions and the design properties of the product. When the results get

417

more detailed and richer, the analysis and interpretation becomes more difficult. In this paper, we have presented some solutions and ideas on the analysis and interpretation of data on emotional response, but further research on this issue is needed. Especially, when data about emotions are used for guiding design, it is essential that the designer is able to build the link between the design features of the product and the results of emotion collection and analysis. Building this link requires that the designer understands the dynamic features of the interaction, and is able to use context information to interpret the information about emotions. Self-report instruments which are designed for evaluating product appearance assuming that evaluation setting can be fixed and controlled have their limitations in mobile settings. As our experiences show, there are several challenges in capturing context data to support interpretation, and integrating information collected with separate methods in dynamic mobile usage situations. When applied to mobile context, there definitely is room and need for new methods which better address the variability of the usage context and usability challenges introduced by mobility. Especially the variability in physical and social environment sets special challenges, as they can rapidly change and their variations are difficult to control and predict. As we give the user more freedom regarding the research setup, controlling the related variables becomes more challenging. In this paper we have presented novel methods which can direct the research towards better and more refined emotion collection in mobile settings. This paper presents our experiences of five different selfreporting methods for collecting expressions of emotions in the context of evaluating mobile applications in mobile settings. Over the past years, we have applied these methods in different field-test situations for evaluating mobile applications with mobile users. This paper pulls together the experiences we have gained using these methods, and provides a comparison framework to summarise the results. The comparison framework can be used by researchers and practitioners who need to choose a self-report method for evaluating mobile applications. References Aittola, M., Ryha¨nen, T., Ojala, T., 2003. SmartLibrary—Location-aware mobile library service. In: Proceedings of the Fifth International Symposium on Human Computer Interaction with Mobile Devices and Services, Udine, Italy, pp. 411–416. Arhippainen, L., Rantakokko, T., Ta¨hti, M., 2004. Mobile feedback application for emotion and user experience collection. In: Proceedings of PROW 2004. Helsinki University Press, pp. 77–81. Ark, W., Dryer, D.C., Lu, D.J., 1999. The emotion mouse. In: Proceedings of the HCI International 1999. Battarbee, K., 2004. Co-experience. Understanding user experiences in interaction. Ph.D. Thesis. University of Art and Design Helsinki. Beck, E., Christiansen, M., Kjeldskov, J., Kolve, N., Stage, J., 2003. Experimental evaluation of techniques for usability testing of mobile systems in a laboratory setting. In proceedings of Ozchi.

ARTICLE IN PRESS 418

M. Isomursu et al. / Int. J. Human-Computer Studies 65 (2007) 404–418

Bradley, M., Lang, P., 1994. Measuring emotion: The self-assessment manikin and the semantic differential. Journal of Behaviour Therapy and Experimental Psychiatry 25 (1), 49–59. Collier, G., 1985. Emotional Expression. Lawrence Erlbaum Associates. Csikszentmihalyi, M., Larson, R., 1987. Validity and reliability of the experience-sampling method. The Journal of Nervous and Mental Disease 175 (9), 526–536. Desmet, P., 2002. Designing Emotions. Doctoral dissertation. Delft University of Technology. Desmet, P.M.A., Overbeeke, C.J., Tax, S.J.E.T., 2001. Designing products with added emotional value; development and application of an approach for research through design. The Design Journal 4 (1), 32–47. Dormann, C., 2001. Seducing consumers, evaluating emotions. People and Computers XV, JointProceedings of IHM-HCI. Ekman, P., Friesen, W.V., 1969. The repertoire of nonverbal behavior: categories, origins, usages and coding. Semiotica 1, 49–98. Ekman, P., Friesen, W.V., 1978. Facial Action Coding System. Consulting Psychologists Press, Palo Alto, California. Fernandez-Dols, J., Ruiz-Belda, M.-A., 1995. Are smiles a sign of happiness? Gold medal winners at the olympic games. Journal of Personality and Social Psychology 69 (6), 1113–1119. Forlizzi, J., Battarbee, K., 2004. Understanding experience in interactive systems. In: Proceedings of DIS 2004. A´CM, 261-268. Forlizzi, J., Ford, S., 2000. The building blocks of experience: an early framework for interaction designers. In: Proceedings of the DIS 2000 Seminar, Communications of the ACM, New York, pp. 419–423. Goldstein, M., Alsio¨, G., Werdenhoff, J., 2002. The media equation does not always apply: people are not polite towards small computers. Personal and Ubiquitous Computing 6, 87–96. Hudlicka, E., 2003. To feel or not to feel: the role of affect in human–computer interaction. International Journal of Human Computer Studies 59, 1–32. Iachello, G., Truong, K., Abowd, G., Hayes, G., Stevens, M., 2006. Prototyping and sampling experience to evaluate ubiquitous computing privacy in the real world. In: Proceedings of CHI 2006, ACM, New York, pp. 1009–1018. Intille, S., Rongoni, J., Kukla, C., Iacono, I., Bao. L., 2003. Contextaware experience sampling tool. Proceedings of CHI. ACM. Isomursu, M., Kuutti, K., Va¨ina¨mo¨, S., 2004. Experience clip: method for user participation and evaluation of mobile concepts. In: Proceedings of the Participatory Design Conference, pp. 83–92. Jordan, P.W., 2000. Designing Pleasurable Products—An Introduction to New Human Factors. London, Taylor & Francis. Kaiser, S., Wehrle, ., 1994. Emotion research and AI: some theoretical and technical issues. Geneva Studies in Emotion and Communication 8 (2), 1–16. Kraus, R., Johnston, R., 1979. Social and emotional messages of smiling: and ethological approach. Journal of Personality and Social Psychology 37 (9), 1539–1553. Lang, P.J., 1980. Behavioral treatment and bio-behavioral assessment: computer applications. In: Sidowski, J.B., Johnson, J.H., Williams,

T.A. (Eds.), Technology in Mental Health Care Delivery Systems. Albex, Norwood, NJ, pp. 119–139. Larson, R., Csikszentmihalyi, M., 1983. The experience sampling method. New Directions for Methodology of Social and Behavioral Science 15, 41–56. Litman, D., Forbes, K., 2003. Recognizing emotions from student speech in tutoring dialogues. In: Proceedings of ASRU 2003. Mandryk, R., Atkins, S., Inkpen, K., 2006. A continuous and objective evaluation of emotional experience with interactive play environ ments. In: Proceedings of HCI 2006, ACM, New York, pp. 1027–1036. Norman, D., 2004. Emotional Design—Why we love (or hate) everyday things? Ojala, T., Korhonen, J., Aittola, M., Ollila, M., Koivuma¨ki, T., Ta¨htinen, J., Karjaluoto, H., 2003. SmartRotuaari—Context-aware mobile multimedia services. In: Proceedings of the Second International Conference on Mobile and Ubiquitous Multimedia, Norrko¨ping. Sweden, pp. 9–18. Ortony, A., Turner, T.J., 1990. What’s basic about basic emotions? Psychol Rev 97 (3), 315–331. Palen, L., Salzman, M., 2002. Voice-mail diary studies for naturalistic data capture under mobile conditions. In: Proceedings of ACM CSCW’02, pp. 87–95. Partala, T., Jokiniemi, M., Surakka, V., 2000. Pupuli. Eye Tracking Research & Applications Symposium, pp. 123–129. Perttunen, M., Riekki, J., Koskinen, K., Ta¨hti, M., 2005. Experiments on mobile context-aware instant messaging. The 2005 International Symposium on Collaborative Technologies and Systems, CTS 2005, May 15–20, 200, Missouri, USA. Picard, R., 1997. Affective Computing. The MIT Press, Cambridge. Reeves, B., Nass, C., 1996. The Media Equation. How people treat computers, television and new media like real people and places. Cambridge University Press. Reijnveld, K., de Looze, M., Krause, F., Desmet, P., 2003. Measuring the Emotions Elicited by Office Chairs. In: Proceedings of DPPI, ACM Press, pp. 6–10. Russell, J.A., 1980. Circumplex model of affect. Journal of Personality and Social Psychology 39, 1161–1178. Scherer, K., 1989. Vocal correlates of emotion. In: Papousek, H., Jurgens, U., Papousek, M. (Eds.), Nonverbal Vocal Communication. Cambridge University Press. Schiano, D., Ehrlich, S., Sheridan, K., 2004. Categorical imperative NOT: facial affect is perceived continuously. In: Proceedings of CHI04, ACM Press, pp. 49–56. Ta¨hti, M., Arhippainen, L., 2004. A Proposal of collecting Emotions and Experiences. Volume 2 in Interactive Experiences in HCI, pp. 195–198. Ta¨hti, M., Rautio, V.-M., Arhippainen, L., 2004a. Utilizing ContextAwareness in Office-Type Working Life. In: Proceedings of MUM2004. Ta¨hti, M., Va¨ina¨mo¨, S., Vanninen, V., Isomursu, M., 2004b. Catching Emotions Elicited by Mobile Services. In: Proceedings of OZCHI 2004.