Evaluation and Program Planning 35 (2012) 236–245
Contents lists available at SciVerse ScienceDirect
Evaluation and Program Planning journal homepage: www.elsevier.com/locate/evalprogplan
Quality and rigor of the concept mapping methodology: A pooled study analysis Scott R. Rosas *, Mary Kane Concept Systems, Inc., 136 East State Street, Ithaca, NY 14850, United States
A R T I C L E I N F O
A B S T R A C T
Article history: Received 8 June 2011 Received in revised form 28 September 2011 Accepted 5 October 2011 Available online 12 October 2011
The use of concept mapping in research and evaluation has expanded dramatically over the past 20 years. Researchers in academic, organizational, and community-based settings have applied concept mapping successfully without the benefit of systematic analyses across studies to identify the features of a methodologically sound study. Quantitative characteristics and estimates of quality and rigor that may guide for future studies are lacking. To address this gap, we conducted a pooled analysis of 69 concept mapping studies to describe characteristics across study phases, generate specific indicators of validity and reliability, and examine the relationship between select study characteristics and quality indicators. Individual study characteristics and estimates were pooled and quantitatively summarized, describing the distribution, variation and parameters for each. In addition, variation in the concept mapping data collection in relation to characteristics and estimates was examined. Overall, results suggest concept mapping yields strong internal representational validity and very strong sorting and rating reliability estimates. Validity and reliability were consistently high despite variation in participation and task completion percentages across data collection modes. The implications of these findings as a practical reference to assess the quality and rigor for future concept mapping studies are discussed. ß 2011 Elsevier Ltd. All rights reserved.
Keywords: Concept mapping Pooled analysis Quality Benchmarking Validity Reliability
1. Introduction More than 20 years ago, Trochim and colleagues published a series of papers on concept mapping in a special issue of Evaluation and Program Planning (Trochim, 1989a). In this seminal work, the theoretical and practical features of concept mapping were outlined, making a case for its utility in planning, evaluation, and research. Since then, concept mapping has been applied in a number of fields and contexts, including public and community health (Rao et al., 2005; Risisky et al., 2008; Trochim, Cabrera, Milstein, Gallagher, & Leischow, 2006; Trochim, Milstein, Wood, Jackson, & Pressler, 2004), social work (Petrucci & Quinlan, 2007; Ridings et al., 2008), health care (Trochim & Kane, 2005), human services (Pammer et al., 2001; Paulson & Worth, 2002), and biomedical research and evaluation (Kagan, Kane, Quinlan, Rosas, & Trochim, 2009; Robinson & Trochim, 2007; Trochim, Marcus, Masse, Moser, & Weld, 2008). The publication of the book Concept Mapping for Planning and Evaluation (Kane & Trochim, 2007) provided concept mapping practitioners with a comprehensive methodological resource. Over the course of two decades, concept mapping has demonstrated value in addressing a variety of practical and
* Corresponding author. Tel.: +1 607 272 1206. E-mail addresses:
[email protected],
[email protected] (S.R. Rosas),
[email protected] (M. Kane). 0149-7189/$ – see front matter ß 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.evalprogplan.2011.10.003
theoretical questions. As concept mapping has gained in popularity, so too has the need to define and examine the characteristics of the method’s methodological quality. No published research exists that has systematically assessed the degree to which concept mapping produces valid and reliable results across an array of different studies. The absence of such information limits researchers’ ability to articulate, assess and improve the methodological quality of concept mapping studies. To address this need, we accessed a large sample of concept mapping studies to: (a) quantitatively describe study characteristics across different phases of the process; (b) quantitatively describe specific indicators of validity and reliability; and (c) examine the relationship between select study characteristics and quality indicators. As a context for this study, we provide a succinct overview of concept mapping, followed by a rationale for examining the quality of concept mapping as a mixed-method approach. Finally, we briefly outline an explanation of validity and reliability, as they pertain to concept mapping. 1.1. Concept mapping overview Concept mapping is a type of structured conceptualization method designed to organize and represent ideas from an identified group. A participatory mixed-methods approach, concept mapping integrates qualitative individual and group processes with multivariate statistical analyses to help a group of individuals describe ideas on any topic of interest and represent
S.R. Rosas, M. Kane / Evaluation and Program Planning 35 (2012) 236–245
these ideas visually through a series of related two-dimensional maps (Kane & Trochim, 2007; Trochim, 1989a). Concept mapping is used frequently in evaluation as a practical means of addressing stakeholder participation in ways that enhance the relevance, ownership and utilization of evaluation (Cousins & Whitmore, 1998). The multi-phase concept mapping process typically requires participants to first brainstorm a large set of statements relevant to the topic of interest. Second, each participant sorts these statements into piles based on perceived similarity, and rates each statement on one or more scales. Third, multivariate analyses are conducted that include two-dimensional multidimensional scaling (MDS) of the unstructured sort data, a hierarchical cluster analysis of the MDS coordinates, and the computation of average ratings for each statement and cluster of statements. The maps that result show the individual statements in two-dimensional (x, y) space with more similar statements located nearer each other, and show how the statements are grouped into clusters that partition the space on the map. Finally, the group interprets the maps that result from the analyses through a structured interpretation process designed to help them understand the maps and label them in a substantively meaningful way. The quantitative maps reveal how a group discerns the interrelationships between and among items and assigns values to ideas and concepts, thus constructing a basis for further discussion, interpretation, and action. We refer readers to Kane and Trochim (2007) for a more detailed description of the entire concept mapping process. 1.2. Quality in mixed-method applications Consistent with the arguments for combining qualitative and quantitative methods in a single study (Creswell & Plano Clark, 2007; Sale, Lohfeld, & Brazil, 2002; Tashakkori & Teddlie, 1998) concept mapping blends the two in a complementary and additive manner. Rather than data remaining distinct but connected as with some mixed-method applications, concept mapping integrates data at multiple points of the process. Qualitative and quantitative methods are combined in ways that challenge the distinction between the two, and may suggest they may be more deeply intertwined (Kane & Trochim, 2007). Given the presence of several design typologies that emphasize a range of sequencing and mixing decisions (e.g. Creswell & Plano Clark, 2007; Tashakkori & Teddlie, 1998) addressing the quality of concept mapping is pertinent. However, the absence of a comprehensive set of criteria for critically appraising mixed-method studies (Tashakkori & Teddlie, 1998; Sale & Brazil, 2004) and the conceptual variation of mixed-method quality among evaluators and researchers (Caracelli & Riggin, 1994) further compound how concept mapping quality should be operationalized. Although generic criteria have been used to assess the quality of mixed-method studies, the need for more specific evaluation criteria, depending on the design and approach, is warranted (Sale & Brazil, 2004). This perspective supports the need to address the methodological quality of concept mapping in ways unique to the approach. 1.3. Validity in concept mapping The traditional notions of external and internal validity are challenging to operationalize for concept mapping, and are frequently overlooked. As defined by Cook and Campbell (1979), validity is the best available approximation of truth or falsity, both externally and internally, of a given inference, proposition, or conclusion. Because validity can be operationalized differently, we posit that external representational validity and internal representational validity may be analogues for concept mapping. External representational validity is concerned with the extent to which a conceptualized model mirrors the reality it is purported to
237
represent. Analytical strategies for concept mapping to assess the degree to which the conceptual model is recognized as the modal representation for a group have been suggested (Cacy, 1996). These techniques, however, are exploratory and not yet practical. Typically, the assessment of external representational validity is generally managed as a function of each concept mapping study by seeking verification that the brainstormed statement set represents the topic under inquiry, using multiple data collection and analysis methods, and including independent participants with diverse perspectives. Because uniform data relevant to external validity are not routinely available for individual studies to include in a pooled analysis, external representational validity is not considered in this study. Internal representational validity however, is particularly germane to this study. Internal representational validity refers to the degree to which the conceptualized model reflects the judgments made by participants in organizing information to produce the model. In that sense, the question of whether the conceptualized model reveals the same distinctions among groupings made by the average participant is of particular importance. A case has been made that the analytic approach that anchors concept mapping results represents the best fit of the various cognitive structures of participants (Forgas, 1979). Questions have been raised, however, as to whether the final model may obscure some of the finer details; due perhaps to variations in how participants approach the structuring task (Keith, 1989). Thus, determining the overall match between the participant-structured input and the mathematically generated output is central to assessing internal representational validity. Several data elements common to all concept mapping studies can be used to evaluate the correspondence of the represented model to the original participant structures. First, early work by Dumont (1989) and Trochim (1989b) suggests that the degree of configural similarity between input and output matrices can be measured by computing a Pearson’s Product–Moment correlation coefficient. A second measure, the stress value, is a goodness-of-fit indicator between a given set of dissimilarities as input and the resultant distances in a configuration (Kruskal & Wish, 1978). Finally, the individual sorting input (i.e. the number of sorted piles) can be examined relative to the number of clusters, to understand the relationship between the groupings from each participant and the final partitioning of the content represented in the map. Collectively, these measures computed from data routinely produced for each concept mapping study can be used to estimate the internal representational validity of the conceptualized model. 1.4. Reliability in concept mapping For concept mapping, the consistency of participant input can be assessed using the sorting and rating data. Reliability of participant ratings on a chosen scale for each of the final statements can be assessed by computing conventional item and rater reliability estimates. The reliability of participant input of the perceived relationships between statements can be assessed by computing a set of estimates that are specific to concept mapping sort data. As suggested by Trochim (1993), the traditional theory of reliability, as typically applied in social science research, does not readily conform to sort data in the concept mapping model. Conventional means for assessing reliability focus on estimating the repeatability of test items or total scores, based on some known or assumed correct response. Sort data in concept mapping is different. Instead of estimating the reliability of items or overall scores of a measure, sorting reliability assessment is more appropriately focused on determining the extent to which the structural arrangements, both individually and collectively, reflect an assumed normatively typical arrangement. Thus, the individual
238
S.R. Rosas, M. Kane / Evaluation and Program Planning 35 (2012) 236–245
and aggregated sort configurations (similarity matrices used as input by multidimensional scaling), as well as the resulting distance matrices (the between-item Euclidean distances output generated by multidimensional scaling), provide information to calculate reliability estimates. We refer to Trochim’s (1993) recommended procedures for calculating a set of reliability statistics specific to concept mapping input and output to estimate consistency of sorting within and across specific studies.
patterns and relationships between different concept mapping study characteristics. Finally, we computed several correlation analyses between concept mapping study characteristics that were expected to be related. Specifically, configural similarity and stress values were correlated with other concept mapping data, including the number of statements, the final number of clusters, and the sort reliability estimates. 3. Results
2. Method
The sample for this quantitative pooled study analysis was sixty-nine (69) individual concept mapping studies conducted within the past 10 years. This set is part of an archive of completed concept mapping studies conducted over that time period by Concept Systems, Inc., the sole proprietor and licensor of the Concept System1.1 The criteria for sample selection were general and meant to include a wide range of studies. For inclusion in this pooled study analysis, each concept mapping study needed to have a final computed map (i.e. each had to have a final multidimensional scaling analysis result and cluster solution generated), and each study needed to have at least one rating completed by participants.
A majority of the concept mapping studies in the sample were classified as public health oriented (59.4%). Others were in the fields of human services (20.3%), biomedical research (5.8%), social science research (2.9%), and business or human resources (2.9%). Twenty-eight studies were supported through Federal sources (41.5%), with several receiving support from foundations or notfor-profit organizations (20.3%), universities or colleges (17.4%), and state government sources (11.6%). The stated purpose of each study varied considerably, confirming the broad use of concept mapping. Twenty-eight (40.6%) were used for strategic planning purposes, defined as an initiative of an organization to establish a specific strategy or direction and make decisions regarding resources to pursue this strategy. Twelve (17.4%) were conducted for the purpose of developing an action or research agenda, defined as a collaborative effort to outline a strategic direction in a field that extends across organization boundaries. Project purposes also included: evaluation (14.5%), primarily for framework and design development; research (7.2%), primarily for conceptualization and theory development; needs assessment (8.7%); and program or intervention development (8.7%). Three main types of data collection modes were identified in the sample. Face-to-face or traditional means, whereby the researcher interacts directly with the participants for brainstorming, sorting, and rating tasks, accounted for 14.5% of the studies. Web-based means, whereby information is gathered exclusively through the use of the Internet without direct interaction of the researchers with participants, accounted for 34.8% of the studies. Multi-method means, whereby information is collected through a variety of means including paper forms, face-to-face interaction, and web-based platforms, accounted for 50.7% of the studies. All pooled analysis results for the sample of concept mapping studies by study characteristic, described in subsequent sections, are presented in Table 1.
2.3. Procedure
3.1. Participants
Once the list of acceptable studies was generated, we extracted data from each specific concept mapping study database. Descriptive characteristics for each study included: data on the number of participants (overall and by concept mapping task), completion rates by task, and the number of statements resulting from brainstorming. Categorical information for each study included: the general field of study, related organizational or institutional support, the general purpose of the study, and the primary data collection mode. For each study, we obtained data for assessing the reliability and validity as it pertains to concept mapping and computed estimates. After generating each concept mapping study’s specific characteristics and estimates, the results were configured, aggregated, and analyzed for the entire sample of studies. Measures of dispersion, central tendency, and interval estimates were computed for each study element for the pooled studies in the sample. We analyzed further these estimates to examine
The total number of participants is the unduplicated count of individuals who provided input during the brainstorming, sorting, or rating tasks. The average total number of participants was 155.78 with a range of 20–649 across studies. The total number of participants for each study varied depending upon the concept mapping data collection mode, and a significant group effect was detected, F(2, 66) = 13.25, p < .001. The average total number of study participants for the web-based data collection mode was significantly larger (M = 243.42, SD = 172.14) than both the faceto-face (M = 62.10, SD = 49.14), p < .001 and multi-method (M = 122.46, SD = 45.76) modes, p < .001. For specific concept mapping tasks, numbers are captured for both sorting and rating participation. Overall, the number of sorters averaged 24.62 (SD = 15.29), well over the recommended number of 15 (Jackson & Trochim, 2002), and nearly 1.7 times (M = 14.62) larger than what was found in Trochim (1993). The smallest study in the sample had 6 sorters and the largest had 90. For the ratings task, the number of participants on the first rating (rating 1) averaged 81.77 (SD = 69.83) and 65.82 (SD = 43.32) on the second (rating 2). The second rating typically has fewer participants, due to attrition,
2.1. Overall study approach The standardized procedures for data collection, data organization, analysis, and representation in the concept mapping process yield a set of common quantitative data, which can be configured in a comparable statistical form. This uniformity allows for the same quality constructs and relationships to be examined, and thus produces results that are more objective and exact than a narrative review. Our first step in conducting this pooled study analysis was to generate quantitative characteristics and estimates of the common data elements for each concept mapping study in a sample of studies meeting specific inclusion criteria. These characteristics and estimates were then aggregated across the sample of studies describing the distribution, variation, and parameters for each characteristic and set of estimates. Finally, we conducted quantitative analyses to examine further the relationships between project characteristics. 2.2. Sample
1 Details and availability of the technology can be found at http://www.conceptsystems.com.
S.R. Rosas, M. Kane / Evaluation and Program Planning 35 (2012) 236–245
239
Table 1 Participants and completion percentages by concept mapping data collection mode. Data collection mode
Face-to-face Web-based Multi-method Overall
Average number of participants
Average percent completing task
Sorting
Rating 1
Rating 2
Sorting
Rating 1
Rating 2
25.7 27.9 22.1 24.6
44.7 112.8 71.1 81.8
33.6 75.6 63.9 65.8
74.1 52.4 43.7 50.1
80.3 68.7 61.1 65.9
72.8 48.0 54.0 51.6
level of participant knowledge, or fatigue by those completing the first rating. The average number of participants for the first rating is nearly 5.8 times (M = 13.94) what was found in Trochim’s (1993) study. Differences in the number of participants by the data collection mode were observed in this sample and are displayed in Table 2. No meaningful difference was found in the average number of sorters by data collection mode. However, a significant group effect was observed, F(2, 66) = 4.62, p < .05, for rating 1, with significant mean differences in the number of participants between the webbased mode (M = 164.04, SD = 116.70) and both face-to-face (M = 116.31, SD = 43.17), p < .01 and multi-method (M = 55.70, SD = 45.31) modes, p < .05. Interestingly, no group differences were detected in the average number of participants for rating 2 by the data collection mode. On close inspection, the largest decrease in the average number of participants between ratings 1 and 2 was observed for the web-based mode. These findings suggest that the use of the web for concept mapping facilitates greater participation with respect to the rating task. However, the level of attrition is greater from completion of rating 1 to rating 2, when the web is used exclusively for data collection. Despite the availability of the Internet to increase access to the concept mapping process, sorting remains a fairly intensive activity, intended to capture participant judgments about the
relationships between all items in a typically large set. Sorting participation was observed to be fairly consistent across the various data collection modes, and may be perceived to be limited more by the demanding nature of the task, rather than the means of participation. 3.2. Completion rates for sorting and rating As shown in Table 2, the average percent completion for sorting and rating tasks indicates that, overall more than half of those who initially agreed to complete the task, did so in a manner that produced usable data. The percent completion for sorting differed by data collection mode and a significant group effect was detected, x2(2) = 111.36, p < .001. As expected, the average percent completion for sorting was highest when done face-toface (M = 74.06, SD = 19.93) compared to web-based (M = 52.38, SD = 23.88), z = 7.17, p < .001 and multi-method (M = 43.69, SD = 19.86) modes, z = 10.28, p < .001. The average percent completion for web-based was also significantly higher than the multi-method mode, z = 4.70, p < .001. The average percent completion of rating tasks followed a similar pattern with a significant group effect found for rating 1, x2(2) = 107.50, p < .001, and rating 2, x2(2) = 66.75, p < .001. As with sorting, the average percent of completion for rating 1 was significantly higher for
Table 2 Concept mapping study characteristics and estimates. Common study elements
Number of statements Number of sorters Number of raters 1 Number of raters 2 Total number of participants Percent completing sorting Percent completing rating 1 Percent completing rating 2 Stress value r (configural similarity)a r2 Stress value split-half 1 Stress value split-half 2 rII rIT rIM rSHT rSHM a for rating 1 a for rating 2 AICC for rating 1 AICC for rating 2 Number of map clusters Average statements per cluster Statements in largest cluster Statements in smallest cluster Average number of sorted piles Median number of sorted piles Largest pile of statements Smallest pile of statements a
M
96.32 24.62 81.77 65.82 155.78 50.07 65.87 51.64 .28 .66 .44 .30 .30 .87 .96 .91 .86 .63 .97 .97 .89 .87 8.93 11.10 18.64 5.49 10.93 9.93 23.25 4.68
Absolute values are reported for this characteristic.
SE
2.07 1.84 8.04 5.84 15.21 2.84 2.43 2.83 .00 .01 .01 .00 .00 .01 .00 .00 .01 .02 .00 .00 .01 .01 .19 .31 .59 .23 .23 .27 1.09 .15
SD
17.23 15.30 69.83 43.32 126.34 23.59 20.24 20.84 .04 .07 .09 .03 .04 .06 .02 .04 .07 .17 .02 .02 .07 .10 1.55 2.58 4.94 1.94 1.88 2.22 9.04 1.28
Mdn
98.00 20.00 62.00 57.00 118.00 56.86 70.27 56.00 .29 .66 .43 .30 .31 .88 .96 .92 .87 .61 .97 .97 .92 .90 9 11.11 18.00 5.00 10.90 10.00 21.00 5.00
Min
45 6 18 5 20 10.58 12.79 10.50 .17 .53 .28 .20 .19 .69 .90 .80 .65 .26 .91 .91 .69 .42 6 5.63 9 1 6.55 6.00 11 2
Max
132 90 485 247 649 100 100 100 .34 .83 .68 .36 .35 .96 .99 .98 .97 .95 .99 .99 .99 .97 14 20.67 32 10 15.76 16.00 61 8
95% CI for mean Lower
Upper
92.18 20.95 64.99 54.11 125.43 48.39 64.87 50.47 .27 .64 .42 .29 .29 .85 .95 .90 .85 .59 .96 .96 .88 .84 8.56 10.43 17.45 5.03 10.47 9.40 21.0 4.37
100.46 28.30 98.54 77.53 186.13 51.75 66.87 52.81 .29 .68 .46 .31 .31 .88 .96 .92 .88 .67 .97 .97 .91 .90 9.30 11.67 19.82 5.96 11.38 10.46 25.42 4.99
S.R. Rosas, M. Kane / Evaluation and Program Planning 35 (2012) 236–245
face-to-face (M = 80.25, SD = 21.61) than web-based (M = 68.73, SD = 24.61), z = 5.51, p < .001 and multi-method (M = 61.13, SD = 14.17) modes, z = 8.74, p < .001. Similarly, the average percent of completion for rating 2 was significantly higher for face-to-face (M = 72.72, SD = 30.02) than web-based (M = 48.02, SD = 24.34), z = 7.21, p < .001 and multi-method (M = 53.95, SD = 12.58) modes, z = 5.48, p < .001). Interestingly, in both cases the average percent completion for web-based ratings was significantly higher than the multi-method mode (z = 7.09, p < .001 for rating 1 and z = 4.86, p < .001 for rating 2). It is not surprising that the face-to-face mode of data collection yielded higher completion percentages, as the researcher manages the process directly. It may also be that the variety of forms and options to manage in the multi-method mode contributed to the consistently lowest percent completion. Nevertheless, the percent completion for data collected through the web are well above those found in other on-line activities, such as internet-based surveys where completion rates of 20% to 30% are common (Cook, Health, & Thompson, 2000; Kaplowitz, Hadlock, & Levine, 2004). 3.3. Statements The sample of studies averaged 96.32 statements (SD = 17.23) with a range of 45 to 132. This average represents approximately a 20% increase initially found by Trochim (1993). No difference was found in the number of statements by data collection mode. Kane and Trochim (2007) report that with the availability of the webbased platform, concept mapping studies commonly yields brainstormed statements sets well over 100 items. Guidance on selection of an appropriate statement set size has emphasized the need to consider participant burden, at the same time working to ensure saturation of the topic (Kane & Trochim, 2007; Trochim, 1989a). These authors recommend a structured process for synthesizing and reducing the set to a manageable size that minimizes burden and maximizes breadth. Thus, despite the propensity for statement sets to be very large when collected via the Internet, size consistency in the final statement set used for sorting and rating was evident. 3.4. Sorting and clusters The sorting task requests each participant to arrange the set of statements into piles or groups, based on participant-perceived similarity. It is an unstructured sort procedure; there is no predetermined number of piles into which the participants are expected to sort the statements. Procedurally, concept mapping participants receive a set of instructions and minimum expectations to guide the sorting task. Each participant is directed to sort the statements into an arrangement that makes sense to her or him (Kane & Trochim, 2007). The mean and median number of piles for each concept mapping study was identified and then summarized for the entire sample. The mean average number of piles was 10.93 (SD = 1.88) and the average median number of piles was 9.93 (SD = 2.22). No difference was found in the number of individual participant sorted piles by data collection mode. The number of clusters for each concept map is selected through a combination of statistical analysis, expert judgment, and participant feedback. There is no single correct number of clusters or a set of mathematical decision criteria for determining the final cluster solution (Kane & Trochim, 2007). The average number of clusters selected for the final concept map in the sample was 8.93 (SD = 1.55) with a range of 6–14. Again, no difference was found in the number of clusters selected for the final concept map in relation to the data collection mode. The distribution of statements across the final number of clusters for each study was also calculated. For each study, an average distribution was computed
by dividing the number of statements by the final number of clusters found in the map. The mean average number of statements per cluster was 11.10 (SD = 2.58) for the sample. The number of statements in the largest and smallest clusters for each map was identified for each study, and was averaged to identify the upper and lower levels across the sample. The average number of statements in the largest cluster for each of the studies was 18.64 (SD = 4.94, range: 9–32) and the average number of statements in the smallest cluster for each study was 5.49 (SD = 1.94, range: 1– 10), suggesting considerable variation in cluster density across sample. Because sorting reflects the judgments made by each participant about the relationships among the statements in a set, and the number of clusters selected reflects the aggregated relationship representation, it is useful to determine the degree of correspondence. A significant Pearson’s Product–Moment correlation of r = .43, p < .001 was found, indicating a moderate relationship between the median number of piles and the final number of clusters in which the map is partitioned. Fig. 1 represents a bivariate plot of the relationship between the mean number of piles and the final number of clusters for each study. This portrays the correspondence between the structural arrangement of the sort data from each participant, on average, and the final partitioning of the multidimensional scaling analysis structure across the sample. Collectively these findings suggests a positive relationship and conceptual consistency between the aggregated groupings from participants and the final groupings found in the map, despite the multiple ways the final number of clusters may be selected. 3.5. Stress, fit, and similarity Stress is a statistic routinely generated and reported in multidimensional scaling (MDS) analyses, reflecting the goodness of fit of the final representation with the original similarity matrix used as input. Stress is the normalized residual variance for a perfect relationship of a monotone regression of distance upon dissimilarity or similarity (Kruskal, 1964). Thus, for any given configuration the stress indicates how well that configuration matches the data. The average stress value for the sample was .28 (SD = .04, range: .17–.34, 95% CI [.27, .29]). The literature on multidimensional scaling suggests lower stress values are preferred and reflect better congruence between the raw data and the processed data (Davison, 1983; Kruskal, 1964). Stress values found in concept mapping analyses are typically higher than those recommended in
16
Average Number of Sorted Piles
240
14 12 10 8 6 4
4
5
6
7
8
9
10
11
12
13
14
15
Final Number of Clusters Fig. 1. Plot of average piles by final number of clusters for 69 concept mapping studies.
S.R. Rosas, M. Kane / Evaluation and Program Planning 35 (2012) 236–245
0.4 0.35
Stress Values
the literature on MDS. Several reasons for the discrepancy have been presented by Trochim (1993) and Kane and Trochim (2007). Comparatively, the stress values across the sample were very similar to those found by Trochim (1993). In fact, nearly the entire set of concept mapping studies in this sample (96%) had a stress value that fell within the 95% CI [.21, .37] originally estimated by Trochim (1993) and reported in Kane and Trochim (2007). Hence, in two pooled analyses using independent samples, nearly identical patterns of stress were observed. It should be noted that a group effect was found, F(2, 66) = 3.62, p < .05, when examining the average stress values by data collection mode. However, the mean difference between the web-based (M = .27, SD = .04) and multi-method (M = .29, SD = .03) modes, p < .05 was very small, and no difference was seen between face-to-face and web-based sorting. In terms of judging the acceptability of the stress values found across studies in this sample, a previous simulation study by Sturrock and Rocha (2000) can serve as a guide. Based on the distributions of over a half a million randomly created and scaled matrices, these authors found that for two-dimensional MDS solutions where 100 objects have been scaled, there is a 1% chance the arrangement of the objects in the matrix is random if the stress value is below a upper limit of .39. Thus, multidimensional maps with a stress statistic below this threshold have less than a 1% probability of having either no structure or a random configuration. Since none of the studies produced a stress value above .39 (even those with over 100 objects in the input matrix), it is likely that none of the two-dimensional configurations included in this study were random or without structure. As a second measure of validity, the configural similarity was calculated for each concept mapping study, reflecting the congruence between data used as input and the final represented form. Configural similarity was estimated by computing the Pearson’s Product–Moment correlation between the original aggregated similarity matrix from participants and the final matrix of Euclidean distances between points on the map. A squared correlation coefficient, r2 was also calculated for each study to assess the proportion of shared variance of the input and output data. The average squared correlation of the input similarities and the scaled distances from the MDS coordinates was .44 and statistically significant, t(68) = 5.38, p < .05. This estimate indicates that on average 44% of the variation in the aggregated participant sort was accounted for by the conceptualized model. No difference was found in the proportion of shared variance of the input and output by data collection mode. Fig. 2 represents a bivariate plot that illustrates the relationship between the stress values and the r2 values for each study in the sample. As observed in the figure, the better the fit of the sort data with the statistically represented model (i.e. lower stress values) the greater the proportion the input data are accounted for in the model. Taken together, these results suggest that concept mapping performs well in representing a complex set of multivariate data in two-dimensional space. An examination of the correlation between the stress value and the number of sorters for each study revealed no linear relationship between the two variables. Nonetheless, it is important to understand further the relationship between sorting participation and stress. To model how stress is affected by the number of sorters, a subset of five studies with the most sorters was identified. Within this subset, sort data from an individual participant was randomly selected from the list of participants who completed the sort task in that study. A second randomly selected sort was then aggregated with the previous, and the stress value calculated. The process continued, and resulted in the inclusion of sort data from all study participants. At every addition
241
0.3 0.25 0.2 0.15 0.20
0.30
0.40
0.50
0.60
0.70
0.80
r Values 2
Fig. 2. Plot of stress values by coefficient of determination values for 69 concept mapping studies.
of one randomly selected participant, the fit of the map was calculated and assessed. The results for each of the five studies are displayed in Fig. 3, with number of sorters plotted on the x-axis and stress values plotted on the y-axis. Trends across the five studies reveal a consistent pattern of stress as the number of sorters increase. The variability in the stress values for each study was dramatic when about 15 or fewer sorters were included. As the number of sorters for each of the five studies reached about 35 sorters, substantial improvements in stress (i.e. lower stress values) were observed. However, beyond 40 sorters, only marginal improvements in stress were detected. These findings suggest between 20 and 30 sorters is warranted to maximize the consistency of fit in the concept mapping representation by minimizing the variability in the stress value found with smaller groups of sorters. This range is about twice the number recommended by Trochim (1989a) and Jackson and Trochim (2002). Our observation comports with previous card sorting studies examining adequate sample sizes needed to produce high quality representations (Tullis & Wood, 2004; Wood & Wood, 2008). While smaller numbers of sort participants may still yield acceptable stress values, the likelihood of generating a higher stress value is greater with smaller groups. Thus, consideration of the appropriate sorting sample in designing concept mapping studies is critical. 3.6. Sorting reliability estimates For each study in the sample, five unique reliability estimates for the sort data were generated using the procedures outlined in Trochim (1993), then averaged. First, the set of sort data from each study was randomly divided into two halves (for odd-numbered groups, one group was randomly assigned one more person than the other). Separate concept maps were computed for each group. The total sort matrices for each split-half group were correlated and the Spearman–Brown correction applied to obtain the splithalf reliability of the sorts (rSHT). The Euclidean distances between all pairs of points on the two maps from the split-half samples were also correlated and the Spearman–Brown correction applied to obtain the split-half reliability of the map (rSHM). Second, the sort matrices for each individual were correlated and averaged. The Spearman–Brown correction was applied to yield the Individualto-Individual Sort Reliability (rII). Third, the sort matrix for each individual was correlated with the total similarity matrix. These correlations were averaged and the Spearman–Brown correction applied to produce the Individual-to-Total Matrix Reliability (rIT). Finally, the sort matrix for each individual was correlated with the
S.R. Rosas, M. Kane / Evaluation and Program Planning 35 (2012) 236–245
242
0.36 0.35 0.34 0.33 0.32 0.31 0.3 0.29
Stress value
0.28 0.27 0.26 0.25 0.24 0.23 0.22 0.21 0.2 0.19 0.18 0.17 0.16 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89
Sorters Study 1
Study 2
Study 3
Study 4
Study 5
Fig. 3. Increase in sorters by stress values: five largest concept mapping projects studies in the pooled sample.
Euclidean distances from the overall, final map. These correlations were averaged and the Spearman–Brown correction applied to produce the individual-to-map reliability (rIM). Of the five reliability estimates calculated, three rely on analysis of the sort data used as input. Overall, the average reliability estimates for the sort data for the sample were high. The average individual-toindividual sort reliability value (rII) was .87, the average individualto-total matrix value (rIT) was .96, and the average split-half total matrix reliability (rSHT) was .87. No differences were found in the reliability estimates of the sort data in relation to the data collection mode. Two of the reliability estimates include information from the final map for each study. The average reliability between individuals’ sort matrices and the final map configuration (rIM) was .91. The average split-half reliability of the final map configuration (rSHM) was .67. The lower value for rSHM was expected as this reliability estimate is calculated on the split-half sample of analyzed or processed data, rather than the raw data used to generate rSHT (Trochim, 1993). Nevertheless, the average reliability estimates found here were high and slightly above those found in Trochim’s (1993) previous study. The interrelationships between the sorting reliability estimates were assessed by calculating the Pearson’s Product–Moment correlations for the pairs of estimates. All correlations were significant at the .001 level. The correlations between the sort data estimates (rII, rIT, rIM, rSHT) were strongly positive (range: .94–.99), while correlations between the split-half map reliabilities and all other reliabilities were lower (range: .62–.72). This suggests strong associations between all reliability estimates, even among those assessing the reliability of the output data. The five sort data reliability estimates were also correlated with the number of statements and the number of sorters, in order further to examine relationships between sort reliability and other study characteristics that presumably affect the estimates. The number of statements was marginally correlated with the reliability estimates, although the correlations were negative. Conversely, the number of sorters was significantly correlated with
the reliability estimates, ranging from .44 to .71(all significant at p < .001). This finding suggests that more sorters may yield more reliable sorting results although the differences may be minor. 3.7. Rating reliability estimates Reliability estimates of the ratings data were calculated for each of the concept mapping studies in the sample (54 studies had two ratings). For each rating, the internal consistency was assessed by computing the average correlation among items using Cronbach’s alpha. We were also interested in the reliability of different raters averaged together. The reliability of averaged ratings is a more useful statistic in practice due to the considerable variation in reliability among raters which cannot be assessed (MacLennan, 1993). The average measure intraclass correlation (AICC) was calculated to produce an inter-rater reliability coefficient, which is equivalent to the average correlation between all pairs of raters with the Spearman–Brown correction for the number of raters. The average Cronbach’s alpha coefficients for both ratings 1 and 2 were above .96, suggesting the items on the scale are highly intercorrelated and internally consistent (DeVellis, 1991), even for studies where two ratings were administered. A significant group effect was found for rating 1, F(2, 66) = 4.29, p < .05 and rating 2, F(2, 51) = 3.20, p < .05 when examining the average alphas by data collection mode. For rating 1, mean differences were found between the web-based (M = .97, SD = .02) and face-to-face (M = .95, SD = .02) modes, p < .05. Similarly, for rating 2, mean differences were found between the web-based (M = .97, SD = .02) and face-to-face (M = .95, SD = .02) modes, p < .05. However, these differences were very slight and the alphas by group were high enough so as to not be meaningful. The average inter-rater reliability coefficients (AICC) were also high, suggesting that across raters the mean ratings are stable, although the inter-rater reliability coefficients for the second rating were slightly lower. No differences were found in the AICC reliability estimates for rating 1 or rating 2, relative to the data collection mode.
S.R. Rosas, M. Kane / Evaluation and Program Planning 35 (2012) 236–245
These reliability estimates were correlated with the number of statements and the number of raters to further examine relationships between rating reliability and other concept mapping characteristics that presumably affect the estimates. Moderately strong correlations between the number of statements and Cronbach’s alphas for rating 1 (r = .49, p < .001) and rating 2 (r = .55, p < .001) were detected. This suggests that larger statement sets yield higher internal consistency estimates. Because ratings used in concept mapping are uni-dimensional in nature; that is, they measure a single construct across a large number of items (e.g. importance, feasibility, readiness, etc.), it is not surprising that higher alphas were found. Moreover, there is a tendency for large numbers of items to produce higher alpha coefficients (DeVellis, 1991). Similarly, the number of raters was moderately correlated with the inter-rater reliability coefficients for both rating 1 (r = .53, p < .001) and rating 2 (r = .51, p < .001), suggesting larger numbers of raters yield higher inter-rater reliability estimates.
4. Discussion As with similar mixed-method applications, the concept mapping studies that are the basis of this pooled analysis were conducted to understand complex realities using data from multiple perspectives to combine and present practical information. This quantitative analysis generated useful baseline information to address questions regarding the methodological quality of concept mapping. The study approach suggested several means for determining the quality of concept mapping that are appropriate, defensible, and relevant. In particular, this work considers the significance of validity and reliability as it relates to concept mapping, and reports on these critical aspects. With this emphasis in mind, concept mapping as an integrated mixedmethod approach for planning, evaluation, and research appears to generate valid and reliable results. Although the representation of a complex set of input data was limited to two dimensions, the internal representational validity across the set of studies was found to be good, supported by multiple measures of fit and similarity. While better fit and greater similarity between the input data and output representation might be observed using more than two dimensions, the current approach appears appropriate for generating the most parsimonious and interpretable results. The reliability of the sort data was observed to be high, both between sorters and among sets of aggregated sorters. Likewise, the consistency of the rating data, between individuals and among items, was very high. These findings comport with an earlier, more limited, study of the reliability of concept mapping by Trochim (1993), where similar patterns were found using the same reliability calculations. As observed in this sample, the advent of a web-based platform for conducting concept mapping asynchronously in a virtual environment has expanded the level of participation across all phases of the process. This presents concept mapping users with unique benefits for expanded participation, and notable challenges to establishing and maintaining quality. For example, the use of the web affects the percent of completion across different tasks based on the ability to invite a greater number to take part in the study. However, despite the utilization of web technology and increased access to concept mapping for a broader set of participants, estimates of reliability and validity appear consistent. Indeed, no meaningful differences were found between multiple data collection modes and the estimates of reliability and validity calculated in this study. While questions persist as to how data collected through group processes like brainstorming are affected when generated individually (Dugosh, Paulus, Roland, & Yang,
243
2000), it appears that quality and rigor can be maintained for concept mapping data collected via the Internet. Although this sample does not represent the totality of concept mapping studies, this pooled study analysis is the largest and most comprehensive conducted to date. The studies combined for this analysis were diverse in scope and topic, and typical of those found in the published research and evaluation literature. In fact, the results of several concept mapping studies included in this analysis were published previously across a variety of content areas and disciplines. Each study in this sample followed the same procedural steps and was subject to similar constraints during implementation. This process consistency, coupled with data collection and computation standardization at the study level, enabled an analysis that produced findings configured in a comparable statistical form. This study’s systematic approach enabled the identification and analysis of patterns that might otherwise be obscured in a case by case assessment. Notwithstanding the strengths of the analytical strategy, several limitations to the study are important to note. First, as with any analysis that pools the results of multiple studies, the variability in random and non-random error is of concern. To minimize error, and subsequent overestimation or underestimation of estimates, we employed several strategies. A detailed protocol for calculating the indicators was consistently applied for each study in the sample. In addition, strict inclusion criteria were used to ensure methodological homogeneity in the pooled sample. Nonetheless, for this analysis, advanced statistical methods were not employed when analyzing the sample to mitigate the potential compounding of error found at the study level. Second, the analysis does not capture qualitative distinctions across studies. Each of the individual studies included in this study sample contain extensive detail that offers insight into a particular topic considered in context. The qualitative variation in participant experiences, settings, content, interpretation, and uses of the results were not considered in this pooled study analysis. Furthermore, the assessment of quality of elements that are directly influenced by the qualitative judgments of the researcher was not undertaken. Cluster selection and labeling, for example, is one of those areas that would benefit from further study to better delineate some notion of intra-observer agreement in the determination of clusters and names of concepts represented by each cluster. The typical approach is for the researcher to conduct the analysis and arrive at options for cluster arrays, and then discuss and confirm or change with key study stakeholders. Given that the concept mapping process calls for cooperation and negotiation in the final structure and labeling, capturing the degree to which two reasonable people with knowledge in the field agree or disagree has important implications for quality. Third, relative to the entire set of concept mapping projects completed in the past 20 years, it is possible that this study includes potentially problematic studies. However, this risk is somewhat mitigated by results that were consistent with those found in previous pooled analyses (cf. Trochim, 1993). In addition, the sample consisted of studies completed by the originators and providers of the concept mapping technique, and as such are an exclusive set. Implemented by experts in the conduct of the concept mapping process, these studies may not be representative of studies conducted by those with less experience with the method. A case could be made that studies in the sample represent a high level of quality due to the adherence to the concept mapping process, as outlined by the developers. Finally, this pooled study analysis is fundamentally correlational. This study examined only the linear relationships between select characteristics found in concept mapping. Questions as to whether varying certain features of the concept mapping process, such as participant engagement, may result in changes in other characteristics remain unanswered.
244
S.R. Rosas, M. Kane / Evaluation and Program Planning 35 (2012) 236–245
Despite these limitations, three implications of the findings warrant attention. First, the results of this analysis offer a practical reference for researchers and evaluators to judge the quality of concept mapping studies and support their choices related to data collection, analysis, and representation. In establishing a basis for comparison, this study generated empirical data for several characteristics of the concept mapping process, providing realistic estimates of what one might expect in typical field applications of the method. This study establishes a set of benchmarks and ranges that can be used for individual concept mapping studies to gauge the reliability and validity of the results, and can provide concept mapping users a basis for confirming the practical issues of fidelity and integrity related to their work (Bradbury & Reason, 2001). Second, the results provide critical information that helps to establish expectations for the quality of concept mapping as a social science research method; upon which other researchers, journal reviewers, editors, and dissertations committees can evaluate the utility of the methodology’s processes and outcomes in future studies. Except for basic counts of the number of statements and participants, few published concept mapping studies report critical elements included in this study, such as stress values. This is likely due, in part, to the lack of information about standards for reporting and how the data is generated. By extension, the absence of clear expectations as to what constitutes quality for concept mapping hinders peer-reviewers in their appraisal of submissions. Thus, the results of this study provide both researchers, and those who review research, information to help evaluate and ensure the rigor of concept mapping studies across a broad base of literature. Third, the results provide a set of empirically grounded recommendations for different activities within the concept mapping process. Most of the current recommendations for concept mapping found in the literature are general heuristics based on professional experience or logic. The results of this study provide systematically derived information that practitioners can use to make key decisions in the concept mapping process, which will affect quality. Recommendations regarding the appropriate number of sorters or suggestions for interpreting stress values are based on examining the variation of these elements in relation to other indicators. Furthermore, criteria that emerged from this study for judging the integrity of concept mapping are more consistent with the assumptions and realities of the approach. For example, present criteria for determining the acceptability of fit for multidimensional scaling (MDS) applications are based on experimental and synthetic data (Kruskal, 1964). For applied field studies like concept mapping where MDS is used, it seems more appropriate and reasonable to assess the acceptability of fit in relation to results from the study of similar practical applications. Thus, using the results of this analysis as a reference, concept mapping practitioners should routinely report on importance indices like stress values, to allow others to judge the relative quality of their studies. This work represents a foundational step in building a base of evidence to support the methodological quality and expectations of the concept mapping approach. However, several areas of inquiry remain incomplete and suggest opportunities for future pooled study investigation, including: content analysis in areas where multiple concept maps have been produced; examination of concept mapping processes and procedures; and inquiry into participant characteristics across different studies. Future analyses might also include other studies completed by a broader pool of concept mapping practitioners. Pooled study analyses of studies with smaller numbers of participants or those with relatively smaller statement sets might provide useful information not observed in this study. Moreover, the development and standardization of methods for assessing external representational validity
and consistency in subjective decisions are needed to further the dialogue on validity in concept mapping. Ultimately, defining appropriate expectations of quality and rigor for concept mapping, established through systematic study, has value for users of the approach. This study attempted to address academic questions of validity, reliability, and quality at the same time accounting for the practical and participatory concerns for concept mapping. This study supports those who use concept mapping in their work, to help establish value in participation, measurement, data, and conclusions within a mixed-method, participatory application.
Acknowledgements The authors would like to acknowledge the support of the staff at Concept Systems, Inc., Specifically we thank Brenda Pepe for information related to the concept mapping studies included in the analysis, Perry Slack for study level data extraction, and Marie Cope for review and comments on the manuscript. We reference and acknowledge the foundational thinking and work of William Trochim in the methodology’s early applications and summary reliability estimates of almost 20 years ago. References Bradbury, H., & Reason, P. (2001). Broadening the bandwidth of validity: Issues and choice-points for improving the quality of action research. In P. Reason & H. Bradbury (Eds.), Handbook of action research: Participative inquiry and practice (pp. 447–456). Thousand Oaks, CA: Sage Publications. Cacy, J.R. (1996). The reality of stakeholder groups: A study of the validity and reliability of concept maps. Ph.D. dissertation, University of Oklahoma. Caracelli, V. J., & Riggin, L. J. C. (1994). Mixed-method evaluation: Developing quality criteria through concept mapping. Evaluation Practice, 15(2), 139–152. Cook, C., Health, F., & Thompson, R. L. (2000). A meta-analysis of response rates in web- or Internet-based surveys. Educational and Psychological Measurement, 60(6), 821–836. Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Houghton Mifflin. Cousins, J. B., & Whitmore, E. (1998). Framing participatory evaluation. New Directions for Evaluation, 80, 5–23. Creswell, J. W., & Plano Clark, V. L. (2007). Designing and conducting mixed methods research. Thousand Oaks, CA: Sage Publications. Davison, M. L. (1983). Multidimensional scaling. New York, NY: John Wiley and Sons. DeVellis, R. F. (1991). Scale development: Theory and applications. Newbury Park, CA: Sage. Dugosh, K. L., Paulus, P. B., Roland, E. J., & Yang, C.-H. (2000). Cognitive stimulation in brainstorming. Journal of Personality and Social Psychology, 79(5), 722–735. Dumont, J. (1989). Validity of multidimensional scaling in the context of structured conceptualization. Evaluation and Program Planning, 12, 81–86. Forgas, J.P. (1979). Multidimensional scaling: A discovery method in social psychology. In G. P. Ginsburg (Ed.), Emerging strategies in social psychological research (pp. 253–288). New York: Wiley. Jackson, K., & Trochim, W. (2002). Concept mapping as an alternative approach for the analysis of open-ended survey responses. Organizational Research Methods, 5(4), 307–336. Kagan, J. M., Kane, M., Quinlan, K. M., Rosas, S., & Trochim, W. M. K. (2009). Developing a conceptual framework for an evaluation system for the NIAID HIV/AIDS clinical trials networks. Health Research Policy and Systems 7(12). Kane, M., & Trochim, W. M. K. (2007). Concept mapping for planning and evaluation. Thousand Oaks, CA: Sage Publications. Kaplowitz, M. D., Hadlock, T. D., & Levine, R. (2004). A comparison of web and mail survey response rates. Public Opinion Quarterly, 68(1), 94–101. Keith, D. (1989). Refining concept maps: Methodological issues and an example. Evaluation and Program Planning, 12(1), 75–80. Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29(1), 1–27. Kruskal, J. B., & Wish, M. (1978). Multidimensional scaling. Beverly Hills, CA: Sage. MacLennan, R. N. (1993). Interrater reliability with SPSS for Windows 5.0. The American Statistician, 47(4), 292–296. Pammer, W., Haney, M., Wood, B. M., Brooks, R. G., Morse, K., Hicks, P., et al. (2001). Use of telehealth technology to extend child protection team services. Pediatrics, 108(3), 584–590. Paulson, B. L., & Worth, M. (2002). Counseling for suicide: Client perspectives. Journal of Counseling & Development, 80, 86–93. Petrucci, C. J., & Quinlan, K. M. (2007). Bridging the research-practice gap: Concept mapping as a mixed methods strategy in practice-based research and evaluation. Journal of Social Services Research, 34(2), 25–42.
S.R. Rosas, M. Kane / Evaluation and Program Planning 35 (2012) 236–245 Rao, J. K., Alongi, J., Anderson, L. A., Jenkins, L., Stokes, G. A., & Kane, M. (2005). Development of public health priorities for end-of-life initiatives. American Journal of Preventive Medicine, 29(5), 453–460. Ridings, J. W., Powell, D. M., Johnson, J. E., Pullie, C. J., Jones, C. M., Jones, R. L., et al. (2008). Using concept mapping to promote community building: The African American initiative at Roseland. Journal of Community Practice, 16(1), 39–63. Risisky, D., Hogan, V. K., Kane, M., Burt, B., Dove, C., & Payton, M. (2008). Concept mapping as a tool to engage a community in health disparity identification. Ethnicity & Disease, 18, 77–83. Robinson, J. M., & Trochim, W. M. K. (2007). An examination of community members’, researchers’ and health professionals’ perceptions of barriers to minority participation in medical research: An application of concept mapping. Ethnicity and Health, 12(5), 521–539. Sale, J. E., & Brazil, K. (2004). A strategy to identify critical appraisal criteria for primary mixed-method studies. Quality and Quantity, 38, 352–365. Sale, J. E., Lohfeld, L. H., & Brazil, K. (2002). Revisiting the quantitative–qualitative debate: Implications for mixed methods research. Quality and Quantity, 36, 43–53. Sturrock, K., & Rocha, J. (2000). A multidimensional scaling stress evaluation table. Field Methods, 12(1), 49–60. Tashakkori, A., & Teddlie, C. (1998). Mixed methodology: Combining qualitative and quantitative approaches. Applied Social Research Methods Series (46, pp. ). ). Thousand Oaks, CA: Sage Publications. Trochim, W. M. K. (1989a). An introduction to concept mapping for planning and evaluation. Evaluation and Program Planning, 12(1), 1–16. Trochim, W. M. K. (1989b). Concept mapping: Soft science or hard art? Evaluation and Program Planning, 12(1), 87–110. Trochim, W. M. K. (1993, November). The reliability of concept mapping. Paper presented at the Annual Conference of the American Evaluation Association. Trochim, W. M. K., Cabrera, D. A., Milstein, B., Gallagher, R. S., & Leischow, S. J. (2006). Practical challenges of systems thinking and modeling in public health. American Journal of Public Health, 96(3), 538–546.
245
Trochim, W., & Kane, M. (2005). Concept mapping: An introduction to structured conceptualization in health care. International Journal for Quality in Health Care, 7(3), 187–191. Trochim, W. M. K., Marcus, S. E., Masse, L. C., Moser, R. P., & Weld, P. C. (2008). The evaluation of large research initiatives: A participatory integrative mixed-methods approach. American Journal of Evaluation, 29, 8–28. Trochim, W. M. K., Milstein, B., Wood, B. J., Jackson, S., & Pressler, V. (2004). Setting objectives for community and systems change: An application of concept mapping for planning a statewide health improvement initiative. Health Promotion Practice, 5(1), 8–19. Tullis, T., & Wood, L. (2004 June). How many users are enough for a card-sorting study? The Proceedings of Usability Professionals Association Conference. Wood, J., & Wood, L. (2008). Card sorting: Current practices and beyond. Journal of Usability Studies, 4(1), 1–6. Scott R. Rosas, PhD is a Senior Consultant at Concept Systems, Inc. where he specializes in the design and use of the concept mapping methodology. His work has focused on conceptualization and measurement in evaluation using concept mapping, with attention to the validity and reliability of the approach. He received his PhD in Human Development and Family Studies from the University of Delaware, with an emphasis on program evaluation. He previously served as Associate Faculty at the Bloomberg School of Public Health at Johns Hopkins University and is currently Adjunct Faculty in the Department of Health at SUNY-Cortland. Mary Kane, MSLIS is the Chief Executive and Principal Consultant at Concept Systems, Inc. Her consulting experience includes strategic and operational planning, product and program development, education and training design, and program needs assessment and evaluation. She has coauthored several articles on the application of concept mapping across several content areas, including public and community health. Ms. Kane is co-author of the definitive volume on concept mapping: Concept Mapping for Planning and Evaluation. She holds a Masters degree in Library and Information Sciences from Columbia University.