International Journal of Industrial Ergonomics 44 (2014) 551e560
Contents lists available at ScienceDirect
International Journal of Industrial Ergonomics journal homepage: www.elsevier.com/locate/ergon
Increasing X-ray image interpretation competency of cargo security screeners Stefan Michel a, b, Marcia Mendes a, b, *, Jaap C. de Ruiter c, Ger C.M. Koomen d, Adrian Schwaninger a, b a
School of Applied Psychology, University of Applied Sciences and Arts Northwestern Switzerland (FHNW), Riggenbachstrasse 16, 4600 Olten, Switzerland Center for Adaptive Security Research and Applications (CASRA), Thurgauerstrasse 39, 8050 Zürich, Switzerland Netherlands Organization for Applied Scientific Research (TNO), Lange Kleiweg 137, Rijswijk, The Netherlands d Dutch Customs, Kingsfordweg 1, 1043 GN, Amsterdam, The Netherlands b c
a r t i c l e i n f o
a b s t r a c t
Article history: Received 23 August 2012 Received in revised form 10 October 2013 Accepted 31 March 2014 Available online
X-ray screening of containers and unit load devices in the area of cargo shipping is becoming an essential and common feature at ports and airports all over the world. The detection of prohibited items in X-ray images is a challenging task for screening officers as they need to know which items are prohibited and what they look like in X-ray images. The main aim of this study was to investigate whether X-ray image interpretation competency of cargo security screeners can be increased by computer-based training. More specifically, effects of training were investigated by conducting tests before training started and after approximately three months of training. Moreover, it was examined whether viewing X-ray images in pseudo color would lead to a better detection performance compared to when X-ray images are shown in greyscale. Recurrent computer-based training resulted in large performance increases after three months. No significant difference in detection performance could be found for tests when using X-ray images in greyscale vs. pseudo color. Relevance to industry: Cargo X-ray screening is becoming a common feature at ports and airports. The identification and detection of prohibited items in X-ray images highly depends on human operators and their competences regarding X-ray image interpretation. Thus, research on appropriate training methods and enhancements of the human factor are essential to achieve and maintain high levels of security. Ó 2014 Elsevier B.V. All rights reserved.
Keywords: Airport security Cargo security screening X-ray screening Human factors Object recognition Detection performance
1. Introduction In the last decade, airport security has been substantially enhanced in the areas of passenger, cabin baggage and hold baggage screening. However, air cargo, whether transported on passenger planes or commercial cargo aircrafts, has not yet received a similar level of attention. Nevertheless, the explosion of a cargo plane could highly impact the world’s economy, commerce or the global supply chain. At least since 2010 it is clear that cargo is an attractive target for terrorists (TIME, 2010). Consequently, X-ray cargo screening for the inspection of unit load devices (ULDs) and containers is emerging quickly and becoming a common feature at ports and airports. A big advantage of X-ray screening is that images
* Corresponding author. School of Applied Psychology, University of Applied Sciences and Arts Northwestern Switzerland (FHNW), Riggenbachstrasse 16, 4600 Olten, Switzerland. Tel.: þ41 62 957 23 53. E-mail address:
[email protected] (M. Mendes). http://dx.doi.org/10.1016/j.ergon.2014.03.007 0169-8141/Ó 2014 Elsevier B.V. All rights reserved.
of the contents of a ULD or container can be created quickly, without causing any physical changes within the containers. Concerning X-ray screening technology, remarkable improvements have been achieved over the last decades. For the field of cargo Xray screening, modern X-ray inspection systems especially tailored to the examination of containers, trucks and rail cars exist, which produce images similar to those obtained through traditional X-ray baggage screening machines at airports (e.g., Reed, 2008, 2009). Despite the advances in technology, the actual decision whether an X-ray image of a container or ULD contains a prohibited item or not still needs to be taken by a human operator, i.e. a cargo security screener. Cargo X-ray screening, in particular, must be considered an extremely challenging task. The large sizes of containers make it very difficult for the cargo security screeners to detect the proportionally small threat items (e.g. an improvised explosive device (IED)) and contraband goods. Moreover, it could be assumed that perpetrators would intentionally disguise or hide their materials or decompose contraband into component parts distributing it within the container. Still, screening officers are expected to assess the
552
S. Michel et al. / International Journal of Industrial Ergonomics 44 (2014) 551e560
contents of large containers or ULDs within a few minutes only. Therefore, it is essential to consider human factors in security screening and design systems to support the decision-making of human operators (e.g., Harris, 2002; Kraemer et al., 2009). Selection and training of screening personnel is a highly relevant factor for improving man-machine system performance. The most sophisticated machines become worthless if the people who operate them and visually inspect the X-ray images are not qualified to do so (Al-Fandi et al., 2009; Hardmeier et al., 2005; Schwaninger et al., 2005). Research on X-ray security screening of passenger baggage at airports has shown that performance of human operators (i.e. airport security screeners) is determined by several factors (e.g., Michel and Schwaninger, 2009; Schwaninger, 2006). First of all, not every person has the potential to become a good screening officer, as certain aptitudes and abilities are required for this task (Hardmeier et al., 2005; Hardmeier and Schwaninger, 2008; Schwaninger et al., 2005). For that reason it is advisable to apply scientifically proven selection tests, such as the X-Ray Object Recognition Test (X-Ray ORT), as part of pre-employment assessment procedures, so that right from the beginning only people well suited for the screening job are chosen for this task (Hardmeier et al., 2005; Schwaninger et al., 2005). Furthermore, in order to detect prohibited items reliably, a screening officer must know which items are prohibited and what they look like in X-ray images. Many prohibited items are rarely seen in everyday life and might look quite different from reality when displayed as an X-ray image (e.g. IEDs). According to object recognition theories and visual cognition, object shapes not similar to the ones stored in visual memory are difficult to recognize (Graf et al., 2002; Schwaninger, 2004a, 2004b). Schwaninger et al. (2005) found that the detection of prohibited items is dependent on knowledge-based and image-based factors. Knowledge-based factors refer to knowledge required for the detection of prohibited items, i.e. knowledge on which items are allowed vs. prohibited, what objects look like in reality, and what they look like in X-ray images of passenger bags or containers. This underlines the importance of class-room, computer-based (CBT) and on-the-job training. Image-based factors relate to the difficulty of inspecting an X-ray image. According to Schwaninger (2003) and Schwaninger et al. (2005) three imagebased factors are relevant in X-ray screening: viewpoint of an object, superposition of an object by other objects and the complexity of a bag/container. Depending on perceptual experience and the ability to mentally rotate objects, rotated items are more difficult to recognize (effect of viewpoint). Moreover, when prohibited objects are superimposed by other objects (effect of superposition), or when bags are closely-packed and contain many items which could attract attention (effect of complexity), the identification of prohibited items becomes more challenging. Results of different studies could show that coping with image difficulty resulting from these factors can be improved through training (e.g., Hardmeier et al., 2006b; Koller et al., 2008). Yet, a person’s learning process is greatly dependent on visual abilities and cannot merely be attributed to training (Hardmeier et al., 2006b). Regarding knowledge-based and image-based factors, individually adaptive computer-based training (CBT), e.g. with X-Ray Tutor (XRT), is considered to be a very powerful tool for achieving and maintaining a good X-ray image interpretation competency in cabin baggage and hold baggage screening (Halbherr et al., 2013; Hardmeier et al., 2006b; Koller et al., 2008; Schwaninger and Hofer, 2004). For one thing, it allows exposing screening officers to objects they usually do not encounter in everyday life (e.g., IEDs) and for another thing, screeners get trained to identify all sorts of objects in different views (rotations), when superimposed by other objects and in bags (containers) of different complexities.
Considering the objective of increasing container security at ports and airports by improving X-ray screening, the aim of this project was to find out whether the screening officers’ performances in detecting prohibited items in cargo X-ray images can be enhanced through implementing recurrent CBT. Moreover, another essential aspect concerning screening performance is the speed at which an operator performs bag/container searches while maintaining an optimal performance level. Results of previous studies with XRT revealed that training not only increased detection performance of screening officers but also reduced reaction time (Koller et al., 2009; Michel et al., 2007b; Schwaninger and Hofer, 2004; Schwaninger and Wales, 2009; Wales et al., 2009). This is relevant for (cargo) X-ray screening applications where throughput matters. State-of-the art X-ray screening machines, especially for cabin and hold baggage screening, are able to produce high quality and material coded colored images. Additionally, they offer a variety of so called “image-enhancement functions” (IEFs) such as color inversion, edge-enhancement, organic only, metal only etc. (for illustrations, see e.g. Michel et al., 2007a). IEFs are applied to bring out detail which is obscured or to highlight certain features (e.g., organic content). However, previous studies (Klock, 2005; Michel, Koller et al., 2007) investigating the usefulness of IEFs for cabin baggage (CBS) and hold baggage screening (HBS) have casted doubt on whether IEFs are useful. Results varied for the different threat types (guns, knives, IEDs and other threat items). The findings of these studies highlight the importance of systematically investigating the usefulness of IEFs in order to optimize humanemachine interaction. In order to obtain colored X-ray images, dual energy technology (high and low energy beams) is required (for more details see Fainberg, 1992; Ogorodnikov and Petruin, 2002). For cargo X-ray screening machines, the application of dual energy technology is still difficult and of high costs. In this study, images were recorded with single energy technology, providing greyscale images. As humans can only discern a few dozen grey level values while they can distinguish thousands of colors (Gonzales and Woods, 2002), it could be assumed that the application of color would help to better distinguish between objects. Moreover, color adds vivacity to images und can thus decrease boredom or fatigue and increase attention (Abidi et al., 2005). Color coding could therefore significantly improve detection performance in X-ray screening. In our study, we investigated if showing the original greyscale image in pseudo colors could improve the detection of prohibited items in cargo X-ray images, and, if so, for which kind of items or images this application was of particular help. Previous studies focusing on cabin baggage screening have proven the effectiveness of CBT in increasing X-ray image interpretation of screening officers (Halbherr et al., 2013; Koller et al., 2008; Michel et al., 2007b; Schwaninger, 2004b; Schwaninger et al., 2008, 2007; Schwaninger and Wales, 2009). Recurrent CBT is considered to be an indispensable prerequisite for achieving and maintaining good operational performance at security checkpoints. For all airports within the EU, recurrent training consisting of image recognition training and testing either in the form of classroom or computer-based training is obligatory for all persons operating Xray equipment (European Commission, 2010). For this study, a new CBT system for the area of cargo X-ray screening was created (Cargo-XRT) and tested with cargo security screening officers at an international European airport. The main aim of this study was to investigate if X-ray image interpretation competency of cargo security screeners could be increased by introducing CBT with the CXRT. Moreover, it was examined whether detection performance is better when pseudo colored images are displayed compared to when the original greyscale images are shown. Further analyses were conducted in order to find out which type of threat items
S. Michel et al. / International Journal of Industrial Ergonomics 44 (2014) 551e560
(security vs. customs related items) were more difficult for screening officers to detect in a ULD or container, as well as how much time screening officers needed to judge an image, and if reaction times could be decreased through training. 2. Method and procedure 2.1. Participants and procedure Altogether, 40 employees (9 working in administration (4 males) and 31 experienced cargo security screeners (25 males)) participated in this study. The experimental design is shown in Fig. 1. Three groups were formed (two experimental groups and one control group). Administrative personnel working in customs served as a control group. In order to make sure to have equally balanced groups with regard to visual abilities needed for X-ray image interpretation, all participants conducted the X-Ray Object Recognition Test (X-Ray ORT) in advance (see 2.2.1). Based on the XRay ORT results, the cargo security screeners were distributed into the two experimental groups. A univariate ANOVA using the detection performance scores A0 (Pollack and Norman, 1964) of the X-Ray ORT was conducted to test the equality of all three groups in terms of their visual abilities. No significant difference between the groups (greyscale, pseudo color, control group) was revealed F(2, 36) ¼ .032, p ¼ .969, h2 ¼ .002. Both experimental groups underwent weekly recurrent training of at least 20 min per week with CXRT for approximately three months (resulting in a total of M ¼ 8.95 h of training, SD ¼ 5.68), while the control group did not conduct any training. Detection performance was measured using the Cargo X-Ray CAT (C-CAT) (see 2.2.3). Measurements were taken before training started and again after three months of training. The two experimental groups conducted different versions of the C-CAT (greyscale X-ray images (n ¼ 15) vs. pseudo color X-ray images (n ¼ 16)). The control group (n ¼ 9) only conducted the greyscale test version (see Fig. 1). 2.2. Materials 2.2.1. X-Ray Object Recognition Test (X-Ray ORT) As aforementioned, not every person has the potential to become a good X-ray screening officer. Certain aptitudes and abilities are required to cope with the above mentioned three imagebased factors (viewpoint, superposition and image complexity)
553
relevant in X-ray screening, which can be measured using the XRay ORT (Hardmeier et al., 2005; Schwaninger et al., 2005). In the X-Ray ORT, X-ray images of passenger bags are displayed. As novices would not know how to interpret color information (the coding of materials), images are only shown in greyscale. Furthermore, only objects with common shapes (guns and knives), which would be identifiable without experience or special knowledge, are used. Threat items are combined with bags of different complexities (low and high) and with different levels of superposition (low and high). Every threat item is shown from two different viewpoints (easy vs. difficult). Each bag image is used twice, once without (non-threat image) and once containing a prohibited item (threat image). Altogether, the test consists of 256 trials. The task of the candidate taking the X-Ray ORT is to visually inspect the X-ray images and decide whether a gun or knife is present. Each image is presented for four seconds. For more detailed information on the X-Ray ORT see Schwaninger et al. (2005), Schwaninger et al. (2004) and Hardmeier et al. (2005, 2006a). 2.2.2. Cargo X-Ray Tutor (C-XRT) training system X-Ray Tutor (XRT) is a computer-based training system developed to enhance X-ray image interpretation competency of X-ray screening officers (Schwaninger, 2004a). During training with XRT, X-ray images of passenger bags are presented to the screening officers which they have to visually inspect in order to decide whether a bag can be regarded as harmless (OK) or not (NOT OK). In case a bag is judged as NOT OK the screener has to identify the threat item in the bag by clicking on it. After each response, feedback is provided, informing the screener whether the response was correct or not. If the bag contains a prohibited item, detailed information about this item (what kind of threat item and what it looks like when X-rayed) as well as its location in the bag is displayed (Schwaninger, 2004a). One core advantage of XRT is that it is individually adaptive, i.e. training sessions are created based on each trainee’s individual performance and learning progress. The XRT training starts by showing threat items depicted in easy (canonical) views, with low superposition and low bag complexities. The difficulty of each of these factors is then increased depending on each individual’s learning progress, using sophisticated image processing algorithms (see Bolfing et al., 2008). For this study, a new version of XRT was created and adapted to the field of cargo X-ray screening. The C-XRT was based on the same scientific background as earlier XRT versions, also accounting for
Fig. 1. Experimental design of the study.
554
S. Michel et al. / International Journal of Industrial Ergonomics 44 (2014) 551e560
the three relevant image-based factors and also being individually adaptive and level based. The XRT training systems for CBS and HBS use a scientifically validated image merging algorithm (Mendes et al., 2011), where an X-ray image of a fictional threat item (FTI) is automatically inserted into an X-ray image of a passenger bag during training. As ULDs or containers are much more complex with regard to size and filling, automatic merging could not yet be applied for the C-XRT. Instead, combined threat images (CTIs) were shown in the C-XRT. For this, X-ray images of ULDs/containers and FTIs were recorded separately. The FTIs were then merged into the ULD/container images by screening experts using an X-ray image merging-software based on the machine specific X-ray image format. The screening experts made sure that each CTI was realistic with regard to the placement of the FTI and that the FTIs were technically visible, i.e. they were potentially detectable for the human operator. All CTIs were double checked by an additional expert group to ensure that they were realistic. The C-XRT version used in this study only included the most important features of XRT. Besides a zoom function, this version also offered the application of different IEFs. Images in the C-XRT could be viewed using the following options (similar to the ones the screening officers could apply at the X-ray machine during work): Normal (Greyscale), black & white square root (BW SQRT), black & white logarithmic (BW LOG), black & white histogram (BW HIST) and pseudo color (PC).
threat images altogether. The applied merging-software displayed the superposition value. This was computed using a function that calculated the difference between the pixel intensity values of the ULD/container image with the FTIs and subtracting the ULD image without the FTIs, using the following formula (see Koller et al., 2008):
SP ¼
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P ½ISN ðx; yÞ IN ðx; yÞ2 Object size
SP: superposition; ISN: greyscale intensity of the SN (signal plus noise) image (contains a prohibited item); IN: greyscale intensity of the N (noise) image (contains no prohibited item); x, y: image coordinates; object size: number of pixels of the FTI where R, G and B are <253. Using this equation, the superposition value is independent of the size of the prohibited item. Additionally to detection performance, the C-CAT allowed measuring reaction times, i.e. the time a screener needed to search a ULD/container before taking a decision. The screening officers were instructed to analyze the images as fast and accurately as possible. A maximum of five minutes was set before the image disappeared on the screen. 3. Results and discussion
2.2.3. Cargo X-Ray Competency Assessment Test (C-CAT) The X-Ray Competency Assessment Test (X-Ray CAT) is a reliable and valid instrument developed to measure X-ray image interpretation competency of security screeners, i.e. how well they can detect different types of prohibited items in X-ray images of passenger bags (Koller and Schwaninger, 2006). Similarly as in XRT, Xray images of passenger bags are displayed on a computer screen. The security screeners have to visually inspect these for prohibited items and decide whether a bag can be considered as harmless (OK) or whether it contains a prohibited object (NOT OK). For the field of cargo X-ray screening, a new version of the X-Ray CAT (Cargo X-Ray CAT, C-CAT) had to be created. Two versions were designed, one with the original greyscale images, and one where these images were color coded (pseudo color images). For the color coding, an X-ray machine specific algorithm was used which derived the colors by the different levels of grey on the greyscale images (not through the differentiation of materials). Apart from the coloring, both tests were identical, containing the same images of containers/ULDs (see Fig. 2 for an example). The C-CAT consisted of 240 trials based on 120 different X-ray images of cargo containers and ULDs. All images were once displayed without prohibited items and once containing prohibited goods or threat items. The C-CAT contained 36 customs related and 24 security related objects, which were visually similar to objects included in the training system, but not identical. The number and types of prohibited items were defined and selected in collaboration with customs experts. The customs related objects were 24 drug items, eight weapons and four “miscellaneous” (e.g., Chinese medicine, watches etc.) items. The security related objects were inert IEDs or explosives, as defined and prepared by TNO.1 The comparisons between the 24 security and 24 drug items were of special interest. The items from these two categories could be further divided into three different weight categories (light/medium/heavy), resulting in eight items per weight class for each of these two groups. All objects were once shown with low superposition and once with high superposition. This added up to 120
1
Netherlands organization for Applied Scientific Research (TNO).
According to signal detection theory (SDT), four possible types of responses exist (Green and Swets, 1966): hit, false-alarm, correct rejection and miss. In this study, A’ (Pollack and Norman, 1964) was applied as a measure for detection performance. It considers the hit rate as well as the false-alarm rate and can be calculated by the following formulae (Grier, 1971): When H > F: .5 þ [(H e F)(1 þ H e F)]/[4H(1 e F)]
(1)
When H < F: .5 þ [(F e H)(1 þ F e H)]/[4F(1 e H)]
(2)
H is the hit rate and F the false alarm rate. If performance is below chance, i.e. when H < F, Equation (2) must be used (Aaronson and Watts, 1987). Due to the security confidential nature of performance values, these are not displayed in this paper. In order to provide meaningful results, relative differences between the two measurements and effect sizes are reported. The reported effect sizes are interpreted based on Cohen (1988). For t-tests, d between .20 and .49 represents a small effect size; d between .50 and .79 represents a medium effect size; d 0.80 represent a large effect size. For analysis of variance (ANOVA) statistics, h2 between .01 and .05 represents a small effect size; h2 between .06 and .13 represents a medium effect size; h2 .14 represents a large effect size. Fig. 3 shows the detection performance of the two experimental groups and the control group for both C-CAT measurement times, i.e. at T1 (before training started) and at T2 (after approximately three months of training). As can be seen in the figure, there were large improvements as a result of training for both experimental groups (greyscale and pseudo color). Separate pairwise t-tests revealed significant differences between the two measurement times with large effect sizes for the greyscale group t(14) ¼ 5.38, p < .001, d ¼ 1.39, and the pseudo color group, t(15) ¼ 3.96, p < .01, d ¼ 1.16. No significant difference was found for the control group t(8) ¼ .12, p ¼ .909, d ¼ .04. This suggests that improvements found in the experimental groups are solely due to training and cannot be explained by repeated exposure to the C-CAT.
S. Michel et al. / International Journal of Industrial Ergonomics 44 (2014) 551e560
555
Fig. 2. Example of the same X-ray image of a ULD, once in greyscale and once in pseudo colors.
Additionally, a repeated measures ANOVA using A0 scores was conducted with test date (first, second) as a within-participants factor and group (greyscale vs. pseudo color) as a betweenparticipants factor. There was a large effect for test date F(1, 29) ¼ 40.457, p < .001, h2 ¼ .582, and no effect for group F(1, 29) ¼ .039, p ¼ .845, h2 ¼ .001. The interaction between test date and group was not significant, F(1, 29) ¼ .021, p ¼ .887, h2 ¼ .001. Hence, improvements were similar for both groups and it cannot be concluded that either one of the image versions at test (greyscale vs. pseudo color) led to a better detection performance. In summary, a highly significant increase in detection performance measured with the C-CAT could be observed after three months of training. Altogether, the experimental groups (greyscale and pseudo color groups combined) increased their detection performance significantly by 14.9% (t(30) ¼ 6.47, p < .001, d ¼ 1.28). Thus, the results imply that training with the C-XRT did help to improve X-ray image interpretation competency of the cargo security screeners. Compared to earlier studies on cabin baggage screening where the authors are in the possession of the data (e.g., Halbherr et al., 2013; Koller et al., 2008; Michel, de
Ruiter et al., 2007; Schwaninger and Wales, 2009), the absolute detection performance scores achieved in the C-CAT at both measurement times of this study are lower. However, it must be taken into account that the cargo security screeners have only been training for a very short period of time (M ¼ 8.95 h, SD ¼ 5.68). Further increases could be expected with a continuation of training. Moreover, the lower detection performance scores of the cargo security screeners at the baseline measurement underline the relevance of training. Another repeated measures ANOVA was conducted including the factor superposition. A0 scores for test date (first, second) and superposition (low vs. high) as within-participants factors and group (greyscale vs. pseudo color) as between-participants factors were used. Large main effects of test date and superposition were revealed, while again no significant effect for group was found (see Table 1, a, for all results). Furthermore, the interaction of superposition and group was significant. As Fig. 4 indicates, the effect of superposition was as expected at T2 for both groups (greyscale and pseudo color), i.e. higher detection performance when superposition is low. This was not so clear at T1 which might be
556
S. Michel et al. / International Journal of Industrial Ergonomics 44 (2014) 551e560
Fig. 3. Mean detection performance A0 with standard errors of the mean for the greyscale group, pseudo color group and control group at the first (T1) and second (T2) measurement. For security reasons, actual A0 scores are not displayed in the figures. The standard error of the mean represents the standard deviation of the samplemean’s estimate of a population mean (Everitt, 2003).
explained by the relatively low detection performance of both groups before training started. As described earlier, the threat items contained in the C-CAT were made up of 36 customs and 24 security related items. The security related items contained inert explosives and IEDs, the customs related items were 24 drug items, eight weapons (guns and knives) and four “miscellaneous” (e.g., Chinese medicine, watches etc.) items. Fig. 5 suggests that increases in detection performance after training resulted for every category. Both experimental groups achieved highest scores for the detection of IEDs and lowest for the category miscellaneous, guns, knives.
Table 1 Results of the ANOVA’s conducted with detection performance (A0 ) as the dependent variable.
a
b
c
Factor
df
F
h2
p Value
Test date (T) (first, second) Superposition (S) (low vs. high) Group (G) (greyscale vs. pseudo) TG SG TS TSG Test date (T) (first, second) Category (C) (drugs; miscellaneous, guns, knives; explosives; IEDs) Group (G) (greyscale vs. pseudo) TG TC CG TCG Test date (T) (first, second) Category (C) (customs 24 vs. security 24) Group (G) (greyscale vs. pseudo) TG CG TC TCG
1, 29 1, 29 1, 29 1, 29 1, 29 1, 29 1, 29 1, 29 2.31, 67
38.19 9.70 .05 .06 8.40 15.41 1.01 37.90 33.45
.568 .251 .002 .002 .225 .347 .034 .566 .536
<.001 <.01 ¼.831 ¼.804 <.01 <.001 ¼.322 <.001 <.001
1, 29 1, 29 2.18, 63 3, 67 3, 63 1, 29 1, 29 1, 29 1, 29 1, 29 1, 29 1, 29
.24 .01 4.40 .09 .06 36.97 41.94 .12 .01 .71 9.78 .21
.008 .000 .132 .003 .002 .560 .591 .004 .000 .024 .252 .007
¼.626 ¼.915 <.05 ¼.965 ¼.982 <.001 <.001 ¼.733 ¼.921 ¼.407 <.01 ¼.653
Fig. 4. Mean detection performance A0 with standard errors of the mean for the greyscale group and pseudo color group at the first (T1) and second (T2) measurement, broken up by superposition.
Separate pairwise t-tests were conducted to compare detection performance at the first and second measurement for both groups and each threat category (Table 2). For both groups, increases in detection performance were revealed for drugs, explosives and IEDs. The comparisons of the effect sizes d between the t-tests of the four categories confirm that the training effect was particularly large for IEDs. Both experimental groups hardly differed with respect to the scores achieved for each category at each measurement time. A repeated measures ANOVA with test date (first, second) and category (drugs; miscellaneous, guns, knives; explosives; IEDs) as within-participants factors and group (greyscale vs. pseudo color) as between-participants factor confirmed this assumption (results can be seen in Table 1, b). There was no main effect of group (greyscale vs. pseudo color) and no interaction involving group reached statistical significance. Thus, with either kind of image (greyscale vs. pseudo color), prohibited items were detected equally well. As Fig. 6 suggests, when comparing the 24 security related items with the 24 drug items (customs related items) a considerably larger increase for the security related items could be found. Separate pairwise t-tests for both experimental groups and both measurement times were conducted for these two categories. Effect sizes d were calculated to examine this difference (see Table 3) and indeed much larger effect sizes were found for security related items. Moreover, Fig. 6 illustrates the similarity in detection performance of the two experimental groups (greyscale and pseudo color). The results of a repeated measures ANOVA with the withinparticipants factors test date (first, second) and category (drug items vs. security items) and the between-participants factor group (greyscale vs. pseudo color) are reported in Table 1, c. Large main effects were revealed for test date and category, whereas again no main effect was found for group and no interaction involving group reached statistical significance. The significant interaction between test date and category could indicate that the identification of the security related items was easier to learn than the identification of the drug related items, and therefore increases in detection
S. Michel et al. / International Journal of Industrial Ergonomics 44 (2014) 551e560
557
Fig. 5. Mean detection performance A0 with standard errors of the mean for the greyscale group and pseudo color group at the first (T1) and second (T2) measurement, broken up by category (drugs; miscellaneous, guns, knives; explosives, IEDs).
performance were larger for this category. Figs. 5 and 6 also imply that the customs related items (including the 24 drug items, eight weapons and four “miscellaneous”), in general, seemed to be more difficult to recognize than the security related items. In order to optimize throughput where relevant, it is of special interest to investigate how much time cargo security screeners need to search a container/ULD reliably and whether reaction time can be reduced through training. Reaction time (RT) was measured (in milliseconds) by the C-CAT software as the time from X-ray image onset until a response button was pressed (i.e. the OK or NOT OK button). For all three groups, a decrease in reaction time could be observed between the first and second measurement (see Fig. 7). A repeated measures ANOVA with test date (first, second) as within-participants factor and group (greyscale, pseudo color, control) as between-participants factor revealed a large main effect for test date, and no significant effects for group and the interaction between test date and group (see Table 4). At first sight, the results of the ANOVA could be interpreted as a general effect resulting from repeated exposure to the C-CAT. However, it must be pointed out that such a general effect applies only to RTs. While they were lower at T2 for all groups, the improvement in detection performance was only observed for the two groups that received training, but not for the control group (see Fig. 3 and statistical results reported above). Finally, it must be taken into consideration that the RTs observed in this study were measured in controlled test situations. In reality, cargo security screeners often take more time to inspect a container/ULD.
Further analyses of training data investigating the relationship between test performance and training behavior (number of images seen during training and the number of hours trained) revealed significant correlations. Both, linear and logarithmic correlations were conducted, to test which trend fits better. A linear correlation of r ¼ .379, p < .05 and a logarithmic correlation of r ¼ .414, p < .05 between the number of images seen during training by the cargo security screeners and performance in the CCAT at the second measurement (Fig. 8a) could be found. Moreover, when correlating the difference in detection performance of all cargo security screeners between the two measurements with the number of hours trained in the meantime, a linear correlation of r ¼ .464, p < .05 and a logarithmic correlation of r ¼ .464, p < .01 was found (Fig. 8b). All correlations indicate that training with C-
Table 2 Results of the t-tests comparing detection performances between the first (T1) and second (T2) measurement for each category. Greyscale group
t(14)
Drugs T1eT2 Misc., guns, knives T1eT2 Explosives T1eT2 IEDs T1eT2
3.54 3.67 4.64
Pseudo color group
t(15)
Drugs T1eT2 Misc., guns, knives T1eT2 Explosives T1eT2 IEDs T1eT2
2.41
t(44) 1.03
t(47) 1.47
3.29 4.18
p value
d
<.01 ¼.307 <.01 <.001
.95 .21 1.06 1.29
p value
d
<.05 ¼.147 <.01 <.001
.71 .29 1.16 1.36
Fig. 6. Mean detection performance A0 with standard errors of the mean for the greyscale group and pseudo color group at the first (T1) and second (T2) measurement, broken up by category (customs vs. security).
558
S. Michel et al. / International Journal of Industrial Ergonomics 44 (2014) 551e560
Table 3 Results of the t-tests comparing detection performances between the first (T1) and second (T2) measurement for the 24 customs related (drugs) and 24 security related items. t(14) Greyscale group Customs 24 (drugs) T1eT2 Security 24 T1eT2 Pseudo color group Customs 24 (drugs) T1eT2 Security 24 T1eT2
t(15)
3.54 5.74 2.41 4.29
p Value
d
<.01 <.001
.95 1.51
<.05 <.001
.70 1.47
Table 4 Results of the ANOVA conducted with reaction time (RT) as the dependent variable. Factor
df
F
h2
p Value
Test date (T)(first, second) Group (G)(greyscale, pseudo color, control) TG
1, 37 1, 37 2, 37
10.1 1.86 .09
.214 .091 .005
<.01 ¼.17 ¼.914
measurements (Koller et al., 2008; Michel, de Ruiter et al., 2007). Hence, even larger improvements in X-ray image interpretation competency of cargo screeners could be expected with more training.
XRT did have an effect on the performance in the C-CAT and, thus, led to an increase in detection performance. For the existing data, a slightly better fit was found with the logarithmic functions. Usually, benefits from practice follow nonlinear functions, as improvement is rapid at first, but eventually evens out after the practitioner becomes more skilled (e.g., Ebbinghaus, 1971; Heathcote and Brown, 2000; Ritter and Schooler, 2001; Thorndike, 1913). Hence, it is likely that logarithmic functions explain these distributions better, which is consistent with our results. 4. Summary and conclusion The main aim of this study was to investigate whether X-ray image interpretation competency of cargo security screeners can be increased by computer-based training (CBT). Consistent with results of previous studies in cabin baggage screening (Koller et al., 2008; Michel, de Ruiter et al., 2007; Schwaninger, 2004b; Schwaninger et al., 2008, 2007; Schwaninger and Wales, 2009), a highly significant increase in detection performance was found for cargo security screeners after approximately three months of training. Yet, it should be noted that even after training detection performance was still in need of improvement. In this study, the cargo security screeners had only trained for a comparably short period of time. In earlier studies, approximately 20 min of training per week were conducted for at least 6e12 months between
Fig. 7. Reaction times (in seconds) with standard errors of the mean for the greyscale, pseudo color and control groups at the first (T1) and second (T2) measurement.
Fig. 8. a Correlation between detection performance A0 at the second measurement (T2) and the number of images seen during training. b. Correlation between the differences in detection performance A0 between the two measurements and the number of hours trained.
S. Michel et al. / International Journal of Industrial Ergonomics 44 (2014) 551e560
Results further revealed that viewing X-ray images in pseudo color at test did not lead to a better detection performance compared to when X-ray images were shown in greyscale. Thus, our findings imply that the application of pseudo color on its own as an image enhancement function (IEF) does not necessarily improve detection performance. Although consistent with earlier research on IEFs (Klock, 2005; Michel, Koller et al., 2007) this does not mean that IEFs such as pseudo color are useless. It cannot be ruled out that a serial application of different IEFs may result in better detection performance compared to a condition where only greyscale images are available. Further research is needed to investigate this possibility. Last but not least it should be noted that the prohibited items used in this study were all technically visible. In reality, prohibited items could be hidden and distributed in the ULDs/containers in ways that could make them even more difficult to spot. Thus, further improvements of man-machine system performance in cargo security screening must not only rely on training, but also on technological enhancements such as automatic threat detection and screener assist technologies (e.g., Eilbert, 2009; Singh and Singh, 2003).
Acknowledgments We are thankful to the Dutch Customs and TNO (Netherlands Organization for Applied Scientific Research) for their good collaboration and support in conducting the study.
References Aaronson, D., Watts, B., 1987. Extension of Grier’s computational formulas for A0 and B00 to below-chance performance. Psychol. Bull. 102, 439e442. Abidi, B., Zheng, A., Gribok, A., Abidi, M., June 2005. Screener evaluation of pseudocolored single energy X-ray luggage images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Workshops, San Diego, CA, USA. Al-Fandi, L., Mabe, N., Khasawneh, M.T., 2009. The impact of threat type and image color on baggage screening performance: A preliminary investigation. In: Proceedings of the 10th Asia Pacific Industrial Engineering & Management Systems (APIEMS) Conference. Dec 14e16, Kitakyushu, Japan, pp. 2432e2437. Bolfing, A., Halbherr, T., Schwaninger, A., 2008. How image based factors and human factors contribute to threat detection performance in X-ray aviation security screening. In: HCI and Usability for Education and Work, Lecture Notes in Computer Science, vol. 5298, pp. 419e438. Cohen, J., 1988. Statistical Power Analysis for the Behavioral Sciences. Erlbaum, Hillsdale, New York. Ebbinghaus, H., 1971. Über das Gedächtnis. Untersuchung zur experimentellen Psychologie (In German). Wissenschaftliche Buchgesellschaft, Darmstadt. Eilbert, R.F., 2009. Chapter 6 e X-ray technologies. In: Marshall, M.M., Oxley, J.C. (Eds.), Aspects of Explosives Detection. Elsevier, Oxford, pp. 89e130. European Commission, 2010. Commission regulation (EU) No 185/2010 of March 2010 laying down detailed measures for the implementation of the common basic standards on aviation security. Off. J. Eur. Union 53, 1e55. Everitt, B.S., 2003. The Cambridge Dictionary of Statistics. Cambridge University Press, Cambridge. Fainberg, A., 1992. Explosive detection for aviation security. Science 255 (5051), 1531e1537. Gonzales, R.C., Woods, R.E., 2002. Digital Image Processing, second ed. Prentice Hall. Graf, M., Schwaninger, A., Wallraven, C., Bülthoff, H.H., 2002. Psychophysical Results from Experiments on Recognition and Categorization. Information Society Technologies (IST) Programme. Cognitive Vision Systems e Cog Vis (IST-2000e 29375). Grier, J.B., 1971. Nonparametric indexes for sensitivity and bias: computing formulas. Psychol. Bull. 75, 424e429. Green, D.M., Swets, J.A., 1966. Signal Detection Theory and Psychophysics. Wiley, New York. Halbherr, T., Schwaninger, A., Budgell, G., Wales, A., 2013. Airport security screener competency: a cross-sectional and longitudinal analysis. Int. J. Aviat. Psychol. 23 (2), 113e129. Hardmeier, D., Hofer, F., Schwaninger, A., 2005. The X-ray object recognition test (XRay ORT) e a reliable and valid instrument for measuring visual abilities needed in X-ray screening. In: IEEE ICCST Proceedings, vol. 39, pp. 189e192. Hardmeier, D., Hofer, F., Schwaninger, A., 2006a. Increased detection performance in airport security screening using the X-Ray ORT as pre-employment assessment tool. In: Proceedings of the 2nd International Conference on Research in Air
559
Transportation, ICRAT 2006. Belgrade, Serbia and Montenegro, June 24e28, pp. 393e397. Hardmeier, D., Hofer, F., Schwaninger, A., 2006b. The role of recurrent CBT for increasing aviation security screeners’ visual knowledge and abilities needed in X-ray screening. In: Proceedings of the 4th International Aviation Security Technology Symposium. Washington, D.C., USA, November 27 e December 1, 2006, pp. 338e342. Hardmeier, D., Schwaninger, A., 2008. Visual cognition abilities in X-ray screening. In: Proceedings of the 3rd International Conference on Research in Air Transportation, ICRAT 2008. Fairfax, Virginia, USA, June 1e4, pp. 311e316. Harris, D.H., 2002. How to really improve airport security. Ergon. Des. 10 (1), 17e22. Heathcote, A., Brown, S., 2000. The power law repealed: the case for an exponential law of practice. Psychonomic Bulletin & Review 7 (2), 185e207. Klock, B.A., 2005. Test and evaluation report for X-ray detection of threats using different X-ray functions. In: IEEE ICCST Proceedings, pp. 182e184. Koller, S.M., Drury, C.G., Schwaninger, A., 2009. Change of search and nonsearch time in X-ray baggage screening due to training. Ergonomics 52 (6), 644e656. Koller, S.M., Hardmeier, D., Michel, S., Schwaninger, A., 2008. Investigating training, transfer, and viewpoint effects resulting from recurrent CBT of X-ray image interpretation. J. Transport. Secur. 1 (2), 81e106. Koller, S.M., Schwaninger, A., 2006. Assessing X-ray image interpretation competency of airport security screeners. In: Proceedings of the 2nd International Conference on Research in Air Transportation, ICRAT 2006. Belgrade, Serbia and Montenegro, June 24e28, 2006, pp. 399e402. Kraemer, S., Carayon, P., Sanquist, T.F., 2009. Human and organizational factors in security screening and inspection systems: conceptual framework and key research needs. Cogn. Tech. Work 11, 29e41. Mendes, M., Schwaninger, A., Michel, S., October 18e21, 2011. Does the application of virtually merged images influence the effectiveness of computer based training in X-ray screening?. In: Proceedings of the 45th IEEE International Carnahan Conference on Security Technology, Mataro Spain, pp. 1e8. Michel, S., Koller, S.M., Ruh, M., Schwaninger, A., 2007a. Do “image enhancement” functions really enhance X-ray image interpretation?. In: Proceedings of the 29th Meeting of the Cognitive Science Society, pp. 1301e1306. Michel, S., de Ruiter, J.C., Hogervorst, M., Koller, S.M., Moerland, R., Schwaninger, A., 2007b. Computer-based training increases efficiency in X-ray image interpretation by aviation security screeners. In: Proceedings of the 41st Carnahan Conference on Security Technology. Ottawa, October 8e11. Michel, S., Schwaninger, A., 2009. Human-machine interaction in X-ray screening. In: IEEE ICCST Proceedings, vol. 42, pp. 13e19. Ogorodnikov, S., Petruin, V., 2002. Processing of interlaced images in 4e10 MeV dual energy customs system for material recognition. Phy. Rev. Acceler. Beams 5 (10), 104701. Pollack, I., Norman, D.A., 1964. A non-parametric analysis of recognition experiments. Psychonom. Sci. 75, 125e126. Reed, W.A., 2008. Implementing X-ray Cargo Screening. Retrieved June 14, 2010, from. http://www.varian.com/media/security_and_inspection/resources/article s/pdf/Implementing_spring08.pdf. Reed, W.A., 2009. Throughput Performance Factors in X-ray Cargo Screening Systems. Retrieved June 14, 2010, from. http://www.varian.com/media/security_ and_inspection/resources/technical_information/pdf/X_ray_cargo_screening. pdf. Ritter, F.E., Schooler, L.J., 2001. The learning curve. In: International Encyclopedia of the Social and Behavioral Sciences. Pergamon, Amsterdam, pp. 8602e8605. Schwaninger, A., 2003. Evaluation and selection of airport security screeners. Airport, 14e15, 02/2003. Schwaninger, A., 2006. Airport security human factors: from the weakest to the strongest link in airport security screening. In: Proceedings of the 4th International Aviation Security Technology Symposium. Washington, D.C., USA, November 27 e December 1, 2006, pp. 265e270. Schwaninger, A., 2004a. Computer-based training: a powerful tool to the enhancement of human factors. Avia. Secur. Int., 31e36. Schwaninger, A., 2004b. Objekterkennung und Signaldetektion: Anwendung in der Praxis (In German). In: Kersten, B., Groner, M. (Eds.), Praxisfelder der Wahrnehmungspsychologie. Huber, Bern, pp. 106e130. Schwaninger, A., Bolfing, A., Halbherr, T., Helman, S., Belyavin, A., Hay, L., 2008. The impact of image based factors and training on threat detection performance in X-ray screening. In: Proceedings of the 3rd International Conference on Research in Air Transportation, ICRAT 2008. Fairfax, Virginia, USA,, pp. 317e324. June 1e4. Schwaninger, A., Hardmeier, D., Hofer, F., 2005. Aviation security screeners’ visual abilities and visual knowledge measurement. IEEE Aerospace Elect. Sys. 20 (6), 29e35. Schwaninger, A., Hardmeier, D., Hofer, F., 2004. Measuring visual abilities and visual knowledge of aviation security screeners. In: IEEE ICCST Proceedings, vol. 38, pp. 258e264. Schwaninger, A., Hofer, F., 2004. Evaluation of CBT for increasing threat detection performance in X-ray screening. In: Morgan, K., Spector, M.J. (Eds.), The Internet Society 2004, Advances in Learning, Commerce and Security. WIT Press, Wessex, pp. 147e156. Schwaninger, A., Hofer, F., Wetter, O., 2007. Adaptive computer-based training increases on the job performance of X-ray screeners. In: Proceedings of the 41st Carnahan Conference on Security Technology. Ottawa, October 8e11, pp. 117e 124.
560
S. Michel et al. / International Journal of Industrial Ergonomics 44 (2014) 551e560
Schwaninger, A., Wales, A., 2009. One year later: how screener performance improves in X-ray luggage search with computer-based training. In: Proceedings of the Ergonomics Society Annual Conference 2009, pp. 381e389. Singh, S., Singh, M., 2003. Explosives detection systems (EDS) for aviation security. Sign. Proc. 83 (1), 31e55. TIME, 2010, Nov. 02. The Weak Spot in Air Security: Cargo Screening. Retrieved May, 2012, fromhttp://www.time.com/time/nation/article/0,8599,2028872,00.html.
Thorndike, E.L., 1913. Educational Psychology: the Psychology of Learning, vol. 2. Teachers College Press, New York. Wales, A., Anderson, C., Jones, K., Schwaninger, A., Horne, J., 2009. Evaluating the two-component inspection model in a simplified luggage search task. Behav. Res. Meth. 41 (3), 937e943.