Int. J. Pres. Ves. & Piping 27 (1987) 49-89
The Presentation of Results from Round-Robin Test-Block Trials
J. A. G. Temple* and M. W. Stringfellowt * Theoretical Physics Division, ~f Materials Physics and Metallurgy Division, AERE Harwell, Didcot, Oxon OXll 0RA, Great Britain (Received: 23 June, 1986; accepted: 8 July, 1986)
ABSTRACT Test-block trials are used to assess the capability of teams and of nondestructive testing techniques to detect and size defects which are of concern for structural integrity of specific components. In particular there have been several test-block exercises involving the ultrasonic inspection of thicksection steel as part of the continuing development of inspection techniques for reactor pressure vessels. This paper examines the presentation of results of test-block trials in general. Although defects cannot be sized unless they have been detected, we concentrate on the sizing aspect of test-block trials rather than on detection of defects alone, because detection capability in recent trials has been very good. Various parameters for presenting the results of teams' abilities to correctly size defects are examined on the basis of a model and synthesized test-block results. The model is based on A S M E acceptance and rejection standards for crack-like defects in pressure vessels and these are coupled with a bivariate normal distribution for the errors that a team may make in measuring the through-wall size and length of a defect. Seven hypothetical teams are used with a spread of team abilities representative of recent test-block trials. Results are synthesized for two different test-block trials each containing 40 cracks. Assuming that all teams detect all defects allows us to concentrate on the sizing ability of the teams. Teams are ranked according to different parameters in order to demonstrate which are the better parameters to use. The mean and associated standard deviations of the errors in measured sizes are found to be the most useful ways of presenting the results. O f the single parameters considered for ranking, the best ones are those that include some combination of all four 49 Int. J. Pres. Ves. & Piping 0308-0161/87/$03.50 © Elsevier Applied Science Publishers Ltd, England, 1987. Printed in Great Britain
J. A. G. Temple, M. W. Stringfellow
50
types of defect classification: correct or incorrect acceptance and correct or incorrect rejection of defects. Simple parameters such as the ratio of defects correctly rejected to the total number of rejectable defects are found to be unsatisfactory. Attempts to improve such simple parameters by weighting defects according to some measure of their distance from the accept-reject boundary are found to be unreliable, yielding arbitrary rankings dependent on the particular choice of weighting. We conclude that uniform distributions of defect parameters provide a satisfactory basis for designing test-block trials to discriminate between relatively good teams and relatively poor teams where a range of abilities is expected.
S U M M A R Y OF N O T A T I O N In a round-robin trial there may be several teams and several blocks containing acceptable or unacceptable defects. These defects may be detected or not by each of the teams, and those defects detected may be classified correctly or incorrectly as rejectable or acceptable defects. There m a y be cases of defects being reported even though there is no corresponding defect in the test-block. Clearly there are m a n y possibilities and the notation describing all these possibilities will necessarily be complex. This list defines some of the notation we shall employ throughout the paper. The list is not comprehensive since such a list would be unmanageable, so we have arbitrarily restricted it to the terms of most importance or most frequent use. A Aa Ar
Ax CAF CRF CRP d 0
ar D
Total number of acceptable defects N u m b e r of acceptable defects detected and classified as acceptable = d a N u m b e r of acceptable defects detected but classified as rejectable N u m b e r of acceptable defects not detected Correct acceptance rate on all acceptable defects by one team = dJA Correct rejection rate on all rejectable defects by one team = df/R Correct rejection rate by teams = r / N N u m b e r of defects detected by a particular team, excluding false calls Somer's d - - a measure of team success N u m b e r of acceptable defects detected and correctly classified as acceptable by one team N u m b e r of rejectable defects detected and correctly classified as rejectable by one team Total n u m b e r of defects in the test-blocks
Presentation of results from round-robin test-block trials Dcl
D~ DDF DDP ea ef
L fd f, L, n
N
R R. Rr RX
WCAF WCRF
X2
51
N u m b e r of defects present in plate which are actually detected N u m b e r of defects present in plate which are missed Defect detection frequency for one team on all intended defects Defect detection frequency for one defect by all teams N u m b e r of defects incorrectly classified as acceptable by one team N u m b e r of defects incorrectly classified as rejectable by one team A measure of team effectiveness in correct acceptance of defects A measure of team effectiveness in defect detection A measure of team effectiveness in correct rejection of defects A measure of team effectiveness in correct overall classification Number of teams who detect a particular defect N u m b e r of teams participating in the round-robin Number of teams who correctly detect and classify as rejectable a defect which would be classed as rejectable according to some criteria such as those in ASME Total number of rejectable defects N u m b e r of rejectable defects detected but classified as acceptable Number of rejectable defects detected and classified as rejectable = df Number of rejectable defects not detected Weighted correct acceptance rate of all acceptable defects by one team Weighted correct rejection rate of all rejectable defects by one team with weights dependent on the distance from the ASME boundary between acceptable and rejectable defects N u m b e r of false calls declared acceptable Total number of false calls Number of false calls declared rejectable Mean square contingency derived from the ;~2 statistic indicating team performance; can be used for intercomparison of test-block trials Value of approximate chi-square statistic with one degree of freedom, indicating the confidence level of team performance
1.
INTRODUCTION
There have now been several test-block exercises aimed at determining the capability of different ultrasonic inspection techniques and operating teams to detect and size a variety of representative defects in thick-section steel
52
J. A. G. Temple, M. W. String[bllow
blocks. In particular, three European round-robin exercises of interest are PISC I, the Defect Detection Trials organized by the United Kingdom Atomic Energy Authority (UKAEA), and most recently, PISC II. For a description of the PISC I results see the reports by the Plate Inspection Steering Committee ~ and for the background to the current PISC II programme see Oliver. 2 For the background to the Defect Detection Trials see Watkins e t al., 3"4 with summaries of the results in Watkins et al. 5 and Lock et al. 6 These test-block exercises are expensive and in order to obtain the maximum benefit from them it is important to carry out a thorough analysis of the results, Test-block trials measure capability rather than reliability. Capability means the intrinsic mixture of resolution and sensitivity which characterizes the physical ability of the technique. Reliability is the reproducibility with which the technique is applied on the job, including the effects of the human operator and the operating procedures in manual inspections, and incorporating the effects of setting up of automated equipment. Once the excitement of the experimental investigation is over, it is easy for the results of the nondestructive tests to be compared quickly with the early destructive results and bland comparisons to be drawn between the teams and techniques. In fairness to the teams and their techniques, it is important to produce results which yield accurate and meaningful comparisons in a form which can nevertheless be readily assimilated. This will usually involve destructive examination which continues over a period of time as discrepancies between these results and those from the nondestructive tests are ironed out. The purpose of this paper is to compare various measures of team success in the correct classification of defects using results from a synthesized test-block trial as examples. The emphasis has been placed on classification, because the results coming from recent round-robin trials such as the Defect Detection Trials organized by the U K A E A and PISC II suggest that d e t e c t i o n of defects can be very good. 7- 10 In principle, it would be possible to go through a similar exercise to this considering detection and classification, but this would tend to obscure the points which we consider important concerning the presentation of results on capabilities of correct classification of defects. A range of parameters which could be used to measure performance are described. These parameters are PISC-like values of DDF, described in section 2.2.1, and CRF and CAF in section 2.2.2. The PISC-like parameters are related to the underlying distribution-based quantities d,, dr, ea and ef in sections 2.3.1 and 2.3.2. Various weighted parameters based on the PISC-like ones are defined in sections 2.4, 2.4.1 and 2.4.4. A single figure of merit is described in sections 2.3.3 and 2.4.5.
Presentation of results from round-robin test-block trials
53
In section 3, the different approaches are applied to data obtained from a synthesized test-block trial to indicate the usefulness of various ways of presenting comparative team success in correct classification of defects. 2.
T H E O R E T I C A L BASIS OF ANALYSIS
In a round-robin exercise designed to determine the capability of teams, techniques and procedures to detect and correctly classify defects which would be of concern for the structural integrity of some structure, such as a reactor pressure vessel, there will be a number of possible outcomes. In general the test-blocks will contain defects which would be classed as acceptable or rejectable according to some structural mechanics criterion. Defects of both these types will generally be present. The teams apply a variety of test procedures using variations of a nondestructive test, here taken to be an ultrasonic test of some sort. The teams may detect all the defects or they may miss some, and they may report defects where there is no corresponding defect in the test-block (a false call). The defects detected are then classified according to some set of criteria. This step is not necessarily carried out by the team itself but may involve an additional body. The aim of round-robin test-block trials should be to elicit the best scientific measurements of capability. Thus teams should be encouraged to report their results honestly, without thought for fracture mechanics assessments. Fracture mechanics assessments should be regarded as a separate exercise. They should be reported separately and clearly labelled as fracture mechanics assessments. Rules of the test-block exercise should clearly state whether a best estimate of defect size and location is required or whether a fracture mechanics interpretation of the N D T measurements is required. It is not helpful to have some teams reporting best estimates and others reporting assessments made with fracture assessments in mind. The classifications we are interested in are based on fracture mechanics and use the dimensions of the defect to determine whether or not the defect would pose a threat to the structure containing it (an unacceptable or rejectable defect) or whether the defect is small enough to pose no threat to the structure under design basis accidents (an acceptable defect). The different possibilities are illustrated in the Venn diagrams of Fig. 1.
2.1. Quantification of detection performance The Venn diagrams of Fig. 1 illustrate the parameters required to describe performance in detection and classification of defects in a test-block
54
J. A. G. Temple, M. W. String)Cellow
Defects
Defects
present
D =
Defects
present
X
=
False calts
d
=
Detected
x
=
Missed
detected"
Fig. 1. Venn diagrams of possible outcomes of a round-robin test-block trial of defect detection: (a) ignoring, and (b) taking into account, acceptance/rejection of defects. exercise. It m a y be seen from Fig. l(a) that the quality of overlap of the two circles gives a measure of a particular team's performance in defect detection. In Fig. l(a) the light stippling indicates those defects which are successfully detected. In this diagram D stands for the defects present in the test-block and X for defects reported which are actually non-existent, that is false calls, while the subscripts d and x stand for defects detected and missed respectively. Thus, for example, D x is the number of real defects missed in the examination. Complications arising from the description of a real defect as more than one detected defect, or the description of several defects as a single detected defect, will be ignored. Figure l(b) shows the possible outcomes of a classification exercise. The only outcomes which can be accepted with complete equanimity are those left white on the figure and represent acceptable defects not detected or false calls which are also classed as acceptable. In either case there is no action required. The light stippling shows areas where correct decisions have been taken and acceptable and rejectable defects have been detected and correctly classified. Action is required about the rejectable defects but, provided this is carried out satisfactorily, there need be no concern over
Presentation of results from round-robin test-block trials
a
=
Acceptable defects Rejectable defects False calls Accepted
r
=
Rejected
x
=
Missed
A
=
R
=
X =
55
Fig. l--contd. the integrity of the structure. The dark stippling represents areas in which incorrect decisions have been made. The rejectable defects missed and the rejectable defects which are detected but which are then classified as acceptable are of obvious structural concern. However, the acceptable defects which are detected and then incorrectly classified as rejectable, and the false calls which are classified as rejectable defects, both lead to unnecessary repair of the structure with the consequent risk of introducing defects into otherwise sound material. We can quantify a given team's performance in defect detection as follows. Referring to Fig. l(a) we see that the ratio of the area of overlap to the area of the left-hand circle is a measure of the team's ability to detect the defects actually present, while the ratio of the area of overlap to the area of the right-hand circle is a measure of its ability to distinguish real from spurious defects. A figure of merit for detectionfd could therefore be defined by combining these ratios to give
fd= D,+Dj\Dd+Xd]
(1)
56
J. A. G. Temple, M. W. Stringfellow
The weighting parameters ~ and fl can be adjusted to give any desired weight to the two aspects of performance. Values of these two parameters would, in practice, be determined by the relative cost of missing defects compared with the possible hazard of false calls with the subsequent chance of repair-induced defects. For a perfect performance D x = X d = 0 andfd = 1 whatever the weighting. In a round-robin trial it is capability that is generally being tested and not reliability. This means that teams should be trying to obtain the smallest errors in location and sizing. However, as the mean and standard deviations of the errors in sizing are reduced, so smaller errors can lead to incorrect classification of defects. This is because the defects are classified according to go~no-go criteria for which small errors in defect size lead to incorrect classification if the actual defect size is near the boundary between acceptable and rejectable defects. This is a complication which does not arise, in principle, with the detection o f defects, although, in practice, some tolerances have to be applied in calculating whether or not a defect was detected. Whilst defects cannot be classified unless they are detected, it is this complication concerning the ranking of teams according to their capability for classifying defects which is of interest in this paper. For this reason we shall not discuss detection any more.
2.2. PISC-type parameters In the analysis of PISC II, Haines 9 has suggested using parameters which are simple to interpret, and which can be easily compared with those used in the analysis of PISC I. The parameters suggested are considered in the following sections.
2.2.1. Defect detection probabilities D D P = n/N
(2)
for a defect detection probability defined by n teams out of N detecting a particular defect. This could be averaged over sets of defects to yield ( D D P ) . Results for the performance o f particular teams are contained in a parameter D D F given by D D F = d/D
(3)
for d defects out of D detected by a particular team. The d defects reported are those which correspond to the defects actually present and do not include false calls.
Presentation of results from round-robin test-block trials
57
2.2.2. Sizing and location parameters The measures suggested for the sizing and location of defects were CRP = r/N
(4)
for the correct rejection probability defined in terms of overall team performance, where r teams out of N detect and reject a defect which would be considered rejectable according to ASME XI rules. Individual team performance is contained in the parameters CRF and CAF defined by CRF
=
df/R
(5)
CAF
=
da/A
(6)
where a single team inspecting R rejectable defects, so defined according to ASME XI rules, obtains sizes and defect locations which would lead to correct rejection of df defects. Note, however, that teams were not asked to make this judgement in either the Defect Detection Trials or PISC II but were out to demonstrate as accurately as possible the capability of their techniques and expertise. It is very likely that results would have led to higher correct rejection probabilities and a higher incidence of false calls and misclassification of acceptable defects had the teams been asked to make ASME XI decisions about the defects. The parameter CAF is the measure of a single team's ability to inspect A acceptable defects, so defined according to ASME XI rules, and to obtain sizes and defect locations which would lead to correct acceptance of da defects.
2.2.3. Destructive examination and compar&on with nondestructive results Consider first D D F in which d defects out of the D actually in the blocks are detected. The first point is that both the nondestructive and destructive examinations must be exhaustive and unbiased if this parameter is to have any value. In other words, the rules governing the reporting of defects by the two examinations must be identical and clearly stated. Defects which are not intended do occur and it is necessary to decide whether these should be included in the analysis or not. A well designed round-robin will have been guided by fracture mechanics to consider defects with sizes in a certain range, the range of concern for the integrity of the structure. Broadly speaking, small defects will occur in large numbers in a component but be of no structural concern until they are bigger than some specified size determined by the extremes of loading envisaged. Large defects should occur very infrequently and may be easier to detect than smaller ones. Thus the main contribution to the failure rate of a population of similar components will generally stem from defects in a limited size range. It will be this size range that the well designed round-robin will
58
J. A. G. Temple, M. W. String]'ellow
concentrate on. This suggests that unintentional defects lying in the size range around which the test-blocks were designed should be included in the analysis. After all, they will probably exhibit a distribution in size which is more like the naturally occurring one. Establishing the size of these additional defects will often require more destructive examination than was originally planned and time should be allowed for this before comparisons between the destructive and nondestructive results are made. Then the values of D D F can be constructed. However, if there are so many of these unintentional defects that either there are too many defects in total, or the distribution is very different from that designed into the test, then it may not be sensible to include them all in the analysis. In either of these last two cases it is likely that the round-robin will be unsatisfactory because whatever was intended as the defect population was not achieved. The rules under which N D T teams report defects should be identical with those used in the destructive examination, especially with regard to the incorporation of satellite defects. If there are unintended defects which would be classed as rejectable if they were detected, but which also go undetected, then there are problems. This might happen with poorly developed inspection techniques or badly organized round-robin exercises, but are not considered to be a problem in the state-of-the-art round-robin exercises devised to test current ultrasonic inspection capabilities on thick-section pressure vessel steels. Any defects which are reported by the nondestructive examination but which do not correspond to intended defects should be particularly carefully examined destructively to discover whether or not they do, in fact, exist. This is important from another aspect of round-robin trials which is concerned with learning about the response of different types of defect. Often the unintentional defects will be associated with intended defects, making a complex defect with more than one part. It is very useful to be able to correlate the ultrasonic response from such defects with destructive information. It is important not to publish comparisons between intended defects and the results of nondestructive examinations, since the comparison can only be a very rough guide to the exact results and may even lead to some destructive examinations being omitted altogether. These early comparisons are often latched onto by outside observers as representing the true state of affairs and such early impressions can be difficult to dislodge if they subsequently turn out to be in error. The time spent on carrying out the nondestructive examinations should not be squandered by inadequate or hasty presentation of the results. Methods of scoring the detection of defects other than the comparison of bounding rectangles should be investigated. Suggestions are as follows: comparisons of the centre of
Presentation of results from round-robin test-block trials
59
gravities of the reported defects with that of the defect found destructively and applying some tolerance level; a measure based on the projected overlap of the two defect areas on the three coordinate axes; or a measure based on the overlap of defect volumes. Returning to the parameters DDF, CRF and CAF, provided the destructive and nondestructive examinations have been thorough and the comparison is a fair one, then we expect these three parameters to be useful ways of comparing the present PISC results with those of PISC I. However, these parameters themselves are inadequate to do justice to the results of such expensive trials. Haines 9 suggests also the use of statistical measures such as mean and standard deviations of the error in size and location measurements. As has been argued elsewhere11,12 these parameters are still not a complete representation of the results but they are the essential minimum. With the modest number of defects available in testblock trials, it may be difficult to justify more sophisticated measures of success. 2.3. Distribution-based m e a s u r e o f individual t e a m success
Returning to DDF, there are D defects of which R are rejectable and A are acceptable, so trivially: D= R+ A
(7)
where, as noted above, D may be obtained from exhaustive destructive examination and not necessarily by comparison with intended defect sizes and locations alone. A particular team correctly rejects df and incorrectly rejects ef whilst correctly accepting da and incorrectly accepting ea. Hence d = dr +
(8)
ef q- d a q- e a
where d excludes false calls. The model used to synthesize results from a hypothetical round-robin exercise, as discussed in section 3, has zero false calls. The PISC parameter for the detection frequency for a particular team D D F is then given by DDF
= df +
ef + d a + D
ea
(9)
If false calls are included in the analysis then D D F could, in fact, become larger than 1. An alternative form for D D F can be written as DDF =
(1) ~
(R x CRF + A x CAF + ef --[-e~)
(10)
60
J. A. G. Temple, M. W. String[ellow
2.3.1. A c t u a l distribution o f defects In a test-block, or in an actual vessel, there will be a population of defects given by N ( x ) d x defects with parameters lying in the range x to x + dx where the vector x represents all factors affecting structural integrity. Examples of parameters which are currently incorporated into x for cracks are the through-wall dimension, the length and the distance of the crack edges from other cracks or from the pressure retaining boundaries of the vessel. O f this population N(x) a proportion Na(X ) will be acceptable and Nf(x) will be rejectable based on some criterion such as linear elastic or elastic-plastic fracture mechanics. In terms of our parameters A and R we have A =
fo
R=
Na(X) d x
(11)
Nf(x) dx
(12)
where Na(X ) = O ( x a - - x ) N ( x )
(13)
Nf(x) = O(x - xa)N(x )
(14)
In these expressions we have defined a vector Xaand a function H which we explain below. N o t e that x is a vector representing an arbitrary point in an n-dimensional parameter space and x a is a vector in the same space and in the same direction as x whose end point represents the b o u n d a r y between acceptable and unacceptable defects. It is assumed that this b o u n d a r y surface is smooth. Then H is simply an indicator function with the value l when the set of defect parameters lies inside the acceptable region and 0 otherwise, i.e. H(x -- Xa) = 0
i f x i < X~IV i
H(x -- Xa) = 1
if any x i > x,,
Thus, for only one parameter of concern x, which could be the throughwall dimension of a defect, x~ is a scalar quantity x a and the meaning is clear. F o r two parameters, say defect through-wall dimension x and length l, then x a represents those pairs of points lying on a curve in (x, l) space which separates acceptable defects from unacceptable ones. F o r three parameters xa would represent points lying on a surface, and for more than three dimensions it would represent sets of n-tuples lying on an ndimensional hypersurface. This notation only really makes sense in terms o f integrations over regions of this hyperspace.
Presentation of results from round-robin test-block trials
61
The function H is simply notation to specify which defects are acceptable and which are unacceptable and accomplishes this by simplifying the limits of integration in a symbolic way. Thus we specify that H ( x - x a ) = 1 for an acceptable defect and 0 for an unacceptable defect. For one parameter H is the Heaviside step function defined t h r o u g h the usual integral relationship
;i
H(x -- Xa)f(x ) dx =
00
f ( x ) dx X a
We can extend this concept to points lying inside or outside a b o u n d a r y in n-dimensional parameter space t h r o u g h H(x - x,)f(x) dx = oo
f(x) dx
(15)
Xa
To be specific, in our two-parameter case of defect through-wall dimension x and length I we m e a n H(x - Xa)f(x ) dx = oC
dx Xa
x, 1) dl I a = O(X)
where the limits on the integrals are functionally related. The above notation is a succinct way of representing rather complicated limits on integrals and nothing mitre. The actual evaluation of such an integral requires that each point be c o m p a r e d with the criterion for unacceptable defects and treated accordingly. 2.3.2. Measured distribution o f defects When measurements are m a d e it is not the distribution N(x) which is measured but a different one M(q) and this is shown schematically, again for a single parameter, in Fig. 3. The size q which is measured is not necessarily the same as the actual size x and, in the mathematical description, this is highlighted by the use of two different variables. In Fig. 3, the single hatched area indicates the defects which are measured as acceptable and includes both those correctly accepted d, and those erroneously accepted ea. The double-hatched area gives the defects rejected, including both those correctly rejected d r and those rejected in error e r. To write d o w n mathematical descriptions of these we note that those defects rejected are those for which the measured parameters q exceed the decision boundary x a, but this includes those defects which are, in fact, rejectable and those which are rejected in error. Hence, for defects which are accepted we have
e, + d a =
M(q)H(x~ - q) dq oc
(16)
62
J. A. G. Temple, M. W. Stringfellow
~o
Acceptobls
Through wall dimension x Fig. 2.
Schematic view of actual defect distribution based solely on through-wall size x.
which can be written as M(q) dq
ea + d~ =
(17)
oo
For defects which are rejected we have ef + d r =
M(q)H(q
-
Xa) dq
(18)
which is written as ef + df =
M(q) dq
(19)
Xa
The mapping from N(x) to M(q) can be interpreted in terms of a conditional probability density P(q[x) as M(q) =
P(q Ix)N(x) dx
(20)
~C
In Figs 2 and 3 histograms have been drawn to represent the defect populations and these are assumed to be the experimental representation of underlying smooth continuous distributions. If the test-block exercise has been correctly devised so that the populations of acceptable and
Presentation of results fiom round-robin test-block trials
--1
XG Happi ng
/k
dr* ef
q
Fig. 3.
63
x
Measurement of the actual defect population produces a different population.
unacceptable defects will not be totally confused by the nondestructive inspection (see discussion in section 3.1.1 below), then the continuous picture for M(q) will look more like that shown in Fig. 4 with two discernible humps. It is constructive to think of this almost bimodal distribution as composed of two separate unimodal distributions added together as shown in the figure. The two parts are then Ma(q) and Mr(q) with the contributions from da, e~ and dr, ef shown in different stippling. The defects which are correctly accepted, d~, are those for which the
/Mix)
/ ff~.__ I //
w
~
//
~r
(x) df
M. (x) da
•a
Fig. 4.
ef
Incorrect acceptance of unacceptable defects and false calls.
64
J. A. G. Temple, M. W. String)Cellow
measured size is acceptable and which are, in fact, acceptable. This is denoted mathematically by da =
dq
P(qlx)N(x)H(x
a -
q)H(x, - x)dx
(21)
oO
which can be written as da =
f qfo
P(q I x)N(x) dx
(22)
The expression (22) is in the form in which we shall make use of it since we shall know which defects are actually acceptable and which are rejectable. The other terms for defects correctly rejected, dr, and the error terms ea and ef are defined similarly by dr =
P(q I x)N(x) dx
(23)
P(q [ x)N(x) dx
(24)
Xa
ea =
dq oC
ef =
Xa
dq
P(q I x)N(x) dx
(25)
Xa
In the above integrals, some extend from - oo, but in the model discussed in section 3.1 with results in section 3, these integrals are truncated at 0. This leads to errors when the defects are actually small and with small mean measurement errors but large standard deviations. In the calculations, the probability that a defect will be regarded as rejectable, Pr, and the probability that it will be regarded as acceptable, P,, are given by eqns (18) and (16) respectively, but with the lower limit of integration set to 0. The probabilities are then renormalized so that their sum is unity by calculating E= 1--Pr--P, and then allowing PamPa~(1--E) and Pr~ Pr/(1-- E). This is one way o f proceeding and is appropriate in this model because it overemphasizes the incorrect acceptance rate at the expense o f the correct rejection rate. Another way of proceeding would be to use P, --. Pa + E. In any case the error term E is only large for very small actual defects and all of our results are unaffected by the correction.
2.3.3. Figure of merit for individual teams If a single figure of merit fm is required, in order to compare teams and techniques, a value related to the relative probabilities of correct rejection,
Presentation of results from round-robin test-block trials
65
correct acceptance and misclassifications would be useful. A possible combination of these volumes is given by the geometric mean da
df
(26)
which would be very much larger than 1 for a good technique in a particular test-block trial. The denominators contain extra factors of + 1 to avoid infinite values if either ea or ef are zero. It would not be possible to compare two different test-block trials because the value depends on the distribution of acceptable and rejectable defects in the test-block.
2.4. Weighting of results We are of the opinion that absolute statistical quantities such as the mean and standard deviations of the errors in defect size and location are the most useful for an adequate description of the results of test-block trials. As well as these, there may be some advantage in measures which correspond to those used in other test-block trials such as DDP, D D F and so on. A possible criticism of D D F as defined previously would be that it weights all defects equally. In the next section we consider some weighted PISC-like parameters.
2.4.1. Weighted PISC-like parameter DDF Now the aim of test-block trials must be to find the most successful technique or combination of techniques for inspecting real vessels in which structural integrity is the foremost consideration. It therefore appears to make sense to direct reporting of the test-block results towards measures which include some fracture mechanics weighting of the importance of defects. With this in mind, a modified form of D D F has been given by J. M. Wrigley and J. M. Coffey (in a private communication) as d
e actual
D D F = i= 1
false
~ 1 =
D 2 w~ ctual k=l
(27)
where we have used the previous notation of the team reporting d of the D actual defects but with e false calls of any defects which did not exist, whether or not classified as acceptable or rejectable. The weights wk are chosen according to the severity of the defect according to some fracture mechanics criterion. Wrigley and Coffey have indicated that one possible set of criteria which could be used to determine this weighting is the
J . A . G . Temple, M. W. Stringfellow
66
ASME XI set of criteria. The parameter ct was an arbitrary penalty parameter to penalize those teams with large numbers of false calls and which was assigned the value ~t ~ 1/3. This parameter is not calculated here because it involves detection of defects rather than sizing or classification ability.
2.4.2. ASME XI criteria for acceptable defects The ASME XI criteria for reactor pressure vessels are shown in Fig. 5 for deeply buried cracks or cracks near the surface. Acceptance or rejection in the ASME XI rules is determined by a combination of the through-wall dimension of the crack, its length, and the distance of the crack edge from the pressure-retaining boundary (taken to be the clad/base metal interface if the defect is nearer the clad side). By plotting the crack through-wall
007
....
Deeply buried defects Surface defects //"h /
005
.,m 1
Unacceptable defects 1 1 , , -e"
/
/ i
0"05 s~.e I .c 0 0 / . x
0 02
002
0-01
0
I
0"1
I
I
02
03
I
04
I
05
Aspect ratio x / {
Fig. 5.
ASME XI acceptance criteria for cracks in reactor pressure vessels (IWB 3510-1).
extent x divided by the plate thickness h against the aspect ratio x/l of the crack, and using the solid curve for surface cracks or the dashed curve for deeply buried cracks, one can ascertain whether or not a given crack is acceptable or rejectable. Acceptable cracks lie below the curves whilst unacceptable ones lie on or above the curve. A crack is considered to be a surface crack (solid line) if its least distance from the pressure-retaining boundary is less than or equal to 0-4 times the crack through-wall extent. A crack is considered to be a deeply buried crack (dashed line) if its least distance from the pressure-retaining boundary is greater than or equal to the crack through-wall extent. Cases between these will yield accept/reject curves between the two shown in Fig. 5.
Presentation of results from round-robin test-block trials
67
2.4.3. Use o f A S M E X I rules as basis o f weighting It is clear that rejectable defects become of increasing structural concern as the distance from the accept/reject boundary increases. Also, the seriousness of false calls that report very small defects as requiring rejection or repair increases with distance from the accept/reject boundary. Thus the weights w k in expression (27) could be chosen realistically to be some power of this distance. In the examples discussed later a linear power is used but others may be more appropriate. For example, the stress intensity factor, which determines failure in the linear elastic mechanics approach, varies as the square root of the crack dimensions so that a fractional power of 1/2 might be appropriate. There may be other reasons why different powers would be appropriate. A limitation is imposed, however, by the fact that this weighting must not be so great as to render the outcome of the trial dependent solely on the values obtained at a small number of defects. If this were the case then the benefit of having about 40 defects in a test-block would be annulled. This limitation rules out high powers of the distance from the accept/reject boundary. 2.4.4. Other weighted parameters In section 2.4.1 we considered a weighted value of the PISC parameter DDF. Why restrict attention to weighted D D F values: why not calculate weighted C R F and C A F values too? In order to penalize errors according to their structural significance, weights can be associated with the parameters of each defect and also with the measured estimates of these parameters. In fact, in section 3 we shall calculate weighted values defined by sums over the defects which are actually acceptable or rejectable respectively to give W C A F and W C R F defined by A A E wactua •"i 1 --0~ E wmeasured
WCAF
--
i=1
i=1
A E wactual i=1
(28)
R R E --i 14;actua 1 --0~ E W? easured
W C R F = i=1
i=1
R E wactua I "-i i=1
(29)
The weights are defined as (30) where a i is the distance of the point representing the defect parameters
J. A. G. Temple, M. W. Stringfellow
68
.....
Deeply buried defects
-
Surface defects
-
I
0'06 Defect parameters I I s ~ ss
~
0-0/*
0.02
bi
I 01
0.2 Crack a s p e c t
Fig. 6.
ratio
0.3 x/(
0"~,
I 05
Derivation of weights from A S M E XI rules.
from the crack aspect ratio coordinate axis as shown in Fig. 6 (based on a private communication from Wrigley and Coffey). Taking the given defect parameters and drawing the normal to the accept/reject boundary gives the distance of the defect parameters from this boundary. The parameter bi is the value of the ratio of the defect through-wall size to the plate thickness evaluated from the foot of the perpendicular from the defect parameters to the accept/reject boundary. In most cases it is sufficient to evaluate ai and b i as though they lay along the same line defined by ai in Fig. 6. Note that in the synthesized trials used to evaluate these measures all defects are detected every time and only classification of defects is taken to be important, in keeping with our emphasis throughout this paper. The weights defined in this way are greater the greater the structural significance of the defect. The constant ~ is an arbitrary penalty factor and fl is a constant exponent. Tables 4 and 6 show the ideal values, i.e. those for a perfect team, for all the ranking parameters considered. The ideal value for W C A F and W C R F is (1 - ~) because of the penalty factor ~, which in the examples is 1/3, so that the ideal value in the tables is 0.67, which makes it difficult to assess whether a team scoring 0.6, say, is better, worse or the same as one scoring 0.74. As we shall see in section 3.3. l, the present arbitrariness of the parameters and the possibility of reordering results merely by different choices of weighting makes these weighted measures unsuitable for presenting the results of classification exercises.
2.4.5. Ranking of individual team performance In the previous sections we have considered a range of parameters which
Presentation of results from round-robin test-block trials
69
could be used to measure performance. A n y of these measures could be used as a ranking tool for comparing the performance of teams in testblock trials. Those mentioned above are not exhaustive and this section introduces some others which have been advocated by Sharp and Sproat la for dealing with the results o f United States Air Force round-robin trials. The interesting thing about these additional parameters is that they are statistically based and so can be associated with degrees o f belief in practice. We emphasize that the absolute statistical parameters which measure accuracy, such as the mean and standard deviation o f the sizing errors, should always be reported. Ranking can only conveniently be carried out with a single parameter and subtle differences between the mean errors and associated standard deviations m a y make it difficult to rank teams in a test-block trial without a single ranking parameter. The parameters introduced below can be used, as can any of those already described, to rank individual team performance. In sections 2.3 and 2.3.2, we introduced the parameters d a, df, e a and e r which can be written as a 2 × 2 contingency matrix as
df
ef
ea
d~
(31)
Sharp and Sproat ~3 consider the value of DA 2
~2 =
(dr + er)(da + ea)(df + ea)(da + e r)
(32)
where A is the determinant o f the matrix (31). U n d e r the assumption o f independence, this value has an approximate chi-square distribution with one degree o f freedom, Thus a value of X2 greater than 2-71 indicates that the inspection team is not scoring acceptable and unacceptable defects in a r a n d o m way; in fact the value o f 2.71 indicates 90% confidence. Larger values indicate better performance. Within a particular test-block trial, this statistic could be used for ranking purposes, but its value depends on the n u m b e r of defects D and so it cannot be used to compare results from different trials. A parameter related to X2, called the mean square contingency 4)2 and defined by q~ = (x//~--2/D)
(33)
can be used in such cases. Finally, we can define a measure o f team performance in the classification o f defects with the help o f the extended Venn diagram o f Fig. 1(b). Here
70
J. A. G. Temple, M. W. Stringfellow
A stands for the acceptable defects present in the test-block, R for the rejectable defects and X for those defects reported which do not, in fact, exist--the false calls. The subscripts a, r and x stand for defects classified as acceptable, classified as rejectable, and missed, respectively. Thus A x is the number of acceptable defects which were missed, while X r is the number of cases in which a defect was reported to be rejectable when in fact no defect was present. Then the quality of overlap of the segments (A x + A r+A~) and (A, h- Ra + X,), representing respectively the number of acceptable defects actually present in the test-block and the number of detected defects classified as acceptable, is an indicator of a team's performance on acceptable defects. Similarly, the quality of overlap of the segments (R x + R r + R~) and (A r + R r + X'r) is an indicator of a team's performance on rejectable defects. We can quantify these indications of performance as follows. Referring to Fig. 1(b), we see that the ratios A a / ( A x + A t + Aa) and Aa/(A a + R a + X'a) are respectively measures of a team's ability to classify defects as acceptable when they are so, and of its ability to avoid classifying them as acceptable when they are not so. By analogy with eqn (1) we define a figure of merit for defect acceptance by the expression
f~=
A,,+Ar+A a
A a + R a +X=
(34)
Here we have arbitrarily given equal weight to the two factors. A figure of merit for defect rejection may be defined in exactly the same way. Referring again to Fig. l(b) we can write
fr =
Rxnt_"-l~r..l_Ra
(35) Ar+RrWX
r
The quantities fa and fr are clearly not independent since changing the classification of a particular defect will in general change both fa and f~. We can obtain a convenient symmetrical expression for a combined measure of performance far by writing Lr = L L
=r
+ Aa Ra +
Rx + R r + R a
)(
=r
Ar + R r + (36)
This expression includes both false calls and missed defects, but in the application to be made to synthetic test-block trials in the next section
Presentation of resultsfrom round-robintest-block trials
71
there are none o f these. F o r application to these results we can therefore put
Xa=Xr=Ax= R x = 0 We can n o w go over to the notation used elsewhere in this paper by writing
Aa=da A t = ef R a = ea
R~ = df so that da
d~
dr
df
By combining eqns (33), (34) and (35) we find that the mean square contingency 4~2 has the form
~/; t~ =
(dadf-eaef)2 eaXdf + e a ) ( d f +
d a + ef)(d a +
el)
(38)
Since we would expect to find dadt ~ eaef in a reasonably successful trial, far and ~b are effectively equivalent. If we consider Fig. 4, and imagine the two c o m p o n e n t distributions to be drawn apart so that the combined distribution, shown dotted, becomes distinctly bimodal, then we can imagine a number of interesting limits. Suppose the two humps become widely separated and the accept/reject b o u n d a r y x a lies between them, then e f - - e a - - 0 and the m a x i m u m value o f ~(2 = D is obtained. With the two humps still well separated, for ease of visualization, we can now move x a along to the right until both humps are to the left o f the accept/reject boundary. This corresponds to setting df = ef = 0 and the value o f Z2 becomes zero. If we now move the line marked x a past the humps to the extreme left-hand side of the diagram then we have the case where e a = d a = 0 and Z2 again becomes zero. Whilst this is the smallest value of Z2 that can be achieved, we note that it can be obtained in two ways which m a y not correspond to cases o f equally bad inspections in practice. F o r example it m a y be considered worse to have the case where d f = ef = 0 than to have d a = e a = 0 because the structural integrity would be more likely to be impaired in the former case than in the latter. The correct w a y o f judging these two cases would involve detailed cost-benefit analysis. Sharp and Sproat 13 considered weighted sums like eqns (27), (28) and (29) and found them inappropriate as ranking tools since multiplication
72
J. A. G. Temple, M. W. Stringfellow
of both dr and e a by a constant will leave the ranking unaltered whereas the team performances would normally be considered to have changed. To overcome this problem, a quantity called Somers's ~ defined by =
A (dr + ea)(d a + el)
(39)
can be used as a ranking tool and was preferred by Sharp and Sproat to ~b. N o t e that, in principle, t3 has a range of values - 1 < t3 < + 1 since one possible, extreme, limit would be if the defects were specified completely wrongly such that d a = d t --0, giving the lower, negative limit. If values of ~ near - 1 were recorded in a test-block exercise the first suspicion would be that the team had made a reporting error. Such an error might occur with confusion over bench marks, as was the case in PISC I, for example. If we were to select df + ef flaws at random from the total number of reports made, i.e. from df + ef + d a + e a, of which we know that d r + e a are actually rejectable, then the penultimate entry in the table is the mean number o f defects/~ which would be rejected. This value is obtained from the hypergeometric distribution as (dr + efXdr + ea) #r = (df + d a + e r + ea)
(40)
A b o u t this mean there will be a variance o r2 given by flr(da + el) 2 O'r -- (d r -}- d a + ef + G)
3.
(41)
PRESENTATION OF TEST-BLOCK RESULTS
In order to compare the ranking of team results according to the various parameters discussed earlier, some data are required. N o t unnaturally, teams participating in round-robin test trials such as PISC II take a fierce pride in their results and this m a y hinder an objective judgement of the best measures to be used when presenting the results. To overcome this we have used synthetic data generated for two hypothetical round-robin trials. We assume that all defects are detected and that only classification is of concern in o r d e r to make the points a b o u t ranking based on classification ability more clearly. We make assumptions concerning the errors involved in sizing. As discussed in section 3.1, measurements taken in practice will
Presentation of results from round-robin test-block trials
73
exhibit a distribution of values about a mean with some standard deviation. Whilst these errors are not necessarily normally distributed, for the purposes of these calculations we have assumed that they are. Thus if mean errors in sizing the through-wall extent and the length of cracks are known, together with the correlation coefficient between these two errors, then expression (42) can be used to assign a probability of measuring any defect size given the actual defect parameters. In this, we are assuming that all the cracks in the hypothetical test block behave in a rather similar way and that there are no rogue defects which are either much easier or much more difficult to size than average. This is a simplifying assumption which should not invalidate the averaged results which we shall present. Once the actual defect parameters are known, then the defect is known to be either acceptable or unacceptable, and the probability density distribution allows the values of terms such as d a, df, ea and ef to be evaluated (eqns (22)-(25) and (31) and Figs 4 and 6). This then makes possible a calculation of the most probable values for all the ranking measures of capability for defect classification by that team. By varying the values of the mean sizing errors in through-wall dimension or length or their associated standard deviations, or the correlation between them, sets of results corresponding to a set of hypothetical teams can be described. These results can be ranked on each trial according to the different measures and then examined to see how well they accord with intuitive feelings about the absolute statistical values of mean errors and standard deviations. 3.1. A model to evaluate different presentations of results of defect classification round-robins In carrying out the nondestructive measurements of crack through-wall dimension and length, some errors will occur. Over all the defects in the test-blocks these errors can be averaged to yield mean errors, and about these mean errors there will be a distribution with some standard deviation in both through-wall extent and crack length. This distribution might be a normal one but is not necessarily so, as was found in an analysis of the Defect Detection Trials. 12 In particular the distribution might not be symmetrical about the mean, and so the higher moments such as skewness and kurtosis would be important. However, for illustrative purposes we will use as a model a normal distribution in the two variables: crack through-wall extent x and crack length denoted by y. Even when this distribution is not strictly applicable, it might still be an acceptable approximation to the true distribution of errors.
J. A. G. Temple, M. W. Stringfellow
74
The
bivariate
normal
distribution
has
a
probability
density
f(x, y; ~x, #y, ax,lcry,P) given by f(x'Y;Px'~r'trx'trY'P)=[4rc2tr2tr2(l--p2)]-l/2exp
2(1 Z p 2)
0"x
_2p(x-[xo+#x]Xy-[yo +/~r]) t_(Y-[Yo +/~r!)2]t ffx~y
~y
(42) where x o is the actual crack through-wall dimension and Yo its length. The terms #x and py represent the mean errors in crack through-wall dimension and crack length respectively, with associated standard deviations ~x and ~y respectively, and p is the correlation coefficient between these two errors with a value lying between + 1. This probability density is shown schematically in Fig. 7 superimposed on a plot of the ASME XI criteria for defect acceptability. For a particular defect, the volume under this function corresponding to the stippled area below the accept/reject boundary is the probability of accepting this defect and is denoted by P,. This will be the probability of correctly accepting it if the defect is actually acceptable, and is therefore a contribution to da, or it will represent an error contributing to e, if the defect is rejectable. Probability
~: . . . . i a t e n o r m a l bution
m,..ao .r, or
f . z/ '.%.X ' I /.//I
Fig. 7.
TM
"-.
* mean error
Use of bivariate normal distribution to determine the probability that a defect will be classified as acceptable.
Presentation of results from round-robin test-block trials
75
3.1.1. Design of test-block trials--a comment An important point for the design of test-block trials is revealed by this diagram: namely that defects should not be placed too close to the accept/reject boundary. Suppose a team has a very good technique with zero mean errors in both length and through-wall dimension measurement. If the actual defects lay on the accept/reject boundary, then no matter how small were the team's standard deviations in the measurements, they would have only a random chance of classifying each defect correctly. The results of the test-block exercise could thus be obtained by tossing a coin! This is because there would be a 50% chance of classifying each defect correctly or incorrectly. Thus it is clearly important in the design of test-blocks that the defects are spread over a region extending away from the accept/reject boundary with possibly a minimum distance on either side of the line. This minimum distance might be specified by the accuracy required in the fracture mechanics assessment, for example. In the ASME XI code for the examination of reactor pressure vessel welds, for example, all dimensions over 25 m m are rounded to the nearest 2 m m whilst those under 25 m m are rounded to the nearest 1 mm. This might suggest that defects should be at least 2 m m from the accept/reject boundary in test-block trials designed with ASME rules in mind.
3.2. Synthesized test-block trials A number of defects are chosen with parameters obtained as random samples from certain statistical distributions. In the two cases presented here the defects were taken to have through-wall dimension and either length or aspect ratios chosen from uniform distributions. Uniform distributions were chosen for the defect parameters so that no special emphasis was given to any particular defect size or length within some range. To separate teams with a range of classification abilities, there must be defects at various distances from the boundary between acceptable and rejectable defects. This will enable good teams to be distinguished from poor teams but, with only a few tens of defects overall, will make discrimination between two teams with similar abilities difficult, if not impossible. Other distributions of defect parameters could be used; distributions with peaks would lead to a non-linear discrimination between teams. This could be useful for distinguishing between teams with very similar abilities but would be difficult to use in a test-block trial aimed at a set of teams with, possibly, a wide range of abilities (which is normally the case since the teams' abilities are not usually known in advance of a round-robin). In the first trial there are 40 cracks with through-wall dimension limited
J. A. G. Temple, M. W. Stringfellow
76
to the range 0 - 6 0 m m , and lengths up to about 500mm, but with the majority of the cracks having lengths under 100mm. The aspect ratio is uniformly distributed and determined after the through-wall dimension has been selected. This leads to the exponential-like distribution of crack lengths shown in Fig. 8. Such long defects would tend to enhance the defect detection probability and, indeed, it is one of our assumptions that all defects have been successfully detected. This is not an essential limitation of the work because a probability of detection could be included. The present results could then be viewed in the sense of conditional probabilities, given successful detection. However, this would obscure the main points concerning the presentation of results on defect classification according to fracture mechanics rules. Results from recent test-block exercises, such as D D T 3- ?
, .:.:,:-:.:.:.:.:.:.:.:.:.:.
+
8->:.:+:.:.:.:.:.:.:.:.:.:. ,...............+.......... .....................+..... ~.,......................... ,......................... ,........-.......... -...
=.,
g ~_
t.--
~ ~Si:i:~ii:~ ~
f.~:~~!~.".~.".':~:
iiii+iiiiii!i!i!i
2 ~i
o
5
15
25
35
45
55
Through-wall size (ram)
20--
i:~:~:!:!:!:~:!:!:!:!:!:!:!
i~!i!~!~!i!i!!!!i!i!i!i!iil
i+ili!i;iiiiiiiiiii+ii 1s -
10-- -
iiiiiii!ii;i;iiii+iiiiiii+i
::::::::::::: ::::::::::::::::::::5::::::
::............-....
iiii!i!i!i!i!i!i!i!~i~i~tliiiiiiiiiiii!iiiiiii~i~i~i 5
40
120
200
280
360~
t./.O
520
Crack length (rnm)
Fig. 8.
Distribution of crack sizes in first synthesized test-block exercise.
Presentation of results from round-robin test-block trials
77
and PISC II, 9.1°'14 suggest that classification is a more important issue than detection. Defects are not only distributed in through-wall size and defect length but are also distributed uniformly throughout the first 80 mm of the plate thickness. As can be seen from Fig. 5 the criteria for acceptance or rejection of defects according to ASME rules will vary with the distance of the crack edge from the nearest pressure-retaining surface. This is fully taken into account in the model and each defect is tested for acceptance or rejection according to ASME XI rules depending on the depth of the defect from the surface. Histograms of crack through-wall size and length for the first trial are shown in Fig. 8. There are 40 defects in all with 34 unacceptable ones according to ASME rules and 6 acceptable ones. The plate into which these defects are imagined to be inserted is taken to be 250 mm thick. The actual crack parameters are given in Table 1. A second trial is also constructed to show the effects of defect population on the presentation of results. This second trial consists of 40 smaller cracks. Again they are distributed relatively uniformly in through-wall size, although the maximum dimension is less than 13 mm whilst the lengths span the range 0-55 mm. In this case the lengths are chosen at random from a uniform distribution rather than the aspect ratio as in the first trial. This results in a flatter distribution in defect length than in the first trial. The cracks are again distributed in depth from the pressure-retaining surface. Histograms of the through-wall size and crack length are given in Fig. 9 and the actual crack parameters are presented in Table 2. For the first trial, for a team with sizing errors around the average values over the seven teams, about a quarter of the defects contribute most to the outcome of the trial. For teams with greater errors nearly all the defects will contribute significantly to the outcome, and for better teams the outcome will be largely decided by about six or seven defects. Similar observations hold for the second trial. This allows discrimination between the teams with rather different abilities but does not provide sufficient resolution effectively to separate teams with similar abilities. 3.2.1. The hypothetical teams In both trials a set of seven hypothetical teams were chosen with a range of values of mean and standard deviations of sizing errors. These values represent typical good values, judged by current test-block results, through to poor ones and are representative of values deduced from recent testblock trials on thick-section steel. 12 The mean errors in crack throughwall size measurement and the associated standard deviations attributed
m
~
~
t
~
~
~
~
~
r~
~
~
~
~
~
t
~
~
t
~
~
~
~
~
~
~
t~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
~
P-1
,'7 o"
i
0 P~
~r
t~
P~
i
J. A. G. Temple, M. W. Stringfellow
80
10
i.. ,:.:., .:...:,:+:+ :::::::::::::5: ::: : ::: : :::::::::::::::::::::::::::::
!ii.ii!!!!.!.!!.!!!
::::::::::::::::::::::::::: .............+..,...... . ....................... -...................,...
8
--
:.. +.... + . . . . . . . . . . . . . . . . ::::::::::::::::::::::::::::: ~.:.:.:-:-:.:-:-:-:.:-:-:.:: ,. • . . . . , . + . . . . . , . . . . . . . . . . ::::::::::::::::::::::::::::: :::::::::::::::::::::::::::: ............... ......... ..... ..............................+....................,....
:::::::::::::::::::::::::::
-....................... ........................ . ....,..............,.. . ................+.+ .., ..............,............ . .................... ....................,.+ ......................... ...........-............. ................ ,+ . .....................,
g-
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii!!iiii!i!ili!iii!iiiiiiiil
:::::::::::::::::::::::::::
~ 6
!++++i+++!++++++i+++i+i+i++++i++++ii+i+i+++i+i+++i+i!i :.:-:.:.:.:.:+:.:.:.:.:.:.
4
i:i:i:i:i:i:i:i:i:i:i:i:i:i
.
.
.
.
.
.
.
.
.
-
!!!!!!!!!!!!!!!!!!!!!!!!!!!!
2
.
.
.
.
.
~i~i~i~!ii~i~!ii~i~i~i~
+++++++i ++++++++++! !+++++!+++++++ ++
!!!!!!!!!!!!!!!!!!!!i!!!i!!i :!:!:!:!:!:!:!:!:!:!:i:!:!:! !i!~!i!~!!ii!!!!!!iii!ii:,.'!~ :!:!:!:!:!:!:!:i:!:!:i:!:!:! iMiiiiiilMiiiiiiiiii
i:i:i:i:i;:i:i:i:i:i:i:i:i ::::::::::::::::::::::::::: ..........................
:::::::::::::::::::::::::::
4
6
8
10
12
Through - wall size (mm)
+st 5
0
Fig. 9.
5
15
25 35 45 CrQck length (mm)
55
Distribution o f crack sizes in second synthesized test-block exercise.
to the teams are shown as solid circles in Fig. 10. These are plotted as points in a two-dimensional map togethe/" with the mean length error and standard deviation for the teams plotted as open circles. The solid circle and the open circle for each team are joined by faint lines in Fig. 10. Whilst the line joining the two points is of no significance it serves to draw attention to the two characteristic values for each team. The two points and the line joining them define a dumb-bell in the plot and the distribution of these dumb-bells around the plane o f the plot indicates the span of team capabilities. Good teams are expected to be those with short dumbbells lying near the bottom left-hand corner of Fig. 10. The actual values used are given in Table 3. The teams have been given names to aid comparisons between them in the next sections; these labels are the names of the planets--Earth, Mars,
Presentation o f results from round-robin test-block trials Mean crack
-10
-E
length error (ram)
0
10
2O
I
I
I
20-3
~.E
81
4
~
20
-
15
.o:
10
=~
5
mo
o
g~
;®
~ ~
15 --
g~
\'%
'Q \
~o
\ ,~
10 --
/
~ / 6e~ /
%
2 -Earth 3 - Jupiter
d
cE
~- sau;n 5-
5
S.o
Uronu~
6 - Neptune 7-Pluto
" 0 .K:
Zk l'~e
-1
Through-wall
o
Length
I 10
0 Mean t h r o u g h - w a l l
Fig. I0.
•
I 2O
size error (turn)
M e a n a n d s t a n d a r d deviation o f sizing errors o f the seven teams involved in the synthesized test-block trial.
Jupiter, Saturn, Uranus, Neptune and Pluto. The planets are at different distances from the Sun and this fact has been used to put their names into one-to-one correspondence with the teams based on the size of the mean sizing errors and associated standard deviations. In other words the Sun is taken to be at the point with zero mean error in the through-wall size and zero standard deviation in this dimension. From Fig. 10, or Table 3, TABLE 3 T e a m P a r a m e t e r s for Synthetic Test-block Trials
Name of team
Earth Mars Jupiter Saturn Uranus Neptune Pluto
Mean error in through-wall dimension o f cracks (ram)
Standard deviation of through-wall measurement error (ram)
Mean error in crack length dimension (ram)
- I "3 - 1.4 1-6 2-3 4.8 5-1 16.9
2-3 2-5 8.6 13.7 11.3 14.4 22.2
- 3"8 -7.2 -0.1 1-4 2.4 21.2 5.9
Standard Correlation deviation o f coefficient crack length between errors measurement in crack error length and (ram) through-wall measurements 6.7 12'0 18.3 18.5 13.6 20.5 12.7
0.174 0.371 0.655 2 0.460 - 0.130 0.512 - 0.225
82
J. A. G. Temple, M . W. S t r i n g f e l l o w
there are three teams from the seven taken to have small mean errors in sizing the through-wall dimension o f cracks. These three teams, Earth, Mars and Jupiter, have mean through-wall sizing errors of absolute value less than 2 mm. Mars and Earth tend to slightly undersize whereas Jupiter oversizes. The standard deviation is much the same for Earth and Mars but is larger for Jupiter. The mean error and standard deviation in crack length measurement, though, are better for Earth than for Mars. Whilst Jupiter has a very small mean error in crack length measurement they exhibit a marked standard deviation. The next two teams, Saturn and Uranus, have slightly bigger mean errors of sizing the through-wall dimension, but are still less than 5 mm, and have quite small mean errors of sizing the crack length dimension, less than 3 m m , but with rather large associated standard deviations -"-6 × FLy or greater. The last two hypothetical teams chosen exhibit rather large mean errors and large standard deviations. Neptune has a mean through-wall size error which is not too poor whereas Pluto has a mean crack length size error which is not too poor. The other parameters of these two teams, though, are not very impressive. We shall see that this intuitive feeling, based on the positions of the teams in Fig. 10, is largely borne out in the rankings achieved over the two synthesized trials as discussed in sections 3.3, 3.4 and 3.5. 3.3. Results of first trial Values of d a and dr and of e a and ef obtained by the various teams form the first four entries in Table 4 which lists all the values calculated for each team. The next two entries are the PISC parameters C R F and C A F obtained from eqns (5) and (6) respectively. Then come the two entries for W C R F and W C A F calculated from eqns (28) and (29) with ~ = 1/3 and TABLE 4 Results of First Synthetic Test-block Trial Name
dr
e r e, d a
CRF
CAF
WCRF
WCAF
Earth Mars Jupiter Saturn Uranus Neptune Pluto
33 33 32 32 33 33 33
0 1 6 0 1 6 2 2 4 3 2 3 3 1 3 4 1 2 4 1 2
0.97 0.97 0.94 0.94 0.97 0.97 0.97
1.0 1.0 0.67 0.50 0-50 0.33 0.33
0.92 0.92 0.91 0-91 0-92 0.91 0.91
0-57 0'24 0-62 0-31 0.35 -0.02 0-02
Ideal
34
0 0 6
1.0
1.0
0.67
0.67
fm
Z2
dp
(7
~t,
ar
9-95 9.95 3.8 2-8 3.5 2.6 2.6
33.3 33"3 14.8 9.1 12.5 6.8 6.8
0.91 0'91 0.61 0.48 0.56 0.41 0.41
0.97 0'97 0"61 0.44 0.47 0-30 0.30
28.1 28.1 28.9 29.8 30.6 31.5 31.5
2-1 2-1 2-1 2.1 2.1 2.2 2.2
40.0
1.0
1.0
28.9
2.1
14.3
Presentation o f results from round-robin test-block trials
83
fl = 1. Next comes the figure of merit defined by eqn (26) and this is followed by the statistical parameters Z2, from eqn (34), ~b from eqn (41), and Somers's 3 from eqn (36). The last two entries in the table give two other pieces of statistical information,/~r and o-r as given in section 2.4.5.
3.3.1. Ranking of teams Having calculated all these results the next stage is to present the results as a ranking of the teams. This is carried out in Table 5, where the columns correspond to those in Table 4 but each column n o w carries the rank, as an integer from 1 to 7 indicating the team performance. W h e n tied positions TABLE $ R a n k i n g o f T e a m P e r f o r m a n c e A c c o r d i n g to the Different Measures: Results for the First Test
Name
df
er
ea
da
CRF
CAF
WCRF
WCAF
fm
Z2
dp
Earth Mars Jupiter Saturn Uranus Neptune Pluto
1 1 6 6 1 1 1
1 1 3 4 4 6 6
1 1 6 6 1 I 1
1 1 3 4 4 6 6
1 1 6 6 1 1 1
1 1 3 4 4 6 6
5 5 . 1 1 5 I 1
2 5 1 4 3 7 6
1 1 3 5 4 6 6
1 1 3 5 4 6 6
1 1 3 5 4 6 6
1 1 3 5 4 6 6
occur, as for example with C A F , then all the tied teams are given the same rank and the rank of the next team(s) will be the sum of the previous rank positions plus one. Note that all the measures except W C R F and W C A F increase as team performance increases whereas the values for W C R F and W C A F have the ideal value of 1 - 0~or 0.67 in our case. Best rank is given the value 1, worst is 7. On the basis of m i n i m u m errors in accepting or rejecting defects, that is e a and er, Mars and Earth share first place. This is reflected in the parameters C R F and C A F but note that first place is shared with Uranus and Pluto on the basis of C R F alone. The parameters W C R F and W C A F do not agree at all with the rankings produced by any of the other parameters, which is not surprising since the parameters ~t and fl and indeed the weights wi themselves are arbitrary. All the other parameters agree generally in ranking, which is not surprising since they are all based on relationships involving all of ea, dr, d~ and e r. The statistical parameters and the figure of merit fm all give equal first place to Mars and Earth and rank Nepture and Pluto equal last in accord with their positions as plotted in Fig. 10.
84
J. A. G. Temple, M. W. String, fellow
The first four columns of Table 4 show clearly the importance of considering the error terms e a and ef as well as the c o r r e c t acceptance and rejection terms da and dr. This emphasizes why the figure of merit and the statistical parameters Z2, q~ and d are useful ranking instruments since they are based on all four important variables. It is clear that most of the ranking measures agree with the intuitive and physically based belief that small mean errors and small standard deviations in the error of sizing, particularly in the through-wall dimension, are indicative o f good team performance and vice versa. However, this is not true of the weighted PISC sizing parameters whose values give very different rankings. Although the weighted sums do in one case distinguish correctly between teams tied according to the other ranking parameters, the overall impression of the weighted parameters is one of variability. Also the parameters ~ and fl are largely arbitrary and could be chosen simply to produce any specified ranking over a given set of defects. The weighted parameters are, therefore, of little value in true comparisons between teams and trials and should not be used. 3.4. Results of second trial Results for the second trial are presented in Table 6 in the same way as those for test 1 are presented in Table 4. The ranking for test 2 has been carried out in Table 7. On the basis of C A F alone, Mars and Earth again share first rank position but on the basis o f C R F they come last with Pluto in first position. This emphasizes the difficulty of coming to a logical choice between two or more teams when faced with the two parameters C A F and C R F individually. Economic cost (including aspects of structural integrity and safety) depends on some mixture o f C R F and CAF. Again the weighted TABLE 6 Defect Parameters for Second Synthetic Test-block Trial Name
df
e f e a da
CRF CAF
Earth Mars Jupiter Saturn Uranus Neptune Pluto
3 3 5 6 6 6 7
1 1 11 16 14 21 24
0"38 0"38 0.63 0"75 0.75 0.75 0.88
Ideal
8
1.0
5 5 3 2 2 2 1
31 31 21 16 18 11 8
0 0 32
WCRF
WCAF
.I'm
Z:
dp
~
0.97 0-97 0-66 0-50 0.56 0.34 0.25
0"73 0"72 0.48 0"57 0-64 0-51 0'60
0.54 0.44 0.58 0.52 0.41 0.11 0.05
2.8 2"8 1.5 1.4 1.5 1-0 1.1
8.4 8'4 2.1 1.6 2.5 0.3 0.6
0-46 0'46 0.23 0.20 0-25 0.08 0.12
0.34 0"34 0.28 0'25 0.31 0.09 0.13
0.8 0-8 0'8 0'8 3.2 1.6 4.4 1.9 4.0 1.8 5.4 2.1 6.2 2.2
1.0
0.67
0.67
1-0
1.6 .1.1
16.2 40-0 1.0
#r
ar
85
Presentation o f results f r o m round-robin test-block trials
TABLE 7 Ranking of Team Performance According to the Different Measures: Results for the Second Test Name
df
et
ea
da
CRF
CAF
WCRF
WCAF
fm
Z2
49
O
Earth Mars Jupiter Saturn Uranus Neptune Pluto
6 6 5 2 2 2 1
1 1 3 5 4 6 7
6 6 5 2 2 2 1
1 1 3 5 4 6 7
6 6 5 2 2 2 1
1 1 3 5 4 6 7
3 2 7 5 1 6 4
2 4 1 3 5 6 7
1 1 4 5 3 7 6
! 1 4 5 3 6 7
1 1 3 5 4 7 6
1 1 4 5 3 7 6
parameters W C R F and W C A F are not in agreement with any of the other ranking tools. The four parameters fro, X2, ~b and d are very much in agreement and all rank Mars and Earth first, followed in third position variously by Uranus (by three parameters) and Jupiter (by one parameter). The Z2 parameter in Table 6 indicates that Mars and Earth are, in fact, well ahead of their nearest rivals. This is also reflected in the fact that their value for d r = 3 is much larger than a mean value/~r = 0"8, together with a standard deviation o"r = 0.8 of purely random assignation of rejectable defects. For Pluto, however, we observe performance which is very little different, if at all, from a random designation of cracks as rejectable or acceptable once they have been detected. With this kind of inspection quality, detection is the only possibility and attempts at fracture mechanics oriented decisions are fatuous. 3.5. Combination of results from both tests
Having obtained two sets of results on disparate sets of defects with rather different distributions of cracks, it is possible to combine the results and see what, if anything, this does for the final presentation. Results for the combined trial are presented in Table 8 in the same way as those for test 1 are presented in Table 4. Having previously discovered the rankings obtained from W C R F and W C A F to be meaningless, they are not included in Table 8. Because the values in the various columns of Table 8 generally stand out in rank order, no ranking table has been included for the combined results. A striking feature of the c o m p o u n d results is how 80 defects instead of 40 enhances the confidence in the two teams Mars and Earth as leaders. In fact this would be what we should deduce from Fig. 10, in which the dumb-bells of Earth and Mars are shorter than those of the other teams
86
J. A. G. Temple, M . W. S t r m g f e l l o w
TABLE 8 Defect Parameters for Both Synthetic Test-block Trials Name
dr
ef
ca
da
CRF
CAF
fm
Earth Mars Jupiter Saturn Uranus Neptune Plut'o
36 36 37 38 39 39 40
1 I 13 19 17 25 28
6 6 5 4 3 3 2
37 37 25 19 21 13 10
0.88 0.86 0.88 0.90 0.93 0.93 0.95
0.97 0.95 0.66 0.50 0.55 0.34 0.26
9.8 9.8 3.3 2.7 3.4 2.2 2.1
Ideal
42
0
0
38
1.0
1.0
39.9
Z2
(9
(3
~r
(Tr
,far
55.4 55.4 24.7 16.0 22.0 9.1 7.3
0.83 0.83 0.56 0.45 0.52 0.34 0.30
0.83 0-83 0-54 0-40 0-48 0.27 0.22
19.4 19.4 26.2 29.9 29.4 33.6 35.7
3.0 3.0 3.5 3.8 3.7 4.0 4.1
0.84 0.84 0.60 0.50 0.56 0.40 0.35
1.0
2 2 . 1 3.2
80"0 1.0
1.0
as well as lying closer to the origin, giving an overall performance which is better. It also demonstrates that the through-wall extent of the cracks is more important in determining test-block performance than the accurate measurement of crack length, since the two teams have quite different performances in terms of sizing accuracy of crack length but very similar accuracies of sizing the through-wall dimension of cracks. This is encouraging since it is generally accepted that the through-wall extent of the cracks is the more important parameter from a fracture mechanics point of view. The reason for this bias of the test-block results towards the crack through-wall dimension can be seen from Fig. 5. The accept/reject boundary is mostly horizontal so that a smaller change in the throughwall dimension of a crack than in its length can change an acceptable defect into a rejectable one or vice versa. Earth and Mars are well out in the lead, yet based on C R F alone we find Mars last in the league, and even with the inclusion of CAF, Jupiter and Saturn might still look like attractive teams. The statistical parameters serve to highlight the difference between them and the field leaders--Earth and Mars. We can also compare the results of calculating fro andja r for the synthetic results of the hypothetical test-book trial discussed in section 3. We note that
dfdf2
(43)
fa2r = (d a + ef)(d a + ea)(d f + ea)(d f + ef)
and the results are given in Table 8. Comparing the ranking of the seven hypothetical teams using these two measures of performance, we see that Earth and Mars lead by a much greater margin in fm than in.far, largely because of the factor 1 + ef in the denominator Offm. This may be desirable. However, the progression of values off~ r reproduces more faithfully the
Presentation of results from round-robin test-block trials
87
progression of parameters displayed in Fig. 10. In particular, Jupiter and Uranus change position when ranked in terms offa r rather than in terms o f f m. Finally, it is interesting to note that the quantity q~2 of Sharp and Sproat, la given by eqn (33), is very similar to the expression forfa 2 given by eqn (43) above.
4.
CONCLUSIONS
The distributions of defects used in the two hypothetical blocks considered in this paper are quite different yet the teams' results are reasonably consistent as judged by the ranking achieved. This suggests either that the distribution of defects in test-block trials is not crucial to the outcome or that, at least, broad, almost uniform, distributions of defect sizes produce sensible results. We have developed and presented in this paper the necessary tools to investigate this conjecture further. However, if the testblock trial is being devised to test some specific feature, such as the capability to correctly size small near-surface defects, then there would be no point in having some cracks very large or deeply buried. It is clear that most of the ranking measures agree with the intuitive and physically based belief that small mean errors and small standard deviations in the error of sizing, particularly in the through-wall dimension, are indicative of good team performance and vice versa. Hence the best presentation of results is one in which the absolute mean errors and their standard deviations are displayed and the teams participating in a roundrobin trial are ranked with one of the parameters fro, Z2, ~b or d. Parameters such as correct detection rate (DDF) as currently used in PISC are of value in that they facilitate rapid, simple comparisons between the present results and those of PISC I. However, they must be supplemented by mean and standard deviations in the errors of sizing and location. More sophisticated measures of team success are well worth considering. Such parameters can also incorporate information on the false call rate and the misclassification rate. As well as results averaged over all teams, individual team results are important and should always be given. They need not necessarily be identified. Some simple parameters like C R F and C A F are considered inadequate vehicles for presenting the results of test-block trials if used alone. In an attempt to improve on them, various weighted parameters have been considered. The weights could be drawn from fracture mechanics criteria but a limitation is imposed by the fact that this weighting must not be so great as to render the outcome of the trial dependent solely on the values
88
J. A. G. Temple, M. IV. Stringfellow
obtained at a small number o f defects. However, the weighted sizing parameters considered here give very different and unreliable ranking from those expected on the basis of the underlying physics. The weights and associated parameters ~ and fl are largely arbitrary and could be chosen simply to produce any specified ranking over a given set of defects. The weighted parameters are, therefore, of little value in true comparisons between teams and trials and should not be used. The pictorial representations of each team's results as dumb-bell plots o f mean and associated standard deviations of sizing accuracy in both the through-thickness and defect length dimensions are an aid to assimilation o f each team's ability and are a suitable means for comparing teams.
ACKNOWLEDGEMENT This work was partly funded by the Central Electricity Generating Board.
REFERENCES 1. Plate Inspection Steering Committee, A description of the PISC project," Evaluation of the PISC trial results; Analysis scheme of the PISC trials results; Destructive examination of the PVRC plates nos 50/52, 51/53 and 204; Ultrasonic examination of the PVRC plates nos 50/52, 51/53 and 204, Volumes I to V respectively of EUR 6371 en published by the Commission of The European Communities, 1979. 2. Oliver, P., The reliability of non-destructive test methods: the PISC program, NEA Newsletter, 2 (1984), 15-16. 3. Watkins, B., Ervine, R. W. and Cowburn, K. J., The UKAEA Defect Detection Trials, Brit. J. NDT, 25(7) (1983), 179-85. 4. Watkins, B., Lock, D., Cowburn, K. J. and Ervine, R. W., The UKAEA Defect Detection Trials on test pieces 3 and 4, Brit. J. NDT, 26(2) (1984), 97105. 5. Watkins, B., Cowburn, K. J., Ervine, R. W. and Latham, F. J., Results obtained from the inspection of test plates 1 and 2 of the defects detection trials (DDT paper No. 2), Brit. J. NDT, 25(7) (1983), 186-92. 6. Lock, D. L., Cowburn, K. J. and Watkins, B., The results obtained in the UKAEA defect detection trials on test pieces 3 and 4, Nuclear Energy, 22(5) (1983), 357-63. 7. Murgatroyd, R. A. and Firth, D., A review and further analysis of the results of the Defect Detection Trials, Paper presented at Post-SMIRT Conference Seminar No. 2, Varese, Italy, 28-29 August 1985, to be published in Nondestructive Examination in Relation to Structural Integrity, ed. R. W. Nichols and G. Dau, Elsevier Applied Science Publishers, London. 8. Crutzen, S. J., PISC exercises: looking for effective and reliable inspection procedures, Nucl. Eng. Des., 86 (1985), 197-218.
Presentation of results from round-robin test-block trials
89
9. Haines, N. F., Data analysis methods and results, Paper presented at PostSMIRT Conference Seminar No. 2, Varese, Italy, 28-29 August 1985, to be published in Non-destructive Examination in Relation to Structural Integrity, ed. R. W. Nichols and G. Dau, Elsevier Applied Science Publishers, London. 10. Nichols, R. W., Summary and conclusions, Paper presented at Post-SMIRT Conference Seminar No. 2, Varese, Italy, 28-29 August 1985, to be published in Non-destructive Examination in Relation to Structural Integrity, ed. R. W. Nichols and G. Dau, Elsevier Applied Science Publishers, London. 11. Temple, J. A. G., Reliable ultrasonic inspection in theory and in practice: sizing capability of Time-of-Flight Diffraction, Paper presented at 3rd European Conference on NDT, Florence, Italy, 15-18 October 1984. 12. Temple, J. A. G., Sizing capability of automated ultrasonic Time-of-Flight Diffraction in thick section steel and aspects of reliable inspection in practice, AERE R-11548, HMSO, London, 1985. 13. Sharp, H. and Sproat, W. H., Treatment of false calls in evaluating nondestructive testing proficiency. J. Nondestructive Evaluation, 2(3/4) (1981), 189-94. 14. Crutzen, S. J., Results of post-test NDE and destructive examinations, Paper presented at Post-SMIRT Conference Seminar No. 2, Varese, Italy, 28-29 August 1985, to be published in Non-destructive Examination in Relation to Structural Integrity, ed. R. W. Nichols and G. Dau, Elsevier Applied Science Publishers, London.