Statistical learning methods: Basics, control and performance

Statistical learning methods: Basics, control and performance

ARTICLE IN PRESS Nuclear Instruments and Methods in Physics Research A 559 (2006) 106–114 www.elsevier.com/locate/nima Statistical learning methods:...

413KB Sizes 0 Downloads 93 Views

ARTICLE IN PRESS

Nuclear Instruments and Methods in Physics Research A 559 (2006) 106–114 www.elsevier.com/locate/nima

Statistical learning methods: Basics, control and performance J. Zimmermann Max-Planck-Institut fu¨r Physik, Fo¨hringer Ring 6, 80805 Mu¨nchen, Germany Available online 20 December 2005

Abstract The basics of statistical learning are reviewed with a special emphasis on general principles and problems for all different types of learning methods. Different aspects of controlling these methods in a physically adequate way will be discussed. All principles and guidelines will be exercised on examples for statistical learning methods in high energy and astrophysics. These examples prove in addition that statistical learning methods very often lead to a remarkable performance gain compared to the competing classical algorithms. r 2005 Elsevier B.V. All rights reserved. PACS: 02.50.Sk; 02.50.Tt; 07.05.Mh Keywords: Multivariate analysis; Inference; Neural networks; Support vector machines; Decision trees; Classification; Regression

1. Introduction These proceedings, like the proceedings of the past ACAT conferences [1,2], show again many different applications of statistical learning methods in physics analysis. One of the most favourite methods is the neural network but also other learning strategies have found their way into the analysis of physics experiments, mostly in high energy and astrophysics. This paper wants to take a step back from the direct application of a specific learning method to a specific problem towards a generalised overview over the concepts and problems of statistical learning. We want to emphasise those points which are typically neglected whenever a statistical learning method is employed to solve a special physics analysis task. Among them are an overview over the regularisation principle, also from the mathematical side. Furthermore, we emphasise the variety of different learning methods and the need of comparison among them. Since physicists sometimes fear the black-box behaviour of statistical learning methods, we devote a section to the physically correct controlling of these methods. Finally, we Tel.: +49 89 32354 389.

E-mail address: [email protected]. 0168-9002/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.nima.2005.11.167

present some very encouraging results which show that statistical learning methods can indeed improve the classically obtained results significantly. Section 2 will start with a short introduction to statistical learning in a general way which covers many different types of applications and learning methods. Section 3 will then summarise three points which are important to have statistical learning methods well under control in physics analysis, including the calculation of uncertainties. Section 4 finally illustrates the discussed features and problems of statistical learning with three different examples from high energy and astrophysics, leaving no doubt that the application of statistical learning methods can be very remunerating. 2. Statistical learning basics This section will provide a short introduction to the basic concepts and properties of statistical learning. 2.1. Training of statistical learning methods When speaking about statistical learning we mean always supervised learning throughout this paper. Supervised learning means that N pairs of input and corresponding

ARTICLE IN PRESS J. Zimmermann / Nuclear Instruments and Methods in Physics Research A 559 (2006) 106–114

Fig. 1. Supervised learning methods (here for classification) which build a model are trained resulting in a classifier which is then used to evaluate new events.

function space too large

appropriate number of parameters

Fig. 2. Like in function-fitting overtraining is due to too many free parameters. Regularisation reduces the searched function space by restricting the parameters. The training examples are fitted no longer exactly after regularisation but the generalisation is much better.

target ð~ xi ; yi Þ, i ¼ 1 . . . N are given to the learning method.1 The inference from these N pairs of input and target to some kind of output function outð~ xÞ will be called the training of the statistical learning method. Fig. 1 shows a scheme of the usual process of training and evaluation for a classification problem. Some learning methods do not build a model, they derive the output for a new event directly from the given examples. The problem of overtraining is due to the statistical nature of the learning process. A potentially very complex functional dependence has to be derived from potentially only very few examples. Since the complexity of the target function is not known a priori an effect like in function fitting can be observed: Fig. 2 shows a fitting example in which in the left picture a function with too many adjustable parameters was chosen. This leads to a perfect fit, but without any generalisation. In the right picture the number of free parameters is appropriate. The number of free parameters and thus the size of the function space which is searched shows here the same influence like for a statistical learning method. The adjustment of the size of the searched function space is called regularisation.

107

the target is a continuous quantity which has not been measured directly but may be deduced from the inputs. The input quantities which should be used to estimate the target values of new, unclassified events are in our context naturally based on experimentally measured quantities. Different strategies of pre-processing may be applied to the raw detector values. The goal is always to use prior knowledge about the physical meaning of the raw inputs to generate a new set of inputs which describes the event on a higher abstraction level. This abstraction is usually combined with a reduction of the dimensionality of the input vector. A special case where this kind of pre-processing has to be kept to a minimum is the online application of statistical learning methods, like in a trigger application. Here mostly raw detector values have to be used as inputs since the tight time restrictions do not allow a full reconstruction of the event. Naturally, the decision obtained in this way cannot compete with a decision based on the fully reconstructed event. However, the motivation to apply a statistical learning method in a trigger scenario is the lack of time2 and for the given time restrictions a statistical learning method approximates the optimal decision behaviour. In all cases where events can be fully reconstructed (offline) it is the lack of knowledge which motivates the application of statistical learning methods. Either no algorithm can be built classically because no model for the obtained data is available or the theory does not describe the experimental data well which leads to a suboptimal performance. In both cases statistical learning methods can overcome the missing knowledge by learning from examples. The training of a statistical learning method can only be done if the correct classification is given for the training examples. For an online application the respective offline analysis provides this information. For an offline application mostly simulations have to be used (with care!). 2.3. Statistical learning methods The most widely used statistical learning methods can be grouped into three categories depending on the underlying idea of classification (regression is always done by generalising classification) [4–9]. The three categories of methods come from different basic classification concepts:

2.2. Different problem settings



The type of the target quantity depends on the given problem: for a classification usually background has to be distinguished from physics events and the target takes the values zero and one, respectively. In a regression problem



Decision trees: Consecutive cuts in different inputs result in an input space which is divided into small hypercubes reflecting the class distributions. Classification is done by majority voting inside each hypercube. Local density estimators: A small environment around a new event is set and classification is done by majority voting inside this small environment.

1

For unsupervised learning only the inputs are given and information about the multidimensional distribution of the events (‘‘clustering’’) is derived.

2 Hardware implementations of neural networks exist now which calculate a large neural network within 400 ns [3].

ARTICLE IN PRESS J. Zimmermann / Nuclear Instruments and Methods in Physics Research A 559 (2006) 106–114

108



Methods based on linear separation: A hyperplane is used to divide the input space into two regions. A new event is classified according to the half space it falls into.

Meta learning strategies, finally, can operate on any of these basic learning methods or even on mixtures. They improve the performance of the underlying learning method by combining several differently trained classifiers. Statistical learning methods try to model a functional dependence only by looking at some examples. This requires some basic assumptions about how this function could look like, called the bias of the learning method. Each learning method implements its own bias, the preference for a specific kind of target function, by its internal representation of the learned hypothesis. In fact there are two kinds of bias in each learning method: the absolute bias restricts the hypotheses space to those functions that can be represented by the learning method and the preference bias represents the wish to find the best hypothesis in a specific part of the hypotheses space because of regularisation. 2.4. Statistical learning theory The purpose of a theoretical analysis of statistical learning methods is to connect the error (misclassification rate) E T measured with the training set T with the true error E P which is (theoretically) measured for all possible events. Both quantities measure the disagreement between the hypothesis h which is put out by the learning method and the target function y which represents the true classification: X 1 E T ðhÞ ¼ 1 ð1Þ n x2T:hðxÞayðxÞ X E P ðhÞ ¼ Px ðxÞ. ð2Þ

Fig. 4. Structural Risk Minimisation: The nested sequence of hypotheses spaces allows searching for the best hypothesis which minimises the sum of training and ‘‘generalisation’’ error.

some distribution Px and 40, d40. Given nXðc=Þðd þ lnð1=dÞÞ the probability that jE P ðhÞ  E T ðhÞj4 is at most d 8h 2 H where c is a constant. The regularisation principle ‘‘Structural Risk Minimisation’’ creates thus a nested sequence of hypotheses spaces H 1  H 2  H 3  . . . with rising VC-dimension VCdimðH 1 ÞoVCdimðH 2 ÞoVCdimðH 3 Þo . . . . The hypothesis with the lowest true error is found by searching through the hypotheses spaces in the given order and stopping when the bound for the true error is minimal sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi d þ ln 1d E P oE T þ c . (3) n The true error is bounded by the sum of training error and an additional term called here ‘‘generalisation error’’ which depends on the VC-dimension. Searching for the lowest true error thus means that the sum of training error and the additional ‘‘generalisation error’’ (depending on the VCdimension) has to be minimised (see Fig. 4). The parameter settings for optimal regularisation are thus determined by theoretical considerations. Therefore, one does not need to select the optimal parameter set by testing different hypotheses on real data (see Section 3.1).

x2X :hðxÞayðxÞ

We will discuss one theorem from the VC-framework [10] which connects the training error with the true error. The important property of the learning method which allows this connection is its VC-dimension, Fig. 3 shows an example. The basic theorem from VC-theory takes a hypotheses space H  2X with a finite VC-dimension d, a set of n training examples T drawn independently according to

3. Control of statistical learning methods Among physicists, often a general fear of statistical learning methods can be observed, more precisely, they doubt that the behaviour of these methods can be controlled in a physically correct way. The following sections will therefore address the most important aspects of controlling the behaviour of a statistical learning method with a focus on the physicist’s needs. 3.1. Unbiased efficiencies

Fig. 3. The VC-dimension of a linear decision in two dimensions is at least 3 because all 23 ¼ 8 dichotomies shown on the left (small boxes) are linearly separable. It is exactly 3 because no linear decision surface is found for the example with four points on the right where the points are already in a general position.

The principle of dividing the available data into a training and a test part is widely known. The training part is used to train the learning method while the test part is kept apart to be used only during the evaluation of the already trained classifier. By this, the estimate of the efficiency with the test set is not affected by any overtraining and therefore unbiased.

ARTICLE IN PRESS J. Zimmermann / Nuclear Instruments and Methods in Physics Research A 559 (2006) 106–114

Training

Selection

109

Testing

Fig. 5. Usually the available examples should be split up into three sets: the training set is used to train the statistical learning method, the selection set is used to choose the optimal generalisation behaviour among different parameter settings, and the test set provides an independent estimate for the true performance. The splitting could typically be 50%:25%:25%.

A common extension of this principle becomes important if several classifiers are trained and the best one among them should be chosen. This is a very common procedure, for example, if the parameters of a learning method are varied to study the overtraining behaviour (regularisation). In this case the best classifier cannot be chosen by the performance on the training set since then overtraining cannot be detected. But also the test set must not be used because this part of the data should be kept apart to determine the efficiencies of the finally chosen classifier in an unbiased way. Therefore the available data has to be divided now into three parts (Fig. 5). The selection set can then be used to choose among differently trained classifiers and the efficiency estimate with the test set will still be unbiased. 3.2. Uncertainties A typical basis to calculate the statistical uncertainty of the efficiency is to study the propagation of the counting errors for the numbers of events which pass or do not pass the selection. This propagation is described by X  q 2 s2 ¼ s2xi (4) qxi where the efficiency  depends on quantities xi which have to be independent. Since the efficiency is the percentage of selected events ¼

N sel N sel þ N cut

(5)

its uncertainty can be calculated by plugging in the counting uncertainties of the independent numbers of cut and selected (passed) events     q 2 2 q 2 2 s2 ¼ sN sel þ sN cut qN sel qN cut N sel N cut ¼ . ð6Þ ðN sel þ N cut Þ3 Systematic uncertainties are calculated by studying the propagation of the systematic uncertainties of the inputs. There is no uncertainty in the learning method itself: think about a simple (here multidimensional) cut. The same classifier is applied to test sets which have been modified according to the systematic uncertainties of the inputs. The usual way to create the modified test sets is that exactly two sets are created per input (for every 1s

Fig. 6. An example how to calculate the systematic uncertainties of efficiency and rejection by processing test sets which have been modified according to the systematic uncertainties of the inputs.

variation, up and down). In the example in Fig. 6 we would have six modified test sets: starting with a set in which in every event x1 is replaced by x1 þ 0:1 (absolute uncertainty) and ending with a set in which in every event x3 is replaced by 0:9  x3 (relative uncertainty). The individual inputs are here assumed to be independent, the six differences in efficiency and rejection (modified vs. original) can thus be added in quadrature. This is done by adding up the squared differences in positive direction on the one hand and all squared differences in negative direction on the other hand. This procedure covers the case that sometimes both modification directions for the same input lead to a change (of efficiency or rejection) into the same direction.3 As can be seen in Fig. 6 the same classifier and the same cut as for the original test set are used for all the modified sets. The variation of the efficiency is shown as 80þ5 6 % and the rejection varied like 90þ2 2 %. These are the systematic uncertainties of efficiency and rejection which were propagated from the input quantities through the statistical learning method. 3.3. Comparison The variety of different learning methods naturally motivates a competition. Many studies (see for example [4]) have shown that a comparison of different learning methods very often pays off. However, to derive statistically correct statements it is not enough to simply compare the efficiencies directly. The statistical uncertainties have also to be taken into account: a statistical ‘‘t-test’’ results in confidence intervals for the efficiency differences. A confidence interval containing zero means that no significant efficiency difference has been found. Besides, not only one comparison is made but typically several different methods are considered. The statistical 3 One can understand this effect if one thinks of a Gaussian input distribution for signal events from which the centre part, say ½m  s; m þ s, is selected by the learning method. Any shift in this distribution will result in a loss of efficiency.

ARTICLE IN PRESS J. Zimmermann / Nuclear Instruments and Methods in Physics Research A 559 (2006) 106–114

110

Table 1 For a given number n of means (methods) which have to be compared, the shown z-value constructs 95% confidence intervals ½Dm  zsDm ; Dm þ zsDm  n z

3 2.39

4 2.64

5 2.81

6 2.94

7 3.04

8 3.14

alpha error (5% for a 95% confidence interval) is only valid if exactly one comparison is done. To limit the experiment-wise alpha error Dunn [11] suggests to reduce the comparison-wise alpha error which makes the tests less powerful but the experiment-wise alpha error is then kept to, for example, 5% in total. Table 1 shows a list of zvalues which construct a 95% confidence interval ½Dm  zsDm ; Dm þ zsDm  for a given number of means that have to be compared (number of learning methods). The values in the table are based on the conservative assumption that really all nðn  1Þ=2 tests are done.

Fig. 7. Simulated photon event (left) vs. hadron event (right) in the MAGIC telescope camera.

they show a higher significance of the photon signal and a larger number of excess events below six degrees. These two quantities affect directly the uncertainties of the flux calculation. 4.2. H1: neural network trigger

4. Performance of statistical learning methods This section fulfils three different purposes. First, the basics of statistical learning have been described above in a theoretical way and are here found in very practical examples. Second, the different aspects of controlling a statistical learning method will be exercised on these examples. Finally, the presented results show very clearly that the application of statistical learning methods very often lead to extraordinary performance gains compared to competing classical algorithms. 4.1. MAGIC: g-hadron separation The Major Atmospheric Gamma Imaging Cherenkov Telescope [12] on La Palma observes Cherenkov light emitted by high energy cosmic particles. Valuable information about cosmic objects can be extracted from the Cherenkov light of photons while Cherenkov light from hadrons (protons and heavier nuclei) has to be sorted out. Fig. 7 shows examples for a photon and a proton event. The inputs used to distinguish between the two cases are derived from a pre-processed shower image: a Cherenkov ellipse is calculated for each event. Fig. 8 presents analysis results of the weak source 1ES1959. In the distributions shown the hadrons create a flat background while the photons peak towards small values of a. This corresponds to the photons pointing back to the observed source while charged hadrons are deflected in the magnetic fields and do not point to a specific location. The supercuts method is a classical parametrisation competing with the statistical learning methods random forest (a meta learning strategy called bagging based on a simple decision tree) and neural network (based on linear separation). We see that both statistical learning methods perform significantly better than the classical algorithm as

The analysis of the data from a high energy physics experiment such as H1 [13] at DESY is quite commonly confronted with problems of data selection, both online (trigger) and offline (event sample purification). As to the case of a trigger application, quite frequently the data rate coming from the experiment is background-dominated and potentially too high for the bandwidth of data logging. By intelligently rejecting the background the rate can be reduced while keeping the efficiency high for the physics reactions of interest. Typically the rate is reduced in several steps (‘‘multi-level trigger scheme’’). In H1, e.g., the rate reduction is achieved in three levels. The second level in H1 is realised by a neural network trigger [14] which uses feedforward neural networks to distinguish between signal and background events. For the triggering of deeply virtual Compton scattering (DVCS), for example, the level one rate of 8 Hz is reduced down to 1.6 Hz by a neural network, resulting in a loss of efficiency of only a few percent. Fig. 9 shows performances of various fast statistical learning methods. The statistical analysis needed to compare different hypotheses as described in Section 3.3 is performed here with the five chosen methods. 95% confidence intervals are constructed for each difference in the performance-ordered list. The result is shown in Table 2. Exclamation marks signify a significant performance difference (95% CL) between two methods in the list which is sorted by efficiency for a required rejection of 80%. Table 2 reveals four groups of significantly different performance. The neural network and the support vector machine perform best, then follow linear discriminant analysis and naive Bayes, and the simple cuts perform worst. It is encouraging that the neural network hypothesis—which was really implemented in the hardware and has been used to select events—is one of the best classifiers available for this problem. From the worst to the best

ARTICLE IN PRESS J. Zimmermann / Nuclear Instruments and Methods in Physics Research A 559 (2006) 106–114

Supercuts

100

350

background rejection [%]

1ES1959 σ = 4.3 Nex = 143

400

counts

300 250 200 150 100

95 90 85 CUTS 80

70 70

0

10

20

30

40 50 60 |alpha| [°]

70

80

90

Random Forest 1ES1959 σ = 6.2 Nex = 168

350

counts

NN

75

80 85 90 signal efficiency [%]

95

100

Fig. 9. Comparison of fast classification methods on the DVCS dataset: neural network (NN), support vector machine (SVM), linear discriminant analysis (LDA), naive Bayes (NB) and simple cuts (CUTS). The selection criterion was the best efficiency at a rejection of 80%.

250

Table 2 Statistical analysis of the efficiencies at 80% rejection for different methods on the DVCS dataset, statistically significant differences are marked by an exclamation mark (95% CL)

200

Neural network

300

96.5% D ¼ 0:7%  0:9%

Support vector machine

150

D ¼ 3:3%  1:5% (!) Linear discriminant analysis D ¼ 2:2%  2:1% (!) Naive Bayes D ¼ 6:8%  2:5% (!) Cuts

100 50 0

0

10

20

30

40 50 60 |alpha| [°]

70

80

1ES1959 σ = 7.2 Nex = 205

300 250 200 150 100 50 0

0

10

20

30

40 50 60 |alpha| [°]

70

80

95.7% 92.5% 90.5% 83.6%

90

Neural Network 350

counts

NB

SVM LDA

75

50 0

111

90

Fig. 8. 1ES1959 results: Both statistical learning methods lead to a higher significance and more excess events than the supercuts method. The neural network performs better than the random forest method.

classifier, from simple cuts to the neural network, the percentage of lost events (inefficiency) is reduced by a factor 4.7, which underlines the necessity to study and choose among different classification methods. We now want to exercise the calculation of systematic uncertainties with the example of DVCS. As discussed in Section 3.2 the propagation of the systematic uncertainties of the inputs has to be studied. The inputs to the neural network trigger consist of energies from the liquid argon calorimeter ‘‘LAr’’ and from the spaghetti calorimeter ‘‘SpaCal’’. Furthermore three quantities are provided by the z-vertex trigger which give the maximum, its position and the integral of the z-vertex histogram. This histogram is created by projecting hit combinations onto the z-axis. The uncertainties taken into account are listed in Table 3. Table 4 shows then which efficiencies and rejection rates are obtained by processing the modified test sets with the same neural network and the same cut. The variations up and down in the inputs according to the systematic uncertainties propagate to the new efficiencies. It is interesting to see that there is no degradation in the efficiency. If at all, a negative effect can be seen in a lower

ARTICLE IN PRESS 112

J. Zimmermann / Nuclear Instruments and Methods in Physics Research A 559 (2006) 106–114

Table 3 Sources for systematic uncertainties for the DVCS dataset Variable

Uncertainty level

LAr calibration SpaCal calibration z-vtx hist. shift (bins) z-vtx efficiency

4% 1% 0.25 4%

R Z

Table 4 Evaluation of systematic uncertainties for the DVCS network. The efficiencies and rejections are calculated with the same neural network and the same cut as for the original test set. A variation upwards and downwards for all four sources of systematic uncertainties are taken into account. These sources are assumed to be independent so that the differences in efficiency or rejection can be added up in quadrature Variation

Eff. (%)

Rej. (%)

without LAr þ LAr  SpaCal þ SpaCal  z-vtx shift right z-vtx shift left z-vtx efficiency þ z-vtx efficiency 

97.6 97.6 97.8 97.7 97.7 97.6 97.7 97.6 97.6

79.4 79.3 79.2 79.0 79.5 79.3 79.1 79.0 79.3

Table 5 Total uncertainties for the DVCS network Efficiency (%)

97:62  0:51 (stat.) 0:00 þ 0:30 (syst.)

Rejection (%)

79:36  1:18 (stat.) 0:59 þ 0:17 (syst.)

rejection. All variations seen in efficiency and rejection are still smaller than the statistical uncertainties induced by the low number of test events. Table 5 summarises the total uncertainties for efficiency and rejection for the DVCS neural network. The ‘‘fault tolerance’’ of neural networks leads here to the effect that uncertainties of a few percent for the input quantities are transformed into uncertainties less than one percent for the efficiency. In summary, the neural network behaves very stable and reliable as the statistical and systematic uncertainties are well under control. A second physics channel for which a rate reduction of the first level trigger is desirable is charged current. There the beam electron converts into a neutrino which escapes undetected but leaves an unbalanced transversal energy distribution. Only very few of the events triggered by the missing transverse energy are real charged current events, they have been scanned by a physicist and have then been presented

Y X

Fig. 10. Event from the charged current selection which would have been rejected by the neural network. It turned out to be most probably an overlay event with a high energy cosmic muon passing through the detector.

to the neural network which should keep as many of these events on the trigger level as possible. Only one of almost 400 events from the charged current selection would not have passed the trigger (Fig. 10). The energy depositions in the LAr calorimeter in both views suggest that they have not been generated by particles coming from the interaction region but by a high energy cosmic ray passing the detector from top to bottom. The remaining parts of the event structure then show some kind of ep-interaction but not a charged current event since this presumption was based on the unbalanced LAr energy. Thus this event should not have been part of the charged current selection. The charged current neural network has proven to be very efficient: no single event has been rejected which showed undoubtfully a charged current interaction. In addition, the rejection of over 50% is sufficient for the needs of the trigger system. A detailed analysis of the network behaviour revealed an event which should not have been part of the charged current selection. This means that the neural network was able to detect a weak point of a physics selection by intelligent pattern recognition even based on only the coarse information delivered to the second trigger level. 4.3. ILC: Higgs-parity measurement As a last example of a very successful application of statistical learning methods we move to simulated events from a future linear collider experiment. The Standard Model Higgs boson H is a scalar ðJPC ¼ 0þþ Þ while extensions like the Minimal Super-symmetric Standard Model predict in addition a pseudo-scalar partner A ðJPC ¼ 0þ Þ. It is therefore important to be able to distinguish between these two cases. The analysis on which the following discussion is based [15] uses a future experimental setup which is based on the TESLA proposal [16]. A linear eþ e collider with a centre-of-mass energy of 500 GeV is assumed with an accumulated luminosity of 500 fb1 . One of the most promising decay channels which allows equal sensitivity to the CP-even and CP-odd components is H ! tþ t . The propagation of the transversal spin

ARTICLE IN PRESS J. Zimmermann / Nuclear Instruments and Methods in Physics Research A 559 (2006) 106–114

The classical ansatz to distinguish between the two parity states is to fit a cosine function to the measured angular distribution like in Fig. 11. The obtained significances reach 5:1s (mean value over many pseudo-experiments). A completely new ansatz to separate the two parity states is the application of a statistical learning method. Fig. 12 shows for high statistics that the output distributions of the two parity states differ not much but still enough to derive a significance from the shift of the mean output value (closer to 0 for ‘‘scalar events’’ and closer to 1 for ‘‘pseudo-scalar events’’). The obtained significance 6:3s (mean value over many pseudo-experiments) shows a remarkable performance gain compared to the classical ansatz.

397 events filled into histogram after preselection 50

40

N

30 amplitude = 7.89 uncertainty = 2.71 significance = 2.91

20

10

0 0

1

2

3

4

5

113

6

acoplanarity

Fig. 11. Example for a histogram with fitted cosine for one pseudo experiment. The 397 events in the histogram (of totally 500 events) are those that survived the preselection cuts. The amplitude and its uncertainty are derived from the fit, the significance is the quotient.

scalar events

102

events [%]

10

1

5. Conclusion The application of statistical learning methods in physics analysis leads directly or indirectly to interesting new physics results. This is made possible by the remarkable performance of these methods compared to competing classical algorithms. Despite the interesting new physics results and the remarkable increase in performance obtainable with statistical learning methods, physicists often hesitate to make use of these methods. A strong emphasis was therefore put onto the understanding and controlling of statistical learning methods. The respective guidelines have been explained and applied in examples from high energy and astrophysics.

10-1

Acknowledgements 10-2

0

0.2

0.4

0.6

0.8

1

output pseudoscalar events

102

We thank the MAGIC group for kindly providing simulated and experimental data and M. Worek for the preparation of the simulated data for the Higgs-Parity Measurement. References

events [%]

10

1

10-1

10-2

0

0.2

0.4

0.6

0.8

1

output

Fig. 12. Direct discrimination of scalar and pseudo-scalar events: Example of output distributions after training (high statistics).

correlations in the subsequent decays t ! r n¯ t ðnt Þ and r ! p p0 leads to a characteristic angular distribution which depends on the parity of the Higgs boson.

[1] Proceedings on the VIII International Workshop on Advanced Computing and Analysis Techniques in Physics Research, Moscow, 2002. [2] Proceedings on the IX International Workshop on Advanced Computing and Analysis Techniques in Physics Research, Tsukuba, 2003. [3] J.-C. Pre´votet: Etude des syste`mes e´lectroniques pour les re´seaux connexionnistes applique´s a` l’instrumentation en temps re´el, The`se de Doctorat de l’Universite´ P. et M. Curie, 2002. [4] J. Zimmermann, Statistical learning in high energy and astrophysics, Ph.D. Thesis, June 2005, LMU Munich. [5] T. Hastie, et al., The Elements of Statistical Learning, Springer, New York, 2001. [6] T. Mitchell, Machine Learning, McGraw Hill, New York, 1997. [7] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach SE, Englewood Cliffs, New York, 1995. [8] D. Michie, et al., Machine Learning, Neural and Statistical Classification, Ellis Horwood, Chichester, 1994. [9] N.J. Nilsson: Introduction to Machine Learning, An early draft of a proposed textbook, 1996.

ARTICLE IN PRESS 114

J. Zimmermann / Nuclear Instruments and Methods in Physics Research A 559 (2006) 106–114

[10] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. [11] O.J. Dunn, J. Amer. Statist. Assoc. 56 (293) (1961) 52. [12] D. Paneque, The MAGIC Telescope: development of new technologies and first observations, Ph.D., Thesis, Fakulta¨t fu¨r Physik der Technischen Universita¨t Mu¨nchen, August 2004. [13] I. Abt, et al., Nucl. Instr. and Meth. A 386 (1997) 310.

[14] J.H. Ko¨hne, et al., Nucl. Instr. and Meth. A 389 (1997) 128. [15] M. Worek, Spin correlations of the heavy flavours as signal for Higgs bosons, Ph.D. Thesis, Institute of Physics, University of Silesia, June 2003. [16] T. Behnke et al., TESLA Technical Design Report Part IV: A Detector for TESLA, manual DESY-01-011.