Multipatch predictive selection

Multipatch predictive selection

CHAPTER Multipatch predictive selection 6 Those who have knowledge, don’t predict. Those who predict, don’t have knowledge Lao Tzu If the path be b...

95KB Sizes 0 Downloads 89 Views

CHAPTER

Multipatch predictive selection

6

Those who have knowledge, don’t predict. Those who predict, don’t have knowledge Lao Tzu If the path be beautiful, let us not ask where it leads Anatole France

6.1 Introduction The predictive selection pattern for meta-algorithmics was introduced in Ref. [Sims13] and comprises a statistical learning phase along with the run-time phase. During statistical learning, each possible configuration of the algorithm, analytics, or system to be deployed is considered along with each logical partition of the data space. In this way, we can determine the precision of each decision and always employ the algorithm, analytics, or system with the highest precision for the partition to which a run-time sample belongs. Predictive selection was therefore shown to reduce to the rule of selecting the option with the highest precision in a specific predictor test. The predictive selection approach is innately rational, as its most rudimentary form is simply conditional probability: maximizing the utility (cost, accuracy, performance, robustness, etc.) of the selected algorithm, analytics, or system given a specific event has occurred. Although wise, the founder of Taoism did not anticipate the power of predictive selection. With predictive selection, knowledge and prediction interplay as part of the process. Knowledge of the partitioning includes understanding what partitions to make as much as which downstream system to employ for each partition. And, the prediction is directly related to the knowledge of the precision of each system (or analytics or algorithm) for each partition. So long as the partitions are consistent over time, knowledge and prediction are used simultaneously, making predictive selection a Tao of analytics. Nor do we stop with Lao Tzu. Anatole France quipped that “if the path be beautiful, let us not ask where it leads,” but with predictive selection, we see both the beauty and the direction at the same time. The beauty lies in the ability to partition, repartition, and layer the predictive elements for the downstream algorithm, analytics, or systems, and where it leads is guided by the better estimate provided by the optimized predictive steps. We begin our discussion with the baseline predictive selection pattern. Meta-Analytics. https://doi.org/10.1016/B978-0-12-814623-1.00006-X # 2019 Steven Simske. Published by Elsevier Inc. All rights reserved.

203

204

CHAPTER 6 Multipatch predictive selection

6.2 Predictive selection First and foremost, predictive selection is about collecting data. The predictive part of the design pattern is focused on categorization. Data that are collected need to be partitioned so that each partition can be used to optimize the downstream algorithm, analytics, or system that provides the particular machine intelligence output required of the problem. Partitioning involves assigning input data to clusters (generally partitioned by their location along ranges of specific content measurements, called features or metrics), to conditions (which can include contextual information, such as metadata or thresholds), and to expert rules (which can be internalized from domain expertise garnered from experience or can be imposed directly in the case of regulations, standards, and auditable levels of quality). Importantly, data can be used to assign content to more than one category at a time. This affords a level of complexity for the overall system design that is analogous to the connections in an artificial neural network. Data used for prediction can assign the output to two or more categories, and based on the precision of the training data for those categories, we can choose to select the optimum system design based on as few as one of the categories and as many as all of the categories. As an example, suppose a data set is analyzed and found to belong to categories A, B, C, and D. The precision associated with those categories are, from the training data, 0.89, 0.87, 0.84, and 0.76. Depending on the circumstances, we may find better overall behavior by considering A only (the highest precision); A and B (e.g., within 5%); A, B, and C (e.g., within 10%), or all four (throwing away no categories identified, assuming they are all accurately assigned). Which of these is the “best” strategy? There is no a priori “right” answer; instead, deciding how multiple categories will be used for selecting the algorithm, analytics, or system to employ should be the task of the validation stage of the algorithm. With this in mind, we now cover some more of the specific of predictive selection, starting with the means of prediction.

6.3 Means of predicting Prediction is a broad topic. From the simplicity of correlation, to the relative sophistication of conditional probabilities such as Markov and Bayesian approaches, to the esoteric nature of deep learning and meta-algorithmic patterns, we have several options for training a system to understand how to configure itself. In this chapter, our primary focus is on using partitions of the input data as a means of assigning differential categories to the subsequent analysis in order to improve the overall measured behavior of the system. Usually, the measurement is of accuracy, but predictive selection can be used to optimize any objective function, which could be concerned with cost, throughput performance, robustness, modularity, sensitivity, scalability, etc., in addition or in place of accuracy. We will default with classification as the presumed subsequent analysis here unless indicated otherwise.

6.3 Means of predicting

The core decision for prediction is to decide how the input should be partitioned and how the data within each partition should be interpreted for the selection process that follows. Six prediction approaches are outlined here, which combined are nonexhaustive but certainly very helpful to get the gist of the process across: 1. The simplest partition is a yes/no or pass/fail decision. For this prediction approach, if “A” is contained in the data, we then use the data in one way to select the downstream process. If “A” is absent, we then use another. This approach is analogous to a decision tree if we “chain” yes/no or pass/fail decisions conditionally on the output of previous decisions. For example, if “yes” to A, then we select pass/fail on “B”; otherwise, we select pass/fail on “C.” 2. The next simplest partition is a threshold. This is a binary decision and so is similar in concept to yes/no or pass/fail. Data above the threshold are assigned to one partition, and data below the threshold are assigned to another partition. Suppose that we are looking at images in a database taken from within a downtown area. We may then decide that if there are more than five moving objects of a large size, we will use one particular image understanding algorithms; if less than five, we will use another. This selection method works well when our objective function is cost, throughput performance, choosing the degree of parallelism, etc. 3. The next partition method is subrange-based partitioning. At its simplest, this is a multiple threshold approach, where we subdivide a range into consecutive subranges that comprise the partition. Subrange-based partitioning is the starting point for two more complicated forms of partitioning, described next. 4. Conditional partitioning is based on a Boolean expression, the simplest of which is AND. Here, the partition may be based on an appropriate subrange for each of two or more features. Even more complexity can be supported if a function of two or more features is also partitioned by subrange. 5. Sequential partitions are partitions applied to data that are collected in a time series, such as ECGs, EEGs, and temperatures, or as video, for example, surveillance information. Because the sequential events are related, we may anticipate different subranges, even for the same type of data, over time. A good example is if an object is accelerating in a video or if heart rate is changing during an exercise stress test. 6. Finally, consensus partitioning is a specialized type of conditional partitioning where the partition is defined based on the data and the metadata (or contextual) information associated with the data. For example, if two different sensors are used but have different sensitivities or calibration, then a different subrange may be used for each sensor based on the metadata identifying the manufacturer and settings of the sensor. We now continue with the broader, and in some ways more interesting, process of selection.

205

206

CHAPTER 6 Multipatch predictive selection

6.4 Means of selecting The means of selecting is based on maximizing the precision of the particular approach. The concept of precision is straightforward and is illustrated here (Table 6.1) for a simple three-class classification problem, which may be the behavior of the classifier selected based on a partitioning, as described in the previous section. In Table 6.1, we have a total of 100 normalized samples that belong (as ground truth training data) to one of three classes: A, B, and C. However, as is true of any nonideal classifier, there are some classification errors. For Class A, 64.2% of the samples actually coming from Class A are assigned (correctly) to Class A; however, 14.7% of the samples from Class A are (incorrectly) assigned to Class B, and 21.1% of the samples from Class A are (incorrectly) assigned to Class B. A similar interpretation can be garnered for the Class B and Class C “true class” rows in Table 6.1. We see, therefore, that recall for each class is readily determined. Recall is the number of samples that actually belong to the class that are in fact assigned to the class divided by the number of samples that actually belong to the class. Since the denominator is 100.0 for each class in Table 6.1, we see that the recall for Classes A, B, and C are 0.642, 0.596, and 0.714, respectively. However, it is precision, not recall, that gives us the most confidence in an assigned class. Precision is determined in Table 6.1 from the columns, rather than the rows. Precision is the ratio of the number of samples assigned to a class correctly divided by the number of samples assigned to the class overall. So, for Class A, this ratio is (64.2)/(64.2 + 16.7 + 14.8)¼0.671. Similarly, for Class B, this is 59.6/88.1 ¼ 0.677, and for Class C, this is 71.4/116.2 ¼ 0.614. So, the recall is highest for Class C, but the precision is lowest. It is precision that tells us what the probability of an assignment being correct is, though, making this the basis of predictive selection. If the classifier of Table 6.1 tells us that a sample belongs to Classes A or B, we are roughly 6% more likely to have the right classification than if it tells us a sample belongs to Class C. Having this understanding, we can now compare this with the confusion matrix for a different classifier. This second classifier produces the confusion matrix shown in Table 6.2, which using the same approach as for Table 6.1 yields precisions of 0.557, 0.713, and 0.696 for Classes A, B, and C, respectively. Table 6.1 Simple confusion matrix to illustrate the concept of precision. Classifier output (computed classification) prediction Normalized confusion matrix True class of the samples (input)

A B C

A

B

C

64.2 16.7 14.8

14.7 59.6 13.8

21.1 23.7 71.4

Precisions of Classes A, B, and C are 0.671, 0.677, and 0.614, respectively. Please see text for details.

6.4 Means of selecting

Table 6.2 Second confusion matrix for a classifier distinct from that of Table 6.1. Classifier output (computed classification) prediction A

Normalized confusion matrix True class of the samples (input)

A B C

55.8 19.8 24.5

B 20.9 74.6 9.2

C 23.3 5.6 66.3

Precisions of Classes A, B, and C are 0.557, 0.713, and 0.696, respectively. Please see text for discussion.

Table 6.3 Precision values for the classifiers of Tables 6.1 and 6.2. Class for precision (1) Classifier of Table 6.1 (2) Classifier of Table 6.2

A

B

C

0.671 0.557

0.677 0.713

0.614 0.696

In Table 6.2, we see lower recall for Classes A and C in comparison with Table 6.1. Additionally, the precision values are different, being higher for Classes B and C, and lower for Class A. These values are captured in Table 6.3. The overall accuracy of both classifiers is similar (the mean of the diagonal elements of the confusion matrices): 0.651 for Table 6.1 and 0.656 for Table 6.2. This means that either classifier used by itself will provide between 65 and 66% accuracy on the three-class problem. However, using predictive selection, we should be able to use the differential precisions of the two classifiers to our advantage. From Table 6.3, our maximum precision is the situation in which Classifier (2) reports that the sample belongs to Class B. In order, Table 6.3 suggests we adopt the following strategy to classifying samples obtained during the run-time, or deployment, phase: 1. If Classifier (2) classifies the sample as being from Class B, assign the sample to Class B. 2. Else, if Classifier (2) classifies the sample as being from Class C, assign the sample to Class C. 3. Else, if Classifier (1) classifies the sample as being from Class B, assign the sample to Class B. 4. Else, if Classifier (1) classifies the sample as being from Class A, assign the sample to Class A. 5. Else, if Classifier (1) classifies the sample as being from Class C, assign the sample to Class C. 6. Otherwise, Classifier (2) classifies the sample as being from Class A, but we cannot ever reach this step unless Classifier (1) provides no classification whatsoever, since this is the lowest precision of any of the size options.

207

208

CHAPTER 6 Multipatch predictive selection

Taken together, this list results in a relatively simple rule. If Classifier (2) classifies the sample as being from Class B or C, then assign the sample to the classification of Classifier (2); otherwise, assign it to the class that Classifier (1) identifies. In this example, because each classifier is used when it provides the higher precision for the class, we might expect the combined accuracy to be approximately the mean of the three higher precisions for the classes; that is, 0.671, 0.713, and 0.696 for Classes A, B, and C, respectively. If so, this would be 0.693 or an error rate reduction of 10.8% in comparison with the accuracy of 0.656 for the (slightly) better classifier (0.307 is 10.8% less than 0.344), Classifier (2), and a 12.0% reduction in error (from 0.349 to 0.307) in comparison with the accuracy of Classifier (1). Note that error rate is 1.0 minus the accuracy. The actual results for such a problem depend on how the precision of the classifier varies for the samples classified as being from Class A for the classifier not used, that is, Classifier (2). We see that for Classifier (2), 55.8/(55.8 + 19.8 +24.5)¼0.5574 of the samples classified as being from Class A actually is from Class A. For these samples, the classifier we actually use is Classifier (1), for which the precision is 0.671. Similarly, 19.8/(55.8 + 19.8 +24.5)¼0.1978 of the samples classified as being from Class A actually is from Class B. For these samples, Classifier (1) has a precision of 0.677. Finally, 24.5/(55.8 + 19.8 +24.5)¼0.2448 of the samples classified as being from Class A actually is from Class C, and Classifier (1) has a precision of 0.614 for these. The overall precision of Classifier (1) for samples that Classifier (2) assigns to Class A is thus the sum: ð0:671  0:5574Þ + ð0:677  0:1978Þ + ð0:614  0:2448Þ ¼ 0:658

(6.1)

This means that the overall accuracy of the two-classifier system is the mean of 0.658, 0.713, and 0.696 for Classes A, B, and C, respectively, or 0.689. This is a reduction in error rate of 9.6% compared with Classifier (2) and a higher (of course) reduction in error rate of 10.9% compared with Classifier (1). This example illustrates how using the principle of “selecting the option with highest precision” leads to better overall system performance. But precision is not the only tool we have at our avail. The entity selected by a predictive selection approach need not be a single category or class as in the above example; instead, it can select an entire system with the specifics (settings, weights, etc.) of the system design selected by the predictive measurements. Selecting a different system for deployment for each partition in the predictor is associated with the regional optimization pattern [Sims13]. The so-called predictive selection with secondary engines approach [Sims13] may be more concerned with ranking the selected choices than in dictating that a specific choice be adopted. This is because in most real-world systems, the output of the system is analyzed just as carefully—if not more so—as the training data. The output can be considered to see if it passes a certain criterium, and if so, there is no need to select the next choice from the ranked list. Note that the criterium can be any quality metric germane to the task, for example, mechanical strength, surface finish smoothness, electric insulation, and density. The criterium can also be analytic rather than a material or physical property. For example, if accuracy or performance requirements are not met, a new system design might be chosen.

6.4 Means of selecting

Another important approach within the predictive selection repertoire is the “separation of task” for prediction and selection. Here, one particular variable is analyzed by subrange, and each of its subranges triggers a different configuration for the prediction. These configurations can include different system architecture patterns and different weightings for the multiple components. Related but alternative to the separation of task, we can use an effectively infinitely scalable system to use as many features as we possibly can to obtain multiple estimates for the output from this ensemble. The features themselves are defined by subranges, and for each subrange, we apply the output of the classifier with the highest precision to the ensemble overall decision. This might sound difficult, but it is fortunately very easy to elucidate with the help of a table. So, using a table as our crutch (and who reading this hasn’t used a table once in their life to rise off the floor?), we can see from Table 6.4 that each of the features can be defined by 2 or more subranges. In Table 6.4, we show four features with a variable number (3, 4, or 5) of subranges. In Table 6.4, each of the subranges is of the same size. Thus, for Feature 1, each subrange comprises 33.3% of the samples; for Features 2 and 3, each subrange comprises 25% of the samples, and for Feature 4, each subrange comprises 20% of the samples. Also, in Table 6.4, the precision of each of three algorithms for each

Table 6.4 Table used for illustrating the multirange, multifeature predictive selection approach. Feature

Subrange

Feature 1

Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange

Feature 2

Feature 3

Feature 4

1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5

Precision, algorithm A

Precision, algorithm B

Precision, algorithm C

0.655 0.669 0.683 0.734 0.744 0.766 0.779 0.555 0.563 0.571 0.557 0.665 0.634 0.651 0.689 0.745

0.677 0.674 0.669 0.804 0.755 0.711 0.667 0.599 0.588 0.597 0.614 0.707 0.745 0.696 0.689 0.694

0.699 0.681 0.658 0.777 0.813 0.783 0.753 0.634 0.613 0.580 0.603 0.645 0.662 0.684 0.704 0.723

Values shown are the precision values for each of Algorithms A, B, and C. In each row, the underlined value is the one with the highest precision. Please see text for details.

209

210

CHAPTER 6 Multipatch predictive selection

Table 6.5 Overall precision of each algorithm for each feature, irrespective of subrange, for Table 6.4 values. Feature

Precision, Algorithm A

Precision, Algorithm B

Precision, Algorithm C

1 2 3 4

0.669 0.756 0.562 0.677

0.673 0.734 0.600 0.706

0.679 0.782 0.608 0.684

Please see text for details.

subrange is given. The highest precision of each triplet (row) is underlined. There is an appreciable variability in precision among the triplets and among the algorithms for the same subranges—up to 0.112 difference for the three algorithms—and among the subranges for the same algorithm, up to 0.137 difference among the subranges for the same algorithm, in Table 6.4. By subrange, the rows of Table 6.4 have a mean range of 0.050. Table 6.5 highlights the differences in the algorithms by feature, and Table 6.6 highlights the differences in the subranges (mean of all three algorithms). In Table 6.5, the mean range across the rows is 0.033 or a third lower than that of Table 6.4. This shows the efficacy of breaking the features up into their subranges, since greater variability allows us to select algorithms with differential precision. Table 6.6 Mean precision of each subrange for the three algorithms. Feature

Subrange

Feature 1

Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange Subrange

Feature 2

Feature 3

Feature 4

Mean precision, algorithms A, B, and C 1 2 3 1 2 3 4 1 2 3 4 1 2 3 4 5

0.677 0.675 0.670 0.772 0.771 0.753 0.733 0.596 0.588 0.583 0.591 0.672 0.680 0.677 0.694 0.721

Column 3 is the mean of columns 3, 4, and 5 in Table 6.4. Please see text for details.

6.4 Means of selecting

The 50% relative increase in range of precision should help provide higher overall precision in the predictive selection approach eventually implemented. The mean precision values of Table 6.6 also hide a much larger subrange-algorithm interaction that is so evident in Table 6.4. For the four features, the range over the set of subranges is 0.007, 0.058, 0.013, and 0.049, respectively, for Features 1, 2, 3, and 4. The mean of these is only 0.032, virtually the same as for Table 6.5. In short, Tables 6.5 and 6.6 justify breaking the features into subranges and using multiple algorithms, as depicted in Table 6.4. Next, we determine the impact of this subrange-algorithm interaction in Table 6.7. Selecting the highest precision algorithm for each subrange results in between 1.8 and 6.0% lower error rate (1.0—precision) compared with the single best algorithm by features in Table 6.4. The improvement is obtained at the expense of adding the subranges, which may have the unanticipated side effect of overtraining the data. If, however, the subranges are likely to continue throughout the (deployment) lifetime of the system, the added advantages of improving the overall precision of the selected algorithm are likely to remain. In this simple example, we do not actually see how the subrange-partitioning predictive selection approach affects the overall accuracy of the classification. In order to perform that assessment, we would need to provide a full multifeature classification along the lines of the process outlined in Sections 1.10 and 1.11. Most important for improving the overall classification accuracy is how independent the features are in their behavior. Also, the features will be weighted differently based on our confidence in each of them—from Table 6.7, we expect Feature 2 to be weighted the most heavily, and Feature 3 to be weighted the least heavily. For the particular data used to create these Tables, the classifiers were used for text classification, and indeed, the overall accuracy of the system was improved from 0.886 to 0.902, a reduction of the error rate from 0.114 to 0.098 (14% reduction in error). Interestingly, this was almost as much as the summed improvement of all four features in Table 6.7 But, of course, “results may vary.”

Table 6.7 Overall precision of each subrange-optimized selection from Table 6.4, with improvement over best single algorithm from Table 6.5 in parenthesis. Feature

Precision, subrange-optimized selection (pct improvement over Table 6.5)

1 2 3 4

0.688 0.795 0.615 0.719

(2.8% (6.0% (1.8% (4.4%

lower lower lower lower

error error error error

rate) rate) rate) rate)

Percent improvement is in error rate, not accuracy. Please see text for details. These precisions are the means of the best for each subrange (3, 4, or 5 in number) in a feature.

211

212

CHAPTER 6 Multipatch predictive selection

6.5 Multi-path approach Everything up to this point in this chapter was meant to illustrate the power of using precision to guide predictive selection algorithms, and so with just a few minor twists, this has been a hopefully interesting but certainly not revolutionary extension of the predictive selection examples in [Sims13]. In this section, however, we get to explore the application of predictive selection process to path analytics. We begin with a consideration of how the partitioning associated with the process can be used to find an optimal decision tree. In the previous section and specifically in Table 6.4, the ranges of the features were broken up and assigned to subranges of equal size and then treated as different conditionals within the overall selection phase. However, the subranges were made of equal size (i.e., the same number of training samples) simply for convenience of calculation and thus illustration in the tables. In practice, we may wish to refine these subranges in order to maximize the variance in the ranges. We perform this in a manner analogous to that employed by decision trees. The first decision is to determine a subrange and then determine a figure of merit for the subrange and any additional subranges it automatically creates (above and below it). If the figure of merit for any above or below it is low enough, then another subrange definition can occur. All subrange definitions in our simple example here were splits, and the decision tree formed by iterative splitting that resulted in the highest mean range in the subranges (across the three classifiers—this value was 0.050 in Table 6.4) was kept. This was chosen in place of a variable such as highest cumulative difference in precision among all the subranges, as this approach led to the definition of as many subranges as possible (in this case, 20 per feature since subrange boundaries were limited to multiples of 5% of the samples). By the subranges so formed, the rows of Table 6.8 have a mean range of 0.064, substantially higher than the “uniform” subranges in Table 6.4. The final “multipath predictive selection” output is given in Table 6.8, which offers the largest observed improved in precision ranges across subranges when subranges have step by 0.05 boundaries. Each “path” is the set of splits occurring in the overall range. In Table 6.8, several results stand out in comparison with Table 6.4. First off, the peak precision values are higher in the mean than those of Table 6.4. This is because the multipath approach leads to subranges with higher variability, and so, half of that additional variability is expected by normal statistical laws to be on the high end, that is, the change in range should be half above and half below the mean. Secondly, less subranges were chosen. This is good news, as it means we probably had more subranges than needed for Table 6.4, so at least, we don’t feel that Table 6.4 “underpartitioned” the range. It also means that the changes in which classifier should be used for the analysis only occur once or twice per feature: this implies that the features chosen are generally “stable” and thus generally useful. Thirdly, we see that the subranges are not uniform in size in Table 6.8. Instead of dividing the ranges evenly into thirds or halves, we see two ranges divided into (0.45 and 0.55) partitions, another into nearly equal thirds (0.30, 0.40, and 0.30), and the last into half and two quarters (0.50, 0.025, and 0.025).

6.5 Multi-path approach

Table 6.8 Table used for illustrating the multipath approach to subrange optimization. Feature Feature 1

Feature 2

Feature 3

Feature 4

Subrange (pct) Subrange (0.55) Subrange (0.45) Subrange (0.30) Subrange (0.40) Subrange (0.30) Subrange (0.45) Subrange (0.55) Subrange (0.50) Subrange (0.25) Subrange (0.25)

Precision, algorithm A

Precision, algorithm B

Precision, algorithm C

1

0.657

0.677

0.702

2

0.684

0.668

0.651

1

0.735

0.807

0.779

2

0.752

0.727

0.824

3

0.782

0.670

0.729

1

0.558

0.593

0.627

2

0.565

0.606

0.592

1

0.650

0.730

0.666

2

0.668

0.672

0.712

3

0.740

0.692

0.692

Here, there are 10 subranges, rather than 16, among the 4 features, and the mean range of the subranges (rows) for the 3 algorithms is 0.064.

The new set of subrange partitions do lead to better overall precision, of course. Otherwise, simply slicing the ranges into equal pieces would have had some strange magic to it. However, with full disclosure, it should be noted that the “equal” subranges in Table 6.4 were like the pigs in Orwell’s “Animal Farm,” that is, more equal than others. If you were paying careful attention to Table 6.4, you noticed that there were three subgroups for Feature 1, four subgroups for Features 2 and 3, and five subgroups for Feature 4. These were of course not selected randomly, but instead were the output of a clustering estimate for the Features—which not coincidentally were 3, 4, 4, and 5, respectively. Regardless, even with the number of partitions “intelligently” picked based on clusters for Table 6.4, the multipath predictive selection approach outlined in this section affords even higher precision, as shown by the underlined precision values in Table 6.8 and the mean precisions shown in Table 6.9. In Table 6.9, the total error rate lowering across the four features is 25.0% (mean ¼ 6.3%), two-thirds again more than the 15.0% total (mean ¼ 3.8%) for Table 6.7. Based on this, we may expect that when these improved subranges are deployed for text classification as above (where we use all four features together with weight), we might see a significant additional improvement in overall accuracy

213

214

CHAPTER 6 Multipatch predictive selection

Table 6.9 Overall precision of each subrange-optimized selection from Table 6.8, with improvement over best single algorithm from Table 6.5 in parenthesis. Feature

Precision, subrange-optimized selection (pct improvement over Table 6.5)

1 2 3 4

0.694 0.806 0.615 0.728

(4.7% lower error rate) (11.0% lower error rate) (1.8% lower error rate) (7.5% lower error rate)

Percent improvement is in error rate, not accuracy. Note that the percentage of samples falling into each subrange is accounted for in the final precision. Please see text for details.

(which for the classifier of Section 6.6 improved from 0.886 to 0.902), perhaps to 0.913 if the same “two-thirds again more accurate” results are obtained on the testing data. However, we measure only a slight improvement in classification accuracy on testing data, from 0.886 to 0.907, in comparison with selecting the best classifier for each feature (0.886) or compared with selecting uniform-size subranges (0.902). This is, nevertheless, an appreciable reduction of the error rate from 0.114 to 0.093 (18.4% reduction in error). The trade-off here is that having less overall subrange partitions (10 instead of 16) removes some sources of overtraining in comparison with the earlier example, but the addition of optimal ranges leads to some overtraining, since any minor shift in the optimal range from training to testing data will affect accuracy. Overall, though, the combination of multipath with multiple subrange predictive selection, in combination with a weighted voting approach to combining the four features, leads to an impressive reduction in error rate. One more type of multipath predictive selection can be used, which is more analogous to a Markov chain than to a decision tree. Here, the predictive selection is used to select a distinct set of features for each classifier, and then, each classifier (in an ensemble of classifiers) only bases its decision on those features. In the example of this section and Section 6.4, Classifiers A and B are very likely to omit Feature 3 using this approach.

6.6 Applications An interesting application for predictive selection that is not immediately apparent from the above examples is to allow the algorithms to feed back their results and accept the output tied to the results with the highest precision. We are not “selecting” which algorithm or system to use, but rather which outcome to most trust. A good example is scene recognition, where we are trying to figure out the context of a location. One image recognition engine may report there are three chairs, one sink, one table, and three appliances in a room and so report with precision ¼ 0.856 that it is a kitchen. Another image recognition engine may report four stools, one sink, one

6.8 Summary

table, two drink dispensers, and one appliance in a room and report with precision ¼ 0.843 that it is a bar. We accept the classification as kitchen, but not with high confidence. Other applications of predictive-selection-based approaches will be used in later chapters, particularly Chapters 12 and 13. However, it is readily seen from the generic nature of the discussion in this chapter that predictive selection approaches are excellent choices for any type of system involved in discriminating two or more classes, clusters, categories, or contexts. The real question to ask as a system design engineer is, “where can and should I use predictive selection in this project?” The answer of whether to use is almost invariably “yes.”

6.7 Sensitivity analysis Many of the predictive selection approaches overviewed in this chapter are effectively “introspective” to sensitivity analysis. In Section 6.6, for example, the multipath predictive selection approach innately provides a measure of sensitivity analysis since we are recording the range in precision for each of the partitions as we consider all of the possible paths to the final partition configuration. Likewise, we see that nearly all predictive selection approaches provide the equivalence of sensitivity analysis during the training, or category-generating, stage. This is because we are highly interested in ensuring that the categories we use for selection are indeed the most reliable. A good part of the discussion in Section 6.5, then, is focused on how our partitioning affects the sensitivity of the predictive selection process to the features used for the analysis. In this case, the most sensitive partitions can be viewed as the “optimal” partitions, because they provide a higher discriminatory power for each partition (subrange) and feature. We must always be careful to ensure that, in our eagerness to find partitioning with higher overall system precision, we do not overtrain the system. It makes good sense to try several different predictive selection approaches and select the one that provides good overall performance and robustness to validation and testing data.

6.8 Summary This chapter describes the broad set of ways in which the predictive selection pattern can be used to improve the performance of intelligent systems. The primary focus of the predictive selection pattern is on improving decision-making accuracy, although the approach of intelligent partitioning and differential deployment of selection tasks based on maximizing precision is broadly useful in optimizing any system objective function. After describing the assignment of data to categories, we next overview six different approaches to partitioning. These are (1) yes/no or pass/fail; (2) thresholdbased partitioning; (3) subrange-based partitioning; (4) conditional partitioning,

215

216

CHAPTER 6 Multipatch predictive selection

which supports any Boolean relationship between two or more features; (5) sequential partitioning, which may result in changing subrange definitions even for the exact same type and range of data, over time; and (6) consensus partitioning, which is a particular type of conditional partitioning based on both the data and the metadata (or contextual) information associated with the data. We then discussed how an ensemble of classifiers (or any other analytics) can be used together based on accepting the output of the one with the highest precision. Calculating the overall precision is possible after the overall strategy has been reduced to a set of rules. Generally, an appreciably more precise system is obtained through this approach. The risk is, of course, that the system may be overtrained. Next, we considered means of selection based on the partitioning. Partitioning of the input can use a “separation of task” approach, wherein an auxiliary data feature is used to assign the sample to one of a plurality of partitions. Then, the precision of each of the algorithms, analytics, or systems for each of the subranges and for each of the features is considered. The overall improvement in system behavior will depend on how the features interact together, but this is a meta-algorithmic [Sims13] consideration downstream from the predictive selection. For this section, the precision of the individual features was highlighted. Finally, we considered multipath predictive selection, in which we allowed a number of possible decisions to be made and computed a figure of merit score to each path that not only is robust to the utility of the output but also does not quickly devolve into creating as many partitions as possible (or no additional partitions, for that matter). Using a functional measurement such as the range of precisions for the entire ensemble of algorithms, analytics, or systems in each partition was shown to work well.

Reference [Sims13] Simske, S., 2013. Meta-Algorithmics: Patterns for Robust, Low-Cost, High-Quality Systems. IEEE Press and Wiley, Singapore.