Neurocomputing 150 (2015) 599–610
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Projection inspector: Assessment and synthesis of multidimensional projections Paulo Pagliosa a, Fernando V. Paulovich b,n, Rosane Minghim b, Haim Levkowitz c, Luis Gustavo Nonato b a
Federal University of Mato Grosso do Sul, UFMS, Brazil University of São Paulo, USP, Brazil c University of Massachusetts Lowell, UML, USA b
art ic l e i nf o
a b s t r a c t
Article history: Received 21 November 2013 Received in revised form 16 April 2014 Accepted 9 July 2014 Available online 27 October 2014
As the number and complexity of visualization techniques have grown, it has become progressively more difficult to make a decision as to which technique to employ for any given situation or application. A particular case is that of multidimensional data visualization utilizing projections, which have gained much attention lately and are being utilized in a growing number of applications. With their popularity, many new variations of multidimensional projections have been proposed in the literature. Numerical evaluations are varied and are useful, but do not reflect visual properties of projections accurately. In this paper we present Projection Inspector, an approach that contributes to the problem of understanding the difference amongst projections. It is an interactive assessment method that allows a user to explore a “space” of known projection techniques and view their results, as well as to identify the differences between them. In addition, it generates “on-the-fly” new projection techniques via interpolations of existing techniques as the user explores the projection space. We present the theoretical foundations of the projection exploration space and an interactive tool that implements a view of this space. We demonstrate the approach with case studies that demonstrate the need for projection assessment and the value of combining projections into new, better suited, projection alternatives. & 2014 Elsevier B.V. All rights reserved.
Keywords: Multidimensional projection Quality metrics
1. Introduction Multidimensional projection (MP) methods have developed significantly in the last decade; they are now considered among the most relevant methods to handle, analyze, and visualize highdimensional data. Such growing interest has lead to the development of many MP methods, which vary considerably in their mathematical foundations, optimization criteria, and computational complexity. One of the reasons for the proposal of so many distinct methods is that the effectiveness of MP methods in revealing valuable information contained in a data set depends strongly on the nature of the data. Moreover, quality metrics for MP methods differ considerably in the property they measure, making it harder for the user to choose what method fits their problem. The problem of selecting an adequate multidimensional projection from a set of possibilities has been treated in the literature. However, the existing alternatives either rely on multiple views generated from a single MP method, such as Projection Pursuit [1]
n
Corresponding author. E-mail address:
[email protected] (F.V. Paulovich).
http://dx.doi.org/10.1016/j.neucom.2014.07.072 0925-2312/& 2014 Elsevier B.V. All rights reserved.
and ranked scatter plot matrices [2], or are not flexible enough to allow a free navigation throughout the possibilities, as is the case of Stress Maps [3] and CheckVis [4]. When faced with deciding what method to adopt for his or her data set and task, the analyst is often faced with various views, various numbers and in the end is not sure what alternative is offering a proper compromise in each case. In that respect, existing solutions are not quite satisfactory and there is a clear need for more flexible methods to navigate through feasible MP methods applied to a particular data set, while still examining multiple quality metrics. We propose a novel method to interact with multiple projections and quality metrics that supports the user towards selecting the most appropriate projection in terms of both quality metric and layout organization. Our method, named Projection Inspector (ProjInspector), allows not only to easily navigate through a set of different projections but also to combine them when the best solution for a data set is not given by any single projection. ProjInspector does that by generating MPs from the combination of basic MP methods and enabling visual inspection of their layouts as well as of their quality metrics. Smoothing transition between distintic layout combinations is ensured by using control points to register basic layouts, thus avoiding drastic jumps during user navigation.
600
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
We demonstrate the usefulness of ProjInspector with a set of experiments showing how the accuracy of projections change depending on the adopted metric. We also show that sometimes the quality compromise the user is looking for to express his or her data lies in a combination of projections instead of in any particular layout. Additionally, we illustrate how ProjInspector can also be used to analyze how parameters affect the quality of a projection, thus enabling the interactive selection of the best set of parameters for an MP method when dealing with a particular data set. These goals are achieved by interpolating between basic projections and giving the user support to select a method or a combination of methods for their case. In summary, the main contributions of this paper are:
A novel method for analyzing and combining of multidimensional
projection methods, enabling the construction of families of MP layouts. An interactive tool to navigate easily through a set of MP layouts, which enables visual inspection of the quality of each particular layout. A set of experiments showing that the quality of a projection changes drastically according to the metric and the data set, thus demonstrating the need for a tool like ProjInspector.
Besides the main contributions mentioned above, we also propose a variant of currently available neighborhood preservation metrics, called Smooth Neighborhood Preservation (SNP), which takes into account the number of neighbors preserved in the projection as well as the distance that misplaced points are from their correct position.
2. Related work The need and desire to compare and evaluate different multidimensional projections, both qualitatively and quantitatively, is not new. Published research has taken two distinct directions: correlating multidimensional projection quality metrics with human perception, and offering multiple projections to assist users in the analysis of high-dimensional data. In order to better contextualize our approach we focus the related work discussion on the latter class of methods, that is, methods that make use of multiple projections to analyze high-dimensional data. Readers interested in the former class of methods can refer to [5–9]. We group techniques that rely on multiple projections into three main categories: multi-projection tour techniques, projection boards, and multi-projections with distortion analysis. Multi-projection tour techniques aim at enabling interactive and animation mechanisms to generate a set of projections that will assist users to “navigate” through high-dimensional data. As early as 1985 (even before the coining of the term “visualization”) Asimov [10], introduced The Grand Tour, “a method for viewing multivariate statistical data via orthogonal projections onto a sequence of twodimensional subspaces.” However, the Grand Tour was limited to attribute driven scatter plots. As data dimensionality grows, the computational demands become prohibitive and the possible combination of axes troublesome. Dhillon et al. [11] introduced “class tours”, which are sequences of two-dimensional, class-preserving projections of multidimensional data that are displayed in a rapid and smooth sequence. Class tours’ purpose is to enable a “view” of higher-dimensional subspaces. In addition, class-similarity graphs that are overlaid on the projections provide a “skeleton of the data”, guide the user through the projections, and help estimate distance relationships in the original high-dimensional space. The authors report to have a “mechanism” that is theoretically able to view interclass relations of any subset q of k classes. In reality, however, as q grows, classes’ distinction blur. Further, the method is not interactive:
the tour is conducted through static photographs, which are preprocessed. Rolling-the-Dice [12] relies on a cube metaphor to interactively smooth out the transition between scatter plots during visualization. Dimension reordering and a sculpting mechanism is used to further improve the high-dimensional data exploration. One of the main issues with tour-based techniques is that quality metrics are not directly used to assist users during the data analysis. Moreover, those techniques strongly rely on attribute scatter plots, and cannot be easily extended to operate on other types of multidimensional projections methods. Projection board methods differ from tour-based techniques in that they generate static boards where projections are arranged according to ranking criteria. Projection Pursuit [13] and its variants [14,1], for instance, generate a family of scatter plots ranked according to their concentration of points into clusters while preserving the separability of those clusters. Rank-byFeature [2] is an interactive framework that allows users to select interesting dimensions according to distinct rank criteria, producing a set of scatter plots from user selections. Tatu et al. [15] automate the ranking process to generate scatter plots (and Parallel Coordinates) of classified and unclassified data according to data correlation and cluster separation. Their goal was to aid and potentially speed up the visual exploration process for different visualization techniques. Sips et al. [16] proposed to select “good views” of high-dimensional data by utilizing what they refer to as “class consistency.” They measure class consistency by the distance of class members from the class's center of gravity and by the entropies of the spatial distributions of classes. In addition, they asked users to choose “good views,” and report that class consistency demonstrated good precision and recall. They have evaluated their consistency measures using various data sets, and concluded that the measures appear to be efficient and robust. Schreck et al. [17] propose the use of a quantitative quality measures to filter out scatter plots so as to highlight the ones with higher precision. They also present a scheme that allows to locally analyze the quality of distinct projection methods as well as the quality of clusters. Lehmann et al. [18] present an interactive approach for generating ranked scatter plot matrices, where the elements are organized according to a reordering technique supported by quality measures. Lehmann's approach bears a set properties that are not present simultaneously in any other scatter plot-based method. Wang et al. [19] proposed a less interactive approach, a mechanism to first cluster dimensions based on their similarity and then filter out dimensions according to an importance criterion. A scatter plot matrix containing only the “important” dimensions is generated in the end of the process. A main issue with Wang's approach is the loss of context when navigating throughout the levels of the hierarchy. Similar to tour-based methods, projection board techniques strongly rely on attribute scatter plots. No significant research has attempted to extend those approaches to general multidimensional projections. Multi-projection with distortion analysis techniques enrich projection layouts with mechanisms that enable the visualization of distortions during the dimensionality reduction process. Aupetit [20] proposes the use of colored Voronoi cells to visualize quality measures defined on projected points, segments connecting projected points, and triangles formed by projected points. These techniques allow visual identification of regions where the neighborhood of each point stretches or compresses. Colored Voronoi cells are also used by CheckViz [4] to visualize false neighbors introduced by the projection mechanism, allowing to compare the distortion introduced by different projection methods. ProxiLens [21] provides an interactive scheme to highlight and filter out false neighbors when visualizing projected data, making it easier to analyze the “true” neighborhood of the data. Stress Maps [3] make use of the landscape metaphor to visualize local stress
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
information, providing navigation tools that allow the visualization of a whole map of stress errors as well as local peaks. Although the methods above are flexible enough to be used with any multidimensional projection technique toward identifying distortions, they are not designed to assist users in selecting quality projections. In other words, they do not enable any mechanism that allows users to generate alternative projections to the one under analysis. It is clear from the various studies that the problem of providing and supporting selection of quality projections, either by automatic measures or by human decision, has not been resolved satisfactorily, since existing techniques are either restricted to attribute scatter plots or do not provide assistance to generate and select from multiple projections of high-dimensional data. Projection Inspector, presented here, offers a powerful capability to combine distinct multidimensional projection schemes as well as to investigate the behavior of different quality measures on those projections, a set of traits not present in any other method. Therefore, ProjInspector advances the state-of-the-art on multidimensional projection selections, assessments, and evaluation. Additionally, it offers a mechanism to choose as the solution a projection that combines other projection, maintaining useful properties of various layouts into one. 3. The projection inspector method The proposed Projection Inspector, (and corresponding tool named ProjInspector), allows for analyzing the behavior and accuracy of a family of projections using a single dynamic view that coordinates with an interactive choice of projection interpolations. By a family of projections we mean a set of new projections generated by interpolation from a small subset of basic projections. Our approach relies on two main elements: a mechanism to combine multiple projections, and a scheme to evaluate the accuracy of the projections and their combinations. The rationale is that some methods perform better than others depending on the data and their relationship both to the projection foundation and to the evaluation metric. It is very difficult to predict which projection method will suit a set of tasks for a given data set, or which method will present the best result with respect to a specific metric. Additionally, the combination of multiple projections can produce better results than using a single projection method. In many data analysis tasks, choosing the right projection can lead to better performance finding useful patterns. An important issue when presenting various projections in a single view frame is lack of registration amongst layouts, generated in different scales, rotation patterns and translation. This impairs the ability to define a unified analysis frame of distinct projections. Various methods can be employed to register distinct layouts. In this work we opt to tackle the problem by restricting this aspect of the approach to using multidimensional projection methods that rely on choosing control points as the initial step to perform the mapping. The approach is to fix the arrangement of control points in visual space so as to ensure that all projection methods under analysis are guided by the same control points, thus naturally maintaining the reference for the output of all methods. The choice is justified by the fact that multidimensional projection methods that rely on control points have become a strong trend in visualization applications [22–24] making this realization of the method comprehensive enough to illustrate its utility and validity. Next we present the mathematical formulation used to handle multiple projections, to combine them, and to analyze the quality of the resulting layout.
601
Pk(S) corresponds to a point cloud layout in the visual space (the twodimensional space in our case). Linear combination is a straightforward mechanism to combine projections in P. In mathematical terms PðSÞ ¼ λ1 P 1 ðSÞ þ λ2 P 2 ðSÞ þ ⋯ þ λn P n ðSÞ;
where coefficients λk control the contribution of each projection to the final layout. Providing an intuitive mechanism to control the coefficients λk is rather important, as meaningless layouts can be produced from poorly chosen coefficients. For instance, it does not make much sense to allow coefficients with negative values, as it is difficult to conceive a negative contribution of a projection to the combined layout. An intuitive control of the contribution of each projection can be obtained by enforcing coefficients that give rise to convex combinations, that is, n
∑ λk ¼ 1;
k¼1
λk Z 0 8 k:
ð2Þ
Constraining coefficients as described in Eq. (2) allows for easy control of the contribution of each projection in the final layout. The closer the value of a coefficient λk is to one, the larger will be the contribution of Pk to the final layout. The problem to be faced is how to design a mechanism that will allow control over the contribution of each projection while satisfying the constraints in Eq. (2). Our approach solves the problem with the help of mean values coordinates [25], which enables us to set each point inside a polygon as a convex combination of the vertices of the polygon. Precisely, let v1 ; v2 ; …; vn be the vertices of a polygon Ω and v be a point inside Ω (see Fig. 1), the point v can be written as n
v ¼ ∑ λk vk ; k¼1
wk ¼
where
λk ¼
wk ; ∑nl¼ 1 wl
tan ðαk 1 =2Þ þ tan ðαk =2Þ ; J vk v J
ð3Þ
where αk and αk 1 , k ¼1,…,n, α0 ¼ αn , are the angles shown in Fig. 1. The λk are called the mean value coordinates of v. It can be shown that the coefficients λk given in (3) satisfy the constraint (2) [25]. Points lying on the boundary of the polygon have to be handled as special cases. In fact, there are two situations: points lying on a boundary edge and points lying on vertices of the polygon. In the former case, if a point v is on the edge vk vk þ 1 then λk ¼ J vk þ 1 v J = J vk þ 1 vk J , λk þ 1 ¼ 1 λk , λl ¼ 0; 8 l a k; k þ 1. If v coincides with a vertex vk , then λk ¼ 1 and λl ¼ 0; 8 l ak. By establishing a correspondence between each vertex of the polygon Ω (which will be assumed to be regular and convex in our case) and the projections in the set P one can intuitively control the contribution of each projection to the final layout. More specifically, the user can simply select a point inside Ω and use the mean value coordinates of that point as weights for the linear combination. For
3.1. Combining multiple projections Let P ¼ fP 1 ; P 2 ; …; P n g be a set of multidimensional projections and Pk(S) be the result of applying the projection Pk to a data set S. Thus,
ð1Þ
Fig. 1. Angles involved in the mean value coordinates computation.
602
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
the sake of illustration, in Fig. 2 we show the layout resulting from the combination of five multidimensional projection techniques, namely, LSP, LAMP, PLMP, SMS, and Hybrid Model (see [26] for a detailed description of those projection methods) when projecting the caltech data set which contains pictures [27] with features extracted using the bag-of-visual features (BoVF) [28] method. Fig. 2 (a) shows the layout produced by setting the coefficients as corresponding to the central point of the polygon, meaning that all projections contribute equally to the final layout. In Fig. 2(b) we present the layout when LSP contributes the most to the linear combination, which is obtained by choosing a point closer to the vertex corresponding to the LSP technique. 3.2. Quality metrics A sizeable amount of effort in the field of point distributions for data visualization has been addressed to developing and applying quality metrics that help identify layout properties and their capability to reflect trends and patterns. Various of these measures were developed to support evaluation and recommendation of scatterplot matrices (see [29,30]), in that a number of tools are used to characterize the trend aspects in sets of scatter plots (for instance, trends, convex hull, outliers, shape features, and so on). Most recently, various authors have been preoccupied with characterizing the loss and the properties of point layouts when reflecting projections, which adds another dimension of the problem, related with associating what is seen with what the original representation space is capable of coding. Some of this work focuses properties that are either attained or preserved by projections relating with segregation capabilities. That is the case of work such as that of Sips et al. [16], which employs class consistency as an evaluation platform, as well as the work by Sedlmair et al. [9], which tries to relate cluster separation factors with the perception of quality in the projection. Another important variant of numerical analysis of projections evaluates preservation of properties from the original space. In that class, the attention to the preservation or evaluation of neighborhoods is in the center of relevant work. For instance, Venna et al. [31] frame the specific visualization task of projecting data as an information retrieval task, and define the concepts of true positives and false positives to the task of neighborhoods in projections, relating those with the measures of precision and recall. A similar view of neighborhood quality was introduced in [4], which offers a geometric interpretation of the
misplacement of points as tears and false neighbors in visual space. Naturally, much of the development of new techniques were also worried about preservation of distances, and employed evaluation measures that would reflect that, such as stress and distance plots. The goal, in many cases, is to provide guidance for choosing or comparing projections. Graphical outputs of evaluation measures are also employed not only to help find good projections, but also to locate regions of a layout that have better properties or worse deviations of target neighborhoods [32]. Our method can associate the layout of sets of projections with the quality or error metrics that mostly reflect desired properties for the target application, provided it yields a value that summarizes the behavior of the whole projection. In our tool ProjInspector, we have implemented three quality metrics, namely, Kruskal's stress function [33], the correlation coefficient [34], and a modified, smooth variant of the neighborhood preservation metric [35]. These metrics gauge different aspects of the quality of a projection, as shown in our experiments. We show that a projection that performs well for one kind of measure may not necessarily succeed for the others. In the following we detail the three quality metrics used in our study. Stress function is given by ESt ¼
∑ij ðdij d ij Þ2 2
∑ij dij
;
ð4Þ
where d and d account for the distance between instances i and j in the original and visual space, respectively, the stress function measures how well original distances in the original highdimensional space are preserved in the visual space. This metric was proposed by Kruskal [33] and is widely used as a quality measure for multidimensional projection methods. Correlation coefficient: In contrast to the stress function ESt described above, which takes into account raw distance values, the correlation coefficient metric [34] aims at measuring how distances in the original space are correlated to those in the visual space. The correlation coefficient is related to graph scagnostics [29], by measuring the difference between points distance in the graph built in the original space and the distances in the 2D embedding of the points. Let D and D be the upper triangular distance matrices before and after projection, respectively. That is, dij and d ij are the entries
Fig. 2. Two layouts resulting from the combination of multidimensional projection methods. (a) Barycenter (all projections contribute equally). (b) LSP contributes most.
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
in row i and column j, j 4 i, of D and D, respectively. The correlation coefficient is given by ECc ¼ 1
〈D D〉 〈D〉〈D〉 ;
σDσD
ð5Þ
603
define w as the radial-based degree five polynomial
8 5 > t r 4 46 t r 3 1 t r 2 < 28 t r 14 þ þ wðr; tÞ ¼ 5 r r 5 r 5 r > : 1
if r r t r 2r; otherwise:
ð7Þ
where is the element-by-element product, 〈 〉 is the average operator over all elements in the upper triangular matrix, and σ D , σ D are the standard deviations of the elements in D and D, respectively. Smooth Neighborhood Preservation: The conventional definition of a neighborhood preservation metric [35] computes the percentage of the k-nearest neighbors of an instance i that still remain neighbors in the visual space. Trustworthiness and continuity [36] are other measures that evaluate neighborhood matching in the original and visual space. Neighborhood preservation, trustworthiness, and continuity are discrete measures in the sense that they use discrete information such as the number of matches or the rank-order of the elements to measure neighborhood preservation. We propose, however, a slightly different variant of those metrics that smoothly measures neighborhood matches, which is called Smooth Neighborhood Preservation. Intuitively, suppose that two projections Pi and Pj misplace the same percentage of neighbors in the visual space but Pi tends to map each wronglyplaced instance close to its true neighborhood while Pj sends them far away from the true neighborhood. It is clear that Pi behaves better then Pj and the metric should reflect this fact. Following the notation introduced in [4], let NTi be the set of instances in the k-nearest neighborhood of an instance i that are not mapped among the k-nearest neighbors of i in the visual space (the set of Tears of i) and NFN be the set of instances that are not i among the k-nearest neighbors of i but are mapped among the knearest neighbors of i in the visual space (the set of False Neighbors of i). Consider the local misplacing metrics 8 1 > < T ∑ wðr i ; d ij Þ if NTi a∅; T T jN Wi ¼ i j j A Ni > : 0 otherwise; 8 1 > < FN ∑ wðr i ; dij Þ if N FN i a∅; FN jN ¼ W FN ð6Þ i j j A Ni i > : 0 otherwise; where jN ni j; n ¼ T; FN is the number of elements in the set, d and d account for distances in the original and visual space, respectively, and w is a weight function that measures how far j is from its true neighborhood. The parameters r i and ri correspond to the radii of the smallest spheres centered in i that encompass the k-nearest neighbors of i in the visual and original space, respectively. We
Fig. 3 shows the weight function w. As one can see, the farther a true neighbor jA N Ti is from i in the visual space the closer its weight is to 1. Similarly, the farther a false neighbor j A NFN i is from i in the original space (recall that instances in NFN are not among i the k-nearest neighbors of i in the original space) the closer its weight is to 1. A global smooth neighborhood preservation metric can be derived from the local misplacing metrics by simply computing: ENb ¼
1 ∑ ðW T þ W FN i Þ; 2jSji A S i
ð8Þ
where S is the set of instances under analysis. 3.3. Visualizing error metrics The quality measures described above evaluate the quality of a projection. However, they are computationally costly to be calculated in real time for a large family of projections. Since one of our main goals is to enable interactive analysis of projections and their combination we remedy the computational burden by sampling the space of projections, and computing the quality metrics only for the samples. More precisely, a set of points is arranged in the interior of Ω as depicted in Fig. 4(a). Each sample point corresponds to a projection obtained from Eq. (1), where the coefficients of the convex combination are given by the mean value coordinates of the sample point (see Eq. (3)). Quality metrics are then evaluated in each sample point. By triangulating the interior of Ω using the sample points as vertices, one can linearly interpolate the metrics within each triangle, allowing for visualizing the quality of the whole family of combined projections, as
Fig. 3. Weight function w.
Fig. 4. Estimating the error of combined projections. (a) Sample points triangulation. (b) Stress.
604
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
depicted in Fig. 4(b). The algorithm to build the triangulation is quite simple: we scan the polygon top-down left-right to compute horizontal segments with vertices in the edges of the polygon. Each segment is then uniformly subdivided such that the length of each sub-segment is closer to a fixed value. The Delaunay triangulation of the subdivision points is then computed. Since Ω is convex, their edges are naturally recovered from the Delaynay triangulation. Although quality measures are not exact in the interpolated points, the provided approximation is satisfactory for analysis purposes, as we discuss in Section 4.3. In fact, as linear elements are being used to perform the interpolation, we can provide upper bounds for the approximation error [37]: 2
jE Ei j ¼ max jEðx; yÞ Ei ðx; yÞj r Ch ; ðx;yÞ A Ω
ð9Þ
where C is a constant, h is the diameter of the largest triangle in Ω, and E and Ei are the exact and interpolated errors, respectively. Therefore, the approximation error tends to zero quadratically with the diameter of the largest triangle. Finally, we present, in Fig. 5, a snapshot of ProjInspector's interface. In the main window to the right, the currently chosen projection is presented, with color reflecting either a scalar value defined in the attribute file, or the color of the selected point in the interaction polygon on the left. The projection color can be switched through the class color box on the left side of the interface. The polygon defining the projection samples is placed on top left, colored alternatively by a chosen quality measurement or by the weights given to each projection. As the user moves around this polygon and chooses a new projection, it is displayed on the main window. When that choice is made, the values for all three metrics for the chosen projection appear in the metric slides, and the coefficients for each basic projection are also presented in the bottom left part of the interface. In the figure, for instance, the polygon has five corners, representing the combination of five
projections, shown also on the left with their coefficients. The user can remove projections, thus changing the polygon. For instance, when only two projections are chosen, the polygon is actually a bar representing the linear interpolation between the two. Besides exploring the interior and borders of the polygon, users can also choose a projection by selecting to go to the projection that gives a maximum or minimum value for any of the metrics.
4. Results In order to show the usefulness of the proposed approach in combining and evaluating projections we have carried out a set of experiments taking as basis 5 distinct multidimensional projection techniques, namely, Least-Square Projection (LSP) [38], Part-Linear Multidimensional Projection (PLMP) [26], Local Affine Multidimensional Projection (LAMP) [35], Sammon's Mapping Speeding-up (SMS) [39], and the Hybrid Model (Hybrid) [40]. A common factor of all these techniques is that they all rely on control points placed in the visual space to perform the projections. As already mentioned in Section 3, the use of the same set of control points as input for all basis projections ensures their correct registration, making interpolapffiffiffiffiffi tions meaningful. In our implementation we are using m randomly chosen control points, where m is the number of instances in the data set. Control points are placed in the visual space using the Force Scheme method proposed in [41]. Fig. 6 shows the layout resulting from applying the five “pure” projections on different data sets, caltech, wdbc, segmentation and fibers. The first two datasets was already described. The segmentation is composed of instances randomly drawn from a database of outdoor images. The images were hand-segmented to create a classification for every pixel, defining 7 different classes on the data set [42]. The fibers data set is made up of 19,000 fibers tracks classified into eight different classes collected from the 2009 Brain Competition Challenge (http:// pbc.lrdc.pitt.edu). Notice that the quality measures vary considerably
Fig. 5. ProjInspector's interface. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
605
Fig. 6. Different pure projections attain different values of ESt, ENb, and ECc metrics.
from one layout to another, confirming the importance of considering multiple projections when analyzing high-dimensional data. Fig. 7 presents the optimal layout for each metric and data set. ProjInspector enables such a functionality while still enabling a set of interactive tools towards enriching user experience. Most of the interactive tools implemented in ProjInspector are illustrated in the following subsections. 4.1. Analyzing projections and their quality measures Our first analysis aims at showing that the quality of a projection changes according to the metric. In Fig. 8(a)–(c) we present the quality metrics ESt, ENb, and ECc, respectively, evaluating the quality of the family of projections generated with ProjInspector during analysis of the wine-quality-red data set [43]. As one can clearly see, the projection that produces a better result changes considerably depending on the metric. While minimal stress and correlation coefficient measures are obtained with an interpolated projection, LSP produces the minimal smooth neighborhood preservation measure. The example shows the usefulness of the proposed projection inspector, which allows us to analyze a large family of projections from distinct perspectives. In fact, this experiment shows that a single projection method may not be sufficient to fully analyze the structure of highdimensional data. Results such as the one presented in Fig. 8 motivate the interactive exploration of the family of projections generated by ProjInspector towards finding a projection corresponding to a good compromise of the three quality metrics. Although the user can choose a single metric to be visualized in the polygon Ω, ProjInspector provides color bars
that encode each individual metric so as to enable the simultaneous analysis of the three metrics. As depicted in Fig. 9, when the user selects a point inside Ω, the values of the metrics computed from the resulting projection are highlighted in the corresponding color bars. Therefore, the user can interactively explore the family of projections so as to find a combination that minimizes the three metrics simultaneously. As far as we know, ProjInspector is the first method to coordinate the interactive exploration of multiple quality metrics towards generating tailored projections. 4.2. Analyzing the parameter space of a single projection ProjInspector can also be used to analyze how parameters affect the behavior of a single projection formulation. Fig. 10 shows the result of using ProjInspector to analyze the LAMP technique. As a local projection method, LAMP allows for tuning the number of control points used to project each data instance. The figure presents the change in the projection layout when the parameter that defines the number of control points in the neighborhood of each instance changes. These are projections of the wdbc data set. The wdbc is a breast cancer data set obtained from 569 digitized images of breast masses. Its instances present 30 dimensions and are classified into two distinct groups, malignant and benign cancer [42]. Numbers around the polygon Ω in Fig. 10(a) and (c) are the percentage of control points in the neighborhood of each instance used to perform the projection. The projections are arranged on the vertices of Ω according to the algorithm proposed in [44] taking into account the percentage of control points used. Therefore, projections with similar number of control points are placed next to each other around the polygon Ω.
606
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
Fig. 7. Different optimal projections attain different values of ESt, ENb, and ECc metrics.
Fig. 8. The quality of a projection can change drastically depending on the metric. (a) Stress. (b) Smooth neighborhood preservation. (c) Correlation coefficient. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
Ordering the vertices is not crucial and it has been employed just to ease visual identification of similar projections. Stress and smooth neighborhood preservation are used as quality metrics. Fig. 10(b) and (d) show the layouts resulting from the interpolated projections that produce the smallest stress ESt and smooth
neighborhood preservation values, respectively. As one can see, the smallest stress results from an almost even combination of projections, while the smallest ENb takes place when the number of neighbor control points in LAMP is set close to 50%. This example shows once again the importance of analyzing multiple projections to get better results.
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
607
Fig. 9. ProjInspector allows the interactive analysis of multiple metrics towards providing projections that behave well with respect to all metrics. (a) Stress. (b) Smooth neighborhood preservation. (c) Correlation coefficient. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
Fig. 10. Analyzing a family of projections generated from the LAMP technique by changing the number of control points in the neighborhood of each instance. (a) and (c) show the metrics ESt and ENb, respectively, when projecting the data set wdbc. (b) and (d) depict the layouts where ESt and ENb are minimal. (a) Stress. (b) Map with minimum stress. (c) Smooth neighborhood preservation. (d) Map with optimal smooth neighborhood preservation.
4.3. Some interesting statistics Fig. 11 shows the averages of false neighbors and tears for several projections generated with ProjInspector. More precisely, each pair in magenta and cyan corresponds to a projection, and the magenta and
cyan bars are the averages of false neighbors and tears, respectively, in each projection computed according to Eq. (6), that is, the magenta bar T is given by ð∑i W FN i Þ=jSj and the cyan bar is given by ð∑i W i Þ=jSj. Thirty four projections corresponding to the vertices of the triangles that discretize the interior of Ω were considered (see Fig. 4) as
608
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
Fig. 11. Histograms showing False Neighbors and Tears for a subset of projection generated by ProjInspector. (a) Caltech (k ¼30, jSj ¼ 3100). (b) wdbc (k ¼ 10, jSj ¼ 569). (c) Wine-quality-red (k¼ 20, jSj ¼ 1599Þ. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
Fig. 12. ENb k. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)
interpolation points. Projections whose indexes are written in red in the vertical axis are the “pure” projections, that is, layouts generated by LAMP (index 0), SMS (index 5), LSP (index 11), Hybrid (index 29), and PLMP (index 33). Notice that the number of false neighbors is significantly smaller than the number of tears in all projections and data sets. This fact corroborates the claim of many researchers that multidimensional projection methods can “miss” some neighbors but they avoid to build misleading neighborhoods. In other words, neighborhoods observed in projected layouts tend to reflect the original ones. It is worth mentioning that this kind of analysis becomes simple with the help of ProjInspector, helping to understand the analysis value of projections. Additionally, it would be difficult to perform the analysis depicted in Fig. 11 with the original definition of neighborhood preservation metric (see [38]), which just counts the number of misplaced instances without taking into account how far badly-placed instances are from their correct position. The proposed smooth neighborhood preservation metric ENb also lets us observe a phenomenon that is intuitive. Fig. 12 shows that neighborhoods tend to be better preserved when increasing the number k of nearest neighbors. The blue curve in Fig. 12 corresponds to the minimal ENb value among all projections depicted in Fig. 11 while the red curve corresponds to the maximal
ENb value. Notice that both the minimal and maximal values of ENb decrease when k increases, showing that the new metric is quite robust and reflects the expected behavior. We conclude this section analyzing the error introduced by interpolating the quality metrics in Ω. We have sampled Ω densely, and in each sampled point we evaluated the three quality metrics using the interpolation scheme discussed in Section 3.3 as well as computing the exact value. The average error between the exact and interpolated error was about 2%. Considering only the worst case, that is, the largest error for all projections and data sets, the error was about 3% and occurred when using the stress metric ESt. This fact confirms our claim that the use of interpolation does not drastically affect the analysis of projections with ProjInspector. As a remark about computational times, the tool is very interactive for data sets of moderate size. For the largest data set used in our experiments, which contains 19,029 instances (fibers), we could freely interact with the layout.
5. Discussion and limitations ProjInspector allowed us to analyze a large family of projections from different points of view and criteria. We were able to show that a single projection method may be inadequate in our quest to fully analyze the structure of high-dimensional data. But when utilizing a multi-projection approach like ProjInspector the user can interactively explore families of projections, thus s/he can more easily find a combination that optimizes all metrics simultaneously. Our experiments also demonstrated that the assessment of false neighbors and tears became a simple task with the use of ProjInspector. ProjInspector allows users to interactively enable and disable the basic projections from the polygon Ω. Therefore, if an interpolation between two or three MPs is of interest, a user can enable only those basic projections in the interface. If only two MPs remain, the polygon becomes a straight line connecting those two MPs.
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
This functionality allows users to explore any subset of projections from those loaded initially, and thus to focus the analysis. Since an MP method that optimizes a particular metric does not necessarily reveal information of interest, ProjInspector enables users to explore a space of projections towards better understanding the particularities of a data set and the behavior of a set projection techniques. Usually, finding out which projection technique is more appropriate to a particular data set is a cumbersome task that can be made considerably easier with the help of a tool like ProjInspector. We have chosen three metrics described in Section 3.2 because they measure distinct properties of a given projection layout. However, any other metric could be incorporated into the system without impacting the system's performance. Computationally costly metrics can be evaluated in a pre-processing step considering only layouts defined by the vertices of the triangles that discretize the polygon Ω. Given the metric values in the vertices, interpolation is performed quite efficiently in the graphics card, thus ensuring free user interaction. However, the current status of the method has limitations. Because the quality metrics employed are computationally expensive, they could not be calculated in real time for large families of projections. To maintain our goal of interactivity we resorted to calculating our quality metrics only on sampled points. And while our worst-case error measures confirmed our assessment that this does not drastically affect the analysis utilizing ProjInspector, we still believe it would be better to be able to measure quality of all points, not just on a sampled subset, if interactivity can still be kept. Moreover, computationally costly metrics might require a pre-processing computation step even when using the interpolation scheme. This might have an adverse affect on the immediacy of any interactive analysis that a user might wish to carry out. Another limitation of the current version is that visualizing local distortions is not possible. Adding existing solutions, such as the one proposed by ProxiLens [21] would strengthen ProjInspector considerably. In addition, we are considering various future improvements. For instance, we can make use of Procrustes analysis [45] as proposed by Garcia-Fernandez et al. [46] to align basic projections avoiding the use of control points. The use of multi-objective metrics combined with optimization strategies is also an aspect we are currently investigating, which should impact positively the overall user experience.
6. Conclusion We have presented ProjInspector, an interactive visual method and tool to analyze, assess, evaluate and combine different multidimensional projection techniques so as to further understand their performance relative to different data sets, situations, and quality metrics. We have provided the rationale for this method, its theoretical foundations, an example implementation, and a set of examples to demonstrate its usability and usefulness. In fact, the presented experiments have demonstrated that the concept of utilizing multiple projections and their combinations can be a powerful mechanism towards facilitating and improving the process of multi-dimensional projection visualization and analysis. The proposed method is not restricted to the applications and metrics shown here and we have outlined various opportunities for future work to improve the method. To the best of our knowledge, ProjInspector is the only method that allows a user to coordinate the interactive exploration of multiple quality metrics with the goal of generating customized and optimized projections. We envisage three ways in which the tool is very useful: first, a typical user can be a projections designer who wishes to attain a projection that is ideal or better for a particular data set, with the goal of applying it
609
for further data collections of the same type. That user would have the chance to define a projection with a good compromise in terms of quality. Second, even if the target is not to create a new projection, the tool can be used to compare available projections and investigate which one more adequately maps the data set under analysis. The third use is to gain further insight into the data set by employing more than one projection at once. By noticing the transition between projection patterns that were not observed for individual projections, users can locate groups of points with particular behaviors. For instances, groups in one projection can mix in another, or the shape of a group may change, tagging alternative views of a group of points. A very large number of applications make use of 2D embeddings of multidimensional data sets, from astrophysics to biology, to knowledge domain analyses. For most of them, quickly defining an adequate projection can save overall analysis time and improve the experiments' reproducibility. The projection that is thought to reach the best compromise can then be saved and imported to a system for exploration of the data set. The system and data are made freely available in our web site.
Acknowledgments This research has been funded by Fapesp-Brazil (grant #2011/ 22749-8) and CNPq-Brazil. Haim Levkowitz gratefully acknowledges his US Fulbright Scholar to Brazil award, which partially supported this research.
Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.neucom.2014.07.072. References [1] D. Cook, A. Buja, J. Cabrera, C. Hurley, Grand tour and projection pursuit, J. Comput. Graph. Stat. 4 (3) (1995) 155–172. [2] J. Seo, B. Shneiderman, A rank-by-feature framework for unsupervised multidimensional data exploration using low dimensional projections, in: IEEE Symposium on Information Visualization, IEEE, 2004, pp. 65–72. [3] C. Seifert, V. Sabol, W. Kienreich, Stress maps: analysing local phenomena in dimensionality reduction based visualisations, in: EuroVAST 2010: International Symposium on Visual Analytics Science and Technology, The Eurographics Association, 2010, pp. 13–18. [4] S. Lespinats, M. Aupetit, Checkviz: sanity check and topological clues for linear and non-linear mappings, in: Computer Graphics Forum, vol. 30, 2011, pp. 113–125. [5] E. Bertini, A. Tatu, D. Keim, Quality metrics in high-dimensional data visualization: an overview and systematization, IEEE Trans. Vis. Comput. Graph. 17 (12) (2011) 2203–2212. [6] I. Icke, A. Rosenberg, Visual and semantic interpretability of projections of high dimensional data for classification tasks, arXiv preprint arxiv:1205.4776. [7] A. Tatu, P. Bak, E. Bertini, D. Keim, J. Schneidewind, Visual quality metrics and human perception: an initial study on 2d projections of large multidimensional data, in: Proceedings of the International Conference on Advanced Visual Interfaces, 2010, pp. 49–56. [8] D. Marghescu, Evaluating the effectiveness of projection techniques in visual data mining, in: Proceedings of the 6th IASTED International Conference on Visualization, Imaging, and Image Processing, 2006, pp. 94–103. [9] M. Sedlmair, A. Tatu, T. Munzner, M. Tory, A taxonomy of visual cluster separation factors, Comput. Graph. Forum 31 (3pt4) (2012) 1335–1344. [10] D. Asimov, The grand tour: a tool for viewing multidimensional data, SIAM J. Sci. Stat. Comput. 6 (1) (1985) 128–143. [11] I.S. Dhillon, D.S. Modha, W. Spangler, Class visualization of high-dimensional data with applications, Comput. Stat. Data Anal. 41 (1) (2002) 59–90. [12] N. Elmqvist, P. Dragicevic, J.-D. Fekete, Rolling the dice: multidimensional visual exploration using scatterplot matrix navigation, IEEE Trans. Vis. Comput. Graph. 14 (6) (2008) 1148–1539. [13] J.H. Friedman, J.W. Tukey, A projection pursuit algorithm for exploratory data analysis, IEEE Trans. Comput. 100 (9) (1974) 881–890. [14] P.J. Huber, Projection pursuit, Ann. Stat. (1985) 435–475. [15] A. Tatu, G. Albuquerque, M. Eisemann, J. Schneidewind, H. Theisel, M. Magnor, D. Keim, Combining automated analysis and visualization techniques for effective exploration of high-dimensional data, in: IEEE Symposium on Visual Analytics Science and Technology (VAST), 2009.
610
P. Pagliosa et al. / Neurocomputing 150 (2015) 599–610
[16] M. Sips, B. Neubert, J.P. Lewis, P. Hanrahan, Selecting good views of highdimensional data using class consistency, Comput. Graph. Forum 28 (3) (2009) 831–838. [17] T. Schreck, T. von Landesberger, S. Bremm, Techniques for precision-based visual analysis of projected data, Inf. Vis. 9 (3) (2010) 181–193. [18] D.J. Lehmann, G. Albuquerque, M. Eisemann, M. Magnor, H. Theisel, Selecting coherent and relevant plots in large scatterplot matrices, in: Computer Graphics Forum, vol. 31, 2012, pp. 1895–1908. [19] J. Wang, W. Peng, M.O. Ward, E.A. Rundensteiner, Interactive hierarchical dimension ordering, spacing and filtering for exploration of high dimensional datasets, in: IEEE Symposium on Information Visualization, IEEE, 2003, pp. 105–112. [20] M. Aupetit, Visualizing distortions and recovering topology in continuous projection techniques, Neurocomputing 70 (7) (2007) 1304–1330. [21] N. Heulot, M. Aupetit, J. Fekete, Proxilens: interactive exploration of highdimensional data using projections, in: Eurovis Workshop on Visual Analytics using Multidimensional Projections, 2013. [22] I. Daniels, E.W. Anderson, L.G. Nonato, C.T. Silva, et al., Interactive vector field feature identification, IEEE Trans. Vis. Comput. Graph. 16 (6) (2010) 1560–1568. [23] S. Ding, X. Hua, Recursive least squares projection twin support vector machines for nonlinear classification, Neurocomputing 130 (2014) 3–9. [24] E. Gomez-Nieto, F. San Roman, P. Pagliosa, W. Casaca, E.S. Helou, M.C.F. de Oliveira, L.G. Nonato, Similarity preserving snippet-based visualization of web search results, IEEE Trans. Vis. Comput. Gr. 20 (3) (2014) 457–470. [25] M.S. Floater, Mean value coordinates, Comput. Aided Geom. Des. 20 (1) (2003) 19–27. [26] F.V. Paulovich, C.T. Silva, L.G. Nonato, Two-phase mapping for projecting massive data sets, IEEE Trans. Vis. Comput. Graph. 16 (6) (2010) 1281–1290. [27] R. Fergus, P. Perona, A. Zisserman, Object class recognition by unsupervised scale-invariant learning, in: CVPR, vol. 2, 2003, pp. 264–271. [28] J. Yang, Y.-G. Jiang, A.G. Hauptmann, C.-W. Ngo, Evaluating bag-of-visualwords representations in scene classification, in: International Workshop on Multimedia Information Retrieval, 2007, pp. 197–206. [29] L. Wilkinson, G. Wills, Scagnostics distributions, J. Comput. Graph. Stat. 17 (2) (2008) 473–491. [30] L. Wilkinson, A. Anand, R. Grossman, Graph-theoretic scagnostics, in: Proceedings of the 2005 IEEE Symposium on Information Visualization, INFOVIS '05, IEEE Computer Society, Washington, DC, USA, 2005, pp. 21. URL 〈http://dx. doi.org/10.1109/INFOVIS.2005.14〉. [31] J. Venna, J. Peltonen, K. Nybo, H. Aidos, S. Kaski, Information retrieval perspective to nonlinear dimensionality reduction for data visualization, J. Mach. Learn. Res. 11 (2010) 451–490, URL 〈http://dl.acm.org/citation.cfm? id=1756006.1756019〉. [32] R.M. Martins, D.B. Coimbra, R. Minghim, A. Telea, Visual analysis of dimensionality reduction quality for parameterized projections, Comput. Graph. 41 (0) (2014) 26–42, URL 〈http://dx.doi.org/10.1016/j.cag.2014.01.006〉. [33] J.B. Kruskal, Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Psychometrika 29 (1964) 115–129. [34] X. Geng, D.-C. Zhan, Z.-H. Zhou, Supervised nonlinear dimensionality reduction for visualization and classification, IEEE Trans. Syst. Man Cybern. Part B: Cybern. 35 (6) (2005) 1098–1107. [35] P. Joia, D. Coimbra, J.A. Cuminato, F.V. Paulovich, L.G. Nonato, Local affine multidimensional projection, IEEE Trans. Vis. Comput. Graph. 17 (2011) 2563–2571. [36] S. Kaski, J. Nikkilä, M. Oja, J. Venna, P. Törönen, E. Castrén, Trustworthiness and metrics in visualizing similarity of gene expression, BMC Bioinf. 4 (1) (2003) 48. [37] T.J. Hughes, The Finite Element Method: Linear Static and Dynamic Finite Element Analysis, Dover Publications. com, 2012. [38] F.V. Paulovich, L.G. Nonato, R. Minghim, H. Levkowitz, Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping, IEEE Trans. Vis. Comput. Graph. 14 (3) (2008) 564–575. [39] E. Pekalska, D. de Ridder, R.P.W. Duin, M.A. Kraaijveld, A new method of generalizing Sammon mapping with application to algorithm speed-up, in: M. Boasson, J.A. Kaandorp, J.F.M. Tonino, M.G. Vosselman (Eds.), Annual Conference of the Advanced School for Computing and Imaging, 1999, pp. 221–228. [40] A. Morrison, M. Chalmers, A pivot-based routine for improved parent-finding in hybrid MDS, Inf. Vis. 3 (2) (2004) 109–122. [41] E. Tejada, R. Minghim, L.G. Nonato, On improved projection techniques to support visual exploration of multidimensional data sets, Inf. Vis. 2 (4) (2003) 218–231. [42] A. Asuncion, D. Newman, UCI Machine Learning Repository (2007). URL 〈http://www.ics.uci.edu/ mlearn/{MLR}epository.html〉. [43] P. Cortez, A. Cerdeira, F. Almeida, T. Matos, J. Reis, Modeling wine preferences by data mining from physicochemical properties, Decis. Support Syst. 47 (4) (2009) 547–553. [44] A. Artero, M. de Oliveira, H. Levkowitz, Enhanced high dimensional data visualization through dimension reduction and attribute arrangement, in: Tenth International Conference on Information Visualization, IV 2006, 2006, pp. 707–712. [45] J.C. Gower, G.B. Dijksterhuis, Procrustes Problems, vol. 3, Oxford University Press Oxford, 2004. [46] F.J. García-Fernández, M. Verleysen, J.A. Lee, I. Díaz, Stability comparison of dimensionality reduction techniques attending to data and parameter variations, in: International Workshop on Visual Analytics using Multidimensional Projections (VAMP-EuroVis), 2013.
Paulo Pagliosa received the Ph.D. degree in structural engineering from the Escola de Engenharia of São Carlos at Universidade de São Paulo (1998). He is currently an associate professor of computer science in the Faculdade de Computação at Universidade Federal de Mato Grosso do Sul. His research interests include visualization, geometric processing, physicsbased animation and general-purpose computation on graphics hardware.
Fernando V. Paulovich received the B.Sc. in computer science from the Federal University of São Carlos, Brazil, in 2000, and the Ph.D. degree in computer sciences from the University of São Paulo, Brazil, in 2008, with a period as invited researcher at the Delft University of Technology, The Netherlands. He is currently an assistant professor at the Computer Science Department of the Institute for Mathematics and Computer Science, at the University of São Paulo, Brazil. His research interests involve computational visualization, more specifically information visualization, visual analytics and visual data mining. He is a member of the Brazilian Computer Society.
Rosane Minghim is an Associate professor at University of São Paulo, São Carlos, Brazil. She is interested in all aspects of visualization, information visualization and visual analytics.
Haim Levkowitz is Associate Chair and Associate Professor of Computer Science, University of Massachusetts Lowell, MA, USA. He is a twice-recipient of a Fulbright US Scholar to Brazil award. He is the author of “Color Theory and Modeling for Computer Graphics, Visualization, and Multimedia Applications” (Springer, 1997) among other. His research interests are HumanInformation Interaction in all its incarnations. He received his Ph.D. in Computer and Information Science in 1988 from the University of Pennsylvania, Philadelphia, PA, USA. He is Senior Member of the ACM and Member of the IEEE.
Luis Gustavo Nonato is an Associate professor at the Instituto de Ciências Matemáticas e de Computação (ICMC) – Universidade de São Paulo (USP) – Brazil. He received the Ph.D. degree in Applied Mathematics from the Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro – Brazil, in 1998. He spent a sabbatical leave in the Scientific Computing and Imaging Institute at the University of Utah from 2008 to 2010. Besides having served in several program committees he is currently in the editorial board of Computer Graphics Forum, he was the president of the Special Committee on Computer Graphics and Image Processing of Brazilian Computer Society and leads the Visual and Geometry Processing Group at ICMC-USP. His research interests include visualization and geometry processing.