J. Parallel Distrib. Comput. 73 (2013) 281–283
Contents lists available at SciVerse ScienceDirect
J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc
Editorial
Models and algorithms for high-performance distributed data mining The problem of devising models and algorithms for highperformance Distributed Data Mining has traditionally been of great interest for the Data Mining and Database communities, merged with researchers and scientists from the Distributed Computing area. In addition to this well-known trend, the emerging MapReduce initiative has conferred a new light on research challenges posed by effectively and efficiently supporting Distributed Data Mining in high-performance environments. Looking to fundamentals, Distributed Data Mining is wellunderstood as a resource-intensive and time-consuming task which is devoted to extract patterns and regularities from huge amounts of distributed data sets. Classical algorithms, mostly developed in the context of centralized environments, have already been proved to be unsuitable to the goal of mining data in distributed settings. This is not only due to conceptual and methodological drawbacks but, most importantly, to novel challenges posed by a distributed, resource-intensive, and time-consuming processing as dictated by high-level specifications of Distributed Data Mining algorithms. From these challenges, performance aspects of Distributed Data Mining are now recognized as one of the most attractive topics for the Data Mining and Database research community, even with respect to next-generation computational platforms (e.g., Clouds, Grids, and Service-Oriented Architectures) and paradigms (e.g., Peer-to-Peer, MapReduce, and Service-Oriented Computing). Emerging application scenarios like Social Networks play the role of interesting contexts as well that may stimulate further investigation in this field. In Distributed Data Mining models and algorithms, highperformance is not only an architecture-and-resource-oriented matter, but it also involves designing innovative models, algorithms and techniques capable of dealing, on the one hand, with the difficulties posed by so-challenging distributed environments (e.g., discontinuities of networks, node faults, and so forth) and, on the other hand, with the conceptual Data Mining tasks (e.g., frequent item set mining, association rule discovery, and so forth) codified within Distributed Data Mining algorithms, which may turn out to be inherently hard. With the aim of fulfilling research gaps deriving from theoretical and practical aspects of models and algorithms of highperformance Distributed Data Mining, this special issue on ‘‘Models and Algorithms for High-Performance Distributed Data Mining’’ of the Journal of Parallel and Distributed Computing contains eight papers, which have gone through two rigorous review rounds before being accepted for final inclusion. Some of the contributions of this special issue have been invited for submission as the best papers of the 10th LNCS International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2010), held in Busan, Korea, during May 21–23, 2010, lead by the Editor. 0743-7315/$ – see front matter © 2012 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2012.11.002
The final goal of this special issue is to provide high-quality contributions in the context of models and algorithms for highperformance Distributed Data Mining, by emphasizing both theoretical as well as practical aspects of this so-interesting yet not completely-explored scientific area, which has also important and well-understood implications in different-but-related disciplines like bioinformatics, genomic computing, e-science, analytics over large-scale data repositories, and so forth. The contributions of this special issue focus their attention on a variety of topics falling in the context of high-performance Distributed Data Mining from fundamentals (e.g., multi-task validation and random graph generation problems) to applications (e.g., Big Data and microarray data clustering). In the following, we provide a summary of the papers contained in this special issue. The first paper, titled ‘‘Parallel approaches to machine learning—A comprehensive survey’’, by Sujatha R. Upadhyaya, focuses its attention on a complete survey on parallel algorithms and architectures devoted to supporting Machine Learning methods and tools, by embracing the wide development in this area since the inception of the idea in 1995 and identifying different phases across the time period 1995–2011. As the author highlights, when it comes to performance enhancement, Graphic Processing Unit (GPU) platforms have carved out a special niche for themselves. The strength of these platforms comes from the capability of speeding-up computations exponentially by way of parallel architecture/programming methods. While it is evident that computationally-complex processes like image processing, mathematical modeling of complex systems, financial data processing, and so forth, stand to gain much from parallel architectures, studies suggest that general-purpose tasks such as machine learning, graph traversals, and finite state machines are also identified as the parallel applications of the future. MapReduce is another important technique that has evolved during this period, and it has been proved to be an important aid in delivering performance of Machine Learning algorithms on GPU. By putting all this together, the paper surveys the path of developments in this so-interesting research area. The second paper, titled ‘‘Parallel multitask cross validation for Support Vector Machine using GPU’’, by Qi Li, Raied Salman, Erik Test, Robert Strack and Vojislav Kecman, investigates the context of Support Vector Machines (SVM), having recognized it as an efficient tool in Machine Learning with high accuracy performance. However, as the authors notice, in order to achieve the highest accuracy performance, n-fold cross validation is commonly used to identify the best hyper-parameters for SVM settings. This becomes a weak point of SVM due to the extremely long training time for various hyper-parameters of different kernel functions. Following this main motivation, the authors propose and experimentally assess a novel parallel SVM training implementation capable of accelerating the cross validation procedure by running multiple
282
Editorial / J. Parallel Distrib. Comput. 73 (2013) 281–283
training tasks simultaneously on a GPU. All of these tasks with different hyper-parameters share the same cache memory which stores the kernel matrix of the support vectors. Therefore, this heavily reduces redundant computations of kernel values across different training tasks. Considering that the computations of kernel values are the most time consuming operations in SVM training, the total time cost of the cross validation procedure decreases significantly. The experimental tests carried out by the authors indicate that the time cost for the multitask cross validation training is very close to the time cost of the slowest task trained alone. In addition to this, comparison tests show that the proposed method is 10 to 100 times faster compared to the state of the art LIBSVM software library for supporting SVM. The third paper, titled ‘‘An effective and efficient parallel approach for random graph generation over GPUs’’, by Stephane Bressan, Alfredo Cuzzocrea, Panagiotis Karras, Xuesong Lu and Sadegh Heyrani Nobari, addresses the problem of generating and manipulating random graphs, being inspired by the widespread usage of such data structures in the context of database applications over several years. This because such random graphs turn out to be very useful in a large family of database applications ranging from simulation to sampling, from analysis of complex networks to the study of randomized algorithms, and so forth. Amongst others, Erdös–Rényi Γv,p is the most popular model to obtain and manipulate random graphs. Unfortunately, as the authors recognize, it has been demonstrated that classical algorithms for generating Erdös–Rényi based random graphs do not scale well on large instances and, in addition to this, fail to make use of the parallel processing capabilities of modern hardware. In order to fulfill this gap, the authors propose a novel parallel algorithm for generating random graphs under the Erdös–Rényi model that is designed and implemented to run on GPU, called PPreZER. The authors demonstrate the nice amenities due to the proposed solution via a succession of several intermediary algorithms, both sequential and parallel, which show the limitations of classical approaches and the benefits due to the PPreZER algorithm. Finally, the authors provide a comprehensive experimental assessment and analysis that brings to light a relevant average speedup gain of PPreZER over the class of the above-mentioned baseline algorithms. The fourth paper, titled ‘‘Fault tolerant decentralised K-Means clustering for asynchronous large-scale networks’’, by Giuseppe Di Fatta, Francesco Blasa, Simone Cafiero and Giancarlo Fortino, proposes a parallel scalable adaptation of the classical K-Means algorithm for cluster analysis. Indeed, as the authors recognize, while the straightforward parallel formulation of K -Means is well-suited for distributed-memory systems with reliable interconnection networks, such as massively-parallel processors and clusters of workstations, in large-scale geographically-distributed systems the straightforward parallel algorithm can be rendered useless by a single communication failure or high latency in communication paths. As a consequence, the lack of scalable and faulttolerant global communication and synchronization methods in large-scale systems has hindered the adoption of K -Means in the context of applications over large networked systems such as Wireless Sensor Networks, Peer-To-Peer Systems and Mobile Ad-Hoc Networks. In line with this breaking evidence, the paper proposes a fully-distributed K-Means algorithm, called Epidemic K-Means, which does not require global communication and it is intrinsically faulttolerant. The proposed distributed K -Means algorithm provides a clustering solution which is capable of approximating the solution of an ideal centralized algorithm over the aggregated data as closely as desired. A comparative performance analysis is carried out by the authors against the state-of-the-art sampling methods and shows that the Epidemic K -Means overcomes the limitations of the sampling-based approaches for skewed cluster distributions. Results of this experimental analysis confirm that the proposed
algorithm is highly accurate and fault-tolerant under unreliable network conditions (e.g., message loss and node failures), and that it is suitable for asynchronous networks of very large and extreme scale. The fifth paper, titled ‘‘A decentralized approach for mining event correlations in distributed system monitoring’’, by Gang Wu, Huxing Zhang, Meikang Qiu, Zhong Ming, Jiayin Li and Xiao Qin, provides a range of contributions around the issue of monitoring, analyzing, and controlling large-scale distributed systems. In this context, events detected during monitoring are temporally-correlated, which is helpful for resource allocation, job scheduling, and failure prediction. To discover correlations among detected events, traditional approaches store detected events into an event database and perform mining procedures on such a database. Contrary to this, the authors argue that these approaches are not scalable on largescale distributed systems as monitored events grow so fast that event correlation discovering can hardly be done with the power of a single computational node. Following this main intuition, the authors propose a decentralized approach to efficiently detect interesting events, filter irrelevant events, and discover their temporal correlations. This solution is implemented by means of a MapReduce-based version of a well-known Apriori algorithm, called MapReduce-Apriori, to mine so-called event association rules. The algorithm makes use of computation resources made available by multiple dedicated computational nodes of the system. Experimental results shown by the authors clearly confirm that the proposed decentralized event correlation mining algorithm achieves a nearlyideal speed-up in comparison to centralized mining approaches. The sixth paper, titled ‘‘Parallel rare term vector replacement: Fast and effective dimensionality reduction for text’’, by Tobias Berka and Marian Vajtersic, focuses the attention on the dimensionality reduction problem, which delineates an established area in Text Mining and Information Retrieval. Classical dimensionality reduction methods convert the highly-sparse corpus matrix into one dense matrix format while preserving or improving the classification accuracy or retrieval performance. Indeed, as the authors recognize, according to Zipf’s law, in canonical document corpora, the majority of indexing terms occur only in a small number of documents. By exploiting this special feature and in order to improve the performance of dimensionality reduction, the authors describe a novel approach for obtaining dimensionality reduction over text documents, along with a parallel algorithm suitable to private-memory parallel computer systems. The distinctive characteristic of the proposed algorithm consists in replacing rare terms of documents via computing a vector which expresses their semantics in terms of common terms. This process produces a so-called projection matrix, which can be either applied to a corpus matrix or individual document and query vectors. In order to support their proposal, the authors provide an experimental evaluation conducted on two benchmark corpora. These experiments show that the proposed algorithm can deliver a substantial reduction in the number of features, with also a clear improvement in the retrieval performance. The parallel implementation of the proposed algorithm has also been evaluated by means of a Message Passing Interface (MPI) with up to 32 processes on a Nehalem Xeon Cluster, by also retrieving significant computational gains in dimensionality reduction still in the parallel environment as well. The seventh paper, titled ‘‘p-PIC: Parallel power iteration clustering for big data’’, by Weizhong Yan, Umang Brahmakshatriya, Ya Xue, Mark Gilder and Bowden Wise, focuses attention on the issue of making Power Iteration Clustering (PIC), a newly-developed clustering algorithm, scalable on Big Data. PIC performs clustering by embedding data points in a low-dimensional subspace derived from the similarity matrix built from the target high-dimensional space. Compared to traditional clustering algorithms, PIC is simple,
Editorial / J. Parallel Distrib. Comput. 73 (2013) 281–283
fast and relatively scalable. However, it requires the data and its associated similarity matrix to fit into memory, which makes the algorithm infeasible for Big Data applications. Following this main drawback of the conventional PIC algorithm, the paper proposes to enhance PIC’s data scalability by implementing its parallel version, called parallel Power Iteration Clustering (p-PIC). In more detail, the authors investigate two different application scenarios. The first one is that of traditional clusters of powerful workstations, where they explore different parallelization strategies and implementation details for minimizing computation and communication costs. The second one is that of clusters of low-end commodity computers (e.g., COTS-based clusters and general-purpose servers found at most commercial cloud providers), where they focus on ensuring the algorithm works well like in the previous (more stable) setting. In addition to conceptual and algorithmic solutions, the authors also provide an experimental evaluation and analysis of the proposed p-PIC algorithm that clearly shows its high scalability on both data and computational resources. Finally, the eighth paper, titled ‘‘MicroClAn: Microarray clustering analysis’’, by Giulia Bruno and Alessandro Fiori, investigates the context of evaluating clustering results, which has been already recognized as a fundamental task in microarray data analysis, due to the lack of enough biological knowledge allowing us to know in advance the true partition of genes. Many quality indexes for gene clustering evaluation have been proposed in literature, mainly coming from classical Data Mining measures, as well as expressly created to evaluate the biological meaning of clusters. As the authors highlight, a critical issue in this domain is to compare and aggregate quality indexes in order to select the best clustering algorithm and the optimal parameter setting for a given data set. Furthermore, due to the huge amount of data generated by microarray experiments and the requirement of external resources such as
283
Ontologies to compute biological indexes, performance decline in terms of execution time is another critical issue to be faced. As a consequence, the distributed computation of algorithms and quality indexes becomes essential. Addressing these issues, the paper presents a MicroClAn framework, a distributed system for evaluating and comparing clustering algorithms over microarray data by using the most-exploited quality indexes. The best solution is selected through a two-step ranking aggregation of the ranks produced by quality indexes, and several scheduling strategies are exploited to distribute tasks in a reference Grid environment as to optimize the overall execution time. The Editor would like to express his sincere gratitude to the Editor-In-Chief of the Journal of Parallel and Distributed Computing, Prof. Viktor Prasanna, for accepting his proposal of a special issue focused on models and algorithms for high-performance Distributed Data Mining, and for assisting him whenever required. The Editor would also like to thank all the reviewers who have worked within a tight schedule and whose detailed and constructive feedbacks to authors have contributed to substantial improvement in the quality of the final papers.
The Editor Alfredo Cuzzocrea ICAR-CNR and University of Calabria, via P. Bucci 41C, 87036 Cosenza, Rende (CS), Italy E-mail address:
[email protected]. URL: http://si.deis.unical.it/∼cuzzocrea. 11 November 2012 Available online 16 November 2012