J. Parallel Distrib. Comput. 68 (2008) 1 – 2 www.elsevier.com/locate/jpdc
Editorial
Special issue on parallel techniques for information extraction We live in an era in which every application of interest in science and engineering has to deal with a large amount of data. For instance, genomic data in biology are quite extensive. In homeland security, voluminous data of different kinds arise. Many of these applications demand real-time or near real-time performance. This special issue is aimed at bringing together both theoreticians and practitioners who work on information extraction techniques for large amounts of data. Given the volume of data to be operated on, parallelism becomes inevitable. Parallelism paves the way for near real-time performance. This special issue deals with varied types of data. There are seven papers in this special issue. Congiusta, Talia, and Trunfio present data-mining algorithms for grid environments. The grid is a distributed computing infrastructure that enables coordinated resource sharing. The grid has proven to be a cost-effective way of achieving parallelism and is being used to solve compute-intensive problems. The authors discuss how grid computing can be used to support distributed data mining and provide an outline of some research activities in grid-based data mining. They also point out some challenges in this area and sketch some promising future directions. Distributed prediction from vertically partitioned data is the topic of Skillicorn and McConnell’s paper. Any data set can be thought of as consisting of records, with each record having several attributes. Vertically partitioned data refer to data for which each local site holds some of the attributes of all of the records. The task of prediction is to predict a particular attribute of each new record from its other attributes (based on an understanding of the structure of similar data). The authors show that a technique called attribute ensembles is very effective in prediction. Glimcher, Jin, and Agrawal present an overview of two middleware systems they have developed for data mining on cluster and grid platforms. The first system, FREERIDE (FRamework for Rapid Implementation of Datamining Engines), is meant for a cluster environment and is based on the observation that well-known data-mining techniques can be parallelized by dividing the data records among the nodes. The authors have extended FREERIDE to obtain FREERIDE-G (FRamework for Rapid Implementation of Datamining Engines in Grid). Identifying protein structures similar to a given query structure is an important problem in biology and has applications in function prediction, drug discovery, etc. In their paper, Gao and Zaki consider this vital problem. In particular, they develop a new method for extracting local structural features from protein structures. These feature vectors and suffix trees are then used to retrieve maximal matches from a database for a given query structure. The authors also demonstrate that their approach results in very good classification accuracy. Ferreira, Koyuturk, Jagannathan, and Grama deal with the problem of semantic indexing in structured peer-to-peer networks. Unstructured and structured overlay networks are being used in many applications, such as file sharing and scientific data repositories. In this paper the authors present a novel structured overlay that integrates aspects of semantic indexing using non-orthogonal matrix decompositions. They employ distributed hash tables to enable efficient consolidation of patterns. An indexing method called pMINER results. The authors demonstrate excellent performance characteristics for their approach as well. Mukherjee and Kargupta, in their paper, study the problem of distributed inferencing in a sensor network. The general version of the problem is intractable and hence researchers have concentrated on approximation algorithms. The authors present a probabilistic algorithm called ‘Variational Inferencing in Distributed Environments (VIDE).’ The performance of the algorithm is analyzed for accuracy and energy consumed. Some experimental results are also presented. Both the analysis and the experimental data establish VIDE as a powerful algorithm for inferencing. The paper by Plaza considers the problem of processing sensor data. Recent advances in computing technology have revolutionized the way of collecting remotely sensed data. For example, NASA is continuously collecting imagery data from the surface of the Earth. It is essential to develop efficient techniques for processing these voluminous data sets. The author reports several parallel algorithms for unsupervised information extraction and mining from hyperspectral image data sets. These algorithms
0743-7315/$ - see front matter © 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2007.08.006
2
Editorial / J. Parallel Distrib. Comput. 68 (2008) 1 – 2
have been specifically designed for NOWs. The author employs three approaches: clustering, classification, and spectral mixture analysis. I thank the authors for their interesting contributions to this special issue. Sanguthevar Rajasekaran Department of Computer Science & Engineering, University of Connecticut, Storrs, CT 06269 2155, USA E-mail address:
[email protected]