Computers & Operations Research 37 (2010) 1359 -- 1360
Contents lists available at ScienceDirect
Computers & Operations Research journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / c o r
Guest editorial In modern biology, the advent of high-throughput technologies, which simultaneously probe thousands of entities, has enabled the production of an enormous amount of data on complex biological systems. The challenge is now to understand how these various components interact, in order to get the entire picture of the system function. This requires the development of novel techniques, aimed at extracting hidden information from such huge amount of data. Such technologies can strongly benefit from the combination and constructive interplay of tools from data mining, statistical learning theory, and optimization. This special issue contains nine papers that deal with optimization methodologies integrated with data analysis methods and efficient algorithms, to extract information from biological data. It is our hope that its reading will stimulate the interest of operations researchers and computational mathematicians, in investigating new methodologies for tackling these challenging data-intensive problems. The first three papers deal with the application of machine learning techniques, with particular focus on clustering and classification algorithms, for organizing, analyzing, and uncovering the information hidden in large-scale gene expression measurement under different conditions. The paper by Marco Antoniotti, Marco Carreras, Antonella Farinaccio, Giancarlo Mauri, Daniele Merico, and Italo Zoppis investigates the application of clustering techniques for large-scale gene expression measurement experiments. The applicability and performance of the cluster meta-analysis step, aimed at discovering relations among the components of cluster sets generated for correlated experiments, are investigated, based on the use of kernel methods to exploit the graphical structure of typical ontologies, such as the Gene Ontology. The well-known Spellman's Yeast Cell Cycle data set is considered in the analysis. Ujjwal Maulik and Anirban Mukhopadhyay propose a clustering method, applied to microarray data, based on Variableconfiguration-length Simulated Annealing (VSA) and artificial neural networks (ANNs). First, the technique evolves the number of clusters and the fuzzy membership matrix through VSA-based fuzzy clustering, minimizing the Xie–Beni cluster validity index. Then, it identifies high-confidence points (core points) for each cluster and exploits them to train the ANN classifier. The remaining points are then classified using the trained classifier. The work by Mariá Cristina Vasconcelos Nascimento, Franklina M.B. Toledo, and André C.P.L.F. de Carvalho focuses on a mathematical formulation to support the creation of consistent clusters for biological data. The authors provide a Greedy Randomized Adaptive Search Procedure (GRASP) for clustering. The initial solution is built using the Kaufman greedy initialization; then, the local search and
0305-0548/$ - see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.cor.2009.06.003
the K-means algorithm are performed. A graph representation of the datasets is used and a linearization of the mathematical model is exploited to run experiments in CPLEX. Tests are performed using external knowledge and the results obtained by GRASP are compared with the partitions produced by known algorithms. The development of efficient classification methods that integrate specialized optimization techniques is the leading theme of the following two papers, which aim at building predictive models for medical diagnosis and therapeutic responses. Domenico Conforti and Rosita Guido focus on the problem of learning the best-performance kernel function to be embedded into the learning methodology based on support-vector-machine classifiers. The optimal kernel function is generated by solving a semidefinite programming model. A suitable model formulation guarantees that the optimal kernel matrix is positive-semidefinite. Numerical experiments on medical diagnostic decision-making domains are provided, using publicly available patient data sets. Francesco Archetti, Ilaria Giordani, and Leonardo Vanneschi investigate the usefulness of Genetic Programming (GP), to get insights into the functional relationships between the gene expressions and the therapeutic responses to four clinical agents. The authors use the NCI-60 microarray dataset, a panel of 60 cell lines derived from several different cancer types, including leukemias, melanomas, ovarian, renal, prostate, colon, lung and CNS cancers. The results provided by GP are compared with those obtained via linear regression and least-square regression. Two papers propose an innovative modelling approach based on game theory for inferring the relevance of genes in certain phenotypical conditions. Roberto Lucchetti, Stefano Moretti, Fioravante Patrone, and Paola Radrizzani consider microarray technology, exploited to generate information on gene expression of human beings, and a method for gene expression analysis based on coalitional games, which allows one to compute a numerical index. Such index represents the relevance of each gene under a specific condition (e.g., a tumor), taking into account the behaviors of the other genes. The authors investigate the characterizations of two indices, i.e., the Banzhaf and Shapley values, on the class of microarray games. The results provided by such indices when applied to a colon tumor data-set published in literature are compared. Stefano Moretti proposes a game-theoretic analysis of the relevance of genes in determining a specific biological condition or response of interest in a population of cells. The author investigates the accuracy of the Shapley value on games arising from microarray experiments. Then, he develops a comparison between the relevance of genes under different biological conditions or responses
1360
Guest editorial / Computers & Operations Research 37 (2010) 1359 -- 1360
(e.g., two different sub-types of tumors or two different treatments) and presents an algorithm to perform statistical inference, based on the distributions of the sample statistics of microarray games and the corresponding statistics of the Shapley values. An optimization technique for the efficient solution of DNA sequencing problems is investigated by Paola Bertolazzi, Giovanni Felici, and Paola Festa. They deal with the cost of extracting a subset of SNPs (i.e., positions of the DNA sequences where the differences among individuals are embedded), called Tag SNPs, which maintains most information contained in the whole DNA sequence. The authors develop an algorithm to find a minimum set of Tag SNPs, using a feature-selection method based on the solution of an integer program that is a particular type of the set-covering problem. The quality of the selected Tags is tested using both a majority voting rule and a more evolved learning strategy. Finally, the paper by Julio Vera, Carlos González-Alcón, Alberto Marín-Sanguino, and Néstor Torres is focused on the development of an optimization framework, which integrates qualitative biological knowledge and quantitative experimental data, for the technological improvement biological systems. Computational issues are faced by exploiting some structural features of the ODEs involved in the systems models. Both mono- and multi-objective optimization methods are considered and the approach is tested on various case studies.
We believe that this special issue will encourage researchers from operational research, computer sciences, biology, and life sciences to further interact, in order to improve the applicability of computational models to challenging biomedical problems. Finally, we would like to express our thanks to the referees for their careful work and to the authors who submitted their research results. Guest Editors Enza Messina Department of Informatics, Systems and Communication (DISCo), University of Milano-Bicocca-Viale Sarca, 336-20126 Milano, Italy E-mail address:
[email protected]
Marcello Sanguineti Department of Communications, Computer and System Sciences (DIST), University of Genoa-Via Opera Pia, 13-16145 Genova, Italy E-mail address:
[email protected]