c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 1 ( 2 0 1 3 ) 512–518
journal homepage: www.intl.elsevierhealth.com/journals/cmpb
GelClust: A software tool for gel electrophoresis images analysis and dendrogram generation Sahand Khakabimamaghani a,b , Ali Najafi a,b,∗ , Reza Ranjbar a , Monireh Raam a a b
Molecular Biology Research Center, Baqiyatallah University of Medical Sciences, Tehran, Iran Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
a r t i c l e
i n f o
a b s t r a c t
Article history:
This paper presents GelClust, a new software that is designed for processing gel elec-
Received 16 December 2012
trophoresis images and generating the corresponding phylogenetic trees. Unlike the most
Received in revised form
of commercial and non-commercial related softwares, we found that GelClust is very user-
16 April 2013
friendly and guides the user from image toward dendrogram through seven simple steps.
Accepted 18 April 2013
Furthermore, the software, which is implemented in C# programming language under Win-
Keywords:
and is the only software able to detect and correct gel ‘smile’ effects completely automati-
Gel electrophoresis images
cally. These claims are supported with experiments.
dows operating system, is more accurate than similar software regarding image processing
Image processing
© 2013 Elsevier Ireland Ltd. All rights reserved.
Dendrogram Software Phylogenetic trees
1.
Introduction
Gaining comprehensive knowledge about constituting elements of individual’s genomic DNA is an important and significant goal in molecular biology and related areas. Regarding its aims, a researcher might require having complete or partial knowledge about the DNA under investigation. Complete insight is achievable using expensive and time consuming sequencing techniques such as shotgun sequencing. But, in many cases, like cloning of large plant DNA [1], constructing physiological maps of chromosome [2], identifying restriction fragment length polymorphisms (RFLPs), determining the number and size of chromosomes, and molecular typing [3,4], this is not what the researcher needs. So, there are cheaper methods used for gathering partial knowledge.
Much of the rapid progress that is being made in molecular biology today depends upon the ability to separate, size, and visualize DNA molecules. The most common technique for this purpose is standard agarose gel electrophoresis [5]. Gel electrophoresis was for the first time conducted in 1930s utilizing sucrose gel. Then, starch and acrylamide gels were used respectively in 1955 and 1959. They provided more accurate separation and more control on sized of holes. Then, during the late 1970s, two-dimensional electrophoresis (1975) and agarose gels emerged. This type of gel is the most popular one nowadays and usually is used in Pulse Field Gel Electrophoresis (PFGE), a method introduced in 1983 that is able to separate large DNA molecules. Gel electrophoresis has different applications in forensics, molecular biology, genetic, microbiology and biochemistry. But, as addressed before, one of the most important
∗ Corresponding author at: Molecular Biology Research Center, Baqiyatallah University of Medical Sciences, Tehran, Iran. Tel.: +98 21 82482548. E-mail addresses: najafi
[email protected], najafi
[email protected] (A. Najafi). 0169-2607/$ – see front matter © 2013 Elsevier Ireland Ltd. All rights reserved. http://dx.doi.org/10.1016/j.cmpb.2013.04.013
513
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 1 ( 2 0 1 3 ) 512–518
applications of this method is molecular typing [6,7]. Molecular typing is exploited in epidemiology, finding the food source of pollution, determining the plant pathogenic species penetrated in environment, and specifying a genotype combined with a particular bacterium [8]. Moreover, this method gives us more insight into epidemiological principles and evolution and penetration of many bacterial diseases. The goal of gel electrophoresis in molecular epidemiology is to investigate the DNA molecules extracted from samples more accurately and to compare them with each other or with a standard sample (i.e., marker). This is accomplished by comparing the size/charge of segments of extracted DNA molecules through three steps of cutting the DNA molecules with restriction enzymes, running the cut molecules of samples in separate columns using electrical current, and comparing the segment maps created for each sample in the spongy tissue of gel with each other. Amongst three steps of DNA restriction, running segments on gel (agarose or acrylamide), and analyzing the generated map, two first steps are completed inside laboratory. The third step used to be done by human eye, however the process was very time-consuming and error prone. Accordingly, after the demand for gel electrophoresis increased, information technology was exploited to hasten the processing, analysis, and clustering of samples based on the gel image.
2.
Related works
A few tools and software has been developed for gel electrophoresis image analysis, but unfortunately, most of them are commercial or banned in Iran. Moreover, some of these software do not fulfill all requirements of users. For example, QIAxcel System [4,9], a product of QIAGEN company, is able to quantize the images but cannot cluster the samples. Another example is Biometra’s product called BioDocAnalyze [10]. This software is not easy to work with when processing low quality images. This also applies to Gel Doc EZ System from Bio-Rad Laboratories [11]. In addition to mentioned software, there are other tools with clustering ability. For instance, ClusterVis from SequentiX company [12] is able to draw the phylogenetic tree of samples based on the information provided by GelQuest (another product of SequentiX). Although it is easy to work with GelQuest, but most of the tasks should be carried out manually and this decreases the agility. Moreover, GelScan, a part of
DIAS-II from Serva Electrophoresis [13], Phoretix 1D from Biostep [14], and Quantity One from Bio-Rad [15] are three commercial software sharing all the capabilities of the above mentioned tools (manually or automatic) and also able to cluster samples physiologically. While these tools fulfill most of the requirements of users, however their main problem is their high price, besides the sophisticated usage, which needs the users to be exercised enough [16]. Furthermore, Rementeria et al. [17] and Cardinali et al. [18] has discussed the drawbacks of some commercial packages and discrepancies between them in detecting bands and determining the number of genotypes based on gel images. With respect to above, some researchers have endeavored to develop their own non-commercial and simple tools for this aim. One example is PyElph [19] which is an open-source software providing all of the abilities of commercial software. This user-friendly software is mainly designed for educational uses and is not so accurate in detecting columns and bands. Other software is GelAnalyzer [20], a java based but not opensource product. Some features of the addressed products are summarized in Table 1. Having said all these, a need to accurate user-friendly software that decreases the need to interference of user in the analysis process is obvious. In the next sections the proposed software, GelClust, is introduced and its general and specific capabilities are discussed. It is supposed that this software will address some of the issues of the existing tools mentioned above.
3.
Methodology
GelClust processes the input image in 5 steps and clusters the samples in 2 steps. The software includes a control window containing a menu bar, user directions, and ‘next’ and ‘previous’ buttons. Furthermore, there is an operation window that allows the user to view and edit the results. The below steps are followed during processing the image: (1) Loading image and selecting region of interest: software accepts most of the image formats as input. The image file should be imported using the ‘File’ menu in the menu bar. The user can select a region of interest on which the processing will be conducted in the next steps. (2) Automatic or semi-automatic detection of columns (Fig. 1): in this step, columns are automatically detected given two
Table 1 – Some of features of existing gel electrophoresis image processing software. Software name
Commercial
Output Band correction
QIAxcel System BioDocAnalyze Gel Doc EZ System Gelquest & ClusterVis GelScan Phoretix 1D Quantity One GelAnalyzer PyElph
Yes Yes Yes Yes Yes Yes Yes No No
Yes (Auto & Manual) Yes (Auto & Manual) Yes (Auto & Manual) Yes (Manual) Yes (Auto & Manual) Yes (Auto & Manual) Yes (Auto & Manual) Yes (Auto & Manual) Yes (Auto & Manual)
Smile effect correction N/A Yes (Manual) Yes (Manual) Yes (Manual) N/A Yes (Manual) Yes (Manual) Yes (Manual) Yes (Manual)
Dendrogram No No No Yes Yes Yes Yes No Yes
514
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 1 ( 2 0 1 3 ) 512–518
Fig. 1 – Automatic or semi-automatic detection of columns: the ideal case is when Sensitivity = 41 and Step = 6.
variables ‘Step’ and ‘Sensitivity‘. ‘Sensitivity’ determines the sensitivity in detecting the columns and the more the value of this variable, the more the possibility of detecting the low intensity columns. In other words, ‘Sensitivity’ determines the software sensitivity to light intensity variations. Variable ‘Step’ is determined regarding the congestion of light bars in the given input image. That is, the more the number of columns and the closer they are to each other in the image, the smaller the value of ‘Step’ should be. The software computes the outputs and displays them to the user as the user changes their values. The algorithm designed for detection of columns is based upon the local variations of light intensity, which is computed by averaging the intensity of pixels in each column and comparing these averages. (3) Manual correction of columns (if needed): if the results of the previous step are not desirable, or they require to be modified, user can conduct these modifications in this step. This is done easily by using the right and left buttons of the mouse. User moves the mouse pointer to the desired point and adds/removes the overlaying bars to/from the image using the left/right clicks. The overlaying bars indicate the gaps between the lanes. (4) Automatic detection of bands and smile effects and identifying the marker column: at the beginning of this step and before the processing starts, the user is asked about the existence of smile effect in the image. This question is asked in order to save the time of processes required for smile effect detection. These processes are performed only if the user’s answer to the question is ‘Yes’. After receiving user’s response, the same method used in the second step for column detection is performed here for detecting the bands in each column. The same input arguments, ‘Sensitivity’ and ‘Step’, are also required here (Fig. 2). Their
effect is similar to step two. Indeed, how to adjust these variables is the only thing the user needs to know. The algorithm used for detecting bands exploits local light intensity variations similar to the column detection algorithm. However, here, instead of averaging the light values for each column of pixels, average is computed for each row of pixels in the lane whose bands are being detected. At the end of this step, user identifies the marker column by clicking on that column. The default marker column is the left most column. (5) Manual correction of bands (if needed): in this step user can easily remove extra bands or add undetected ones using right and left buttons of the mouse respectively, at the desired point on image. The software will help user by showing the position of the bar that is supposed to be added as the user moves the mouse pointer. After completing all mentioned steps, information required for clustering samples become available. This information includes number of columns and position of bands in each column. Then, clustering can be conducted in two steps: (1) Determining the amount of error: the amount of error indicates the vertical distance between two different bands from two different columns, which in fact correspond to two fragments with similar or very close molecular weights. This distance can be a result of different causes such as mutations, errors in image, or faults of image processing algorithm. In this step, user will see a corrected image (resulted after automatic smile effect correction) and will be able to determine the amount of error. The software will display margins of each band interactively as the amount of error is changed by user. So user can see the overlapping margins and decide on the suitable value
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 1 ( 2 0 1 3 ) 512–518
515
Fig. 2 – Automatic or semi-automatic detection of bands: the best results are related to values Sensitivity = 46 and Step = 4.
of error (Fig. 3). We have used pixel based tolerance values regarding the statement of Gerner-Smidt et al. that “the errors on the run length are much more constant over the gel than errors on the molecular sizes [21]”. This value will be used in the next step for constructing dendrogram. (2) Determining types of clustering algorithm and similarity coefficient used in it (Figs. 4 and 5): as addressed before, this software provides UPGMA [22] and Neighbor Joining [23] clustering algorithms and Dice [24], Jaccard
[25], and Pearson [26] similarity coefficients, which are of the most popular and significant methods. User can see the new dendrogram immediately after changing the selected types of clustering algorithm and similarity coefficient.
Finally user can save the resulted dendrogram in several image formats (BMP, JPG, TIF, GIF, and PNG).
Fig. 3 – Determining the amount of error. In this figure error = 5.
516
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 1 ( 2 0 1 3 ) 512–518
Fig. 4 – Clustering with UPGMA using Dice similarity coefficient.
4.
Results and discussion
According to all above, capabilities of GelClust software can be divided into two groups: general and specific capabilities. General capabilities are those also found in other similar software, however with some differences. These capabilities are:
(1) Accepting different formats of image files.
(2) Automatic/semi-automatic detection of columns and manual correction of them. (3) Automatic/semi-automatic detection of bands and manual correction of them. (4) Identifying the marker column. (5) Identifying the amount of error. (6) Variety in types of available clustering algorithms and similarity coefficients. (7) Saving the resulted dendrogram as an image file.
Fig. 5 – Clustering with Neighbor Joining using Pearson similarity coefficient.
517
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 1 ( 2 0 1 3 ) 512–518
Table 2 – Comparing GelClust with other similar non-commercial software regarding some general capabilities. Software name
GelAnalyzer PyElph
Accuracy of column detection = +
Accuracy of band detection + +
Table 2 compares GelClust with GelAnalyzer and PyElph regarding some important general capabilities. For this comparison, 5 different gel images were selected and all three software were used to analyze them. The five images were different in aspects like quality and number of columns and bands in each column. The images were analyzed by experienced users and the three tools were scored by those users for their performance in different aspects. The results are averaged over the five images and compared in Table 2. In this table, signs +, =, and − respectively indicate superiority, equality and inferiority of GelClust against other software in the corresponding capability. As illustrated in this table, GelClust is better than or equal to other software in most of the capabilities. The only drawback of GelClust is less variety of clustering algorithms and similarity coefficients it provides compared with PyElph. Regarding that GelClust includes prominent clustering algorithms and similarity coefficient and also possibility of adding more algorithms to it in future researches, this is not counted as an important drawback for GelClust’s. Another point is the superiority of GelClust in band detection, which might be the most important feature of gel image analysis software. Other general features in which GelClust provides improvement are: (1) Much more user-friendly and interactive graphical interface, with no need to use keyboard. (2) Less number of variables required to be adjusted by user: GelClust have 4 user determined variables while PyElph needs user to determine 6 variables. On the other hand, GelAnalyzer is completely automatic, however it does not provide any clustering facilities and its accuracy in detecting bands is very poor, so that it cannot be accounted as a suitable benchmark for GelClust regarding this feature. (3) Simpler manual column and band corrections. Other contributions of GleClust, the specific capabilities, are those features embedded only in the proposed software (i.e., unique capabilities of GelClust). These are: (1) Automatic detection of smile effects. (2) Possibility of immediately viewing the results of changing values of the variables (real-time adjustments).
5.
Variety in input formats
Conclusions and further research
In this paper, after providing a big picture of the existing tools for gel electrophoresis image analysis, a new software, named GelClust, is provided with the aim of filling the shortages of the current tools. This software is developed mainly for clustering the samples and does not provide gel quantization facilities yet. According to the results, GelClust is superior
+ =
Variety in clustering algorithms + −
Variety in similarity coefficients + −
to existing software regarding image-processing accuracy. It is also more user-friendly than the existing tools requiring less interference of the user. Furthermore, it provides some unique capabilities: (1) automatic smile effect detection and (2) correction and (3) real-time variable adjustment. Generally, GelClust is a useful tool applicable in forensics, molecular biology, genetic, microbiology, biochemistry, etc., where gel electrophoresis is an inseparable part of routines. The software can be exploited by either scientists or students in their day by day analysis of gel images. However, still there are some missing capabilities in this version of the software which could be added in the next versions to improve the applicability of GelClust. These are:
(1) Different image processing algorithms amongst which user can select the one with the best results. (2) The possibility to accept a binary matrix input as another option replacing the image. (3) Quantization of the gel image. (4) Options to apply band classification algorithms [27] and identify the type of applied algorithm, in the first step of clustering. (5) More clustering and similarity coefficient methods.
6.
Availability
The win32 executable file of GelClust is available at: http://www.bmsu.ac.ir/Services/Event/View.aspx?OId=1766.
references
[1] J. Ecker, PFGE and YAC analysis of the Arabidopsis genome, in: B. Birren, E. Lai (Eds.), A Companion to Methods of Enzymology Pulsed Field Electrophoresis, vol. 1, Academic Press, San Diego, 1990, pp. 186–194. [2] M.C. Elia, J.G. DeLuca, M.O. Bradley, Significance and measurement of DNA double strand breaks in mammalian cells, Pharmacology and Therapeutics 51 (1991) 291–327. [3] S.N.R. Ranjbar, M.M. Soltan dallal, S. Farshad, Evaluation of a PCR based approach to study the relatedness among Shigella sonnei strains, Iranian Journal of Clinical Infectious Diseases 4 (2009) 163–166. [4] S.N.R. Ranjbar, A. Ahmadi, et al., Antimicrobial susceptibility and AP-PCR typing of Iranian isolates of Acinetobacter spp. strains, Iranian Journal of Public Health 36 (2007) 50–56. [5] L.S.B. Joppa, S. Cole, S. Gallagher, Pulsed Field Electrophoresis for Separation of Large DNA, Probe 2 (1992). [6] M.C.R. Ranjbar, M.R. Pourshafie, M.M. Soltan-Dallal, Characterization of endemic Shigella boydii strains isolated in Iran by serotyping, antimicrobial resistance, plasmid profile, ribotyping and pulsed-field gel electrophoresis, BMC Research Notes 29 (2008).
518
c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e 1 1 1 ( 2 0 1 3 ) 512–518
[7] A.A.R. Ranjbar, G.M. Giammanco, A.M. Dionisi, N. Sadeghifard, C. Mammina, Genetic relatedness among isolates of Shigella sonnei carrying class 2 integrons in Tehran, Iran, 2002–2003, BMC Infectious Diseases 22 (2007). [8] H.M.R. Ranjbar, A.R. Kaffashian, S. Farshad, An outbreak of shigellosis due to Shigella flexneri serotype 3a in a prison in Iran, Archives of Iranian Medicine 13 (2010) 413–416. [9] M.S. Abu-Asab, M. Chaouchi, S. Alesci, S. Galli, M. Laassri, A.K. Cheema, F. Atouf, J. VanMeter, H. Amri, Biomarkers in the age of omics: time for a systems biology approach, OMICS 15 (Mar 2011) 105–112. [10] M. Adamczyk, K. van Eunen, B.M. Bakker, H.V. Westerhoff, Enzyme kinetics for systems biology when, why and how, Methods in Enzymology 500 (2011) 233–257. [11] Gel Doc EZ System, www.bio-rad.com/prd/en/US/LSR/PDP/./Gel-Doc-EZ-System [12] L.G. Adams, S. Khare, S.D. Lawhon, C.A. Rossetti, H.A. Lewin, M.S. Lipton, J.E. Turse, D.C. Wylie, Y. Bai, K.L. Drake, Multi-comparative systems biology analysis reveals time-course biosignatures of in vivo bovine pathway responses to B. melitensis, S. enterica Typhimurium and M. avium paratuberculosis, BMC Proceedings 5 (Suppl. 4) (2011) S6. [13] A. Aderem, J.N. Adkins, C. Ansong, J. Galagan, S. Kaiser, M.J. Korth, G.L. Law, J.G. McDermott, S.C. Proll, C. Rosenberger, G. Schoolnik, M.G. Katze, A systems biology approach to infectious disease research: innovating the pathogen–host research paradigm, MBio 2 (2011), e00325-10. [14] N.J. Afacan, C.D. Fjell, R.E. Hancock, A systems biology approach to nutritional immunology – focus on innate immunity, Molecular Aspects of Medicine 33 (February) (2012) 14–25. [15] A. Agusti, P. Sobradillo, B. Celli, Addressing the complexity of chronic obstructive pulmonary disease: from phenotypes and biomarkers to scale-free networks, systems biology, and P4 medicine, American Journal of Respiratory and Critical Care Medicine 183 (May) (2011) 1129–1137.
[16] P. Gerner-Smidt, L.M. Graves, S. Hunter, B. Swaminathan, Computerized analysis of restriction fragment length polymorphism patterns: comparative evaluation of two commercial software packages, Journal of Clinical Microbiology 36 (May) (1998) 1318–1323. [17] A. Rementeria, L. Gallego, G. Quindos, J. Garaizar, Comparative evaluation of three commercial software packages for analysis of DNA polymorphism patterns, Clinical Microbiology and Infection 7 (June) (2001) 331–336. [18] G. Cardinali, A. Martini, R. Preziosi, F. Bistoni, F. Baldelli, Multicenter comparison of three different analytical systems for evaluation of DNA banding patterns from Cryptococcus neoformans, Journal of Clinical Microbiology 40 (June) (2002) 2095–2100. [19] A.B. Pavel, C.I. Vasile, PyElph – a software tool for gel images analysis and phylogenetics, BMC Bioinformatics 13 (2012) 9. [20] GelAnalyzer, http://www.gelanalyzer.com/ [21] G. Cardinali, A. Martini, Critical observations on computerized analysis of banding patterns with commercial software packages, Journal of Clinical Microbiology 37 (March) (1999) 876–877. [22] M.C.R. Sokal, A statistical method for evaluating systematic relationships, University of Kansas Science Bulletin 38 (1958) 1409–1438. [23] N. Saitou, M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees, Molecular Biology and Evolution 4 (July) (1987) 406–425. [24] R. Lee, Dice, Measures of the amount of ecologic association between species, Ecology 26 (1945) 297–302. [25] Rogers David J., Tanimoto Taffee T., A computer program for classifying plants, Science 132 (1960) 1115–1118. [26] H.E.Y. Soper, A.W. Cave, B.M.A. Lee, K. Pearson, On the distribution of the correlation coefficient in small samples. Appendix II to the papers of “Student” and R.A. Fisher. A co-operative study, Biometrika 11 (1917) 328–413. [27] G. Cardinali, F. Maraziti, S. Selvi, Electrophoretic data classification for phylogenetics and biostatistics, Bioinformatics 19 (November) (2003) 2163–2165.