Combinatorial selection of short triplex forming oligonucleotides for fluorescence in situ hybridisation COMBO-FISH

Combinatorial selection of short triplex forming oligonucleotides for fluorescence in situ hybridisation COMBO-FISH

Journal of Computational Science 3 (2012) 328–334 Contents lists available at SciVerse ScienceDirect Journal of Computational Science journal homepa...

576KB Sizes 1 Downloads 31 Views

Journal of Computational Science 3 (2012) 328–334

Contents lists available at SciVerse ScienceDirect

Journal of Computational Science journal homepage: www.elsevier.com/locate/jocs

Combinatorial selection of short triplex forming oligonucleotides for fluorescence in situ hybridisation COMBO-FISH Eberhard Schmitt a,b,∗ , Jenny Wagner a , Michael Hausmann a a b

Kirchhoff Institute for Physics, Heidelberg University, Im Neuenheimer Feld 227, D-69120 Heidelberg, Germany Institute for Numerical and Applied Mathematics, University of Goettingen, Lotzestr. 16-18, D-37083 Goettingen, Germany

a r t i c l e

i n f o

Article history: Received 30 October 2010 Received in revised form 30 August 2011 Accepted 3 October 2011 Available online 18 October 2011 MSC: 68P05 68P10 92D20

a b s t r a c t For the selection of short triplex forming oligonucleotides for COMBO-FISH hybridisations used in research and medicine for 3D-structure investigations of the nucleus, the human genome has to be scanned for polypurine or poly-pyrimidine sequences which colocalise at the desired genomic region. Further binding sites of the selected oligonucleotides are required not to cluster anywhere else. We present an implementation of algorithms which design such oligonucleotide sets and exemplify the existence of COMBO-FISH probe sets by designing labelling sets for 29 genes and subsets thereof. The algorithms can be trivially parallelised and run on clusters, grids, and clouds. © 2011 Elsevier B.V. All rights reserved.

Keywords: Genomic sequence search Sequence clusters Lymphocytes and tissue sections Labelling of vital cells

1. Introduction In biological research and in medical diagnostics the investigation of the cell nucleus and its nano-architecture [1] as well as the distribution and geometry of chromatine [2] and its functional relations to physiological processes [3] have gained importance besides the long standing question about the multiplicity of genes or other genetic elements [4]. This has led to an increasing demand for methods to label short DNA sequences the exact localisation of which within the nucleus is detected by advanced microscopic methods. The general method of fluorescence in situ hybridisation (FISH) is based on attaching a molecule with detectable labels to the molecule of interest [5,6] whose existence is to be shown or whose occurrence is to be counted. For DNA hybridisation, one makes use of the fact that a free single stranded DNA molecule with appropriate markers binds to its combinatorial counterpart to be labelled within the chromosome. In its first version, DNA hybridisation was achieved by applying some solution of appropriate DNA sequences

∗ Corresponding author at: Kirchhoff Institute for Physics, Heidelberg University, Im Neuenheimer Feld 227, D-69120 Heidelberg, Germany. Tel.: +49 6221 549815; fax: +49 6221 549112. E-mail addresses: [email protected] (E. Schmitt), [email protected] (J. Wagner), [email protected] (M. Hausmann). 1877-7503/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.jocs.2011.10.001

to the nucleus requiring denaturation and fixation of the specimen, which leads to severe changes of the three dimensional nuclear structure. The DNA sequence applied was prepared by biochemical amplification, so called YAC-, BAC-, or X-clones, of some reference sequence which was cut out of the genome by enzymes. In this regard, the primary information for the labelling procedure was the knowledge of these specific cutting locations within the genome rather than the sequence of the desired genetic region. Furthermore, the length of the labelling probe was determined by the properties of these enzymatical and biochemical processes and not by the desired specificity of the marker. Nevertheless, the method is still applied with great success in various fields in research and diagnostics and has a fundamental impact on nowadays’ medical care. The conventional FISH method as described above has two major drawbacks, namely the rather unspecific labelling with a very long DNA probe and the destruction of the original three dimensional structure of the nucleus. While the latter is a general problem for research, when relations of genetic alterations and the 3D structure of the chromatine are to be evaluated, the former is already a problem for labelling itself, because a more specific and exactly targeted labelling of a specific genomic region is required to get access to information about changes in copy number of genes or their parts, e.g., during tumour development. Therefore, in this case, the labelling should be based on the sequence of the

E. Schmitt et al. / Journal of Computational Science 3 (2012) 328–334

329

Fig. 1. Snapshot of a molecular dynamics simulation showing triplex structures of the two classical Hoogsteen pairs for parallel binding: C+ *GC (left) and T*AT (right).

respective genomic region with carefully determined start and end points. Both drawbacks can be overcome by the design of short oligonucleotide sequences which bind to the desired chromosome sequence. Such an approach has become feasible since the human genome has been sequenced in total [7] and since procedures are known to synthesise predetermined DNA sequences on a purely chemical basis. In principle, these two problems concerning hybridisation are independent. A specific labelling can be achieved already for a rather short sequence of about a hundred bases, which is unique in the genome. Such a sequence can be found near any genomic region according to our experience. On the other hand, the currently available chemical methods are not suited for a stable synthesis of DNA molecules of this length. The other serious problem is the preservation of the 3D structure of the genomic architecture during the hybridisation process and the subsequent acquisition of the microscopic images. It has been shown that the application of triplex forming oligonucleotides (TFOs) can improve the situation dramatically, because denaturation of the specimen can be omitted. TFOs bind as a third strand into the major groove of the native double strand, at least for TFO lengths between 15 and 30 bases. On the other hand, oligonucleotides of such a length are far from being specific and appear up to several hundred times in the genome. The solution to this problem consists of designing a set of short oligonucleotides which bind to the selected region, but do not colocalise elsewhere in the genome. A priori, it is by no means clear that such a set exists. In this paper, we design and implement an algorithm to search for such sets and apply it to 29 diagnostically important genes to show the existence of the desired combination of short oligonucleotides. 2. Binding combinatorics The principle of COMBO-FISH is based on a set of short oligonucleotides of 15–30 bases length which bind to the region of the genome to be labelled with fluorochromes [8–10]. The single oligonucleotides carry only one or two fluorochromes, so about 20 or 30 stretches have to colocalise within a small region to produce a spot which is detectable in the microscope. Furthermore, to guarantee the specificity of the labelling, this colocalisation spot should be the only one in the whole genome. Further binding locations of single oligonucleotides, which cannot be excluded, will contribute to the background noise. In principle, labelling can be performed with any DNA sequence when hybridisation

to single stranded denatured DNA is tolerable which destroys much of the three dimensional structure. But here, we will focus on methods which also preserve the native architecture of the chromatine as much as possible. To avoid denaturation, hybridisation will be performed with triplex forming oligonucleotides which bind as a third strand to the double strand. Such sequences exhibit special combinatorial properties which we describe first. The first publication of experimentally determined binding structures of DNA nucleotides dates back to 1959, when Hoogsteen’s paper [11] appeared, but it took another eight years, until it was clear [12] that these double nucleotides were part of the unusual triple helical structures when a third strand binds into the major groove of a native double strand DNA. Later, they were called Hoogsteen pairs in distinction to Watson–Crick pairs. In their original form, the conditions for the formation of triplex strands can be summarised as follows. The sequence of the native double strand must exclusively consist of purines (bases A – adenine and G – guanine) or of pyrimidines (C – cytosine and T – tymine) in the binding region. Note that due to the pairing rules the opposite strand of a polypurine strand is a polypyrimidine strand and vice versa. The third strand will then bind to the purine strand in the following way: Either it is a polypurine strand, then A binds to A and G to G, but the strand direction is reversed, or it is a polypyrimidine strand binding C to G and T to A, the polypyrimidine strand running in the same direction as the purine strand. But this again means that it has the reverse direction with respect to the polypyrimidine strand of the double helix it binds to. In both cases, the third strand (the triplex forming oligonucleotide, TFO) has the same sequence as one of the strands of the double strand, but the opposite orientation. We have included a snapshot (Fig. 1) of an own molecular dynamics simulation [13] showing the triplex structures for the two classical parallel binding Hoogsteen pairs. The biochemical properties, including the description of the H-bonds, can be found in the literature, but they are not important for the bioinformatical questions. Nevertheless, they are very important for the actual choice and synthesis of such strands, as different molecule types (locked nucleic acids LNAs [14], peptide nucleic acids PNAs [15], twisted intercalating nucleic acids TINAs [16], etc.) react differently under special physiological conditions and define the hybridisation properties. From a bioinformatical point of view it is equivalent to design a set of polypurines or of polypyrimidines, which we call TFOs for short. The decision to realise the oligonucleotide probe set as one or the other depends on the experimental conditions.

330

E. Schmitt et al. / Journal of Computational Science 3 (2012) 328–334

Fig. 2. The algorithmic design of the COMBO-FISH set is based on the location data of the gene of interest. The experimental results lead to optimisation and application of the set and may modify the biomedical question.

3. Problem formulation and data structures The problem of labelling specific genomic sections by hybridising suitable oligonucleotide sequences with marker molecules to the corresponding DNA single or double strand arises in research and development applications as well as in routine diagnostics in standard medical applications (cf. Fig. 2). Within the underlying biomedical question, a gene or any other genetic region of interest (genetic ROI for short) is determined by specifying the location of its first and last nucleotide. These two numbers are the input to the algorithm which determines all TFOs within the sequence of the gene of interest and subsequently determines a uniquely colocalising subset of these TFOs. This COMBO-FISH set is then investigated in test experiments for binding efficiency, fluorescence intensity, background noise, and other parameters important for the experimental procedure and subsequent image evaluation. If the experiments are not successful, the experimental procedure will be checked and may be revised, or the COMBO-FISH set may be designed newly in an optimisation step using additional restrictions, e.g., based on considerations concerning binding efficiency, as depicted in Fig. 2. The additional restraints can be incorporated into the algorithm. Examples of how this interplay between medical diagnostics, industrial probe supply, and academic research can be realised and incorporated into a business model, are described in [17]. If the COMBO-FISH set proves successful in the test experiments, it is used in research or medical applications in the laboratory or in routine tests, whose results can lead to reformulations and modifications of the biomedical question. The bioinformatical problem of designing such a uniquely colocalising COMBO-FISH set of short triplex forming oligonucleotides can be formulated as follows. Find a set of N TFOs Pi , 1 ≤ i ≤ N, with Pi having length li with L1 ≤ li ≤ L2 within the desired genomic region (specified by the start and end nucleotide locations), such that no M + 1 of these Pi bind to any other genomic region of length l ≤ S, S being a user defined parameter. The formulation of this problem contains the following parameters which can be adjusted to the special situation under

consideration: the minimal length L1 and the maximal length L2 of the oligonucleotides, the maximal cluster size M of probes which are allowed to bind within a sequence of length S outside the labelling site, and the number N of probes to be determined. For our applications, we have chosen L1 = 15 and L2 = 25 as limits for the probe size and M = 5 and S = 250 kb for the maximal possible background clusters. In its present implementation, N is a result of the search as it signifies the maximal number of oligonucleotides found which are consistent with the constraints. Due to the currently available microscopic methods, the probes can be used for labelling if N ≥ 20. The search and application of the algorithm could be performed on the generic sequence database of the human genome as provided by the NCBI [18], but we prefer to establish an own database for this purpose as the TFO sequences (polypurines and polypyrimidines) of a minimal length of 15 bases comprise about 4% of the genome only. To this end, the original contig files of all chromosomes are split into two parts: an annotation file containing all the descriptive informations about the genes (names, locations, introns, exons, etc.), and a sequence file containing the sequence of nucleotides of the contig. Our database P for the COMBO-FISH algorithm is derived from these sequence files by reading each file and writing all the polypurine and polypyrimidine sequences of minimal length of L1 = 15 bases to a TFO sequence file, including the location information. One entry T(i) in such a file Fj consists of the number B(i) of the beginning nucleotide, followed by the sequence O(i) of the nucleotides in ASCIInotation. The set of all these TFO sequence files is the database P = Fj we use as input for our COMBO-FISH set design algorithm. The number of files Fj depends on the genome sequence release and the contig structure used. It can be 1 file per chromosome, but we also use the dissected chromosome notation which amounts to 64 files per chromosome at maximum. Note that using this database implies L1 ≥ 15. On the other hand, the annotation parts are used to determine the location of the genetic region of interest, when the name of the gene is given. It should be noted that the sequence contig files are meanwhile based on the four letter alphabet (acgt) omitting the non-IUPAC terminology for groups of nucleotides. Polymorphisms are denoted in the annotation part.

E. Schmitt et al. / Journal of Computational Science 3 (2012) 328–334

4. Solution and algorithms Given the location of the genomic region of interest, which is determined from the annotation file of the respective chromosome in the human genome, the solution of the COMBO-FISH set problem is divided into three steps (Fig. 3): TFO collection from the database P, location search of all candidate TFOs within P, and removal of non-intended clusters. Whereas the second step is purely deterministic, the first and third step contain elements which are implemented in a heuristic way based on the experiences gained by designing COMBO-FISH sets for several genes by user interference. Algorithm 1. Algorithm to collect all TFOs within the genetic region of interest. Inputs are P, the start and end point of the genetic region of interest and the length constraints of the TFOs. Output is the list of all TFOs within the genetic ROI function: TFO list = TFO collection(start pos, end pos, L1 , L2 ) initialise TFO list = vector(TFO names) for 1 ≤ i ≤ length(database) do TFO(i) = read(database, i) {read i-th position/sequence entry in database} if B(i) ≥ start pos AND B(i) ≤ end pos then if length(TFO(i)) ≥ L1 then if length(TFO(i)) > L2 then TFO(i) = truncate(TFO(i), L2 ) {length(TFO(i)) = L2 afterwards} end if TFO list = append(TFO list, TFO(i)) {append TFO(i) to list of TFOs within the genetic ROI} end if end if if B(i) > end pos then break end if end for

1. TFO Collection: In this first step, all admissible TFO sequences Qj , 1 ≤ j ≤ n, within the genomic region to be labelled are collected. In particular, the polypurine and polypyrimidine sequences of length ≥L1 between the first and the last base of the genetic element of interest (genetic ROI) are extracted from the database P. The database contains all such sequences and information on their location for all contigs in contig specific files, but here we only need to consider the respective contig file containing the genetic ROI. Sequences from P with length lj ≥ L2 are truncated to L2 in a heuristic way avoiding repetitive parts. In some cases, two or more sequences Q can be cut out of the respective database sequence entry. The final selection of this process is stored in a list containing the TFO sequences, which we call the basis set B = {Qj }, and which is the input to step 2 Location Search. 2. Location Search: All the occurrences of all sequences Qj within the genome are located in the same database P by a straight forward algorithm walking through P and searching all Qj s consecutively in each polypurine or polypyrimidine sequence entry. Here, also the duplex forming locations have to be listed as binding to the natural opposite strand also contributes to clusters and background noise. All occurrences of Qj , 1 ≤ j ≤ n and clusters of more than M of the sequences Qj within a genomic sequence of S base pairs are determined and stored in the input list C for step 3. This search strategy allows to determine the clusters of size >M within a genomic sequence of length ≤S concurrently to the locations of the Qj . For the sake of clarity, we formulate algorithm 2 in a way decoupling the three loops: location search, location sort, and cluster listing. 3. Declustering: Iteratively, sequences Qj which appear in the clusters from step 2 are removed until no such clusters exist any more. As step 1 TFO Collection, step 3 Declustering is different from step 2 Location Search which is completely deterministic. For declustering, there are many instances which are decided by heuristic arguments, and therefore there are reasons to repeat this step with

331

different choices and to compare the results. Essentially, the clusters are inspected and sequences Qj from the basis set are removed, until no clusters remain. In our experience, this is best achieved in two stages. In the first, all sequences are removed which occur more than twenty times on one chromosome. This can already be done during defining the basis set. Usually, these are sequences which have a repetitive character. In the main stage, a greedy removal seems to be most promising: remove those sequences iteratively, which occur most often in all clusters. This is done, until there are no clusters left. On the average, this process reduces the basis set by 10–50%. The remaining set R = {Ri , 1 ≤ i ≤ N} ⊂ {Qj , 1 ≤ j ≤ n} is the resulting COMBO-FISH set. Algorithm 2. Algorithm to search the entire database for the TFOs that are contained in the genetic ROI. Inputs are the database itself and the list of TFOs in the genetic ROI obtained from Algorithm 1 and the numbers M and S. The output is a list of clusters of more than M TFOs having a distance less than S base pairs occurring outside the genetic ROI function: cluster list = location search(TFO list, database) initialise location list = array(locations, TFO names), cluster list = vector(clusters), cluster = vector(TFO names) for 1 ≤ j ≤ length(database) do for 1 ≤ i ≤ length(TFO list) do possible locations = find location(TFO(i), database(j)) {find all locations in each database entry where TFO(i) occurs} / {} then if possible locations = location list = append(location list, possible locations, TFO(i)) {write locations and TFO names in location list} end if end for end for location list = ascend sort(location list,1) {sort list of locations in ascending order of locations} for 1 ≤ i ≤ length(location list) do j = find idx(location list, location list(j) > location list(i) +S) {j is first index s.t. location of (entry j) > location of (entry i) +S} TFO number = j − i if TFO number > M then cluster = location list.TFOnames(i → j − 1) cluster list = append(cluster list, cluster) end if end for

Algorithm 3. Algorithm to remove those TFOs from the COMBOFISH set that also occur in other genomic regions. Inputs are the list of TFOs from Algorithm 1 and the list of clusters from Algorithm 2 together with the maximal allowed TFO number M within S base pairs. Output is the COMBO-FISH set that consists of those TFOs that do not cluster anywhere else function: COMBO set = declustering(TFO list, cluster list, M) initialise TFO occur = array(TFO list, occurrence = 0) initialise COMBO set = TFO list while length(cluster list) > 0 do for 1 ≤ i ≤ length(cluster list) do TFO number(i) = length(cluster(i)) if TFO number(i) < M then cluster list = remove(cluster list, cluster(i)) {remove cluster(i) from list} end if end for TFO occur = find occurrence(COMBO set, cluster list) {find occurrences of all TFOs in cluster list} max idx = find idx(max(TFO occur)) COMBO set = remove(COMBO set, COMBO set(max idx)) {remove most frequent TFO from COMBO set} cluster list = remove(cluster list, COMBO set(max idx)) {remove all entries of this TFO in cluster list as well} end while

5. Implementation We have implemented a version of the above described algorithms 1, 2, and 3, and the database P on a linux computer and

332

E. Schmitt et al. / Journal of Computational Science 3 (2012) 328–334

6. Application results

Fig. 3. Based on the location information: (1) the TFOs within the genomic ROI are collected from the database, (2) their location in the whole genome is determined, and (3) clustering TFOs are removed. The last step can be performed with or without user interface (UI).

a grid cluster. To establish the database P, we downloaded the human genome files from NCBI [18] via its ftp server, which takes between one and several hours, and extracted all polypurine and polypyrimidine sequences of a length ≥15 from the sequence files of the chromosomes by a straight forward algorithm whose time complexity is of the same order of magnitude. But this has to be done only once. The algorithms for 1. TFO Collection, 2. Location Search, and 3. Declustering, as described in Algorithm 1, Algorithm 2 and Algorithm 3, are implemented in three different C programs. TFO Collection needs as input parameters L1 and L2 and the location of the genomic element and outputs the list of the basis set Qj , 1 ≤ j ≤ n, and n, which is input to Location Search, which, in addition, needs M and S. Run time for a 200 kb genetic element for both programs is less than a minute, most of the time being spent for input operations from P. The resulting list C of locations and clusters is input to the third program for declustering. There are two versions of Declustering (cf. Fig. 3). In the interactive mode, information about each remaining sequence Qj on its occurrence on all single chromosomes as well as the consistence of all clusters is printed to the screen and is used to decide which sequences to remove. Removed sequences can also be reintroduced. In this way, sequences Qj are iteratively removed by the user until no clusters remain. It should be noted that all files intermediately stored (defined as input and output files in algorithms 1–3) are ASCII files and can also be edited manually. This is also the reason why we kept the original distinction between purines and pyrimidines to stay close to the original data format. Using the interactive mode for several COMBO-FISH set designs, we gained experience to implement the greedy removal approach described above in the automatic version, which runs on a stand alone basis and outputs the final COMBO-FISH set without manual interference within a few minutes.

In a comparative ‘experiment’ for the design of a COMBO-FISH set, we determined the basis sets and their genome positions and clusters for 29 genes automatically. Thereafter, in a computer run the declustering was also performed automatically with the heuristic approach, and the final COMBO-FISH set was stored as the result (see Table 1). This took about 10 min per gene. To compare it to manual performance, the same starting sets were investigated and declustered by use of the program which permits user interference. In all cases, both methods delivered a final set of uniquely colocalising TFOs, which means that COMBO-FISH seems to be in general feasible. The resulting sets did not differ much. In some cases, the manually generated final set was larger, because the user could see interrelations between the clusters earlier. On the other hand, we also tried a more sophisticated version of the automatic implementation in which the TFO sequence removed last was reintroduced and another TFO was removed for further declustering. In some cases, this extended iterative approach led to larger TFO sets than the manual procedure. Altogether, the automatic approach based on heuristic experiences does not seem to be worse than the approach using manual editing via the user interface, which may capture more complex relations between clusters. In addition to the whole genes, COMBO-FISH sets for the beginning and the end parts of the genes were also designed. In general, the sets contained additional sequences which had been discarded in the final set for the total gene sequence. But as the genetic ROIs in these cases are much shorter than the whole gene, the basis sets B are smaller and so are the final COMBO-FISH sets R, which very often had 10 or less members and did not allow detection in connection with standard hybridisation and microscopy.

7. Performance analysis As mentioned above, a complete run of the three algorithms for a typical 200 kb genetic element takes a few minutes. In practice, the complexity of all three algorithms is linear in the number of the TFOs within the genetic ROI which in turn is proportional to its length. Algorithm 1 is in essence the copy function of the vector of TFO entries in P to the intermediate file containing the list of TFOs in the genetic ROI. Algorithm 2 has been formulated as a succession of three loop elements (the first one being a double loop), but in a skilled implementation the three loops are intermingled into one big loop depending on the database entry index j in P which contains an inner loop which searches all occurrences of all TFOs from B in the respective database entry. These have then to be sorted, but due to the restricted length of such TFOs in the human genome, there are seldom more than two TFOs of the genomic ROI contained in one entry of the database which itself is sorted ascendingly. Thus, in each step the list to be appended is almost sorted by default and the clusters can be easily tracked at the same time. All processes are trivially parallelisable by loop unrolling, even in different obvious ways. For one gene, the TFO list of the genetic ROI can be split up, the database can be distributed which is the most preferable parallelisation, or the declustering algorithm, the performance of which is subject to ongoing research, can be distributed in different versions with different optimisation rules. But these aspects become interesting only, if more time consuming searches are done. This can be the case for large sets of genetic elements, especially when COMBO-FISH sets are designed which are to distinguish between different locations, e.g., for multicolour experiments. This question also applies, when the method is extended to non-Hoogsteen pairings or hybridisation with mismatches. In this case, the database P will increase dramatically in size, or one will have to work with the entire genome sequence database. Then,

E. Schmitt et al. / Journal of Computational Science 3 (2012) 328–334

333

Table 1 Gene names with chromosome and base positions and sizes of basis B and final COMBO-FISH set R. Gene

Position on chromosome

Base position

NRAS AKT3 MSH2 GNLY RASSF1 FHIT PIM1 ABCB1 (MDR) MET CMYC CDKN2A (p16) PTEN ATM KRAS2 RB1 PNN SNRPNup IGF1R FANCA D17S125 ERBB2 LAMA3 AKT2 MYBL2 PTPN1 ZNF217 PCNT2 PDGFB (SIS) TBX1

1p13.2 1q44 2p22.3-p22.1 2p12-q11 3p21.3 3p14.2 6p21.2 7q21.1 7q31 8q24.12-q24.1 9p21 10q23.3 11q22.3 12p11.2 13q14 14q13 15q12 15q25-q26 16q24.3 17p12-p11.2 17 18q11.2 19q13.1-q13.2 20q13.1 20q13.1-q13.2 20q13.2-q13.3 21qtel 22q13.1 22q11.2

4773158..4824158 1714365..2069716 5331083..5503008 17295832..17300298 587156..598306 3106362..3371407 27935054..27940329 12367353..12576743 41489038..41613011 191836..197827 21935774..21963322 897571..1000507 11636339..11779887 2778774..2824914 17452380..17630614 19564443..19572199 3595436..3750373 297646..606403 610999..690100 7147833..7148031 1593840..1622888 2933849..3024059 13007022..13060073 7348623..7398038 14179798..14253995 17236471..17252494 3053753..3175376 18834148..18855844 2892106..2918996

parallelisation seems to be indispensable. In so far, the algorithms are prepared also for grid or cloud applications. 8. Related work We have started our work with proof of principle experiments designing the COMBO-FISH sets in an interactive mode with preliminary versions of the algorithms presented here [8]. Since then, the algorithms have converged to the present implementation, which has now been used to automatically generate the COMBO-FISH sets for the genetic ROIs mentioned in Table 1. Meanwhile, also a stable experimental protocol has been established [10]. To our knowledge, the development of algorithms for the search for combinations of specifically colocalising TFOs has not been considered by anyone else, though the search for TFOs within the genome seems to have become standard in the mean time. The methods (e.g., [19]) known to us are all based on BLAST searches [20] (Basic Local Alignment Search Tool) and need additional evaluation. When they are based on GUIs (Graphical User Interface), it is hard to use them for search in a whole genome, because they are not suited for high throughput TFO search. 9. Discussion and conclusions As experiments have shown, COMBO-FISH is a feasible method in research and medical diagnostics to investigate chromosomal nanostructure. In this method, a set of short oligonucleotides is designed and hybridised to chromatine in such a way that the oligonucleotides colocalise exclusively in a predefined genomic region and form no clusters anywhere else which are microscopically distinguishable from the usual background noise. The associated bioinformatical problem, namely to find combinations of short oligonucleotides with such properties, has been solved in this paper in the case of triplex forming oligonucleotides which have remarkable additional properties when used in COMBO-FISH experiments. The bioinformatical advantage is the restriction of all searches and subsequent declustering procedures to polypurine

|B|

|R|

62 55 80 113 134 84 101 85 90 113 101 66 44 69 64 52 40 101 49 105 159 120 81 77 78 69 71 114 64

35 37 49 33 61 50 52 54 26 57 53 41 27 43 34 32 21 59 32 30 77 59 43 39 45 39 38 60 38

and polypyrimidine stretches, which make up only two to four percent of mammalian genomes. Therefore, an efficient strategy on a reduced database can be algorithmically implemented. To our experience, and as shown here in the application to 29 genes, the design of a COMBO-FISH probe set should be possible in any case, which in itself is a remarkable result and raises the question how randomly or chaotically the triplex forming sequences are distributed in the genome. In COMBO-FISH protocols with TFOs, the denaturation step can be omitted and thus the natural structure and nano-architecture of the chromatine are much better preserved than in usual FISH methods. This opens new perspectives for research and diagnostics. In particular, COMBO-FISH with TFOs can be applied to vital cells and may allow to follow the further development of the labelled cell in vivo. When extended to the full COMBO-FISH method with arbitrary sequences, the combinatorics of designing an appropriate COMBO-FISH set is not restricted to the relatively small part of TFO sequences, and therefore a much more focused labelling should be possible. On the other hand, the bioinformatical problem is much more involved, because then the whole genome has to be taken into consideration. Searches take orders of magnitude more time, and the question is, whether commonly applicable methods like BLAST [20] and its dialects can be utilised. In our case, we did not find major advantages, because editing of BLAST output seems to be more involved than doing the search on the reduced sequence database. Presently, our database and algorithms are transformed to a more efficient coding which will still increase performance by some factors. On the other hand, the possibility of interacting by editing ASCII files will be lost. In addition, different types of sequences other than polypurines and polypyrimidines will be included, as non-Hoogsteen triplexes have been found to exist as well [21,12]. Furthermore, the question arises, in how far mismatches interfere and allow hybridisation to sequences which are not exactly complementary to the probe sequence. In this context, the computation of binding energies plays an important role, but presently not much seems to be known.

334

E. Schmitt et al. / Journal of Computational Science 3 (2012) 328–334

Another interesting question concerns the heuristic approaches used here. A more detailed mathematical analysis could help to understand the internal combinatorial structure and the distribution and clustering properties of the TFOs. With such knowledge, maybe based on explicit statistical descriptions, the selection of an oligonucleotide of a shorter length within a long stretch could be made more efficiently when designing the basis set in Algorithm 1. In essence, this problem occurs again when arbitrary sequences are considered. The same reasoning applies to the declustering process. We still lack a more deterministic reduction of the basis probe set to the final set with simultaneous elimination of the background clusters by a sophisticated selection of probes to be removed. However, current theoretical analyses show that the problem may be formulated as a linear optimisation problem, which can then be (deterministically) solved for a (global) optimum. Acknowledegments The authors gratefully acknowledge financial support to M. Hausmann (FKZ 01|G07015G — Services@MediGRID and 13N11163 — COMBO-FISH) by BMBF and to J. Wagner (Heidelberger Graduiertenschule der mathematischen und computergestützten Methoden für die Wissenschaften) by DFG. We all thank Nick Kepper (BioQuant Heidelberg) for kindly supplying Figure 1. E. Schmitt wants to express his thankfulness to Jeremy Rodack and Graham Dorsey, Western Universality, whose iron will bridged the gap from noon till three when writing silly code. References [1] M. Bohn, P. Diesinger, R. Kaufmann, Y. Weiland, P. Müller, M. Gunkel, A. von Ketteler, P. Lemmer, M. Hausmann, D. Heermann, C. Cremer, Localization microscopy reveals expression-dependent parameters of chromatin nanostructure, Biophys. J. 99 (2010) 1358–1367. [2] P. Lemmer, M. Gunkel, D. Baddeley, R. Kaufmann, A. Urich, Y. Weiland, J. Reymann, P. Müller, M. Hausmann, C. Cremer, SPDM-light microscopy with single molecule resolution at the nanoscale, Appl. Phys. B 93 (2008) 1–12. [3] M. Hausmann, G. Hildenbrand, J. Schwarz-Finsterle, U. Spöri, H. Schneider, E. Schmitt, C. Cremer, New technologies measure genome domains high resolution microscopy and novel labeling procedures enable 3-D studies of the functional architecture of gene domains in cell nuclei, Biophotonics Int. 12 (2005) 34–37. [4] J. Erenpreisa, M. Kalejs, F. Ianzini, E. Kosmacek, M. Mackey, D. Emzinsh, M. Cragg, A. Ivanov, T. Illidge, Segregation of genomes in polyploid tumour cells following mitotic catastrophe, Cell Biol. Int. 29 (2005) 1005–1011. [5] J. Walter, B. Joffe, A. Bolzer, H. Albiez, P. Benedetti, S. Müller, M. Speicher, T. Cremer, M. Cremer, I. Solovei, Towards many colors in FISH on 3D-preserved interphase nuclei, Cytogenet. Genome Res. 114 (3–4) (2006) 367–378. [6] E. Volpi, J. Bridger, FISH glossary: an overview of the fluorescence in situ hybridization technique, Biotechniques 45 (2008) 385–409. [7] I. Zahra Abdellah, et al., Finishing the euchromatic sequence of the human genome, Nature 431 (2004) 931–945. [8] M. Hausmann, R. Winkler, G. Hildenbrand, J. Finsterle, A. Weisel, A. Rapp, E. Schmitt, S. Janz, C. Cremer, COMBO-FISH: specific labeling of nondenatured chromatin targets by computer-selected DNA oligonucleotide probe combinations, Biotechniques 35 (2003) 564–577. [9] J. Schwarz-Finsterle, S. Stein, C. Großmann, E. Schmitt, L. Trakhtenbrot, G. Rechavi, N. Amariglio, C. Cremer, M. Hausmann, Comparison of triplehelical COMBO-FISH and standard FISH by means of quantitative microscopic image analysis of abl/bcr genome organisation, J. Biochem. Biophys. Methods 70 (3) (2007) 397–406. [10] E. Schmitt, J. Schwarz-Finsterle, S. Stein, C. Boxler, P. Müller, A. Mokhir, R. Krämer, C. Cremer, M. Hausmann, Combinatorial oligo FISH: directed labeling of specific genome domains in differentially fixed cell material and live cells, Methods Mol. Biol. 659 (2) (2010) 185–202. [11] K. Hoogsteen, The structure of crystals containing a hydrogen-bonded complex of 1-methylthymine and 9-methyladenine, Acta Crystallogr. 12 (10) (1959) 822–823.

[12] R. Besch, C. Giovannangeli, K. Degitz, Triplex-forming oligonucleotides—sequence-specific DNA ligands as tools for gene inhibition and for modulation of DNA-associated functions, Curr Drug Targets 5 (2004) 691–703. [13] J. Luchner, Molecular dynamics simulation of short triple helical DNA structures, Bachelor Thesis in Physics, Department of Physics and Astronomy, Heidelberg University, 2011. [14] A.A. Koshkin, S.K. Singh, P. Nielsen, V.K. Rajwanshi, R. Kumar, M. Meldgaard, C.E. Olsen, J. Wengel, LNA (locked nucleic acids): synthesis of the adenine, cytosine, guanine, 5-methylcytosine, thymine and uracil bicyclonucleoside monomers, oligomerisation, and unprecedented nucleic acid recognition, Tetrahedron 54 (1998) 3607–3630. [15] O.B. Peter, E. Nielsen, M. Egholm, Peptide nucleic acid (PNA). A DNA mimic with a peptide backbone, Bioconjug. Chem. 5 (1994) 3–7. [16] V.V. Filichev, H. Gaber, T.R. Olsen, P.T. Jrgensen, C.H. Jessen, E.B. Pedersen, Twisted intercalating nucleic acids intercalator influence on parallel triplex stabilities, Eur. J. Org. Chem. 2006 (2006) 3960–3968. [17] F. Dickmann, J. Falkner, W. Gunia, J. Hampe, M. Hausmann, A. Herrmann, N. Kepper, T.A. Knoch, S. Lauterbach, J. Lippert, K. Peter, E. Schmitt, U. Schwardmann, J. Solodenko, D. Sommerfeld, T. Steinke, A. Weisbecker, U. Sax, Solutions for biomedical grid computing—case studies from the D-Grid Project Services@MediGRID, J. Comput. Sci. 3 (2012) 280–297. [18] National Center for Biotechnology Information, Bethesda, MD 20894, USA, http://www.ncbi.nlm.nih.gov. [19] S.S. Gaddis, Q. Wu, H.D. Thames, J. Digiovanni, E.F. Walborg, M.C. MacLeod, K.M. Vasquez, Oligonucleotides. A web-based search engine for triplexforming oligonucleotide target sequences, Oligonucleotides 16 (2006) 196–201. [20] National Center for Biotechnology Information, Bethesda, MD 20894, USA, http://www.ncbi.nlm.nih.gov/Blast.cgi. [21] M.P. Knauert, P.M. Glazer, Triplex forming oligonucleotides: sequencespecific tools for gene targeting, Hum. Mol. Genet. 10 (2001) 2243–2251. Dr. Eberhard Schmitt received his Diploma in Biology and his Dr. rer. nat. in Mathematics from Universität Göttingen. Having started to work at the Max-Planck-Institut für Biophysikalische Chemie in Göttingen on structure calculations and algorithms for energy minimization of DNA structures, he continued this approach with his group at the Institute for Molecular Biotechnology in Jena, adding ODE descriptions of mitotic checkpoint dynamics to his research field, as well as combinatorial aspects of the genome. At the Kirchhoff-Institut für Physik since 2008, for the design of COMBO-FISH sets he sets on his designer’s COMBO-FISH spectacles. Jenny Wagner studied physics, mathematics and computer science at Heidelberg University and received her diploma about “data compression for the ALICE detector at CERN” in 2008. She currently pursues her PhD research about image processing, pattern analysis and computer vision focusing on automated quality analysis in microarray experiments.

Michael Hausman After studying physics, Michael Hausman received his PhD in 1988 and his habilitation in 1996. Thereafter, he was appointed leader of the “Farfield/Nearfield Microscopy Lab” at the Institute of Molecular Biotechnology in Jena and worked guest professor at the Universities of Jena and Amsterdam. In 2002, he became research group leader of the Microscopy Group of the Institute of Pathology in Freiburg. In 2004, he was appointed professor by the Faculty of Physics and Astronomy, University of Heidelberg. Since 2005, he is the leader of the research division “Peptide Chips” of the Kirchhoff Institute of Physics, University of Heidelberg.