Pattern Recognition Letters 31 (2010) 1672–1682
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Multi-objective optimisation of real-valued parameters of a hybrid MT system using Genetic Algorithms Sokratis Sofianopoulos *, George Tambouratzis Institute for Language and Speech Processing, Artemidos 6, 151 25 Athens, Greece
a r t i c l e
i n f o
Article history: Received 31 October 2008 Available online 24 May 2010 Communicated by R.C. Guido Keywords: Multi-objective optimisation Genetic Algorithms SPEA2 Machine Translation Evaluation of translation quality Evolutionary computation
a b s t r a c t In this paper, an automated method is proposed for optimising the real-valued parameters of a hybrid Machine Translation (MT) system that employs pattern recognition techniques together with extensive monolingual corpora in the target language from which statistical information is extracted. The absence of a parallel corpus prohibits the use of the training techniques traditionally employed in state-of-the-art Statistical Machine Translation systems. The proposed approach for fine-tuning the system parameters towards the generation of high-quality translations is based on a Genetic Algorithm and the multi-objective evolutionary algorithm SPEA2. In order to evaluate the translation quality, established MT automatic evaluation criteria are employed, such as BLEU and METEOR. Furthermore, various ways of combining these criteria are explored, in order to exploit each one’s characteristics and evaluate the produced translations. The experimental results indicate the effectiveness of this approach, since the translation quality of the evaluation sentence sets used is substantially improved in all studied configurations, when compared to the output of the same system operating with manually-defined parameters. Out of all configurations, the multi-objective evolutionary algorithms, combining several MT evaluation metrics, are found to produce the highest quality translations. Ó 2010 Elsevier B.V. All rights reserved.
1. Introduction Machine Translation (MT) systems automatically translate text from a source language (SL) to a target language (TL). The most widely-used approaches nowadays are Statistical Machine Translation (SMT) and Hybrid Machine Translation (HMT). SMT employs statistical methods and large bilingual text corpora in order to create probabilistic language models for producing the translation (Brown et al., 1990). HMT systems combine the characteristics of various MT approaches in order to encompass their respective advantages and improve the translation quality. All modern MT systems are based on language models with large numbers of parameters derived from corpora analysis. An important step in the development stage of MT systems is the optimisation of the parameters of the statistical model, since they reflect the characteristics of the specific corpus and should be adjusted to any corpus changes. Most SMT systems use the Expectation Maximisation (EM) algorithm (Dempster et al., 1977) in order to calculate the latent variables of the probabilistic language model. Even though these parameters play an important role in * Corresponding author. Tel.: +30 2106875361; fax: +30 2106875484. E-mail addresses: s_sofi
[email protected] (S. Sofianopoulos),
[email protected] (G. Tambouratzis). 0167-8655/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2010.05.018
the translation process and considerable work has focussed on parameter optimisation, relatively limited work has been invested on improving the training process itself by applying other learning algorithms. Techniques reported in the literature for the optimisation of MT systems include the Downhill Simplex method (Bender et al., 2004), the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm (Lambert and Banchs, 2007), the Simplex Armijo Downhill algorithm (Zhao and Chen, 2009) and the Joint Optimisation Strategy for combining outputs from multiple MT systems in a single log-linear model (He and Toutanova, 2009). In this article, an automatic method is proposed for optimising the parameters of an HMT system, based on Genetic Algorithms (GA) (Holland, 1975). This family of algorithms is based on the idea of natural evolution, working from many initial solutions simultaneously to reach a near-optimal solution (Goldberg, 1989). GAs have been applied successfully to a wide variety of optimisation problems, ranging from job scheduling, timetabling and routing to digital signal processing. In the field of MT, GAs have been applied to rule acquisition from translation examples in rule-based systems (Echizen-ya et al., 1996), to sentence alignment of bilingual corpora (Gautam and Sinha, 2007) and to semantic classification of sentences (Siegel and McKeown, 1996). However, very few approaches, to our knowledge, have relied on GAs for the finetuning of MT system parameters. A GA has been applied (Nabhan
1673
S. Sofianopoulos, G. Tambouratzis / Pattern Recognition Letters 31 (2010) 1672–1682
and Rafea, 2005) to tune the parameters of GIZA++, a toolkit employed for producing statistical translation models and word alignments, but not the parameters of the MT system itself. A GA has also been applied for the parameter optimisation of the clue alignment approach used for the alignment of words and multi-word units in SMT systems (Tiedemann, 2005). To evaluate the effectiveness of the proposed method, METISII was selected as the MT system (METIS-II – FP6-IST-003768 – was a 3-year IST Specific Targeted Research or Innovation Project funded by the IST programme of the European Commission and completed in 2007). METIS-II (Carl et al., 2008) is an HMT system combining pattern-matching techniques with the use of only a monolingual TL corpus. The METIS-II parameters are employed in various phases of the translation process and have been assigned values manually, based predominantly on linguistic intuition, while a fine-tuning phase, involving the iterative application of Machine Translation experiments, has led to their finalisation. However, ideally the system should not rely on predetermined parameter values, but these values should reflect the corpus that the system uses, i.e. they should be domain and data specific. For this reason an automatic training process for the optimisation of the parameters is considered essential. Classic SMT optimisation techniques are not appropriate, since they cannot be applied in METIS-II. The parameters of the statistical models of an SMT system derive from the analysis of large bilingual text corpora and are relative to the size of the bilingual corpora. In METIS-II there is no statistical information and no parallel corpora, but only a monolingual TL corpus and a few language-related real-valued parameters, and therefore the optimisation techniques that are used by SMT systems cannot be applied. Instead of trying to adjust the standard SMT training methods, it was decided to use a GA together with automatic evaluation criteria that measure the quality of the produced translation. A case study of the methodology has already been applied to METIS-II (Sofianopoulos et al., 2008): a basic GA was implemented, using a single MT evaluation criterion, BLEU (Papineni et al., 2002), to optimise a subset of the system parameters. The results were promising, since the BLEU scores obtained with the optimised parameters were substantially improved, yet there was still room for further improvement. The aim of the present research is to optimise all system parameters at the same time, instead of focussing on only a small subset, and furthermore to experiment on existing MT evaluation criteria, such as BLEU, NIST (Nist, 2002), METEOR (Banerjee and Lavie, 2005) and IQMT (Giménez and Amigó, 2006), by using combinations of these criteria to evaluate the translation quality. Since none of the current evaluation criteria provides a single comprehensive measure of quality, multi-objective evaluation algorithms have also been applied to the task. The article is organised as follows. Section 2 presents an overview of the MT system, along with a description of the parameters to be tuned and their role in the translation process. Section 3 provides a detailed description of the optimisation algorithms developed. The results obtained are presented in Section 4 and are compared to older ones, as well as to other optimisation methods used within the MT paradigm. Section 5 concludes the article by discussing future work directions.
paradigms: rule-based and stochastic tools for SL and TL text processing and statistical information for the translation process. More specifically, translation involves determining alignments between items in the two languages. Unlike many MT systems, instead of sequences of n words (n-grams), the system handles syntactically-defined phrasal segments of varying length, i.e. clauses and chunks, which are generated by a shallow parser. In the METIS-II language representation (SL and TL), each word is represented by its lemma and its part-of-speech tag (PoS tag), while each phrasal segment (chunk), which consists of words, is represented by its type and the head token (the predominant word in the chunk). Fig. 1 gives an example of an English clause lemmatised, tagged and segmented into phrasal chunks. Using the above representation and the TL monolingual corpus, the system translates an SL clause by establishing lexical selection and word and chunk alignment with ‘‘similar” TL clauses retrieved from the corpus. To achieve the SL–TL text alignment, the system employs a pattern-matching algorithm which, based on a series of parameters and combining information derived from the TL corpus with a bilingual dictionary, finds correspondences between the two languages. The translation process is essentially a comparison between SL and TL chunks and words, which results in aligned elements between the two languages. In this article the Greekto-English implementation of METIS-II (Tambouratzis et al., 2006) is used. 2.2. The translation process of METIS-II and the system parameters The translation process begins by retrieving the translation candidates from the TL corpus and ranking them in terms of similarity, by aligning the SL clause with each retrieved TL clause. Alignment is a top–down approach executed at two levels (clause-level and chunk-level), based on a general-purpose pattern-matching algorithm employing a set of parameters. The algorithm used is an implementation of the Hungarian algorithm, known as the Kuhn–Munkres algorithm, proposed by Kuhn (1955) and revised by Munkres (1957). The parameters are real-valued in the [0, 1] range and reflect the similarity of part-of-speech tags and chunk labels across SL and TL. At the first level of the alignment process, the SL clause is compared with the retrieved TL clauses, so as to establish the correct order of chunks within the clause and disambiguate between possibly multiple translations of the chunk head tokens. The Hungarian algorithm aligns the chunks by calculating the similarity of the ith SL chunk and a TL chunk as the weighted sum of the comparison scores of three types of information, namely (a) chunk labels (LabelComp), (b) chunk head lemmas (LemmaComp) and (c) chunk head tags (TagComp) (Eq. (1)).
ChunkScorei ¼ bcfi LabelCompi þ tcfi TagCompi þ ð1 bcfi tcfi Þ LemmaCompi
English Clause
two
cows
are
grazing
in
a
field
Lemmata
two
cow
be
graze
in
a
field
CRD
NN2
VBD
VVG
PRP
AT0
NN1
Pos Tags
2. A hybrid MT system using only a monolingual corpus 2.1. The concept of translating with only a monolingual corpus METIS-II is an HMT system that does not require bilingual text corpora. Instead, it relies only on a TL monolingual corpus and a bilingual lexicon and combines characteristics from various MT
ð1Þ
Chunks (chunk type and head word (bold))
NP[two cow]
VG[be graze]
PP[in NP[a field]]
Fig. 1. English sample clause with lemma, PoS tag and chunk information.
1674
S. Sofianopoulos, G. Tambouratzis / Pattern Recognition Letters 31 (2010) 1672–1682
where bcfi and tcfi are real positive-valued parameters that sum up to less than one. The calculated similarity scores for all possible SL–TL chunk pairs are arranged into an m m matrix and fed to the Hungarian algorithm to determine the optimum chunk alignment. Finally, the collective clause similarity score is calculated as the weighted sum of all chunk comparison scores selected by the algorithm (Eq. (2)):
ClauseScore ¼
m X
( ocfi
i¼1
) ChunkScorei Pm ; j¼1 ocfj
where m > 1
ð2Þ
Each discrete chunk label value is assigned a unique tuple of real-valued parameters (ocf, tcf, bcf), which play an important role in the alignment process since they drive all chunk comparisons. At the second-level of the alignment process, the tokens within each pair of SL chunk and TL chunk (that have been selected by the first level) are aligned, to establish the correct token order and thus select the appropriate translations. The token alignment process is similar to chunk alignment: first, each SL token is compared with all tokens of the TL chunk, calculating the token score as the weighted sum of the lemma comparison (LemmaCompi) and the PoS tag comparison (TagCompi) scores (Eq. (3)).
TokenScorei ¼ tcfi TagCompi þ ð1 tcfi Þ LemmaCompi
ð3Þ
All token comparison scores then form an n n matrix, which is fed to the Hungarian algorithm in order to obtain the optimum token alignment. The collective chunk similarity score, ColChunkScore, is calculated as the weighted sum of all token comparison scores TokenScorei (Eq. (4)). The term ocf is the cost factor parameter of the SL token in each token pair. Each discrete SL PoS tag is assigned a unique (ocf, tcf) set of real-valued parameters. The weighted sum of all ColChunkScore scores yields the second-level clause comparison score.
ColChunkScore ¼
n X i¼1
) TokenScorei ; ocfi Pn j¼1 ocfj
2.3. The system parameters The translation system, when initialised, is provided with 70 reference real-valued parameters in the range of [0, 1], which fall under two types: (a) 40 PoS tag-related parameters, corresponding to 20 tag types with 2 parameters each, and (b) 30 chunk label-related ones, corresponding to 10 chunk types with 3 independent parameters each. These values need to be defined at the same time; it is probably not sufficient to optimise only a subset, since the translation process is indivisible and all parameters are active throughout the process. In the Greek-to-English implementation, all parameter values have been manually assigned, based on the given TL corpus and linguistic intuition. However, subsequent experiments indicated that these values are sub-optimal, as better quality translations are obtained by adjusting even a subset of these parameters. Numerous learning methods exist for solving general optimisation problems with a large number of real-valued parameters and the task of selecting the most appropriate one is very difficult. Methods such as Weight Space Search (Black and Campbell, 1995) and Multilinear Regression (Graybill, 1961) were not chosen here due to their prohibitively high computational costs. Besides, as has already been noted, training algorithms used in SMT, such as EM, cannot be applied for the optimisation of the METIS-II parameters, due to the existence of only a monolingual corpus. Therefore, it has been decided to use Genetic Algorithms, as they have been proven efficient at solving similar optimisation problems involving a large number of real-valued parameters that interact nonlinearly with each other with numerous local optima existing in the objective function (Ballester and Carter, 2003).
3. Parameter optimisation using Genetic Algorithms 3.1. Genetic Algorithms
(
where n > 1
ð4Þ
Fig. 2 illustrates the application of the Hungarian algorithm for calculating the clause similarity score and the resulting chunk alignment. Each cell contains the similarity score of the corresponding SL chunk and TL chunk, while highlighted cells indicate the optimum SL–TL chunk mapping as defined by solving the assignment problem. When the alignment process has been completed, a final comparison score for each SL–TL clause pair is calculated, as the product of the scores of the two levels. The TL clause with the highest final score is selected as the basis for the translation, while the order of elements at both chunk and token levels are established through the aforementioned process. As a final step, any mismatches exceeding a given level are eliminated by modification and/or substitution of chunks and tokens, based on statistical information retrieved from the TL corpus. For more details see Sofianopoulos et al. (2007).
GAs are a class of adaptive stochastic search and optimisation algorithms based on the principle of natural evolution. They operate on a population of candidate solutions by applying over each generation the law of survival of the fittest in order to produce progressively improved approximations to a solution. More specifically, a GA iteratively creates a new population by selecting individuals according to their level of fitness in the problem domain, and applying operators that simulate natural evolution. This process leads to the evolution of populations of individuals that are better adapted to the given environment than the individuals from which they originate. Candidate solutions, or individuals, are encoded as strings using various data types, i.e. binary, integer, real-valued etc. Very frequently, the initial population is created by randomly producing such strings. Populations are evaluated using an objective function which governs the selection process and denotes the level of fitness of each individual within the given environment. Individuals achieving a high fitness value will have a higher probability of
SL Score=82.8 % NP(two{crd} T [cow{nn}] VG(be{vbd} L [graze{vvg}]) PP([in{prp}] np_2(a{at0} [field{nn}]))
NP_NM(the{at} [flock{nn}|herd{nn}|shoal{nn}])
VG([graze{vv}])
PP([at{prp}|in{prp}| on{prp}] NP_AC([meadow{nn}])
62.82%
0
4.29%
0
100%
0
36.85%
0
96.84%
Fig. 2. Example indicating the alignment of chunks from the source to the target language and the respective comparison scores.
S. Sofianopoulos, G. Tambouratzis / Pattern Recognition Letters 31 (2010) 1672–1682
being selected for reproduction towards the new generation than individuals with lower fitness values. Reproduction is usually performed in pairs, through the probabilistic application of the genetic operators of selection, crossover and mutation, and generates offspring containing material from their parents. The offspring are then diversified via a mutation operator. This way, analogously to nature, the average fitness value is expected to increase over generations as good individuals survive and propagate their genetic material, while less fit individuals are discarded. New generations are created through the iterative application of the aforementioned process. The sequence of selection, reproduction and evaluation is repeated until some termination criterion is reached. Such a criterion could either be (i) the completion of a certain number of generations, (ii) the failure of the best individuals to improve within a certain number of generations, or (iii) the successful convergence of the genetic algorithm (Hui and Xi, 1996). Although GAs are not guaranteed to find an optimal solution, they have been shown to be most successful in finding a near-optimal solution, if allowed to operate for a sufficient number of iterations. 3.2. Applying the GA to the MT optimisation task The GA implemented is an elitist real-valued GA with a constant population size p. The initial population is randomly created and for each new generation p/2 parent pairs are selected from the current population, based on their fitness, to produce p children through the application of genetic operators. Below we describe the details of the GA configuration in terms of individual representation, selection, reproduction, fitness function and termination schemes. 3.2.1. Individual representation using real coding For optimising the MT system parameters, the first step is to code the parameters into GA individuals. Initially, a common method of applying GAs to real-valued parameter problems was to encode each parameter as a binary bit string (Wright, 1991), then concatenating the parameter representations together to a single string, called a chromosome. More recent approaches to optimisation problems of real-values parameters (Deb, 2001; Herrera et al., 1998; Blanco et al., 2001) use vectors with real-valued parameters instead of binary bit strings (Herrera et al., 2003). In this work the latter approach has been used. Specifically, each individual is a vector W, containing all the real-valued system parameters to be optimised (W = {w1, w2, . . ., w70}), with each parameter taking values in the range [0, 1]. In order for the MT system to execute translation tasks, it needs to be initialised with such a vector. 3.2.2. Selection We use the roulette wheel selection to select two parents from the current population. In roulette wheel selection, each individual of the population is given a probability of being selected relative to its fitness, with the individuals having a higher fitness value being selected more often. Genetic operators are then applied to the parents in order to produce new individuals. 3.2.3. Offspring generation Following the selection, the next step is the generation of the new population. Each pair of parent individuals produces two children using uniform crossover (Syswerda, 1989). In uniform crossover individual values of the vectors of the two parents are swapped with a fixed probability of 0.5 in order to create two new individuals. After the creation of the new individuals, a Gaussian mutation operator alters one or more parameter values of a number of randomly selected individuals. The number of parameter values for which the mutation operator will be applied depends
1675
on the mutation rate of the GA. The Gaussian mutation operator is applied by adding a unit Gaussian distributed random value to a randomly chosen parameter of the selected individual. If the new parameter is outside of the range [0, 1], then another Gaussian distributed random value is produced and added to the selected parameter value. Mutation ensures diversity in the population, helping to prevent the population from stabilising at local optima and moving individuals towards new areas of the pattern space in search for the optimal solution. In each offspring generation stage, the overall number of new individuals generated is twice the number of individuals in the population. Fig. 3 presents an example of the uniform crossover applied to two individuals, each represented as a vector of 8 real-valued parameters, and the Gaussian mutation operator performed to 2 of the parameters of the pair of new individuals (mutation rate of 0.125). 3.2.4. Evaluation of population For calculating the fitness of an individual k, the MT system is initialised with the parameter values of vector Wk and then generates a translation of a suitably-chosen set of SL sentences. The produced TL translation is then evaluated using MT evaluation criteria, and a fitness value is assigned to this individual, according to the evaluation result of the translation, with the individuals yielding good translations characterised as fitter than those producing bad translations. The set of SL sentences used is the same for all individuals in the population throughout the generations. The appropriate choice of the SL text greatly contributes to the success of the GA. Therefore, the SL text needs to cover all the corresponding linguistic phenomena, so as to ensure that no parameter stays inert during the translation process. 3.2.5. Creation of new population As we have already noted, the implemented GA combines elitism with a constant population size. When creating new populations, we combine the fittest individual from the current generation and the fittest children, while retaining the population size constant. This is executed by passing to the new generation the fittest individual of the current generation without any modifications and supplementing the population with the fittest children, according to their fitness value, until the new generation reaches the desired size. The processes of selection, generation, evaluation and new population creation are repeated until the chosen termination criteria are satisfied. 3.2.6. Termination In the present implementation two termination criteria have been used. According to the first criterion, the GA is stopped when a specified number of generations are completed. According to the second criterion, evolution is terminated when the fitness of the best individual remains less than a pre-specified percentage away from the fitness of the best individual so far, throughout a number of generations. Fig. 4 gives an overview of the implemented GA using pseudocode. 3.3. Automatic MT evaluation metrics used as fitness functions Automatic evaluation metrics grade the output of an MT system by comparing it to one or more reference translations. They have played an essential role in the development of MT systems in recent years and remain fundamental in comparing the quality of MT systems, despite the criticism they have received for focussing on partial elements of the translation process and occasionally showing poor correlation with human judgments (Estrella et al., 2007). In the experiments reported here, BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) have been selected, these
1676
S. Sofianopoulos, G. Tambouratzis / Pattern Recognition Letters 31 (2010) 1672–1682
Fig. 3. Example of the application of the crossover and mutation for two individuals.
Procedure GA { t=0; initialize P(t); evaluate P(t); while not (termination-criterion) do { t=t+1; select_parents from P(t-1) based on fitness using roulette wheel selection; generate_offspring O(t) using uniform crossover; mutate_offspring O(t) using Gaussian mutation; evaluate O(t); create P(t) from O(t) and P(t-1); } } Fig. 4. Overview of the implemented GA using pseudocode.
being amongst the most widely-used modern evaluation metrics for MT tasks. BLEU was developed to evaluate the effectiveness of translation for SMT systems based on n-grams over words. The score generated is a real value in the range [0, 1] that expresses the fraction of common n-grams between the system-generated translations and the reference ones, with 1 corresponding to a perfect translation and 0 to a very poor translation with no common ngrams with the reference ones. METEOR, on the other hand, works in a sequence of stages, with different criteria being used at each stage to detect and score single-word matches. METEOR scores also span the range [0, 1]. Within the MT paradigm, research has strived towards the creation of more sophisticated metrics, able to capture more linguistic aspects than simply distinguishing between correct and incorrect translations. As this is a very difficult task, instead of defining a single golden metric, methods have been proposed for combining multiple existing evaluation metrics in order to provide a representative score that reflects multiple aspects of the produced translation. Such a result is achieved by the IQMT Evaluation Framework (Giménez and Amigó, 2006), which combines different metrics and returns a single evaluation score. Furthermore, techniques have been investigated for combining different MT evaluation criteria in order to create an appropriate fitness function for the GA. In this approach, the problem of MT parameter optimisation is treated as a multi-objective optimisation problem, attempting to optimise system parameters by applying multiple MT evaluation metrics. Previous approaches in the simultaneous application of two MT evaluation metrics for the training and optimisation of MT systems, combined them either by simply adding them up (Zhao and Chen, 2009) or by interpolating them (Mauser et al., 2008) and had only been tested in statistical systems. The methodology presented in this paper combines two evaluation metrics for the optimisation of the
system parameters in two ways: (a) as a weighted sum and alternatively, (b) as two distinct criteria in a Pareto-based multi-objective optimisation using the SPEA2 evolutionary algorithm. The optimisation of the parameters of a hybrid MT system combining multiple MT evaluation criteria using either a GA or a Paretobased evolutionary multi-objective algorithm such as SPEA2 is novel in the field of MT, while to the best of our knowledge, a Pareto-based combination has never been applied to either the combination of translation quality criteria or to the task of optimising the parameters of an MT system.
Fig. 5. Implementation of the GA for the MT task.
S. Sofianopoulos, G. Tambouratzis / Pattern Recognition Letters 31 (2010) 1672–1682
Fig. 5 illustrates the implemented GA for the MT task, for the general case of either single-objective or multi-objective approaches. 4. Experiments
1677
3. A GA using as its fitness function a weighted sum of the BLEU and METEOR scores obtained. The formula for calculating the fitness score is expressed by Eq. (5), where the weights of the two scores (0.63 and 0.27) have been defined by analysing the average values of BLEU and METEOR over a sample translation task.
4.1. Experimental setup The experiments conducted involved the Greek-to-English implementation of METIS-II (Tambouratzis et al., 2006). For the purposes of the experiments, the METIS-II mechanism of resolving translation ambiguities using collocations was not activated, since the aim was to optimise the main MT engine by maximising the resolution of translation ambiguities prior to invoking the ambiguity-resolving mechanism. The TL corpus used was a subset of the British National Corpus (BNC World, 2001), consisting of 1,488,000 clauses. BNC has been selected as the TL corpus, since it is one of the most widely-used general-purpose text corpora for the English language. The SL training dataset comprised two distinct Greek sentence sets, TRAIN_A and TRAIN_B. The sentences for both sentence sets were hand-picked by experts in linguistics from the METIS-II project evaluation test corpus, so that they cover a range of known translation problems, such as lexical ambiguities, word order, prepositions and others. It has been decided to use more than one training sets to obtain some indication of how each GA-based optimisation setup performs over different training data. Both training sets contained 50 sentences, whose size ranged between 3 and 15 tokens arranged into 3–8 chunks. For each of the sentences, from both training sets, a reference set of three English sentences, produced by professional translators, has been used for the automatic evaluation. The translators were instructed to check that the BNC corpus did not contain any of the translations they had produced for the sentences. Hence, both the Greek sentence sets TRAIN_A and TRAIN_B as well as the English translations of all sentences used in these experiments were prepared independently from the present study so that the optimisation results would not be biased. In order to test objectively the performance of the MT system on sentences other than those of the training set, two different test corpora were used, each consisting of 200 distinct sentences. These sets have also been compiled and used for the METIS-II project evaluation. The first one (EU200) contained sentences extracted from the European Parliament Proceedings Parallel Corpus (Koehn, 2005), while the second one (GR200) comprised sentences obtained from the Hellenic National Corpus (2006) with the additional requirement of covering various phenomena of the Greek language. Regarding the GA configuration, a population size of 30 individuals is chosen. New individuals are created with a crossover rate of 0.45 and a mutation rate of 0.05. During mutation, a unit Gaussian distributed random value is added to 5% of the parameters of the new individuals. If the new parameter value falls outside of the range [0, 1] we produce another Gaussian distributed random value. The termination criterion selected is the completion of 300 generations of individuals. This GA setup was used for experiments described in the present article. This experimental configuration does not necessarily represent the optimal setup, but has only been used as a case study, to examine the validity of the basic hypothesis that the MT system performance can be improved by optimising the system parameters through an evolutionary process. With respect to the fitness functions used, the following methods have been employed: 1. A GA using BLEU as the fitness function. 2. A GA using METEOR as the fitness function.
f ðxÞ ¼
0:63 BleuScoreðxÞ þ 0:27 MeteorScoreðxÞ 0:90
ð5Þ
4. A GA employing IQMT as the fitness function. For the framework configuration, BLEU, NIST, METEOR and ROUGE have been used as metrics within IQMT. 5. The multi-objective Pareto-based evolutionary algorithm SPEA2 with BLEU and METEOR as objective functions. SPEA2 (Zitzler et al., 2001) is a Pareto-based multi-objective optimisation evolutionary algorithm that is reported to exhibit a very good performance in comparison to other multi-objective evolutionary algorithms (Corne et al., 2000) and has been used in various applications (for instance Lahanas et al., 2001). The multi-objective algorithm was implemented by modifying the SPEA2 algorithm, so that it conformed to the GA configuration for the given task. Previous work (Sofianopoulos et al., 2008) had showcased that with a very modest GA configuration applied over only a fraction of the MT system parameters, the obtained optimised parameter set scored higher than the manually-defined parameters in both evaluation sets GR200 and EU200. Furthermore, these initial experiments had raised two important issues regarding the METIS-II system. The first issue was the excessive translation times of the system, with each run of the GA requiring 10–12 h to complete on a DELL Workstation with Dual Xeon processors clocked at 3.6 GHz. In order to incorporate the GA to the specific task, it proved essential to revisit major modules of the MT system, allowing us to improve the overall system behaviour, in terms of memory allocation and processing speed. Via the system improvements, a 52.5% reduction in the total execution time was achieved, thus enabling the application of the GA in a more realistic setup and also the increase of the GA population (from 20 to 30 individuals) and the number of Greek sentences in the training set. Following the reduction of the processing time by more than half, the amount of time required to run 300 generations of either a GA or SPEA2 on one of the cores of an Intel Q9300 based PC is roughly 50 days. This is because the METIS-II translation system needs approximately 8 min to translate once all 50 sentences of either sentence set TRAIN_A or TRAIN_B to allow the metaheuristic to evaluate the quality of the produced translations achieved by using the parameter vector of the individual under evaluation. The second issue was the discovery of indexing issues regarding the database tables used to store the monolingual corpus. These issues concerned the number of chunks stored in some of the tables in the clause database. For example, certain clauses containing only 3 chunks were indexed in the database as if they contained 4 chunks. These clause indexing errors occurred during the METIS-II project, and resulted in some cases in the retrieval of an unsuitable candidate translation by the system and thus in a more constrained search space from which to determine the final translation. The revisions of the database were executed by re-indexing the clause database tables using the correct chunk numbers. Taking into account these issues, it was decided to execute the new experiments in two steps. In the first series of experiments, each one of the new methods were applied only once to the optimisation task, without performing the revisions to the database, to ensure that the results would be comparable to previous evaluation
1678
S. Sofianopoulos, G. Tambouratzis / Pattern Recognition Letters 31 (2010) 1672–1682
results of the MT system, indicating the improvements conferred by the metaheuristic optimisation process. Afterwards, a second series of experiments was carried out, using the revised version of the database. In this second series of experiments, only three of the optimisation methods were selected, but this time each of these was repeated 10 times with different random initialisation. This number of runs is not as large as would be ideally chosen, since it was constrained by the long execution time required by each method and limited computational resources.
4.2. First series of experimental results This section presents the results obtained by the aforementioned optimisation methods in our first series of experiments. The aim of this first series of experiments was to roughly estimate the performance of each method, so as to select the best methods for further study. It was found that for a modest population of 30 candidate solutions, a total of over 200 iterations were required for an evolutionary optimisation process to settle to a stable solution. Thus, a limit of 300 iterations was selected. Fig. 6 illustrates the evolution of the fittest individual from each generation according to its BLEU score over the training set TRAIN_B for two of the optimisation methods implemented (weighted sum and SPEA2 – both using the BLEU and METEOR metrics), in comparison to the score achieved by the original METIS-II system. It is evident that both optimisation methods outperform the baseline system (Sofianopoulos et al., 2007) throughout the generation range studied. In order to verify the quality of the results obtained using the optimisation methods, a population of 26,000 random parameter vectors was created and used with METIS-II to generate the corresponding translations for the training set TRAIN_B. These were subsequently evaluated using BLEU. Fig. 7 presents the histogram of BLEU scores for this random population together with indicators to the final BLEU scores achieved by all optimisation methods over the same text. The best result achieved via random search was 0.3060, while the highest score achieved by any GA-optimisation method was 0.3529. The only GA configuration that failed to produce a higher BLEU score than random search was the one using METEOR as a single fitness function. This may be attributable to the fact that BLEU and METEOR seem to have local maxima in different portions of the pattern space and thus by evaluating with BLEU, the effective optimisation of the METEOR metric cannot be revealed. Still, the results prove that the GA-based optimisation gives results superior to a random search. Throughout each GA
Fig. 7. Histogram of BLEU scores achieved by the first set of 26,000 random parameters along with indicators to the highest BLEU score values achieved over the training set TRAIN_B by the optimisation techniques of the first series of experiments.
simulation a total of 9,000 vectors are evaluated (30 elements over 300 iterations), while using the random process more than 26,000 vectors were randomly generated (i.e. almost three times as many vector solutions). A statistical analysis has shown that the BLEU score population of random vectors does not follow a normal distribution, and thus a quantitative analysis of the quality of the GA-solution is not possible. Still, the average of the BLEU scores is equal to 0.2033, with a standard deviation of 0.0353. Therefore, the BLEU score of 0.3529 obtained by SPEA2 is situated at a distance of 4.24 standard deviations from the random population mean value, and positioned to the right, indicating a substantially higher solution quality. This illustrates the effectiveness of the GA weight-optimisation process, with the most effective optimisation achieved for SPEA2. Tables 1 and 2 present the scores achieved by all optimisation methods over the two evaluation sentence sets, for the two training sets respectively. As a rule METIS-II with the newly-optimised parameters exceeds the performance of both the original METIS-II system (with a BLEU score of 0.114 in the EU200 test set and 0.210 in the GR200 one) and the case study GA-optimised METIS-II presented in (Sofianopoulos et al., 2008) (BLEU score of 0.115 in the EU200 test set and 0.219 in the GR200 one). Regarding the EU200 test set, where the baseline system achieved a BLEU score of 0.114, the GA with the weighted sum of BLEU and METEOR as a fitness function trained over set TRAIN_A was the one resulting in the greatest improvement in translation quality (with a 20.4% improvement over the score of the original system). For the GR200 test set, the SPEA2 multi-objective algorithm trained over the set TRAIN_B (cf. Table 2) was the one that produced the highest BLEU score, improving the BLEU score by 41.2% over the baseline system (that generated a BLEU score of 0.210). For both test sets, the optimisation methods achieving the best scores were the ones based on the combination of multiple
Table 1 First series evaluation scores achieved by the optimised parameters trained over set TRAIN_A. Opt. method
Fig. 6. BLEU scores of the baseline system and of the fittest individual of each generation for the GA with the weighted BLEU and METEOR score (BLEU + METEOR) and the SPEA2 training methods.
GA GA GA GA SPEA2
Evaluation
BLEU METEOR BLEU + METEOR IQMT BLEU, METEOR
BLEU score
METEOR score
EU200
GR200
EU200
GR200
0.1284 0.1166 0.1372 0.1230 0.1335
0.2339 0.2502 0.2558 0.2564 0.2480
0.5477 0.5495 0.5552 0.6551 0.5541
0.6475 0.6492 0.6511 0.5441 0.6485
S. Sofianopoulos, G. Tambouratzis / Pattern Recognition Letters 31 (2010) 1672–1682 Table 2 First series evaluation scores achieved by the optimised parameters trained over set TRAIN_B. Opt. method
GA GA GA GA SPEA2
Evaluation
BLEU METEOR BLEU+METEOR IQMT BLEU, METEOR
BLEU score
METEOR score
EU200
GR200
EU200
GR200
0.1293 0.0880 0.1212 0.1202 0.1325
0.2767 0.1901 0.2721 0.2759 0.2965
0.5511 0.5405 0.5474 0.5454 0.5480
0.6634 0.6523 0.6635 0.6623 0.6767
metrics. BLEU, being the most widely-used metric for MT evaluation, was used in most of the optimisation methods. Over both test sets, using only METEOR resulted in a lower performance of the optimised parameters than all other configurations. The experimental results indicate that the effectiveness of METEOR as the sole evaluation metric for GA-optimisation is questionable, as it fails to provide reliable results. On the other hand, when used in combination with other evaluation metrics such as BLEU, METEOR contributes in the optimisation method by at least providing more diversity in the population, thus allowing the GA to escape from local minima. The results over the EU200 test set were as a rule substantially lower than the GR200 set. This is due to the fact that the reference translations used for the GR200 set were created from scratch by professional translators, while for the EU200 test set the corresponding parallel corpus sentences were employed, which however represent free rather than exact translations. As a consequence, these translations are difficult for an automated MT system to emulate. Therefore, the results of the GR200 set are deemed to be more representative of the quality that can be obtained. Notably, the manual tuning of weights within METIS-II had been an extensive procedure, spanning a period of several months and involving expert knowledge encompassing various aspects of the translation process. No automated optimisation method can easily incorporate such knowledge on the basis of automatic evaluation metrics and a handful of sentences, from which to extract information. Still, the higher scores achieved for both evaluation sets via the GA indicate that the approach proposed here succeeds in defining parameter values, which replicate the extraction of expert knowledge via a learn-by-example process and which improve upon the manual optimisation.
1679
random parameter vectors was created and evaluated using METIS-II and BLEU over the same training set (TRAIN_B). Fig. 8 presents the histogram of BLEU scores for the second random population together with indicators to the mean values of the BLEU scores achieved over the 10 runs by each of the three optimisation methods over training text TRAIN_B. When compared to the BLEU scores of the first random population (Fig. 7), the histogram of the new random population is displaced to the right (indicating a higher translation quality). The mean BLEU value of the new random population is 0.2327 with a standard deviation of 0.0448, while the previous random population had a mean value of 0.1986 and a standard deviation of 0.038. The highest BLEU score achieved within the random search over sentence set TRAIN_B was 0.3262, while the BLEU scores achieved by all three optimisation methods were significantly higher, with the GA using BLEU as a fitness function scoring the highest BLEU score (0.4019). This indicates that the revisions in the database contribute to the improvement of the translation quality of the MT system, by effectively expanding the search for the optimal translation. The mean score and standard deviation over the 10 runs, as well as the maximum score for the BLEU and METEOR MT evaluation metrics, for each of the optimisation methods over evaluation sets GR200 and EU200 are presented in Tables 3 and 4, respectively. The mean value of both the BLEU and METEOR scores for all three optimisation methods substantially exceeds their previous performance, this being the beneficial effect of improving the database consistency. SPEA2 achieves the highest mean value for both MT metrics over both evaluation sets, thus validating the results of the first series of experiments where SPEA2 had been found to share the top performance with the BLEU and METEOR weighted sum GA. From the standard deviations computed for all the methods, it is evident that there is a low variability between the achieved scores obtained in different runs for both the BLEU and METEOR metrics over both evaluation sets EU200 and GR200, with the SPEA2 algorithm exhibiting a higher standard deviation than the two Genetic Algorithms. Even though the highest BLEU score achieved by the SPEA2 algorithm over the 10 runs for training set TRAIN_B was only 0.3811 (the lowest of the three optimisation methods, according to Fig. 8), it recorded the highest BLEU and METEOR scores in both evaluation sets, as well as the highest BLEU
4.3. Second series of experimental results In order to confirm the preliminary results obtained by the first series of experiments, and also to evaluate the effect of the revised database to the result of the optimisation, a second series of experiments was carried out. For this second series, three of the optimisation methods were selected based on their previous evaluation results for more extensive study. These were (a) the GA with BLEU as a fitness function, (b) the GA with the weighted sum of BLEU and METEOR scores as a fitness function and (c) the SPEA2 multi-objective optimisation algorithm. Sentence set TRAIN_B was selected as the training data set and the same GA scheme as the one in the first series of experiments was used. For each one of these optimisation methods, after the completion of the 300th generation, the mean and standard deviation of the BLEU and METEOR scores achieved for the evaluation sets GR200 and EU200 over multiple runs of the algorithm with different initial randomly seeded populations were calculated. Based on the availability of computing resources, each method was simulated for 10 independent runs of 300 generations. In order to be able to comparatively evaluate the quality of the results of this new series of experiments, a population of 15,000
Fig. 8. Histogram of BLEU scores achieved by the second set of 15,000 random parameters along with indicators to the highest BLEU score values achieved over the training set TRAIN_B by each one of the three optimisation techniques in the second series of experiments.
1680
S. Sofianopoulos, G. Tambouratzis / Pattern Recognition Letters 31 (2010) 1672–1682
Table 3 GR200 evaluation scores (including mean, standard deviation and maximum over 10 runs) for the second series of experiments trained over set TRAIN_B. Opt. method
Evaluation
BLEU score
METEOR score
Mean std. dev.
Max
Mean std. dev.
Max
GA
BLEU
0.3082 0.0073
0.3210
0.6651 0.0048
0.6718
GA
BLEU + METEOR
0.3123 0.0030
0.3165
0.6684 0.0037
0.6739
SPEA2
BLEU, METEOR
0.3110 0.0077
0.3256
0.6693 0.0051
0.6794
Table 4 EU200 evaluation scores (including mean, standard deviation and maximum over 10 runs) for the second series of experiments trained over set TRAIN_B. Opt. method
Evaluation
BLEU score
METEOR score
Mean std. dev.
Max
Mean std. dev.
Max
GA
BLEU
0.1529 0.0037
0.1593
0.5512 0.0034
0.5548
GA
BLEU + METEOR
0.1510 0.0038
0.1561
0.5524 0.0023
0.5559
SPEA2
BLEU, METEOR
0.1545 0.0060
0.1657
0.5535 0.0043
0.5611
and METEOR score mean values, besides the BLEU mean score value for the GR200 evaluation set. This indicates that SPEA2 achieves a more effective training whilst the other GA-based methods may be more susceptible to overtraining problems. By evaluating the population using two distinct criteria, these being the BLEU and METEOR metrics, the system manages to generate better translations, which is reflected in the metric scores. The highest BLEU score achieved for the EU200 evaluation set was 0.1657 by the SPEA2 algorithm, while in the first series of experiments the highest BLEU score was 0.1372 and was achieved by the GA with the BLEU and METEOR weighted sum (the corresponding improvement being 20.7%). For the GR200 evaluation set, the highest BLEU score was 0.3256, achieved by the SPEA2 algorithm, while in the first series of experiments the respective score had been 0.2965, also achieved by SPEA2 (the corresponding improvement being 9.8%). This confirms that SPEA2 leads to the selection of the best weight vectors. To investigate the properties of the populations generated by the three different metaheuristics, a statistical analysis was performed comparing all possible pairs of metaheuristics. For each run, only the optimal set of weights was retained as the representative of the population, the aim being to determine the BLEU and METEOR scores reflecting the translation quality. Thus, the three metaheuristic methods were compared on the basis of populations of 10 scores each. More specifically, using an independent-samples T-test, it was found that the populations generated from the GA using BLEU and the GA using the weighted sum of BLEU and METEOR do not have different means in terms of either their BLEU or METEOR scores in a statistically-significant sense. However, SPEA2 was found to result in populations of a statistically-significant different means value in comparison to both GA-based methods, for which the BLEU scores were significantly lower (even at a 95% level of significance), though the METEOR scores were significantly higher (again at a 95% level of significance). This indicates the different behaviour of SPEA2, which manages to search more efficiently the search space, using both BLEU and METEOR. In Fig. 9, an example of the manner in which the resulting Pareto front of the optimal SPEA2 solutions evolves over different generations is shown. This figure illustrates how the values of both the BLEU and METEOR metrics are improved during the latter generations of the evolution process.
4.4. Comparison with other optimisation methods For comparison purposes to the proposed method, we present here the recorded improvements of four other MT optimisation methods proposed in literature. In addition, the improvement achieved via a GA on the Clue Alignment approach parameters (Tiedemann, 2005) is expressed in terms of precision and recall. Of course, the purpose of this process is not to directly compare the effectiveness of these optimisation methods to the approach introduced in this article, but only to provide an indication of the effectiveness of the approach. A direct comparison cannot be carried out as the various systems translate different language pairs and are evaluated over different sentence sets. The optimisation methods selected for comparison are the Joint Optimisation Strategy (He and Toutanova, 2009), the Downhill Simplex method (Bender et al., 2004), the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm (Lambert and Banchs, 2007) and the Simplex Armijo Downhill algorithm (Zhao and Chen, 2009). The Joint Optimisation Strategy was applied in a Chinese–English translation system that works by combining the results of other translation systems, in order to create a higher quality translation. As a test set He and Toutanova (2009) selected the second half of the NIST MT08 C2E test set, containing 691 and 666 newswire and web-data sentences, respectively. According to their evaluation results, the best BLEU score achieved by the joint decoding method over the test set was 0.3720, while the best BLEU score achieved by one of the initial MT systems (more specifically System B) was 0.3203 (corresponding to a 16.14% improvement). The Downhill Simplex method (Bender et al., 2004) is used to optimise the model scaling factor of a Chinese–English and a Japanese–English translation system, both systems having employed the same statistical algorithm and both trained and evaluated using more than one corpus. In each case, the test set used consisted of approximately 500 sentences. Bender et al. (2004) report an increase in the BLEU score from 0.453 to 0.467 (equivalent to an increase of 3.9%) for the Japanese–English system. The optimisation method using the SPSA algorithm (Lambert and Banchs, 2007) was applied in a Chinese–English SMT system for the optimisation of the feature function selection of the language model. The authors did not use any test data for the
S. Sofianopoulos, G. Tambouratzis / Pattern Recognition Letters 31 (2010) 1672–1682
1681
Fig. 9. Evolution of the resulting Pareto front of the optimal SPEA2 solutions over different generations.
evaluation, but rather performed the optimisation process multiple times, each time starting from different initial points, and reported the BLEU scores at various iterations. The system parameters were tuned over the development set of the IWSLT’06 evaluation campaign (http://www.slc.atr.jp/IWSLT2006/). After performing the optimisation process two times, they concluded that one can on average expect a 0.4% BLEU improvement when optimising using the SPSA algorithm. Finally, the Simplex Armijo Downhill algorithm (Zhao and Chen, 2009), a variation of the Downhill Simplex method, was used for the optimisation of the parameters of a Chinese–English SMT system, trained and evaluated over sentences from the GALE P3/P3.5 evaluations. The test set consisted of approximately 565 sentences from transcriptions of broadcast news and conversations. When compared to the Downhill Simplex method, over the same evaluation data, the Simplex Armijo Downhill algorithm scores 1.17% higher in the BLEU metric (0.3815 over 0.3771). Table 5 serves to illustrate the improvement of BLEU scores achieved by the proposed GA-based optimisation method in comparison to the recorded improvements of the other MT optimisation methods. Additionally, the improvement achieved via a GA on the Clue Alignment approach parameters (Tiedemann, 2005) is expressed in terms of precision and recall. The GA-based method leads to a maximum improvement of 45.5% for the BLEU metric. This compares favourably with the improvements achieved with other methods which give an improvement of 16.4% at most in
Table 5 Improvement of translation quality using various MT optimisation methods, including the proposed method. Optimisation method
Improvement (basis of measurement) (%)
Genetic Algorithm (Tiedemann, 2005) Joint Optimisation Strategy (He and Toutanova, 2009) Downhill Simplex method (Bender et al., 2004) SPSA Algorithm (Lambert and Banchs, 2007) Simplex Armijo Downhill Algorithm (Zhao and Chen, 2009)
6 (in the F-score) 16.4 (in the BLEU score) 3.9 (in the BLEU score) 0.4 (in the BLEU score) 1.17 (in the BLEU score)
the MT accuracy. Furthermore, even the GA-based Clue Alignment process gives an improvement of 6% when applied in Natural Language Processing tasks. As previously noted, since these methods are applied to probabilistic parameters of SMT systems, they cannot be directly compared to the proposed method, as they are required to optimise the parameters of complex statistical language models. The one method that differs in this respect and is closer to the method proposed in the present article is the Joint Optimisation Strategy, as it is applied to a hybrid MT system with the purpose of combining various MT systems. Notably, even though this optimisation method gives an improvement of 16.4%, this improvement is much lower to that achieved by the method proposed here.
5. Conclusion In this paper, a method has been proposed for the automated optimisation of real-valued parameters employed by an MT system towards the generation of high-quality translations. The optimisation method is based on the application of a single-objective elitist real-valued GA which tunes system parameters, guided by a variety of MT evaluation metrics as fitness functions. Experiments were also conducted using the SPEA2 multi-objective evolutionary algorithm as an alternative to Genetic Algorithms. The experimental results verify the effectiveness of the GA-based approach, since the employment of the optimised parameters leads to translations of a substantially higher quality than those achieved by the baseline system. This improvement is quantified in a first series of experiments by a peak rise of 20.35% and 41.2% in the BLEU score, over two distinct test sets, of different origin and characteristics. In a second series of experiments, following refinements in the TL corpus database, more extensive experimentation was performed, for the three optimisation methods that had proved more effective. In this case, the effect of different initial random seeds was also studied, performing multiple runs to the greatest extent possible based on the computing resources available. By improving the TL corpus database, a rise of 9.8% and 20.7% was reported in the BLEU scores over the two evaluation sets used. The new experiments confirmed that the results can be substantially improved if the quality of the corpus is enhanced, and that the
1682
S. Sofianopoulos, G. Tambouratzis / Pattern Recognition Letters 31 (2010) 1672–1682
selected optimisation methods generate consistent results and comparable improvements over independent runs. The quality of the translations generated by the MT system via the optimisation techniques is further improved over the baseline system, thus validating the effectiveness of the proposed optimisation process. It should be noted that though SPEA2 does not necessarily give the best performance on the training set, it outperforms Genetic Algorithms on previously unseen test sets. To achieve this performance, SPEA2 needs more generations to settle. However, even more substantial improvements are attainable. To that end, further experimentation is considered essential to investigate key GA aspects, such as the details of the crossover and mutation operations or the population size. Additionally, more experiments need to be performed regarding the various MT evaluation metrics and the optimisation of their combination and concurrent exploitation. MT evaluation plays an important role in modern MT systems, and should be thoroughly examined in order to implement MT systems that can produce translations that are as natural as possible. At the same time, customisation to the texts being processed is essential to achieve the highest possible translation quality. Thus, the aspect of automatic optimisation of parameters gains an increased importance. In this respect, it is claimed that the work described in the current article can be of particular benefit to future research in the area of Machine Translation. Acknowledgement This research has been supported by the PENED programme 03ED251, funded by the General Secretariat for Research and Technology of Greece. The authors wish to acknowledge the assistance of Ms M. Vassiliou. References Ballester, P.J., Carter, J.N., 2003. Real parameter genetic algorithms for finding multiple optimal solutions in multimodal optimization. In: Erick CantuPaz et al. (Eds.), Proceedings of the Genetic and Evolutionary Computation Conference. Lecture Notes in Computer Science, Springer, pp. 706–717. Banerjee, S., Lavie, A., 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the Association of Computational Linguistics (ACL’05). Ann Arbor, Michigan. Bender, O., Zens, R., Matusov, E., Ney, H., 2004. Alignment templates: the RWTH SMT system. In: IWSLT 2004, Proceedings of the International Workshop on Spoken Language Translation. Kyoto, Japan, pp. 79–84. Black, A.W., Campbell, N., 1995. Optimizing selection of units from speech databases for concatenative synthesis. In: Proceedings of Eurospeech’95, vol. 1. Madrid, Spain, pp. 581–584. Blanco, A., Delgado, M., Pegalajar, M.C., 2001. A real-coded genetic algorithm for training recurrent neural networks. Neural Networks 14 (1), 93–105. British National Corpus, version 2 (BNC World), 2001. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. URL: http:// www.natcorp.ox.ac.uk/. Brown, P.F., Cocke, J., Della Pietra, S., Della Pietra, V., Jelinek, F., Mercer, R., Roossin, P., 1990. A statistical approach to machine translation. Computational Linguistics 16 (2), 79–85. Carl, M., Melero, M., Badia, T., Vandeghinste, V., Dirix, P., Schuurman, I., Markantonatou, S., Sofianopoulos, S., Vassiliou, M., Yannoutsou, O., 2008. METIS-II: low resource machine translation. Mach. Translat. 22 (1–2), 67–99. Corne, D.W., Knowles, J.D., Oates, M.J., 2000. The Pareto envelope-based selection algorithm for multiobjective optimization. In Lecture Notes in Computer Science, vol. 1917. Springer, Berlin. pp. 839–848. Deb, K., 2001. Multi-objective Optimization using Evolutionary Algorithms. Wiley, Chichester. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39, 1–38. Echizen-ya, H., Araki, K., Momouchi, Y., Tochinai, K., 1996. Machine translation method using inductive learning with genetic algorithms. In: Proceedings of the COLING’96 Conference. Copenhagen, Denmark, pp. 1020–1023. Estrella, P., Popescu-Belis, A., King, M., 2007. A new method for the study of correlations between MT evaluation metrics. In: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation. Skovde, Sweden, pp. 55–64.
Gautam, M., Sinha, R.M.K., 2007. A hybrid approach to sentence alignment using genetic algorithm. In: Proceedings of the ICCTA’ 07 Conference. India, pp. 480– 484. Giménez, J., Amigó, E., 2006. IQMT: a framework for automatic machine translation evaluation. In: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’06). Genoa, Italy, pp. 685–690. Goldberg, D.E., 1989. Genetic Algorithms in Search. Optimization and Machine Learning. Addison-Wesley. Graybill, F.A., 1961. An Introduction to Linear Statistical Models. McGraw-Hill, New York. He, X., Toutanova, K., 2009. Joint optimization for machine translation system combination. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore, pp. 1202–1211. Hellenic National Corpus, version 3. 2006. URL: http://www.hnc.ilsp.gr/. Herrera, F., Lozano, M., Verdegay, J.L., 1998. Tackling real-coded genetic algorithms: operators and tools for behavioral analysis. Artif. Intell. Rev. 12, 265–319. Herrera, F., Lozano, M., Sanchez, A.M., 2003. A taxonomy for the crossover operator for real-coded genetic algorithms: an experimental study. Int. J. Intell. Syst. 18, 309–338. Holland, J., 1975. Adaption in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, Michigan. Hui, W.J., Xi, Y.G., 1996. Operation mechanism analysis of genetic algorithm. Control Theory Appl. 13 (3), 297–303. Koehn, P., 2005. A parallel corpus for statistical machine translation. In: Proceedings of MT Summit X. Phuket, Thailand, pp. 79–86. Kuhn, H.W., 1955. The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2, 83–97. Lahanas, M., Milickovic, N., Baltas, D., Zamboglou, N., 2001. Application of multiobjective evolutionary algorithms for dose optimization problems in brachytherapy. In: Proceedings of the First International Conference on Evolutionary Multi-Criterion Optimization (EMO 2001). Lecture Notes in Computer Science, vol. 1993. Springer-Verlag, Berlin, pp. 574–587. Lambert, P., Banchs, R.E., 2007. SPSA vs simplex in statistical machine translation optimization. In: Proceedings of Applied Mathematics and Mechanics, vol. 7, No. 1. Special Issue: Sixth International Congress on Industrial Applied Mathematics (ICIAM07) and GAMM Annual Meeting. Zürich, pp. 1062503– 1062504. Mauser, A, Hasan, S., Ney, H., 2008. Automatic evaluation measures for statistical machine translation system optimization. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakesh, Morocco, pp. 28–30. Munkres, J., 1957. Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 5, 32–38. Nabhan, A.R., Rafea, A., 2005. Tuning statistical machine translation parameters using perplexity. In: Proceedings of the IRI-05 Information Reuse and Integration Conference. Las Vegas, pp. 338–343. Nist: Automatic Evaluation of Machine Translation Quality Using n-gram Cooccurrences Statistics, 2002. Available from:
. Papineni, K., Roukos, S., Ward, T., Zhu, W.J., 2002. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of Association for Computational Linguistics. Philadelphia, USA, pp. 311–318. Siegel, E.V., McKeown, K.R., 1996. Gathering statistics to aspectually classify sentences with a genetic algorithm. In: Proceedings of the Second International Conference on New Methods in Language Processing. Ankara, Turkey. Sofianopoulos, S., Spilioti, V., Vassiliou, M., Yannoutsou, O., Markantonatou, S., 2007. Demonstration of the Greek to English METIS-II MT system. In: Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-07). Skövde, Sweden, pp. 199–205. Sofianopoulos, S., Tambouratzis, G., Carayannis, G., 2008. Parameter optimization of MT systems operating on monolingual corpora employing a genetic algorithm. In: Proceedings of the IEEE International Conference on Distributed HumanMachine Systems. Athens, Greece, pp. 260–265. Syswerda, G., 1989. Uniform crossover in genetic algorithms. In: Schaffer, J.D. (Ed.), Proceedings of the 3rd International Conference on Genetic Algorithm. San Mateo, CA, pp. 2–9. Tambouratzis, G., Sofianopoulos, S., Spilioti, V., Vassiliou, M., Yannoutsou, O., Markantonatou, S., 2006. Pattern matching-based system for machine translation (MT). In: Advances in Artificial Intelligence: 4th Hellenic Conference on AI, Heraklion, Greece. Lecture Notes in Artificial Intelligence, vol. 3955, Springer-Verlag, pp. 345–355. Tiedemann, J., 2005. Optimization of word alignment clues. Nat. Lang. Eng. 11 (3), 279–293. Wright, A.H., 1991. Genetic algorithms for real parameter optimization. Foundations of Genetic Algorithms. Morgan Kaufman. pp. 205–218. Zhao, B., Chen, S., 2009. A simplex Armijo downhill algorithm for optimizing statistical machine translation decoding parameters. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Companion Volume: Short Papers. Boulder, Colorado, pp. 21–24. Zitzler, E., Laumanns. M., Thiele, L., 2001. SPEA2: Improving the Strength Pareto Evolutionary Algorithm. Swiss Federal Institute of Technology (ETH). Zurich, Switzerland. Technical report TIK-Report 103.