Comparative genomic analysis of eutherian globin genes

Comparative genomic analysis of eutherian globin genes

    Comparative genomic analysis of eutherian globin genes Marko Premzl PII: DOI: Reference: S2452-0144(16)30089-9 doi: 10.1016/j.genrep...

440KB Sizes 2 Downloads 101 Views

    Comparative genomic analysis of eutherian globin genes Marko Premzl PII: DOI: Reference:

S2452-0144(16)30089-9 doi: 10.1016/j.genrep.2016.10.009 GENREP 88

To appear in:

Gene Reports

Received date: Revised date: Accepted date:

23 October 2016 24 October 2016 24 October 2016

Please cite this article as: Premzl, Marko, Comparative genomic analysis of eutherian globin genes, Gene Reports (2016), doi: 10.1016/j.genrep.2016.10.009

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Title: Comparative genomic analysis of eutherian globin genes Author: Marko Premzl Address: Laboratory of Genomics, Centre of Animal Reproduction, 55 Heinzel St., Zagreb, Croatia

AC

CE P

TE

D

MA

NU

SC R

IP

T

E-mail contact address: [email protected]

1

ACCEPTED MANUSCRIPT Abstract

Using eutherian comparative genomic analysis protocol and public genomic sequence assemblies, the

T

present study attempted to update and revise eutherian globin genes implicated in major physiological

IP

and pathological processes. The most comprehensive curated third party data gene data set of eutherian

SC R

globin genes annotated 149 complete coding sequences among 299 potential coding sequences. The present analysis first described 8 major gene clusters of eutherian globin genes, and explained their differential gene expansion patterns. The integrated gene annotations, phylogenetic analysis and protein

NU

molecular evolution analysis proposed updated classification and nomenclature of eutherian globin

MA

genes.

D

Keywords: comparative genomic analysis, gene annotations, molecular evolution, phylogenetic

AC

CE P

TE

analysis, RRID:SCR_014401

2

ACCEPTED MANUSCRIPT 1. Short Communication

The eutherian globin genes were implicated in elevation of haematocrit levels and polycythaemia after

T

exposure of non-adapted humans to high-altitude environments, as well as in major physiological and

IP

pathological processes (Lorenzo et al. 2014). Specifically, the globins were implicated in oxygen and

SC R

carbon dioxide transport by blood and in haemoglobinopathies (Voet and Voet 2004; Park et al. 2006; Levantino et al. 2012). In addition, the eutherian globins were paradigmatic in comparative genomic analyses (Hardison 2012; Storz et al. 2013), as well as in calculations of reference species phylogenies

NU

(Goodman et al. 1998). Indeed, the patterns of evolution of globin genes were studied in eutherians

MA

(Hardison and Miller 1993; Opazo et al. 2008; Hardison 2012; Storz et al. 2013; Gaudry et al. 2014), rodents (Storz et al. 2008), cetaceans (Nery et al. 2013) and primates (Moleirinho et al. 2015). Of note,

D

there were several revisions of eutherian globin gene classifications and nomenclatures, describing 5

TE

major globin types (Hardison and Miller 1993; Opazo et al. 2008; Hardison 2012). Specifically, there were α-globins including α-globin or HBA, μ-globin or HBM, θ-globin or HBQ and ζ-globin or HBZ, β-

CE P

globins including β-globin or HBB, δ-globin or HBD, ε-globin or HBE, η-globin or HBH and γ-globin or HBG, cytoglobin or CYGB, myoglobin or MB and, finally, neuroglobin or NGB. Yet, due to the

AC

incompleteness of public eutherian genomic sequence data sets (International Human Genome Sequencing Consortium 2001; Harrow et al. 2012) and potential genomic sequence errors (International Human Genome Sequencing Consortium 2004), future revisions of comprehensive eutherian gene data sets were expected. For example, the eutherian comparative genomic analysis protocol integrating gene annotations, phylogenetic analysis and protein molecular evolution analysis into one framework of eutherian gene descriptions was established as guidance in protection against potential genomic sequence errors (Premzl 2016a, 2016b). The protocol included new genomics and protein molecular evolution tests applicable in revisions of 8 major eutherian gene data sets, including 1023 complete coding sequences deposited in European Nucleotide Archive as curated third party data gene data sets. Thus, using eutherian comparative genomic analysis protocol and public genomic sequence assemblies,

3

ACCEPTED MANUSCRIPT the present study attempted to update and revise comprehensive eutherian globin GLN gene data sets. Therefore, the gene identifications of potential GLN coding sequences used public eutherian genomic sequence assemblies downloaded from National Center for Biotechnology Information (NCBI)

T

GenBank (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/) and NCBI's BLAST

IP

programs (ftp://ftp.ncbi.nlm.nih.gov/blast/). In addition, the public eutherian genomic sequences were

SC R

downloaded from Ensembl genome browser (http://www.ensembl.org). In analyses and manipulations of GLN nucleotide and protein sequences, the sequence alignment editor BioEdit 7.0.5.3 was used (http://www.mbio.ncsu.edu/BioEdit/bioedit.html). The GLN gene feature analyses included direct

NU

evidence of gene annotations available in NCBI's nr, est_human, est_mouse and est_others databases

MA

(https://www.ncbi.nlm.nih.gov). The tests of reliability of eutherian public genomic sequences used potential GLN coding sequences. Nucleotide sequence coverages of each potential GLN coding sequence

D

were analysed in the first test steps, using NCBI's BLAST programs and genomic sequence reads

TE

deposited in NCBI's Trace Archive (https://www.ncbi.nlm.nih.gov/Traces/trace.cgi). The potential GLN coding sequences were described as complete GLN coding sequences only if consensus read nucleotide

CE P

sequence coverages were available for every nucleotide of each potential GLN coding sequence. Alternatively, potential GLN coding sequences were described as putative GLN coding sequences. The

AC

complete GLN coding sequences were deposited in European Nucleotide Archive as third party data and used in present analyses (http://www.ebi.ac.uk/ena/about/tpa-policy) (Gibson et al. 2016). In revised GLN gene nomenclature and classification, the guidelines of human (http://www.genenames.org/about/guidelines) and mouse (http://www.informatics.jax.org/mgihome/nomen/gene.shtml) gene nomenclatures were used. The multiple pairwise genomic sequence alignments of GLN genes used mVISTA's AVID program using default settings (http://genome.lbl.gov/vista/index.shtml). In alignments with base sequences (Homo sapiens), the cut-offs of detection of common genomic sequence regions were: 95% per 100 bp (Homo sapiens, Pan troglodytes, Gorilla gorilla), 90% per 100 bp (Pongo abelii, Nomascus leucogenys), 85% per 100 bp (Macaca mulatta, Papio hamadryas), 80% per 100 bp (Callithrix jacchus), 75% per 100 bp

4

ACCEPTED MANUSCRIPT (Tarsius syrichta, Otolemur garnettii), 65% per 100 bp (rodents) or 70% per 100 bp in other pairwise alignments (Premzl 2016a). In base sequences, the human GLN exons were annotated using transcripts BC007075.1 (GLNA1), BC070282.1 (GLNA2), BC010913.1 (GLNB1), BC056686.1 (GLND1) and

T

BC035682.1 (GLNE1). The detections and maskings of transposable elements in base sequences used

IP

RepeatMasker program version open-4.0.5 using default settings, except simple repeats and low

SC R

complexity elements were not masked (sensitive mode, cross_match version 1.080812, RepBase Update 20140131 and RM database version 20140131) (http://www.repeatmasker.org/). Using ClustalW implemented in BioEdit 7.0.5.3, the common predicted primate GLN promoter genomic sequence

NU

regions were aligned at nucleotide sequence level and manually corrected. The pairwise nucleotide

MA

sequence identities of common predicted primate GLN promoter genomic sequence regions were calculated using BioEdit 7.0.5.3 and used in statistical analyses, including calculations of average

D

pairwise identities (ā) and their average absolute deviations (āad), largest pairwise identities (amax) and

TE

smallest pairwise identities (amin) (Microsoft Office Excel). The most comprehensive curated third party data gene data set of eutherian GLN genes annotated 149 complete coding sequences among 299

CE P

potential coding sequences (Figure 1A) (Supplementary data file 1). Under accession numbers LT548096-LT548244, the eutherian GLN gene data set was deposited in European Nucleotide Archive

AC

(http://www.ebi.ac.uk/ena/data/view/LT548096-LT548244). The major cluster GLNA included 37 HBB, HBD and HBH genes (Supplementary data file 2A). There were 2 common predicted primate GLNA promoter genomic sequence region types annotated, having average pairwise nucleotide sequence identities ā = 0,89 (amax = 0,99, amin = 0,725, āad = 0,09) (Supplementary data file 3A) and ā = 0,922 (amax = 0,981, amin = 0,857, āad = 0,049) (Supplementary data file 3B). Likewise, the major cluster GLNB included 40 HBG and HBE genes (Supplementary data file 2B). There were 2 common predicted primate GLNB promoter genomic sequence region types annotated, having ā = 0,872 (amax = 0,993, amin = 0,68, āad = 0,088) (Supplementary data file 3C) and ā = 0,861 (amax = 0,988, amin = 0,697, āad = 0,099) (Supplementary data file 3D). The major cluster GLNC included 15 HBA genes (Supplementary data file 2C). The common predicted primate GLNC promoter genomic sequence region was annotated, having ā

5

ACCEPTED MANUSCRIPT = 0,883 (amax = 0,994, amin = 0,794, āad = 0,05) (Supplementary data file 3E). There were 10 HBQ genes included in major gene cluster GLND (Supplementary data file 2D). Among primates except Otolemur garnettii, the common predicted GLND promoter genomic sequence region was annotated, having ā =

T

0,838 (amax = 0,961, amin = 0,736, āad = 0,083) (Supplementary data file 3F). The major cluster GLNE

IP

included 14 HBM and HBZ genes (Supplementary data file 2E). The common predicted primate GLNE

SC R

promoter genomic sequence region was annotated, having a = 0,978 (Supplementary data file 3G). There were 6 CYGB genes included in major gene cluster GLNF (Supplementary data file 2F). Among primates, the common predicted GLNF promoter genomic sequence region was annotated, having a =

NU

0,939 (Supplementary data file 3H). The major cluster GLNG included 14 MB genes (Supplementary

MA

data file 2G). The common predicted primate GLNG promoter genomic sequence region was annotated, having ā = 0,934 (amax = 0,986, amin = 0,891, āad = 0,028) (Supplementary data file 3I). Finally, the major cluster GLNH included 13 NGB genes (Supplementary data file 2H). The common predicted

TE

D

primate GLNH promoter genomic sequence region was annotated, having ā = 0,881 (amax = 0,988, amin = 0,698, āad = 0,113) (Supplementary data file 3J). Next, the complete GLN coding sequences were

CE P

translated using BioEdit 7.0.5.3, and aligned at amino acid level using ClustalW implemented in BioEdit 7.0.5.3. After manual corrections of protein primary sequence alignments, GLN nucleotide sequence

AC

alignments were prepared accordingly. The MEGA 6.06 program was included in GLN phylogenetic tree calculations (http://www.megasoftware.net), using neighbour-joining method (default settings, except gaps/missing data treatment=pairwise deletion), minimum evolution method (default settings, except gaps/missing data treatment=pairwise deletion), maximum parsimony method (default settings, except gaps/missing data treatment=use all sites) and unweighted pair group method with arithmetic mean method (default settings, except gaps/missing data=pairwise deletion). However, the maximum likelihood methods were not used in present analysis because their homogeneity and stationarity assumptions were not satisfied (data not shown). The pairwise nucleotide sequence identities of GLN nucleotide sequence alignments were calculated using BioEdit 7.0.5.3, and used in statistical analyses including calculations of average pairwise identities (ā) and their average absolute deviations (āad),

6

ACCEPTED MANUSCRIPT largest pairwise identities (amax) and smallest pairwise identities (amin) (Microsoft Office Excel). Using phylogenetic tree calculations (Figure 1A) and calculations of pairwise nucleotide sequence identity patterns (Supplementary data file 4), the present phylogenetic analysis first described 8 major gene

T

clusters of eutherian globin genes GLNA-GLNH. The major GLN protein sequence alignment landmarks

IP

included protein primary structure features, as well as calculated invariant and forward amino acid sites

SC R

(Figure 1B). Identical major phylogenetic tree topologies were calculated using different methods (Figure 1A). For example, the minimum evolution tree topology included clustering of major gene clusters GLNA and GLNB, clustering of major gene clusters GLNC-GLNE, and, finally, clustering of

NU

major gene clusters GLNF-GLNH. The major gene clusters GLNA-GLNE included evidence of

MA

differential gene expansions respectively, but major gene clusters GLNF-GLNH included orthologues respectively. Of note, the present phylogenetic classification of eutherian GLN genes was in major

D

disagreements with analyses of Hardison and Miller (1993), Opazo et al. (2008), Storz et al. (2008),

TE

Hardison (2012), Nery et al. (2013), Storz et al. (2013), Gaudry et al. (2014) and Moleirinho et al. (2015). Specifically, the major gene cluster GLNA integrated HBB, HBD and HBH genes, major gene

CE P

cluster GLNB integrated HBG and HBE genes, major gene cluster GLNC included HBA genes, major gene cluster GLND included HBQ genes and, finally, major cluster GLNE integrated HBM and HBZ

AC

genes. On the other hand, the clustering of major gene clusters GLNF (CYGB genes), GLNG (MB genes) and GLNH (NGB genes) agreed with analyses of Opazo et al. (2008), Hardison (2012), Gaudry et al. (2014) and Moleirinho et al. (2015), but disagreed with analysis of Storz et al. (2013). Indeed, the calculations of pairwise nucleotide sequence identity patterns among 8 major gene clusters confirmed present phylogenetic classification of eutherian GLN genes (Supplementary data file 4). The complete GLN nucleotide sequence alignments included average pairwise nucleotide sequence identity ā = 0,498 (amax = 1, amin = 0,132, āad = 0,18). Among 8 eutherian GLN major gene clusters respectively, there were nucleotide sequence identity patterns of very close eutherian orthologues and paralogues (GLNA and GLNB), close eutherian orthologues and paralogues (GLNC, GLND, GLNE), close eutherian orthologues (GLNF) and typical eutherian orthologues (GLNG and GLNH). In comparisons between eutherian GLN

7

ACCEPTED MANUSCRIPT major gene clusters, there were nucleotide sequence identity patterns of very close eutherian homologues in comparisons between major gene clusters GLNA and GLNB, as well as in comparisons between major gene clusters GLNC, GLND and GLNE. In other comparisons between major gene clusters, there were

T

nucleotide sequence identity patterns of close or typical eutherian homologues. However, the exceptions

IP

were nucleotide sequence identity patterns of distant eutherian homologues in comparisons between

SC R

major gene cluster GLNH and major gene clusters GLNB, GLNC and GLND respectively. Finally, the entire present eutherian GLN gene data set including 149 complete coding sequences was included in tests of protein molecular evolution (Supplementary data file 5). In calculations of GLN codon usage

NU

statistics, the MEGA 6.06 program was used. The ratios between observed and expected amino acid

MA

codon counts determined relative synonymous codon usage statistics (R). Specifically, the not preferable amino acid codons (R≤0.7) included TTA (0,03), TTG (0,43), CTT (0,22), CTA (0,18), ATA (0,29), GTT

D

(0,47), GTA (0,11), TCA (0,2), TCG (0,34), CCA (0,59), CCG (0,62), ACA (0,4), ACG (0,29), GCA

TE

(0,33), GCG (0,44), TAT (0,38), CAT (0,68), CAA (0,24), AAT (0,64), AAA (0,3), GAT (0,67), GAA (0,51), CGT (0,31), CGA (0,23), AGA (0,7), GGA (0,58) and GGG (0,55). Using protein and nucleotide

CE P

sequence alignments, the reference human GLNA1 protein sequence amino acid sites were described as invariant amino acid sites (invariant alignment positions), forward amino acid sites (variant alignment

AC

positions not including amino acid codons with R≤0.7) or compensatory amino acid sites (variant alignment positions including amino acid codons with R≤0.7). In analysis of human GLNA1 tertiary structures 2dn1, 2dn2 and 2dn3 downloaded from The RCSB Protein Data Bank (http://www.rcsb.org/pdb/home/home.do), the DeepView/Swiss-PdbViever 4.1.0 program was used (http://spdbv.vital-it.ch/). The SignalP 4.1 server was used in predictions of N-terminal signal peptides in GLN protein primary structures using default settings (http://www.cbs.dtu.dk/services/SignalP/), and TMHMM 2.0 server was used in prediction of transmembrane regions in GLN protein primary structures using default settings (http://www.cbs.dtu.dk/services/TMHMM/). The eutherian GLN proteins were described as intracellular proteins (Voet and Voet 2004; Park et al. 2006; Levantino et al. 2012) and neither N-terminal signal peptides nor transmembrane regions were predicted in GLN protein

8

ACCEPTED MANUSCRIPT primary structures. The cysteine amino acid residues common to eutherian GLN major protein clusters respectively were used as major protein sequence alignment landmarks (Figure 1B). For example, the major protein cluster GLNG included no common cysteine amino acid residues, but there were 3

T

cysteine amino acid residues common to major protein cluster GLNH. Although described as

IP

intracellular proteins, the major protein clusters GLNB and GLNF included 1 common potential N-

SC R

glycosylation site respectively (human GLNB1 N48, human GLNF N177). Using protein and nucleotide sequence alignments, the tests of protein molecular evolution integrated patterns of nucleotide sequence similarities with protein tertiary structures (Premzl 2016a, 2016b). In analysis of human GLNA1 tertiary

NU

structures 2dn1, 2dn2 and 2dn3 (Park et al. 2006; Levantino et al. 2012), there were 4 invariant amino

MA

acid sites (M1, F43, N58 and F123) and 1 forward amino acid site (W16) among 147 human GLNA1 amino acid residues (Figure 1C). For example, the human GLNA1 invariant amino acid site F43 or

D

Phe(CD1)β was implicated in β haem models (Voet and Voet 2004; Park et al. 2006; Levantino et al.

TE

2012). In conclusion, the most comprehensive curated third party data gene data set of eutherian GLN genes first annotated 8 major gene clusters GLNA-GLNH and explained their differential gene expansion

CE P

patterns. Thus, the integrated gene annotations, phylogenetic analysis and protein molecular evolution

References

AC

analysis proposed updated classification and nomenclature of eutherian GLN genes.

1. Gaudry MJ, Storz JF, Butts GT, Campbell KL, Hoffmann FG (2014) Repeated evolution of chimeric fusion genes in the ß-globin gene family of laurasiatherian mammals. Genome Biol Evol 6:1219-1234 2. Gibson R, Alako B, Amid C, Cerdeño-Tárraga A, Cleland I, Goodgame N, Ten Hoopen P, Jayathilaka S, Kay S, Leinonen R, Liu X, Pallreddy S, Pakseresht N, Rajan J, Rosselló M, Silvester N, Smirnov D, Toribio AL, Vaughan D, Zalunin V, Cochrane G (2016) Biocuration of functional annotation at the European nucleotide archive. Nucleic Acids Res 44:D58-D66

9

ACCEPTED MANUSCRIPT 3. Goodman M, Porter CA, Czelusniak J, Page SL, Schneider H, Shoshani J, Gunnell G, Groves CP (1998) Toward a phylogenetic classification of Primates based on DNA evidence complemented by fossil evidence. Mol Phylogenet Evol 9:585-598

T

4. Hardison RC (2012) Evolution of hemoglobin and its genes. Cold Spring Harb Perspect Med

IP

2:a011627

SC R

5. Hardison R, Miller W (1993) Use of long sequence alignments to study the evolution and regulation of mammalian globin gene clusters. Mol Biol Evol 10:73-102 6. Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell

NU

D, Zadissa A, Searle S, Barnes I, Bignell A, Boychenko V, Hunt T, Kay M, Mukherjee G, Rajan

MA

J, Despacio-Reyes G, Saunders G, Steward C, Harte R, Lin M, Howald C, Tanzer A, Derrien T, Chrast J, Walters N, Balasubramanian S, Pei B, Tress M, Rodriguez JM, Ezkurdia I, van Baren J,

D

Brent M, Haussler D, Kellis M, Valencia A, Reymond A, Gerstein M, Guigó R, Hubbard TJ

TE

(2012) GENCODE: The reference human genome annotation for The ENCODE Project. Genome Res 22:1760-1774

CE P

7. International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409:860-921

AC

8. International Human Genome Sequencing Consortium (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931-945 9. Levantino M, Spilotros A, Cammarata M, Schirò G, Ardiccioni C, Vallone B, Brunori M, Cupane A (2012) The Monod-Wyman-Changeux allosteric model accounts for the quaternary transition dynamics in wild type and a recombinant mutant human hemoglobin. Proc Natl Acad Sci U S A 109:14894-14899 10. Lorenzo FR, Huff C, Myllymäki M, Olenchock B, Swierczek S, Tashi T, Gordeuk V, Wuren T, Ri-Li G, McClain DA, Khan TM, Koul PA, Guchhait P, Salama ME, Xing J, Semenza GL, Liberzon E, Wilson A, Simonson TS, Jorde LB, Kaelin WG Jr, Koivunen P, Prchal JT (2014) A genetic mechanism for Tibetan high-altitude adaptation. Nat Genet 46: 951-956

10

ACCEPTED MANUSCRIPT 11. Moleirinho A, Lopes AM, Seixas S, Morales-Hojas R, Prata MJ, Amorim A (2015) Distinctive patterns of evolution of the d-globin gene (HBD) in primates. PLoS One 10:e0123365 12. Nery MF, Arroyo JI, Opazo JC (2013) Genomic organization and differential signature of

T

positive selection in the alpha and beta globin gene clusters in two cetacean species. Genome

IP

Biol Evol 5:2359-2367

SC R

13. Opazo JC, Hoffmann FG, Storz JF (2008) Differential loss of embryonic globin genes during the radiation of placental mammals. Proc Natl Acad Sci U S A 105:12950-12955 14. Park SY, Yokoyama T, Shibayama N, Shiro Y, Tame JR (2006) 1.25 A resolution crystal

NU

structures of human haemoglobin in the oxy, deoxy and carbonmonoxy forms. J Mol Biol

MA

360:690-701

15. Premzl M (2016a) Comparative genomic analysis of eutherian tumor necrosis factor ligand

D

genes. Immunogenetics 68:125-132

TE

16. Premzl M (2016b) Curated eutherian third party data gene data sets. Data Brief 6:208-213 17. Storz JF, Hoffmann FG, Opazo JC, Moriyama H (2008) Adaptive functional divergence among

CE P

triplicated alpha-globin genes in rodents. Genetics 178:1623-1638 18. Storz JF, Opazo JC, Hoffmann FG (2013) Gene duplication, genome duplication, and the

AC

functional diversification of vertebrate globins. Mol Phylogenet Evol 66:469-478 19. Voet D, Voet JG (2004) Hemoglobin: protein function in microcosm. In: Voet D (ed) Biochemistry, 3rd edn. Wiley International Edition, Hoboken, pp 320–355

11

ACCEPTED MANUSCRIPT Figure legends Figure 1: (A) Phylogenetic analysis of eutherian globin genes. The minimum evolution tree calculated

T

using maximum composite likelihood method included bootstrap estimates higher than 50% after 1000

IP

replicates. (B) Eutherian globin protein sequence alignment landmarks. The invariant amino acid sites

SC R

were shown using violet squares, and forward amino acid sites were shown using red squares. The black squares indicated common cysteine amino acid residues. The numbers indicated numbers of amino acid residues. (C) Reference human GLNA1 protein primary structure. Whereas the 4 invariant amino acid

NU

sites were shown using white letters on violet backgrounds, 1 forward amino acid site was shown using

AC

CE P

TE

D

MA

white letter on red background.

12

Fig. 1

AC

CE P

TE

D

MA

NU

SC R

IP

T

ACCEPTED MANUSCRIPT

13

ACCEPTED MANUSCRIPT List of Abbreviations

ā, average pairwise identity; āad, average absolute deviation for ā; amax, largest pairwise identity; amin,

T

smallest pairwise identity; GLN, globin gene; GLNA-GLNH, major gene clusters of globin genes; R,

AC

CE P

TE

D

MA

NU

SC R

IP

relative synonymous codon usage statistic.

14

ACCEPTED MANUSCRIPT

AC

CE P

TE

D

MA

NU

SC R

IP

T

Graphical abstract

15

ACCEPTED MANUSCRIPT Research highlights

AC

CE P

TE

D

MA

NU

SC R

IP

T

 Update and revision of eutherian globin genes.  First description of 8 major gene clusters of eutherian globin genes.  New classification and nomenclature of eutherian globin genes.

16