Genome-wide identification, evolution, and expression analysis of GATA transcription factors in apple (Malus × domestica Borkh.)

Genome-wide identification, evolution, and expression analysis of GATA transcription factors in apple (Malus × domestica Borkh.)

Accepted Manuscript Genome-wide identification, evolution, and expression analysis of GATA transcription factors in apple (Malus×domestica Borkh.) Ho...

2MB Sizes 7 Downloads 170 Views

Accepted Manuscript Genome-wide identification, evolution, and expression analysis of GATA transcription factors in apple (Malus×domestica Borkh.)

Hongfei Chen, Hongxia Shao, Ke Li, Dong Zhang, Sheng Fan, Youmei Li, Mingyu Han PII: DOI: Reference:

S0378-1119(17)30499-7 doi: 10.1016/j.gene.2017.06.049 GENE 42012

To appear in:

Gene

Received date: Revised date: Accepted date:

14 January 2017 16 June 2017 28 June 2017

Please cite this article as: Hongfei Chen, Hongxia Shao, Ke Li, Dong Zhang, Sheng Fan, Youmei Li, Mingyu Han , Genome-wide identification, evolution, and expression analysis of GATA transcription factors in apple (Malus×domestica Borkh.), Gene (2017), doi: 10.1016/j.gene.2017.06.049

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Genome-wide identification, evolution, and expression analysis of GATA transcription factors in apple (Malus × domestica Borkh.)

PT

Hongfei Chen, Hongxia Shao, Ke Li, Dong Zhang, Sheng Fan, Youmei Li, Mingyu Han*

RI

College of Horticulture, Northwest A&F University, Yangling, Shaanxi 712100, China *Corresponding author

SC

College of Horticulture, Northwest A&F University, Yangling, Shaanxi 712100, China Tel & Fax: ++ 86-029-87082849

AC

CE

PT E

D

MA

NU

Email: [email protected]

ACCEPTED MANUSCRIPT Abstract Plant GATA transcription factors are type-IV zinc-finger proteins that play important regulatory roles in plant growth and development. In this study, we identified 35 GATA genes classified into four groups in the whole genome sequence of Malus domestica. A physiochemical property analysis indicated that GATA proteins are largely unstable hydrophilic proteins. An analysis of conserved protein motifs

PT

uncovered three highly conserved motifs, in addition to the GATA motif, in all MdGATA proteins. These three motifs, CCT, TIFY, and ASXH, were found to occur in specific GATA groups and may be

RI

related to GATA gene function. We identified 10 pairs of putative paralogs, indicating that MdGATA genes have mainly undergone whole genome duplication. Eighteen orthologous gene pairs were also

SC

identified between Arabidopsis thaliana and M. domestica. Furthermore, many light-responsive cis-elements were found in MdGATA gene promoters. Tissue-specific expression analysis performed by

NU

quantitative real-time reverse transcription PCR showed that MdGATA genes were preferentially expressed in flowers, leaves, and buds. Apple seedlings maintained in darkness for 7 days exhibited a

MA

moderate decline in chlorophyll content along with significant down-regulation of most MdGATA genes, suggesting that MdGATA genes may be involved in light-responsive development and chlorophyll-level

D

regulation. The distinctly higher expression levels observed for many MdGATA genes during three

PT E

stages of floral induction also indicate that MdGATA genes may play a role in the apple flowering transition. The results presented here lay the foundation for further investigation of MdGATA gene family putative functions and improvement of apple yields.

AC

Abbreviations

CE

Keywords: GATA; apple; evolution; expression

ASXH, additional sex homology; GDR, Genome Database for Rosaceae; GRAVY, grand average of hydropathicity; GSDS, Gene Structure Display Server; NCBI CDD, National Center for Biotechnology Information Conserved Domains Database;

PlantCARE, Plant Cis-Acting Regulatory Element; RGAP, Rice Genome Annotation Project; TAIR, The Arabidopsis Information Resource

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

SC

RI

PT

MeJA, Methyl Jasmonate

ACCEPTED MANUSCRIPT

1. Introduction GATA transcription factors are a group of regulatory proteins that exist in a wide range of eukaryotic organisms including fungi, metazoans, and plants. These proteins are known for their specific binding of the consensus sequence WGATAR (W = T or A; R = G or A) (Lowry and Atchley, 2000). In animals, a (T/A)GATA(A/G) sequence to which six transcription factors (GATA1 to GATA6)

PT

can bind was initially identified in the chicken globin promoter (Evans et al., 1988). In plants, the first GATA transcription factor was identified from tobacco and named NTL1 based on its similarity to a

RI

protein in Neurospora crassa (Daniel-Vedele and Caboche, 1993). Type-IV zinc-finger motifs related

SC

to GATA transcription factors were subsequently identified in Arabidopsis thaliana (Teakle and Gilmartin, 1998). Although structures of GATA proteins differ among species, each GATA protein

NU

contains a highly conserved type-IV zinc-finger motif consisting of the amino acid sequence CX2CX17–20CX2C followed by a highly basic region (Reyes et al., 2004). In animals, GATA proteins

MA

contain two typical conserved zinc fingers (CX2CX17CX2C), but only the C-terminal zinc finger is related to DNA binding (Lowry and Atchley, 2000). Previous research has shown that the N-terminal zinc finger (N-finger) can modulate the C-terminal zinc finger to bind DNA with different specificities

D

(Newton et al., 2001; Patient and McGhee, 2002; Trainor et al., 2000) or mediate interactions between

PT E

GATA transcription factors and transcription cofactors composing the Friend of GATA (FOG) family (Tsang et al., 1997). The synergies of two finger-like structures can improve DNA affinity binding and

CE

increase the recognition range of the finger domain. Most fungal GATA proteins, such as those in yeast, contain only a single CX2CX17CX2C or CX2CX18CX2C domain (Scazzocchio, 2000). A study of

AC

GATA transcription factors in Arabidopsis thaliana and Oryza sativa indicated that most plant GATA proteins contain only the single zinc-finger domain C-X2-C-X18-C-X2-C, with a few containing two zinc-finger domains or the C-X2-C-X20-C-X2-C zinc-finger domain (Reyes et al., 2004). That study identified 29 GATA family members in A. thaliana and 30 in O. sativa divided into four and six subfamilies, respectively, according to their evolutionary relationships, domain structures, and exon–intron structures. Among the subfamilies in A. thaliana, subfamily I contained 14 members with two exons, subfamily II contained 10 members with two to three exons, subfamily III contained three members with seven exons, and subfamily IV members had no characteristic gene structure. In addition to the GATA domains present in all GATA proteins, acid domains and CCT domains were identified in

ACCEPTED MANUSCRIPT subfamilies I and III, respectively. Similar divisions were also seen in O. sativa. This study of the GATA gene family in A. thaliana and O. sativa provided a foundation for further identification, evolutionary analysis, and functional analysis of GATA transcription factors in other species. Functional analysis has suggested that GATA transcription factors play significant roles in the regulation of plant growth and development. GATA elements have been found in many regulatory

PT

regions of light-responsive genes. Electrophoretic mobility shift assays and DNase I footprinting experiments have subsequently demonstrated that GATA transcription factors can bind to these

RI

elements, thereby implicating GATA transcription factors in light-mediated processes (Borello et al., 1993; Lam and Chua, 1989; Schindler and Cash more, 1990; Terzaghi and Cashmore, 1995). A study

SC

of A. thaliana revealed that expressions of many GATA genes are responsive to light culture, dark culture, and changes in circadian rhythms (Manfield et al., 2007). GATA2 (AT2G45050) can regulate

NU

the expressions of light-responsive genes and plays an important role in photomorphogenesis (Luo et al., 2010). Besides light response, GATA transcription factors are also involved in floral development.

MA

An RNA gel blot experiment indicated that ZIM (At4g24470), a GATA gene, is highly expressed in shoot apices of immature flowers and in flowers in the reproductive phase (Nishii et al., 2000).

D

HANABA TARANU, a GATA transcription factor, is believed to be involved in the establishment of

PT E

boundaries between the meristem and its newly initiated organ primordia. This transcription factor affects the expression of WUSCHEL, a gene functioning near the boundary of the central zone and rib meristem in shoot and floral meristems, that can promote meristem activities (Zhao et al., 2004; Mayer

CE

et al., 1998). Extensive research suggests that GATA transcription factors also play a role in the regulation of carbon and nitrogen metabolism, chlorophyll levels, chloroplast size, and photosynthetic

AC

efficiency (An et al., 2014; Bi et al., 2005; Chiang et al., 2012; Hudson et al., 2011). Apple (Malus domestica) is known as the king of temperate fruits, and its importance in terms of human production and livelihood is self-evident. Some apple cultivars, however, especially ‘Fuji’ accounting for 65% of apple cultivation in China, do not readily flower (Xing et al., 2016). Because this difficulty in flowering severely restricts apple production and economic benefits, the study of the molecular mechanisms of apple flowering is essential to increase apple yields. Despite efforts made to investigate the GATA gene family in model plants such as A. thaliana and O. sativa, this family is relatively uncharacterized in apple. In this study, we therefore performed a genome-wide search for GATA genes in the apple genome and analyzed their chromosomal distributions, evolutionary

ACCEPTED MANUSCRIPT mechanisms, gene structures, functional domains, and cis-elements. We also analyzed their expression patterns in different tissues in response to light and during floral differentiation by quantitative real-time reverse transcription PCR (qRT-PCR) and RNA sequencing (RNA-seq). Our study findings should not only contribute to future investigations of MdGATA gene regulation of light response and flowering in apple, but also provide guidance useful for increasing apple production.

PT

2. Materials and methods 2.1 Identification and chromosomal distribution of GATA genes in M. domestica

the

GATA

domain

(PF00320)

was

downloaded

from

the

Pfam

database

SC

of

RI

To identify all members of the M. domestica GATA gene family, the hidden Markov model profile

(http://pfam.xfam.org/family/PF00320/hmm). This model was used as a query to search for GATA

NU

genes in the M. domestica complete genome based on an expected value (E-value) cutoff of 0.01 in HMMER 2.0 (Finn et al., 2011), which resulted in the initial identification of 39 M. domestica GATA

(http://pfam.xfam.org/search/sequence)

MA

genes. To further confirm the 39 predicted genes as GATA family members, the Pfam database and

NCBI

CCD

D

(https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) were used to examine the integrity of the

PT E

GATA domain based on an E-value cutoff of 0.01 (Wang et al., 2016). This analysis resulted in the identification of 35 confirmed MdGATA genes from the M. domestica complete genome. To obtain chromosomal location information for MdGATA genes, the DNA sequence of each MdGATA gene was

CE

used in BLASTN searches of the M. domestica genome in the Genomics Database for Rosaceae (GDR) (http://www.rosaceae.org/tools/ncbi_blast). The 35 MdGATA genes were then named as MdGATA1 to

software

AC

MdGATA35 based on their chromosomal locations and were mapped onto chromosomes using the program

MapInspect

(http://www.plantbreeding.wur.nl/UK/software_mapinspect.html).

Prosite ExPASy server (http://web.expasy.org/protparam/) was used to predict physicochemical characteristics of MdGATA proteins.

2.2 Multiple sequence alignment and phylogenetic analysis Based on the results of previous studies (Reyes et al., 2004; Ao et al., 2015), A. thaliana, O. sativa, and R. communis GATA proteins were downloaded from The Arabidopsis Information Resource (http://www.arabidopsis.org/index.jsp),

the

Rice

Genome

Annotation

Project

(http://rice.plantbiology.msu.edu/cgi-bin/ORF_infopage.cgi), and the Castor Bean Genome Annotation

ACCEPTED MANUSCRIPT database (http://castorbean.jcvi.org/index.php), respectively. Multiple sequence alignment of GATA proteins was carried out using DNAMAN software (Zhang et al., 2016). To further understand the characteristics of MdGATA proteins, the online tool WebLogo (http://weblogo.berkeley.edu/) was used to examine sequence identities in the multiple sequence alignment. MEGA 7.0.14 (Kumar et al., 2016) was used for phylogenetic analysis of the GATA transcription factor family in A. thaliana, O. sativa, M.

PT

domestica, and R. communis. Phylogenetic tree was constructed using the neighbor-joining method with the following parameters: Poisson substitution model, pairwise deletion, and 1,000 bootstrap tests.

RI

2.3 Detection of homologous gene pairs and synteny analysis

SC

Whole-genome protein sequences from A. thaliana and M. domestica were combined and compared for homology using BLASTP with an E-value cutoff of 1 × 10−5. The OrthoMCL algorithm

NU

with default parameters was applied to detect paralogous and orthologous gene pairs and then MdGATA paralogous gene pairs within the M. domestica genome and GATA orthologous gene pairs between M.

MA

domestica and A. thaliana genomes were extracted (Li et al., 2003). The homologous gene pairs obtained in this fashion were used to further identify syntenic blocks using the MCScan algorithm with default parameters (Wang et al., 2012). Syntenic blocks within the M. domestica genome and between

D

M. domestica and A. thaliana genomes were downloaded from the Plant Genome Duplication Database

PT E

(http://chibba.agtec.uga.edu/duplication/) and used to identify syntenic blocks containing MdGATA genes. Detected orthologous, paralogous, and syntenic relationships were illustrated using Circos

CE

(Krzywinski et al., 2009).

2.4 Gene structure, conserved motif, and promoter sequence analysis

AC

Information on MdGATA and AtGATA gene structures was downloaded respectively from the M. domestica genome database in Phytozome (http://www.phytozome.net/apple) and the TAIR database (http://www.arabidopsis.org/index.jsp). The MEME (http://meme-suite.org/) server was used to determine conserved motifs in MdGATA proteins using default parameters and a conserved motif number of 30. The conserved motifs were annotated using the Pfam database. The online tool GSDS 2.0 (http://gsds.cbi.pku.edu.cn/) was used to display exon–intron layouts. The 1,500-bp genomic DNA sequence upstream of the start codon of each MdGATA gene was obtained from the apple genome.

Cis-elements

in

promoters

were

then

identified

(http://bioinformatics.psb.ugent.be/webtools/plantcare/html/ ).

using

the

PlantCARE

database

ACCEPTED MANUSCRIPT 2.5. Analysis of MdGATA microarray expression data Gene expression data from different tissues of ‘Golden Delicious’ and a range of apple hybrids (M14, M20, M49, M67, M74, and X8877) were downloaded from the Gene Expression Omnibus database (http://www.Ncbi.Nlm.Nih.gov/geo/) with the reference number GSE42873. We then manually extracted the MdGATA gene expression data and visualized the results using Mev 4.9.0

PT

software.

2.6. Plant material, RNA deep sequencing, library construction, and determination of

RI

chlorophyll content

SC

Plant material was collected from 6-year-old apple trees of ‘Nagafu No. 2’ (a ‘Fuji’ cultivar) grown on M.26 rootstocks. RNA extracted from buds at three time points—namely, early, middle, and

NU

late stages of flower bud differentiation—was used for cDNA library construction and RNA-seq by the Biomarker Biotechnology Corporation (Beijing, China). The methods used for library construction,

MA

RNA deep sequencing, and data processing are described in detail in Xing et al. (2015). The RNA-seq expression profiles were visualized using Mev 4.9.0 software. Five types of young tissue (stem, leaf, flower, fruit and bud) of mature ‘Nagafu No. 2’ were used as materials for tissue-specific expression

D

analysis. To compare light- vs. dark-cultivated material, 21-day-old tissue-cultured ‘Nagafu No. 2’

PT E

apple seedlings were grown on Murashige-Skoog medium in a tissue culture room at 25°C under 2,000 lux light intensity. The flasks for controls were subjected to a 16:8 h light:dark photoperiod for 7 days,

CE

while flasks for dark treatments were covered with two layers of aluminum foil. Seedlings under these respective growth conditions were harvested into nitrogen after subjective dawn on the 7th day

AC

(Manfield et al., 2007). Both sets of seedlings were divided into 0.2-g portions, which were cut into small pieces (≤0.2 cm) and soaked in 25 ml of 95% ethanol at 4ºC in the dark for 3 days to extract chlorophyll. Total chlorophyll content was measured spectrophotometrically (Fukuda and Terao 2015).

2.7. RNA extraction, purification, cDNA synthesis, and qRT-PCR Total RNA was purified by the cetyltrimethylammonium bromide method and treated with RNase-free DNase I (Invitrogen, Shanghai, China) to remove any residual genomic DNA. First-strand cDNA was synthesized from 1 μl of total RNA using a SYBR Prime Script RT-PCR Kit II (Takara, Shanghai). Expression profiles of MdGATA genes in different tissues and during photomorphogenesis and skotomorphogenesis were analyzed by qRT-PCR. The qRT-PCR amplifications were performed

ACCEPTED MANUSCRIPT with three technical replicates in 20-µl volumes containing SYBR Premix Ex Taq II (TliRNaseH Plus) with 10 μl of 2× SYBR Premix Ex Taq II (Takara, Beijing, China) on a Bio-Rad CFX Connect Real-Time PCR Detection system. The specific PCR primers used in this study were designed using PrimerQuest (http://www.idtdna.com/Primerquest/). PCR amplification conditions were as follows: 95°C for 5 min, followed by 40 cycles of 94°C for 5 s, 60°C for 15 s, and 72°C for 10 s. Analysis of

PT

rmRNA relative expression levels was performed using the 2 –ΔΔCT method, with the apple actin gene used as an internal control for gene expression normalization.

RI

3. Results

SC

3.1 Genome-wide identification of M. domestica GATA genes

Using HMMER 2.0 software, 35 members of the GATA gene family were identified in M.

NU

domestica (Table 1). To determine their chromosomal distributions, the DNA sequence of each MdGATA gene was searched using BlastN against the M. domestica genome in the GDR database

MA

(http://www.rosaceae.org/tools/ncbi_blast). A total of 31 MdGATA genes were mapped onto 13 M. domestica chromosomes, representing all chromosomes except for 4, 5, 10, and 14, while the other four

D

MdGATA genes were located on unanchored scaffolds. Chromosomes 8 and 15 contained the largest

PT E

number of MdGATA genes, representing 6% and 17.14% of the total number, respectively. Seven chromosomes, namely, 1, 3, 6, 7, 12, 13, and 16, each contained only one MdGATA gene: MdGATA1, 5, 6, 7, 19, 20, and 27, respectively (Fig. 1). On the basis of their chromosomal locations, the 35 GATA

CE

genes were named MdGATA1–MdGATA35. Information on these 35 genes, including gene accession numbers, genomic locations, amino acid numbers, molecular weights, instability indexes, aliphatic

AC

indexes, grand average of hydropathicity (GRAVY) values, and coding sequence (CDS) lengths, is given in Table 1. MdGATA gene CDSs were between 273 and 3,486 bp long, with the molecular weights of predicted MdGATA proteins accordingly ranging from approximately 10 to 130 kDa. The instability index, a measure of protein stability (Guruprasad et al., 1990), was greater than 40 for each protein except for MdGATA25, MdGATA29, and MdGATA30. Theoretical isoelectric points of the 35 MdGATA proteins ranged from 5.23 to 10.62, with the majority above 7. Most MdGATA proteins within the same group had similar theoretical isoelectric points. All theoretical isoelectric points of MdGATA proteins in group B were higher than 8, while those of group C proteins were lower than 8. GRAVY values of all MdGATA proteins were less than zero.

ACCEPTED MANUSCRIPT 3.2 Phylogenetic analysis and sequence alignment To better analyze the evolutionary relationships of MdGATA genes, an unrooted phylogenetic tree was generated using the 35 MdGATA proteins and 19 R. communis, 29 O. sativa, and 30 A. thaliana GATA proteins (Fig 2; Table S1). According to the results of a cluster analysis combined with previous GATA gene studies in model plants, we divided the MdGATA genes into four groups (A, B, C, and D).

PT

Examination of the tree revealed that group A contained 20 GATA members, accounting for more than half of the total number of MdGATA genes (57.1%). With only three members, namely, MdGATA6,

RI

MdGATA27, and MdGATA35, Group D contained the lowest number of MdGATA genes (8.6%). To

SC

further analyze the sequence features of the 35 MdGATA proteins, their amino acid sequences were aligned. This multiple alignment revealed that most MdGATA proteins contained the integrated

NU

conserved domain C-X2-C-X17-20-C-X2-C, with others, including MdGATA29, MdGATA30, MdGATA35 and so on, having lost some of these amino acids (Fig. 3). Amino acid sequence

MA

characteristics of MdGATA proteins in each group were also generally consistent with previously studied GATA proteins in A. thaliana (Reyes et al., 2004). For example, M. domestica GATA proteins of group A were characterized by the presence of conserved Gln and Thr amino acids in positions 15

D

and 25, respectively, of the zinc-finger loop. In addition, a series of conserved sequences was present in

PT E

the α-helix and the unstructured amino-terminal region of all group A members. Group-B MdGATA proteins were characterized by the presence of a Ser residue in the 25th position of the zinc-finger loop

CE

and an Ile residue in the 32nd position. Similar to group-C GATA proteins of other species, all MdGATA group C members had an insertion of two amino acids (Reyes et al., 2004). The MdGATA

AC

proteins of group D were characterized by the presence of a Val residue in the first position before the zinc-finger loop, with an 18-residue loop containing almost no conserved amino acid sites except for a His residue in the fifth position. The GATA motifs and conserved amino acid sites in MdGATA proteins may contribute to the various functions of these GATA proteins.

3.3 Homologous gene pairs and synteny analysis To analyze MdGATA gene duplication events, we identified 10 pairs of putative paralogous MdGATA genes within the MdGATA gene family (Fig. 4a). Tandem and segmental duplications are reported to be the two main mechanisms underlying gene family expansion (Cannon et al., 2004). Tandem duplication is thought to have occurred when two closely related genes are located within the

ACCEPTED MANUSCRIPT same chromosomal region and separated by fewer than 20 genes (Xu et al., 2009). Segmental duplications can be divided into two classes: interchromosomal and intrachromosomal. Members of the interchromosomal class are duplicated on non-homologous chromosomes and many localize to pericentromeric and subtelomeric chromosomal regions. Intrachromosomal duplications, referred to as region- or chromosome-specific low-copy repeats, are typically found on a single chromosome or in a

PT

single chromosomal band (Emanuel and Shaikh, 2001). A whole genome duplication has also occurred in M. domestica, as shown by the distribution of gene pairs across different but homologous

RI

chromosomes, including pairings on chromosomes 1-7, 2-15, 8-15, 3-11, 4-12, 5-10, 6-14, 9-17, and 13-16 (Velasco et al., 2010). Our results indicate that MdGATA2/MdGATA11, MdGATA5/MdGATA17,

SC

MdGATA12/MdGATA15, MdGATA21/MdGATA23 have undergone segmental duplications and MdGATA2/MdGATA21, MdGATA2/MdGATA23, MdGATA11/MdGATA21, MdGATA11/MdGATA23,

NU

MdGATA12/MdGATA22, and MdGATA13/MdGATA26 have been formed by whole genome duplication events. Orthologous gene pairs can provide effective information about evolutionary

MA

relationships between species (Wei et al., 2015). We therefore investigated orthologous GATA gene pairs between M. domestica and A. thaliana, which resulted in the identification of eighteen gene pairs

analysis,

which

revealed

that

MdGATA2/MdGATA23,

MdGATA2/AtGATA22,

PT E

synteny

D

across the two species (Fig. 4a; Table S2). Using these homologous genes, we then carried out a

MdGATA11/MdGATA21, MdGATA12/MdGATA22, MdGATA13/MdGATA26, MdGATA2/AtGATA22, MdGATA13/AtGATA25, AtGATA22/MdGATA23, and AtGATA25/MdGATA26 were located in syntenic

CE

blocks (Fig. 4b; Table S3).

3.4 Gene structure and conserved motif analysis

AC

To better analyze the structure of GATA genes and the conserved motifs of GATA proteins, we constructed a phylogenetic tree of ATGATA and MdGATA proteins. Exon–intron structures of 65 GATA genes were determined based on their full-length CDSs and corresponding genomic DNA sequences, thereby revealing that GATA genes have between 1 (MdGATA32 and MdGATA35) and 20 (MdGATA19) exons. Group A had the lowest average number of CDSs per gene, 2.4, while group C had the highest, 8.6. Furthermore, GATA genes within the same group had analogous gene structures. For example, each GATA gene in group C was composed of more than five CDSs, and, at 3686 bp (data not shown), the average length of their full-length intron was longer than the average intron of any other group. We identified 30 motifs (designated as motifs 1 to 30) in the 65 GATA proteins, with most GATA proteins

ACCEPTED MANUSCRIPT in the same group containing similar motifs (Fig. 5). For example, GATA proteins in group D had an average of nine conserved motifs, including motif 1, which was assigned to the GATA zinc finger according to annotations in the Pfam database, and unique motifs 10, 20, 21, 23, and 27 (Table S4). In addition to motif 1, all GATA proteins in group C contained conserved motifs 4 and 8 representing CCT and TIFY domains, respectively. Apart from MdGATA35, all GATA proteins in group D contained

PT

motif 11, an ASXH domain. Taken together, these results reveal that different clades differ significantly from each other. The order of exons encoding GATA motifs was also similar within each group. For

RI

example, most GATA domains of group-B and group-C GATA proteins were encoded by the last two

SC

and fifth exons, respectively.

3.5 Analysis of the promoter sequences of MdGATA genes

NU

To further explore the function and regulatory patterns of MdGATA genes, a 1,500-bp region of the genomic DNA sequence of each gene was scanned for putative cis-regulatory elements using the

MA

PlantCARE database. This search identified 11 main types of cis-elements (Fig. 6a and b). More than 20 types of light-responsive cis-elements, such as ACE, L-box, and Sp1, were observed across the 35 MdGATA genes. Most notably, light-responsive cis-elements were found to constitute the bulk (up to

D

63%) of presumptive cis-elements (Fig. 6b). In addition, various cis-elements involved in hormone

PT E

response (e.g., MeJA, salicylic acid, gibberellins, auxin, and ethylene), stress response (e.g., drought, low temperature, and heat), meristem expression, and circadian control were also identified in promoter

CE

sequences of the MdGATA genes.

3.6 Chlorophyll content and expression profile analysis of MdGATA genes

AC

To elucidate the expression patterns of MdGATA genes during apple growth and development, the expression patterns of individual MdGATA family members were analyzed in various tissues, including roots, stems, leaves, flowers, seeds, and seedlings, using microarray expression data. A heat map was generated to show the expression profiles (Fig. S1). MdGATA genes exhibited obviously strong preferential expression in leaves, flowers, and fruits. On the basis of these results, 10 family members that were highly expressed in seedlings and leaves were subjected to qRT-PCR analysis (MdGATA2, MdGATA7,

MdGATA13,

MdGATA18,

MdGATA19,

MdGATA23,

MdGATA26,

MdGATA29,

MdGATA32, and MdGATA34; Table S5). The qRT-PCR analysis uncovered high expression of MdGATA7, MdGATA13, MdGATA18, MdGATA19, and MdGATA26 in buds; high expression of

ACCEPTED MANUSCRIPT MdGATA19, MdGATA26, and MdGATA29 in flowers; and high expression of MdGATA2, MdGATA13, MdGATA23, MdGATA26, MdGATA29, and MdGATA32 in leaves (Fig. 7). After 7 days in darkness, seedlings appeared to be etiolated (Fig. 8a). To further investigate the effect of shading on the chlorophyll content of apple, we determined the total chlorophyll content of light-grown and dark-grown seedlings. Our results revealed that the chlorophyll content of apple seedlings experienced

PT

a moderate decrease after dark treatment (Fig. 8b). Most noteworthily, the expressions of all MdGATA genes highly expressed in leaves according to qRT-PCR were higher in light-grown seedlings than in

RI

dark-grown seedlings (Fig. 8c); for MdGATA23, in particular, this expression difference was greater than 2-fold. Other genes, including MdGATA7, MdGATA18, MdGATA32, and MdGATA34, showed

SC

higher expression in dark-grown seedlings. To explore the possible roles of MdGATA genes in apple flowering, expression profiles during three stages of flower bud physiological differentiation were also

NU

analyzed by RNA-seq. As shown in Fig. S2, the expressions of 16 MdGATA genes were detected during the three stages in this study. Expressions of MdGATA1, MdGATA3, MdGATA5, MdGATA6,

MA

MdGATA13, MdGATA17, MdGATA20, and MdGATA26 remained at obviously high levels during all three stages of flower bud physiological differentiation, while MdGATA12, MdGATA22, and

D

MdGATA24 were only significantly up-regulated during early and middle stages.

PT E

4. Discussion

GATA transcription factors play significant roles in various plant growth and development

CE

processes. In conjunction with the rapid development of modern bioinformatics, genome-wide analysis of the GATA gene family has been performed in model plants such as A. thaliana and O. sativa. In this

AC

study, 35 MdGATA genes were identified and classified into four subfamilies designated as groups A to D. Consistent with A. thaliana and O. sativa, group A contained the most MdGATA genes (Reyes et al., 2004). All the GATA homologous gene pairs identified in this study were tightly grouped together, indicating that these homologous gene pairs were more closely related to each other (Fig.2, Fig.4 and Fig.5), which suggested that the topologies of phylogenetic trees were to some extent consistent with the synteny analysis. Protein stability may be a useful indicator to estimate the suitability of a protein for medical and industrial production. Our analysis indicated that the instability index indexes of most MdGATA proteins are above 40, suggesting their possible instability (Guruprasad et al., 1990). GRAVY values of all identified MdGATA proteins were less than zero, indicating that they are

ACCEPTED MANUSCRIPT hydrophilic. These results are consistent with those of a previous study of GATA genes in castor bean (Ao et al., 2015). Taken together, these findings suggest that MdGATA genes are widely conserved in terms of physicochemical properties across different species. Gene duplication events are thought to be an important mechanism in the evolution of plant genomes (Vision et al., 2000; Cannon et al., 2004; Zhou et al., 2004; Yang et al., 2008). Orthologous

PT

relationship analysis of M. domestica and A. thaliana indicated that many AtGATA genes have two or more counterparts in M. domestica, which suggests that the expansion of the MdGATA gene family in

RI

M. domestica may have resulted from genome duplication events. Further investigation revealed the presence of an additional four, and six gene pairs inferred to have arisen from segmental duplication

SC

and whole genome duplication events, respectively. This finding suggests that whole genome duplication is the main mechanism by which the MdGATA gene family has expanded in apple. The

NU

identification of orthologous GATA gene pairs between apple and A. thaliana, a model plant in which many GATA genes have been functionally characterized, has provided reference information about the

MA

evolutionary relationships and expression patterns of GATA genes in M. domestica. Patterns of synteny can provide insight into the evolutionary history of a genome. However, some homologous GATA

D

genes may not have been mappable to any syntenic blocks because of chromosomal rearrangements,

PT E

fusions, and selective gene loss, thus obscuring the identification of chromosomal syntenies (Zhang et al., 2012). We detected an interesting case where apple duplications corresponded to Arabidopsis duplications, such as MdGATA13/MdGATA26-AtGATA24/ AtGATA25/ AtGATA28. AtGATA24 (ZML1),

CE

AtGATA25 (ZML), and AtGATA28 (ZML2) were reported to be homologous genes in previous research (Shikata et al., 2004), demonstrating the accuracy of our results.

AC

In addition to the GATA motif found in all GATA proteins, various MdGATA groups contained other conserved domains, including two new domains in GATA proteins, namely, ASXH and TIFY, and the CCT domain identified in A. thaliana and rice in previous research (Reyes et al., 2004). The CCT domain was first identified in the CONSTANS (CO) protein related to the circadian clock and flowering control in A. thaliana (Suárez-López et al., 2001). The fully conserved TIFY domain has previously been found to characterize a large family of transcription factors (Vanholme et al., 2007). Ongoing studies have suggested that the TIFY domain mediates homomeric and heteromeric interactions between TIFY proteins and other specific proteins (Melotto et al., 2008; Chini et al., 2009). The TIFY domain has also been recently found to be widespread in the JASMONATE ZIM-domain

ACCEPTED MANUSCRIPT protein family and PEAPOD proteins and to be related to the jasmonic acid pathway (Bai et al., 2011). Current research on the ASXH domain has focused mostly on animals, where it is considered to regulate the combination of polycomb-group proteins (Aravind & Iyer,2012). The presence of these different highly conserved domains may therefore be related to various MdGATA protein functions. Exon gain/loss has occurred extensively within many gene families during their evolution. In A.

PT

thaliana, most GATA genes in group A contain only two exons. In MdGATA group A, however, MdGATA 32 contains one exon, while MdGATA15, MdGATA29, and MdGATA31, among others, posses

RI

more than two exons. Such differences also occur in other groups. Taken together, these results demonstrate that GATA genes have undergone moderate divergence in terms of structure and function

SC

over the course of evolution.

Leaf tissue is significantly involved in the light-signaling pathway and photosynthesis during plant

NU

growth and development. Being very sensitive to environmental changes, tissue-cultured seedlings of apple are an ideal material for light regulation research (Li et al., 2012). To further analyze

MA

light-regulated expression in seedlings, we therefore selected MdGATA genes that were strongly expressed in leaves and seedlings according to microarray data. Tissue-specific gene expression

D

analysis based on qRT-PCR revealed the highest expressions of MdGATA genes in leaves, flowers, and

PT E

buds, thus implicating MdGATA genes in the biological processes occurring in these tissues. For a few genes, however, their tissue-specific expression profiles obtained through qRT-PCR of mature tissues were not really consistent with the RNA-seq data; for example, MdGATA7 showed lower expression in

CE

leaves according to qRT-PCR, possibly because of the different plant materials used. The identification of many light-responsive cis-elements in the promoters of MdGATA genes implies that their functions

AC

may be related to light regulation of development. In A. thaliana, GNC(AtGATA21) and CGA1(AtGATA22) are both considered to be widely involved in the regulation of chlorophyll levels, chloroplast size, photosynthetic efficiency, and carbon and nitrogen metabolism (Bi et al., 2005; Hudson et al., 2011; Chiang et al., 2012). GNC homologs in rice and poplar, Os02g12790 (OsGATA11) and PdGNC, respectively, also play essential roles in regulating chlorophyll levels and carbon and nitrogen metabolism (Hudson et al., 2013; An et al., 2014). 29794. m003323, which are homologs of the CGA1 gene in R. communis, has also been demonstrated to potentially function in physiological processes of light regulation (Ao et al., 2015). The CGA1 ortholog MdGATA23, which is closely related to CGA1, GNC, 29794.m003323, and even Os02g12790 according to our analyses of

ACCEPTED MANUSCRIPT phylogenetic, exhibited the greatest difference in expression levels between photomorphogenesis and skotomorphogenesis of any analyzed MdGATA genes. Compared with the other MdGATA genes subjected to qRT-PCR expression analysis, MdGATA23 also contained the largest number of light-responsive cis-elements (i.e., 12), thus demonstrating the potential functions of MdGATA23 in light response and chlorophyll-level regulation. The highly consistent expression patterns of MdGATA

PT

genes between mature leaves and light-grown seedlings according to qRT-PCR performed in this study also suggests that MdGATA genes are related to light-mediated regulation. Environmental condition is a

RI

fundamental factor that influences flowering in apple. Light affects apple flowering not only by photoperiodic induction (Suárez-López et al., 2016), but also via photosynthesis in the floral induction

SC

period, which was proved by our previous research (Fan et al., 2016). Not exactly clear, however, is whether MdGATA genes can function through the photoperiod pathway similar to other

NU

light-responsive genes such as CO (Onouchi et al., 2000) that control flowering formation by responding to light changes and subsequently activating flowering-related genes via various signal

MA

transduction pathways. Nevertheless, a moderate decline in chlorophyll content accompanied the down-regulated expression of most MdGATA genes in our study, demonstrating that MdGATA genes

D

might be able to regulate chlorophyll levels and photosynthetic efficiency to indirectly regulate apple

PT E

flowering. Understanding the details of this process, however, will require further research. Floral induction is a decisive period for flowering in apple. In a previous study, ZIM (AtGATA25) and its two homologous genes AtGATA24 and AtGATA28 were all found to be highly expressed in shoot apices of

CE

the vegetative phase and inflorescences of the reproductive phase in A. thaliana (Shikata et al., 2004). Similarly, AtGATA18 (HANABA TARANU) has also been reported to affect shoot apical meristem

AC

development and function at the boundaries between the meristem and its newly initiated organ primordia and at the boundaries between different floral whorls (Zhao et al., 2004). According to our orthologous relationship analysis, the orthologous genes of AtGATA18 (HANABA TARANU) in apple is MdGATA24, and, the orthologous gene of AtGATA24, AtGATA25 and AtGATA28 in apple are MdGATA13 and MdGATA26 (Table S2). Interestingly, RNA-seq during floral development indicated that MdGATA13 and MdGATA26 had higher expression during the three stages of flower bud physiological differentiation, while MdGATA24 was also highly expressed during early and middle stages. These results strongly suggest that MdGATA13, MdGATA24, and MdGATA26 may play a role in floral induction in apple. The fact that the expressions of the remaining seven MdGATA genes

ACCEPTED MANUSCRIPT (MdGATA1, MdGATA3, MdGATA5, MdGATA6, MdGATA17, and MdGATA20) remained elevated throughout the bud differentiation period also implies that they potentially have various functions in apple floral development. Despite these insights, the way in which GATA genes function in shoot apices to regulate flowering, such as by hormone induction or other types of signal transduction, is unknown. Application of various measures to improve flowering in apple has always been a topic of

PT

importance. Here, we have directly and indirectly provided two possible perspectives to understand the functions of MdGATA genes in flowering in apple, with the ultimate goal of offering some useful

RI

information for improving apple yield.

Overall, the whole-genome, evolutionary, and expression analyses of the GATA gene family in M.

SC

domestica carried out in our study resulted in the identification and characterization of 35 MdGATA genes. Based on their phylogenetic relationships, the 35 MdGATA genes were divided into four groups.

NU

MdGATA genes within the same group had similar conserved motifs, amino acid sites, and exon–intron organizations. Through our analysis of gene duplication events, we believe that whole genome

MA

duplication may be the primary mechanism underlying expansion of the MdGATA gene family. Our analysis of promoter sequences revealed that many light-response-related cis-acting elements are

D

prevalent in the promoter regions of MdGATA genes. Tissue-specific expression analysis suggested that

PT E

MdGATA genes are expressed mainly in leaves, flowers, and buds. The RNA-seq results showed that 11 MdGATA genes were highly expressed at different stages of flower bud physiological differentiation, suggesting that MdGATA genes may play a role in floral induction in apple. Moreover, most MdGATA

CE

genes, especially MdGATA23, were significantly down-regulated after 7 days of dark culture, suggesting a potential function in light-regulated transcription. Here, we have provided basic

AC

information about MdGATA genes that can be usefully applied to different methods, such as genetic engineering and light control, to increase apple flowering and yield.

Acknowledgements This work was supported by the National Science and Technology Supporting Project (2013BAD20B03) and the National Apple Industry Technology System of the Agricultural Ministry of China (CARS-28).

Conflict of interest The authors declare that they have no conflict of interest.

ACCEPTED MANUSCRIPT

References An,Y., Han, X., Tang, S., Xia, X., Yin, W., 2014. Poplar GATA transcription factor PdGNC is capable of regulating chloroplast ultrastructure, photosynthesis, and vegetative growth in Arabidopsis under varying nitrogen levels. Plant Cell Tiss. Org. 119, 313-327. Ao, T., Liao, X.J., Xu, W., Liu, A.Z., 2015. Identification and Characterization of GATA Gene Family in Castor Bean (Ricinus communis) . Plant Diver. Resour. 37, 453-462. Aravind, L., Iyer, L.M., 2012. The HARE-HTH and associated domains: novel modules in the coordination of epigenetic DNA and protein modifications. Cell Cycle. 11, 119-131.

PT

Bai, Y., Meng, Y., Huang, D., Chen, M., Q, Y., 2011. Origin and evolutionary analysis of the plant-specific TIFY transcription factor family. Genomics. 98, 128-136.

RI

Bi, Y.M., Zhang, Y., Signorelli, T., Zhao, R., Zhu, T., Rothstein, S., 2005. Genetic analysis of Arabidopsis GATA transcription factor gene family reveals a nitrate‐inducible member important for chlorophyll synthesis and

SC

glucose sensitivity. Plant J. 44, 680-692.

Borello, U., Ceccarelli, E., Giuliano, G., 1993. Constitutive, light-responsive and circadian clock-responsive factors compete for the different I box elements in plant light-regulated promoters. Plant J. 4, 611-619.

NU

Cannon, S.B., Mitra, A., Baumgarten, A., Young, N.D., May, G., 2004. The roles of segmental and tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana. BMC Plant Biol. 4, 10. Chiang, Y.H., Zubo, Y.O., Tapken, W., Kim, H.J., Lavanway, A.M., Howard, L., Pilon, M., Kieber, J.J., Kieber,

MA

G.E., 2012. Functional Characterization of the GATA Transcription Factors GNC and CGA1 Reveals Their Key Role in Chloroplast Development, Growth, and Division in Arabidopsis. Plant Physiol. 160, 332-348. Chini, A., Fonseca, S., Chico, J.M., Fernández-Calvo, P., Solano, R., 2009. The ZIM domain mediates homo‐and heteromeric interactions between Arabidopsis JAZ proteins . Plant J. 59, 77-87.

D

Chehadeh, W., Albaksami, O., Altawalah, H., Ahmad, S., Madi, N., John, S.E., Abraham, P.S., Al-Nakib, W., 2015. Phylogenetic analysis of HIV-1 subtypes and drug resistance profile among treatment-naïve people in

PT E

Kuwait. J. Med. Virol. 87, 1521-1526.

Daniel-Vedele, F., Caboche, M., 1993. A tobacco cDNA clone encoding a GATA-1 zinc finger protein homologous to regulators of nitrogen metabolism in fungi. Mol. Genet. Genomics. 240, 365-373. Emanuel, B.S., Shaikh, T.H., 2001. Segmental duplications: an 'expanding' role in genomic instability and disease.

CE

Nat. Rev. Genet. 2, 791-800.

Evans, T., Reitman, M., Felsenfeld, G., 1988. An erythrocyte-specific DNA binding factor recognizes a regulatory sequence common to all chicken globin genes. Proc. Natl. Acad. Sci. USA. 85, 5976-5980.

AC

Fan, S., Zhang, D., Lei, C., Chen, H.F., Xing, L.B., Ma, J.J., Zhao, C.P., Han, M.Y., 2016. Proteome Analyses Using iTRAQ Labeling Reveal Critical Mechanisms in Alternate Bearing Malus prunifolia. J. Proteome Res. 15, 3602-3616.

Finn, R.D., Clements, J., Eddy, S.R., 2011. HMMER web server: interactive sequence similarity searching. Nucleic acids res. 39, W29-W37. Fukuda, A., Terao, T., 2015. QTLs for Shoot Length and Chlorophyll Content of Rice Seedlings Grown under Low-Temperature Conditions, using a Cross between Indica and Japonica Cultivars. Plant Prod. Sci. 18, 128-136. Guruprasad, K., Reddy, B.B., Pandit, M.W., 1990. Correlation between stability of a protein and its dipeptide composition: a novel approach for predicting in vivo stability of a protein from its primary sequence. Protein Eng. 4, 155-161. Higgins, D.G., Sharp, P.M., 1988. CLUSTAL: a package for performing multiple sequence alignment on a

ACCEPTED MANUSCRIPT microcomputer. Gene. 73, 237-244. Hudson, D., Guevara, D.R., Hand, A.J., Xu, Z., Hao, L., Chen, X., Zhu, T., B, Y.M., Rothstein, S.J., 2013. Rice cytokinin GATA transcription Factor1 regulates chloroplast development and plant architecture.Plant Physiol. 162, 132-144. Hudson, D., Guevara, D., Yaish, M.W., Hannam, C., Long, N., Clarke, J.D., Bi, Y.M., Rothstein , S.J., 2011. GNC and CGA1 modulate chlorophyll biosynthesis and glutamate synthase (GLU1/Fd-GOGAT) expression in Arabidopsis. PLoS One. 6, e26765. Komeda, Y., 2004. Genetic regulation of time to flower in Arabidopsis thaliana. Annu. Rev. Plant Biol. 55, 521-535.

PT

Krzywinski, M., Schein, J., Birol, I., Connors, J.,Gascoyne, R., Horsman, D., Jones, S.J., Marra, M. A., 2009. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639-1645.

RI

Kumar, S., Stecher, G., Tamura, K., 2016. MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets. Mol. Biol. Evol. 33, 1870-1874.

SC

Lam, E., Chua, N.H., 1989. ASF-2: a factor that binds to the cauliflower mosaic virus 35S promoter and a conserved GATA motif in Cab promoters. Plant Cell. 1, 1147-1156.

Li, L., Stoeckert, C.J., Roos, D.S., 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes.

NU

Genome Res. 13, 2178-2189.

Lowry, J.A., Atchley, W.R., 2000. Molecular evolution of the GATA family of transcription factors: conservation within the DNA-binding domain. J. Mol. Evol. 50, 103-115.

MA

Luo, X.M., Lin, W.H., Zhu. S., Zhu, J.Y., Sun, Y., Fan, X.Y., Cheng, M., Hao,Y.Q., Oh, E., Tian, M., Liu, L., Zhang, M., Xie, Q., Chong, K., Wang, Z.Y., 2010. Integration of light-and brassinosteroid-signaling pathways by a GATA transcription factor in Arabidopsis. Dev. Cell. 19, 872-883. Li, Y.Y., Mao, K., Zhao, C., Zhao, X.Y., Zhang, H.L., Shu, H.R., Hao, Y.J., 2012. MdCOP1 ubiquitin E3 ligases

Plant physiol. 160, 1011-1022.

D

interact with MdMYB1 to regulate light-induced anthocyanin biosynthesis and red fruit coloration in apple.

PT E

Liu, A., Yong, W., Dang, C., Zhang, D., Song, H., Yao, Q., Chen, K., 2012. A genome-wide identification and analysis of the basic helix-loop-helix transcription factors in the ponerine ant, Harpegnathos saltator. BMC. Evol. Biol. 12:165.

Manfield, I.W., Devlin, P.F., Jen, C.H., David, R., Westhead, Philip, M., Gilmartin, C., 2007. convergence, and

CE

divergence of light-responsive, circadian-regulated, and tissue-specific expression patterns during evolution of the Arabidopsis GATA gene family. Plant Physiol. 143, 941-958. Mayer, K.F., Schoof, H., Haecker, A., Lenhard, M., Jurgens, G., Laux, T., 1998. Role of WUSCHEL in regulating

AC

stem cell fate in the Arabidopsis shoot meristem. Cell. 95, 805-815. Melotto, M., Mecey, C., Niu, Y., Chung, H.S., Katsir, L., Yao, J., Zeng, W., Thines, B., Staswick, P., Browse, J., Howe, G.A., 2008. A critical role of two positively charged amino acids in the Jas motif of Arabidopsis JAZ proteins in mediating coronatine‐and jasmonoyl isoleucine‐dependent interactions with the COI1 F-box protein. Plant J. 55, 979-988. Onouchi, H., Igeño, M I., Périlleux, C., Graves, K., Coupland, G., 2000. Mutagenesis of plants over expressing CONSTANS demonstrates novel interactions among Arabidopsis flowering-time genes. The Plant Cell. 12, 885-900. Newton, A., Mackay, J., Crossley, M., 2001. The N-terminal zinc finger of the erythroid transcription factor GATA-1 binds GATC motifs in DNA. J. Biol. Chem. 276, 35794-35801. Nishii, A., Takemura, M., Fujita, H., Shikata, M., Yokota, A., Kohchi, T., 2000. Characterization of a novel gene encoding a putative single zinc-finger protein, ZIM, expressed during the reproductive phase in Arabidopsis

ACCEPTED MANUSCRIPT thaliana. Biosci. Biotechnol. Biochem. 64, 1402-1409. Patient, R.K., McGhee, J.D., 2002. The GATA family (vertebrates and invertebrates). Curr. Opin. Genet. Dev. 12, 416-422. Reyes, J,C., Muro-Pastor, M.I., Florencio, F.J., 2004. The GATA family of transcription factors in Arabidopsis and rice. Plant Physiol. 134, 1718-1732. Schindler, U., Cashmore, A.R., 1990. Photoregulated gene expression may involve ubiquitous DNA binding proteins. EMBO J. 9, 3415-3427. Scazzocchio, C., 2000 The fungal GATA factors. Curr. Opin. Microbiol. 3, 126-131. Suárez-López P, Wheatley K, Robson F, Onouchi H, Valverde F, Coupland G (2001) CONSTANS mediates

PT

between the circadian clock and the control of flowering in Arabidopsis. Nature. 410, 1116-1120. Shikata, M., Matsuda,Y., Ando, K., Nishii, A., Takemura, M., Yokota, A., Kohchi, T., 2004. Characterization of

RI

Arabidopsis ZIM, a member of a novel plant-specific GATA factor gene family. J. Exp. Bot. 55, 631-639. Teakle, G.R., Gilmartin, P.M., 1998. Two forms of type IV zinc-finger motif and their kingdom-specific

SC

distribution between the flora, fauna and fungi. Trends Biochem. Sci. 23, 100-102.

Terzaghi, W.B., Cashmore, A.R., 1995. Light-regulated transcription. Annu. Rev. Plant Physiol. 46, 445-474. Trainor, C.D., Ghirlando, R., Simpson, M.A., 2000. GATA zinc finger interactions modulate DNA binding and

NU

transactivation. J. Biol. Chem. 275, 28157-28166.

Tsang, A.P., Visvader, J.E., Turner, C.A., Fujiwara,Y., Yu, C., Weiss, M.J., Crossley, M., Orkin, S.H., 1997. FOG, a multitype zinc finger protein, acts as a cofactor for transcription factor GATA-1 in erythroid and

MA

megakaryocytic differentiation. Cell. 90, 109-119.

Vanholme, B., Grunewald, W., Bateman, A., Kohchi, T., Gheysen, G., 2007. The tify family previously known as ZIM. Trends Plant Sci. 12, 239-244.

Velasco, R., Zharkikh, A., Affourtit, J., Dhingra, A., Cestaro, A., Kalyanaraman, A., Fontana, P., Bhatnagar, S.K.,

D

Troggio, M., Pruss D., et al 2010. The genome of the domesticated apple (Malus domestica Borkh.). Nat. Genet. 42, 833-839.

2114-2117.

PT E

Vision,T.J., Brown, D.G., Tanksley, S.D., 2000. The origins of genomic duplications in Arabidopsis. Science. 290,

Wang, W., Wu, P., Li, Y., Hou, X.L., 2016. Genome-wide analysis and expression patterns of ZF-HD transcription factors under different developmental tissues and abiotic stresses in Chinese cabbage. Mol. Genet. Genomics.

CE

291, 1451-1464.

Wang, Y., Tang, H., DeBarry, J.D., Tan, X., Li, J., Wang, X., Lee, T., Jin, H., Marler1, B., Guo, H., Kissinger, J.C., Paterson A.H., 2012. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and

AC

collinearity. Nucleic Acids Res. 40, e49-e49. Wei, X., Wang, L., Yu, J., Zhang, Y.X., Li, D.H., Zhang, X.R 2015. Genome-wide identification and analysis of the MADS-box gene family in sesame. Gene. 569, 66-76. Xing, L., Zhang, D., Song, X., Weng, K., Shen, Y., Li, Y., Zhao, C., Ma, J., An, N., Han, M., 2016. Identifying genome-wide sequence variations and comparing floral-associated traits based on re-sequencing of two varieties of apple (Malus domestica Borkh.)‘Nagafu No. 2’and ‘Qinguan’. Front. Plant Sci. 7. Xing, L.B., Zhang, D., Li, Y.M., Shen, Y.W., Zhao, C.P., Ma,J.J., An, N., Han, M., 2015. Transcription profiles reveal sugar and hormone signaling pathways mediating flower induction in apple (Malus domestica Borkh.). Plant Cell Physiol. 56, 2052-2068. Xu, G., Ma, H., Nei, M., Kong, H., 2009. Evolution of F-box genes in plants: different modes of sequence divergence and their relationships with functional diversification. Proc. Natl. Acad. Sci. USA.106, 835-840. Yang, S., Zhang, X., Yue, J.X., Tian, D., Chen, J.Q., 2008. Recent duplications dominate NBS-encoding gene

ACCEPTED MANUSCRIPT expansion in two woody species. Mol. Genet. Genomics. 280, 187-198. Zhang, Y.C., Gao, M., Singer, S.D., Fei, Z.J., Wang, H., Wang, X.P., 2012. Genome-wide identification and analysis of the TIFY gene family in grape. Plos one. 7, e44465. Zhang, H.X., Jin, J.H., He, Y.M., Lu, B.Y., Li, D.W., Chai, W.G., Khan, A., Gong, Z.H., 2016. Genome-wide identification and analysis of the SBP-box family genes under Phytophthora capsici stress in pepper (Capsicum annuum L.). Front. Plant Sci. 7. Zhao, Y., Medrano, L., Ohashi, K., Fletcher, J.C., Yu, H., Sakai, H., Meyerowitz, E.M., 2004. HANABA TARANU is a GATA transcription factor that regulates shoot apical meristem and flower development in Arabidopsis. Plant Cell. 16, 2586-2600.

PT

Zhou, T., Wang, Y., Chen, J.Q., Araki, H., Jing, Z., Jiang, K., Shen, J., Tian, D., 2004. Genome-wide identification of NBS genes in japonica rice reveals significant expansion of divergent non-TIR NBS-LRR genes. Mol.

AC

CE

PT E

D

MA

NU

SC

RI

Genet. Genomics. 271, 402-415.

ACCEPTED MANUSCRIPT

Table legends Table 1 The 35 putative GATA genes identified in Malus domestica in this study along with their predicted and tallied physiochemical properties Table S1 Thirty Arabidopsis thaliana GATA genes and 29 Oryza sativa GATA genes used in phylogenetic analyses

PT

Table S2 Eighteen orthologous GATA gene pairs between Malus domestica and Arabidopsis thaliana Table S3 Syntenic block regions detected using MCScanX toolkit

RI

Table S4 Distribution of motifs in MdGATA proteins

SC

Table S5 MdGATA gene-specific primers used for qRT-PCR analysis

Figure legends

NU

Fig. 1 Chromosomal distribution of GATA genes in Malus domestica. The scale is in kilobases (kb). The chromosome number is shown at the top of each chromosome.

MA

Fig. 2 An unrooted phylogenetic tree representing the relationships of GATA genes in Malus domestica, Oryza sativa, Ricinus communis and Arabidopsis thaliana. 35 MdGATA proteins and 29 OsGATA and and 19 RcGATA proteins and 30 OsGATA proteins were used to construct the tree. GATA proteins

D

from different species are indicated by different symbols: M. domestica by filled circles, A. thaliana by

PT E

solid yellow triangles, O. sativa by solid squares and R. communis by solid prism. The four different groups are represented by different colors. Numbers at nodes represent bootstrap values based on 1,000

CE

replicates. Bootstrap values below 50% are not shown (Liu et al., 2012; Chehadeh et al., 2015). Fig. 3 Alignment of amino acid sequences from 35 putative GATA genes in Malus domestica. GATA

AC

motifs and amino acid sites are marked at the top, and sequence identities are shown at the bottom. Fig. 4 Syntenic relationships of GATA genes in apple and Arabidopsis thaliana. (a) Results of paralogous relationship analysis of MdGATA genes and orthologous relationship analysis of GATA genes between apple and A. thaliana. (b) Results of synteny analyses of MdGATA and GATA genes between apple and A. thaliana. Homologous genes and syntenic gene regions are connected by colored curves. Fig. 5 Exon–intron structures of GATA genes and a schematic diagram of the amino acid motifs of GATA proteins in Malus domestica and Arabidopsis thaliana. The position of the sequence encoding the GATA motif in GATA genes is shown by feature. The phylogenetic tree representing the

ACCEPTED MANUSCRIPT relationships of GATA genes was constructed using MEGA 7.0.14 according to the neighbor-joining method with 1,000 bootstrap test replicates. Bootstrap values below 50% are not shown. Fig. 6 Distribution of cis-elements in the promoters of putative MdGATA genes. (a) The number of various cis-elements in the promoters of each MdGATA gene. (b) The relative proportions of different cis-elements in the promoters of MdGATA genes are indicated by the pie chart. Cis-elements sharing

PT

identical or similar functions are represented by the same color. Fig. 7 Expression profiles of 10 MdGATA genes in different tissues, including stems, leaves, flowers,

RI

fruits, and buds, investigated by qRT-PCR. Values are means of three replicates ± SE. Small letters indicate significant differences at the 0.05 level.

SC

Fig. 8 Comparative analysis of light- and dark-cultured apple seedlings. (a) Phenotypes of seedlings grown in light (L) and in darkness (D). (b–c) Chlorophyll content of light-grown and dark-grown

NU

seedlings (b) and expression changes of 10 MdGATA genes during photomorphogenesis (L) and skotomorphogenesis (D) using real-time quantitative reverse transcription PCR (c). Values are means of

MA

three replicates ± SE. The ratio of chlorophyll content and expression level in light-grown and dark-grown seedlings is shown.

D

Fig. S1 Transcriptional profile of MdGATA genes in tissues of ‘Golden Delicious’ and different apple

PT E

hybrids. GSM numbers correspond to different RNA sequencing samples. The bar at the top of heat map represents relative expression values. Fig. S2 Flowering-related MdGATA gene expression profiles at early (ES), middle (MS), and late (LS)

CE

stages of flower bud physiological differentiation. Fragments Per Kilobase of transcript per million

AC

mapped reads (FPKM) values and the hierarchical method were used for the cluster analysis.

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

SC

Figure 1

PT E

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

Figure 2

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

Figure 3

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

Figure 4

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

PT E

D

MA

NU

Figure 5

PT E

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

AC

CE

Figure 6

AC

CE

PT E

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Figure 7

AC

CE

PT E

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Figure 8

ACCEPTED MANUSCRIPT Table 1 The 35 putative GATA genes in M. domestica with their predicted and tallied physiochemical properties.

acids(aa)

Molecular weight/D

pI

Instability

Aliphatic

Hydroph

index

index

obicity

MDP0000220844

1245

414

44895.9

6.43

65.9

62.15

-0.577

MdGATA9

MDP0000255235

1032

343

37820.1

8.29

67.31

53.99

-0.829

MdGATA8

MDP0000824445

996

331

36377.2

6.29

69.74

55.35

-0.758

MdGATA20

MDP0000172464

966

321

35536.2

5.66

52.46

71.31

-0.45

MdGATA16

MDP0000777336

972

323

35496.8

5.88

52.22

70.28

-0.533

MdGATA7

MDP0000528111

732

243

27417.9

9.37

58.46

43.05

-0.9

MdGATA15

MDP0000462997

1305

434

47505.7

6.07

50.32

69.49

-0.503

MdGATA22

MDP0000290156

1125

374

40848.2

6.01

52.28

59.2

-0.724

MdGATA12

MDP0000248210

1125

374

40859.4

6.31

54.71

58.66

-0.694

MdGATA28

MDP0000542350

852

283

31634.2

8.95

42.52

56.18

-0.673

MdGATA33

MDP0000566760

1584

527

59551

9.16

51.17

58.52

-0.729

MdGATA29

MDP0000542351

273

90

9935.4

10.29

39.39

43.33

-1.036

MdGATA30

MDP0000401351

357

118

13170.1

8.68

34.2

85.85

-0.205

MdGATA32

MDP0000224540

582

193

21172.9

9.22

58.15

56.58

-0.715

MdGATA31

MDP0000182176

1092

363

39844.9

8.61

49.48

59.39

-0.753

MdGATA10

MDP0000137305

1056

351

37928.1

5.86

60.18

69.23

-0.489

MdGATA1

MDP0000248942

1884

627

72000.7

10.62

62.96

85.39

-0.434

MdGATA5

MDP0000166889

906

301

33653.7

8.34

57.03

61.3

-0.734

MdGATA18

MDP0000338280

1086

363

40470.9

7.69

57.29

69.06

-0.575

MdGATA17

MDP0000275252

819

272

30448.4

8.77

57.18

62.06

-0.757

MdGATA4

MDP0000310271

1044

347

38351.5

9.8

50.89

59.63

-0.623

MdGATA24

MDP0000253174

810

269

29607.5

8.32

55.83

46.77

-0.798

MdGATA25

MDP0000263391

543

180

20096.4

10.09

26.6

46.11

-0.663

MdGATA19

MDP0000309902

3486

1161

129736.1

8.98

51.02

87.08

-0.299

MdGATA2

MDP0000131803

1029

342

37077.6

9.49

60.76

58.57

-0.644

MdGATA23

MDP0000190038

1080

359

38758.3

9.35

56.5

57.13

-0.667

MdGATA21

MDP0000739900

1074

357

39063.5

9.17

40.59

61.79

-0.638

MdGATA11

MDP0000237740

1068

355

38684.4

9.2

44.76

60.25

-0.655

MDP0000283079

1554

517

55849.1

7.66

42.95

65.86

-0.591

MDP0000316985

1653

550

60344

5.31

45.42

58.15

-0.827

MdGATA34

MDP0000303048

921

306

33338.8

5.53

56.27

61.54

-0.701

MdGATA26

MDP0000192617

1212

403

44360.6

5.23

48.54

68.39

-0.674

MdGATA6

MDP0000309356

1875

624

70048.3

7.01

52.25

75.69

-0.467

MdGATA27

MDP0000129092

1824

607

67711.5

6.56

58.32

71.3

-0.575

MdGATA35

MDP0000703990

663

220

24582.1

10.61

76.74

73.14

-0.67

MdGATA14

RI

NU

MA

D

PT

MdGATA3

MdGATA13

D

of amino

SC

C

size(bp)

PT E

B

Gene locus

CE

A

Gene name

AC

Group

Number

CDS

ACCEPTED MANUSCRIPT Highlights 1. A total of 35 Malus domestica GATA genes were identified and divided into four Groups based on phylogenetic analysis. 2. Eighteen orthologous gene pairs were found between Arabidopsis and M.domestica. Segmental and whole-genome duplications may account for the expansion of GATA genes in Malus domestica. 3. Four main conserved motifs were identified in GATA proteins and the exon-intron physical and encoding layouts were then analyzed. 4. RT-qPCR and RNA-seq results showed that MdGATA genes were expressed mainly in leaf, flower, and bud; most of MdGATA genes were significantly down-regulated after 7 day´s dark culture,

PT

suggesting a potential function in light-regulated transcription; 11 MdGATA genes were highly expressed in different stages of flower bud physiological differentiation, suggesting that MdGATA

AC

CE

PT E

D

MA

NU

SC

RI

genes might play a role during floral induction in apple.