Polymorphisms of IFS1 and IFS2 gene are associated with isoflavone concentrations in soybean seeds

Polymorphisms of IFS1 and IFS2 gene are associated with isoflavone concentrations in soybean seeds

Plant Science 175 (2008) 505–512 Contents lists available at ScienceDirect Plant Science journal homepage: www.elsevier.com/locate/plantsci Polymor...

611KB Sizes 9 Downloads 41 Views

Plant Science 175 (2008) 505–512

Contents lists available at ScienceDirect

Plant Science journal homepage: www.elsevier.com/locate/plantsci

Polymorphisms of IFS1 and IFS2 gene are associated with isoflavone concentrations in soybean seeds Hao Cheng a, Oliver Yu b, Deyue Yu a,* a

National Center for Soybean Improvement, National Key Laboratory of Crop Genetics and Germplasm Enhancement, Nanjing Agricultural University, Nanjing 210095, China b Donald Danforth Plant Science Center, St. Louis, [13_TD$IF]MO 63132, USA

A R T I C L E I N F O

A B S T R A C T

Article history: Received 29 January 2008 Received in revised form 13 May 2008 Accepted 29 May 2008 Available online 6 June 2008

Soybean isoflavones are associated with many health benefits of soy consumption, and isoflavone levels are one of the most important output traits of soybean. Isoflavones are synthesized from the phenylpropanoid pathway. Isoflavone synthase (IFS) acts as the key metabolic entry point for the formation of all kinds of isoflavones. We have cloned, sequenced, and analyzed the IFS1 and IFS2 genomic regions from 33 Chinese soybean accessions including 16 Glycine soja and 17 Glycine max. The isoflavone levels in these accessions vary greatly, ranging from 536.6 mg/g to 5509.1 mg/g dry seed weight. High nucleotide diversity and low extent of linkage disequilibrium (LD) in these two genes provided sufficient genetic resolution for association analysis of polymorphisms in these genes and soybean seed isoflavone levels. As a result, three single nucleotide polymorphisms (SNPs) in IFS1 gene and two SNPs in IFS2 gene were found closely associated (P < 0.05) with all individual and total isoflavone levels in seeds, regardless of population structure. These results indicated that IFS1 and IFS2 gene both contributed to the levels of isoflavones in seeds. These polymorphisms may serve as important molecular markers for breeding. ß 2008 Elsevier Ireland Ltd. All rights reserved.

Keywords: Association analysis Isoflavone synthase Linkage disequilibrium Isoflavone Soybean

1. Introduction Isoflavones are phenolic secondary metabolites found mostly in legumes. Epidemiological studies comparing populations in Asia, where soy consumption is high, and Western countries with relative low soy consumption, suggested soybean food may contribute to multiple health benefits [1]. Additional research has demonstrated that isoflavones in soybean are beneficial for decreasing certain cancers, osteoporosis, cardiovascular disease, and menopausal symptoms in animal models and some human trials [2–5]. Although the health benefits of isoflavones are generally accepted, they are not without controversy [6]. Some concerns persist about isoflavones fed to infants with formula [7] and the safety levels of isoflavones for adults [8]. Thus, metabolic engineering of isoflavonoid biosynthesis to either increase or decrease isoflavone levels in soybean seeds may have significant nutritional impact by controlling dietary isoflavone levels for the improvement of human health [9]. In soybean seeds, the three types of isoflavones, daidzein (Dai), genistein (Gen), and glycitein (Gly) are predominately occurred as glucosides or malonyl-glucosides [10,11]. Total seed isoflavone

* Corresponding author. Tel.: +86 25 84396410; fax: +86 25 84396410. E-mail addresses: [email protected], [email protected] (D. Yu). 0168-9452/$ – see front matter ß 2008 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.plantsci.2008.05.020

concentrations are the sum of these aglycones and conjugates. Total isoflavones fluctuate by crop years and planting locations, indicating a large environmental effect [12–14]. Other studies have shown large variations in isoflavone concentrations and compositions among soybean genotypes as well [15]. Together, both genetic and environmental factors make breeding isoflavone levels very difficult in soybean. Several quantitative trait loci (QTL) for individual and total isoflavone concentrations in soybean seeds have recently been discovered, demonstrating the complexity of isoflavone traits [16,17]. Additionally, it is not clear whether or how these isoflavone variations are affected by some critical structural enzymes. Since sequence variations are heretically stable and can have a major impact on how the organism develops and responds to the environment [18], these variations in the genes of the key isoflavone synthesis enzymes could have major impact on seed isoflavone levels. In plants, isoflavones biosynthesis is a part of the phenylpropanoid pathway (Fig. 1). Isoflavone synthase (IFS) is the first committed enzyme in the isoflavone pathway and converts flavanone substrates to isoflavone products. The unique aryl migration reaction to create isoflavones is mediated by this enzyme which belongs to the CYP93C subfamily of cytochrome P450 monooxygenase and the encoding gene of which has been identified [19–21]. When IFS genes were silenced in soybean, isoflavone biosynthesis was blocked, and no isoflavone accumula-

506

H. Cheng et al. / Plant Science 175 (2008) 505–512

Fig. 1. Partial diagram of the isoflavonoid biosynthesis in soybean. CHI, chalcone isomerase; CHR, chalcone reductase; CHS, chalcone synthase; F3H, flavanone-3-hydroxylase; F6H, flavanone-6-hydroxylase; FLS, Flavonol synthase; FNS, Flavone synthase; GT, glucosyl-transferase; IFS, isoflavone synthase; IMT, isoflavone methyl-transferase; MT, malonyl-transferase. Dotted arrows represent multiple or uncertain steps.

tion was detected [22]. Clearly, IFS genes are essential for the biosynthesis and accumulation of isoflavones in plants. So far, two IFS genes, designated as IFS1 and IFS2, have been reported in soybean [21]. Thus, in our study, we focused on the polymorphisms of both IFS1 and IFS2 genes to evaluate how these two genes affect the accumulation of soybean isoflavones. Single nucleotide polymorphisms (SNPs), which include single base pair changes and small insertions/deletions (indels), can serve as molecular genetic markers. SNPs are abundant and relatively stable in the genome, and have been discovered within genes underlying observed traits [23]. To identify the causative SNPs that are associated with the soybean seed isoflavone levels in the IFS1 and IFS2 gene, we carried out a set of association analyses in this study. Association analysis, also known as association mapping, is a population-based survey used to identify trait-marker relationships based on linkage disequilibrium (LD) [24]. Association analysis has the potential to identify a single polymorphism within a gene that is responsible for the difference in phenotypes [24]. It was originally developed for mapping human disease genes [25,26] and has been used extensively in medical genetics [27]. The primary obstacle to successful association studies in plants is the nature of population structure [28]. The presence of subgroups with an unequal distribution of alleles within a population can result in non-functional, spurious associations [29]. To solve this problem, Pritchard et al. [30] developed methods that account for population structure by including a vector quantity for subpopulation memberships derived from simple sequence repeat (SSR) in the association model. Thornsberry et al. [31] extended Pritchard’s method for quantitative traits and applied it to evaluate

maize flowering times. Since then, there have been increasing reports on association analysis in plants, such as in maize [32–34], Arabidopsis [35,36], and wheat [37]. These reports proved that association analysis could be a reliable method for finding causative SNPs. In our analysis, we found many SNPs closely associated (P < 0.05) with either individual types of seed isoflavones or the total isoflavone concentrations at each of the two genes, regardless of population structures. More importantly, common sites, including three SNPs in IFS1 gene and two SNPs in IFS2 gene were found closely associated (P < 0.05) with all individual types and total seed isoflavone concentrations. These results indicated that both IFS1 and IFS2 gene played critical roles in the synthesis and accumulation of soybean isoflavones. And these SNPs will be valuable genetic markers for future breeding efforts in soybean. 2. Materials and methods 2.1. Plant materials A collection of 33 Chinese soybean accessions consisting of 16 wild (Glycine Soja) and 17 cultivated (Glycine Max) accessions collected from latitude 19–498N and longitude 106–1318E were included in the analysis (Table 1). This population of accessions was selected to sample not only all six ecological regions of soybean in China [38] but also soybeans with diverse seed isoflavone levels. Seeds of all accessions were obtained from Germplasm Storage of Chinese National Center for Soybean Improvement (Nanjing Agricultural University, Nanjing, China).

H. Cheng et al. / Plant Science 175 (2008) 505–512

507

Table 1 Seed isoflavone levels in 33 soybean accessions and GenBank accession numbers of the sequences obtained from each accession for IFS1 and IFS2 gene Accession

[2_TD$IF]Total (mg/g)

Dai (mg/g)

Gen (mg/g)

Gly (mg/g)

GenBank accession no. for IFS1

GenBank accession no. for IFS2

C_HC01 W_HC02 W_HC03 C_HC04 W_HC05 W_HC06 C_HC07 W_HC08 W_HC09 W_HC10 C_HC11 W_HC12 W_HC14 W_HC15 C_HC16 W_HC17 W_HC18 W_HC19 W_HC20 C_HC21 W_HC22 W_HC23 C_HC24 C_HC25 C_HC26 C_HC27 C_HC28 C_HC29 C_HC31 C_HC32 C_HC33 C_HC35 C_HC37 Overall G. max G. soja

536.6 1776.8 2099.4 2371.8 4075.8 2816.4 3103.0 4824.6 4124.0 3787.1 1527.5 5509.1 3600.1 3884.8 2028.6 3249.2 4319.2 3930.5 3853.1 4508.0 4341.6 2415.6 4002.4 3087.7 3081.8 3542.6 3678.3 4580.1 3535.3 2368.9 2393.9 3775.9 2119.7 3298.5 2955.4 3663.0

200.2 390.2 603.2 970.9 1407.4 1174.7 1191.6 2138.6 1537.2 1697.5 544.9 2394.3 1792.2 1443.2 960.4 1583.7 1639.8 1724.2 1062.8 1532.5 1424.1 1027.6 1229.4 854.6 1063.2 1593.5 1365.4 1970.7 1443.4 740.6 1028.5 1591.3 801.0 1276.4 1122.5 1440.0

231.9 896.1 902.2 974.0 1958.6 1149.4 1567.9 2130.2 1825.0 1687.5 576.9 2291.3 1308.7 1918.9 854.4 1089.2 2138.0 1752.4 2380.4 2587.2 2568.4 1074.7 2076.5 1625.4 1399.9 1422.4 1841.3 2064.9 1529.7 1200.8 970.0 1916.2 1093.7 1545.6 1407.8 1691.9

104.5 490.4 594.0 426.9 709.8 492.3 343.4 555.9 761.7 402.2 405.7 823.5 499.3 522.8 213.8 576.3 541.4 454.0 409.9 388.3 349.1 313.3 696.5 607.7 618.7 526.7 471.6 544.4 562.3 427.5 395.4 268.4 224.9 476.5 425.1 531.0

EU391463 EU391476 EU391491 EU391489 EU391482 EU391478 EU391468 EU391483 EU391479 EU391477 EU391475 EU391469 EU391487 EU391484 EU391460 EU391485 EU391480 EU391481 EU391492 EU391462 EU391488 EU391486 EU391490 EU391470 EU391461 EU391472 EU391465 EU391473 EU391474 EU391466 EU391471 EU391464 EU391467

EU391505 EU391519 EU391517 EU391513 EU391503 EU391523 EU391507 EU391524 EU391520 EU391525 EU391493 EU391499 EU391516 EU391518 EU391508 EU391498 EU391521 EU391500 EU391496 EU391510 EU391522 EU391497 EU391494 EU391502 EU391511 EU391509 EU391495 EU391504 EU391506 EU391515 EU391514 EU391512 EU391501

Plants were grown at Jiangpu Trial Station of Nanjing Agricultural University. Three replications of all accessions were grown in the same field with a randomized block design. Plots consisted of single rows, 0.8 m apart, 4 m long with 5 hill-plots, and 8 plants in each hill-plot.

dry seed weight. The analysis of variance (ANOVA) for the preliminary data was based on the SAS WINDOWS version 9.0 (SAS Institute, Cary, NC). 2.3. DNA isolation, PCR amplification, cloning of amplification products and sequencing

2.2. Phenotypic analysis Mature soybean seeds from each plot were harvested. A bulk of seeds were randomly selected from the seed bag and ground to fine powders using a POLYMIX Analytical Mill (model A 10, Kinematica, Switzerland). Each 1 g powder was extracted with 5 mL MeOH[14_TD$IF]– H2O (4:1) and analyzed by HPLC (Agilent 1100, Wilmington, DE) equipped with a diode array detector and an Agilent ZORBAX SBC18 column (4.6 [15_TD$IF]mm  150 mm, 3.5 micron, Agilent) maintained at 30 8C [39]. The column was eluted at 1 mL/min with a linear solvent gradient from 10% acetonitrile (CH3CN) with 0.1% trifluoroacetic acid (TFA) to 26% CH3CN with 0.1% TFA over 15 min, and 100% CH3CN with 0.1% TFA from 15 to 18 min. The quantity of each of the 12 isoflavones was calculated by calibration to standard curves prepared from authentic compounds. Five of the 12 authentic compounds (Daidzein, Genistein, Glycitein, Daidzin and Genistin) were purchased from Sigma[16_TD$IF]–Aldrich (St. Louis, MO), the other seven (Glycitin, 6[0017_TD$IF] -O-malonyldaidzin, 6[18_TD$IF]00 -O-malonylgenistin, 6[19_TD$IF]00 -O-malonylglycitin, 6[20_TD$IF]00 -O-acetyldaidzin, 6[21_TD$IF]00 -O-acetylgenistin and 6[2_TD$IF]00 -O-acetylglycitin) were purchased from Fijicco (Tokyo, Japan). The moisture content of the seeds was calculated following standard procedures (Chinese National Standard GB 3523-1983). The final isoflavone concentration was converted to

Total genomic DNA was extracted from bulked leaf tissue of 8– 10 g of G. soja or G. max plants as described by Keim et al. [40]. Primers were designed using OLIGO primer design software version 6.71 (Molecular Biology Insights, Inc., Cascade, CO). We used BioLign software version 4.0.6 (http://en.bio-soft.net/dna/ BioLign.html) to assemble the published sequences of soybean cultivar G. max cv. Wye IFS1 gene, whose accession numbers are AY530096 [41], AF195818 and AF195798 [21], for the purpose of designing PCR primers, the assembled sequence is 4464 bp long, with two exons. We amplified approximately 2800 nucleotides including a partial promoter, a 50 -UTR, the complete CDS, an intron and a partial 30 -UTR. These sequences were amplified by PCR, using the following three primer pairs at different annealing temperatures: (1) ATGTGTTTCTGGGGTTATTG/AGTTGTCGTAAGTGAGGCGTC, at 60 8C, (2) GACGCCTCACTTACGACAACT/AGAAAAAGTCCTACATACCCA, at 56 8C, (3) TGGGTATGTAGGACTTTTTCT/ATGTAACCTTAATTACTTGAT, at 52 8C. We assembled the published sequences of G. max cv. Wye IFS2 gene, whose accession numbers are AY530097 [41] and AF195819 [21], for the purpose of designing PCR primers. The assembled sequence is 3040 bp long, also with two exons. We amplified approximately 3000 bp including a promoter, a 50 -UTR, the

508

H. Cheng et al. / Plant Science 175 (2008) 505–512

complete CDS, an intron and a partial 30 -UTR. These sequences were amplified by PCR, using the following two primer pairs at different annealing temperatures: (1) CAGGCAAAGAGAACCAAAACA/TTTACAGTGGTGGCGTTGGGA, at 59 8C, (2) TCCCAACGCCACCACTGTAAA/AAACGAAGACAAATGGGAGAT, at 58 8C. PCR was conducted in a 25 mL reaction volume by using ExTaq polymerase (TAKARA, Kobe, Japan) following manufacturer’s recommendations. PCR was carried out using a PTC-225 thermal cycler (MJ Research, Watertown, MA) for 1 cycle of 3 min at 94 8C, 30 cycles of 30 s at 94 8C, 1 min at annealing temperature for respective primer pairs, and 1 min at 72 8C, followed by 1 cycle of 10 min at 72 8C. The PCR product was separated by electrophoreses on a 1% agarose gel, and the band of expected size was excised and purified with Quick PCR gel purification kit (BioDev-Tech., Beijing, China). Purified PCR products were cloned into the pGEM T-easy plasmid vector (Promega, Madison, WI) and transformed into DH5a E. coli competent cells. Plasmids were purified by using the Wizard Miniprep kit (Promega) and sequenced on both strands with both M13 and sequence specific primers. DNA sequencing was performed with BigDye v3.0 (Applied Biosystems, Foster City, CA) and run on an Applied Biosystems 3730 automated DNA sequencer. Each clone was sequenced in both directions to ensure accuracy. Sequences have been deposited to GenBank (accession numbers in Table 1). 2.4. Analysis of DNA sequences All sequences were checked visually using Chromas software version 2.31 (http://www.technelysium.com.au/chromas.html), and, if necessary, edited according to the electrophorgrams. The sequencing results were assembled using BioLign software. Assembled sequences were then aligned using CLUSTAL X software version 1.83 [42]. Manual check was performed in every case to ensure sequencing and alignment quality. Polymorphism data analysis and neutral tests were carried out by using DnaSP software version 4.10 [43]. Levels of nucleotide diversity were estimated as p, the average number of nucleotide differences per site between two sequences [44], and u, number of segregating sites (S) corrected for the sample size, u = S/an1, where an1 ¼ Pn1 1=i and n is the sample size [45]. We tested the hypothesis of 1 neutral polymorphisms using Tajima’s D test [46] and Fu and Li’s D* and F* tests [47]. LD between pairs of polymorphic sites, excluding singletons, in each of these two genes was estimated by the TASSEL software version 2.0 [48]. Squared allele frequency correlations (r2) [49] were chosen for LD calculations. The significance of LD between sites was determined by a Fisher’s exact test. The trend lines fitted to the data were calculated using TableCurve software version [23_TD$IF]5.01 (Systat Software Inc., San Jose, CA) and drawn by SigmaPlot software version 10.0 (Systat Software). 2.5. Population structure and association analysis All lines were genotyped with 55 unlinked simple sequence repeat markers (SSRs) providing an even coverage of the soybean genome. The employed SSR markers are publicly available (http:// bldg6.arsusda.gov/pooley/soy/cregan/soymap1.html). Population structure was inferred from SSR data by using the Structure software version 2.2 [50]. The No Admixture model was applied. The optimum number of populations (K) was selected after five independent runs of a burn-in of 500,000 iterations followed by 500,000 iterations testing for K = 1[24_TD$IF]–8 (See the Structure 2.2 documentation at http://pritch.bsd.uchicago.edu/software). Struc-

ture produces a Q matrix that lists the estimated membership coefficients for each individual in each cluster. The estimated Q matrices were used in the subsequent association analysis, logistic regression, carried out in TASSEL software. Mean phenotypic values (Table 1) were applied for the association analysis. In addition, the general linear model (GLM) analysis in TASSEL was employed to identify associations, not considering population structure. All polymorphisms (including singletons) were tested and the P-value for individual polymorphisms was estimated based on 1000 permutations of the dataset, both for GLM and logistic regression. Those polymorphisms with P < 0.05 were considered significantly associated to the traits.

3. Results 3.1. Isoflavone concentrations vary significantly in different soybean varieties and accessions Mature soybean seeds were harvested from field-grown plants, including three replications of 16 G. soja accessions and 17 G. max accessions. Isoflavones, such as daidzein (Dai), genistein (Gen), and glycitein (Gly), were extracted following previous methods [39]. The amount of isoflavone aglycone, glucose-conjugate, [25_TD$IF]acetylglucose-conjugate, and malonyl-glucose-conjugate were determined individually by HPLC, and later combined to give the total amount of seed isoflavone concentrations (Total) as shown in Table 1. A wide range of isoflavone concentrations among the 33 accessions [26_TD$IF]was observed. Mean values for individual accessions ranged from 536.6 to 5509.1 mg/g dry seed weight for the Total, 200.2[27_TD$IF]–2394.3 mg/g for Dai, 231.9[28_TD$IF]–2587.2 mg/g for Gen, and 104.5[29_TD$IF]– 823.5 mg/g for Gly. Overall, means were 3298.5 mg/g for Total, 1276.4 mg/g for Dai, 1545.6 mg/g for Gen, and 476.5 mg/g for Gly. The mean values of G. soja were all higher than those of G. max, suggesting that wild soybeans contained higher seed isoflavone level than the cultivated soybeans in general. Co-efficiencies of correlation between different types of isoflavones and total isoflavone levels were given in Table 2. The correlations observed between these traits were all positively significant, except between Gen and Gly. As expected, Gen (0.91) and Dai (0.92) levels are the most closely correlated with total isoflavone levels and these two together constituted 71–92% of total isoflavones. These results are similar to previous reports [51]. 3.2. High nucleotide diversity was found in both IFS genes The genomic sequences of IFS1 and IFS2 gene were cloned and subsequently sequenced from the above 33 accessions. For IFS1, the upper most 50 primer was selected 846 bp upstream of the start codon, and the last 30 primer was 117 bp downstream of the stop codon. For IFS2, the upper most 50 primer was selected 1107 bp upstream of the start codon, and the last 30 primer was 147 bp downstream of the stop codon. Therefore, in addition to the coding regions, at least a portion of the 50 [30_TD$IF]-UTRs, 30 -UTRs, and the intron sequences were obtained as well, and used in the following Table 2 Coefficients of correlation for 33 soybean accessions between mean Total, Dai, Gen and Gly

Dai Gen Gly * **

[3_TD$IF]p < 0.05. p < 0.01.

Total

Dai

Gen

0.91** 0.92** 0.57**

0.70** 0.46*

0.40

H. Cheng et al. / Plant Science 175 (2008) 505–512

509

association analysis. Two parameters for nucleotide diversities, p, which is defined as the average number of nucleotide differences per site between two sequences [44], and u, which is defined as the number of segregating sites corrected for the sample size [45], were calculated for the combined samples, and for G. soja and G. max samples individually (Table 3 and Supplementary Table 1). The calculations were based on the SNPs (excluding indels) identified from each of the two genes. For both IFS1 (Fig. 2a) and IFS2 (Fig. 2b) genes, nucleotide diversities were lowest in the coding regions and highest in the 50 [31_TD$IF]UTR regions. For the entire IFS1 gene, nucleotide diversity was p = 0.00494 in the combined samples. The p value was similar in the G. max samples (p = 0.00461) as compared to the G. soja samples (p = 0.00466). Whereas for the entire IFS2 gene, nucleotide diversity was slightly lower (p = 0.00319) in the combined samples. In addition, the G. soja samples had higher nucleotide diversity than the G. max samples (p = 0.00319 vs. p = 0.00258). The u values showed the same trends as the p values ([32_TD$IF] Supplementary Table 1). In general, the nucleotide diversity at these two genes was higher than reported genome-wide average nucleotide diversity in 25 cultivated soybeans [52], suggesting wild soybeans possess more genetic diversity than cultivated ones. To test the neutrality of the polymorphisms among these two genes, we calculated the Tajima’s D value [46] and Fu and Li’s D* and F* values [47] (Table 3 and Supplementary Table 1). For IFS1 gene, Tajima’s D was negative and significant in all regions of the combined samples and the coding region of G. soja samples. For IFS2 gene, Tajima’s D was negative and significant in all regions of the combined samples and the coding region of the two separate samples. The results indicated a purifying selection as well as the presence of low-frequency alleles at these two genes. This was consistent with an excess of singletons found in both IFS1 and IFS2 genes. Similar results were obtained for the Fu and Li statistics D* and F* ([32_TD$IF]Supplementary Table 1). Fig. 2. Nucleotide diversity (p) calculated along the IFS1 (a) and IFS2 (b) gene for G. soja, G. max and the combined samples. p is shown in sliding windows of 100 bp using a step size of 10 bp.

3.3. The extent of LD of the two IFS genes LD was estimated between pairs of polymorphic sites, excluding singletons, for the G. soja and G. max separately and for the combined samples (Fig. 3). Squared allele frequency correlations (r2) [49], were plotted against the base pair distance between sites, and trend lines were fitted to the data. The average decay of LD in G. soja in the two genes analyzed declined to r2 = 0.1 between 400 and 750 bp. In contrast, the G. max demonstrated a longer extent of LD across these two genes. For the IFS1 gene, LD extended more than 2200 bp (Fig. 3a); while for IFS2 gene, LD showed no decay across the whole gene (Fig. 3b). The different extent of LD found in the two populations suggested that the G. soja population was more ideal for fine mapping while the G. max population was more suitable for genome-wide scanning.

Table 3 DNA polymorphisms and diversity in the IFS1 and IFS2 gene in 33 soybean accessions Gene

Region

[4_TD$IF]Size (bp)

SNP

Indel (size)

p

Tajima’s D

IFS1

Entire Coding Non-coding

2748 1560 1185

116 40 76

[5_TD$IF]11 (1–21 bp) 0 [5_TD$IF]11 (1–21 bp)

0.00494 0.00234 0.00856

2.08717* 2.35108** 1.87208*

IFS2

Entire Coding Non-coding

2956 1560 1393

104 41 63

[6_TD$IF]12 (1–9 bp) [7_TD$IF]4 (1–2 bp) [8_TD$IF]8 (1–9 bp)

0.00319 0.00170 0.00487

2.42094** 2.70233*** 2.13583*

* ** ***

[9_TD$IF]P < 0.05. P < 0.01. P < 0.001.

Taken together, in the combined samples the average decay of LD in the two genes declined to r2 = 0.1 between 400 and 750 bp, similar to that in the G. soja population but much more rapid than that in the G. max population, suggesting sufficient resolutions available for the association analysis of the polymorphisms of two genes and the seed isoflavone concentrations. Our results also showed that LD was highly variable not only among populations but also between different regions of the soybean genome as previously reported [53]. 3.4. Nucleotide variations in IFS genes are associated with seed isoflavone levels Prior to association analysis, population structure was estimated by the Structure software based on 55 unlinked SSR markers (http://bldg6.arsusda.gov/pooley/soy/cregan/soymap1.html), which provided an even coverage of the soybean genome. An analysis of population structure identified the highest log likelihood (See the Structure 2.2 documentation at http://pritch.bsd.uchicago.edu/software) with the number of populations set at two (K = 2). Two [3_TD$IF]sub-populations, in agreement with G. max and G. soja, were confirmed as the most likely subdivision of our plant materials. This population structure estimates were used in the TASSEL software [48] to test for associations between IFS1 and IFS2 polymorphisms and mean seed isoflavone levels in Total, Dai, Gen and Gly, separately (Table 1). All polymorphisms, including

H. Cheng et al. / Plant Science 175 (2008) 505–512

510

These SNPs seemed to be the most possible causative polymorphisms for soybean seed isoflavone concentrations. 4. Discussion In this report, the associations of gene polymorphisms with isoflavone levels in soybean seeds were calculated based on 33 soybean accessions. A set of SNPs with significant association to seed isoflavone levels were discovered in both IFS1 and IFS2 genes. These SNPs can serve as genetic markers for future soybean breeding. 4.1. High densities of SNPs provided genetic resolution for association analysis

Fig. 3. Linkage disequilibrium (LD) plots of r2 value against physical distance (bp) along the [1_TD$IF]IFS1 (a) and IFS2 (b) gene in G. soja, G. max and the combined samples. Singletons were excluded.

singletons, were considered in the association analysis. Significant sites (P < 0.05) were identified by both general linear model (GLM) analysis (not considering population structure) and logistic regression analysis (considering population structure) in the TASSEL software. For each of the two genes, there were several SNPs closely associated (P < 0.05) with the four traits separately (data not shown), indicating that both IFS1 and IFS2 polymorphisms were associated with seed isoflavone concentrations. Only those identified by both analysis methods and significantly associated (P < 0.05) to all four traits were shown in Table 4. For the IFS1 gene, out of the 116 SNPs (Table 3), two in 50 [34_TD$IF]-UTR (689 A/G and 150 A/G) and one in the first exon [35_TD$IF](+298 T/C) which caused a serine to proline change, were significantly associated with all four traits. For the IFS2 gene, out of the 104 SNPs (Table 3), one in the first exon (+402 C/T, synonymous) and one in the second exon [36_TD$IF](+1247 G/A), which caused a valine to methionine change, were significantly associated with all four traits. All these changes were listed in a way to mark the decrease of isoflavones. Table 4 Significant sites (P < 0.05) associated with all four traits (Total, Dai, Gen and Gly) identified by both general linear model analysis (not considering population structure) and logistic regression analysis (considering population structure) Gene

Site

Location

Type

Amino [10_TD$IF]acid change

IFS1

689 150 +298

5 [1_TD$IF]-UTR 50 -UTR Exon1

SNP(A/G) SNP(A/G) SNP(T/C)

Ser 100 Pro

IFS2

+402 +1247

Exon1 Exon2

SNP(C/T) SNP(G/A)

Synonymous Val 371 Met

0

The genomic sequences of the IFS1 and IFS2 were obtained from 16 G. soja and 17 G. max accessions. The nucleotide diversity (p) at these two genes was much higher (from 0.00166 to 0.00248 in coding regions and from 0.00362 to 0.00856 in n IF]TD$[37_ on-coding regions, Table 3 and Supplementary Table 1) than those reported by Zhu et al. [52] on genome-wide soybean SNP analysis. In their study of 25 diverse soybean genotypes, nucleotide diversity (p) was 0.00053 in coding regions and 0.00111 in non-coding perigenic regions. This dramatic difference could not be completely explained by the selection of populations. Although our data included 16 G. soja accessions, while previous genotypes Zhu et al. [52] used were all G. max, the data suggested that even from our G. max population, the p value in the two genomic regions were still significantly higher. Another possible explanation was the wide geographic diversity of the accessions in this study (from 19–498N to 106–1318E), which covered all six ecological regions of soybean in China [38]. Cronk [54] showed the presence of one SNP per 100 bp in poplar in a small set of populations, but the density increased to one in every 50 bp when geographically diverse species were included in the study. These two explanations were supported by the report of Hyten et al. [53]. In their study, 26 G. soja plant introductions from China, Korea, Taiwan, Russia, and Japan (from 23–508N to 106–1408E) were selected to sample all of the geographical areas within the range of G. soja. The nucleotide diversity (p) of this G. soja population was 0.00105 in coding regions and 0.00217 in n ]DIF_T$[37 on-coding regions. Yet these numbers were still lower than our results. Since the results of both Zhu et al. and Hyten et al. were the mean values of randomly selected genes, we suspect that the nucleotide diversity among genes could be different and that the nucleotide diversity of the two genes in our study was higher than mean levels. The two IFS genes in the study were thus suitable for the association analysis because of their high polymorphisms. Consistent with the high level of nucleotide diversity of the two genes, the extent of LD of these genes of the combined samples was quite short (Fig. 3[38_TD$IF]a and b), less than 1000 bp. The combined samples we examined were a wide geographic sample of germplasms and would have a long time for genetic associations to decay. To see whether we could detect higher LD in a population of more recent origin, we analyzed LD in G. soja and G. max samples separately. Although LD still declined rapidly in the G. soja samples, the LD declined much slower in the G. max samples. LD can be expected to be higher among cultivated accessions as compared to more distantly related genetic resources due to population bottlenecks and selection [55]. The power to detect associations between an SNP and quantitative traits largely depends on having sufficient density of SNP markers to ensure that some SNPs will be in LD with the molecular variant that contributes to phenotypic variation [56]. Therefore, the population of 33 accessions in our study showed sufficient density of SNPs and genetic resolution for association analysis of polymorphisms.

H. Cheng et al. / Plant Science 175 (2008) 505–512

4.2. Identifying causative polymorphisms for soybean seed isoflavone concentrations For each of the two genes we tested, several SNPs were detected to be closely associated (P < 0.05) with the four isoflavone traits, Total, Dai, Gen and Gly, separately, by either GLM analysis (not considering population structure) or logistic regression analysis (considering population structure). One common concern about population structure is that LD can be caused by admixture of [39_TD$IF]subpopulation, which leads to false-positive results if not correctly controlled in statistical analysis [57]. The complex breeding history of many crops and the limited gene flow in most wild plants have created complex stratifications within germplasms, which complicated association studies [58]. To reduce this risk, estimates of population structure must be included in association analysis. However, if the distribution of functional alleles is highly correlated with population structure, statistically controlling for population structure can result in false-negatives, particularly for small size samples [57]. Hence, to reduce both [40_TD$IF]false-positive and [41_TD$IF] false-negative risk, only those SNPs that were detected by both analysis methods were taken into account in this study. Though some different sites were found for the four traits separately, there were common sites found for all four traits (Table 4). Three SNPs in IFS1 gene and two SNPs in IFS2 gene that were significantly associated (P < 0.05) with all four traits by two analysis methods were the most plausible causative polymorphisms for soybean seed isoflavone concentrations. 4.3. Deriving functional markers for soybean seed isoflavone traits Functional markers can be derived from the causative polymorphisms identified by association analysis [59]. The non-synonymous SNP in the first exon of IFS1 gene that causes serine to proline change, and in the second exon of IFS2 gene which causes valine to methionine change, could be considered as candidate causative polymorphisms from which functional markers could be derived. Studies of gene expression and enzyme activity would further elucidate the allelic effects of these polymorphisms. However, since these non-synonymous SNPs were present only in a single accession (C_HC01), the association results needs to be interpreted with caution until validation is done in more accessions. At present, a population consisting of more than 200 soybean accessions is under construction in our laboratory for association analysis and additional candidate gene association researches. Ultimately, all polymorphisms identified in this study could be evaluated in the larger population to enhance both the genetic resolution and the power of the association analysis before applying them in the marker assisted selection breeding for soybean isoflavones. Acknowledgements We thank Drs. Zhiwu Zhang and Ed Buckler for their help on TASSEL software, Dr. Jianqun Chen for his help on data analysis, Dr. Jianming Yu for his advice on association mapping. This work was supported in part by National 973 Projects (No. 2004CB117206, No. 2002CB111304), National 863 Projects (No. 2006AA10Z1C1, No. 2006AA10A111), and National Natural Science Foundation of China (No. 30490250, No. 30771362).

[42_TD$IF]Appendix A. Supplementary [43_TD$IF]data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.plantsci.2008.05.020.

511

References [1] M. Messina, Soy, soy phytoestrogens (isoflavones), and breast cancer, [45_TD$IF]Am. J. Clin. Nutr. 70 (1999) 574–575. [2] P. Nestel, Isoflavones: their effects on cardiovascular risk and functions, [46_TD$IF]Curr. Opin. Lipidol. 14 (2003) 3–8. [3] P.S. Williamson-Hughes, B.D. Flickinger, M.J. Messina, M.W. Empie, Isoflavone supplements containing predominantly genistein reduce hot flash symptoms: a critical review of published studies, Menopause 13 (2006) 831– 839. [4] C. Duffy, K. Perez, A. Partridge, Implications of phytoestrogen intake for breast cancer, CA Cancer [47_TD$IF]J. Clin. 57 (2007) 260–277. [5] M.A. Goetzl, P.J. Van Veldhuizen, J.B. Thrasher, Effects of soy phytoestrogens on the prostate, Prostate Cancer Prostatic Dis. 10 (2007) 216–223. [6] W. Wuttke, H. Jarry, D. Seidlova-Wuttke, Isoflavones-safe food additives or dangerous drugs, Ageing [48_TD$IF]Res. Rev. 6 (2007) 150–188. [7] D. Turck, Soy protein for infant feeding: what do we know, [49_TD$IF]Curr. Opin. Clin. Nutr. Metab. Care 10 (2007) 360–365. [8] I.L. Nielsen, G. Williamson, Review of the factors affecting bioavailability of soy isoflavones in humans, Nutr Cancer 57 (2007) 1–10. [9] O. Yu, B. McGonigle, Metabolic engineering of isoflavone biosynthesis, [50_TD$IF]Adv. Agron. 86 (2005) 147–190. [10] H.J. Wang, P.A. Murphy, Mass balance study of isoflavones during soybean processing, [51_TD$IF]J. Agric. Food Chem. 44 (1996) 2377–2383. [11] A.P. Griffith, M.W. Collison, Improved methods for the extraction and analysis of isoflavones from soy-containing foods and nutritional supplements by reversedphase high-performance liquid chromatography and liquid chromatography[52_TD$IF]– mass spectrometry, [53_TD$IF]J. Chromatogr. A 913 (2001) 397–413. [12] A. Eldridge, W. Kwolek, Soybean isoflavones: effect of environment and variety on composition, [51_TD$IF]J. Agric. Food Chem. 31 (1983) 394–396. [13] H.J. Wang, P.A. Murphy, Isoflavone composition of American and Japanese soybeans in Iowa: [54_TD$IF]effects of variety, crop, year, and location, [51_TD$IF]J. Agric. Food Chem. 42 (1994) 1674–1677. [14] J.A. Hoeck, W.R. Fehr, P.A. Murphy, G.A. Welke, Influence of [5_TD$IF]genotype and environment on isoflavone contents of soybean, Crop Sci. 40 (2000) 48– 51. [15] J.S. Choi, T.W. Kwon, J.S. Kim, Isoflavone contents in some varieties of soybean, Foods Biotechnol. 5 (1996) 167–169. [16] M.A. Kassem, K. Meksem, M.J. Iqbal, V.N. Njiti, W.J. Banz, T.A. Winters, A. Wood, D.A. Lightfoot, Definition of [56_TD$IF]soybean genomic regions that control seed phytoestrogen amounts, J. Biomed. Biotechnol. 2004 (2004) 52–60. [17] V.S. Primomo, V. Poysa, G.R. Ablett, C.J. Jackson, M. Gijzen, I. Rajcan, Mapping QTL for [57_TD$IF]individual and total isoflavone content in soybean seeds, Crop Sci. 45 (2005) 2454–2464. [18] C. Lopez, B. Pie´gu, R. Cooke, M. Delseny, J. Tohme, V. Verdier, Using cDNA and genomic sequences as tools to develop SNP strategies in cassava (Manihot esculenta Crantz), [58_TD$IF]Theor. Appl. Genet. 110 (2005) 425–431. [19] T. Akashi, T. Aoki, S. Ayabe, Cloning and functional expression of a cytochrome P450 cDNA encoding 2-hydroxyisoflavanone synthase involved in biosynthesis of the isoflavonoid skeleton in licorice, Plant Physiol. 121 (1999) 821– 828. [20] C.L. Steele, M. Gijzen, D. Qutob, R.A. Dixon, Molecular [59_TD$IF]characterization of the [60_TD$IF] enzyme catalyzing the aryl migration reaction of isoflavonoid biosynthesis in soybean, Arch. Biochem. Biophys. 367 (1999) 146–150. [21] W. Jung, O. Yu, S.M. Lau, D.P. O’Keefe, J. Odell, G. Fader, B. McGonigle, Identification and expression of isoflavone synthase, the key enzyme for biosynthesis of isoflavones in legumes, [61_TD$IF]Nat. Biotechnol. 18 (2000) 208–212. [22] S. Subramanian, G. Stacey, O. Yu, Endogenous isoflavones are essential for the establishment of symbiosis between soybean and Bradyrhizobium japonicum, Plant J. 48 (2006) 261–273. [23] D. Edwards, J. Forster, D. Chagne´, J. Batley, What are SNPs, in: N. Oraguzie, E. Rikkerink, S. Gardiner, H.D. Silva (Eds.), Association Mapping in Plants, Springer, 2007, pp. 41–52. [24] S.A. Flint-Garcia, J.M. Thornsberry, B. Edwards, Structure of linkage disequilibrium in plants, [62_TD$IF]Annu. Rev. Plant Biol. 54 (2003) 357–374. [25] B. Kerem, J.M. Rommens, J.A. Buchanan, D. Markiewicz, T.K. Cox, A. Chakravarti, M. Buchwald, L.C. Tsui, Identification of the cystic fibrosis gene: genetic analysis, Science 245 (1989) 1073–1080. [26] E.H. Corder, A.M. Saunders, N.J. Risch, W.J. Strittmatter, D.E. Schmechel, P.C. Gaskell, J.B. Rimmler, P.A. Locke, P.M. Conneally, K.E. Schmader, Protective effect of apolipoprotein E type 2 allele for late onset Alzheimer disease, [63_TD$IF]Nat. Genet. 7 (1994) 180–184. [27] S.J. Chanock, T. Manolio, M. Boehnke, E. Boerwinkle, D.J. Hunter, G. Thomas, J.N. Hirschhorn, G. Abecasis, D. Altshuler, J.E. Bailey-Wilson, Replicating genotype[64_TD$IF]– phenotype associations, Nature 447 (2007) 655–660. [28] E.S. Buckler, J.M. Thornsberry, Plant molecular diversity and applications to genomics, [65_TD$IF]Curr. Opin. Plant Biol. 5 (2002) 107–111. [29] W.C. Knowler, R.C. Williams, D.J. Pettitt, A.G. Steinberg, Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture, [6_TD$IF] Am. J. Hum. Genet. 43 (1988) 520–526. [30] J.K. Pritchard, M. Stephens, P. Donnelly, Inference of [67_TD$IF]population structure using multilocus genotype data, Genetics 155 (2000) 945–959. [31] J.M. Thornsberry, M.M. Goodman, J. Doebley, S. Kresovich, D. Nielsen, E.S. Buckler Iv, Dwarf8 polymorphisms associate with variation in flowering time, [63_TD$IF]Nat. Genet. 28 (2001) 286–289.

512

H. Cheng et al. / Plant Science 175 (2008) 505–512

[32] S.R. Whitt, L.M. Wilson, M.I. Tenaillon, B.S. Gaut, E.S. Buckler, Genetic diversity and selection in the maize starch pathway, [68_TD$IF]Proc. Natl. Acad. Sci. U.S.A. 99 (2002) 12959–12962. [33] S.J. Szalma, E.S. Buckler, M.E. Snook, M.D. McMullen, Association analysis of candidate genes for maysin and chlorogenic acid accumulation in maize silks, [58_TD$IF] Theor. Appl. Genet. 110 (2005) 1324–1333. [34] J.R. Andersen, I. Zein, G. Wenzel, B. Kru¨tzfeldt, J. Eder, M. Ouzunova, T. Lu¨bberstedt, High levels of linkage disequilibrium and associations with forage quality at a Phenylalanine Ammonia-Lyase locus in European maize (Zea mays L.) inbreds, [58_TD$IF] Theor. Appl. Genet. 114 (2007) 307–319. [35] M.J. Aranzana, S. Kim, K. Zhao, E. Bakker, M. Horton, K. Jakob, C. Lister, J. Molitor, C. Shindo, C. Tang, et al., Genome-wide association mapping in Arabidopsis identifies previously known flowering time and pathogen resistance genes, PLoS Genet. 1 (2005) e60. [36] K. Zhao, M.J. Aranzana, S. Kim, C. Lister, C. Shindo, C. Tang, C. Toomajian, H. Zheng, C. Dean, P. Marjoram, et al., An Arabidopsis example of association mapping in structured samples, PLoS Genet. 3 (2007) e4. [37] C. Ravel, S. Praud, A. Canaguier, P. Dufour, S. Giancola, F. Balfourier, B. Chalhoub, D. Brunel, L. Linossier, M. Dardevet, DNA sequence polymorphisms and their application to bread wheat quality, Euphytica 158 (2007) 331–336. [38] Y. Wang, J. Gai, Study on the ecological regions of soybean in China II. Ecological environment and representative varieties, Ying Yong Sheng Tai Xue Bao 13 (2002) 71–75. [39] J. Bennett, O. Yu, L. Heatherly, H. Krishnan, Accumulation of genistein and daidzein, soybean isoflavones implicated in promoting human health, is significantly elevated by irrigation, [51_TD$IF]J. Agric. Food Chem. 52 (2004) 7574–7579. [40] P. Keim, T.C. Olson, R.C. Shoemaker, A rapid protocol for isolating soybean DNA, Soybean [69_TD$IF]Genet. Newsl. 15 (1988) 150–152. [41] S. Subramanian, X. Hu, G. Lu, J.T. Odelland, O. Yu, The promoters of two isoflavone synthase genes respond differentially to nodulation and defense signals in transgenic soybean roots, Plant [70_TD$IF]Mol. Biol. 54 (2004) 623–639. [42] J.D. Thompson, T.J. Gibson, F. Plewniak, F. Jeanmougin, D.G. Higgins, The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools, Nucleic Acids Res. 25 (1997) 4876– 4882. [43] J. Rozas, J.C. Sanchez-DelBarrio, X. Messeguer, R. Rozas, S.P. Dna, DNA polymorphism analyses by the coalescent and other methods, Bioinformatics 19 (2003) 2496–2497.

[44] M. Nei, Molecular Evolutionary Genetics, Columbia University Press, New York, 1987. [45] G. Watterson, On the number of segregating sites in genetical models without recombination, [71_TD$IF]Theor. Popul. Biol. 7 (1975) 256–276. [46] F. Tajima, Statistical method for testing the neutral mutation hypothesis by DNA polymorphism, Genetics 123 (1989) 585–595. [47] Y. Fu, W. Li, Statistical tests of neutrality of mutations, Genetics 133 (1993) 693–709. [48] P.J. Bradbury, Z. Zhang, D.E. Kroon, T.M. Casstevens, Y. Ramdoss, E.S. Buckler, TASSEL: software for association mapping of complex traits in diverse samples, Bioinformatics 23 (2007) 2633–2635. [49] B. Weir, Genetic [72_TD$IF]Data Analysis, Chinese ed., Publishing House of Agricultural Science, Beijing, 1996. [50] D. Falush, M. Stephens, J.K. Pritchard, Inference of population structure using multilocus genotype data: dominant markers and null alleles, [73_TD$IF]Mol. Ecol. Notes 7 (2007) 574–578. [51] H.J. Wang, P.A. Murphy, Isoflavone [74_TD$IF]content in commercial soybean foods, J. Agric. Food Chem. 42 (1994) 1666–1673. [52] Y. Zhu, Q. Song, D. Hyten, C.V. Tassell, L. Matukumalli, D. Grimm, S. Hyatt, E. Fickus, N. Young, P. Cregan, Single-nucleotide polymorphisms in soybean, Genetics 163 (2003) 1123–1134. [53] D. Hyten, I. Choi, Q. Song, R. Shoemaker, R. Nelson, J. Costa, J. Specht, P. Cregan, Highly variable patterns of linkage disequilibrium in multiple soybean populations, Genetics 175 (2007) 1937–1944. [54] Q. Cronk, Plant eco-devo: the potential of poplar as a model organism, New Phytol. 166 (2005) 39–48. [55] M.I. Tenaillon, M.C. Sawkins, A.D. Long, R.L. Gaut, J.F. Doebley, B.S. Gaut, Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.), [68_TD$IF]Proc. Natl. Acad. Sci. U.S.A. 98 (2001) 9161–9166. [56] A. Long, R. Lyman, C. Langley, T. Mackay, Two sites in the Delta gene region contribute to naturally occurring variation in bristle number in Drosophila melanogaster, Genetics 149 (1998) 999–1017. [57] J. Yu, E. Buckler, Genetic association mapping and genome organization of maize, [46_TD$IF] Curr. Opin. Biotechnol. 17 (2006) 155–160. [58] T. Sharbel, B. Haubold, T. Mitchell-Olds, Genetic isolation by distance in Arabidopsis thaliana: bio-geography and postglacial colonization of Europe, Mol. Ecol. 9 (2000) 2109–2118. [59] J. Andersen, T. Lu¨bberstedt, Functional markers in plants, Trends Plant Sci. 8 (2003) 554–560.