A versatile tool for retrieving molecular sequences

A versatile tool for retrieving molecular sequences

COMPu,ERCORNER RescuingcontaminatedPCR primers Michael Lush ([email protected]) suspected that he had accidentally contaminated a PCR primer st...

394KB Sizes 0 Downloads 66 Views

COMPu,ERCORNER RescuingcontaminatedPCR primers Michael Lush ([email protected]) suspected that he had accidentally contaminated a PCR primer stock solution with a DNA fragment approximately 6 kb in length. Since it would be very expensive to replace the entire primer solution, he asked for advice on how to eliminate the contaminating DNA fra~ent. Provided the proper equipment and experienced personnel are available, the most reliable method would be to use strong-anion-exchange chromatography or reverse-phase chromatography. One person suggested purifying the primer stock on a polyacrylamide gel and extracting the banded oligonucleotide. However, all of these would be time consuming. One faster and less expensive alternative for separating the primers from the double-stranded (ds) DNA fragments would be to spin the mixture through a Miilipore Ultra-freerM-MC filter unit containing a membrane with a 30 kDa cut-off, or a Centricon 100 column (Amicon inc.), which has a 100 kDa cut-off. This would trap the larger DNA molecules onto the membrane, while allowing the smaller primers to pass through into the collection reservoir. Using this purification scheme, however, might not remove all the contaminating DNA, since it could have been partially degraded, and smaller

TIBS 19 - FEBRUARY1994

fragments might pass through the membrane and end up in the filtrate. Another suggestion was to place the solution in a clear microcentrifuge tube and expose it to 254 nm UV light for a minimum of five minutes. Presumably, this would crosslink the larger DNA fragments and prevent annealing of the primer, or prohibit the polymerase from passing the crosslinked region during extension. For example, ten minutes in a Statalinker UV crosslinker from Statagene should be sufficient to eliminate vi.,~ible bands of amplified template DNA during PCR2. Alternatively, the primers could be treated by adding at least 0.5 units of DNase to the concentrated stock solution or to the PCR mix before adding the template and polymerase. Incubation for ten minutes at room temperature, followed by boiling for ten minutes, would probably eliminate the unwanted dsDNA and also destroy the DNase activity. Although it might not be necessary for every application, some people use DNase religiously to treat PCR reactions before template addition, regardless of suspected contamination. This might not always be a good idea, and care must be t.ok~n to remove all remaining DNase, as even the slightest residual activity could be detrimental to the amplification of low-abundance DNA.

A versatile tool for retrieving molecular sequences

A problem that DNase treatment has in common with spin filtration is that partial digestion could allow small fragments of single-stranded DNA left in the solution to act as primers, leading to possible false positive bands or smears of multiple PCR products seen after agarose gel electrophoresis of the sample. Using a combination of DNase, UV light and filtration might be sufficient to allow the use of the primers for routine PCR applications. Although loss of the primer stock would be costly, most netters felt that the risk of obtaining extraneous PCR products from the suspect primer stock would be too great and that, once contaminated, the primers would not be sufficiently clean to be trusted.

References 1 Labarca, C. and Paigen, K. (1980) Anal. Biochem. 102, 344-352 2 Sarkar, G. an~ Sommer, S. S. (1990) Nature 343, 27

PAUL N. HENGEN National CancerInstitute, FrederickCancer Research and DevelopmentCenter, Frederick, MD 21702-1201, USA. email: [email protected] [Any statements made by the author are not meant to advocate the use of a particular commercial product or endorse any company. All opinions are those of the author and do not reflect I the opinion of the National Cancer Institute or the ] Na!!onal Institutes of Health.

Term frequency ,,tatistics

Entrez (release 8.0, December1993) produced by the US National Center for BlotechnologyInformation. $ 76.00 (USA)/S84.00-108.00 (rest of world): CD-ROMupdatedbimonthly; network access free 'How can ! find all the Ras-related sequences stored in the databanks?'; 'I want to find a reference, but I've forgotten the author's name, and it doesn't come up on Medl[ne with any of the keywords I've tried.'; 'l want to get sequences of all the olfactory receptors that were cloned in that Nature paper. How can ! retrieve them without typing in three dozen accession numbers?' Until recently, the answer to all these questions was: 'With some difficulty.'

94

MEDLINE records Literature citation.~/ " invequence / databases /

~

Nucleotide ) sequences Nucleotide sequencesimilarity

"NLiteraturecitations ~ insequence \ databases

~

~

Coding region features

Protein sequences Anmloacid sequencesimilarity

Rgum1 The organization of the Entrezdatabases (taken from the user manualL © 1994,ElsevierScience Ltd 0968-0004/94/$07.00

COMPUTER CORNER

T f B S 1 9 - FEBRUARY 1 9 9 4

Box 1. An example

Entrez in u s e

of

(el ;lie

OTitions

Oetabose:

Field:

Mode:

I M[o,l~[ I L Rbstract°rTitle] I Selecti'on l ~ Term:l{atF.~"

Suppose we want to find all the papers that report the identificat;,.]l~ of large families of olfactory receptor genes. We can do a standard keyword search on 'olfact...' and 'mcept...'. Unfortunately, Entrezfinds 111 records that contain these terms. Well, we couM look through them all.., but then we remember that in one of the papers we want, the olfactory receptors were cloned from catfish. Them am probably not too many 'catfish' references. So we add 'catfish' to the query (a). This time, just one reference comes back (It). Choosing ~eighbours to this one summons a list of twelve similar references (¢), several of which are reports of olfactory receptor gene families. Selecting these references, we find that they am linked to a total of 70 associated protein sequences which can be grabbed en masse, if necessary, using a couple of mouse clicks.

Ot

caterpillar-de'eYed

caterpillars cater,on cote,boa,no catesbeiana

i --~:''~ I

1 II

cntg t

Term Selection

41. Query Refinement

i

O 0 40

60

2

tO

~---

Special

TaleS

[o;fa.... Ir~cepl... i Lcotfith

Retrieve I Oocument

[[

(b)

Preferences

] ~

['More eooleans

j

(c) File

Options

P~'elerence~

] ~ File Options Peeferences

(~

| l

Ootobose:

Field:

i M..,~

Abstract I I o=

Term:l~.l'h

liE- ;'

cnterpillar-der( caterpillars cater,o;', cateshenlna

Mode: orTdie I J~e;fi°~'--J~

Cei72:657-66 l

! MEOLINE

t

(1993) 1932015901

.~

Tim family or pries em:od~a8 odorant receptors in the channel catfish.

olfoct... recept.. ¢t~tfish

(

....

b,,,o.,~.

I[~ l

~ I=I=I

B~k,

i~

~

GUCk.

Theolfa¢torq multtqene femtlq



' I Tenn Selectlc I

i;;;;;:: LcOIFI~h

~-~

or

~*

Iiij I ~iii|i

1992

restrioted sub~ts of c~rnosehsor9 r,eurone,

~I]i~ I ~!!1 mu

1992

nfthecychc nudeotldo-petedchenhel from

|':]l i

~

,eor

(elftshol.tor~ ....... .o.l~noof,.r~,e.t,oll,~=-h~,~ prate,.,

J];i}) B

II~

I IB

m

FIre~telh, A.Seful0~~o,r~eplorelmva}

ii

II

=~

1991

,=

CI)-AOM?

Entrez can be used in two ways. For use on a single machine or for sharing over a local network, NCBI distributes the database files on CD-ROMat bimonthly intervals, on a non-profit basis. The Entrez databases currently occupy two CDs, but as they inevitably overflow onto more disks, the advantages of the Internet version of the software will be ever more apparent. For increased speed when using Entrez on CD, either just the indices or the entire contents of the CDs can be transferred to a local hard disk. The indices take up just I% of the space of the entire database - transferring them is highly recommended - but if several users access En~ez simultaneously, response time slows down unacceptably unless the entire database is ira,,forced to a hard disk. Alternatively, for sites with a direct connection to the Interne,, a network

.... h........... ,,o.or~.....

IIil/ am

/i~1[] Ii~',1I

,

io~-U'p-~

. [

also grouped according to similarity- a concept referred to by the NCBI as 'neighbourlng' (Fig. I). Netwod~

~li Iii[I i M I!JM ~

~

I [

Q92

oq......

~

] |

["

tkl~t4 TIrMI ~, trio Fiord gnq~nne

Powerful biological information retrieval tools are widely available, but the number of arcane systems needed to accomplish such simple tasks can be daurlting. Additionally, for those wor~sg with limited budgets, the cost of commercial on-line search services can soon become prohibitive. Entrez is a new system produced by the US National Center for Biotechnology Information (NCBI), which satisfies most of the database querying needs of the average molecular biologist in a single low-cost package. It gives access to a nonredundant set of DNA and protein sequence databases, and also to a relevant subset of Medline life-science bibliographic references. Two innovations make Entrez dist;nctive. First, links between the databases allow users to move easily between cogitate protein and nucleic acid sequences, and in either direction between a sequence and its parent reference. Second, the entries in both the sequence and the bibliographic databases are not only indexed on keywords, but are

caterpillars

p. . . . . . . . . . . . . ba31sfo. . . . . . T~ o~f~'^~q m~!t~qerlef,~mrlq

/

..... " ' - ~

. . . . . . . ,0.,,,sh A novelmoltl~enefamily meqentre Odorant

BUCk. 199 . . . .

Tnll~r,

{ J

,99~

F~ ~

J catesbeiana

t

~'~

-~-~~i'nOndoran'tr~ept

I~;~ t

coterson

The e~tor~tcal aM numerical atmphcttq of the fish olfactory a~siemhe, led n, to exam1no I the Isrntl.j of olfactory receptors expressedII1 the catfish Wehoveldonitfted e fomlly of ~eneanhcodthORavenIron,membrane domain receptorSthai atmrnconsiderable hernolo(jy :.1 vtth the odorant receptors of the rot The el~ of the catfish receptor repertoire nppears to M~far smellerthen1n mammalsA~lqstsofthe nucte~ttde~lu~ee at~qesteth0t these receptor ~nea have u,derqone~otlave Oorwtntan selection toOenereteenhance4dsversltU vtihtn the putativeodoreM-htndlnq domain, Individual receptor cloneeanneal yah 0 ~%29 of the elfaclorq heuroha, ,oooestlng that e at ~le cell expressesnnl y e emull subsetof dtOtll~l odorant red,piers Eachcell. therefore, possessese uniqueIdentttq deftned by the receptpr~ tt expreseee ThesedMn au~OeatIh|t the brain ma~dl~rt mlnete omon~~1o~, b~ dotermtntnQ',~htch neuronehavebeenactivated

,NOel,

. caterpillar_dec

I catesbeolno

Dep. . . . . t P ~che~ts~ aM MoLectLlazBs~ph~s~s, Ho~oxd Ht~hes 51ed¢~XZ. . . . . COlLie of P h y s z ~ aM ~ul'~eo~. ColumbL~U~ve~t~. Hey ¥o#.. Hey ¥o~. 10032

Tel'In SelerfiG

[ ~ - ~

,e,m~/

..N9o.i, 109S

J H ~ , ~ M Do,vtmg, L Buck, R A:ml& A Chess cato

I

"~ O M [ O " N [ ~ P r o t e l n / " . ) N n c l e o t l d e Select: ~

~

[ Parental

L____J

i

version of the E~trez software is avail~b!c, which accesses the latest release of the databases held on a central server in the USA. The service operates in a similar way to Gopher. a~id anyone who has access to Gopher should be :~ble to run the network version of Entrez. There ts no charge for registration to use the network service and, if an Internet connection is already in place, this can be a very cheap way to access the databases. Even heavy use is unlikely to overload an lnternet connection, since the amount oi data transferred during typical use is very small, thanks to the specificity of s~=arch|ngmade possible by neighbouring Sequence nelghboudng The neighbouring cf the sequence databases is done by self-comparison using the BLASTalgorithm l, which identifies many, but not all, biologically significant sequence similarities. The BLAST algorithm is conveniently amenable to statistical analysis, and NCBI include as neighbours only those matches

95

BOOKREVIEWS that would be expected to occur by chance less than once in the entire database. Sequence neighbouring allows families of sequences to be retrieved very easily, avoiding the problems of changing nomenclature which can make multigene families difficult to search for. If an initial sequence is found using keywords or an accession number, many related sequences can be found almost instantly by looking for neighbours. Retrieval of neighbours is extremely fast, since no fresh sequence comparisons need to be done. The most efficient way to retrieve a family of coding nucleotide sequences is to search for neighbours in the protein database, where similarity is more evid~nt, and then to link from that to the corresponding nucleotide sequences. Neighbouring at the nucleotide sequence level is most useful for identifying overlapping contiguous regions (contigs) between published sequences. As well as increasing convenience of retrieval, BLAST neighbouring may also pick out previously unnoticed sequence similarities. Since the comparison process is automated and largely unsupervised, each new release of the Entrez databases is potentially loaded with novel sequence relationships waiting to be discovered.

Referencenelghbourlng The neighbouring of the reference database is central to the operation of Entre,. References are grouped according to the number of terms they share, but different ,~hared terms are given different weights. These weights are assigned according to the results of a preliminary self-comparlson of the database 2 using rougidy estimaiud weights based on term frequency. A terra is then given high weighting if references that share that term frequently share many other terms. Hence 'Staphylococcus' is a high-weighted term, because when it occurs in two references they are oIten closely related, whereas 'phenomenon' is weighted low since, as a shared term, it is a poor predicter of reference relatedness. The value or reference neighbouring is that it allows roughly specificd searches

A field guide to the ECM Guidebook to the Extracellular Matrix and Adhesion Proteins edited by ThomasKrels and Ronald Vale, O):ford University Pres~, 1993. £ 1 8 , 5 0 (xf + 176 pages) ISBN 0 19 359933 1

96

TIBS 1 9 - FEBRUARY 1 9 9 4

to achieve a balance between comprehensiveness and selectivity. Searching for keywords, one is faced with the dilemma of either missing l'elevant references by specifying too many terms, or being inundated with matches to a loosely specified query. A good solution, using Entrez, is to specify a narrow query, and then to view the neighbours of the few matching references found (Box 1). Alternatively, if a broad query is specified, it can be narrowed by selecting the first few interesting references and then looking for neighbours. Neighbouring can be repeated as many times as necessary, making it possible to meander freely and easily around a subject.

Umitati0ns Entrez is not designed as a tool for sequence analysis and does not provide facilities for screening novel sequences against its databases, or for retrieving sequences co,tainii~g motifs specified by the user. The various public email servers that allow FASTA or BLAST searches fill that niche well and were described in a previous Computer Corner article (TIBS 18, 267-268). Alternatively, it is possible to extract the sequences from the Entrez databases onto a hard disk in a format that can then be searched by sequence comparison programs such as BLAST and FASTA. Since the Entrez databases are updated only every two months, it is not the best way to access the very latest sequences (the Gopher and emall servers maintained by the databanks should be more up to date). However, while the CD distribution looks set to continue to be bimonthly, the central databases should receive updates more frequently in future. Finally, Entrez is not intended to compete in the comprehensiveness of its reference database with the full version of Medline, or with commercial reference databases such as the Science Citation Index (which offers the useful bonus of citation-based searching). The reference database offered by Entrez currently includes all publications that report novel sequence data, and other references judged to be relevant by subject heading. Although it will remain restricted to a

For those of us who teach about the interactions of cells with each other and the extracellular matrix (ECM), this small guidebook is indispensable. Investigators in the field will probably also benefit from having this guidebook as a ready reference. In just 176 pages, the book provides a broad introduction to the diverse macromolecules of the ECM and to the adhesion proteins in the ECM and

molecular biology subset of Medline, the reference coverage offered by Entrez is expected to expand in future releases. For typical use in a molecular biology laboratory, it is already very useful.

Concludingremarks For sites with a connection to the Internet, Network Entrez works like a dream. Sequences and references float effortlessly to your desktop as never before, and the only cost is the overhead for use of the lnternet. For the lnternet-less, Entrez is still a bargain, it can (for the moment) run quite happily using a single CD-ROM drive; it integrates sequences and references in a time-saving way and, with neighbouring, it introduces a novel and useful searchrefinement mechanism.

Availability NCBl-produced versions of Entrezare available for the Apple Macintosh, Windows PCs, VAX/VMSand many,Uni::,~ysLems Inde~ndent groups have also written limited versions'f6i', other systems; including text-based implementations for Unix and for VAX/VMS,which will be useful for those whose only connection to the Internet is through a text-based link. The Entrez software is not copyright material, and can be freely distributed. The latest version of the software, together with documentation, can be obtained via the Internet by anonymous FTP to ncbi.nlm,nih.gov. For information about Entrez, including how to subscribe to the CD-ROM distribution, sencl emall to [email protected],nlh.gov, or contact National Center for Blotechnology Information, National Library of Meoicine, National Institutes of Health, Bldg 38A, 8600 Rockvllle Pike, Bethesda, MD 20894, USA,

References I Altc,,chulS. F. et al. (1990) J. Mol. Biol. 215, 403--410 2 Wilbur, W. J. (1992) in Proceedingsof the 55th

.,~mericanSocietyof InformationScienceAnnual Meeting(Shaw, D., ed.), pp. 216-220, Pittsburgh

MATTHEW COCKERILL ICRF Clare Hall Laboratories, South Mimms, Potters Bar, UK EN6 3LD.

on the cell surface that specify the molecular associations in this system. Often a neglected aspect of biochemistry and cell biology in the past, the ECM and adhesion molecules now get top billing, since it has been appreciated how important they are in the development and maintenance of complex organisms. The main problem addressed by this book is how to keep track of the