[30]
MACROMOLECULARCIF
571
ments for the types of data to be deposited and proper ways of checking the validity and consistency of the data will be developed in cooperation with the experimental community for each category of structure data archived by the PDB.
[30] M a c r o m o l e c u l a r
C r y s t a l l o g r a p h i c I n f o r m a t i o n File
By PHILIP E.
B O U R N E , H E L E N M. B E R M A N , B R I A N M C M A H O N , KEITH D. W A T E N P A U G H , J O H N D. W E S T B R O O K , and PAULA M. D. FITZGERALD
Introduction The Protein Data Bank (PDB) format provides a standard representation for macromolecular structure data derived from X-ray diffraction and nuclear magnetic resonance (NMR) studies. This representation has served the community well since its inception in the 1970s1 and a large amount of software that uses this representation has been written. However, it is widely recognized that the current PDB format cannot express adequately the large amount of data (content) associated with a single macromolecular structure and the experiment from which it was derived in a way (context) that is consistent and permits direct comparison with other structure entries. Structure comparison, for such purposes as better understanding biological function, assisting in the solution of new structures, drug design, and structure prediction, becomes increasingly valuable as the number of macromolecular structures continues to grow at a near exponential rate. It could be argued that the description of the required content of a structure submission could be met by additional PDB record types. However, this format does not permit the maintenance of the automated level of consistency, accuracy, and reproducibility required for such a large body of data. A variety of approaches for improved scientific data representation is being explored. 2 The approach described here, which has been developed under the auspices of the International Union of Crystallography (IUCr), is to extend the Crystallographic Information File (CIF) data representation used for describing small-molecule structures and associated diffraction experiments. This extension is referred to as the macromolecular Crystallo1 F. C. Bernstein, T. F. Koetzle, G. J. B. Williams, E. F. Meyer, Jr., M. D. Brice, J. R. Rogers, O. Kennard, T. Shimanouchi, and M. Tasumi, J. Mol. Biol. 112, 535 (1977). 2 I E E E Metadata: http://www.llnLgov/liv_comp/metadata/(1996).
METHODS IN ENZYMOLOGY,VOL. 277
Copyright © 1997by AcademicPress All rights of reproduction in any form reserved. 0076-6879/97$25.00
572
PRESENTATION AND ANALYSIS
[301
graphic Information File (mmCIF) and is the subject of this chapter. We briefly cover the history of mmCIF, similarities to and differences from the PDB format, contents of the m m C I F dictionary, and how to represent structures using mmCIF. The m m C I F home page (mmCIF 3) contains a historic description of the development of the dictionary, current versions of the dictionary in text and H y p e r T e x t markup language ( H T M L ) formats, software tools, archives of the mmCIF discussion list, and a detailed online tutorial. 4 Background The CIF was developed to describe small-molecule organic structures and the crystallographic experiment by the IUCr Working Party on Crystallographic Information at the behest of the IUCr Commission on Crystallographic Data and the IUCr Commission on Journals. The result of this effort was a core dictionary of data items 4a sufficient for archiving the smallmolecule crystallographic experiment and its results, s,6 This core dictionary was adopted by the IUCr at its 1990 Congress in Bordeaux. The format of the small-molecule CIF dictionary and the data files based on that dictionary conform to a restricted version of the Self-Defining Text Archive and Retrieval (STAR) representation developed by Hall and others. 7,8 S T A R permits a data organization that may be understood by analogy with a spoken language (Fig. i). S T A R defines a set of encoding rules similar to saying the English language is composed of 26 letters. A dictionary definition language ( D D L ) is defined that uses those rules and that provides a framework from which to define a dictionary of the terms needed by the discipline. The D D L is a computer-readable way of declaring that words are made up of arbitrary groups of letters and that words are organized into sentences and paragraphs. The D D L provides a convention for naming and defining data items within the dictionary, declaring specific attributes of those data items (e.g., a range of values and the data type), and for declaring relationships between data items. In other words, the D D L defines the format of the dictionary and any new words that are added must conform to that format. Just as words are constantly being added to a language, data items will be added to the dictionaries as the discipline evolves. The 3mmCIF: http://ndbserver, rutgers.edu./mmcif/(1996). Bourne, http://www.sdsc.edu/pb/cif/overview.html (1996). aa A data item refers to a data name and its associated value, as discussed subsequently. 5S. R. Hall, F. H. Allen, and I. D. Brown, Acta Crystallogr. A47, 655 (1991). 6 IUCr: ftp://ftp.iucr.ac.uk/pub/cifdic.c91 (1996). 7A. Cook and S. R. Hall, J. Chem. Inf. Comput. Sci. 31, 326 (1992). 8S. R. Hall and N. Spadaccini,J. Chem. Inf. Comput. Sci. 34, 505 (1994). 4 p . E.
[30]
MACROMOLECULAR CIF
573
STAR/ClF
I
i
Data Files
nco Z' I
DDL
English LanguageAnalogy There are 26 letters in the alphabet, I before E except after C is a rule.
Words are allowable groups of letters separated by white space and punctuation.
Words are defined in a dictionary,
This chapter uses only words in the dictionary.
FIG. 1. Components of the STAR/CIF data representation and their analogy to a natural language.
S T A R encoding rules and the D D L are being used to develop a variety of dictionaries and reference files, for example, the powder diffraction dictionary, the modulated structures dictionary, a file of ideal geometry for amino acids, and an N M R dictionary. This extensibility is attractive because the same basic reading and browsing software (context-based tools) can be used irrespective of the data content. Data files (this chapter is an example in our language analogy) are composed of data items found in the dictionaries. In 1990, the IUCr formed a working group to expand the core dictionary to include data items relevant to the macromolecular crystallographic experiment. Version 1.0 of the mmCIF dictionary (Fitzgerald et aL 9 mmCIF3), which encompasses many data items from the current core dictionary (IUCrl°), is in the final stage of review by COMCIFs, the IUCr appointed committee overseeing CIF developments. This dictionary has been written using D D L version 2.1.1 (Westbrook and Halln), which is significantly enhanced, yet upwardly compatible with D D L version 1.4 (IUCr a2) currently used for the small-molecule dictionary. 9p. M. D. Fitzgerald,H. M. Berman, P. E. Bourne, B.McMahon, K. Watenpaugh, and J. D. Westbrook, Acta Crystallogr. (Suppl.) A52, C575 (1996). 10IUCr: flp://ftp.iucr.ac.uk/pub/cifdic.c96 (1996). u j. D. Westbrook and S. R. Hall, http://ndbserver.rutgers.edu/mmcif/ddl/(1995). 12IUCr: fip://ftp.iucr.ac.uk/pub/ddldic.c95 (1995).
574
PRESENTATION AND ANALYSIS
1301
C o n s i d e r a t i o n s in D e v e l o p m e n t of m m C I F Dictionary In developing version 1.0 of the m m C I F dictionary we made the following decisions. 1. Every field of every PDB record type should be represented by a data item if that PDB field is important for describing the structure, the experiment that was conducted in determining the structure, or the revision history of the entry. It is important to note that it is straightforward to convert a m m C I F data file to a PDB file without loss of information because all information is parsable. It is not possible, however, to automate completely the conversion of a PDB file to an mmCIF, because many m m C I F data items either are not present in the PDB file or are present in PDB R E M A R K records that in some instances cannot be parsed. The content of PDB R E M A R K records are maintained as separate data items within m m C I F so as to preserve all information, even if that information is not parsable. 2. Data items should be defined such that all the information described in the Materials and Methods section of a structure paper could be referenced. This includes major features of the crystal, the diffraction experiment, phasing methodology, and refinement. 3. Data items should be defined such that the biologically active molecule could be described as well as any structural subcomponents deemed important by the crystallographer. 4. Atomic coordinates should be representable as either orthogonal (in .~ngstr6ms) or fractional. 5. Data items should be provided to describe final h,k,l values, including those collected at different wavelengths. 6. For the most part data items specific to an N M R experiment or modeling study would not be included in version 1.0. Exceptions are the data items that summarize the features of an ensemble of structures and permit the description of each m e m b e r of the ensemble. 7. Crystallographic and noncrystallographic symmetry should be defined. 8. A comprehensive set of data items for providing a higher order structure description, for example, to cover supersecondary structure and functional classification, was considered beyond the scope of version 1.0. 9. Data items should be present for describing the characteristics and geometry of canonical and noncanonical amino acids, nucleotides, and heterogen groups. 10. Data items should be present that permit a detailed description of the chemistry of the component parts of the macromolecule, including the provision for two-dimensional (2D) projections.
[301
MACROMOLECULARCIF
575
11. Data items should be present that provide specific pointers from elements of the structure (e.g., the sequence, bound inhibitors) to the appropriate entries in publicly available databases. 12. Data items should be present that provide meaningful 3D views of the structure so as to highlight functional and structural aspects of the macromolecule. On the basis of the preceding decisions, a mmCIF dictionary with approximately 1500 data items (including those data items taken from the smallmolecule dictionary) was developed. It is not expected that all relevant data items will be present in each mmCIF data file. What data items are mandatory to describe the structure and experiment adequately need to be decided by community consensus.
Comparing mmCIF Data Files with PDB Files The format of an mmCIF containing structural data can best be introduced through analogy with the existing PDB format. A PDB file consists of a series of records each identified by a keyword (e.g., HEADER, COMPND) of up to six characters. The format and content of fields within a record are dependent on the keyword. An mmCIF, on the other hand, always consists of a series of name-value pairs (a data item) defined by STAR, where the data name is preceded by a leading underscore (_) to distinguish it from the data value. Thus, every field in a PDB record is represented in mmCIF by a specific data name. The PDB H E A D E R record, HEADER
PLANT SEED PROTEIN
II-OCT-91
ICBN
becomes _struct.entry_id struct.title
'ICBN' 'PLANT SEED PROTEIN'
_struct_keywords.entry_id _struct_keywords.text
'ICBN' 'plant seed protein'
database_2.database_id database_2.database_code
'PDB' 'ICBN'
database_PDB_rev.rev_num _database_PDB_rev.date_original
1 '1991-i0-ii'
The name-value pairing represents a major departure from the PDB file format and has the advantage of providing an explicit reference to each item of data within the data file, rather than having the interpretation left
576
PRESENTATIONAND ANALYSIS
[30]
to the software reading the file. The name matches an entry in the m m C I F dictionary where characteristics of that data item are defined explicitly. When multiple values for the same data item exist, the name of the data item or items concerned is declared in a header and the associated values follow in strict rotation. This is a S T A R rule referred to as a loop_ construct. This loop_ construct is illustrated in the representation of atomic coordinates. loop_ _a t o m _ s i t e. g r o u p _ P D B _atom_site. type_symbol _atom_site. label_atom_id _atom_site. label_comp_id _atom_site. label_asym_id _atom_site. label_seq_id _atom_site. label_alt_id _atom_site. cartn_x _ a t o m _ s i t e . cart n _ y _atom_site. cartn_z _atom_site. occupancy _ a t o m _ s i t e. B _ i s o _ o r _ e q u i v _atom_site. footnot e_id _atom_site. auth_seq_id _ a t o m _ s i t e . id ATOM N N VAL A Ii . 2 5 . 3 6 9 ATOM C CA VAL A ii . 2 5 . 9 7 0 ATOM C C VAL A ii . 2 5 . 5 6 9
30.691 31.965 32.010
11.795 12.332 13.881
1.00 1.00 1.00
17.93 17.75 17.83
. ii 1 . ii 2 . ii 3
[data omitted]
Note that the name construct is of the form _category.extension. The category explicitly defines a natural grouping of data items such that all data items of a single category are contained within a single loop_. There is no restriction on the length of name, beyond the record length limit of 80 characters mentioned below, and while there is no formal syntax within name beyond the category and extension separated by a period, by convention the category and extension are represented as an informal hierarchy of parts, with each part separated by an underscore (_). The names _atom__site.label_atom_id and _atom_site.label_comp_id are examples. Questions that arise concerning the separation of data names and data values are solved with some additional syntax. For example, what if the data value contains white space, an underscore, or runs over several lines? Similarly, what if a value in a loop_ is undefined or has no meaning in the context in which it is defined? The following syntax rules, which are a m o r e restricted set of rules than permitted by S T A R , complete the m m C I F description.
[30]
MACROMOLECULARCIF
577
Comments are preceded by a hash (#) and terminated by a new line. Data values on a single line may be delimited by pairs of single (') or double (") quotes. Data values that extend beyond a single line are enclosed within semicolons (;) as the first character of the line that begins the text block and the first character of the line following the last line of text. Data values that are unknown are represented by a question mark (?). Data values that are undefined are represented by a period (.). The length of a record in mmCIF is restricted to 80 characters. Only printable ASCII characters are permitted. Only a single level of loop_ is permissible. To complete the introductory picture of the appearance of an mmCIF data file consider the notion of scope. A PDB file has essentially one form of scope--the complete file. Thus, a single structure or an ensemble of structures is represented by a single file with each member of the ensemble separated by a PDB M O D E L keyword record. There is no computerreadable mechanism for associating components of, say, the R E M A R K records with a particular member of the ensemble. The mmCIF representation deals with this issue by using the STAR data block concept. Data blocks begin with data_ and have a scope that extends until the next data_ or an end-of-file is reached. A name may appear only once in a data block, but data items may appear in any order. A consequence of these STAR rules is that the combination of data block name and data name is always unique. Contents of m m C I F Dictionary Table I summarizes the category groups, their associated individual categories, and their definitions as found in the mmCIF dictionary version 0.9.01 dated Jan. 31, 1997. This comprehensive hierarchy of categories follows closely the progress of the experiment and the subsequent structure description. S t r u c t u r e Representation Using m m C I F The categories describing the crystallographic experiment are relatively self-explanatory and are not detailed here. We do, however, outline the data model used to describe the resulting structure and its description. The structural data model can be described most simply as containing three interrelated groups of categories: A T O M _ S I T E categories, which give coordinates and related information for the structure; E N T I T Y categories, which describe the chemistry of the components of the structure; and S T R U C T categories, which analyze and describe the structure.
578
PRESENTATION AND ANALYSIS
I301
TABLE I mmCIF CATEGORY GROUPS AND ASSOCIATED CATEGORIESa Category groups and members INCLUSIVE G R O U P ATOM GROUP ATOM_SITE ATOM_SITE_ANISOTROP ATOM_SITES ATOM_SITES_ALT ATOM_SITES_ALT ENS ATOM_SITES_ALT_GEN ATOM_SITES_FOOTNOTE ATOM_TYPE AUDIT GROUP AUDIT AUDIT_AUTHOR AUDIT_CONTACT_AUTHOR CELL GROUP CELL CELL_MEASUREMENT CELL_MEAS UREMENT_REFLN
CHEM_COMP GROUP CHEM_COMP CHEM_COMP_ANGLE CHEM_COMP_ATOM CHEM_COMP BOND CHEM_COMP_CHIR CHEM_COMP CHIR_ATOM CHEM_COMP_LINK CHEM_COMP_PLANE CHEM_COMP PLANE_ATOM CHEM_COMP TOR CHEM_COMP TOR_VALUE CHEM_LINK GROUP CHEM_LINK CHEM_LINK_ANGLE CHEM_LINK_BOND CHEM_LINK_CHIR CHEM_LINK_CHIR_ATOM CHEM_LINK_PLANE CHEM_LINK_PLANE_ATOM CHEM_LINK_TOR CHEM_LINK_TOR_VALUE
Definition All category groups Details of each atomic position Anisotropic thermal displacement Details pertaining to all atom sites Details pertaining to alternative atom sites as found in disorder etc. Details pertaining to alternative atom sites as found in ensembles, e.g., from NMR and modeling experiments Generation of ensembles from multiple conformations Comments concerning one or more atom sites Properties of an atom at a particular atom site Detail on the creation and updating of the mmCIF Author(s) of the mmCIF including address information Author(s) to be contacted Unit cell parameters How the cell parameters were measured Details of the reflections used to determine the unit cell parameters Details of the chemical components Bond angles in a chemical component Atoms defining a chemical component Characteristics of bonds in a chemical component Details of the chiral centers in a chemical component Atoms comprising a chiral center in a chemical component Linkages between chemical groups Planes found in a chemical component Atoms comprising a plane in a chemical component Details of the torsion angles in a chemical component Target values for the torsion angles in a chemical component Details of the linkages between chemical components Details of the angles in the chemical component linkage Details of the bonds in the chemical component linkage Chiral centers in a link between two chemical components Atoms bonded to a chiral atom in a linkage between two chemical components Planes in a linkage between two chemical components Atoms in the plane forming a linkage between two chemical components Torsion angles in a linkage between two chemical components Target values for torsion angles enumerated in a linkage between two chemical components
TABLE I (continued) Category groups and members CHEMICAL GROUP CHEMICAL CHEMICAL_CONN_ATOM CHEMICAL_CONN_BOND CHEMICAL_FORMULA CITATION GROUP CITATION CITATION_AUTHOR CITATION_EDITOR COMPUTING GROUP COMPUTING SOFTWARE DATABASE GROUP DATABASE DATABASE_2 DATABASE_PDB_CAVEAT DATABASE_PDB_MATRIX DATABASE_PDB_REMARK DATABASE_PDB_REV DATAB ASE_PDB_REV RECORD DATABASE_PDB_TVECT DIFFRN GROUP DIFFRN DIFFRN_ATTENUATOR DIFFRN DETECTOR DIFFRN MEASUREMENT DIFFRN ORIENT_MATRIX DIFFRN ORIENT_REFLN DIFFRN_RADIATION DIFFRN SOURCE DIFFRN REFLN DIFFRN_REFLNS DIFFRN SCALE_GROUP DIFFRN_RADIATION_ WAVELENGTH DIFFRN STANDARD_REFLN DIFFRN STANDARDS ENTITY GROUP ENTITY ENTITY_KEYWORDS ENTITY_LINK
Definition Composition and chemical properties Atom position for 2-D chemical diagrams Bond specifications for 2-D chemical diagrams Chemical formula Literature cited in reference to the data block Author(s) of the citations Editor(s) of citations where applicable Computer programs used in the structure analysis More detailed description of the software used in the structure analysis Superseded by DATABASE_2 Codes assigned to mmCIFs by maintainers of recognized databases CAVEAT records originally found in the PDB version of the mmCIF data file MATRIX records originally found in the PDB version of the mmCIF data file REMARK records originally found in the PDB version of the mmCIF data file Taken from the PDB REVDAT records Taken from the PDB REVDAT records TVECT records originally found in the PDB version of the mmCIF data file Details of diffraction data and the diffraction experiment Diffraction attenuator scales Describes the detector used to measure scattered radiation Details on how the diffraction data were measured Orientation matrices used when measuring data Reflections that define the orientation matrix Details on the radiation and detector used to collect data Details pertaining to the radiation source Unprocessed reflection data Details pertaining to all reflection data Details of reflections used in scaling Wavelength of radiation used to measure diffraction intensities Details of the standard reflections used during data collection Details pertaining to all standard reflections Details pertaining to each unique chemical component of the structure Keywords describing each entity Details of the links between entities
continued
580
PRESENTATION AND ANALYSIS
[30]
TABLE I (continued) Category groups and members ENTITY_NAME_COM ENTITY_NAME_SYS ENTITY_POLY ENTITY_POLY_SEQ ENTITY_SRC_GEN ENTITY_SRC_NAT ENTRY GROUP ENTRY EXPTL GROUP EXPTL EXPTL_CRYSTAL EXPTL_CRYSTAL_FACE EXPTL CRYSTAL_GROW EXPTL_CRYSTAL_GROW COMP GEOM GROUP GEOM GEOM ANGLE GEOM BOND GEOM CONTACT GEOM TORSION JOURNAL GROUP JOURNAL PHASING GROUP PHASING PHASING_AVERAGING PHASING_ISOMORPHOUS PHASING_MAD PHASING_MAD_CLUST PHASING_MAD_EXPT PHASING_MAD_RATIO PHASING_MADSET PHASING_MIR PHASING_MIR_DER PHASING_MIR_DER REFLN PHASING_MIR_DER SHELL PHASING_MIR_DER SITE PHASING_MIR_SHELL PHASING_SET PHASING_SET_REFLN PUBL GROUP PUBL PUBL_AUTHOR PUBL_MANUSCRIPT_INCL
Definition Common name for the entity Systematic name for the entity Characteristics of a polymer Sequence of monomers in a polymer Source of the entity Details of the natural source of the entity Identifier for the data block Experimental details relating to the physical properties of the material, particularly absorption Physical properties of the crystal Details pertaining to the crystal faces Conditions and methods used to grow the crystals Components of the solution from which the crystals were grown Derived geometry information Derived bond angles Derived bonds Derived intermolecular contacts Derived torsion angles Used by journals and not the mmCIF preparer General phasing information Phase averaging of multiple observations Phasing information from an isomorphous model Phasing via multiwavelength anomalous dispersion (MAD) Details of a cluster of MAD experiments Overall features of the MAD experiment Ratios between pairs of MAD datasets Details of individual MAD datasets Phasing via single and multiple isomorphous replacement Details of individual derivatives used in MIR Details of calculated structure factors As above but for shells of resolution Details of heavy atom sites Details of each shell used in MIR Details of data sets used in phasing Values of structure factors used in phasing Used when submitting a publication as a mmCIF Authors of the publication To include special data names in the processing of the manuscript
TABLE I (continued) Category groups and members REFINE GROUP REFINE REFINE_ANALYZE REFINE_B ISO REFINE_HIST REFINE LS RESTR REFINE_LS_RESTR_NCS REFINE LS SHELL REFINE_OCCUPANCY REFLN GROUP REFLN REFLNS REFLNS_SCALE REFLNS_SHELL STRUCT GROUP STRUCT STRUCT_ASYM STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_BIOL_KEYWORDS STRUCT_BIOL_VIEW STRUCT_CONF STRUCT_CONF_TYPE STRUCT_CONN STRUCT_CONN_TYPE STRUCT_KEYWORDS STRUCT_MON_DETAILS STRUCT_MON NUCL STRUCT_MON_PROT STRUCT_MON_PROT_CIS STRUCT_NCS_DOM STRUCT_NCS DOM_LIM STRUCT_NCS_ENS STRUCT_NCS ENS_GEN STRUCT_NCS_OPER STRUCT REF
Definition
Details of the structure refinement Items used to assess the quality of the refined structure Details pertaining to the refinement of isotropic B values History of the refinement Details pertaining to the least squares restraints used in refinement Details of the restraints applied to atomic positions related by noncrystallographic symmetry Results of refinement broken down by resolution Details pertaining to the refinement of occupancy factors Details pertaining to the reflections used to derive the atom sites Details pertaining to all reflections Details pertaining to scaling factors used with respect to the structure factors As REFLNS, but by shells of resolution Details pertaining to a description of the structure Details pertaining to structure components within the asymmetric unit Details pertaining to components of the structure that have biological significance Details pertaining to generating biological components Keywords for describing biological components Description of views of the structure with biological significance Conformations of the backbone Details of each backbone conformation Details pertaining to intermolecular contacts Details of each type of intermolecular contact Description of the chemical structure Calculation summaries at the monomer level Calculation summaries specific to nucleic acid monomers Calculation summaries specific to protein monomers Calculation summaries specific to cis peptides Details of domains within an ensemble of domains Beginning and end points within polypeptide chains forming a specific domain Description of ensembles Description of domains related by noncrystallographic symmetry Operations required to superimpose individual members of an ensemble External database references to biological units within the structure
continued
582
PRESENTATIONAND ANALYSIS
[301
TABLE I (continued) Category groups and members STRUCT_REF_SEQ STRUCT_REF_SEQ_DIF STRUCT_SHEET STRUCT_SHEET_HBOND STRUCT SHEET_ORDER STRUCT_SHEET_RANGE STRUCT_SHEET_TOPOLOGY STRUCT_SITE STRUCT_SITE_GEN STRUCT_SITE_KEYWORDS STRUCT_SITE_VIEW
Definition Describes the alignment of the external database sequence with that found in the structure Describes differences in the external database sequence with that found in the structure Beta sheet description Hydrogen bond description in beta sheets Order of residue ranges in beta sheets Residue ranges in beta sheets Topologyof residue ranges in beta sheets Details pertaining to specific sites within the structure Details pertaining to how the site is generated Keywords describing the site Description of views of the specified site
SYMMETRY GROUP
SYMMETRY SYMMETRY EQUIV
Details pertaining to space group symmetry Equivalent positions for the specified space group
a Taken from http://ndbserver.rutgers.edu/mmcif/dictionary/dict-html/cif-mm.dic/Index/.
The data items in the ATOM__SITE category record details about the atom sites including the coordinates, the thermal displacement parameters, the errors in the parameters, and a specification of the component of the asymmetric unit to which an atom belongs. The E N T I T Y category categorizes the unique chemical components of the asymmetric unit as to whether they are polymer, nonpolymer, or water. The characteristics of a polymer are described by the ENTITY__POL Y category and the sequence of the chemical components composing the polymer by the ENTITY__POLY_SEQ category. The C H E M _ C O M P categories describe the standard geometries of the m o n o m e r units such as the amino acids and nucleotides as well as that of the ligands and solvent groups. The STR UCT__BIOL category allows the author to describe the biologically relevant features of a structure and its component parts. The STRUCT__BIOL_GEN category provides the information about how to generate the biological unit from the components of the asymmetric unit, which are in turn specified by the STR U C T _ A S Y M category. Various features of the structure such as intermolecular hydrogen bonds, special sites, and secondary structure are specified in STRUCT_CONN, STRUCT_SITE, and STRUCT_CONF, respectively. Figure 2 illustrates the interrelationships among these categories.
[30]
MACROMOLECtJLARCIF
583
a
struct_biol-.~
, struc,bl,gen I st,uc,_~,o,_v,,w1 [ ~,,uct ~,o,_~e~orOs]
I struct_conn_type -~ struct_conn-~ J
• ,entity • -~ ~-~ chemcomp
ent,_po,,
E ~
atom_site
-entitypolyseq~3_~
-struct-c°nf-~ i~ structconftypeJ~
Fie. 2. (a) The relationships between categoriesthat describe biologicallyrelevant structure. (b) The relationships between categories describingpolymerstructure, the atomiccoordinates, and those categories that describe structural features such as hydrogen bonding and secondary structure.
These and other major descriptive features of the m m C I F dictionary are best explored by example. A browsable dictionary can be found at the m m C I F World Wide Web (WWW) site ( m m C I F 3) as well as some complete examples. Complete examples for all nucleic acids can be found at the Nucleic Acid Database W W W site (NDB13). Partial m m C I F s for every 13NDB: http://ndbserver.rutgers.edu/(1996).
584
PRESENTATION AND ANALYSIS
[30]
structure in the PDB are available at two WWW sites [PDB 14, San Diego Supercomputer Center (SDSC, San Diego, CA) 15] having been generated with the program pdb2cif (Bernstein et all6).
Example 1 Starting simply, consider the protein crambin, which is a single [9olypeptide chain of 48 residues, and in the low-temperature form at 0.83 A resolution (Teeter et al.17; PDB code 1CBN) has nearly all the protein-bound solvent resolved as well as an ethanol molecule cocrystallized. The protein shows recognizable sequence microheterogeneity at positions 22 (Pro/Ser) and 25 (Leu/Ile) and 24% of residues show discrete disorder. Although microheterogeneity and disorder are described using data items in the mmCIF dictionary, they are not detailed here for the sake of simplicity. Because the biological function of this molecule is unknown, no biologically relevant structural components are justified. A single identifier (crambin_l) is used to identify the unknown biological function of this molecule. _struct_biol.id crambin_l _struct_biol.details ; The function of this protein is unknown and therefore the biological unit is assumed to be the single polypeptide chain without co-crystallization factors i.e. ethanol.
The single biological descriptor, crambin__l, is generated from the single polypeptide chain found in the asymmetric unit without any symmetry transformations applied. The polypeptide chain is designated chain_a. _struct_biol_gen.biol_id _struct_biol_gen.asym.id _struct_biol_gen.symmetry
crambin 1 chain_a 1_555
The chemical components of the asymmetric unit are three entities: a single polypeptide chain characterized as a polymer, ethanol characterized as nonpolymer, and water. Whether the source of the entity is a natural product, or has been synthesized, is also indicated.
14 PDB: http://www.pdb.bnl.gov/cgi-bin/pdbmain (1996). 15 SDSC: http://www.sdsc.edu/moose (1996). 16 H. J. Bernstein, F. C. Bernstein, and P. E. Bourne, J. Appl. Cryst. accepted (1997). 17 M. M. Teeter, S. M. Roe, and N. Ho Heo, J. Mol. Biol. 230, 292 (1993).
[30]
MACROMOLECULAR CIF
loop_ _entity.id _entity.type _entity.formula_weight _entity.src_method A polymer ethanol non-polymer H20 water
4716 52 18
585
'NATURAL' 'SYNTHETIC'
It is then possible to expand on this basic description of each entity using the entity.id as a reference. So, for example, the common and systematic names are specified as _entity_name_com.entity _entity_name_com.name
id
_entity_name_sys.entity_id _entity_name_sys.name
A crambin A 'Crambe Abyssinica '
Similarly, the natural and synthetic description can be given in more detail, so for the natural product we have _entity_src_nat.entity_id _entity_src_nat.common_name _entity_src_nat.genus _entity_src_nat.species _entity_src_nat.details
A 'Abyssinian cabbage seed' Crambe Abyssinica ?
By using the entities as building blocks, the contents of the asymmetric unit are specified. Crambin is straightforward because each entity appears only once in the asymmetric unit. loop_ _struct_asym.id _ s t r u c t _ a s y m . e n t i t y id _struct_asym.details chain a A ethanol ethanol H20 H20
'Single polypeptide chain' 'Cocrystallized ethanol molecule'
Entities classified as polymer, in this instance only that entity identified as A, are further described. First, the overall features of the polypeptide chain: _entity_poly.entity_id _entity_poly.type _entity_poly.nstd_chirality _entity_poly.nstd_linkage _entity_poly.nstd_monomers _entity_poly.type_details
A po lypept ide (L ) no no no 'Microheterogeneity at 22 and 25'
586
PRESENTATION AND ANALYSIS
[30]
and then the component parts, loop_ _entity_poly_seq.entity _entity_poly_seq.num _entity_poly_seq.mon_id A 1 THR
#
A
2
THR
PRO ALA
A A
23 25
GLU LEU
ALA
A
48
ASN
[data omitted] A A
#
id
22 24
[data omitted] A
47
The entity also may exist in other databases and these references may be cited and described. For the entity designated A, which is defined in GenBank but without sequence microheterogeneity, we have loop_ _struct_ref.id struct_ref.entity_id _struct_ref.biol_id _struct_ref.db_name _struct_ref.db_eode _ s t r u c t ref.secL_align _struct_ref.seq_dif _struct_ref.details 1 A crambin_l 2 A crambin_l
'Genbank' 'PDB'
'493916' 'ICBN'
'entire' 'entire'
'no' 'no'
Once each polymer entity is defined, the details of the secondary structure are defined using the STRUCT_CONF category. loop_ _struct_conf.id _struct_conf.conf_type.id _struct_conf.beg_label_comp_id _struct_conf.beglabel_asym_id _ s t r u c t _ c o n f . b e g label_seq_id _ s t r u c t _ c o n f . e n d label_comp id _struct_conf.end_label_asym_id _struct_conf.end_label_seq_id _struct_conf.details HI HELX RH A L P ILE chain_a 7 H2 HELX RH A L P GLU chain_a 23 S1 STRN_P CYS chain_a 32 $2 STRN_P THR chain_a 1 $3 STRN_P ASN chain_a 46 $4 STRN__P THR chain_a 39 T1 TURN-TYI_P ARG chain_a 17 T2 TURN-TYI_P PRO chain_a 41
PRO THR ILE CYS ASN PRO GLY TYR
chain_a chain_a chain_a chain_a chain_a chain_a chain_a chain_a
19 30 35 4 46 41 2O 44
'HELX-RH3T 17-19' 'Alpha-N start' . . . .
[30]
MACROMOLECULARCIF
587
These assignments are further enumerated over those made in a PDB file for the record types HELIX, TURN, and SHEET. Moreover, the STRUCT_CONF_TYPE category (Table I) specifies the method of assignment that could, for example, be deduced by the crystallographer from the electron-density maps or defined algorithmically. loop_ _struct_conf_type.id _struct_conf_type.criteria
_struct_conf_type.reference HELX_RH ALP STRN_P TURN_TYI_P HELX RH P
'author 'author 'author 'Kabsch
judgement' judgement' judgement' and Sander'
'Biopolymers
(1983)
22:2577'
The commented entry at the end is a hypothetical example of a calculated assignment. Data items also exist (Table I) for the description of/3 sheets, but are not shown in this introductory example. Interactions between various portions of the structure are described by the STRUCT_CONN and associated STRUCT_CONN_TYPE categories. loop_ struct_conn.id struct_conn.conn_type_id struct_conn.ptnrl_label_comp_id struct_conn.ptnrl_label_asym_id struct_conn.ptnrl_label_seq_id struct_conn.ptnrl_label_atom_id
_struct_conn.ptnrl_role _struct_conn.ptnrl_symmetry _struct_conn.ptnr2_label_comp_id _struct_conn.ptnr2_label_asym_id struct_conn.ptnr2_label_seq_id _struct_conn.ptnr2_label_atom_id _struct_conn.ptnr2_role _struct_conn.ptnr2_symmetry _struct_conn.details SSI disulf CYS chain_a 3 S . 1_555 SS2 disulf CYS chain_a 4 S . 1_555
CYS CYS
chain_a 40 chain_a 32
[data omitted] HBI
hydrog
HB2
hydrog
SER LEU ARG ASP
[data omitted]
chain_a 6 chain_a 8 chain_a 17 chain_a 43
OG 0 N 0
positive negative positive negative
1_555 1_556 1_555 1_554
S . 1_555 S . 1_555
588
[301
PRESENTATION AND ANALYSIS
These intermolecular interactions are partially specified on PDB CONNECT records. However, mmCIF provides an additional level of detail such that the criteria used to define an interaction may be given using the STRUCT_CONN_TYPE category. Here is a hypothetical example used to describe a salt bridge and a hydrogen bond: loop_ _struct_conn_type.id _struct_conn_type.criteria
_struct_conn_type.reference saltbr hydrog
'negative to p o s i t i v e d i s t a n c e > 2.5 \%A a n d < 3.2 \%A ' 'N to O d i s t a n c e > 2.5 \%A, < 3.2 \%A, N O C a n g l e < 120\%' .
Example 2 Consider an mmCIF representation of a more complex structure: the gene regulatory protein 434 CRO complexed with a 20-base pair D N A segment-containing operator (Mondragon and Harrison18; PDB code
3CRO). loop_ _struct_biol.id _struct_biol.details complex T h e c o m p l e x c o n s i s t s of 2 p r o t e i n 20 b a s e p a i r D N A segment.
domains
bound
to a
protein E a c h of the 2 p r o t e i n d o m a i n s is a s i n g l e h o m o l o g o u s p o l y p e p t i d e c h a i n of 71 r e s i d u e s d e s i g n a t e d L a n d R. DNA
two s t r a n d s b a s e offset. The
(A and B)
are c o m p l e m e n t a r y
given
a one
The protein-DNA complex, the protein, and the D N A are considered as three separate biological components each generated from the contents of the asymmetric unit. No crystallographic symmetry need be applied to generate the biologically relevant components.
18 A. Mondragon and S. C. Harrison, J. Mol. Biol. 219, 321 (1991).
[30]
MACROMOLECULAR CIF
589
loop_ _struct_biol gen.biol id _struct_biol_gen.asym.id _struct_biol_gen.symmetry complex L 1_555 complex R 1_555 complex A 1_555 complex B 1_555 protein L 1_555 protein R 1_555 DNA A 1_555 DNA B 1_555 loop_ _entity.id _entity.type dimer DNA A DNA_B water
polymer polymer polymer water
Because each protein domain is chemically identical they constitute a single entity that has been designated diner. The complementary D N A strands are not chemically identical and therefore constitute two separate entities: loop_ _struct_asym.id _struct_asym.entity_id _struct_asym.details L dimer R dimer A DNA_A B DNA_B H20 water
'71 residue polypeptide chain' '71 residue polypeptide chain' '20 base strand' '20 base strand' 'solvent'
Features of the CRO 434 secondary structure and intermolecular contacts can be described in the same way in which crambin was represented and are not repeated.
Conclusion In preparing these examples of representing macromolecular structure using mmCIF it was necessary to return to the original articles because not all the relevant information could be retrieved from the PDB entry. This is evidence that mmCIF provides additional information that also has the advantage of being in a computer-readable form. The consequence is that
590
PRESENTATIONAND ANALYSIS
[31]
it places additional emphasis on the person preparing the mmCIF. It is anticipated that full use of the expressive power of m m C I F will be made only when existing structure solution and refinement programs are modified to maintain m m C I F data items and software tools are developed to help prepare and use an m m C I F effectively. A variety of software tools have been developed for m m C I F (Bernstein et all6; Westbrook et all9). A description of a variety of other efforts can be found elsewhere (Bourne2°). Code and documentation are available at the m m C I F W W W site (mmCIF3). A long-term goal might be to maintain all aspects of the structure determination in an electronic laboratory notebook that uses m m C I F as its underlying data representation.
Acknowledgments The development of the mmCIF dictionary has been a communityeffort. The Background and Introduction sections of the mmCIF WWW site describe the contributions of the many people who have participated in this project (mmCIF3). 19j. D. Westbrook, S. H. Hsieh, and P. M. D. Fitzgerald,J. Appl. Crystallogr. 30, 79-83 (1997). 20p. E. Bourne (ed.), "Proceedings of the First Macromolecular CIF Tools Workshop." Tarrytown, New York, 1993.
[31] P H A S E S - 9 5 : A P r o g r a m P a c k a g e f o r P r o c e s s i n g Analyzing Diffraction Data from Macromolecules
and
By W. FUREY and S. SWAMINATHAN Background The determination of macromolecular crystal structures has always been a computationally demanding task, often further complicated by the need for frequent algorithm modifications, by lack of general, compatible software for the many different procedures required, and by difficulties in exchanging data with the (usually incompatible) hardware needed for visual feedback. In P H A S E S the idea was to overcome these problems by creating a processing package that is efficient, general, reasonably thorough, easily upgradeable, nearly hardware independent, compatible with other popular software, and (most of all) easy to use. Several events took place in the late 1980s and early 1990s that facilitated such a package. First, increases in CPU speed aided efficiency and justified use of general algorithms for
METHODS IN ENZYMOLOGY, VOL. 277
Copyright © 1997 by Academic Press All rights of reproduction in any form reserved. 0076-6879/97 $25.00