Stockholm Bioinformatics Center, SBC
Lecture notes, main page

Lecture 30 Oct 2001 Per Kraulis

Databases in bioinformatics

7. Macromolecular 3D structure databases


The PDB is the main primary database for 3D structures of biological macromolecules determined by X-ray crystallography and NMR. Structural biologists usually deposit their structures in the PDB on publication, and some scientific journals require this before accepting a paper. It also accepts the experimental data used to determine the structures (X-ray structure factors and NMR restraints) and homology models. As of 23 Oct 2001 the PDB contained 16,358 entries, the majority of which (12,304) are X-ray structures.

The Protein Data Bank (PDB) was established in the 1970s at the Brookhaven Lab on Long Island, New York State, US. In 1999, the management was moved to the Research Collaboratory for Structural Bionformatics (RCSB, a joint organisation between Rutgers University, San Diego Supercomputer Center and NIST).

The PDB entries contain the atomic coordinates, and some structural parameters connected with the atoms (B-factors, occupancies), or computed from the structures (secondary structure). The PDB entries contain some annotation, but it is not as comprehensive as in SWISS-PROT. Fortunately, there are cross-links between the databases in both file formats. Here is an example of an entry is the the Ras-binding domain of the human Raf-1 oncogene in the traditional PDB format and in the mmCIF format.

There are no legal restrictions on the use of the data in the PDB.


The SCOP (Structural Classification of Proteins) database was started by Alexey Murzin in 1994 (Lab of Molecular Biology, MRC, Cambridge, UK). Its purpose is to classify protein 3D structures in a hierarchical scheme of structural classes. It is maintained by experts ("by hand"), and all protein structures in the PDB are classified, and it is updated as new structures are deposited in the PDB.

This is a typical secondary database; it is based on data in a primary database (in this case the PDB), but adds information through analysis and/or organisation, in this case the classification of protein 3d structures into a hierarchical scheme of folds, superfamilies and families.


The CATH database (Class, architecure, topology, homologous superfamily) is a hierarchical classification of protein domain structures, which clusters proteins at four major structural levels. Although the aim is very similar to SCOP, the scheme it uses is different, and the philosophy and practical details of producing the classification differ considerably. For instance, a larger fraction of the decisions made when classifying a new protein 3D structure is made automatically by software. It was started by Christine Orengo in Janet Thornton's lab (University College London) in 1996.

Copyright © 2001 Per Kraulis $Date: 2001/11/09 15:19:05 $