Mutiple Motifs Charles Yan Spring 2006
From Single Motif to Multiple Motifs One single motif is not sufficent to discriminate a protein family. Multiple motifs have stronger discriminating power.
Multiple Motifs Protein function prediction using multiple motifs • Each protein family is characterized by a set of motifs (in stead of a single one). • If a protein contain a set of motifs, it probably belong to the family that the set of motifs correspond to.
PRINTS • PRINTS(http://umber.sbs.man.ac.uk/dbbrowser/PRINTS/ ) is a database of protein fingerprints. • A fingerprint is a group of conserved motifs used to characterize a protein family; • ftp.bioinf.man.ac.uk/pub/prints • PRINTS is now maintained at the University of Manchester • PRINTS VERSION 38.0 (16 June, 2005) • 1900 FINGERPRINTS, encoding11,435 single motifs
PRINTS • Each fingerprint has been defined and iteratively refined using database SWISS-PROT/TrEMBL composite. • Two types of fingerprint are represented in the database, i.e. they are either simple or composite, depending on their complexity: simple fingerprints are essentially single-motifs; while composite fingerprints encode multiple motifs. The bulk of the database entries are of the latter type because discrimination power is greater for multi-component searches. • Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. • Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbors.
PRINTS • A motif is a conserved element corresponding to a region whose function or structure is known. It is likely to be predictive of any subsequent occurrence of such a structural/functional region in any other protein sequence. • A motif is represented as a conserved alignment of multiple sequence. • A fingerprint is a set of motifs used to predict the occurrence of similar motifs, either in an individual sequence.
PRINTS • The starting point is a multiple sequence alignment of a small number of sequences • Once a motif, or set of motifs, has been identified, the conserved regions are excised in the form of local alignments • The motif/s are used to scan against the database • Only those sequences that match with all motifs are regarded as true matches • The additional sequence data from the new true set is then used to generate another set of aligned motifs, and the database is searched again • Until converge
PRINTS a) General field
PRINTS b) Summary field A good fingerprint should exhibit a clear discrimination cut-off, i.e. shows all true positives matching with all n motifs, perhaps some noise, and few or no matches at intermediate positions of the summary table.
PRINTS • Motif name • Iteration number • PCODE: the protein identification codes of the initial sequences • ST: the location of the motifs within those sequences, • INT: and the interval between adjacent motifs. for the first motif, this is simply the distance from the beginning of the sequence to the start of the motif.
PRINTS FPScan Submitting a PROTEIN sequence find the closest matching PRINTS fingerprint/s.
PRINTS GRAPHScan A graphical view of the result of a scan of a fingerprint against a sequence. Matching motifs are highlighted if they score above the threshold % identity
PRINTS MULScan This facility allows multiple sequences to be scanned against the database, Results are returned via email.
Related Projects • InterPro - Integrated Resources of Proteins Domains and Functional Sites • BLOCKS - BLOCKS db • Pfam - Protein families db (HMM derived) [Mirror at St. Louis (USA)] • PRINTS - Protein Motif fingerprint db • ProDom - Protein domain db (Automatically generated) • PROTOMAP - An automatic hierarchical classification of Swiss-Prot proteins • SBASE - SBASE domain db • SMART - Simple Modular Architecture Research Tool • TIGRFAMs - TIGR protein families db