(Some) Databases and tools for analyzing protein structures and domains

(Some) Databases and tools for analyzing protein structures and domains • Domains (Sequence Based): • PFAM • UNIPROT • JackHmmer • Structure Based: • PDB • Pymol • Dali server

Sequence: PFAM Database of protein domain families. They are based on manually curated alignments of known domains. These alignments are translated into a HMM profile. If a sequence aligns with this profile, we say that it contains the domain. These are SEQUENCE and functional domains, not structural domains. http://www.uniprot.org/uniprot/H9L4B6.fasta 1 2

Sequence: PFAM Positions of the HMM profile that align with our sequence Size of the HMM profile that represents this domain family Not all the sequence is aligned, only those regions that align with the HMM profiles of given domain families (compare the numbers given here with those in the Pfam entry for this protein, H9L4B6). If the #SEQ line contains lower case letters it means that they are not aligned with the HMM profile, so they are insertions in the sequence that are not considered part of the domain. 1 2 3 4

Sequence: PFAM Identifier (ID) Accession

Sequence: PFAM

Sequence: PFAM This family is similar to others and they are grouped in a “Clan”

Sequence: PFAM You can visualize or download many different alignments for proteins in this family: Seed alignment is the one used to generate the HMM profile that defines the family. It is supposed to be manually curated and reviewed Number of sequences of this domain that Pfam knows in each database (not necessarily the real, updated, number)

Sequence: PFAM Using the “Logo” the degree of conservation of each position in the domain can easily be seen:

Sequence: PFAM All structures known by Pfam that contain a domain of this family. Observe that several structures could be from the same protein. Do not trust “PDB residues” since these numbers some times differ from the PDB residue numeration.

Sequence: PFAM The HMM profile derived from the seed alignment that defines this domain can be downloaded. This way, we can use it to create highly reliable multiple alignments. Once downloaded, open it with a text editor. Observe the different positions in the profile and the scores (probabilities) for a “match” of each residue in the profile, for an insertion in the profile and for transitions from one state to another (match to match, match to insert, match to deletion, etc ).

Sequence: fromUniprot to PFAM http://www.uniprot.org/uniprot/H9L4B6 1 2 3

Sequence: PFAM Annotated features and known Pfam domains in our selected sequence

Sequence: PFAM Annotated features and known Pfam domains in our selected sequence 1 2 3

Sequence:Jackhmmer Find remote-homolog proteins by iteratively searching a protein sequence against a protein database seqDB seqDB Build HMM profile Multiple alignment queryseqxxxxxxxxxxxx S1 xxxxxxxxxxxx S2 xxxxxxxxxxxx … Sn xxxxxxxxxxxx >s1 xxxxxxxxxx >s2 xxxxxxxxxx … >sn xxxxxxxxxxx >s1 xxxxxxxxxx >s2 xxxxxxxxxx … >sn xxxxxxxxxxx >queryseq XXXXXXXXXX profile Question: When do we stop?

Sequence:Jackhmmer https://www.ebi.ac.uk/Tools/hmmer/search/jackhmmer http://www.uniprot.org/uniprot/Q63JW5.fasta 3 4 5 1 2 6

Sequence:Jackhmmer Iteration 1 Submit iteration 2 Result sequences Pfam data Number of significant matches 1 2

Sequence:Jackhmmer Domain arquitectures Different Pfam Domain Arquitectures found in our sequences

Sequence:Jackhmmer iterating Without multiple iterations there is no remote homologous search Iteration 2 results (open in new tab) 1 2

Sequence:Jackhmmer iterating 1

Sequence:Jackhmmer iterating After 5 iterations we see relations between this protein and others with other domains

(Some) Databases and tools for analyzing protein structures and domains • Domains (Sequence Based): • PFAM • UNIPROT • JackHmmer • Structure Based: • PDB • Pymol • Dali server

Structure: PDB THE Database PDB stores proteins (and DNA, lipids, small molecules …) with known structure. This means that we know the coordinates X,Y and Z of their atoms. So we can see their “shape” and KNOW which residues are buried, which are exposed, the distance between them, their interactions, etc. Every structure deposited has a 4 digit code that starts with a number and then has 3 other characters that are normally letters. They currently have no meaning, although the first ones did (Ex 4CPA = CarboxyPeptidase A) Apart from the coordinates it also has information derived from them, from the state in which the protein has been solved to the experimental conditions or the presence of catalytic, binding and allosteric sites, ligands, secondary structures, mutations, etc.

Structure: PDB THE Database We will use 4BYZ structure as an example

Structure: PDB Structure Summary Expression system Method used to solve the structure and, if by X-Ray diffraction, its resolution View the structure interactively (rotate, zoom, cartoons, surface, etc) and its electron density • Publications: • View this pubmed entry • Search PDB by this pubmed reference

Structure: PDB Structure view • Translate, rotate, zoom … • Lines, backbone, cartoons, surface … • Color by chain, element … • Play with it

Structure: PDB Electron density • You can see how good the atoms fit the electron density so how good is this structure • Try using: • Mouse whell • Ctrl + mouse wheel • Shift + mouse whell • Mouse right button and movement

Structure: PDB Structure Summary All proteins (chains) present in this structure (links to uniprot) Different known and predicted features Ligands present in this structure (their structure is also solved)

Structure: PDB Protein Feature View Protein: complete sequence Several 3D structures of the same protein exist (in this case two: chain A of 4byz and chain A of 4bz0) Some residues in 4bz0 could not be solved

Structure: PDB Data in Uniprot PDB codes for this protein, the method used to solve the structure, resolution and what chain in the pdb structure this protein is Which PDB should I use (Europe, USA, Japan) Positions in the Uniprot sequence with the structure solved in the pdb file. (WARNING: this range may include internal aa with unresolved structure, so there will not be X,Y,Z coordinates for their atoms in the PDB file).

Structure: PDB Structure Summary We download the structure in a text file with “pdb” format (the most standard, all software read this format) We can download the structure as it was solved in the experiment (PDB format) or In the form it works in nature (Biological Assembly). There may be several structures here because sometimes the oligomerisation state is not clear or the protein may work as a monomer but be solved as an oligomer or complex

Structure: PDB Structure Sunmary We have to take care when downloading the “FASTA” sequence, this is not necessarily the protein sequence nor the sequence with the X,Y,Z coordinates in the pdb file. This is the sequence that the experimentalist who solved the structure used. It can contain mutations, be just a fragment or include tags used for purification that are not present in the real protein. At the same time, perhaps not all atoms and/or residues could be resolved, so not all the aa in the sequence may have their structure solved

Structure: PDB Structure Sunmary What can be downloaded can also be displayed

Structure: PDB PDB file This : https://files.rcsb.org/view/4BYZ.pdb Is a pdb file, it contains a header with all the info present in the web page (sequence, ligands, binding site, Uniprot code, etc.) and the XYZ coordinates for the atoms of the molecules present in the structure, with the following format: (Lines with atom coordinates begin with the tag ATOM or HETATM) #--------------------------------------------------------------------------- #Field | Column | FORTRAN | # No. | range | format | Description #--------------------------------------------------------------------------- # 1. | 1 - 6 | A6 | Record ID (eg ATOM, HETATM) # 2. | 7 - 11 | I5 | Atom serial number # - | 12 - 12 | 1X | Blank # 3. | 13 - 16 | A4 | Atomname (eg " CA " , " ND1") # 4. | 17 - 17 | A1 | Alternativelocationcode (ifany) # 5. | 18 - 20 | A3 | Standard 3-letter amino acidcodeforresidue # - | 21 - 21 | 1X | Blank # 6. | 22 - 22 | A1 | Chainidentifiercode # 7. | 23 - 26 | I4 | Residuesequencenumber # 8. | 27 - 27 | A1 | Insertioncode (ifany) # - | 28 - 30 | 3X | Blank # 9. | 31 - 38 | F8.3 | Atom's x-coordinate # 10. | 39 - 46 | F8.3 | Atom's y-coordinate # 11. | 47 - 54 | F8.3 | Atom's z-coordinate # 12. | 55 - 60 | F6.2 | Occupancyvalueforatom # 13. | 61 - 66 | F6.2 | B-value (thermal factor) # - | 67 - 67 | 1X | Blank # 14. | 68 - 70 | I3 | Footnotenumber #--------------------------------------------------------------------------- # # 1 2 3 4 5 6 #12345678901234567890123456789012345678901234567890123456789012345678 #-------------------------------------------------------------------- #@<<<<<@>>>> @<<<@@<< @@>>>@ @>>>>>>>@>>>>>>>@>>>>>>> #ATOM 1751 N GLY C 250 32.286 1.882 43.206 1.00 22.00 #ATOM 31 P POM 3 8.010 4.280 1.939 #ATOM 3000 SE MSE A 401 -1.600 -14.228 65.136 0.70 15.94 SE This lines are not in the pdb file They are here for you to view the column range (each character is a column) and if fields are left or right aligned http://www.wwpdb.org/documentation/file-format http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html

Structure: PDB Annotations

Structure: PDB Sequence Here we can view the sequence for all chains and all features annotated in the pdb file

Structure: PDB Sequence • Change the sequence displayed: • Sequence used by the experimentalist to solve this structure • Sequence in Uniprot for this protein • Mouse over a residue to see the residue and its number (numeration in PDB is not necessarily the position in the sequence, this can be a cause for confusion) • A section of the protein may have not been solved, so you will not see structure here (no XYZ coordinates for its atoms) 1 2

Structure: PDB Multiple Chains Be careful! A pdb structure may contain multiple chains, which can be from the same or different proteins. In the following example we see that the 4CPA structure has 4 chains (A, B, I, J). - A and B are from Carboxypeptidase A from Bostaurus(cow) UniprotID: P00730. - I and J are from Metallocarboxypeptidase inhibitor from Solanum tuberosum(potato) UniprotID: P01075

Structure: PDB Advanced Search Like in most databases we can combine multiple searches using Boolean operators 1 2 3

Structure: PDB Advanced Search There are 722 pdbentries containing this protein and 138 containing this ligand Add search criteria 1 2 3 3 4 5 6 7

Structure: PDB Advanced Search There are only 12 pdb entries containing both this protein and this ligand

Structure: PDB Ligand Search We can also search for pdb files by the ligands they contain Or by protein sequence 1 2 3 1 2 3

BfpC shares two structural domains with our query protein, we can superpose both proteins (the entire solved structure, not just by domains) to see if they are really similar Structure: Operations Pairwaise Structure superposition 1 2 3 4 5 6 7 8

Structure: Operations Pairwaise Structure superposition We can see that the two structures are very similar (RMSD=2.90Å) in spite of having a very low sequence identity (11.6%)

Structure: Operations Pairwise Structure superposition

Structure: Operations Pairwise Structure superposition (ALTERNATIVE VERSION) 1 2 3

Structure: Operations Pairwise Structure superposition (ALTERNATIVE VERSION)

Structure: Operations Pairwise Structure superposition (ALTERNATIVE VERSION) 2 1 3 4

Structure: Operations Pairwaise Structure superposition (ALTERNATIVE VERSION) Move and zoom the molecule (mouse: left button and wheel) - EQR: number of superposed residues - Len1 & Len2 : Length of the superposed structures - RMSD: Root Mean Square Difference (average distance between superposed atoms) This alignment is not based on the sequence but based on the structure 1

Structure: Operations Pairwise Structure superposition with Pymol Pymol is a molecular visualization software that includes several functions to analyze protein structures. It also has a built in superposition algorithm

Structure: Operations Pairwise Structure superposition with Pymol 4BYZ 1 2

(Some) Databases and tools for analyzing protein structures and domains

(Some) Databases and tools for analyzing protein structures and domains

Presentation Transcript

Some statistics

Some think fast, some slow

Some Statistics

Some comments

Some

Some examples…….

Some functions

Some Things

Some definitions

Some notes...

Some Concepts

Some news

Some Definitions

Some BACKGROUND

Some Technology, Some Math

All Gave Some, Some Gave All

some comparisons

Some terminology:

SOME START

Some Questions

Some info …