Computational molecular biology
Download
1 / 168

Computational Molecular Biology - PowerPoint PPT Presentation


  • 309 Views
  • Updated On :

Computational Molecular Biology. Protein Structure: Introduction and Prediction. Protein Folding. One of the most important problem in molecular biology Given the one-dimensional amino-acid sequence that specifies the protein, what is the protein’s fold in three dimensions?. Overview.

Related searches for Computational Molecular Biology

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Computational Molecular Biology' - mike_john


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Computational molecular biology l.jpg

Computational Molecular Biology

Protein Structure: Introduction and Prediction


Protein folding l.jpg
Protein Folding

  • One of the most important problem in molecular biology

  • Given the one-dimensional amino-acid sequence that specifies the protein, what is the protein’s fold in three dimensions?

My T. Thai

[email protected]


Overview l.jpg
Overview

  • Understand protein structures

    • Primary, secondary, tertiary

  • Why study protein folding:

    • Structure can reveal functional information which we cannot find from the sequence

    • Misfolding proteins can cause diseases: mad cow disease

    • Use in drug designs

  • My T. Thai

    [email protected]


    Overview of protein structure l.jpg
    Overview of Protein Structure

    • Proteins make up about 50% of the mass of the average human

    • Play a vital role in keeping our bodies functioning properly

    • Biopolymers made up of amino acids

    • The order of the amino acids in a protein and the properties of their side chains determine the three dimensional structure and function of the protein

    My T. Thai

    [email protected]


    Amino acid l.jpg

    R

    O

    H

    N

    C

    C

    OH

    H

    H

    Amino Acid

    • Building blocks of proteins

    • Consist of:

      • An amino group (-NH2)

      • Carboxyl group (-COOH)

      • Hydrogen (-H)

      • A side chain group (-R) attached to the central α-carbon

    • There are 20 amino acids

    • Primary protein structure is a sequence of a chain of amino acids

    Side chain

    Aminogroup

    Carboxylgroup

    My T. Thai

    [email protected]


    Side chains amino acids l.jpg
    Side chains (Amino Acids)

    • 20 amino acids have side chains that vary in structure, size, hydrogen bonding ability, and charge.

    • R gives the amino acid its identity

    • R can be simple as hydrogen (glycine) or more complex such as an aromatic ring (tryptophan)

    My T. Thai

    [email protected]



    How amino acids become proteins l.jpg
    How Amino Acids Become Proteins

    Peptide bonds

    My T. Thai

    [email protected]


    Polypeptide l.jpg
    Polypeptide

    • More than fifty amino acids in a chain are called a polypeptide.

    • A protein is usually composed of 50 to 400+ amino acids.

    • We call the units of a protein amino acid residues.

    amidenitrogen

    carbonylcarbon

    My T. Thai

    [email protected]


    Side chain properties l.jpg
    Side chain properties

    • Carbon does not make hydrogen bonds with water easily – hydrophobic.

      • These ‘water fearing’ side chains tend to sequester themselves in the interior of the protein

    • O and N are generally more likely than C to h-bond to water – hydrophilic

      • Ten to turn outward to the exterior of the protein

    My T. Thai

    [email protected]


    Slide11 l.jpg

    My T. Thai

    [email protected]


    Primary structure l.jpg
    Primary Structure

    Primary structure: Linear String of Amino Acids

    Side-chain

    Backbone

    ... ALA PHE LEU ILE LEU ARG ...

    Each amino acid within a protein is referred to as residues

    Each different protein has a unique sequence of amino acid residues, this is its primary structure

    My T. Thai

    [email protected]


    Secondary structure l.jpg
    Secondary Structure

    • Refers to the spatial arrangement of contiguous amino acid residues

    • Regularly repeating local structures stabilized by hydrogen bonds

      • A hydrogen atom attached to a relatively electronegative atom

    • Examples of secondary structure are the α–helix and β–pleated-sheet

    My T. Thai

    [email protected]


    Alpha helix l.jpg
    Alpha-Helix

    • Amino acids adopt the form of a right handed spiral

    • The polypeptide backbone forms the inner part of the spiral

    • The side chains project outward

    • every backbone N-H group donates a hydrogen bond to the backbone C = O group

    My T. Thai

    [email protected]


    Beta pleated sheet l.jpg
    Beta-Pleated-Sheet

    • Consists of long polypeptide chains called beta-strands, aligned adjacent to each other in parallel or anti-parallel orientation

    • Hydrogen bonding between the strands keeps them together, forming the sheet

    • Hydrogen bonding occurs between amino and carboxyl groups of different strands

    My T. Thai

    [email protected]


    Parallel beta sheets l.jpg
    Parallel Beta Sheets

    My T. Thai

    [email protected]


    Anti parallel beta sheets l.jpg
    Anti-Parallel Beta Sheets

    My T. Thai

    [email protected]


    Mixed beta sheets l.jpg
    Mixed Beta Sheets

    My T. Thai

    [email protected]


    Tertiary structure l.jpg
    Tertiary Structure

    • The full dimensional structure, describing the overall shape of a protein

    • Also known as its fold

    My T. Thai

    [email protected]


    Quaternary structure l.jpg
    Quaternary Structure

    • Proteins are made up of multiple polypeptide chains, each called a subunit

    • The spatial arrangement of these subunits is referred to as the quaternary structure

    • Sometimes distinct proteins must combine together in order to form the correct 3-dimensional structure for a particular protein to function properly.

    • Example: the protein hemoglobin, which carries oxygen in blood. Hemoglobin is made of four similar proteins that combine to form its quaternary structure.

    My T. Thai

    [email protected]


    Other units of structure l.jpg
    Other Units of Structure

    • Motifs (super-secondary structure):

      • Frequently occurring combinations of secondary structure units

      • A pattern of alpha-helices and beta-strands

    • Domains: A protein chain often consists of different regions, or domains

      • Domains within a protein often perform different functions

      • Can have completely different structures and folds

      • Typically a 100 to 400 residues long

    My T. Thai

    [email protected]


    What determines structure l.jpg
    What Determines Structure

    • What causes a protein to fold in a particular way?

    • At a fundamental level, chemical interactions between all the amino acids in the sequence contribute to a protein’s final conformation

    • There are four fundamental chemical forces:

      • Hydrogen bonds

      • Hydrophobic effect

      • Van der Waal Forces

      • Electrostatic forces

    My T. Thai

    [email protected]


    Hydrogen bonds l.jpg
    Hydrogen Bonds

    • Occurs when a pair of nucliophilic atoms such as oxygen and nitrogen share a hydrogen between them

    • Pattern of hydrogen bounding is essential in stabilizing basic secondary structures

    My T. Thai

    [email protected]


    Van der waal forces l.jpg
    Van der Waal Forces

    • Interactions between immediately adjacent atoms

    • Result from the attraction between an atom’s nucleus and it neighbor’s electrons

    My T. Thai

    [email protected]


    Electrostatic forces l.jpg
    Electrostatic Forces

    • Oppositely charged side chains con form salt-bridges, which pulls chains together

    My T. Thai

    [email protected]


    Experimental determination l.jpg
    Experimental Determination

    • Centralized database (to deposit protein structures) called the protein Databank (PDB), accessible at http://www.rcsb.org/pdb/index.html

    • Two main techniques are used to determine/verify the structure of a given protein:

      • X-ray crystallography

      • Nuclear Magnetic Resonance (NMR)

    • Both are slow, labor intensive, expensive (sometimes longer than a year!)

    My T. Thai

    [email protected]


    X ray crystallography l.jpg
    X-ray Crystallography

    • A technique that can reveal the precise three dimensional positions of most of the atoms in a protein molecule

    • The protein is first isolatedto yield a high concentration solution of the protein

    • This solution is then used to grow crystals

    • The resulting crystal is then exposed to an X-ray beam

    My T. Thai

    [email protected]


    Disadvantages l.jpg
    Disadvantages

    • Not all proteins can be crystallized

    • Crystalline structure of a protein may be different from its structure

    • Multiple maps may be needed to get a consensus

    My T. Thai

    [email protected]


    Slide29 l.jpg
    NMR

    • The spinning of certain atomic nuclei generates a magnetic moment

    • NMR measures the energy levels of such magnetic nuclei (radio frequency)

    • These levels are sensitive to the environment of the atom:

      • What they are bonded to, which atoms they are close to spatially, what distances are between different atoms…

    • Thus by carefully measurement, the structure of the protein can be constructed

    My T. Thai

    [email protected]


    Disadvantages30 l.jpg
    Disadvantages

    • Constraint of the size of the protein – an upper bound is 200 residues

    • Protein structure is very sensitive to pH.

    My T. Thai

    [email protected]


    Computational methods l.jpg
    Computational Methods

    • Given a long and painful experimental methods, need computational approaches to predict the structure from its sequence.

    My T. Thai

    [email protected]


    Functional region prediction l.jpg
    Functional Region Prediction

    My T. Thai

    [email protected]


    Protein secondary structure l.jpg
    Protein Secondary Structure

    My T. Thai

    [email protected]


    Tertiary structure prediction l.jpg
    Tertiary Structure Prediction

    My T. Thai

    [email protected]


    More details on x ray crystallography l.jpg
    More Details on X-ray Crystallography

    My T. Thai

    [email protected]


    Overview36 l.jpg
    Overview

    My T. Thai

    [email protected]


    Overview37 l.jpg
    Overview

    My T. Thai

    [email protected]


    Crystal l.jpg
    Crystal

    • A crystal can be defined as an arrangement of building blocks which is periodic in three dimensions

    My T. Thai

    [email protected]


    Crystallize a protein l.jpg
    Crystallize a Protein

    • Have to find the right combination of all the different influences to get the protein to crystallize

    • This can take a couple hundred or even thousand experiments

    • Most popular way to conduct these experiments

      • Hanging-drop method

    My T. Thai

    [email protected]


    Hanging drop method l.jpg
    Hanging drop method

    • The reservoir contains a precipitant concentration twice as high as the protein solution

    • The protein solutions is made up of 50% of stock protein solution and 50% of reservoir solution

    • Overtime, water will diffuse from the protein drop into the reservoir

    • Both the protein concentration and precipitant concentration will increase

    • Crystals will appear after days, weeks, months

    My T. Thai

    [email protected]


    Properties of protein crystal l.jpg
    Properties of protein crystal

    • Very soft

    • Mechanically fragile

    • Large solvent areas (30-70%)

    My T. Thai

    [email protected]


    A schematic diffraction experiment l.jpg
    A Schematic Diffraction Experiment

    My T. Thai

    [email protected]


    Why do we need crystals l.jpg
    Why do we need Crystals

    • A single molecule could never be oriented and handled properly for a diffraction experiment

    • In a crystal, we have about 1015 molecules in the same orientation so that we get a tremendous amplification of the diffraction

    • Crystals produce much simpler diffraction patterns than single molecules

    My T. Thai

    [email protected]


    Why do we need x rays l.jpg
    Why do we need X-rays

    • X-rays are electromagnetic waves with a wavelength close to the distance of atoms in the protein molecules

    • To get information about where the atoms are, we need to resolve them -> thus we need radiation

    My T. Thai

    [email protected]


    A diffraction pattern l.jpg
    A Diffraction Pattern

    My T. Thai

    [email protected]


    Slide46 l.jpg

    My T. Thai

    [email protected]


    Resolution l.jpg
    Resolution

    • The primary measure of crystal order/quality of the model

    • Ranges of resolution:

      • Low resolution (>3-5 Ao) is difficult to see the side chains only the overall structural fold

      • Medium resolution (2.5-3 Ao)

      • High resolution (2.0 Ao)

    My T. Thai

    [email protected]


    Some crystallographic terms l.jpg
    Some Crystallographic Terms

    • h,k,l: Miller indices (like a name of the reflection)

    • I(h,k,l): intensity

    • 2θ: angle between the x-ray incident beam and reflect beam

    My T. Thai

    [email protected]


    Diffraction by a molecule in a crystal l.jpg
    Diffraction by a Molecule in a Crystal

    • The electric vector of the X-ray wave forces the electrons in our sample to oscillate with the same wavelength as the incoming wave

    My T. Thai

    [email protected]


    Description of waves l.jpg
    Description of Waves

    My T. Thai

    [email protected]


    Structure factor equation l.jpg
    Structure Factor Equation

    • fj: proportional to the number of electrons this atom j has

    • One of the fundamental equations in X-ray Crystallography

    My T. Thai

    [email protected]


    The phase l.jpg
    The Phase

    • From the measurement, we can only obtain the intensity I(hkl) of any given reflection (hkl)

    • The phase α(hkl) cannot be measured

    My T. Thai

    [email protected]


    How to determine the phase l.jpg
    How to Determine the Phase

    • Small changes are introduced into the crystal of the protein of interest:

      • Eg: soaking the crystal in a solution containing a heavy atom compound

    • Second diffraction data set needs to be collected

    • Comparing two data sets to determine the phases (also able to localize the heavy atoms)

    My T. Thai

    [email protected]


    Other phase determination methods l.jpg
    Other Phase Determination Methods

    My T. Thai

    [email protected]


    Electron density map l.jpg
    Electron Density Map

    • Once we know the complete diffraction pattern (amplitudes and phases), need to calculate an image of the structure

    • The above equation returns the electron density (so we get a map of where the electrons are their concentration)

    My T. Thai

    [email protected]


    Interpretation of electron density l.jpg
    Interpretation of Electron Density

    • Now, the electron density has to be interpreted in terms of atom identities and positions.

    • (1): packing of the whole molecules is shown in the crystal

    • (2): a chain of seven amino acids in shown with the resulting structure superimposed

    • (3): the electron density of a trypophan side chain is shown

    My T. Thai

    [email protected]


    Refinement and the r factor l.jpg
    Refinement and the R-Factor

    My T. Thai

    [email protected]


    Nuclear magnetic resonance l.jpg
    Nuclear Magnetic Resonance

    • Concentrated protein solution (very purified)

    • Magnetic field

    • Effect of radio frequencies on the resonance of different atoms is measured.

    My T. Thai

    [email protected]


    Slide59 l.jpg

    My T. Thai

    [email protected]


    Slide60 l.jpg
    NMR

    • Behavior of any atom is influenced by neighboring atoms

    • more closely spaced residues are more perturbed than distant residues

    • can calculate distances based on perturbation

    My T. Thai

    [email protected]


    Nmr spectrum of a protein l.jpg
    NMR spectrum of a protein

    My T. Thai

    [email protected]


    Computational molecular biology62 l.jpg

    Computational Molecular Biology

    Protein Structure: Secondary Prediction


    Primary structure symbolic definition l.jpg
    Primary Structure: Symbolic Definition

    • A = {A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y } – set of symbols denoting all amino acids

    • A* - set of all finite sequences formed out of elements of A, called protein sequences

    • Elements of A* are denoted by x, y, z …..i.e. we write x A*, y A*, zA*, … etc

    • PROTEIN PRIMARY STRUCTURE: any x  A* isalso called a protein sequence or protein sub-unit

    My T. Thai

    [email protected]


    Protein secondary structure pss l.jpg
    Protein Secondary Structure (PSS)

    • Secondary structure: the arrangement of the peptide backbone in space. It is produced by hydrogen bondings between amino acids

    • PROTEIN SECONDARY STRUCTURE consists of: protein sequence and its hydrogen bonding patterns called SS categories

    My T. Thai

    [email protected]


    Protein secondary structure65 l.jpg
    ProteinSecondaryStructure

    • Databases for protein sequences are expanding rapidly

    • The number of determined protein structures (PSS – protein secondary structures) and the number of known protein sequences is still limited

    • PSSP (Protein Secondary Structure Prediction)research is trying to breach this gap.

    My T. Thai

    [email protected]


    Protein secondary structure66 l.jpg
    Protein Secondary Structure

    • The most commonly observed conformations in secondary structure are:

      • Alpha Helix

      • Beta Sheets/Strands

      • Loops/Turns

    My T. Thai

    [email protected]


    Turns and loops l.jpg
    Turns and Loops

    • Secondary structure elements are connected by regions of turns and loops

    • Turns – short regions of non-, non- conformation

    • Loops – larger stretches with no secondary structure.

    My T. Thai

    [email protected]


    Three secondary structure states l.jpg
    Three secondary structure states

    • Prediction methods are normally assessed for 3 states:

      • H (helix)

      • E (strands)

      • L (others (loop or turn))

    My T. Thai

    [email protected]


    Secondary structure69 l.jpg
    Secondary Structure

    8 different categories:

    • H:  - helix

    • G: 310 – helix

    • I:  - helix (extremely rare)

    • E:  - strand

    • B:  - bridge

    • T: - turn

    • S: bend

    • L: the rest

    My T. Thai

    [email protected]


    Three ss states reduction methods l.jpg
    Three SS states: Reduction methods

    • Method 1, used by DSSP program:

      • H(helix) ={ G (310 – helix), H (- helix)}

      • E (strands) = {E (-strand), B (-bridge)} ,

      • L = all the rest

      • Shortly: E,B => E; G,H => H; Rest => C

    • Method 2, used by STRIDE program:

      • H as in Method 1

      • E = {E (-strand), b (isolated  -bridge)},

      • L = all the rest

    My T. Thai

    [email protected]


    Three ss states reduction methods71 l.jpg
    Three SS states: Reduction methods

    • Method 3, used by DEFINE program:

      • H(helix) as in Method 1

      • E (strands) = {E (-strand)},

      • L = all the rest

    My T. Thai

    [email protected]


    Example of typical pss data l.jpg
    Example of typical PSS Data

    • Example:

      • Sequence

        • KELVLALYDYQEKSPREVTHKKGDILTLLNSTNKDWWKYEYNDRQGFVP

      • Observed SS

        • HHHHHLLLLEEEHHHLLLEEEEEELLLHHHHHHHHLLLEEEEEELLLHHH

    My T. Thai

    [email protected]


    Pss symbolic definition l.jpg
    PSS: Symbolic Definition

    • GivenA = {A,C,D,E,F,G,H,I,J,K,L,M,N,P,Q,R,S.T,V,W,Y } – set of symbols denoting amino acids and a protein sequence x  A*

    • Let S ={ H, E, L} be the set of symbols of 3 states: H (helix), E (strands) and L (loop) and S* be the set of all finite sequences of elements of S.

    • We denote elements of S* by e, e S*

    My T. Thai

    [email protected]


    Pss symbolic definition74 l.jpg
    PSS: Symbolic Definition

    • Any one-to-one function

      f : A* S* i.e. f  A* x S*

      is called a protein secondary structure (PSS) identification function

    • An element (x, e)  fis a called protein secondary structure (of the protein sequence x)

    • The element e  S* (of (x, e)  f) is called secondary structure.

    My T. Thai

    [email protected]


    Slide75 l.jpg
    PSSP

    • If a protein sequence shows clear similarity to a protein of known three dimensional structure

      • then the most accurate method of predicting the secondary structure is to align the sequences by standard dynamic programming algorithms

      • Why?

        • homology modelling is much more accurate than secondary structure prediction for high levels of sequence identity.

    My T. Thai

    [email protected]


    Slide76 l.jpg
    PSSP

    • Secondary structure prediction methods are of most use when sequence similarity to a protein of known structure is undetectable.

    • It is important that there is no detectable sequence similarity between sequences used to train and test secondary structure prediction methods.

    My T. Thai

    [email protected]


    Classification and classifiers l.jpg
    Classification and Classifiers

    • Given a database table DB with a special atribute C, called a class attribute (or decision attribute). The values: C1, C2, ...Cn of the class atrribute are called class labels.

    • Example:

    My T. Thai

    [email protected]


    Classification and classifiers78 l.jpg
    Classification and Classifiers

    • The attributeC partitions the records in the DB:

      • divides the records into disjoint subsets defined by the attributes C values, CLASSIFIES the records.

      • It means we use the attributre C and its values to divide the set R of records of DB into n disjoint classes:

        C1={ rDB: C=c1} ...... Cn={rDB: C=cn}

    • Example (from our table)

      C1 = { (1,1,m,g), (1,0,m,b)} = {r1,r3}

      C2 = { (0,1,v,g)} ={r2}

    My T. Thai

    [email protected]


    Classification and classifiers79 l.jpg
    Classification and Classifiers

    • An algorithm is called a classification algorithm if it uses the data and its classification to build a set of patterns.

    • Those patterns are structured in such a way that we can use them to classify unknown sets of objects- unknown records.

    • For that reason (because of the goal) the classification algorithm is often called shortly aclassifier.

    • The name classifier implies more then just classification algorithm. A classifier is final product of a data set and a classification algorithm.

    My T. Thai

    [email protected]


    Classification and classifiers80 l.jpg
    Classification and Classifiers

    • Building a classifier consists of two phases:

      training and testing.

    • In both phases we use data (training data set and disjoint with it test data set) for which the class labels are known for ALL of the records.

    • We use the training data set to create patterns

    • We evaluate created patterns with the use of of test data, which classification is known.

    • The measure for a trained classifier accuracy is called predictive accuracy.

    • The classifier is buildi.e. we terminate the process if it has been trained and tested and predictive accuracy was on an acceptable level.

    My T. Thai

    [email protected]


    Classifiers predictive accuracy l.jpg
    Classifiers Predictive Accuracy

    • PREDICTIVE ACCURACY of a classifier is a percentage of well classified data in the testing data set.

    • Predictive accuracy depends heavily on a choice of the test and training data.

    • There are many methods of choosing test and and training sets and hence evaluating the predictive accuracy. This is a separate field of research.

    My T. Thai

    [email protected]


    Accuracy evaluation l.jpg
    Accuracy Evaluation

    • Use training data to adjust parameters of method until it gives the best agreement between its predictions and the known classes

    • Use the testing data to evaluate how well the method works (without adjusting parameters!)

    • How do we report the performance?

    • Average accuracy = fraction of all test examples that were classified correctly

    My T. Thai

    [email protected]


    Accuracy evaluation83 l.jpg
    Accuracy Evaluation

    • Multiple cross-validation test has to be performed to exclude a potential dependency of the evaluated accuracy on the particular test set chosen

    • Jack-Knife:

      • Use 129 chains for setting up the tool (training set)

      • 1 for estimating the performance (testing)

      • This has to be repeated 130 times until each protein has been used once for testing

      • The average over all 130 tests gives an estimate of the prediction accuracy

    My T. Thai

    [email protected]


    Pssp datasets l.jpg
    PSSP Datasets

    • Historic RS126dataset. Contains126 sub-units with known secondary structure selected by Rost and Sander. Today is not used anymore

    • CB513 dataset. Contains 513 sub-units with known secondary structure selected by Cuff and Barton in 1999. Used quite frencently in PSSP research

    • HS17771 dataset. Created by Hobohm and Scharf. In March-2002 it contained 1771 sub-units

    • Lots of authors has their own and “secret” datasets

    My T. Thai

    [email protected]


    M easures for pssp accuracy l.jpg
    Measures for PSSP accuracy

    • http://cubic.bioc.columbia.edu/eva/doc/measure_sec.html (for more information)

    • Q3:Three-state prediction accuracy (percent of succesful classified)

    • Qi %obs: How many of the observed residues were correctly predicted?

    • Qi %prd: How many of the predicted residues were correctly predicted?

    My T. Thai

    [email protected]


    Measures for pssp accuracy l.jpg
    Measures for PSSP Accuracy

    • Aij = number of residues predicted to be in structure type j and observed to be in type i

    • Number of residues predicted to be in structure i:

    • Number of residues observed to be in structure i:

    My T. Thai

    [email protected]


    Measures for ssp accuracy l.jpg
    Measures for SSP Accuracy

    • The percentage of residues correctly predicted to be in class i relative to those observed to be in class i

    • The percentages of residues correctly predicted to be in class i from all residues predicted to be in i

    • Overall 3-state accuracy

    My T. Thai

    [email protected]


    Pssp algorithms l.jpg
    PSSP Algorithms

    • There are three generations in PSSP algorithms

      • First Generation: based on statisticalinformation of single amino acids (1960s and 1970s)

      • Second Generation: based on windows (segments) of amino acids. Typically a window containes 11-21 amino acids (dominating the filed until early 1990s)

      • Third Generation: based on the use of windows on evolutionary information

    My T. Thai

    [email protected]


    Pssp first generation l.jpg
    PSSP: First Generation

    • First generation PSSP systems are based on statistical information on a single amino acid

    • The most relevant algorithms:

      • Chow-Fasman, 1974

      • GOR, 1978

    • Both algorithms claimed 74-78% of predictive accuracy, but tested with better constructed datasets were proved to have the predictive accuracy ~50% (Nishikawa, 1983)

    My T. Thai

    [email protected]


    Chou fasman method l.jpg
    Chou-Fasman method

    • Uses table of conformational parameters determined primarily from measurements of the known structure (from experimental methods)

    • Table consists of one “likelihood” for each structure for each amino acid

    • Based on frequencies of residues in a-helices, b-sheets and turns

    • Notation: P(H): propensity to form alpha helices

    • f(i): probability of being in position 1 (of a turn)

    My T. Thai

    [email protected]


    Chou fasman p ij values l.jpg
    Chou-Fasman Pij-values

    My T. Thai

    [email protected]


    Chou fasman l.jpg
    Chou-Fasman

    • A prediction is made for each type of structure for each amino acid

      • Can result in ambiguity if a region has high propensities for both helix and sheet (higher value usually chosen)

    My T. Thai

    [email protected]


    Chou fasman93 l.jpg
    Chou-Fasman

    How it works:

    1. Assign all of the residues the appropriate set of parameters

    2. Identify a-helix and b-sheet regions. Extend the regions in both directions.

    3. If structures overlap compare average values for P(H) and P(E) and assign secondary structure based on best scores.

    4. Turns are calculated using 2 different probability values.

    My T. Thai

    [email protected]


    Assign pij values l.jpg
    Assign Pij values

    1. Assign all of the residues the appropriate set of parameters

    My T. Thai

    [email protected]


    Scan peptide for a helix regions l.jpg
    Scan peptide for a-helix regions

    2. Identify regions where 4 out of 6 have a

    P(H) >100 “alpha-helix nucleus”

    My T. Thai

    [email protected]


    Extend a helix nucleus l.jpg
    Extend a-helix nucleus

    3. Extend helix in both directions until a set of four consecutive residues with P(H) <100.

    • Find sum of P(H) and sum of P(E) in the extended region

      • If region is long enough ( >= 5 letters) and sum P(H) > sum P(E) then declare the extended region as alpha helix

    My T. Thai

    [email protected]


    Scan peptide for b sheet regions l.jpg
    Scan peptide for b-sheet regions

    4. Identify regions where 3 out of 5 have a

    P(E) >100 “b-sheet nucleus”

    5. Extend b-sheet until 4 continuous residues with an average P(E) < 100

    6. If region average > 100 and the average P(E) > average P(H) then “b-sheet”

    My T. Thai

    [email protected]


    Overlapping l.jpg
    Overlapping

    • Resolving overlapping alpha helix & beta sheet

      • Compute sum of P(H) and sum of P(E) in the overlap.

      • If sum P(H) > sum P(E) => alpha helix

      • If sum P(E) > sum P(H) => beta sheet

    My T. Thai

    [email protected]


    Turn prediction l.jpg
    Turn Prediction

    • An amino acid is predicted as turn if all of the following holds:

      • f(i)*f(i+1)*f(i+2)*f(i+3) > 0.000075

      • Avg(P(i+k)) > 100, for k=0, 1, 2, 3

      • Sum(P(t)) > Sum(P(H)) and Sum(P(E)) for i+k, (k=0, 1, 2, 3)

    My T. Thai

    [email protected]


    Pssp second generation l.jpg
    PSSP: Second Generation

    • Based on the information contained in a window of amino acids (11-21 aa.)

    • The most systems use algorithms based on:

      • Statistical information

      • Physico-chemical properties

      • Sequence patterns

      • Graph-theory

      • Multivariante statistics

      • Expert rules

      • Nearest-neighbour algorithms

    My T. Thai

    [email protected]


    Pssp first second generation l.jpg
    PSSP: First & Second Generation

    • Main problems:

      • Prediction accuracy <70%

        • SS assigments differ even between crystals of the same protein

        • SS formation is partially determined by long-range interactions, i.e., by contacts between residues that are not visible by any method based on windows of 11-21 adjacent residues

    My T. Thai

    [email protected]


    Pssp first second generation102 l.jpg
    PSSP: First & Second Generation

    • Main problems:

      • Prediction accuracy for b-strand 28-48%, only slightly better than random

        • beta-sheet formation is determined by more nonlocal contacts than in alpha-helix formation

      • Predicted helices and strands are usually too short

        • Overlooked by most developers

    My T. Thai

    [email protected]


    Example of second generation l.jpg
    Example of Second Generation

    • Example for typical secondary structure prediction of the 2nd generation.

    • The protein sequence (SEQ ) given was the SH3 structure.

    • The observed secondary structure (OBS ) was assigned by DSSP (H = helix; E = strand; blank = non-regular structure; the dashes indicate the continuation).

    • The typical prediction of too short segments (TYP ) poses the following problems in practice.

      • (i) Are the residues predicted to be strand in segments 1, 5, and 6 errors, or should the helices be elongated?

      • (ii) Should the 2nd and 3rd strand be joined, or should one of them be ignored, or does the prediction indicate two strands, here? Note: the three-state per-residue accuracy is 60% for the prediction given.

    My T. Thai

    [email protected]


    Pssp third generation l.jpg
    PSSP: Third Generation

    • PHD: First algorithm in this generation (1994)

    • Evolutionary information improves the prediction accuracy to 72%

    • Use of evolutionary information:

      1. Scan a database with known sequences with alignment methods for finding similar sequences

      2. Filter the previous list with a threshold to identify the most significant sequences

      3. Build amino acid exchangeprofiles based on the probable homologs (most significant sequences)

      4. The profiles are used in the prediction, i.e. in building the classifier

    My T. Thai

    mythai@cise.ufl.edu


    Pssp third generation105 l.jpg
    PSSP: Third Generation

    • Many of the second generation algorithms have been updated to the third generation

    My T. Thai

    mythai@cise.ufl.edu


    Pssp third generation106 l.jpg
    PSSP: Third Generation

    • Due to the improvement of protein information in databases i.e. better evolutionary information, today’s predictive accuracy is ~80%

    • It is believed that maximum reachable accuracy is 88%. Why such conjecture?

    My T. Thai

    mythai@cise.ufl.edu


    Why 88 l.jpg
    Why 88%

    • SS assignments may vary for two versions of the same structure

      • Dynamic objects with some regions being more mobile than others

      • Assignment differ by 5-15% between different X-ray (NMR) versions of the same protein

      • Assignment diff. by about12% between structural homologues

    • B. Rost, C. Sander, and R. Schneider, Redefining the goals of protein secondary structure predictions, J. Mol. Bio.

    My T. Thai

    mythai@cise.ufl.edu


    Pssp data preparation l.jpg
    PSSP Data Preparation

    • Public Protein Data Sets used in PSSP research contain protein secondary structure sequences. In order to use classification algorithms we must transform secondary structure sequences into classification data tables.

    • Records in the classification data tables are called, in PSSP literature (learning) instances.

    • The mechanism used in this transformation process is called window.

    • A window algorithmhas a secondary structure as input and returns a classification table: set of instances for the classification algorithm.

    My T. Thai

    mythai@cise.ufl.edu


    Window l.jpg
    Window

    • Consider a secondary structure (x, e).

      where (x,e)= (x1x2 …xn, e1e2…en)

    • Windowof the length wchooses a subsequence of length wof x1x2 …xn, and an element ei from e1e2…en, corresponding to a special position in the window, usually the middle

    • Window moves along the sequences

      x = x1x2 …xnand e= e1e2…en

      simultaneously, starting at the beginning moving to the right one letter at the time at each step of the process.

    My T. Thai

    mythai@cise.ufl.edu


    Window sequence to structure l.jpg
    Window: Sequence to Structure

    • Such window is calledsequence to structure window.We will call it for short a window.

    • The process terminates when the window or its middle position reaches the end of the sequence x.

    • The pair: (subsequence, element of e ) is often written in a form

    • subsequence  H, E or L

      is called an instance, or a rule.

    My T. Thai

    mythai@cise.ufl.edu


    Example window l.jpg
    Example: Window

    • Consider a secondary structure (x, e) and the window of length 5 with the special position in the middle (bold letters)

    • Fist position of the window is:

      x = A R N S T V V S T A A ….

      e = H H H H L L L E E E

    • Window returns instance:

    • A R N S T  H

    My T. Thai

    mythai@cise.ufl.edu


    Example window112 l.jpg
    Example: Window

    • Second position of the window is:

      x = A R N S T V V S T A A ….

      e = H H H H L L L E E E

    • Windows returns instance:

      • R N S T V  H

    • Next instances are:

      • N ST V V  L

      • S T V V S  L

      • T V V S T  L

    My T. Thai

    mythai@cise.ufl.edu


    Symbolic notation l.jpg
    Symbolic Notation

    • Let f be a protein secondary structure (PSS) identification function:

    • f : A* S* i.e. f  A* x S*

    • Let x= x1x2…xn, e= e1e2…en,f(x)= e, we define

    • f(x1x2…xn)|{xi}= ei, i.e. f(x)|{xi}= ei

    My T. Thai

    mythai@cise.ufl.edu


    Example semantics of instances l.jpg
    Example:Semantics of Instances

    • Let

    • x = A R N S T V V S T A A ….

    • e = H H H H L L L E E E

    • And assume that the windows returns an instance:

    • A R N S T  H

    • Semantics of the instance is:

    • f(x)|{N}=H,

    • where f is the identification function and N is preceded by A R and followed by S T and the window has the length 5

    My T. Thai

    mythai@cise.ufl.edu


    Classification data base table l.jpg
    Classification Data Base (Table)

    • We build the classification table with attributes being the positions p1, p2, p3, p4, p5 .. pw

      in the window, where w is length of the window.

      The corresponding values of attributes are elements of of the subsequent on the given position.

    • Classification attribute is Swith values in the set {H, E, L} assigned by the window operation (instance, rule).

    • The classification table for our example (first few records) is the following.

    My T. Thai

    mythai@cise.ufl.edu


    Classification table example l.jpg
    Classification Table (Example)

    • x = A R N S T V V S T A A ….

    • e = H H H H L L L E E E

    Semantics of record r= r(p1, p2, p3,p4,p5, S) is :f(x)|{Vp3} = Vs

    where Va denotes a value of the attribute a.

    My T. Thai

    mythai@cise.ufl.edu


    Size of classification datasets tables l.jpg
    Size of classification datasets (tables)

    • The window mechanism produces very large datasets

    • For example window of size 13 applied to the CB513 dataset of 513 protein subunits produces about

      70,000 records (instances)

    My T. Thai

    mythai@cise.ufl.edu


    Window118 l.jpg
    Window

    • Window has the following parameters:

    • PARAMETER 1 : i  N+, the starting point of the window as it moves along the sequence x= x1 x2 …. xn. The value i=1 means that window starts at x1, i=5 means that window starts at x5

    • PARAMETER 2: w  N+ denotes the size (length) of the window.

    • For example: the PHD system of Rost and Sander (1994) uses two window sizes: 13 and 17.

    My T. Thai

    mythai@cise.ufl.edu


    Window119 l.jpg
    Window

    • PARAMETER 3: p  {1,2, …, w}

      where p is a special position of the window that returns the classification attribute values from S ={H, E, L} and wis the size (length) of the window

    • PSSP PROBLEM:

      find optimal size w, optimal special position p for the best prediction accuracy

    My T. Thai

    mythai@cise.ufl.edu


    Window symbolic definition l.jpg
    Window: Symbolic Definition

    • Window Arguments: window parameters and secondary structure (x,e)

    • Window Value: (subsequence of x, element of e)

    • OPERATION (sequence – to –structure window)

      W is a partial function

      W: N+  N+  {1,…, k} (A*  S* )  A*  S

      W(i, k, p, (x,e)) = (xi x(i+1)…. x(i+k-1), f(x)|{x(i+p)}) where (x,e)= (x1x2 ..xn, e1e2…en)

    My T. Thai

    mythai@cise.ufl.edu


    Neural network models l.jpg
    Neural network models

    • machine learning approach

    • provide training sets of structures (e.g. a-helices, non a -helices)

    • are trained to recognize patterns in known secondary structures

    • provide test set (proteins with known structures)

    • accuracy ~ 70 –75%

    My T. Thai

    mythai@cise.ufl.edu


    Reasons for improved accuracy l.jpg
    Reasons for improved accuracy

    • Align sequence with other related proteins of the same protein family

    • Find members that has a known structure

    • If significant matches between structure and sequence assign secondary structures to corresponding residues

    My T. Thai

    mythai@cise.ufl.edu


    3 state neural network l.jpg
    3 State Neural Network

    My T. Thai

    mythai@cise.ufl.edu


    Neural network l.jpg
    Neural Network

    My T. Thai

    mythai@cise.ufl.edu


    Input layer l.jpg
    Input Layer

    • Most of approach set w = 17. Why?

      • Based on evidence of statistical correlation with secondary structure as far as 8 residues on either side of the prediction point

    • The input layer consists of:

      • 17 blocks, each represent a position of window

      • Each block has 21 units:

        • The first 20 units represent the 20 aa

        • One to provide a null input used when the moving window overlaps the amino- or carboxyl-terminal end of the protein

    My T. Thai

    mythai@cise.ufl.edu


    Binary encoding scheme l.jpg
    Binary Encoding Scheme

    • Example:

    • Let w = 5, and let say we have the sequence:

      A E G K Q….

    • Then the input layer is:

    • A,C,D,E,F,G,…,N,P,Q,R,S.T,V,W,Y

      1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …. 0 0

      0 0… 1 0 …..

      0 … 0 1 0 …..

    My T. Thai

    mythai@cise.ufl.edu


    Hidden layer l.jpg
    Hidden Layer

    • Represent the structure of the central aa

    • Encoding scheme:

      • Can use two units to present:

        • (1,0) = H, (0,1) = E, (0,0) = L

      • Some uses three units:

        • (1,0,0) = H, (0,1,0) = E, (0,0,1) = L

    • For each connection, we can assign some weight value.

    • This weight value can be adjusted to best fit the data (training)

    My T. Thai

    mythai@cise.ufl.edu


    Output level l.jpg
    Output Level

    • Based on the hidden level and some function f, calculate the output.

    • Helix is assigned to any group of 4 or more contiguous residues

      • Having helix output values greater than sheet outputs and greater than some threshold t

    • Strand (E) is assigned to any group of two or more contiguous resides, having sheet output values greater than helix outputs and greater than t

    • Otherwise, assigned to L

    • Note that t can be adjusted as well (training)

    My T. Thai

    mythai@cise.ufl.edu


    How phd works l.jpg
    How PHD works

    Step 1. BLAST search with input sequence

    Step 2. Perform multiple seq. alignment and calculate aa frequencies for each position

    My T. Thai

    mythai@cise.ufl.edu


    How phd works130 l.jpg
    How PHD works

    Step 3. First Level: “Sequence to structure net”

    Input: alignment profile, Output: units for H, E, L

    Calculate “occurrences” of any of the residues to be present in either an a-helix, b-strand, or loop.

    1

    2

    3

    4

    5

    6

    7

    H = 0.05

    E = 0.18

    L= 0.67

    N=0.2, S=0.4, A=0.4

    My T. Thai

    mythai@cise.ufl.edu


    How phd works131 l.jpg
    How PHD works

    Step 3. Second Level: “Structure to structure net”

    Input: First Level values, Output: units for H, E, L

    Window size = 17

    H = 0.59

    E = 0.09

    L= 0.31

    E=0.18

    Step 4. Decision level

    My T. Thai

    mythai@cise.ufl.edu


    Prepare data for phd neural nets l.jpg
    Prepare Data for PHD Neural Nets

    • Starting from a sequence of unknown structure (SEQUENCE ) the following steps are required to finally feed evolutionary information into the PHD neural networks:

      • a data base search for homologues (method Blast),

      • a refined profile-based dynamic-programming alignment of the most likely homologues (method MaxHom)

      • a decision for which proteins will be considered as homologues (length-depend cut-off for pairwise sequence identity)

      • a final refinement, and extraction of the resulting multiple alignment. Numbers 1-3 indicate the points where users of the PredictProtein service can interfere to improve prediction accuracy without changes made to the final prediction method PHD .

    • http://cubic.bioc.columbia.edu/papers/2000_rev_humana/paper.html

    My T. Thai

    mythai@cise.ufl.edu


    Phd neural network l.jpg
    PHD Neural Network

    My T. Thai

    mythai@cise.ufl.edu


    Prediction accuracy l.jpg
    Prediction Accuracy

    My T. Thai

    mythai@cise.ufl.edu


    Where can i learn more l.jpg
    Where can I learn more?

    Protein Structure Prediction Center

    Biology and Biotechnology Research ProgramLawrence Livermore National Laboratory, Livermore, CA

    http://predictioncenter.llnl.gov/Center.html

    DSSP

    Database of Secondary Structure Prediction

    http://www.sander.ebi.ac.uk/dssp/

    My T. Thai

    mythai@cise.ufl.edu


    Computational molecular biology136 l.jpg

    Computational Molecular Biology

    Protein Structure: Tertiary Prediction via Threading


    Objective l.jpg
    Objective

    • Study the problem of predicting the tertiary structure of a given protein sequence

    My T. Thai

    mythai@cise.ufl.edu


    A few examples l.jpg
    A Few Examples

    actual

    predicted

    predicted

    actual

    actual

    My T. Thai

    mythai@cise.ufl.edu

    predicted

    actual

    predicted


    Two comparative modeling l.jpg
    Two Comparative Modeling

    • Homology modeling – identification of homologous proteins through sequence alignment; structure prediction through placing residues into “corresponding” positions of homologous structure models

    • Protein threading – make structure prediction through identification of “good” sequence-structure fit

    • We will focus on the Protein Threading.

    My T. Thai

    mythai@cise.ufl.edu


    Why it works l.jpg
    Why it Works?

    • Observations:

      • Many protein structures in the PDB are very similar

        • Eg: many 4-helical bundles, globins… in the set of solved structure

    • Conjecture:

      • There are only a limited number of “unique” protein folds in nature

    My T. Thai

    mythai@cise.ufl.edu


    Threading method l.jpg
    Threading Method

    • General Idea:

      • Try to determine the structure of a new sequence by finding its best ‘fit’ to some fold in library of structures

    • Sequence-Structure Alignment Problem:

      • Given a solved structure T for a sequence t1t2…tn and a new sequence S = s1s2… sm, we need to find the “best match” between S and T

    My T. Thai

    mythai@cise.ufl.edu


    What to consider l.jpg
    What to Consider

    • How to evaluate (score) a given alignment of s with a structure T?

    • How to efficiently search over all possible alignments?

    My T. Thai

    mythai@cise.ufl.edu


    Three main approaches l.jpg
    Three Main Approaches

    • Protein Sequence Alignment

    • 3D Profile Method

    • Contact Potentials

    My T. Thai

    mythai@cise.ufl.edu


    Protein sequence alignment method l.jpg
    Protein Sequence Alignment Method

    • Align two sequences S and T

    • If in the alignment, si aligns with tj, assign si to the position pj in the structure

    • Advantages:

      • Simple

    • Disadvantages:

      • Similar structures have lots of sequence variability, thus sequence alignment may not be very helpful

    My T. Thai

    mythai@cise.ufl.edu


    3d profile method l.jpg
    3D Profile Method

    • Actually uses structural information

    • Main idea:

      • Reduce the 3D structure to a 1D string describing the environment of each position in the protein. (called the 3D profile (of the fold))

      • To determine if a new sequence S belongs to a given fold T, we align the sequence with the fold’s 3D profile

    • First question: How to create the 3D profile?

    My T. Thai

    mythai@cise.ufl.edu


    Create the 3d profile l.jpg
    Create the 3D Profile

    • For a given fold, do:

      • For each residue, determine:

        • How buried is it?

        • Fraction of surrounding environment that is polar

        • What secondary structure is it in (alpha-helix, beta-sheet, or neither)

    My T. Thai

    mythai@cise.ufl.edu


    Create the 3d profile147 l.jpg
    Create the 3D profile

    2. Assign an environment class to each position:

    Six classes describe the burial and polarity criteria (exposed, partially buried, very buried, different fractions of polar environment)

    My T. Thai

    mythai@cise.ufl.edu


    Create the 3d profile148 l.jpg
    Create the 3D Profile

    • These environment classes depend on the number of surrounding polar residues and how buried the position is.

    • There are 3 SS for each of these, thus have 18 environment classes

    My T. Thai

    mythai@cise.ufl.edu


    Create the 3d profile149 l.jpg
    Create the 3D Profile

    3. Convert the known structure T to a string of environment descriptors:

    4. Align the new sequence S with E using dynamic programming

    My T. Thai

    mythai@cise.ufl.edu


    Scores for alignment l.jpg
    Scores for Alignment

    • Need scores for aligning individual residues with environments.

    • Key: Different aa prefer diff. environment. Thus determine scores by looking at the statistical data

    My T. Thai

    mythai@cise.ufl.edu


    Scores for alignment151 l.jpg
    Scores for Alignment

    • Choose a database of known structures

    • Tabulate the number of times we see a particular residue in a particular environment class -> compute the score for each env class and each aa pair

    • Choose gap penalties, eg. may charge more for gaps in alpha and beta environments…

    My T. Thai

    mythai@cise.ufl.edu


    Alignment l.jpg
    Alignment

    • This gives us a table of scores for aligning an aa sequence with an environment string

    • Using this scoring and Dynamic Programming, we can find an optimal alignment and score for each fold in our library

    • The fold with the highest score is the best fold for the new sequence

    My T. Thai

    mythai@cise.ufl.edu


    Contact potentials method l.jpg
    Contact Potentials Method

    • Take 3D structure into account more carefully

    • Include information about how residues interact with each other

      • Consider pairwise interactions between the position pi, pj in the fold

      • For a given alignment, produce a score which is the sum over these interactions:

    My T. Thai

    mythai@cise.ufl.edu


    Problem l.jpg
    Problem

    • Have a sequence from the database T = t1…tn with known positions p1…pn, and a new sequence S = s1…sm.

    • Find 1 <= r1 < r2 < … < rn < m which maximize

      where ri is the index of the aa in S which occupies position pi

    • This problem is NP-complete for pairwise interactions

    My T. Thai

    mythai@cise.ufl.edu


    How to define that score l.jpg
    How to Define that Score?

    • Use so-called “knowledge-based potentials”, which comes from databases of observed interactions.

    • The general form:

    My T. Thai

    mythai@cise.ufl.edu


    How to define the score l.jpg
    How to Define the Score

    • General Idea:

      • Define cutoff parameter for “contact” (e.g. up to 6 Angstroms)

      • Use the PDB to count up the number of times aa i and j are in contact

    • Several method for normalization. Eg. Normalization is by hypothetical random frequencies

    My T. Thai

    mythai@cise.ufl.edu


    Other variations l.jpg
    Other Variations

    • Many other variations in defining the potentials

    • In addition to pairwise potentials, consider single residue potentials

    • Distance-dependent intervals:

      • Counting up pairwise contacts separately for intervals within 1 Angstrom, between 1 and 2 Angstroms…

    My T. Thai

    mythai@cise.ufl.edu


    Threading via tree decomposition l.jpg
    Threading via Tree-Decomposition

    My T. Thai

    mythai@cise.ufl.edu


    Contact graph l.jpg
    Contact Graph

    • Each residue as a vertex

    • One edge between two residues if their spatial distance is within given cutoff.

    • Cores are the most conserved segments in the template

    template

    My T. Thai

    mythai@cise.ufl.edu


    Simplified contact graph l.jpg
    Simplified Contact Graph

    My T. Thai

    mythai@cise.ufl.edu


    Alignment example l.jpg
    Alignment Example

    My T. Thai

    mythai@cise.ufl.edu


    Alignment example162 l.jpg
    Alignment Example

    My T. Thai

    mythai@cise.ufl.edu


    Calculation of alignment score l.jpg
    Calculation of Alignment Score

    My T. Thai

    mythai@cise.ufl.edu


    Graph labeling problem l.jpg
    Graph Labeling Problem

    • Each core as a vertex

    • Two cores interact if there is an interaction between any two residues, each in one core

    • Add one edge between two cores that interact.

    h

    f

    b

    d

    s

    m

    c

    a

    e

    i

    j

    k

    l

    Each possible sequence alignment position for a single core

    can be treated as a possible label assignment to a vertex in G

    D[i] = be a set of all possible label assignments to vertex i.

    Then for each label assignment A(i) in D[i], we have:

    My T. Thai

    mythai@cise.ufl.edu


    Tree decomposition l.jpg
    Tree Decomposition

    My T. Thai

    mythai@cise.ufl.edu


    Tree decomposition robertson seymour 1986 l.jpg

    h

    f

    d

    f

    abd

    b

    d

    g

    g

    m

    c

    m

    c

    a

    e

    i

    a

    e

    i

    j

    l

    k

    j

    k

    l

    Tree Decomposition[Robertson & Seymour, 1986]

    Greedy: minimum degree heuristic

    h

    • Choose the vertex with minimum degree

    • The chosen vertex and its neighbors form a component

    • Add one edge to any two neighbors of the chosen vertex

    • Remove the chosen vertex

    • Repeat the above steps until the graph is empty

    My T. Thai

    mythai@cise.ufl.edu


    Tree decomposition cont d l.jpg

    h

    fgh

    f

    b

    d

    g

    m

    acd

    cdem

    defm

    abd

    c

    a

    e

    i

    clk

    eij

    remove dem

    j

    k

    l

    fgh

    ab

    ac

    c

    f

    clk

    ij

    Tree Decomposition (Cont’d)

    Tree Decomposition

    My T. Thai

    mythai@cise.ufl.edu


    Tree decomposition based algorithms l.jpg

    Xir

    Xr

    Xi

    Xp

    Xli

    Xji

    Xq

    Xj

    Xl

    Tree Decomposition-Based Algorithms

    • Bottom-to-Top: Calculate the minimal F function

    • 2. Top-to-Bottom: Extract the optimal assignment

    A tree decomposition rooted at Xr

    The score of component Xi

    The scores of subtree rooted at Xl

    The score of subtree rooted at Xi

    The scores of subtree rooted at Xj

    My T. Thai

    mythai@cise.ufl.edu


    ad