- 101 Views
- Uploaded on

Download Presentation
## CSE182-L6

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Possible domain queries

- Case 1:
- You have a collection of sequences that belong to a family (contain a functional domain).
- Given an ‘orphan’ sequence, does it belong to the family?
- There are different solutions depending upon the representation of the domain (patterns/alignments/HMM/profiles)
- Case 2:
- You have an orphan sequence from an uncharacterized family. Can you identify other members of the family, and create a representation of them (Harder problem).

CSE182

EX: Innexins

- The Macagno lab is studying Gap junction proteins, Innexins (invertebrate analogs of connexins) in Hirudo
- Innexins have been found in C. elegans, and Drosophila.
- In C. elegans, 25 members of this family have been found, and partially categorized.

CSE182

Innexins in Hirudo

- When certain Innexins are knocked out, they cause serious defects in cells in the ganglia.
- The EST database (partial gene sequences) contains a number of putative Innexins, discovered via BLAST.
- Project:
- Q: Can you confirm that these are Innexins. Can you find more members? (this lecture)
- Q: Can you characterize them w.r.t known innexins in C. elegans, and Drosophila?
- Q: Use your method for other families of interest. Netrins, and their receptors.

CSE182

Protein sequence motifs

- Premise:
- The sequence of a protein sequence gives clues about its structure and function.
- Not all residues are equally important in determining function.
- Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not.
- The key residues can be identified if we had structural information, or through conserved residues in an alignment of the family.

CSE182

Representation of domains/families.

- We will consider a number of representations that describe key residues, characteristic of a family
- Patterns (regular expressions)
- Alignments
- Profiles
- HMMs
- Start with the following:
- A collection of sequences with the same function.
- Region/residues known to be significant for maintaining structure and function.
- Develop a pattern of conserved residues around the residues of interest
- Iterate for appropriate sensitivity and specificity

CSE182

From alignment to patterns

*

ALRDFATHDDF

SMTAEATHDSI

ECDQAATHEAS

ATH-[DE]

- Search a database with the resulting pattern
- Refine pattern to eliminate false positives
- Iterate

CSE182

Regular Expression Patterns

- Zinc Finger motif
- C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
- 2 conserved C, and 2 conserved H
- How can we search a database using these motifs?
- The motif is described using a regular expression. What is a regular expression?

CSE182

Regular Expressions

- Concise representation of a set of strings over alphabet .
- Described by a string over
- R is a r.e. if and only if

CSE182

Regular Expression

- Q: Let ={A,C,E}
- Is (A+C)*EEC* a regular expression?
- Is *(A+C) regular?

- Q: When is a string s in a regular expression?
- R =(A+C)*EEC*
- Is CEEC in R?
- AEC?
- ACEE?

CSE182

Regular Expression & Automata

- Every R.E can be expressed by an automaton (a directed graph) with the following properties:
- The automaton has a start and end node
- Each edge is labeled with a symbol from , or

- Suppose R is described by automaton A
- S R if and only if there is a path from start to end in A, labeled with s.

CSE182

Regular Expression Matching

- Given a database D, and a regular expression R, is a substring of D in R?

- Is there a string D[l..c] that is accepted by the automaton of R?

- Simpler Q: Is D[1..c] accepted by the automaton of R?

CSE182

Alg. For matching R.E.

- If D[1..c] is accepted by the automaton RA
- There is a path labeled D[1]…D[c] that goes from START to END in RA

D[1]

D[2]

D[c]

CSE182

Alg. For matching R.E.

- If D[1..c] is accepted by the automaton RA
- There is a path labeled D[1]…D[c] that goes from START to END in RA
- There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END

u

D[1] .. D[c-1]

D[c]

CSE182

D.P. to match regular expression

u

v

- Define:
- A[u,] = Automaton node reached from u after reading
- Eps(u): set of all nodes reachable from node u using epsilon transitions.
- N[c] = subset of nodes reachable from START node after reading D[1..c]
- Q: when is v N[c]

u

Eps(u)

CSE182

D.P. to match regular expression

- Q: when is v N[c]?
- A: If for some u N[c-1], w = A[u,D[c]],
- v {w}+ Eps(w)

CSE182

Algorithm

CSE182

The final step

- We have answered the question:
- Is D[1..c] accepted by R?
- Yes, if END N[c]
- We need to answer
- Is D[l..c] (for some l, and some c) accepted by R

CSE182

Representation 2: Profiles

- Profiles versus regular expressions
- Regular expressions are intolerant to an occasional mis-match.
- The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members.
- Profiles capture some of these ideas.

CSE182

Profiles

- Start with an alignment of strings of length m, over an alphabet A,
- Build an |A| X m matrix F=(fki)
- Each entry fki represents the frequency of symbol k in position i

0.71

0.14

0.28

0.14

CSE182

Profiles

- Start with an alignment of strings of length m, over an alphabet A,
- Build an |A| X m matrix F=(fki)
- Each entry fki represents the frequency of symbol k in position i

0.71

0.14

0.28

0.14

CSE182

Scoring matrices

i

- Given a sequence s, does it belong to the family described by a profile?
- We align the sequence to the profile, and score it
- Let S(i,j) be the score of aligning position i of the profile to residue sj
- The score of an alignment is the sum of column scores.

s

sj

CSE182

Domain analysis via profiles

- Given a database of profiles of known domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences.
- What if the sequence matches some other sequences weakly (using BLAST), but does not match any Profile?

CSE182

Psi-BLAST idea

- Iterate:
- Find homologs using Blast on query
- Discard very similar homologs
- Align, make a profile, search with profile.
- Why is this more sensitive?

Seq Db

CSE182

- If profile of length m must score >= T
- Then, a sub-profile of length l must score >= lT|/m
- Generate all l-mers that score at least lT|/M
- Search using an automaton

- Multiple alignment:
- Use ungapped multiple alignments only

- Two time consuming steps.
- Multiple alignment of homologs
- Searching with Profiles.
- Does the keyword search idea work?

CSE182

Representation 3: HMMs

- Question:
- your ‘friend’ likes to gamble.
- He tosses a coin: HEADS, he gives you a dollar. TAILS, you give him a dollar.
- Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin.
- Can you say what fraction of the times he loads the coin?

CSE182

Representation 3: HMMs

- Building good profiles relies upon good alignments.
- Difficult if there are gaps in the alignment.
- Psi-BLAST/BLOCKS etc. work with gapless alignments.
- An HMM representation of Profiles helps put the alignment construction/membership query in a uniform framework.

V

CSE182

The generative model

- Think of each column in the alignment as generating a distribution.
- For each column, build a node that outputs a residue with the appropriate distribution

0.71

Pr[F]=0.71

Pr[Y]=0.14

0.14

CSE182

A simple Profile HMM

- Connect nodes for each column into a chain. Thie chain generates random sequences.
- What is the probability of generating FKVVGQVILD?
- In this representation
- Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S]
- What is the difference with Profiles?

CSE182

Profile HMMs can handle gaps

- The match states are the same as on the previous page.
- Insertion and deletion states help introduce gaps.
- A sequence may be generated using different paths.

CSE182

Example

A L - L

A I V L

A I - L

- Probability [ALIL] is part of the family?
- Note that multiple paths can generate this sequence.
- M1I1M2M3
- M1M2I2M3
- In order to compute the probabilities, we must assign probabilities of transition between states

CSE182

Profile HMMs

- Directed Automaton M with nodes and edges.
- Nodes emit symbols according to ‘emission probabilities’
- Transition from node to node is guided by ‘transition probabilities’
- Joint probability of seeing a sequence S, and path P
- Pr[S,P|M] = Pr[S|P,M] Pr[P|M]
- Pr[ALIL AND M1I1M2M3]

= Pr[ALIL| M1I1M2M3,M] Pr[M1I1M2M3|M]

- Pr[ALIL | M] = ?

CSE182

Protein structure basics

CSE182

Side chains determine amino-acid type

- The residues may have different properties.
- Aspartic acid (D), and Glutamic Acid (E) are acidic residues

CSE182

Various constraints determine 3d structure

- Constraints
- Structural constraints due to physiochemical properties
- Constraints due to bond angles
- H-bond formation
- Surprisingly, a few conformations are seen over and over again.

CSE182

Alpha-helix

- 3.6 residues per turn
- H-bonds between 1st and 4th residue stabilize the structure.
- First discovered by Linus Pauling

CSE182

Beta-sheet

- Each strand by itself has 2 residues per turn, and is not stable.
- Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel.
- Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions.

CSE182

Domains

- The basic structures (helix, strand, loop) combine to form complex 3D structures.
- Certain combinations are popular. Many sequences, but only a few folds

CSE182

3D structure

- Predicting tertiary structure is an important problem in Bioinformatics.
- Premise: Clues to structure can be found in the sequence.
- While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals.
- The PDB database is a compendium of structures

PDB

CSE182

Searching structure databases

- Threading, and other 3d Alignments can be used to align structures.
- Database filtering is possible through geometric hashing.

CSE182

Nobel Citation 2002

CSE182

Nobel Citation, 2002

CSE182

Mass Spectrometry

CSE182

Download Presentation

Connecting to Server..