cse182 l6 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CSE182-L6 PowerPoint Presentation
Download Presentation
CSE182-L6

Loading in 2 Seconds...

play fullscreen
1 / 52

CSE182-L6 - PowerPoint PPT Presentation


  • 100 Views
  • Uploaded on

CSE182-L6. Dicitionary matching Pattern matching. Today, we might look at R. Expr. In Assignment 1, you were asked to look for all mouse sequences. One way is to make a perl regular expression out all possibilities MATCH [Mm]us OR [Mm]ouse but DO NOT match [Ll][Ii][Kk][Ee]

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CSE182-L6' - hammer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
cse182 l6

CSE182-L6

Dicitionary matching

Pattern matching

CSE182

today we might look at r expr
Today, we might look at R. Expr.
  • In Assignment 1, you were asked to look for all mouse sequences.
  • One way is to make a perl regular expression out all possibilities
  • MATCH [Mm]us OR [Mm]ouse but DO NOT match [Ll][Ii][Kk][Ee]
  • How can we do these searches? Are they relevant to bioinformatics?

Stay tuned

CSE182

an o n alg for keyword matching
An O(n) alg. For keyword matching
  • Start with the first position in the db, and the root node.
  • If successful transition
    • Increment current pointer
    • Move to a new node
    • If terminal node “success”
  • Else (if at root)
    • Increment ‘current’ pointer
    • Mv ‘start’ pointer
    • Move to root
  • Else
    • Move ‘start’ pointer forward
    • Move to failure node

CSE182

failure function
Failure function
  • Every node v corresponds to a string sv that is a prefix of some pattern.
  • Define F[v] to be the node u such that su is the longest suffix of sv
  • If we fail to match at v, we should jump to F[v], and commence matching from there
  • Let lp[v] = |su|

1

P

n2

O

n3

T

n4

A

n5

T

n6

O

n1

v

T

S

S

I

U

M

n7

A

n10

S

T

E

n8

n9

CSE182

illustration
Illustration
  • What is F(n10)?
  • What is F(n5)?
  • F(n3)?
  • Lp(n10)?

1

P

n2

O

n3

T

n4

A

n5

T

n6

O

n1

v

T

S

S

I

U

M

n7

A

n10

S

T

E

n8

n9

CSE182

illustration1

l = 1

c = 1

v

Illustration

P O T A S T P O T A T O

1

P

n2

O

n3

T

n4

A

n5

T

n6

O

n1

T

S

S

I

U

M

n7

A

n10

S

T

E

n8

n9

CSE182

illustration2

l = 1

c = 2

v

Illustration

P O T A S T P O T A T O

1

P

n2

O

n3

T

n4

A

n5

T

n6

O

n1

T

S

S

I

U

M

n7

A

n10

S

T

E

n8

n9

CSE182

illustration3

l = 1

c = 6

Illustration

P O T A S T P O T A T O

1

P

n2

O

n3

T

n4

A

n5

T

n6

O

n1

T

S

v

S

I

U

M

n7

A

n10

S

T

E

n8

n9

CSE182

illustration4

l = 3

c = 6

v

Illustration

P O T A S T P O T A T O

1

P

n2

O

n3

T

n4

A

n5

T

n6

O

n1

T

S

S

I

U

M

n7

A

n10

S

T

E

n8

n9

CSE182

illustration5

l = 3

c = 7

v

Illustration

P O T A S T P O T A T O

1

P

n2

O

n3

T

n4

A

n5

T

n6

O

n1

T

S

S

I

U

M

n7

A

n10

S

T

E

n8

n9

n11

CSE182

illustration6

l = 7

c = 7

Illustration

P O T A S T P O T A T O

v

1

P

n2

O

n3

T

n4

A

n5

T

n6

O

n1

T

S

S

I

U

M

n7

A

n10

S

T

E

n8

n9

CSE182

illustration7

l = 7

c = 8

Illustration

P O T A S T P O T A T O

v

1

P

n2

O

n3

T

n4

A

n5

T

n6

O

n1

T

S

S

I

U

M

n7

A

n10

S

T

E

n8

n9

CSE182

illustration8

l = 7

c = 7

Illustration

P O T A S T P O T A T O

v

1

P

n2

O

n3

T

n4

A

n5

T

n6

O

n1

T

S

S

I

U

M

n7

A

n10

S

T

E

n8

n9

CSE182

time analysis
Time analysis
  • In each step, either c is incremented, or l is incremented
  • Neither pointer is ever decremented (lp[v] < c-l).
  • l and c do not exceed n
  • Total time <= 2n

l

c

P O T A S T P O T A T O

CSE182

blast putting it all together
Blast: Putting it all together
  • Input: Query of length m, database of size n
  • Select word-size, scoring matrix, gap penalties, E-value cutoff
  • Blast

CSE182

blast steps
Blast Steps
  • Generate an automaton of all query keywords.
  • Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits.
  • Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties.
  • For each alignment with score S, compute E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached.
  • Output results.

CSE182

blast output
BLAST output
  • Look up Blast Results with RID
    • HA5YXH5C012

CSE182

protein sequence analysis

B

A

C

Protein Sequence Analysis
  • What can you do if BLAST does not return a hit?
    • Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity.
  • A: Accept hits at higher E-value.
    • This increases the probability that the sequence similarity is a chance event.
    • How can we get around this paradox?
    • Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?

CSE182

silly quiz
Silly Quiz

Skin patterns

Facial Features

CSE182

not all features residues are important
Not all features(residues) are important

Skin patterns

Facial Features

CSE182

protein sequence motifs

Fam(B)

A

C

Protein sequence motifs
  • Premise:
  • The sequence of a protein sequence gives clues about its structure and function.
  • Not all residues are equally important in determining function.
  • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not.
  • How can we identify these key residues?

CSE182

prosite
Prosite
  • In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function.

Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch

The PROSITE database, its status in 1999

CSE182

basic idea
Basic idea
  • It is a heuristic approach. Start with the following:
    • A collection of sequences with the same function.
    • Region/residues known to be significant for maintaining structure and function.
  • Develop a pattern of conserved residues around the residues of interest
  • Iterate for appropriate sensitivity and specificity

CSE182

proteins containing zf domains
Proteins containing zf domains

How can we find a motif corresponding to a zf domain

CSE182

from alignment to regular expressions
From alignment to regular expressions

*

ALRDFATHDDF

SMTAEATHDSI

ECDQAATHEAS

ATH-[DE]

  • Search Swissprot with the resulting pattern
  • Refine pattern to eliminate false positives
  • Iterate

CSE182

the sequence analysis perspective
The sequence analysis perspective
  • Zinc Finger motif
    • C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
    • 2 conserved C, and 2 conserved H
  • How can we search a database using these motifs?
    • The motif is described using a regular expression. What is a regular expression?

CSE182

regular expressions
Regular Expressions
  • Concise representation of a set of strings over alphabet .
  • Described by a string over
  • R is a r.e. if and only if

CSE182

regular expression
Regular Expression
  • Q: Let ={A,C,E}
    • Is (A+C)*EEC* a regular expression?
    • *(A+C)?
    • AC*..E?
  • Q: When is a string s in a regular expression?
    • R =(A+C)*EEC*
    • Is CEEC in R?
    • AEC?
    • ACEE?

CSE182

regular expression automata
Regular Expression & Automata
  • Every R.E can be expressed by an automaton (a directed graph) with the following properties:
    • The automaton has a start and end node
    • Each edge is labeled with a symbol from , or 
  • Suppose R is described by automaton A
    • S  R if and only if there is a path from start to end in A, labeled with s.

CSE182

examples regular expression automata
Examples: Regular Expression & Automata
  • (A+C)*EEC*

A

C

E

E

start

end

C

CSE182

constructing automata from r e

Constructing automata from R.E

  • R = {}
  • R = {},   
  • R = R1 + R2
  • R = R1 · R2
  • R = R1*

CSE182

end of l6
End of L6

CSE182

side chains determine amino acid type
Side chains determine amino-acid type
  • The residues may have different properties.
  • Aspartic acid (D), and Glutamic Acid (E) are acidic residues

CSE182

various constraints determine 3d structure
Various constraints determine 3d structure
  • Constraints
    • Structural constraints due to physiochemical properties
    • Constraints due to bond angles
    • H-bond formation
  • Surprisingly, a few conformations are seen over and over again.

CSE182

alpha helix
Alpha-helix
  • 3.6 residues per turn
  • H-bonds between 1st and 4th residue stabilize the structure.
  • First discovered by Linus Pauling

CSE182

beta sheet
Beta-sheet
  • Each strand by itself has 2 residues per turn, and is not stable.
  • Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel.
  • Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions.

CSE182

domains
Domains
  • The basic structures (helix, strand, loop) combine to form complex 3D structures.
  • Certain combinations are popular. Many sequences, but only a few folds

CSE182

3d structure
3D structure
  • Predicting tertiary structure is an important problem in Bioinformatics.
  • Premise: Clues to structure can be found in the sequence.
  • While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals.
  • The PDB database is a compendium of structures

PDB

CSE182

searching structure databases
Searching structure databases
  • Threading, and other 3d Alignments can be used to align structures.
  • Database filtering is possible through geometric hashing.

CSE182

trivia quiz
Trivia Quiz
  • What research won the Nobel prize in Chemistry in 2004?
  • In 2002?

CSE182

sample preparation

Enzymatic Digestion

(Trypsin)

+

Fractionation

Sample Preparation

CSE182

single stage ms
Single Stage MS

Mass

Spectrometry

LC-MS: 1 MS spectrum / second

CSE182

tandem ms
Tandem MS

Secondary Fragmentation

Ionized parent peptide

CSE182