1 / 12

Pairwise Sequence Analysis-I

Pairwise Sequence Analysis-I. Analogous Concepts Domain Concepts Dot-matrix Analysis Pairwise Sequence Analysis Why is it useful? What are the underlying concepts? What algorithms are used? Limitations/Open questions. Concepts. Pairwise “everything” analysis

upton
Download Presentation

Pairwise Sequence Analysis-I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pairwise Sequence Analysis-I • Analogous Concepts • Domain Concepts • Dot-matrix Analysis • Pairwise Sequence Analysis • Why is it useful? • What are the underlying concepts? • What algorithms are used? • Limitations/Open questions Lecture 2 CS566

  2. Concepts • Pairwise “everything” analysis • Humans are experts at pairwise comparison • “Peas of a pod” • Basis for pecking order in society • Basis for notion of fairness (“He got a better deal than I did!”) • Basis for grouping • Basis for sorting/ranking algorithms • Basis for search strategies Lecture 2 CS566

  3. Concepts • Identity versus similarity • Brand name versus generic products • Coincidence versus plagiarism • Co-discoveries versus “Microsoft Explorer” • “What you see is not what you got” • Process not evident from product • “If it ain’t broke, don’t fix it” • Why messy code that works is left to remain messy Lecture 2 CS566

  4. Domain Concepts • Sequences change over generations • More changes in DNA than in proteins (Why?) • Causes of change • Error rates of replication ~ 1 in a billion • Exchange • Legitimate: Recombination • Illegitimate: Lateral transfer (a la Napster) • Mutation • Exposure to chemicals Lecture 2 CS566

  5. Dot-matrix Analysis Lecture 2 CS566

  6. Why (Motivation) • Given sequence A, what are its properties? • Find any sequence(s) B similar to A, where B  {Sequences with known properties} • Given sequences A & B, are they • Identical? • Similar? • Evolutionary related? • Given sequence A, is this a new discovery? • Given a set of sequences, can they be clustered into families of related sequences? Lecture 2 CS566

  7. Components • Compare the two sequences and assign score • Evaluate the score • Use of a scoring scheme to assign a value to a candidate alignment • Finding the best alignment between two sequences • Use of a probability model to assess the significance of the similarity (coincidence or plagiarism?) Lecture 2 CS566

  8. Concepts - Alignment • Sequences may be aligned ‘locally’ (look for regions/subsequences that are similar) or ‘globally’ (align along entire length) • The result of a local alignment may differ from that a global alignment (Why?) • Potentially large number of alignments to be considered • Each alignment is a path through a fully connected dot matrix graph, i.e., a subset of the set of all diagonal edges, connected by horizontal or vertical lines. Lecture 2 CS566

  9. Concepts – Scoring scheme • Scores are based on amino-acid substitution matrices (log-odds ratio of observed versus expected random substitutions) • PAM (Percent amino acid substitution) matrices: Based on evolutionary model. E.g., PAM 1%, PAM 250%. • BLOSUM (Blocks substitution) matrices: Based on percent identity. E.g., Blosum50, Blosum62. Lecture 2 CS566

  10. Concepts – Scoring scheme Identity based scoring (match XOR non-match) Isawa --- dancer Isawatapdancer Isawa ---- danc-er Isawapinkpanther Similarity based scoring (partial matching) Eyl dkv Lecture 2 CS566

  11. Algorithms • Optimal/Exact solutions: • Take longer time • Typically used for comparing a small number of sequences • (Needleman-Wunsch; Smith-Waterman) • Heuristic: • Frequently close to the optimal solution • Rapid • Typically used for 1:n searches involving large number of sequences • (BLAST, FASTA) Lecture 2 CS566

  12. Open Questions/Limitations Highly similar sequences or highly dissimilar sequences are easily found BUT • Gray area lies in weak similarity; probability distribution is continuous • Other forms of similarity (e. g., structural or species similarity) may be necessary reach a decision • Sequence similarity per se does not guarantee functional equivalence; small difference can be responsible for large difference in function Lecture 2 CS566

More Related