1 / 27

Parsing

Parsing. Analyze text: split it into meaningful units, tokens Extract relevant information , disregard irrelevant information ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for? Phone book: meaningful tokens are words, numbers

jade-gamble
Download Presentation

Parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parsing • Analyze text: split it into meaningful units, tokens • Extract relevant information, disregard irrelevant information • ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for? • Phone book: meaningful tokens are words, numbers • Search phone book for all people named “Ole Hansen” • Search phone book for phone numbers starting with 86

  2. Blast • Program for finding similarities between biological sequences • Different versions, one is blastn which compares DNA sequences with DNA sequences • Input: • Fasta file with query sequences • Formatted Fasta file with database sequences • Sensitivity parameter • Output: • Each query compared to each database sequence

  3. Example run Query file: arachis.fasta Database file: arabidopsis.fasta Format the database: formatdb –i arabidopsis.fasta –p F –o T Command: /users/chili/BiRC/MolBio/Blast-2.2.6/blastall -p blastn-e 0.000000002-d arabidopsis.fasta-i arachis.fasta-o arachis_arab.bn

  4. Example output BLASTN 2.2.6 [Apr-09-2003] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= CL5Contig1 (797 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done ***** No hits found ****** ..

  5. Example output Query= CL69Contig1 (372 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 gb|AV519674.1|AV519674 AV519674 Arabidopsis 68 1e-09 gb|AV557401.1|AV557401 AV557401 Arabidopsis 42 3e-05 gb|BP670151.1|BP670151 BP670151 RAFL21 Arabidopsis 43 1e-04 ..

  6. Example output >gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009 Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| |||||||||||| Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826 .. General form of output: Repetitions of (query, subject matches, alignments)

  7. Extract information from blast output • Extract the best hit for each query sequence Query= CL69Contig1 (372 letters) .. Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 ..

  8. Algorithm • Read blast output file line by line • Introduce two states: • Looking for next query • Looking for hit list • Return dictionary of query  best hit

  9. First state: Looking for next query Query= CL69Contig1 (372 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 .. Look for a line starting with Query= (the = is important!)

  10. Why we look for Query= and not just Query >gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis Length = 1009 Score = 69.9 bits (35), Expect = 3e-10 Identities = 47/51 (92%) Strand = Plus / Plus Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117 ||||||||||||||||| ||||| ||||| |||||||| |||||||||||| Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826 ..

  11. Second state: Looking for hit list Query= CL69Contig1 (372 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 .. Case A: hits were found

  12. Case B: no hits were found Query= CL5Contig1 (797 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done ***** No hits found ****** Query= CL69Contig1 (372 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 ..

  13. Second state: Looking for hit list Query= CL69Contig1 (372 letters) Database: arabidopsis_nucleotides.fasta 1,004,711 sequences; 901,539,077 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10 gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10 .. Look for a line starting with Searching Then read a few more lines to distinguish case A/B and read best hit

  14. blastparser.py (part 1) Find the query ID: Query= CL69Contig1

  15. blastparser.py (part 2) Find the best match ID: gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

  16. Test >>> from blastparser import parseBlastallOutput >>> d = parseBlastallOutput(“arachis_arab.bn”) >>> d[“gi|30419745”] ‘gb|BP625785.1|BP625785’ >>> d[“gi|30419753”] ‘none’

  17. Two pass-parsing Evolutionary tree of life (animal kingdom) • Huge hierarchy of groups and subgroups • Each node in the tree has a name and a (possibly empty) list of descendant trees (sons) Source: The origin and evolution of model organisms, Nature Genetics, Nov. 2002, vol. 3.

  18. Abstract data structure to represent a general tree (not necessarily binary) tree.py

  19. How can we write a tree to a sequential file? • Informally: • A tree is a labeled node containing a (possibly empty) list of other trees • Write tree node using start and end tags: <N=“Insects”> [sons] </N> • Formally (context-free grammar): T→ <N=“L”>S</N> S→λ| TS L→string label Insects Flies Beetles B C A E D

  20. Recursive method for string representing of tree Insects Flies Beetles B C A E D tree.py First obtain string representation of sons (empty string if no sons) by calling function recursively.. .. then create string with start tag, label, sons’ representation, and end tag .. <N=“Beetles”><N=“C”></N><N=“D”></N><N=“E”></N></N> ..

  21. Larger tree – How can we read a tree from a sequential file? <N="Terrestrialvertebrates"><N="Synapsida"><N="Therapsida"><N="Mammalia"><N="Marsupialia"><N="Kangaroo"></N><N="Koala"></N></N><N="Eutheria"><N="Primates"><N="Human"></N><N="Gorilla"></N><N="Chimpanzee"></N></N><N="Carnivora"><N="Walrus"></N><N="Wolf"></N></N><N="Proboscidea"><N="Elephant"></N></N></N></N></N></N><N="Reptilia"><N="Diapsida"><N="Archosauromorpha"><N="Tyrannosaurus"></N><N="Penguin"></N><N="Owl"></N></N><N="Lepidosauromorpha"><N="Lizard"></N><N="Snake"></N></N></N><N="Testudines"><N="Turtle"></N></N></N></N> part_of_the_tree_of_life.txt We need a parser!

  22. Two-pass parsing Complex parsing is often split in two passes: • Lexical analysis • Identify and assemble tokens: logical units of text • Structural analysis • Determine the structural hierarchy of the tokens

  23. Lexical analysis tree Match either a start tag or an end tag Define a group containing the start tag’s label phylogenyparser.py (part 1) Search text from index pointer Create token of right type Move index pointer

  24. Structural analysis 1 2 3 phylogenyparser.py (part 2) .. <N="Kangaroo"></N><N="Koala"></N>.. current_node current_node current_node new_node Kangaroo 1 2 3

  25. Terrestrial vertebrates Turtle Testudines Reptilia Lizard Synapsida Lepidosauromorpha Diapsida Snake Therapsida Archosauromorpha Mammalia Owl Kangaroo Eutheria Penguin Marsupilia Tyrannosaurus Proboscidea Koala Primates Elephant Carnivora Human Wolf Gorilla Walrus Chimpanzee

  26. Testprogram phylogenyparser.py

  27. Navigating in the tree Name: Diapsida Father: Reptilia Siblings: Testudines Sons: Archosauromorpha Lepidosauromorpha (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? b Number of sibling (0-0)? 0 Name: Testudines Father: Reptilia Siblings: Diapsida Sons: Turtle (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? p <N="Testudines"><N="Turtle"></N></N> Name: Testudines Father: Reptilia Siblings: Diapsida Sons: Turtle (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? f Name: Reptilia Father: Terrestrial vertebrates Siblings: Synapsida Sons: Diapsida Testudines Turtle Testudines Reptilia Lepidosauromorpha Diapsida Archosauromorpha

More Related