Parsing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 28

Parsing PowerPoint PPT Presentation


  • 128 Views
  • Uploaded on
  • Presentation posted in: General

Parsing. Analyze text: split it into meaningful units, tokens Extract relevant information , disregard irrelevant information ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?. Blast. Program package for finding similarities between biological sequences

Download Presentation

Parsing

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Parsing

Parsing

  • Analyze text: split it into meaningful units, tokens

  • Extract relevant information, disregard irrelevant information

  • ‘Meaningful’ and ‘relevant’ depend on application: what are we looking for?


Blast

Blast

  • Program package for finding similarities between biological sequences

  • blastn compares DNA sequences with DNA sequences

  • Input:

    • Fasta file with query sequences

    • Formatted Fasta file with database sequences

    • Sensitivity parameter (and more)

  • Output:

    • Result of comparing each query to each database sequence


Example run

Example run

Query file: arachis.fasta

Database file: arabidopsis_nucleotides.fasta

Format the database: formatdb –i arabidopsis.fasta –p F –o T

Command:

/users/chili/usr/blast-2.2.13/bin/blastall -p blastn-e 0.000000002-d arabidopsis.fasta-i arachis.fasta-o arachis_arab.bn


Example output query with no match

Example output – query with no match

BLASTN 2.2.6 [Apr-09-2003]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Query=CL5Contig1

(797 letters)

Database: arabidopsis_nucleotides.fasta

1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

..


Example output query with matches

Example output – query with matches

Query=CL69Contig1

(372 letters)

Database: arabidopsis_nucleotides.fasta

1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10

gb|AV519674.1|AV519674 AV519674 Arabidopsis 68 1e-09

gb|AV557401.1|AV557401 AV557401 Arabidopsis 42 3e-05

gb|BP670151.1|BP670151 BP670151 RAFL21 Arabidopsis 43 1e-04

..


Example output match alignment

Example output – match alignment

>gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis

Length = 1009

Score = 69.9 bits (35), Expect = 3e-10

Identities = 47/51 (92%)

Strand = Plus / Plus

Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117

||||||||||||||||| ||||| ||||| |||||||| ||||||||||||

Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826

..

General form of output:

Repetitions of (query, subject matches, alignments)


Extract information from blast output

Extract information from blast output

  • Extract the best hit for each query sequence

Query= CL69Contig1

(372 letters)

..

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10

..


Algorithm

Algorithm

  • Read blast output file line by line

  • Introduce two states:

    • Looking for next query

    • Looking for hit list

  • Return dictionary of query  best hit


First state looking for next query

First state: Looking for next query

Query= CL69Contig1

(372 letters)

Database: arabidopsis_nucleotides.fasta

1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10

..

Look for a line starting with

Query=

(the = is important!)


Why we look for query and not just query

Why we look for Query= and not just Query

>gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis

Length = 1009

Score = 69.9 bits (35), Expect = 3e-10

Identities = 47/51 (92%)

Strand = Plus / Plus

Query:67 gagctattaacaggtaagggtcttttgaagggaacaggcttcttggacttc 117

||||||||||||||||| ||||| ||||| |||||||| ||||||||||||

Sbjct:776 gagctattaacaggtaaaggtctattgaaaggaacagggttcttggacttc 826

..


Second state looking for hit list

Second state: Looking for hit list

Query= CL69Contig1

(372 letters)

Database: arabidopsis_nucleotides.fasta

1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10

..

Case A: hits were found


Case b no hits were found

Case B: no hits were found

Query= CL5Contig1

(797 letters)

Database: arabidopsis_nucleotides.fasta

1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

Query= CL69Contig1

(372 letters)

Database: arabidopsis_nucleotides.fasta

1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10

..


Second state looking for hit list1

Second state: Looking for hit list

Query= CL5Contig1

(797 letters)

Database: arabidopsis_nucleotides.fasta

1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

***** No hits found ******

Query= CL69Contig1

(372 letters)

Database: arabidopsis_nucleotides.fasta

1,004,711 sequences; 901,539,077 total letters

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

gb|AV791771.1|AV791771 AV791771 RAFL7 Arabidopsis 70 3e-10

..

Look for a line starting with

Searching

Then read a few more lines to distinguish case A/B


Parsing

blastparser.py (part 1)

Find the query ID:

Query= CL69Contig1


Parsing

blastparser.py (part 2)

Find the best match ID:

Searching..................................................done

Score E

Sequences producing significant alignments: (bits) Value

gb|AV791675.1|AV791675 AV791675 RAFL7 Arabidopsis 70 3e-10

Find the best match ID:

Searching..................................................done

***** No hits found ******


Parsing

Test

>>> from blastparser import parseBlastallOutput

>>> d = parseBlastallOutput(“arachis_arab.bn”)

>>> d[“gi|30419745”]

‘gb|BP625785.1|BP625785’

>>> d[“gi|30419753”]

‘none’


Two pass parsing

Two pass-parsing

Evolutionary tree of life

(animal kingdom)

  • Huge hierarchy of groups and subgroups

  • Each node in the tree has a name and a (possibly empty) list of descendant trees (sons)

Source: The origin and evolution of model organisms, Nature Genetics, Nov. 2002, vol. 3.


Abstract data structure to represent a general tree not necessarily binary

Abstract data structure to represent a general tree (not necessarily binary)

tree.py


How can we write a tree to a sequential file

How can we write a tree to a sequential file?

(File format should be readable by other systems, so we

can’t use cPickle)

  • A tree is a labeled node containing

    a (possibly empty) list of other (sub)trees

  • Write tree node using start and end tags:

    <N=“Insects”> [sons] </N>

  • Formally (context-free grammar):

    T→ <N=“L”>S</N>

    S→λ| TS

    L→string label

  • Insects

    Flies

    Beetles

    B

    C

    A

    E

    D


    Recursive method string representation of tree

    Recursive method: string representation of tree

    Insects

    Flies

    Beetles

    B

    C

    A

    E

    D

    tree.py

    First obtain string representation of sons (empty string if no sons) by calling function recursively..

    .. then create string with start tag, label, sons’ representation, and end tag

    .. <N=“Beetles”><N=“C”></N><N=“D”></N><N=“E”></N></N> ..


    Larger tree how can we read a tree from a sequential file

    Larger tree – How can we read a tree from a sequential file?

    <N="Terrestrialvertebrates"><N="Synapsida"><N="Therapsida"><N="Mammalia"><N="Marsupialia"><N="Kangaroo"></N><N="Koala"></N></N><N="Eutheria"><N="Primates"><N="Human"></N><N="Gorilla"></N><N="Chimpanzee"></N></N><N="Carnivora"><N="Walrus"></N><N="Wolf"></N></N><N="Proboscidea"><N="Elephant"></N></N></N></N></N></N><N="Reptilia"><N="Diapsida"><N="Archosauromorpha"><N="Tyrannosaurus"></N><N="Penguin"></N><N="Owl"></N></N><N="Lepidosauromorpha"><N="Lizard"></N><N="Snake"></N></N></N><N="Testudines"><N="Turtle"></N></N></N></N>

    part_of_the_tree_of_life.txt

    We need a parser!


    Two pass parsing1

    Two-pass parsing

    Complex parsing is often split in two passes:

    • Lexical analysis

      • Identify and assemble tokens: logical units of text

    • Structural analysis

      • Determine the structural hierarchy of the tokens

        In our case, the tokens are the two kinds of tag:


    Parsing

    Lexical analysis

    Match either a start tag or an end tag

    Define a group containing the start tag’s label

    phylogenyparser.py (part 1)

    Search text from index pointer

    Create token of right type

    Move index pointer


    Structural analysis

    Structural analysis

    Real root will be first son of this node

    1

    2

    3

    phylogenyparser.py (part 2)

    .. <N="Kangaroo"></N><N="Koala"></N>..

    .. <N="Kangaroo"></N><N="Koala"></N>..

    current_node

    current_node

    current_node

    new_node

    Kangaroo

    1

    2

    3


    Parsing

    Terrestrial vertebrates

    Turtle

    Testudines

    Reptilia

    Lizard

    Synapsida

    Lepidosauromorpha

    Diapsida

    Snake

    Therapsida

    Archosauromorpha

    Mammalia

    Owl

    Kangaroo

    Eutheria

    Penguin

    Marsupilia

    Tyrannosaurus

    Proboscidea

    Koala

    Primates

    Elephant

    Carnivora

    Human

    Wolf

    Gorilla

    Walrus

    Chimpanzee


    Test program

    Testprogram

    phylogenyparser.py


    Navigating in the tree

    Navigating in the tree

    Name: Diapsida

    Father: Reptilia

    Siblings: Testudines

    Sons: Archosauromorpha Lepidosauromorpha

    (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? b

    Number of sibling (0-0)? 0

    Name: Testudines

    Father: Reptilia

    Siblings: Diapsida

    Sons: Turtle

    (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? p

    <N="Testudines"><N="Turtle"></N></N>

    Name: Testudines

    Father: Reptilia

    Siblings: Diapsida

    Sons: Turtle

    (f)ather, (s)on, si(b)ling, (p)rint, (q)uit? f

    Name: Reptilia

    Father: Terrestrial vertebrates

    Siblings: Synapsida

    Sons: Diapsida Testudines

    Turtle

    Testudines

    Reptilia

    Lepidosauromorpha

    Diapsida

    Archosauromorpha


    On to the exercises

    .. on to the exercises


  • Login