EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Administrative. Class presentation schedule is online First class presentation is “kernel based classification” by Han Bin on Nov 6 th Project design is due Oct 30th. Overview. Gene ontology Challenges
Instructor: Luke Huan
This is an extension of the previous senses of "ontology" (above) which has become common in discussions about the difficulty of maintaining subject indices. The philosophy of indexing everything in existence?
(J. Bard, BioEssays, 2003)
Image from http://microscopy.fsu.edu
Comparison is difficult – in particular across species or across databases
Function (what)Process (why)
Drive nail (into wood)Carpentry
Drive stake (into soil) Gardening
Smash roach Pest Control
Clown’s juggling object Entertainment
Eukaryotic Genome SequencesYear Genome # Genes
Yeast (S. cerevisiae) 1996 12 6,000
Worm (C. elegans) 1998 97 19,100
Fly (D. melanogaster) 2000 120 13,600
Plant (A. thaliana) 2001 125 25,500
Human (H. sapiens, 1st Draft) 2001 ~3000 ~35,000
A Common Language for Annotation of Genes from
Yeast, Flies and Mice
…and Plants and Worms
…and anything else!
insulin receptor activity
glucose-6-phosphate isomerase activity
a commonly recognized series of events
Metabolism: degradation or synthesis of biomelecules
Development: how a group of cell become a tissue
A child is
a subset or instances of
a parent’s elements
Evidence CodeAnnotation Example
What is the experiment that was done?
Text Mining can help you review
…for B. napus PERK1 protein (Q9ARH1)
In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response…
PubMed ID: 12374299
Function: protein serine/threonine kinase activity GO:0004674
Component: integral to plasma membrane GO:0005887
Process: response to wounding GO:0009611
<a href>Frank Rizzo
<a hef>this home</a>
from <a href>Lake
View Real Estate</a>
Frank Rizzo bought
his home from Lake
View Real Estate in
He paid $200,000
under a15-year loan
from MW Financial.
Loanee: Frank Rizzo
Agency: Lake View
Term: 15 years
Mining Text Data
Data Mining / Knowledge Discovery
Structured Data Multimedia Free Text Hypertext
(Taken from ChengXiang Zhai, CS 397cxz, UIUC, CS – Fall 2003)
Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.
Now we are engaged in a great civil war, testing whether that nation, or …
nation – 5
civil - 1
war – 2
men – 2
died – 4
people – 5
Liberty – 1
God – 1
Loses all order-specific information!
Severely limits context!
A dog is chasing a boy on the playground on this continent,
Scared(x) if Chasing(_,x,_).
A person saying this may
be reminding another person to
get the dog back…
(speech act)Natural Language Processing
(himself = John or Bill?)
Humans rely on context to interpret (when possible).
This context may extend beyond a given document!
parched on this continent,
Pick the on this continent, most likely tag sequence.
Most common tagPart-of-Speech Tagging
Training data (Annotated text)
This sentence serves as an example of annotated text…
Det N V1 P Det N P V2 N
This is a new sentence.
Det Aux Det Adj N
“This is a new sentence.”
? on this continent,
“The difficulties of computational linguistics arerootedin ambiguity.”
N Aux V P NWord Sense Disambiguation
. on this continent,
Probability of this tree=0.000015
S NP VP
NP Det BNP
NP NP PP
VP Aux V NP
VP VP PP
PP P NP
Probability of this tree=0.000011
Choose most likely parse tree…