Eecs 800 research seminar mining biological data
This presentation is the property of its rightful owner.
Sponsored Links
1 / 76

EECS 800 Research Seminar Mining Biological Data PowerPoint PPT Presentation


  • 92 Views
  • Uploaded on
  • Presentation posted in: General

EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Administrative. Class presentation schedule is online First class presentation is “kernel based classification” by Han Bin on Nov 6 th Project design is due Oct 30th. Overview. Gene ontology Challenges

Download Presentation

EECS 800 Research Seminar Mining Biological Data

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Eecs 800 research seminar mining biological data

EECS 800 Research SeminarMining Biological Data

Instructor: Luke Huan

Fall, 2006


Administrative

Administrative

  • Class presentation schedule is online

    • First class presentation is “kernel based classification” by Han Bin on Nov 6th

  • Project design is due Oct 30th


Overview

Overview

  • Gene ontology

    • Challenges

    • What is gene ontology

    • construct gene ontology

  • Text mining, natural language processing and information extraction: An Introduction

  • Summary


Ontology

Ontology

  • <philosophy> A systematic account of Existence.

  • <artificial intelligence> (From philosophy) An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

  • <information science> The hierarchical structuring of knowledge about things by subcategorising them according to their essential (or at least relevant and/or cognitive) qualities.

    This is an extension of the previous senses of "ontology" (above) which has become common in discussions about the difficulty of maintaining subject indices.The philosophy of indexing everything in existence?


Aristotele s 384 322 bc ontology

Aristotele’s (384-322 BC) Ontology

  • Substance

    • plants, animals, ...

  • Quality

  • Quantity

  • Relation

  • Where

  • When

  • Position

  • Having

  • Action

  • Passion


Ontology and informatics

Ontology and -informatics

  • In information sciences, ontology is better defined as: “a domain of knowledge, represented by facts and their logical connections, that can be understood by a computer”.

    (J. Bard, BioEssays, 2003)

  • “Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing”

    (Gruber, 1993)


Information exchange in bio sciences

Information Exchange in Bio-sciences

  • Basic challenges:

    • Definition, definition, definition

  • What is a name?

  • What is a function?


Eecs 800 research seminar mining biological data

Cell


Eecs 800 research seminar mining biological data

Cell


Eecs 800 research seminar mining biological data

Cell


Eecs 800 research seminar mining biological data

Cell


Eecs 800 research seminar mining biological data

Cell

Image from http://microscopy.fsu.edu


What s in a name

What’s in a name?

  • The same name can be used to describe different concepts


What s in a name1

What’s in a name?

  • Glucose synthesis

  • Glucose biosynthesis

  • Glucose formation

  • Glucose anabolism

  • Gluconeogenesis

  • All refer to the process of making glucose from simpler components


What s in a name2

What’s in a name?

  • The same name can be used to describe different concepts

  • A concept can be described using different names

 Comparison is difficult – in particular across species or across databases


What is function the hammer example

What is Function? The Hammer Example

Function (what)Process (why)

Drive nail (into wood)Carpentry

Drive stake (into soil) Gardening

Smash roach Pest Control

Clown’s juggling object Entertainment


Information explosion

Information Explosion


Entering the genome sequencing era

Entering the Genome Sequencing Era

Eukaryotic Genome SequencesYearGenome# Genes

Size (Mb)

Yeast (S. cerevisiae)1996 12 6,000

Worm (C. elegans)1998 97 19,100

Fly (D. melanogaster)2000 120 13,600

Plant (A. thaliana)2001 125 25,500

Human (H. sapiens, 1st Draft)2001 ~3000~35,000


Eecs 800 research seminar mining biological data

What is the Gene Ontology?

A Common Language for Annotation of Genes from

Yeast, Flies and Mice

…and Plants and Worms

…and Humans

…and anything else!


Eecs 800 research seminar mining biological data

http://www.geneontology.org/


What is the gene ontology

What is the Gene Ontology?

  • Gene annotation system

  • Controlled vocabulary that can be applied to all organisms

    • Organism independent

  • Used to describe gene products

    • proteins and RNA - in any organism


The 3 g ene o ntologies

The 3 Gene Ontologies

  • Molecular Function = elemental activity/task

    • the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity

  • Biological Process = biological goal or objective

    • broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions

  • Cellular Component= location or complex

    • subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme


Cellular component

Cellular Component

  • where a gene product acts


Cellular component1

Cellular Component


Cellular component2

Cellular Component


Cellular component3

Cellular Component

  • Enzyme complexes in the component ontology refer to places, not activities.


Molecular function

Molecular Function

insulin binding

insulin receptor activity


Molecular function1

Molecular Function

  • activities or “jobs” of a gene product

glucose-6-phosphate isomerase activity


Molecular function2

Molecular Function

  • A gene product may have several functions; a function term refers to a single reaction or activity, not a gene product.

  • Sets of functions make up a biological process.


Biological process

cell division

Biological Process

a commonly recognized series of events


Biological process1

Biological Process

transcription


Biological process2

Biological Process

Metabolism: degradation or synthesis of biomelecules


Biological process3

Biological Process

Development: how a group of cell become a tissue


Biological process4

Biological Process

courtship behavior


Ontology applications

Ontology applications

  • Can be used to:

    • Formalise the representation of biological knowledge

    • Standardise database submissions

    • Provide unified access to information through ontology-based querying of databases, both human and computational

    • Improve management and integration of data within databases.

    • Facilitate data mining


Gene ontology structure

Gene Ontology Structure

  • Ontologies can be represented as directed acyclic graphs (DAG), where the nodes are connected by edges

    • Nodes = terms in biology

    • Edges = relationships between the terms

      • is-a

      • part-of


Parent child relationships

Parent-Child Relationships

Chromosome

Cytoplasmic

chromosome

Mitochondrial

chromosome

Nuclear

chromosome

Plastid

chromosome

A child is

a subset or instances of

a parent’s elements


Parent child relationships1

Parent-Child Relationships

cell

membrane chloroplast

mitochondrial chloroplast

membrane membrane

is-a

part-of


Annotation in go

Annotation in GO

  • A gene product is usually a protein but can be a functional RNA

  • An annotation is a piece of information associated with a gene product

  • A GO annotation is a Gene Ontology term associated with a gene product


Terms definitions ids

Terms, Definitions, IDs

  • Term: MAPKKK cascade (mating sensu Saccharomyces)

  • Goid: GO:0007244

  • Definition: OBSOLETE. MAPKKK cascade involved in transduction of mating pheromone signal, as described in Saccharomyces.

  • Evidence code: how annotation is done

  • Definition_reference: PMID:9561267


Annotation example

PMID: 11956323

nek2

Reference

Gene Product

IDA

Inferred from

Direct Assay

Evidence Code

Annotation Example

centrosome

GO:0005813

GO Term


Go annotation

GO Annotation


Go annotation1

GO Annotation


Go annotation2

GO Annotation


Evidence code

Evidence Code

  • Indicate the type of evidence in the cited source that supports the association between the gene product and the GO term

    http://www.geneontology.org/GO.evidence.html


Types of evidence codes

Types of evidence codes

  • Types of evidence code

    • Experimental codes - IDA, IMP, IGI, IPI, IEP

    • Computational codes - ISS, IEA, RCA, IGC

    • Author statement - TAS, NAS

    • Other codes - IC, ND

  • Two types of annotation

    • Manual Annotation

    • Electronic Annotation


Ida inferred from direct assay

IDA: Inferred from Direct Assay

  • direct assay for the function, process, or component indicated by the GO term

    • Enzyme assays

    • In vitro reconstitution (e.g. transcription)

    • Immunofluorescence (for cellular component)

    • Cell fractionation (for cellular component)


Imp inferred from mutant phenotype

IMP: Inferred from Mutant Phenotype

  • variations or changes such as mutations or abnormal levels of a single gene product

    • Gene/protein mutation

    • Deletion mutant

    • RNAi experiments

    • Specific protein inhibitors

    • Allelic variation


Igi inferred from genetic interaction

IGI: Inferred from Genetic Interaction

  • Any combination of alterations in the sequence or expression of more than one gene or gene product

    • Traditional genetic screens

      • - Suppressors, synthetic lethals

    • Functional complementation

    • Rescue experiments

  • An entry in the ‘with’ column is recommended


Ipi inferred from physical interaction

IPI: Inferred from Physical Interaction

  • Any physical interaction between a gene product and another molecule, ion, or complex

    • 2-hybrid interactions

    • Co-purification

    • Co-immunoprecipitation

    • Protein binding experiments

  • An entry in the ‘with’ column is recommended


Iep inferred from expression pattern

IEP: Inferred from Expression Pattern

  • Timing or location of expression of a gene

    • Transcript levels

      • Northerns, microarray

  • Exercise caution when interpreting expression results


Iss inferred from sequence or structural similarity

ISS: Inferred from Sequence or structural Similarity

  • Sequence alignment, structure comparison, or evaluation of sequence features such as composition

    • Sequence similarity

    • Recognized domains/overall architecture of protein

  • An entry in the ‘with’ column is recommended


Rca inferred from reviewed computational analysis

RCA: Inferred from Reviewed Computational Analysis

  • non-sequence-based computational method

    • large-scale experiments

      • genome-wide two-hybrid

      • genome-wide synthetic interactions

    • integration of large-scale datasets of several types

    • text-based computation (text mining)


Igc inferred from genomic context

IGCInferred from Genomic Context

  • Chromosomal position

  • Most often used for Bacteria - operons

    • Direct evidence for a gene being involved in a process is minimal, but for surrounding genes in the operon, the evidence is well-established


Iea inferred from electronic annotation

IEA: Inferred from Electronic Annotation

  • depend directly on computation or automated transfer of annotations from a database

    • Hits from BLAST searches

    • InterPro2GO mappings

  • No manual checking

  • Entry in ‘with’ column is allowed (ex. sequence ID)


Tas traceable author statement

TAS: Traceable Author Statement

  • publication used to support an annotation doesn't show the evidence

    • Review article

    • Text mining!

  • Would be better to track down cited reference and use an experimental code


Nas non traceable author statement

NAS: Non-traceable Author Statement

  • Statements in a paper that cannot be traced to another publication


Nd no biological data available

ND: No biological Data available

  • Can find no information supporting an annotation to any term

  • Indicate that a curator has looked for info but found nothing

    • Place holder

    • Date


Ic inferred by curator

IC: Inferred by Curator

  • annotation is not supported by evidence, but can be reasonably inferred from other GO annotations for which evidence is available

  • ex. evidence = transcription factor (function)

    • IC = nucleus (component)


Choosing the correct evidence code

Choosing the correct evidence code

Ask yourself:

What is the experiment that was done?

Text Mining can help you review

papers faster!


Beyond go open biomedical ontologies

Beyond GO – Open Biomedical Ontologies

  • Orthogonal to existing ontologies to facilitate combinatorial approaches

    • Share unique identifier space

    • Include definitions


Gene ontology and text mining

Gene Ontology and Text Mining

  • Derive ontology from text data

  • More general goal: understand text data automatically


Finding go terms

Finding GO terms

…for B. napus PERK1 protein (Q9ARH1)

In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response…

PubMed ID: 12374299

Function: protein serine/threonine kinase activity GO:0004674

Component: integral to plasma membrane GO:0005887

Process: response to wounding GO:0009611


Eecs 800 research seminar mining biological data

<a href>Frank Rizzo

</a> Bought

<a hef>this home</a>

from <a href>Lake

View Real Estate</a>

In <b>1992</b>.

<p>...

Frank Rizzo bought

his home from Lake

View Real Estate in

1992.

He paid $200,000

under a15-year loan

from MW Financial.

HomeLoan (

Loanee: Frank Rizzo

Lender: MWF

Agency: Lake View

Amount: $200,000

Term: 15 years

)

Loans($200K,[map],...)

Mining Text Data

Data Mining / Knowledge Discovery

Structured Data Multimedia Free Text Hypertext

(Taken from ChengXiang Zhai, CS 397cxz, UIUC, CS – Fall 2003)


Bag of tokens approaches

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or …

nation – 5

civil - 1

war – 2

men – 2

died – 4

people – 5

Liberty – 1

God – 1

Bag-of-Tokens Approaches

Documents

Token Sets

Feature

Extraction

Loses all order-specific information!

Severely limits context!


Natural language processing

A dog is chasing a boy on the playground

Lexical

analysis

(part-of-speech

tagging)

Det

Noun

Aux

Verb

Det

Noun

Prep

Det

Noun

Noun Phrase

Noun Phrase

Noun Phrase

Complex Verb

Prep Phrase

Semantic analysis

Verb Phrase

Syntactic analysis

(Parsing)

Dog(d1).

Boy(b1).

Playground(p1).

Chasing(d1,b1,p1).

Verb Phrase

+

Sentence

Scared(x) if Chasing(_,x,_).

A person saying this may

be reminding another person to

get the dog back…

Scared(b1)

Inference

Pragmatic analysis

(speech act)

Natural Language Processing


General nlp too difficult

General NLP—Too Difficult!

  • Word-level ambiguity

    • “design” can be a noun or a verb (Ambiguous POS)

    • “root” has multiple meanings (Ambiguous sense)

  • Syntactic ambiguity

    • “natural language processing” (Modification)

    • “A man saw a boy with a telescope.” (PP Attachment)

  • Anaphora resolution

    • “John persuaded Bill to buy a TV for himself.”

      (himself = John or Bill?)

  • Presupposition

    • “He has quit smoking.” implies that he smoked before.

Humans rely on context to interpret (when possible).

This context may extend beyond a given document!


Shallow linguistics

Shallow Linguistics

  • Progress on Useful Sub-Goals:

    • English Lexicon

    • Part-of-Speech Tagging

    • Word Sense Disambiguation

    • Phrase Detection / Parsing


Wordnet

parched

watery

wet

moist

arid

dry

damp

anhydrous

WordNet

  • An extensive lexical network for the English language

    • Contains over 138,838 words.

    • Several graphs, one for each part-of-speech.

    • Synsets (synonym sets), each defining a semantic sense.

    • Relationship information (antonym, hyponym, meronym …)

    • Downloadable for free (UNIX, Windows)

    • Expanding to other languages (Global WordNet Association)

    • Funded >$3 million, mainly government (translation interest) to George Miller, National Medal of Science, 1991.

synonym

antonym


Part of speech tagging

Pick the most likely tag sequence.

Independent assignment

Most common tag

Part-of-Speech Tagging

Training data (Annotated text)

This sentence serves as an example of annotated text…

Det N V1 P Det N P V2 N

POS Tagger

This is a new sentence.

Det Aux Det Adj N

“This is a new sentence.”

Partial dependency

(HMM)


Word sense disambiguation

?

“The difficulties of computational linguistics arerootedin ambiguity.”

N Aux V P N

Word Sense Disambiguation

  • Supervised Learning

  • Features:

    • Neighboring POS tags (NAuxVPN)

    • Neighboring words (linguistics are rooted in ambiguity)

    • Stemmed form (root)

    • Dictionary/Thesaurus entries of neighboring words

    • High co-occurrence words (plant, tree, origin,…)

    • Other senses of word within discourse

  • Algorithms:

    • Rule-based Learning (e.g. IG guided)

    • Statistical Learning (i.e. Naïve Bayes)

    • Unsupervised Learning (i.e. Nearest Neighbor)


Parsing

.

.

.

S

Probability of this tree=0.000015

NP

VP

Probabilistic CFG

Det

S NP VP

NP  Det BNP

NP  BNP

NP NP PP

BNP N

VP  V

VP  Aux V NP

VP  VP PP

PP  P NP

V  chasing

Aux is

N  dog

N  boy

N playground

Det the

Det a

P  on

1.0

BNP

VP

PP

0.3

0.4

0.3

A

N

Aux

V

NP

P

NP

Grammar

chasing

on

dog

is

a boy

the playground

Probability of this tree=0.000011

S

1.0

NP

VP

0.01

Det

NP

BNP

Aux

V

PP

0.003

A

NP

N

is

chasing

Lexicon

NP

P

dog

a boy

on

the playground

Parsing

Choose most likely parse tree…


Obstacles

Obstacles

  • Ambiguity

  • “A man saw a boy with a telescope.”

  • Computational Intensity

  • Imposes a context horizon.

  • Text Mining NLP Approach:

    • Locate promising fragments using fast IR methods (bag-of-tokens).

    • Only apply slow NLP techniques to promising fragments.


Summary shallow nlp

Summary: Shallow NLP

  • However, shallow NLP techniques are feasible and useful:

  • Lexicon – machine understandable linguistic knowledge

    • possible senses, definitions, synonyms, antonyms, typeof, etc.

  • POS Tagging – limit ambiguity (word/POS), entity extraction

    • “...research interests include text mining as well as bioinformatics.”

      • NPN

  • WSD – stem/synonym/hyponym matches (doc and query)

    • Query: “Foreign cars” Document: “I’m selling a 1976 Jaguar…”

  • Parsing – logical view of information (inference?, translation?)

    • “A man saw a boy with a telescope.”

  • Even without complete NLP, any additional knowledge extracted from text data can only be beneficial.

  • Ingenuity will determine the applications.


Reference for go

Reference for GO

  • Gene ontology teaching resources:

    • http://www.geneontology.org/GO.teaching.resources.shtml


References for tm

References for TM

  • C. D. Manning and H. Schutze, “Foundations of Natural Language Processing”, MIT Press, 1999.

  • S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach”, Prentice Hall, 1995.

  • S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and Semi-Structured Data”, Morgan Kaufmann, 2002.

  • G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton University, August 1993.

  • C. Zhai, Introduction to NLP, Lecture Notes for CS 397cxz, UIUC, Fall 2003.

  • M. Hearst, Untangling Text Data Mining, ACL’99, invited paper. http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html

  • R. Sproat, Introduction to Computational Linguistics, LING 306, UIUC, Fall 2003.

  • A Road Map to Text Mining and Web Mining, University of Texas resource page. http://www.cs.utexas.edu/users/pebronia/text-mining/

  • Computational Linguistics and Text Mining Group, IBM Research, http://www.research.ibm.com/dssgrp/


  • Login