cartic ramakrishnan meenakshi nagarajan amit sheth
Download
Skip this Video
Download Presentation
TEXT ANALYSIS FOR SEMANTIC COMPUTING

Loading in 2 Seconds...

play fullscreen
1 / 125

TEXT ANALYSIS FOR SEMANTIC COMPUTING - PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on

CARTIC RAMAKRISHNAN MEENAKSHI NAGARAJAN AMIT SHETH. TEXT ANALYSIS FOR SEMANTIC COMPUTING. Preparing for a Tutorial. A Great way to find out…. How little you really know. Acknowledgements.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'TEXT ANALYSIS FOR SEMANTIC COMPUTING' - redell


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
preparing for a tutorial
Preparing for a Tutorial

A Great way to find out….

How little you really know

acknowledgements
Acknowledgements

We have used material from several popular books, papers, course notes and presentations made by experts in this area. We have provided all references to the best of our knowledge. This list however, serves only as a pointer to work in this area and is by no means a comprehensive resource.

slide4
To Contact us..
  • KNO.E.SIS
    • knoesis.org
  • Director: AmitSheth
    • knoesis.wright.edu/amit/
  • Graduate Students:
understanding natural language
Understanding Natural Language

Word Sequence

Syntactic Parser

Parse Tree

Semantic Analyzer

Literal Meaning

Discourse Analyzer

Meaning

An Overview of Empirical Natural Language Processing, Eric Brill, Raymond J. Mooney

slide6
Brief History
  • Traditional (Rationalist) Natural Language Processing
    • Main insight: Using rule-based representations of knowledge and grammar (hand-coded) for language study

Text

KB

NLP System

Analysis

slide7
Brief History
  • Empirical Natural Language Processing
    • Main insight: Using distributional environment of a word as a tool for language study

Text

Corpus

KB

NLP System

Analysis

Learning System

slide8
Brief History
  • Two approaches not incompatible. Several systems use both.
  • Many empirical systems make use of manually created domain knowledge.
  • Many empirical systems use representations of rationalist methods replacing hand-coded rules with rules acquired from data.
goals of this tutorial
Goals of this Tutorial
  • Several algorithms, methods in each task, rationalist and empiricist approaches
  • What does a NL processing task typically entail?
  • How do systems, applications and tasks perform these tasks?
  • Syntax : POS Tagging, Parser
  • Semantics : Meaning of words, using context/domain knowledge to enhance tasks
what is text analysis for semantic computing
What is Text Analysis for Semantic Computing?

Finding more about what we already know

Ex. patterns that characterize known information

The search/browse OR ‘finding a needle in a haystack’ paradigm

Discovering what we did not know

Deriving new information from data

Ex. Relationships between known entities previously unknown

The ‘extracting ore from rock’ paradigm

levels of text analysis
Levels of Text Analysis
  • Information Extraction - those that operate directly on the text input
      • this includes entity, relationship and event detection 
  • Inferring new links and paths between key entities
      • sophisticated representations for information content, beyond the "bag-of-words" representations used by IR systems
  • Scenario detection techniques
      • discover patterns of relationships between entities that signify some larger event, e.g. money laundering activities. 
what do they all have in common
What do they all have in common?
  • They all make use of knowledge of language (exploiting syntax and structure, different extents)
    • Named entities begin with capital letters
    • Morphology and meanings of words
  • They all use some fundamental text analysis operations
    • Pre-processing, Parsing, chunking, part-of-speech, lemmatization, tokenization
  • To some extent, they all deal with some language understanding challenges
    • Ambiguity, co-reference resolution, entity variations etc.
  • Use of a core subset of theoretical models and algorithms
    • State machines, rule systems, probabilistic models, vector-space models, classifiers, EM etc.
slide13
Some Differences to note too..
  • Wikipedia like text (GOOD)
    • “Thomas Edison invented the light bulb.”
  • Scientific literature (BAD)
    • “This MEK dependency was observed in BRAF mutant cells regardless of tissue lineage, and correlated with both downregulation of cyclin D1 protein expression and the induction of G1 arrest.”
  • Text from Social Media (UGLY)
    • "heylooo..ano u must hear it loadsss bu your propa faabbb!!"
goals of this tutorial1
Goals of this Tutorial
  • Illustrate analysis of and challenges posed by these three text types throughout the tutorial
text mining
TEXT MINING

WHAT CAN TM DO FOR HARRY PORTER?

A bag of words

??

a cool text mining example
A cool text mining example

UNDISCOVERED PUBLIC KNOWLEDGE

Discovering connections hidden in text

motivation for text mining
Motivation for Text Mining
  • Undiscovered Public Knowledge [Swanson] – as mentioned in [Hearst99]
  • Search no longer enough
      • Information overload – prohibitively large number of hits
      • UPK increases with increasing corpus size
  • Manual analysis very tedious
  • Examples [Hearst99]
      • Example 1 – Using Text to Form Hypotheses about Disease
      • Example 2 – Using Text to Uncover Social Impact
example 1
Example 1

Swanson’s discoveries

Associations between Migraine and Magnesium [Hearst99]

stress is associated withmigraines

stress can lead to loss of magnesium

calcium channel blockersprevent some migraines

magnesiumis a natural calcium channel blocker

spreading cortical depression (SCD) is implicated in some migraines

high levels of magnesiuminhibit SCD

migraine patientshave highplatelet aggregability

magnesium can suppressplatelet aggregability

example 2
Example 2

Mining popularity from Social Media

Goal: Top X artists from MySpace artist comment pages

Traditional Top X lists got from radio plays, cd sales. An attempt at creating a list closer to listeners preferences

Mining positive, negative affect / sentiment

Slang, casual text necessitates transliteration

‘you are so bad’ == ‘you are good’

text mining two objectives
Text Mining – Two objectives

Mining text to improve existing information access mechanisms

Search [Storylines]

IR [QA systems]

Browsing [Flamenco]

Mining text for

Discovery & insight [Relationship Extraction]

Creation of new knowledge

Ontology instance-base population

Ontology schema learning

mining text to improve search
Mining text to improve search
  • Web search – aims at optimizing for top k (~10) hits
  • Beyond top 10
    • Pages expressing related latent views on topic
    • Possible reliable sources of additional information
  • Storylines in search results [3]
mining text to improve fact extraction
Mining text to improve Fact Extraction

TextRunner[4]

A system that uses the result of dependency parses of sentences to train a Naïve Bayes classifier for Web-scale extraction of relationships

Does not require parsing for extraction – only required for training

Training on features – POS tag sequences, if object is proper noun, number of tokens to right or left etc.

This system is able to respond to queries like "What did Thomas Edison invent?"

mining text to improve browsing
Mining text to improve Browsing

Castanet [1]

Semi-automatically builds faceted hierarchical metadata structures from text

This is combined with Flamenco[2] to support faceted browsing of content

castanet building faceted hierarchies from text
Castanet – building faceted hierarchies from text

Build

core tree

Augment

core tree

Select terms

WordNet

Divide into facets

Remove

top level

categories

Compress

Tree

Documents

details of castanet s method
Details of Castanet’s method

entity

entity

entity

substance,matter

substance,matter

substance,matter

nutriment

nutriment

nutriment

dessert

dessert

dessert

frozen dessert

frozen dessert

frozen dessert

ice cream sundae

sherbet,sorbet

ice cream sundae

sherbet,sorbet

sherbet

sundae

sundae

sherbet

Domains used to prune applicable senses in Wordnet (e.g. “dip”)

ontology population vs ontology creation
Ontology Population vs. ontology creation?

4733

documents

9284

documents

5

documents

UMLS

Semantic Network

complicates

Biologically

active substance

affects

causes

causes

Disease or Syndrome

Lipid

affects

instance_of

instance_of

???????

Fish Oils

Raynaud’s Disease

MeSH

PubMed

text mining for knowledge creation ontology population
Text Mining for knowledge creation – Ontology population

Finding attribute “like”

relation instances

[Nguyen07]

Finding class instances

[Hearst92]

[Ramakrishnan et. al. 08]

text mining for knowledge creation ontology learning
Text Mining for Knowledge creation – Ontology Learning

Automatic acquisition of

Class Labels

Class hierarchies

Attributes

Relationships

Constraints

Rules

text analysis preliminaries
TEXT ANALYSIS PRELIMINARIES

SYNTAX, SEMANTICS, STATISTICAL NLP, TOOLS, RESOURCES, GETTING STARTED

language understanding is hard
Language Understanding is hard
  • [hearst 97] Abstract concepts are difficult to represent
  • “Countless” combinations of subtle, abstract relationships among concepts
  • Many ways to represent similar concepts
    • E.g. space ship, flying saucer, UFO
  • Concepts are difficult to visualize
    • High dimensionality
    • Tens or hundreds of thousands of features
why is nl hard some challenges
Why is NL hard – some challenges
  • Ambiguity (sense)
    • Keep that smile playin’ (Smile is a track)
    • Keep that smile on!
  • Variations (spellings, synonyms, complex forms)
    • Illeal Neoplasm vs. Adenomatous lesion of the Illeal wall
  • Coreference resolution
    • “John wanted a copy of Netscape to run on his PC on the desk in his den; fortunately, his ISP included it in their startup package,”
text mining is also easy
Text Mining is also easy
  • [hearst 97] Highly redundant data
    • …most of the methods count on this property
  • Just about any simple algorithm can get “good” results for simple tasks:
    • Pull out “important” phrases
    • Find “meaningfully” related words
    • Create some sort of summary from documents
tm an interdisciplinary field
TM - An Interdisciplinary Field
  • Concerned with processing documents in natural language
    • Computational Linguistics, Information Retrieval, Machine learning, Statistics, Information Theory, Data Mining etc.
  • TM generally concerned with practical applications
    • As opposed to lexical acquisition (for ex.)in CL
helping us get there
Helping us get there
  • Computing Resources
    • Faster disks, CPUs, Networked Information
  • Data Resources
    • Large corpora, tree banks, lexical data for training and testing systems
  • Tools for analysis
    • NL analysis: taggers, parsers, noun-chunkers, tokenizers; Statistical Text Analysis: classifiers, nl model generators
  • Emphasis on applications and evaluation
    • Practical systems experimentally evaluated on real data
concepts relevant to tm systems covered here
Concepts Relevant to TM systems covered here
  • Computational Linguistics - Syntax
    • Parts of speech, morphology, phrase structure, parsing, chunking
  • Semantics
    • Lexical semantics, Syntax-driven semantic analysis, domain model-assisted semantic analysis (WordNet),
  • Getting your hands dirty
    • Text encoding, Tokenization, sentence splitting, morphology variants, lemmatization
    • Using parsers, understanding outputs
    • Tools, resources, frameworks
computational linguistics syntax
Computational Linguistics - Syntax

POS Tags, Taggers, Ambiguities, Examples

Word Sequence

Syntactic Parser

Parse Tree

Semantic Analyzer

Literal Meaning

Discourse Analyzer

Meaning

pos tagging
POS Tagging
  • Assigning a pos or syntactic class marker to a word in a sentence/corpus.
    • Word classes, syntactic/grammatical categories
  • Usually preceded by tokenization
    • delimit sentence boundaries, tag punctuations and words.
  • Publicly available tree banks, documents tagged for syntactic structure
  • Typical input and output of a tagger
      • Cancel that ticket. Cancel/VB that/DT ticket/NN ./.
ambiguity in tagging an example
Ambiguity in Tagging – an example
  • Lexical ambiguity
    • Words have multiple usages and parts-of-speech
      • A duck in the pond ; Don’ t duck when I bowl
        • Is duck a noun or a verb?
      • Yes, we can; Can of soup; I canned this idea
        • Is can an auxiliary, a noun or a verb?
  • Problem in tagging is resolving such ambiguities
significance of pos tagging
Significance of POS Tagging
  • Information about a word and its neighbors
    • has implications on language models
      • Possessive pronouns (mine, her, its) usually followed a noun
    • Understand new words
      • Toves did gyre and gimble.
    • On IE
      • Nouns as cues for named entities
      • Adjectives as cues for subjective expressions
pos tagger algorithms
POS Tagger Algorithms
  • Rule-based
    • Database of hand-written/learned rules to resolve ambiguity -EngCG
  • Probability / Stochastic taggers
    • Use a training corpus to compute probability of a word taking a tag in a specific context - HMM Tagger
  • Hybrids – transformation-based
    • The Brill tagger
  • A comprehensive list of available taggers
    • http://www-nlp.stanford.edu/links/statnlp.html#Taggers
pos taggers rule based
POS Taggers – Rule-based
  • Not a complete representation
  • EngCG based on the Constraint Grammar Approach
  • Two step architecture
    • Use a lexicon of words and likely pos tags to first tag words
    • Use a large list of hand-coded disambiguation rules that assign a single pos tag for each word
engcg architecture
EngCG - Architecture
  • Sample lexicon

Word POS AdditionalPOS features

    • Slower ADJ COMPARITIVE
    • Show V PRESENT
    • Show N NOMINATIVE
  • Sample rules
probabilistic stochastic approaches
Probabilistic/Stochastic Approaches
  • What is the best possible tag given this sequence of words?
    • Takes context into account; global
  • Example: HMM (hidden Markov models)
    • A special case of Bayesian Inference
    • likely tag sequence is the one that maximizes the product of two terms:
      • probability of sequence of tags and probability of each tag generating a word
example
Example
  • Peter/NNP is/VBZ expected/VBN to/TOrace/VB tomorrow/NN
  • to/TO race/???ti = argmaxj P(tj|ti-1)P(wi|tj)
    • P(VB|TO) × P(race|VB)
  • Based on the Brown Corpus:
    • Probability that you will see this POS transition and that the word will take this POS
    • P(VB|TO) = .34 × P(race|VB) = .00003 = .00001
why is this important
Why is this Important?
  • Be aware of possibility of ambiguities
  • Possible one has to normalize content before sending it to the tagger
  • Pre Post Transliteration
      • “Rhi you were da coolest last eve”
      • Rhi/VB you/PRP were/VBD da/VBG coolest/JJ last/JJ eve/NN
      • “Rhi you were the coolest last eve”
      • Rhi/VB you/PRP were/VBD the/DT coolest/JJ last/JJ eve/NN
computational linguistics syntax1
Computational Linguistics - Syntax

Understanding Phrase Structures, Parsing, Chunking

Word Sequence

Syntactic Parser

Parse Tree

Semantic Analyzer

Literal Meaning

Discourse Analyzer

Meaning

you saw pos tags what are phrases
You saw POS tags; what are Phrases?
  • Words don’t just occur in some order
  • Words are organized in phrases
    • groupings of words that clunk together
  • Major phrase types
    • Noun Phrases
    • Prepositional phrases
    • Verb phrases
what is syntactic parsing
What is Syntactic Parsing?
    • Deriving the syntactic structure of a sentence based on a language model (grammar)
  • Natural Language Syntax described by a context free grammar
    • the Start-Symbol S≡ sentence
    • Non-Terminals NT≡ syntactic constituents
    • Terminals T≡ lexical entries/ words
    • Productions P  NT (NTT)+≡ grammar rules

http://www.cs.umanitoba.ca/~comp4190/2006/NLP-Parsing.ppt

sample grammar
Sample Grammar
  • S  NT, Part-of-Speech  NT, Constituents  NT, Words  T, Rules:
    • S  NP VP statement
    • S  Aux NP VP question
    • S  VP command
    • NP  Det Nominal
    • NP  Proper-Noun
    • Nominal  Noun | Noun Nominal | Nominal PP
    • VP  Verb | Verb NP | Verb PP | Verb NP PP
    • PP  Prep NP
    • Det  that | this | a
    • Noun  book | flight | meal | money
parsing approaches
Parsing Approaches
  • Bottom-upParsing or data-driven
  • Top-down Parsing or goal-driven
    • S
    • Aux NP VP
    • Det Nominal Verb NP
    • Noun Det Nominal
    • doesthisflightincludeameal
constituency vs dependency parsers
Constituency vs. Dependency Parsers

Constituency Parse - Nested Phrasal Structures

Dependency parse - Role Specific Structures

Natural Language Parsers, Peter Hellwig, Heidelberg

sample parser results
Sample Parser Results
  • Tagging
    • John/NNP bought/VBD a/DT book/NN ./.
  • Constituency Parse
      • Nested phrasal structure
      • (ROOT (S (NP (NNP John)) (VP (VBD bought) (NP (DT a) (NN book))) (. .)))
  • Typed dependencies
    • Role specific structure
      • nsubj(bought-2, John-1)
      • det(book-4, a-3)
      • dobj(bought-2, book-4)
how are parse results used
How are Parse Results Used?
  • Grammar checking: sentences that cannot be parsed may have grammatical errors
  • Using results of Dependency parse
    • Word sense disambiguation (dependencies as features or co-occurrence vectors)
dependency parsers
Dependency Parsers
  • MINIPAR
    • http://www.cs.ualberta.ca/~lindek/minipar.htm
  • Link Grammar parser: http://www.link.cs.cmu.edu/link/
  • Standard “CFG” parsers like the Stanford parser
    • http://www-nlp.stanford.edu/software/lex-parser.shtml
  • ENJU’s probabilistic HPSG grammar
    • http://www-tsujii.is.s.u-tokyo.ac.jp/enju/
parsing vs chunking
Parsing vs. Chunking
  • Some applications don’t need the complex output of a full parse
  • Chunking / Shallow Parse / Partial Parse
    • Identifying and classifying flat, non-overlapping contiguous units in text
      • Segmenting and tagging
  • Example of chunking a sentence
      • [NPThe morning flight] from [NPDenver] [VPhas arrived]
  • Chunking algos mention
using chunking results
Using Chunking Results
  • Entity recognition
      • people, locations, organizations
  • Studying linguistic patterns (Hearst 92)
      • gave NP
      • gave up NP in NP
      • gave NP NP
      • gave NP to NP
test drive a parser
Test Drive a Parser
  • Stanford and Enju parser demos; analyzing results
    • http://www-tsujii.is.s.u-tokyo.ac.jp/enju/demo.html
    • http://nlp.stanford.edu:8080/parser/
  • If you want to know how to run it stand alone
    • Talk to one of us or see their very helpful help pages
computational linguistics semantics
Computational Linguistics - Semantics

COLORLESS GREEN IDEAS SLEEP FURIOUSLY

Word Sequence

Syntactic Parser

Parse Tree

Semantic Analyzer

Literal Meaning

Discourse Analyzer

Meaning

the need for meaning
The Need for Meaning
  • When raw linguistic inputs nor any structures derived from them will facilitate required semantic processing
  • When we need to link linguistic information to the non-linguistic real-world knowledge
  • Typical sources of knowledge
    • Meaning of words, grammatical constructs, discourse, topic..
meaning
Meaning
  • Lexical Semantics
    • The meanings of individual words
  • Formal Semantics (Compositional Semantics or Sentential Semantics)
    • How those meanings combine to make meanings for individual sentences or utterances
  • Discourse or Pragmatics
    • How those meanings combine with each other and with other facts about various kinds of context to make meanings for a text or discourse

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

what is a word
What is a word ?
  • Lexeme: set of forms taken by a single word
    • run, runs, ran and running forms of the same lexeme RUN
  • Lemma: a particular form of a lexeme that is chosen to represent a canonical form
      • Carpet for carpets; Sing for sing, sang, sung
  • Lemmatization: Meaning of a word approximated by meaning of its lemma
    • Mapping a morphological variant to its root
      • Derivational and Inflectional Morphology
word senses
Word Senses
  • Word sense: Meaning of a word (lemma)
    • Varies with context
  • Significance
    • Lexical ambiguity
      • consequences on tasks like parsing and tagging
      • implications on results of Machine translation, Text classification etc.
  • Word Sense Disambiguation
    • Selecting the correct sense for a word
semantic relations between word senses
Semantic Relations between Word Senses
  • Homonymy
  • Polysemy
  • Synonymy
  • Antonymy
  • Hypernomy
  • Hyponomy
  • Meronomy
  • Why do we care?

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

homonymy polysemy
Homonymy, Polysemy
  • Homonymy: share a form, relatively unrelated senses
    • Bank (financial institution, a sloping mound)
  • Polysemy: semantically related
    • Bank as a financial institution, as a blood bank
    • Verbs tend more to polysemy
synonyms hypernymy hyponymy
Synonyms, Hypernymy, Hyponymy
  • Different words/lemmas that have the same sense
    • Couch/chair
  • One sense more specific than the other (hyponymy)
    • Car is a hyponym of vehicle
  • One sense more general than the other (hypernymy)
    • Vehicle is a hypernym of car
meronymy holonymy part whole
Meronymy, holonymy – part whole
  • Meronymy
    • Engine part of car; engine meronym of car
  • Holonymy
    • Car is a holonym of engine
more general sense relations
More general sense relations
  • Semantic fields
    • Cohesive chunks of knowledge
    • Air travel:
      • Flight, travel, reservation, ticket, departure…
wordnet
WordNet
  • Models these sense relations
  • A hierarchically organized lexical database
  • On-line thesaurus + aspects of a dictionary
      • Versions for other languages are under development

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

sample wordnet entry
Sample WordNet Entry

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

verb and noun relations
Verb and Noun Relations
  • Verbs and Nouns in separate hierarchies

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

word senses in wordnet
Word Senses in WordNet
  • The set of near-synonyms for a WordNet sense is called a synset (synonym set)
    • Their version of a sense or a concept
  • Duck as a verb to mean
      • to move (the head or body) quickly downwards or away
      • dip, douse, hedge, fudge, evade, put off, circumvent, parry, elude, skirt, dodge, duck, sidestep
how are word senses used
How are Word Senses used
  • IR and QnA
    • Indexing using similar (synonymous) words/query or specific to general words (hyponymy / hypernymy) improves text retrieval
  • Machine translation, QnA
    • Need to know if two words are similar to know if we can substitute one for another
semantic relations b w senses
Semantic Relations b/w Senses
  • Most well developed
    • Synonymy or similarity
  • Synonymy - a binary relationship between words, rather their senses
  • Approaches
    • Thesaurus based : measuring word/sense similarity in a thesaurus
    • Distributional methods: finding other words with similar distributions in a corpus
thesaurus based methods for sense relations intuition
Thesaurus based methods for sense relations - Intuition
  • Thesaurus based
    • Path based similarity – two words are similar if they are similar in the thesaurus hierarchy

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

disadvantages
Disadvantages
  • We don’t have a thesaurus for every language. Even if we do, many words are missing
    • Wordnet: Strong for nouns, but lacking for adjectives and even verbs
    • Expensive to build
  • They rely on hyponym info for similarity
    • car hyponym of vehicle
  • Alternative - Distributional methods for word similarity

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

distributional methods intuition
Distributional Methods - Intuition
  • Firth (1957): “You shall know a word by the company it keeps!”
  • Similar words appear in similar contexts - Nida example noted by Lin:
      • A bottle of tezgüino is on the table
      • Everybody likes tezgüino
      • Tezgüino makes you drunk
      • We make tezgüino out of corn.

Partial material from http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

similarity synonymy detected plagiarism
Similarity/Synonymy – Detected Plagiarism

http://www.stanford.edu/class/cs224u/224u.07.lec2.ppt

your own miner
YOUR OWN MINER

So you want to build your own text miner!

tm systems
TM Systems
  • Infrastructure intensive
  • Luckily, plenty of open source tools, frameworks, resources..
    • http://www-nlp.stanford.edu/links/statnlp.html
    • http://www.cedar.buffalo.edu/~rohini/CSE718/References2.html
take a sample application
Take a Sample Application
  • Mining opinions from casual text
  • Data – user comments on artist pages from MySpace
    • “Your musics the shit,…lovve your video you are so bad”
    • “Your music is wicked!!!!”
  • Goal
    • Popularity lists generated from listener’s comments to complement radio plays/cd sales lists
typical order of tasks
Typical Order of Tasks

“Your musics the shit,…lovve your video you are so bad”

  • Pre-processing
      • strip html, normalizing text from different sources..
  • Tokenization
      • Splitting text into tokens : word tokens, number tokens, domain specific requirements
  • Sentence splitting
      • ! . ? … ; harder in casual text
  • Normalizing words
      • Stop word removal, lemmatization, stemming, transliterations (da == the)
more concrete tasks
More Concrete Tasks

‘The smile is so wicked!!’

  • Syntax : Marking sentiment expression from syntax or a dictionary
      • The/DT smile/NN is/VBZ so/RB wicked/JJ !/. !/.
  • Semantics : Surrounding context
      • On Lily Allen’s MySpace page. Cues for Co-ref resolution
      • Smile is a track by Lilly Allen. Ambiguity
  • Background knowledge / resources
      • Using urbandictionary.com for semantic orientation of ‘wicked’
frameworks to get you started
Frameworks to get you started
  • GATE - General Architecture of Text Engineering, since 1995 at University of Sheffield, UK
  • UIMA - Unstructured Information Management Architecture, IBM
    • Document processing tools, Components syntactic tools, nlp tools, integrating framework
text mining part ii
TEXT MINING – PART II

SAMPLE APPLICATIONS, SURVEY OF EFFORTS IN TWO SAMPLE AREAS

a text mining process text analytics
MEK dependency

correlated with

observed in

downregulation of

cyclin D1 protein expression

correlated with

BRAF mutant cells

induction of G1 arrest

A Text Mining process – Text Analytics

Information Extraction = segmentation+classification+association+mining

Text mining = entity identification+named relationship extraction+discovering association chains….

Named Relationship Extraction

This MEK dependency was observed in BRAF mutant cells regardless of tissue lineage, and correlated with both downregulation of cyclin D1 protein expression and the induction of G1 arrest.

Segmentation

*MEK dependency ISA Dependency_on_an_Organic_chemical

*BRAF mutant cells ISA Cell_type

*downregulation of cyclin D1 protein expression ISA Biological_process

*tissue lineage ISA Biological_concept

*induction of G1 arrest ISA Biological_process

Classification

mining
MEK dependency

correlated with

observed in

downregulation of

cyclin D1 protein expression

correlated with

BRAF mutant cells

induction of G1 arrest

Mining
nei named entity identification
NEI – Named Entity Identification
  • The task of classifying token sequences in text into one or more predefined classes
  • Approaches
    • Look up a list
      • Sliding window
    • Use rules
    • Machine learning
  • Compound entities
  • Applied to
    • Wikipedia like text
    • Biomedical text
approaches to nei
Approaches to NEI
  • The simplest approach
    • Proper nouns make up majority of named entities
    • Look up a gazetteer
      • CIA fact book for organizations, country names etc.
    • Poor recall
      • coverage problems
approaches to nei1
Approaches to NEI
  • Rule based [Mikheev et. Al 1999]

Frequency Based

"China International Trust and Investment Corp”

"Suspended Ceiling Contractors Ltd”

"Hughes“ when "Hughes Communications Ltd.“ is already marked as an organization

  • Scalability issues:
    • Expensive to create manually
    • Leverages domain specific information – domain specific
    • Tend to be corpus-specific – due to manual process
approaches to nei2
Approaches to NEI
  • Machine learning approaches
    • Ability to generalize better than rules
    • Can capture complex patterns
    • Requires training data
      • Often the bottleneck
  • Techniques [list taken from Agichtein2007]
    • Naive Bayes
    • SRV [Freitag 1998], Inductive Logic Programming
    • Rapier [Califf and Mooney 1997]
    • Hidden Markov Models [Leek 1997]
    • Maximum Entropy Markov Models [McCallum et al. 2000]
    • Conditional Random Fields [Lafferty et al. 2001]
features used for ner
Features used for NER
  • Context Features
    • Window of words
      • Fixed
      • Variable
  • Orthographical Features
    • CD28 a protein
  • Part-of-speech features
    • Current word
    • Adjacent words – within fixed window
  • Word shape features
    • Kappa-B replaced with Aaaaa-A
  • Dictionary features
    • Inexact matches
  • Prefixes and Suffixes
    • “~ase” = protein
details of some machine learning techniques used for ner
Details of some Machine Learning techniques used for NER
  • HMMs
    • a powerful tool for representing sequential data
    • are probabilistic finite state models with parameters for state-transition probabilities and state-specific observation probabilities
    • the observation probabilities are typically represented as a multinomial distribution over a discrete, finite vocabulary of words
    • Training is used to learn parameters that maximize the probability of the observation sequences in the training data
  • Generative
    • Find parameters to maximize P(X,Y)
    • When labeling Xi future observations are taken into account (forward-backward)
  • Problems
    • Feature overlap in NER
      • E.g. to extract previously unseen company names from a newswire article
        • the identity of a word alone is not very predictive
        • knowing that the word is capitalized, that is a noun, that it is used in an appositive, and that it appears near the top of the article would all be quite predictive
      • Would like the observations to be parameterized with these overlapping features
    • Feature independence assumption

Several features about same word can affect parameters

details of some machine learning techniques used for ner1
Details of some Machine Learning techniques used for NER
  • MEMMs [McCallum et. al, 2000]
    • Discriminative
      • Find parameters to maximize P(Y|X)
    • No longer assume that features are independent
      • f(“Apple”, Company) = 1.
    • Do not take future observations into account (no forward-backward)
  • Problems
    • Label bias problem
details of some machine learning techniques used for ner2
Details of some Machine Learning techniques used for NER
  • CRFs [Lafferty et. al, 2001]
    • Discriminative
    • Doesn’t assume that features are independent
    • When labeling Yi future observations are taken into account
    • Global optimization – label bias prevented
  • The best of both worlds!
using crfs for ner
Using CRFs for NER
  • Example
    • [ORG U.S. ] general [PER David Petraeus] heads for [LOC Baghdad ] .

Token POS Chunk Tag

---------------------------------------------------------

U.S. NNP I-NP I-ORG

general NN I-NP O

David NNP I-NP B-PER

Petraeus NNP I-NP I-PER

heads VBZ I-VP O

for IN I-PP O

Baghdad NNP I-NP I-LOC

. . O O

    • CONLL format – Mallet
    • Major bottleneck is training data
semi automated training data generation
Semi-automated training data generation
  • Context Induction approach [Talukdar2006]
      • Starting with a few seed entities, it is possible to induce high-precision context patterns by exploiting entity context redundancy.
      • New entity instances of the same category can be extracted from unlabeled data with the induced patterns to create high-precision extensions of the seed lists.
      • Features derived from token membership in the extended lists improve the accuracy of learned named-entity taggers.

Pruned

Extraction

patterns

Feature

generation

For

CRF

entity identification
Entity Identification
  • Machine Learning
    • Best performance
  • Problem
    • Training data bottleneck
  • Pattern induction
    • Reduce training data creation time
relationship fact extraction
Relationship/Fact Extraction
  • Knowledge Engineering approach
    • Manually crafted rules
        • Over lexical items works for
        • Over syntactic structures – parse trees
    • GATE
  • Machine learning approaches
    • Supervised
    • Semi-supervised
    • Unsupervised
machine learning approach to relationship extraction
Machine Learning approach to relationship extraction
  • Supervised
      • BioText – extraction of relationships between diseases and their treatments [Rosario et. al 2004]
      • Rule-based supervised approach [Rinaldi et. al 2004]
        • Semantics of specific relationship encoded as rules
        • Identify a set of relations along with their morphological variants (bind, regulate, signal etc.)
          • subj(bind,X,_,_),pobj(bind,Y,to,_) prep(Y,to,_,_) => bind(X,Y).
        • Axiom formulation was however a manual process involving a domain expert.
other supervised approaches to relationship extraction
Other Supervised approaches to relationship extraction
  • Hand-coded domain specific rules that encode patterns used to extract
      • Molecular pathways [Freidman et. al. 2001]
      • Protein interaction [Saric et. al. 2006]
  • All of the above in the biomedical domain
    • Notice – specificity of relationship types
    • Amount of effort required
    • Also notice types of entities involved in the relationships
a related problem supervised
A related problem – Supervised
  • Semantic Role Labeling
    • Features
  • Detailed tutorial on SRL is available
    • By Yih & Toutanova here
related approaches
Related approaches
  • Other approaches
    • Discovering concept-specific relations
        • Dmitry Davidov, et. al 2007,
    • preemptive IE approach
        • Rosenfeld & Feldman 2007
    • Open Information Extraction
      • Banko et. al 2007
      • Self supervised approach
      • Uses dependency parses to train extractors
    • On-demand information extraction
      • Sekine 2006
        • IR driven
        • Patterns discovery
        • Paraphrase
fact extraction from wikipedia
Fact Extraction from Wikipedia
  • Rule and Heuristic based method
    • YAGO Suchanek et. al, 2007
    • Pattern-based approach
    • Uses WordNet
  • Subtree mining over dependency parse trees
    • Nguyen et. al, 2007
compound entities
Compound Entities
  • Entities (MeSH terms) in sentences occur in modified forms
    • “adenomatous” modifies “hyperplasia”
    • “An excessive endogenous or exogenous stimulation” modifies “estrogen”
  • Entities can also occur as composites of 2 or more other entities
    • “adenomatous hyperplasia” and “endometrium” occur as “adenomatous hyperplasia of the endometrium”
joint extraction of compound entities and relationships
Joint Extraction of Compound Entities and Relationships
  • Small set of rules over dependency types dealing with
    • modifiers (amod, nn) etc. subjects, objects (nsubj, nsubjpass) etc.
  • Since dependency types are arranged in a hierarchy
    • We use this hierarchy to generalize the more specific rules
    • There are only 4 rules in our current implementation

Carroll, J., G. Minnen and E. Briscoe (1999) `Corpus annotation for parser evaluation'. In Proceedings of the EACL-99 Post-Conference Workshop on Linguistically Interpreted Corpora, Bergen, Norway. 35-41. Also in Proceedings of the ATALA Workshop on Corpus Annotés pour la Syntaxe - Treebanks, Paris, France. 13-20.

Relationship head

Subject head

Object head

Object head

representation
Representation

Modifiers

Modified entities

Composite Entities

evaluation
Evaluation
  • Manual Evaluation
    • Test if the RDF conveys same “meaning” as the sentence
    • Juxtapose the triple with the sentence
    • Allow user to assess correctness/incorrectness of the subject, object and triple
discovering complex connection patterns
Discovering complex connection patterns

Discovering informative subgraphs (Harry Potter)

Given a pair of end-points (entities)

Produce a subgraph with relationships connecting them such that

The subgraph is small enough to be visualized

And contains relevant “interesting” connections

We defined an interestingness measure based on the ontology schema

In future biomedical domain the scientist will control this with the help of a browsable ontology

Our interestingness measure takes into account

Specificity of the relationships and entity classes involved

Rarity of relationships etc.

Cartic Ramakrishnan, William H. Milnor, Matthew Perry, Amit P. Sheth: Discovering informative connection subgraphs in multi-relational graphs. SIGKDD Explorations 7(2): 56-63 (2005)

heuristics
Heuristics

Two factor influencing interestingness

discovery algorithm
Discovery Algorithm
  • Bidirectional lock-step growth from S and T
  • Choice of next node based on interestingness measure
  • Stop when there are enough connections between
  • the frontiers
  • This is treated as the candidate graph
discovery algorithm1
Discovery Algorithm

Model the Candidate graph as an electrical circuit

S is the source and T the sink

Edge weight derived from the ontology schema are treated as conductance values

Using Ohm’s and Kirchoff’s laws we find maximum current flow paths through the candidate graph from S to T

At each step adding this path to the output graph to be displayed we repeat this process till a certain number of predefined nodes is reached

Results

Arnold schwarzenegger, Edward Kennedy

Other related work

Semantic Associations

take away message
Take Away Message
  • Text Mining, Analysis  understanding  utilization in decision making  knowledge discovery
  • Entity Identification  focus change from simple to compound
  • Relationship extraction  implicit vs. explicit
  • Need more unsupervised approaches
  • Need to think of incentives to evaluate
major issue
Major issue
  • Existing corpora
    • GENIA, BioInfer many others
    • Narrow focus
  • Precision and Recall
  • Utility
    • How useful is the extracted information? How do we measure utility?
      • Swanson’s discovery, Enrichment of Browsing experience
  • Text types and mining
    • Systematically compensating for (in)formality
useful links
Useful links

http://www.cs.famaf.unc.edu.ar/~laura/text_mining/

http://www.stanford.edu/class/cs276/cs276-2005-syllabus.html

http://www-nlp.stanford.edu/links/statnlp.html

http://www.cedar.buffalo.edu/~rohini/CSE718/References2.html

references
References

Automating Creation of Hierarchical Faceted Metadata Structures Emilia Stoica, Marti Hearst, and Megan Richardson, in the proceedings of NAACL-HLT, Rochester NY, April 2007

Finding the Flow in Web Site Search, Marti Hearst, Jennifer English, Rashmi Sinha, Kirsten Swearingen, and Ping Yee, Communications of the ACM, 45 (9), September 2002, pp.42-49.

R. Kumar, U. Mahadevan, and D. Sivakumar, "A graph-theoretic approach to extract storylines from search results",  in Proc. KDD, 2004, pp.216-225.

Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction from the Web. IJCAI 2007: 2670-2676

Hearst, M. A. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th Conference on Computational Linguistics - Volume 2 (Nantes, France, August 23 - 28, 1992).

Dat P. T. Nguyen, Yutaka Matsuo, Mitsuru Ishizuka: Relation Extraction from Wikipedia Using Subtree Mining. AAAI 2007: 1414-1420

"Unsupervised Discovery of Compound Entities for Relationship Extraction" Cartic Ramakrishnan, Pablo N. Mendes, Shaojun Wang and Amit P. Sheth EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management Knowledge Patterns

Mikheev, A., Moens, M., and Grover, C. 1999. Named Entity recognition without gazetteers. In Proceedings of the Ninth Conference on European Chapter of the Association For Computational Linguistics (Bergen, Norway, June 08 - 12, 1999).

McCallum, A., Freitag, D., and Pereira, F. C. 2000. Maximum Entropy Markov Models for Information Extraction and Segmentation. In Proceedings of the Seventeenth international Conference on Machine Learning

Lafferty, J. D., McCallum, A., and Pereira, F. C. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth international Conference on Machine Learning

references1
References
  • Barbara, R. and A.H. Marti, Classifying semantic relations in bioscience texts, in Proceedings of the 42nd ACL. 2004, Association for Computational Linguistics: Barcelona, Spain.
  • M.A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING‘ 92, pages 539–545
  • M. Hearst, "Untangling text data mining," 1999. [Online]. Available: http://citeseer.ist.psu.edu/563035.html
  • Friedman, C., et al., GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 2001. 17 Suppl 1: p. 1367-4803.
  • Saric, J., et al., Extraction of regulatory gene/protein networks from Medline. Bioinformatics, 2005.
  • Ciaramita, M., et al., Unsupervised Learning of Semantic Relations between Concepts of a Molecular Biology Ontology, in 19th IJCAI. 2005.
  • Dmitry Davidov, Ari Rappoport, Moshe Koppel. Fully Unsupervised Discovery of Concept-Specific Relationships by Web Mining. Proceedings, ACL 2007, June 2007, Prague.
  • Rosenfeld, B. and Feldman, R. 2007. Clustering for unsupervised relation identification. In Proceedings of the Sixteenth ACM Conference on Conference on information and Knowledge Management (Lisbon, Portugal, November 06 - 10, 2007).
  • Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, Oren Etzioni: Open Information Extraction from the Web. IJCAI 2007: 2670-2676
  • Sekine, S. 2006. On-demand information extraction. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (Sydney, Australia, July 17 - 18, 2006). Annual Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 731-738.
  • Suchanek, F. M., Kasneci, G., and Weikum, G. 2007. Yago: a core of semantic knowledge. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW '07.
ad