meaning from text teaching computers to read n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Meaning from Text: Teaching Computers to Read PowerPoint Presentation
Download Presentation
Meaning from Text: Teaching Computers to Read

Loading in 2 Seconds...

play fullscreen
1 / 40

Meaning from Text: Teaching Computers to Read - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

Meaning from Text: Teaching Computers to Read. Steven Bethard University of Colorado. Query: “Who is opposing the railroad through Georgia?”.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Meaning from Text: Teaching Computers to Read' - albert


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
meaning from text teaching computers to read

Meaning from Text:Teaching Computers to Read

Steven Bethard

University of Colorado

query who is opposing the railroad through georgia
Query: “Who is opposing the railroad through Georgia?”

1 en.wikipedia.org/wiki/Sherman's_March_to_the_Sea…they destroyed the railroads and the manufacturing and agricultural infrastructure of the state…Henry Clay Work wrote the song Marching Through Georgia…

3 www.ischool.berkeley.edu/~mkduggan/politics.htmlWhile the piano piece "Marching Through Georgia" has no words...Party of California (1882) has several verses opposing the "railroad robbers"...

71 www.azconsulatela.org/brazaosce.htmAzerbaijan, Georgia and Turkey plan to start construction of Kars-Akhalkalaki-Tbilisi-Baku railroad in May, 2007…However, we’ve witnessed a very strong opposition to this project both in Congress and White House. President George Bush signed a bill prohibiting financing of this railroad…

what went wrong
What went wrong?
  • Didn’t find some similar word forms (Morphology)
    • Finds opposing but not opposition
    • Finds railroad but not railway
  • Didn’t know how words should be related (Syntax)
    • Looking for: opposing railroad
    • Finds: opposing the “railroad robbers”
  • Didn’t know that “is opposing” means current (Semantics/Tense)
    • Looking for: recent documents
    • Finds: Civil War documents
  • Didn’t know that “who” means a person (Semantics/Entities)
    • Looking for: <person> opposing
    • Finds: several verses opposing
teaching linguistics to computers
Teaching Linguistics to Computers
  • Natural Language Processing (NLP)
    • Symbolic approaches
    • Statistical approaches
  • Machine learning overview
  • Statistical NLP
    • Example: Identifying people and places
    • Example: Constructing timelines
early natural language processing
Early Natural Language Processing
  • Symbolic approaches
  • Small domains
  • Example:
    • SHRDLU block world
    • Vocabulary of ~50 words
    • Simple word combinations
    • Hand-written rules to understand sentences

Person: WHAT DOES THE BOX CONTAIN?

Comp: THE BLUE PYRAMID.

Person: WHAT IS THE PYRAMID SUPPORTED BY?

Comp: THE BOX.

Person: HOW MANY BLOCKS ARE NOT IN THE BOX?

Comp: SEVEN OF THEM.

recent natural language processing
Recent Natural Language Processing
  • Large scale linguistic corpora
    • e.g. Penn TreeBank million words of syntax:
  • Statistical machine learning
    • e.g. Charniak parser
      • Trained on the TreeBank
      • Builds new trees with 90% accuracy
machine learning
Machine Learning
  • General approach
    • Analyze data
    • Extract preferences
    • Classify new examples using learned preferences
  • Supervised machine learning
    • Data have human-annotated labels
      • e.g. each sentence in the TreeBank has a syntactic tree
    • Learns human preferences
supervised machine learning models

A Two-Dimensional Space

Supervised Machine Learning Models
  • Given:
    • An N­dimensional feature space
    • Points in that space
    • A human-annotated label for each point
  • Goal:
    • Learn a function to assign labels to points
  • Methods:
    • K-nearest-neighbors, support vector machines, etc.

?

?

machine learning examples
Machine Learning Examples
  • Character Recognition
    • Feature space: 256 pixels (0 = black, 1 = white)
    • Labels: A, B, C, …
  • Cardiac Arrhythmia
    • Feature space: age, sex, heart rate, …
    • Labels: has arrythmia, doesn’t have arrythmia
  • Mushrooms
    • Feature space: cap shape, gill color, stalk surface, …
    • Labels: poisonous, edible
  • … and many more:
    • http://www.ics.uci.edu/~mlearn/MLRepository.html
machine learning and language
Machine Learning and Language
  • Example:
    • Identifying people, places, organizations (named entities)
    • However, we’ve witnessed a very strong opposition to this project both in [ORGCongress]and [ORGWhiteHouse]. President [PERGeorgeBush] signed a bill prohibiting financing of this railroad.
  • This doesn’t look like that lines and dots example!
    • What’s the classification problem?
    • What’s the feature space?
named entities classification
Named Entities: Classification
  • Word-by-word classification
  • Is the word beginning, inside or outside of a named entity?
named entities clues
Named Entities: Clues
  • The word itself
    • U.S. is always a Location
    • (though Turkey is not)
  • Part of speech
    • The Locations Turkey and Georgia are nouns
    • (though the White of White House is not)
  • Is the first letter of the word capitalized?
    • Bush and Congress are capitalized
    • (though the von of vonNeumann is not)
  • Is the word at the start of the sentence?
    • In the middle of a sentence, Will is likely a Persion
    • (but at the start it could be an auxiliary verb)
named entities clues as features
Named Entities: Clues as Features
  • Each clue defines part of the feature space
named entities string features
Named Entities: String Features
  • But machine learning models need numeric features!
    • True  1
    • False  0
    • Congress  ?
    • adjective  ?
  • Solution:
    • Binary feature for each word
named entities review
Named Entities: Review

…[ORG Congress]and [ORG White House]…

named entities features and models
Named Entities: Features and Models
  • String features
    • word itself
    • part of speech
    • starts sentence
    • has initial capitalization
  • How many numeric features?
    • N = Nwords + Nparts-of-speech + 1 + 1
    • Nwords ≈ 10,000
    • Nparts-of-speech ≈ 50
  • Need efficient implementations, e.g. TinySVM
named entities in use
Named Entities in Use
  • We know how to:
    • View named entity recognition as classification
    • Convert clues to an N-dimensional feature space
    • Train a machine learning model
  • How can we use the model?
named entities in research
Named Entities in Research
  • TREC-QA
    • Factoid question answering
    • Various research systems compete
    • All use named entity matching
  • State of the art performance: ~90%
    • That’s 10% wrong!
    • But good enough for real use
    • Named entities are a “solved” problem
  • So what’s next?
learning timelines
Learning Timelines
  • The top commander of a Cambodian resistance force said Thursday he has sent a team to recover the remains of a British mine removal expert kidnapped and presumed killed by Khmer Rouge guerrillas almost two years ago.
learning timelines1
Learning Timelines
  • The top commander of a Cambodian resistance force said Thursday he has sent a team to recover the remains of a British mine removal expert kidnapped and presumed killed by Khmer Rouge guerrillas almost two years ago.
learning timelines2
Learning Timelines
  • The top commander of a Cambodian resistance force said Thursday he has sent a team to recover the remains of a British mine removal expert kidnapped and presumed killed by Khmer Rouge guerrillas almost two years ago.
why learn timelines
Why Learn Timelines?
  • Timelines are summarization
    • 1996
      • Khmer Rouge kidnapped and killed British mine removal expert
    • 1998
      • Cambodian commander sent recovery team
  • Timelines allow reasoning
    • Q: When was the expert kidnapped?A: Almost two years ago.
    • Q: Was the team sent before the expert was killed?A: No, afterwards.
learning timelines classification
Learning Timelines: Classification
  • Standard questions:
    • What’s the classification problem?
    • What’s the feature space?
  • Three different problems
    • Identify times
    • Identify events
    • Identify links (temporal relations)
times and events classification
Times and Events: Classification
  • Word-by-word classification
  • Time features:
    • word itself
    • has digits
  • Event features:
    • word itself
    • suffixes (e.g. -ize, -tion)
    • root (e.g. evasionevade)
times and events state of the art
Times and Events: State of the Art
  • Performance:
    • Times: ~90%
    • Events: ~80%
    • Mr Bryza, it's been [Eventreported] that Azerbaijan, Georgia and Turkey [Eventplan] to [Eventstart][Eventconstruction] of Kars­Akhalkalaki­Tbilisi­Baku railroad in [TimeMay], [Time2007].
  • Why are events harder?
    • No orthographic cues (capitalization, digits, etc.)
    • More parts of speech (nouns, verbs and adjectives)
temporal links
Temporal Links
  • Everything so far looked like:
    • Aaaa [X bb] ccccc [Y dd eeeee] fff [Z gggg]
  • But now we want this:
  • Word-by-word classification won’t work!
temporal links classification
Temporal Links: Classification
  • Pairwise classification
    • Each event with each time
  • Saddam Hussein [Timetoday] [Eventsought][Eventpeace] on another front by [Eventpromising] to [Eventwithdraw] from Iranian territory and [Eventrelease] soldiers [Eventcaptured] during the Iran-Iraq [Eventwar].
temporal links clues
Temporal Links: Clues
  • Tense of the event
    • said (past tense) is probably Before today
    • says (present tense) is probably During today
  • Nearby temporal expression
    • In “said today”, said is During today
    • In “captured in 1989”, captured is During 1989
  • Negativity
    • In “People believe this”, believe is During today
    • In “People don’t believe this any more”, believe is Before today
temporal links features
Temporal Links: Features

Saddam Hussein [Time today] [Event sought][Event peace] on another front by [Event promising] to [Event withdraw] from Iranian territory…

temporal links state of the art
Temporal Links: State of the Art
  • Corpora with temporal links:
    • PropBank: verbs and subjects/objects
    • TimeBank: certain pairs of events (e.g. reporting event and event reported)
    • TempEval A: events and times in the same sentence
    • TempEval B: events in a document and document time
  • Performance on TempEval data:
    • Same-sentence links (A): ~60%
    • Document time links (B): ~80%
what will make timelines better
What will make timelines better?
  • Larger corpora
    • TempEval is only ~50 documents
    • Treebank is ~2400
  • More types of links
    • Event-time pairs for all events
      • TempEval only considers high-frequency events
    • Event-event pairs in the same sentence
summary
Summary
  • Statistical NLP asks:
    • What’s the classification problem?
      • Word-by-word?
      • Pairwise?
    • What’s the feature space?
      • What are the linguistic clues?
      • What does the N-dimensional space look like?
  • Statistical NLP needs:
    • Learning algorithms efficient when N is very large
    • Large-scale corpora with linguistic labels
references
References
  • Symbolic NLP
    • Terry Winograd. 1972. Understanding Natural Language. Academic Press.
  • Statistical NLP
    • Daniel M. Bikel, Richard Schwartz, Ralph M. Weischedel. 1999. “An Algorithm that Learns What's in a Name.” Machine Learning.
    • Kadri Hacioglu,Ying Chen and Benjamin Douglas. 2005. “Automatic Time Expression Labeling for English and Chinese Text.” In Proceedings of CICLing-2005.
    • Ellen M. Voorhees and Hoa Trang Dang. 2005. “Overview of the TREC 2005 Question Answering Track.” In proceedings of The Fourteenth Text REtrieval Conference.
references1
References
  • Corpora
    • Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. “Building a large annotated corpus of english: The penn treebank.” Computational Linguistics, 19:313-330.
    • Martha Palmer, Dan Gildea, Paul Kingsbury. 2005. The Proposition Bank: A Corpus Annotated with Semantic Roles, Computational Linguistics Journal, 31:1.
    • James Pustejovsky, Patrick Hanks, Roser Saurí, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro and Marcia Lazo. 2003. The TIMEBANK Corpus. Proceedings of Corpus Linguistics 2003: 647-656.
feature windowing 1
Feature Windowing (1)
  • Problem:
    • Word-by-word gives no context
  • Solution:
    • Include surrounding features
feature windowing 2
Feature Windowing (2)
  • From previous word: features, label
  • From current word: features
  • From following word: features
  • Need special values like !START! and !END!