Natural language processing
Download
1 / 52

Natural Language Processing - PowerPoint PPT Presentation


  • 205 Views
  • Updated On :

Natural Language Processing. Why “natural language”?. Natural vs. artificial Language vs. English. Why “natural language”?. Natural vs. artificial Not precise, ambiguous, wide range of expression Language vs. English English, French, Japanese, Spanish. Why “natural language”?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Natural Language Processing' - cera


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Why natural language l.jpg
Why “natural language”?

  • Natural vs. artificial

  • Language vs. English


Why natural language3 l.jpg
Why “natural language”?

  • Natural vs. artificial

    • Not precise, ambiguous, wide range of expression

  • Language vs. English

    • English, French, Japanese, Spanish


  • Why natural language4 l.jpg
    Why “natural language”?

    • Natural vs. artificial

      • Not precise, ambiguous, wide range of expression

  • Language vs. English

    • English, French, Japanese, Spanish

  • Natural language processing = programs, theories towards understanding a problem or question in natural language and answering it


  • Approaches l.jpg
    Approaches

    • System building

      • Interactive

      • Understanding only

      • Generation only

  • Theoretical

    • Draws on linguistics, psychology, philosophy


  • Slide6 l.jpg


    Natural language is useful l.jpg
    Natural language is useful

    • Question-answering systems

      • http://tangra.si.umich.edu/clair/NSIR/NSIR.cgi

  • Mixed initiative systems

    • http://www.cs.columbia.edu/~noemie/match.mpg

  • Information extraction

    • http://nlp.cs.nyu.edu/info-extr/biomedical-snapshot.jpg

  • Systems that write/speak

    • http://www-2.cs.cmu.edu/~awb/synthesizers.html

    • MAGIC

  • Machine translation

    • http://world.altavista.com/babelfish


  • Topics l.jpg
    Topics

    • Syntax

    • Semantics

    • Pragmatics

    • Statistical NLP: combining learning and NL processing


    Goal of interpretation l.jpg
    Goal of Interpretation

    • Identify sentence meaning

    • Do something with meaning

      • Need some representation of action/meaning


    Analysis of form syntax l.jpg
    Analysis of form: Syntax

    • Which parts were damaged by larger machines?

    • Which parts damaged larger machines?

    • Which larger machines damaged parts?

    • Approaches:

      • Statistical part of speech tagging

      • Parsing using a grammar

      • Shallow parsing: identify meaningful chunks


    Which parts were damaged by larger machines l.jpg
    Which parts were damaged by larger machines?

    S (Q)

    VP

    NP

    N

    NP (Q)

    ADJ

    V (past)

    machines

    damage

    Det (Q)

    N

    larger

    parts

    which


    Which parts were damaged by machines with functional roles l.jpg
    Which parts were damaged by machines? – with functional roles

    S (Q)

    VP

    NP (SUBJ)

    ADJ

    N

    NP (Q) (OBJ)

    V (past)

    damage

    Det (Q)

    N

    larger

    machines

    parts

    which


    Which parts damaged machines with functional roles l.jpg
    Which parts damaged machines? – with functional roles roles

    NP (Q) (SUBJ)

    Det (Q)

    N

    which

    S (Q)

    VP

    NP (OBJ)

    V (past)

    N

    ADJ

    parts

    damage

    larger

    machines


    Parsers l.jpg
    Parsers roles

    • Grammar

      • S -> NP VP

      • NP -> DET {ADJ*} N

  • Different types of grammars

    • Context Free vs. Context Sensitive

    • Lexical Functional Grammar vs. Tree Adjoining Grammars

  • Different ways of acquiring grammars

    • Hand-encoded vs. machine learned

    • Domain independent (TreeBank, Wall Street Journal)

    • Domain dependent (Medical texts)


  • Semantics analysis of meaning l.jpg
    Semantics: analysis of meaning roles

    • Word meaning

      • John picked up a bad cold

      • John picked up a large rock.

      • John picked up Radio Netherlands on his radio.

      • John picked up a hitchhiker on Highway 66.

  • Phrasal meaning

    • Baby bonuses -> allocations

    • Senior citizens -> personnes agees

    • Causing havoc -> seme le dessaroi

  • Approaches

    • Representing meaning

    • Statistical word disambiguation

    • Symbolic rule-based vs. shallow statistical semantics



  • Omega l.jpg
    OMEGA roles

    • http://omega.isi.edu:8007/index

    • http://omega.is.edu/doc/browsers.html


    Statistical word sense disambiguation l.jpg
    Statistical Word Sense Disambiguation roles

    Context within the sentence determines which sense is correct

    • The candidate picked up [sense6] thousands of additional votes.

    • He picked up [sense2] the book and started to read.

    • Her performance in school picked up [sense13].

    • The swimmers got out of the river and climbed the bank [sloping land] to retrieve their towels.

    • The investors took their money out of the bank [financial institution] and moved it into stocks and bonds.


    Slide21 l.jpg
    Goal roles

    • A program which can predict which sense is the correct sense given a new sentence containing “pick up” or “bank”

    • Avoid manually itemizing all words which can occur in sentences with different meanings

    • Can we use machine learning?


    What do we need l.jpg
    What do we need? roles

    • Data

    • Features

    • Machine Learning algorithm

      • Decision tree vs. SVM/Naïve Bayes

      • Inspecting the output

  • Accuracy of these methods



  • Training data for machines l.jpg
    Training data for “machines” animal) for training


    Predicting the correct sense in unseen text l.jpg
    Predicting the correct sense in unseen text animal) for training

    • Use presence of the salient words in context

    • 50 word window

    • Use Baye’s rule to compute probabilities for different categories


    Crane l.jpg
    “Crane” animal) for training

    • Occurred 74 times in Grolliers, 36 as animal, 38 as machine

    • Prediction in new sentences were 99% correct

    • Example: lift water and to grind grain .PP Treadmills attached to cranes were used to lift heavy objects from Roman times.


    Going home a play in one act l.jpg
    Going Home – A play in one act animal) for training

    • Scene 1: Pennsylvania Station, NYCBonnie: Long Beach?Passerby: Downstairs, LIRR Station

    • Scene 2: ticket counter: LIRRBonnie: Long Beach?Clerk: $4.50

    • Scene 3: Information Booth, LIRRBonnie: Long Beach?Clerk: 4:19, Track 17

    • Scene 4: On the train, vicinity of Forest HillsBonnie: Long Beach?Conductor: Change at Jamaica

    • Scene 5: On the next train, vicinity of LynbrookBonnie: Long Beach?Conductor: Rigtht after Island Park.


    Question answering on the web l.jpg
    Question Answering on the web animal) for training

    • Input: English question

    • Data: documents retrieved by a search engine from the web

    • Output: The phrase(s) within the documents that answer the question


    Examples l.jpg
    Examples animal) for training

    • When was X born?

      • When was Mozart born?

      • Mozart was born in 1756.

      • When was Gandhi born?

      • Gandhi (1869-1948)

    • Where are the Rocky Mountains located?

    • What is nepotism?


    Common approach l.jpg
    Common Approach animal) for training

    • Create a query from the question

      • When was Mozart born -> Mozart born

      • Use WordNet to expand terms and increase recall:

        • Which high school was ranked highest in the US in 1998?

        • “high school” -> (high&school)|(senior&high&school)|(senior&high(|high|highschool

  • Use search engine to find relevant documents

  • Pinpoint passage within document that has answer using patterns

    • From IR to NLP


  • Produce a biography of peron only these fields are relevant l.jpg
    PRODUCE A BIOGRAPHY OF [PERON]. animal) for trainingOnly these fields are Relevant:

    • Name(s), aliases:

    • *Date of Birth or Current Age:

    • *Date of Death:

    • *Place of Birth:

    • *Place of Death:

    • Cause of Death:

    • Religion (Affiliations):

    • Known locations and dates:

    • Last known address:

    • Previous domiciles:

    • Ethnic or tribal affiliations:

    • Immediate family members

    • Native Language spoken:

    • Secondary Languages spoken:

    • Physical Characteristics

    • Passport number and country of issue:

    • Professional positions:

    • Education

    • Party or other organization affiliations:

    • Publications (titles and dates):


    Biography of han ming l.jpg
    Biography of Han Ming animal) for training

    • Han Ming, born 1944 March in Pyongyan, South Korean Lei Fa Women’s University in French law, literature, a former female South Korean people, chairman of South Korea women’s groups,…Han, 62, has championed women’s rights and liberal political ideas. Han was imprisoned from 1979 to 1981 on charges of teaching pro-Communist ideas to workers, farmers and low-income women. She became the first minister of gender equality in 2001 and later served as an environment minister.


    Biography two approaches l.jpg
    Biography – two approaches animal) for training

    • To obtain high precision, we handle each slot independently using bootstrapping to learn IE patterns.

    • To improve the recall, we utilize a biography Language Model.


    Approach l.jpg
    Approach animal) for training

    • Characteristics of the IE approach

      • Training resource: Wikipedia and its manual annotations

      • Bootstrapping interleaves two corpora to improve precision

        • Wikipedia: reliable but small

        • Web: noisy but many relevant documents

      • No manual annotation or automatic tagging of corpus

      • Use seed tuples (person, date-of-birth) to find patterns

      • This approach is scalable for any corpus

        • Irrespective of size

        • Irrespective of whether it is static or dynamic

    • The IE system is augmented with language models to increase recall


    Biography as an ie task l.jpg
    Biography as an IE task animal) for training

    • We need patterns to extract information from a sentence

    • Creating patterns manually is a time consuming task, and not scalable

    • We want to find these patterns automatically


    Biography patterns from wikipedia l.jpg
    Biography patterns from Wikipedia animal) for training


    Biography patterns from wikipedia40 l.jpg
    Biography patterns from Wikipedia animal) for training

    • Martin Luther King, Jr., (January 15, 1929 – April 4, 1968) was the most …

    • Martin Luther King, Jr., was born on January 15, 1929, in Atlanta, Georgia.


    Run idfinder on these sentences l.jpg
    Run IdFinder on these sentences animal) for training

    • <Person> Martin Luther King, Jr. </Person>, (<Date>January 15, 1929</Date> – <Date> April 4, 1968</Date>) was the most…

    • <Person> Martin Luther King, Jr. </Person>, was born on <Date> January 15, 1929 </Date>, in <GPE> Atlanta, Georgia </GPE>.

    • Take the token sequence that includes the tags of interest + some context (2 tokens before and 2 tokens after)


    Convert to patterns l.jpg
    Convert to Patterns: animal) for training

    • <My_Person>(<My_Date>– <Date>) was the

    • <My_Person> , was born on <My_Date>, in

    • Remove more specific patterns – if there is a pattern that contains other, take the smallest > k tokens.

    • <MY_Person> , was born on <My_Date>

    • <My_Person>(<My_Date>– <Date>)

    • Finally, verify the patterns manually to remove irrelevant patterns.


    Examples of patterns l.jpg
    Examples of Patterns: animal) for training

    • 502 distinct place-of-birth patterns:

      • 600 <MY_Person> was born in <MY_GPE>

      • 169 <MY_Person> ( born <Date> in <MY_GPE> )

      • 44 Born in <MY_GPE> <MY_Person>

      • 10 <MY_Person> was a native <MY_GPE>

      • 10 <MY_Person> 's hometown of <MY_GPE>

      • 1 <MY_Person> was baptized in <MY_GPE>

    • 291 distinct date-of-death patterns:

      • 770 <MY_Person> ( <Date> - <MY_Date> )

      • 92 <MY_Person> died on <MY_Date>

      • 19 <MY_Person> <Date> - <MY_Date>

      • 16 <MY_Person> died in <GPE> on <MY_Date>

      • 3 < MY_Person> passed away on < MY_Date >

      • 1 < MY_Person> committed suicide on <MY_Date>


    Biography as an ie task44 l.jpg
    Biography as an IE task animal) for training

    • This approach is good for the consistently annotated fields in Wikipedia: place of birth, date of birth, place of death, date of death

    • Not all fields of interests are annotated, a different approach is needed to cover the rest of the slots


    Bouncing between wikipedia and google l.jpg
    Bouncing between Wikipedia and Google animal) for training

    • Use one seed only:

      • <my person> and <target field>

        • Google: “Arafat” “civil engineering”, we get:


    Bouncing between wikipedia and google47 l.jpg
    Bouncing between Wikipedia and Google animal) for training

    • Use one seed only:

      • <my person> and <target field>

        • Google: “Arafat” “civil engineering”, we get:

          • Arafatgraduated with a bachelor’s degree in civil engineering

          • Arafatstudiedcivil engineering

          • Arafat,acivil engineering student

        • Using these snippets, corresponding patterns are created, then filtered out manually.


    Bouncing between wikipedia and google48 l.jpg
    Bouncing between Wikipedia and Google animal) for training

    • Use one seed tuple only:

      • <my person> and <target field>

        • Google: “Arafat” “civil engineering”, we get:

          • Arafatgraduated with a bachelor’s degree incivil engineering

          • Arafatstudiedcivil engineering

          • Arafat,acivil engineering student

        • Using these snippets, corresponding patterns are created, then filtered out manually

      • To get more seed pairs, go to Wikipedia biography pages only and search for:

        • “graduated with a bachelor’s degree in”

        • We get:


    Bouncing between wikipedia and google50 l.jpg
    Bouncing between Wikipedia and Google animal) for training

    • New seed tuples:

      • “Burnie Thompson” “political science“

      • “Henrey Luke” “Environment Studies”

      • “Erin Crocker” “industrial and management engineering”

      • “Denise Bode” “political science”

    • Go back to Google and repeat the process to get more seed patterns!


    Bouncing between wikipedia and google51 l.jpg
    Bouncing between Wikipedia and Google animal) for training

    • This approach worked well for a few fields such as: education, publication, Immediate family members, and Party or other organization affiliations

    • Did not provide good patterns for every field, such as: Religion, Ethnic or tribal affiliations, and Previous domiciles), we got a lot of noise

    • For some slots, we created some patterns manually


    Biography as sentence selection and ranking l.jpg
    Biography as Sentence Selection and Ranking animal) for training

    • To obtain high recall, we also want to include sentences that IE may miss, perhaps due to ill-formed sentences (ASR and MT)

    • Get the top 100 documents from Indri

    • Extract all sentences that contain the person or reference to him/her

    • Use a variety of features to rank these sentence…


    ad