Introduction to nlp tools
1 / 34

Introduction to NLP Tools - PowerPoint PPT Presentation

  • Updated On :

Introduction to NLP Tools. 09/23/2003. Motivation. Machine Translation From English to French What’s needed?. Motivation Cont’d (1). Syntactic parser Part-Of-Speech Tagger Example: NP -> adj noun Morphological Analyzer Example: “tools” -> “tool” “Who is he?” -> “Who is he ?”

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Introduction to NLP Tools' - philena

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Introduction to nlp tools l.jpg

Introduction to NLP Tools


Motivation l.jpg

  • Machine Translation

    • From English to French

  • What’s needed?

Motivation cont d 1 l.jpg
Motivation Cont’d (1)

  • Syntactic parser

  • Part-Of-Speech Tagger

    • Example: NP -> adj noun

  • Morphological Analyzer

    • Example: “tools” -> “tool”

      “Who is he?” -> “Who is he ?”

  • Semantic Analyzer

    • Word sense disambiguate (“wash dishes”)

    • Choose the correct translation

Motivation cont d 2 l.jpg
Motivation Cont’d (2)

  • Lexicons

    • The information of the word

      How many senses? What’s the possible translations

      of the word?

  • Corpus

    • Useful for learning a tool

    • Useful for evaluation

Outline l.jpg

  • Lexicons

  • Text corpora

  • Morphological tools

  • Part-Of-Speech(POS) taggers

  • Syntactic parsers

  • Semantic knowledge bases and semantic parser

  • Speech tools

Lexicons l.jpg

  • Definition

    • A repository for words

  • Lexicons in LDC(Linguistic Data Consortium)

    • creating and sharing linguistic resources: data, tools and standards.


  • WordNet

Celex l.jpg

  • Dutch Center for Lexical Information

  • Lexical databases of English , Dutch and German

  • 21,000 nouns, 8,000 adjectives and 6,000 verbs

  • English:

    • English Orthography, Lemmas

    • English Phonology, Lemmas

    • English Morphology, Lemmas

    • English Syntax, Lemmas

    • English Frequency, Lemmas

    • English Orthography, Wordforms

    • English Phonology, Wordforms

    • English Morphology, Wordforms

    • English Frequency, Wordforms

    • English Corpus Types

    • English Frequency, Syllables

Wordnet l.jpg

  • A database of lexical relations

  • Inspired by current psycholinguistic theories of human lexical memory

  • Synset: a set of synonyms, representing one underlying lexical concept

    • Example:

      • fool {chump, fish, fool, gull, mark, patsy, fall guy, sucker, schlemiel, shlemiel, soft touch, mug}

  • Relations link the synsets: hypernym, Has-Member, Member-Of, Antonym, etc.

Wordnet cont d l.jpg
WordNet Cont’d

  • Example$ wn bike -partn

    Part Meronyms of noun bike

    2 senses of bike

    Sense 1

    motorcycle, bike

    HAS PART: mudguard, splashguard

    Sense 2

    bicycle, bike, wheel

    HAS PART: bicycle seat, saddle

    HAS PART: bicycle wheel

    HAS PART: chain

    HAS PART: coaster brake

    HAS PART: handlebar

    HAS PART: mudguard, splashguard

    HAS PART: pedal, treadle, foot lever

    HAS PART: sprocket, sprocket wheel

  • Example

  •$wn bike

  • Information available for noun bike

  • -hypen Hypernyms

  • -hypon, -treen Hyponyms & Hyponym Tree

  • -synsn Synonyms (ordered by frequency)

  • -partn Has Part Meronyms

  • -meron All Meronyms

  • -famln Familiarity & Polysemy Count

  • -coorn Coordinate Sisters

  • -simsn Synonyms (grouped by similarity of meaning)

  • -hmern Hierarchical Meronyms

  • -grepn List of Compound Words

  • -over Overview of Senses

  • Information available for verb bike

  • -hypev Hypernyms

  • -hypov, -treev Hyponyms & Hyponym Tree

  • -synsv Synonyms (ordered by frequency)

  • -famlv Familiarity & Polysemy Count

  • -framv Verb Frames

  • -simsv Synonyms (grouped by similarity of meaning)

  • -grepv List of Compound Words

  • -over Overview of Senses

Corpus l.jpg

  • Definition

    • Collections of text and speech

  • LDC

  • Penn Treebank

  • DSO

  • Hansard

Some of the top corpus from ldc l.jpg
Some of the Top Corpus from LDC


    • Information Retrieval, Data Extrraction datasets

    • TIPSTER project, TREC project

  • TIMIT Acoustic-Phonetic Continuous Speech Corpus

    • A corpus of read speech designed to

    • Provide speech data for the acquisition of acousticphonetic knowledge

    • Useful for the development and evaluation of automatic speech recognition systems

  • ECI(European Corpus Initiative Multilingual Corpus) multilingual electronic text corpus


    • A phonetically

    • balanced, continuous speech, telephone bandwidth speech database

Penn treebank l.jpg
Penn Treebank

  • A collection of corpora

  • Tagged with POS, Syntactic roles, predicate/argument structure, dysfluency annotation

  • How are they made

    • Hand correction of the output of an errorful automatic process

  • 3 million words

    • 1 million words tagged with predicate/argument structure for extraction semantic knowledge

Penn treebank cont d l.jpg
Penn Treebank Cont.’d

  • Corpora

    • Wall Street Journal

    • ATIS (Air Travel Information System)

    • Brown Corpus

    • IBM Manual Sentences

    • Library of America Texts: Mark Twain, Henry Adams, Herman Melville ...

    • MUC-3 Messages

  • Example:

    • ( (S (NP-SBJ Rally 's)

    • (VP operates

    • and

    • franchises

    • (NP (NP (QP about 160)

    • fast-food restaurants)

    • (PP-LOC throughout

    • (NP the U.S))))

    • Seeking/VBG to/TO block/VB

    • [ the/DT investors/NNS ]

    • from/IN buying/VBG

    • [ more/JJR shares/NNS ]

    • ./.

Slide14 l.jpg

  • Word Sense Corpus

    • Contains sentences in which about 192,800 word occurrences have been tagged with WordNet senses

    • Taken from the Brown corpus and the Wall Street Journal corpus

    • 121 nouns and 70 verbs

Hansard l.jpg

  • Official records (Hansards) of the 36th Canadian Parliament, both in English of French

  • 1.3 million pairs of aligned sentences of English and French

    • Example

      • Comme il est 14 h 30, la Chambre s'ajourne jusqu'\xe0 lundi prochain, \xe0 11 heures, conform\xe9ment au paragraphe 24(1) du R\xe8glement.

      • It being 2.30 p.m., the House stands adjourned until Monday next at 11 a.m., pursuant to Standing Order 24(1).

  • Useful for Machine Translation

Morphological tools l.jpg
Morphological Tools


    • A two-level morphological parser

  • Porter Stemmer

  • Penn Treebank Tokenizer

    • Seperate document into words

    • “dog?” -> “dog ?”

Porter stemmer l.jpg
Porter Stemmer

  • Simple algorithm, use a set of cascaded rewrite rules

    • Example

      • Ational->ATE (relational->relate)

  • Stem:

    • The main morpheme of the word, supplying the main meaning

  • Fast

  • Used very widely in Information Retrieval

    • Run stemmer on keywords and the words in the documents

Part of speech pos taggers l.jpg
Part-Of-Speech(POS) Taggers

  • Part-Of-Speech: noun, verb, pronoun, etc.

  • Brill’s Tagger

  • HMM Tagger


Brill s tagger l.jpg
Brill’s Tagger

  • Transformation-Based Learning(TBL) tagger

  • /projects/nlp/brill-pos-tagger

  • First labels every word with its most-likely tag

  • Then Use Learned TBL Rules to correct mistakes

    • Example:

      • Change NN to VB when the previous tag is TO

Hmm tagger l.jpg
HMM Tagger

  • Also called Maximum Likelihood Tagger

  • Xerox PARC's HMM tagger:

  • Choose the tag sequence with the maximum possibility given the words seen.

Mxpost maximum entropy pos tagger l.jpg
MXPOST: Maximum Entropy POS Tagger

  • Maximum Entropy Model is a framework integrating many information sources(called features) for classification

  • Each candidate tag is a class

  • Given features of the word(the around words, the morphological feature, and around tags, etc.), decide which class it belongs.

Syntactic parsers l.jpg
Syntactic Parsers

  • Collin’s Parser

  • XTAG

  • MXPOST: Maximum Entropy Parser

Collin s parser l.jpg
Collin’s Parser

  • Context-free Grammar

  • Use frequencies to solve ambiguities

  • Got some idea of this parser

    • Web-based Chart parser

Slide24 l.jpg

  • An on-going project to develop a wide-coverage grammar for English

  • using a lexicalized Tree Adjoining Grammar (TAG) formalism

    • Context sensitive grammar

  • consists of a parser, an X-windows grammar development interface and a morphological analyzer.

  • /projects/nlp/xtag/

Semantic knowledge bases and semantic parser l.jpg
Semantic Knowledge Bases and Semantic Parser

  • Analyze what does it say

  • WordNet

  • Penn Treebank

  • Web-based Semantic Parser

Wordnet27 l.jpg

  • Respresents lexical relations

  • Useful in word sense disambiguation

Penn treebank28 l.jpg
Penn Treebank

Predicate: fool(Kris)

Semantic parser l.jpg
Semantic Parser

  • A web-based chart parser enriched with semantic constraints

  • Example:

    • Input: My dog has fleas.

    • Output: has(my(dog),fleas)

Speech tools l.jpg
Speech Tools

  • ISIP

  • EPOS

  • CSLU Toolkit

Slide31 l.jpg

  • ISIP(Institute for Signal and Information Processing) public domain speech recognition system

  • Open research software

  • Online courses, tutorials, dictionaries, databases

  • Build your own speech recognition system

Slide32 l.jpg

  • a language independent rule-driven Text-to-Speech (TTS) system

  • supports several main speech generation algorithms

Cslu toolkit l.jpg
CSLU Toolkit

  • Basic framework and tools for people to build, investigate and use interactive language systems

  • speech recognition, natural language understanding, speech synthesis and facial animation technologies

  • Easy to use , spread from higher education into homes