1 / 28

A Survey of NLP Toolkits

A Survey of NLP Toolkits. Jing Jiang Mar 8, 2007. Outline. WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases) NER SNoW, OpenNLP and LingPipe. Outline (cont.). What does the tool provide? Is the tool easy to use as a stand-alone program?

morrie
Download Presentation

A Survey of NLP Toolkits

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Survey of NLP Toolkits Jing Jiang Mar 8, 2007

  2. Outline • WordNet • Statistics-based phrases • POS taggers • Parsers • Chunkers (syntax-based phrases) • NER • SNoW, OpenNLP and LingPipe

  3. Outline (cont.) • What does the tool provide? • Is the tool easy to use as a stand-alone program? • Is the tool easy to modify or integrate with my program?

  4. WordNet • Background: • Princeton, George Miller, 1985 • “WordNet: An Electronic Lexical Database” • Current version: WordNet 3.0 • What does it provide? • A database of words and their relations • Nouns, verbs, adjectives and adverbs • Lexical relations: morphology • Semantic relations: synonyms, hypernyms/hyponyms, holonyms/meronyms, etc.

  5. WordNet • To use as a stand-alone program? • A command line program • Web interface • To modify or integrate with my program? • API in C • Online manual not very clear (http://wordnet.princeton.edu/doc) • Interfaces in other languages (http://wordnet.princeton.edu/links#local) • Java • Perl • Many others

  6. WordNet::Similarity • Background • Ted Pedersen et al. • What does it provide: • Semantic similarity between two words measured in various ways using WordNet • Need to understand the measures to make the best use • Demo: • http://marimba.d.umn.edu/cgi-bin/similarity.cgi

  7. WordNet::Similarity • To use as a stand-alone program? • A Perl script to call from command line • Web interface • To modify or integrate with my program? • A Perl module • Online API with details and examples

  8. Ngram Statistics Package • What does it provide: • N-grams from a corpus ranked by a user-selected statistical measure of association (e.g. mutual information, chi-squared test)

  9. Ngram Statistics Package • To use as a stand-alone program? • count.pl, statistic.pl • Input can be flat text • Regular expressions to define tokens can be specified by the user • To modify or integrate with my program? • Perl module • Online API with details and examples • User can define new statistical measures of association

  10. LingPipe: Significant Phrases • What does it provide: • Collocations (similar to NSP) • Relatively new terms • Foreground vs. background • Web application: Amazon “SIPs”, Yahoo “Buzz Index”, Google “in the news” • http://www.alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html

  11. POS Taggers • What do they provide? • POS tags • How many POS tags are there? • Penn Treebank Tag Set http://www.cis.upenn.edu/~treebank/ • Which tags are useful to your task?

  12. Brill Tagger • Background • Eric Brill, PhD thesis, U Penn, 1993 • Transformation-based error-driven learning • Accuracy and speed • ~96% • ~5000 sentences  ~4 seconds

  13. Brill Tagger • To use as a stand-alone program? • Call from command line • Input must be one sentence per line, tokenized • E.g. We ’re going today , are you ? • To modify or integrate with my program? • No API

  14. Charniak Parser • Background • Eugene Charniak, Brown University • State-of-the-art • What does it provide? • Syntactic parse tree

  15. Charniak Parser • To use as a stand-alone program? • Call from command line • Input must be one sentence per line • To modify or integrate with my program? • No API

  16. Collins Parser • Background • Michael Collins, PhD thesis, U Penn, 1999 • Head-driven statistical models • What does it provide? • Syntactic parse trees • Head word for each production (dependency relations, but no relation labels)

  17. Collins Parser • To use as a stand-alone program? • Call from command line • Input must be one sentence per line, tokenized, POS tagged • To modify or integrate with my program? • No API

  18. MiniPar • Background • Dekang Lin, U Alberta • What does it provide? • Dependency parse trees • Dependency relation labels • Accuracy and speed • ~88% precision, ~80% recall for dependency relations • 300 words / second (Pentium II 300, 128MB)

  19. Examples of Dependency Relations • The Fulton County Grand Jury said Friday an investigation of Atlanta 's recent primary election produced… • say V:s:N Fulton County Grand Jury • Fulton County Grand Jury N:det:Det the • Fulton County Grand Jury N:lex-mod:U Fulton • Fulton County Grand Jury N:lex-mod:U County • Fulton County Grand Jury N:lex-mod:U Grand • say V:subj:N Fulton County Grand Jury • say V:guest:N Friday • produce V:s:N investigation • investigation N:det:Det an • investigation N:mod:Prep of

  20. MiniPar • To use as a stand-alone program? • A command line program • Input must be one sentence per line • To modify or integrate with my program? • API in C • Parse tree and dependency relations are stored in some data structure for easy access

  21. Comparison of Parsers • Accuracy: • Charniak > Collins > MiniPar • Dependency relations: • Collins, MiniPar • Dependency relation labels: • MiniPar • Speed • MiniPar

  22. Chunkers (Shallow Parsers) • What do they provide? • Phrase structure of a sentence • E.g. [NP He] [VP reckons] [NP the current account deficit] [VP will narrow] [PP to] [NP only 1.8 billion] [PP in] [NP September] • Compare with collocations

  23. Named Entity Recognizers • What do they provide? • Named entities of various pre-defined types (e.g. Person, Location, Organization, Number, etc.)

  24. SNoW-based Tools • Use SNoW as the underlying learner • In C++ • API available for many components

  25. SNoW-based Tools • Sentence splitter • Tokenizer • POS tagger • Dependency parser • Chunker • NE tagger • SRL

  26. OpenNLP • Java-based, open source project • Maximum entropy models • Pipeline structure • Sentence detector  tokenizer  POS tagger  Chunker • Java API

  27. OpenNLP • Sentence boundary detector • Tokenizer • POS tagger • Chunker • Parser • Name Finder • Coreference

  28. LingPipe • Java-based libraries for various linguistic analysis • http://www.alias-i.com/lingpipe/index.html

More Related