280 likes | 409 Views
A Survey of NLP Toolkits. Jing Jiang Mar 8, 2007. Outline. WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases) NER SNoW, OpenNLP and LingPipe. Outline (cont.). What does the tool provide? Is the tool easy to use as a stand-alone program?
E N D
A Survey of NLP Toolkits Jing Jiang Mar 8, 2007
Outline • WordNet • Statistics-based phrases • POS taggers • Parsers • Chunkers (syntax-based phrases) • NER • SNoW, OpenNLP and LingPipe
Outline (cont.) • What does the tool provide? • Is the tool easy to use as a stand-alone program? • Is the tool easy to modify or integrate with my program?
WordNet • Background: • Princeton, George Miller, 1985 • “WordNet: An Electronic Lexical Database” • Current version: WordNet 3.0 • What does it provide? • A database of words and their relations • Nouns, verbs, adjectives and adverbs • Lexical relations: morphology • Semantic relations: synonyms, hypernyms/hyponyms, holonyms/meronyms, etc.
WordNet • To use as a stand-alone program? • A command line program • Web interface • To modify or integrate with my program? • API in C • Online manual not very clear (http://wordnet.princeton.edu/doc) • Interfaces in other languages (http://wordnet.princeton.edu/links#local) • Java • Perl • Many others
WordNet::Similarity • Background • Ted Pedersen et al. • What does it provide: • Semantic similarity between two words measured in various ways using WordNet • Need to understand the measures to make the best use • Demo: • http://marimba.d.umn.edu/cgi-bin/similarity.cgi
WordNet::Similarity • To use as a stand-alone program? • A Perl script to call from command line • Web interface • To modify or integrate with my program? • A Perl module • Online API with details and examples
Ngram Statistics Package • What does it provide: • N-grams from a corpus ranked by a user-selected statistical measure of association (e.g. mutual information, chi-squared test)
Ngram Statistics Package • To use as a stand-alone program? • count.pl, statistic.pl • Input can be flat text • Regular expressions to define tokens can be specified by the user • To modify or integrate with my program? • Perl module • Online API with details and examples • User can define new statistical measures of association
LingPipe: Significant Phrases • What does it provide: • Collocations (similar to NSP) • Relatively new terms • Foreground vs. background • Web application: Amazon “SIPs”, Yahoo “Buzz Index”, Google “in the news” • http://www.alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html
POS Taggers • What do they provide? • POS tags • How many POS tags are there? • Penn Treebank Tag Set http://www.cis.upenn.edu/~treebank/ • Which tags are useful to your task?
Brill Tagger • Background • Eric Brill, PhD thesis, U Penn, 1993 • Transformation-based error-driven learning • Accuracy and speed • ~96% • ~5000 sentences ~4 seconds
Brill Tagger • To use as a stand-alone program? • Call from command line • Input must be one sentence per line, tokenized • E.g. We ’re going today , are you ? • To modify or integrate with my program? • No API
Charniak Parser • Background • Eugene Charniak, Brown University • State-of-the-art • What does it provide? • Syntactic parse tree
Charniak Parser • To use as a stand-alone program? • Call from command line • Input must be one sentence per line • To modify or integrate with my program? • No API
Collins Parser • Background • Michael Collins, PhD thesis, U Penn, 1999 • Head-driven statistical models • What does it provide? • Syntactic parse trees • Head word for each production (dependency relations, but no relation labels)
Collins Parser • To use as a stand-alone program? • Call from command line • Input must be one sentence per line, tokenized, POS tagged • To modify or integrate with my program? • No API
MiniPar • Background • Dekang Lin, U Alberta • What does it provide? • Dependency parse trees • Dependency relation labels • Accuracy and speed • ~88% precision, ~80% recall for dependency relations • 300 words / second (Pentium II 300, 128MB)
Examples of Dependency Relations • The Fulton County Grand Jury said Friday an investigation of Atlanta 's recent primary election produced… • say V:s:N Fulton County Grand Jury • Fulton County Grand Jury N:det:Det the • Fulton County Grand Jury N:lex-mod:U Fulton • Fulton County Grand Jury N:lex-mod:U County • Fulton County Grand Jury N:lex-mod:U Grand • say V:subj:N Fulton County Grand Jury • say V:guest:N Friday • produce V:s:N investigation • investigation N:det:Det an • investigation N:mod:Prep of
MiniPar • To use as a stand-alone program? • A command line program • Input must be one sentence per line • To modify or integrate with my program? • API in C • Parse tree and dependency relations are stored in some data structure for easy access
Comparison of Parsers • Accuracy: • Charniak > Collins > MiniPar • Dependency relations: • Collins, MiniPar • Dependency relation labels: • MiniPar • Speed • MiniPar
Chunkers (Shallow Parsers) • What do they provide? • Phrase structure of a sentence • E.g. [NP He] [VP reckons] [NP the current account deficit] [VP will narrow] [PP to] [NP only 1.8 billion] [PP in] [NP September] • Compare with collocations
Named Entity Recognizers • What do they provide? • Named entities of various pre-defined types (e.g. Person, Location, Organization, Number, etc.)
SNoW-based Tools • Use SNoW as the underlying learner • In C++ • API available for many components
SNoW-based Tools • Sentence splitter • Tokenizer • POS tagger • Dependency parser • Chunker • NE tagger • SRL
OpenNLP • Java-based, open source project • Maximum entropy models • Pipeline structure • Sentence detector tokenizer POS tagger Chunker • Java API
OpenNLP • Sentence boundary detector • Tokenizer • POS tagger • Chunker • Parser • Name Finder • Coreference
LingPipe • Java-based libraries for various linguistic analysis • http://www.alias-i.com/lingpipe/index.html