200 likes | 279 Views
Explore the world of Natural Language Processing (NLP) with PoS tagging and word prediction techniques. Learn about N-gram models, auto-generating stories, and the significance of parts of speech in language processing.
E N D
School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)
Reminder • FreqDist counts of tokens and their distribution can be useful • Eg find main characters in Gutenberg texts • Eg compare word-lengths in different languages • Human can predict the next word … • N-gram models are based on counts in a large corpus • Auto-generate a story ... (but gets stuck in local maximum) • Grammatical trends: modal verb distribution predicts genre
Why do puns make us groan? • He drove his expensive car into a tree and found out how the Mercedes bends. Isn't the Grand Canyon just gorges? Time flies like an arrow. Fruit flies like a banana.
Predicting Next Words • One reason puns make us groan is they play on our assumptions of what the next word will be – human language processing involves predicting the most probable next word • They also exploit • homonymy – same sound, different spelling and meaning (bends, Benz; gorges, gorgeous) • polysemy – same spelling, different meaning NLP programs can also make use of word-sequence modeling
Auto-generate a Story How to fix this? Use a random number generator.
Auto-generate a Story The choice() method chooses one item randomly from a list (from random import *)
Part-of-Speech Tagging: Terminology • Tagging • The process of associating labels with each token in a text, using an algorithm to select a tag for each word, eg • Hand-coded rules • Statistical taggers • Brill (transformation-based) tagger • Hybrid tagger: combination, eg by “vote” • Tags • The labels • Tag Set • The collection of tags used for a particular task, eg Brown or LOB tagset Modified from Diane Litman's version of Steve Bird's notes
Example from the GENIA corpus • Typically a tagged text is a sequence of white-space separated word/tag tokens: These/DT findings/NNS should/MD be/VB useful/JJ for/IN therapeutic/JJ strategies/NNS and/CC the/DT development/NN of/IN immunosuppressants/NNS targeting/VBG the/DT CD28/NN costimulatory/NN pathway/NN ./.
What does Tagging do? • Collapses Distinctions • Lexical identity may be discarded • e.g., all personal pronouns tagged with PRP • Introduces Distinctions • Ambiguities may be resolved • e.g. deal tagged with NN or VB • Helps in classification and prediction Modified from Diane Litman's version of Steve Bird's notes
Significance of Parts of Speech • A word’s POS tells us a lot about the word and its neighbors: • Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) • Helps in stemming • Limits the range of following words • Can help select nouns from a document for summarization • Basis for partial parsing (chunked parsing) • Parsers can build trees directly on the POS tags instead of maintaining a lexicon Modified from Diane Litman's version of Steve Bird's notes
Choosing a tagset • The choice of tagset greatly affects the difficulty of the problem • Need to strike a balance between • Getting better information about context • Make it possible for classifiers to do their job Slide modified from Massimo Poesio's
Some of the best-known Tagsets • Brown corpus: 87 tags • (more when tags are combined, eg isn’t) • LOB corpus: 132 tags • Penn Treebank: 45 tags • Lancaster UCREL C5 (used to tag the BNC): 61 tags • Lancaster C7: 145 tags Slide modified from Massimo Poesio's
The Brown Corpus • An early digital corpus (1961) • Francis and Kucera, Brown University • Contents: 500 texts, each 2000 words long • From American books, newspapers, magazines • Representing genres: • Science fiction, romance fiction, press reportage scientific writing, popular lore Modified from Diane Litman's version of Steve Bird's notes
help(nltk.corpus.brown) • >>> help(nltk.corpus.brown) • | paras(self, fileids=None, categories=None) • | • | raw(self, fileids=None, categories=None) • | • | sents(self, fileids=None, categories=None) • | • | tagged_paras(self, fileids=None, categories=None, simplify_tags=False) • | • | tagged_sents(self, fileids=None, categories=None, simplify_tags=False) • | • | tagged_words(self, fileids=None, categories=None, simplify_tags=False) • | • | words(self, fileids=None, categories=None) • |
nltk.corpus.brown • >>> nltk.corpus.brown.words() • ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] • >>> nltk.corpus.brown.tagged_words() • [('The', 'AT'), ('Fulton', 'NP-TL'), ...] • >>> nltk.corpus.brown.tagged_sents() • [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), …
Penn Treebank • First large syntactically annotated corpus • 1 million words from Wall Street Journal • Part-of-speech tags and syntax trees Modified from Diane Litman's version of Steve Bird's notes
help(nltk.corpus.treebank) • | parsed(*args, **kwargs) • | @deprecated: Use .parsed_sents() instead. • | • | parsed_sents(self, files=None) • | • | raw(self, files=None) • | • | read(*args, **kwargs) • | @deprecated: Use .raw() or .sents() or .tagged_sents() or • | .parsed_sents() instead. • | • | sents(self, files=None) • | • | tagged(*args, **kwargs) • | @deprecated: Use .tagged_sents() instead. • | • | tagged_sents(self, files=None) • | • | tagged_words(self, files=None)
How hard is POS tagging? In the Brown corpus, 12% of word types ambiguous 40% of word tokens ambiguous Slide modified from Massimo Poesio's
Tagging with lexical frequencies • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • Problem: assign a tag to race given its lexical frequency • Solution: we choose the tag that has the greater probability • P(race|VB) • P(race|NN) • Actual estimate from the Switchboard corpus: • P(race|NN) = .00041 • P(race|VB) = .00003 This suggests we should always tag race/NN (correct 41/44=93%) Modified from Massio Poesio's lecture
Reminder • Puns play on our assumptions of the next word… • … eg they present us with an unexpected homonym (bends) • ConditionalFreqDist() counts word-pairs: word bigrams • Used for story generation, Speech recognition, … • Parts of Speech: groups words into grammatical categories • … and separates different functions of a word • In English, many words are ambiguous: 2 or more PoS-tags • Very simple tagger: choose by lexical probability (only) • Better Pos-Taggers: to come…