1 / 19

CSA2050 Introduction to Computational Linguistics

CSA2050 Introduction to Computational Linguistics. Lecture 3 Examples. Course Contents. Outline. Examples in the areas of Tokenisation Morphological Analysis Tagging Syntactic Analysis. Information Extraction. raw text. tokenisation. tagged text. morphological analysis.

eyal
Download Presentation

CSA2050 Introduction to Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA2050 Introduction to Computational Linguistics Lecture 3 Examples

  2. Course Contents CSA2050 - Lecture III: Examples

  3. Outline • Examples in the areas of • Tokenisation • Morphological Analysis • Tagging • Syntactic Analysis CSA2050 - Lecture III: Examples

  4. Information Extraction raw text tokenisation tagged text morphological analysis syntactic analysis named entity recognition CSA2050 - Lecture III: Examples

  5. Tokenisation • The basic idea of tokenisation is to identify the basic tokens that are present in a text. • Mostly, tokens are the same as words, but not always • Why should this be a problem? John’s car cost €10,000.00. “And it’s worth every penny”, he exclaimed. CSA2050 - Lecture III: Examples

  6. Tokenisation ProblemsPunctuation • novel forms: .net, Micro$oft, :-) • hyphenation: • linebreaks vs word-internal: e-mail, 898-0587 • multi-word: the 90-cent-an-hour raise • confusion with dash • apostrophes in contractions: we'll • periods • part of names: Amazon.com • numerical expressions: $1.99 • abbreviations, end of sentence, haplology • commas: 1,000,000 CSA2050 - Lecture III: Examples

  7. Other Problems • Token-internal whitespace: 898 0464 • Interaction: the New York-New Haven railroad • Mixed language tokens : u • Automated language guesser • Token equivalence (when are two tokens the same)? • Case-normalization. • Sentence boundary detection. • Inconsistency: database, data-base, data base • Demo: xerox tokeniser CSA2050 - Lecture III: Examples

  8. Morphology • Simple versus complex wordsdogdogs • Complex words formed by concatenation of morphemes. • Morpheme: The smallest unit in a word that bears some meaning, such as dog and s. CSA2050 - Lecture III: Examples

  9. Morphological Analysis • Morphological analysis of a word involves a segmentation problem • Segmentation: discovery of the component morphemesdogs → dog + senlargement → en + large + ment • Possible ambiguities:enlargement → enlarge + ment → en + largement • Role of lexicon CSA2050 - Lecture III: Examples

  10. Morphological Analysis John has a couple of rabbits • rabbits → rabbit + s • s indicates plural of noun rabbit • Is this the only possibility? CSA2050 - Lecture III: Examples

  11. Morphological Analysis John rabbits on and on • rabbits → rabbit + s • s indicates 3rd person singular plural of verb rabbit • The suffix “s” is a realisation of two entirely different morphemes. • The morpheme is something more abstract than the string which realises it. CSA2050 - Lecture III: Examples

  12. Morphological Analysis -s -a suffix world morpheme world +3S +PL CSA2050 - Lecture III: Examples

  13. Morphological Analysis Output Analysis rabbit N PL rabbit V 3S Input Word rabbits Morphological Parser • Output is a string of morphemes • Morpheme is employed in a loose sense that • is useful for further processing CSA2050 - Lecture III: Examples

  14. Morphological Analysis: ENGTWOL & Xerox • Atro Voutilainen, Juha Heikkilä, Timo Järvinen and Lingsoft, Inc. 1993-1995 • ENGTWOL demo • Xerox morphological analysis CSA2050 - Lecture III: Examples

  15. Morphological Synthesis Input rabbit N PL rabbit V 3S Output Word rabbits Morphological Parser • Input is a string of morphemes • Ouput is a word CSA2050 - Lecture III: Examples

  16. Reversibility Lookup APPLY UP> left left leave+Verb+PastBoth+123SP left left+Adv left left+Adj left left+Noun+Sg Lookdown APPLY DOWN> leave+Adj left CSA2050 - Lecture III: Examples

  17. POS Tagging • In POS tagging, the task is to assign the most appropriate morphosyntactic label from amongst those listed in the lexicon, given the context. • John leaves presents. • Proper Names CSA2050 - Lecture III: Examples

  18. Semantic Tagging • Named Entity Recognition • Basic idea is to recognise and tag named entities and classify them as being of type • Persons • Locations • Organisations • Named Entity Recognition - Demo CSA2050 - Lecture III: Examples

  19. Syntactic Analysis • Problem: given sentence and grammar/lexicon, discover assigned tree structure. • XIP Parser Demo CSA2050 - Lecture III: Examples

More Related