1 / 21

CLINT

CLINT. Tokenisation. Information Food Chain. Inference Knowledge Representation Meaning Extraction Semantic Relationships Chunking (noun phrases; verb phrases) Part of Speech Annotation Paragraph and sentence identification Tokenisation Raw Text. Start with a Corpus.

Faraday
Download Presentation

CLINT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CLINT Tokenisation Introduction to Computational Linguistics

  2. Information Food Chain Inference • Knowledge Representation • Meaning Extraction • Semantic Relationships • Chunking (noun phrases; verb phrases) • Part of Speech Annotation • Paragraph and sentence identification • Tokenisation • Raw Text Introduction to Computational Linguistics

  3. Start with a Corpus • A corpus is an organised body of materials from language that is used as a basis for empirical studies. • Corpora classfied according to • Representativeness • Medium • Language • Information Content • Structure Introduction to Computational Linguistics

  4. Examples of Corpora • Project Gutenberg: public domain text resources. http://www.promo.net/pg • Brown Corpus: a tagged corpus of about 1M words put together at Brown 1960-70 • Penn Treebank: a corpus of parsed sentences based on text from the WSJ • Canadian Hansards: bilingual (En Fr) corpus the Canadian parliament. Introduction to Computational Linguistics

  5. Low Level Issues • Preprocessing: getting rid of junk such as whitespace, images, certain formatting information etc. • Normalisation: deciding on standard character representations; adopting upper or lower case (or both) • Tokenisation Introduction to Computational Linguistics

  6. Tokenisation • Tokenisation is a process which divides input text into individual units called tokens. • Tokens are normally taken to be indivisible by the next level of analysis, but they can be associated with various kinds of information. • An example of such information is the type of the token: word, punctuation, number Introduction to Computational Linguistics

  7. What counts as a word? • Words are quite tricky to define • The standard definition: a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes but no other punctuation marks (Kucera and Francis 1967) • It is easy to find exceptions. Introduction to Computational Linguistics

  8. Problems Identifying Words VfB Stuttgart scored twice in quick success-ion early in the second half on their way to a deserved 2-1 victory over Manchester United in the Champions League on Wednesday.(example from Mary Dalrymple, University of London) • VfB Stuttgart, Manchester United • succession • 2-1 • Wednesday Introduction to Computational Linguistics

  9. Problems Identifying WordsProblems Involving Spaces • Lack of spaces between wordsLebensversicherungsgesellschaftsanngesteller (life insurance company employee)Ix-Xemx • The presence of spaces may not indicate a word breakCoca Cola; +356 21 456 457 Introduction to Computational Linguistics

  10. Problems Involving Special Characters • Words often include non-alphanumeric characters which are actually part of the word.$22.50; www.di-ve.com.mt; BSc. IT :-) • Words are often terminated by punctuation which is not part of the word. • Sometimes, terminating punctuation is part of the word. Introduction to Computational Linguistics

  11. Periods • In general, punctuation marks attach to words, and can be removed. However there are special cases: • Most periods mark end of sentence • Others mark abbreviations, e.g. "e.g.". "Wash." • Note that when an abbreviation occurs at the end of a sentence there is only one period. Introduction to Computational Linguistics

  12. Apostrophe • English contractions such as won't or I'll count as one word according to the classic definition • However there are reasons for wanting two separate tokens – such as interaction with grammar rules (S → NP VP) • Penn Treebank splits such contractions into two words. Introduction to Computational Linguistics

  13. Apostrophe • This sometimes leaves odd wordsFor example isn’t yields is + n't • 's is ambiguous • Abbreviation for is (he's strange) • Possessive (John's car) • Word-final aprostrophe is ambiguous • end of quotation • possessive of word ending in s Introduction to Computational Linguistics

  14. Exercise • How is the apostrophe used in Maltese • How should a Maltese tokeniser deal with it? Introduction to Computational Linguistics

  15. Hyphen • Issue: do sequences of words joined by hyphens count as one word or more? • Typesetting hyphens (at end of line) and hyphens in measure phrases (35-year-old)are usually removed. • Typesetting hyphens can be ambiguous • Lexical hyphens are usually kepthi-fi • Hyphens – standing alone – are used as punctuation. • Texts are often inconsistent in usage of hyphens Introduction to Computational Linguistics

  16. Case • Types vs. Tokens • How many tokens in the following sentence:The cat chased the rat on the table • How many types? • Tokenisation should correctly identify word types, i.e. • Tokens of the same type should be identified • Tokens of different type should be distinguished • Case representation of ordinary words must be standardised. Introduction to Computational Linguistics

  17. Case • Heuristics • Map first character of a sentence to standard case • Map all words in titles to lowercase • Problems • Identification of sentence boundaries • Identification of proper names Introduction to Computational Linguistics

  18. Normalisation • Character representations. • Converting all letters to lower or upper case • Removing punctuation • Removing letters with accent marks and other diacritics • Expanding abbreviations Introduction to Computational Linguistics

  19. Further Normalisation • Stemming: are eats and eating different words? • They are two different wordforms • that have the same stem, eat, but different suffixes, -s and -ing • Stemming versus full morphological analysis. Introduction to Computational Linguistics

  20. Summary • The tokenisation problem interacts with design decisions at different levels concerning • Handling of non alphanumeric characters • Case • Punctuation • Typically many of these problems are dealt with by hand crafting special rules which match a particular case. • Such rules are often built out of regular expressions. Introduction to Computational Linguistics

  21. Sources Foundations of Statistical Language Processing, Manning and Schütze, MIT 1999 Introduction to Computational Linguistics

More Related