1 / 23

Computational linguistics

Computational linguistics. Syntactic parsing Part-Of-Speech tagging Apr. 5. Computational linguistics. Syntax, semantics, grammar, and the lexicon Lexical semantics and ontologies Phonology/morphology, word segmentation, and tagging Summarization Language generation

urban
Download Presentation

Computational linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational linguistics Syntactic parsing Part-Of-Speech tagging Apr. 5

  2. Computational linguistics Syntax, semantics, grammar, and the lexicon Lexical semantics and ontologies Phonology/morphology, word segmentation, and tagging Summarization Language generation Paraphrasing and textual entailment Parsing and chunking Spoken language processing, understanding and speech-to-speech translation Linguistic, psychological and mathematical models of language Computational pragmatics Dialogue and conversational agents Computational models of discourse Information retrieval Question answering Word sense disambiguation Information extraction and text mining Semantic role labeling Sentiment analysis and opinion mining Corpus-based modeling of language Machine translation and translation aids Multilingual processing Multimodal systems and representations Statistical and machine learning methods Applications Corpus development and language resources Evaluation methods and user studies LING 001 Introduction to Linguistics, Spring 2010

  3. Computational linguistics • Emphasis on integrating linguistic and other knowledge to produce working systems. • System performance is important. Computational linguistics deals with language as it’s actually used. • little need to worry about rare constructions and distinctions; • need to worry about fragments, typos, false starts, ambiguities, non-native speakers, etc. • Ambiguity in natural language is pervasive, which makes computational linguistics hard. LING 001 Introduction to Linguistics, Spring 2010

  4. Ambiguity • Lexical: • Bank • Unlockable • Syntactic: • I shot an elephant in my pajamas. (How he got in my pajamas, I'll never know.) • I forgot how good beer tastes. • I met Mary and Elena’s mother at the mall yesterday. • Semantic: • Every cat chases a mouse. • The police refused the demonstrators a permit because ...
 ... they feared violence.
 ... they advocated violence. LING 001 Introduction to Linguistics, Spring 2010

  5. Parsing • Parsing: taking an input and producing some sort of structure for it. • A syntactic parser is a device (or algorithm) that takes a phrase or sentence as input, and uses a grammar (including a lexicon) to produce the syntactic structure(s) appropriate for that phrase or sentence (often called parse trees or just trees). LING 001 Introduction to Linguistics, Spring 2010

  6. Context-free grammar • The type of grammar often applied in parsing is known as a context-free grammar. • A context-free grammar is a set of rules/productions (and a lexicon) that specify how a syntactic constituent can be composed of smaller constituents (The term “context-free” means that expanding a constituent doesn't depend on what other constituents are around it). LING 001 Introduction to Linguistics, Spring 2010

  7. Context-free grammar • The symbols for constituents (e.g., phrases and sentences) are called non-terminal symbols. Those representing words are called terminal symbols. • Each rule has a single non-terminal symbol on the left hand side of the arrow. This symbol is expanded into the symbols (non-terminal or terminal) on the right hand side. The non-terminal symbols on the right hand side can then be expanded by other rules. • The vertical stroke | is just a shorthand for alternative expansions. • The grammar “accepts” a sentence if there is a way of expanding S (the start symbol), then expanding all the sub-constituents, and so on, until the leaves of the tree match the words in the sentence (which are terminal symbols). If we want to accept noun phrases, we can treat NP as a start symbol. LING 001 Introduction to Linguistics, Spring 2010

  8. Parsing • Parsing is to run a grammar backwards to find possible structures of a sentence. It can be viewed as a search problem. • Top-down strategy: All the expansions of the start symbol are considered, then expansions of each of those constituents, and so on, until we reach expansions that match all the words in the sentence. (What are the problems?) LING 001 Introduction to Linguistics, Spring 2010

  9. Parsing • Bottom-up strategy: The words are examined and all the small constituents that might contain them are postulated, then we see which of those can be fitted together into larger constituents, and so on, until we reach a tree. (what are the problems?) LING 001 Introduction to Linguistics, Spring 2010

  10. Parsing • The left-corner strategy (top-down prediction with bottom-up verification): Make the left-most expansion (top-down), find rules that handle the left-most words (bottom-up), repeat the procedure. • Does this flight include a meal? LING 001 Introduction to Linguistics, Spring 2010

  11. Parsing • Does this flight include meal? LING 001 Introduction to Linguistics, Spring 2010

  12. Parsing • Does this flight include meal? LING 001 Introduction to Linguistics, Spring 2010

  13. Probabilistic CFGs and Statistic Parsing • Attach probabilities to context-free grammar rules (PCFG): the expansions for a given non-termimal sum to 1. • Goal: find a single parse tree (the max probability tree) for a sentence instead of all possible parse trees. LING 001 Introduction to Linguistics, Spring 2010

  14. Probabilistic CFGs and Statistic Parsing .15*.40*.05*.05*…=1.5*10-6 .15*.40*.40*.05*…=1.7*10-6 LING 001 Introduction to Linguistics, Spring 2010

  15. Probabilistic CFGs and Statistic Parsing • Probabilities can be computed from an annotated database (a Treebank). • The Penn Treebank: LING 001 Introduction to Linguistics, Spring 2010

  16. Human parsing • While most sentences are ambiguous in some way, people rarely notice these ambiguities. Instead, they only seem to see one interpretation for a sentence. • Lexical subcategorization preferences: • The women kept the dogs on the beach. The women kept the dogs which were on the beach. 5% The women kept them (the dogs) on the beach. 95% • The women discussed the dogs on the beach. The women discussed the dogs which were on the beach. 90% The women discussed them (the dogs) while on the beach. 10% (keep has a preference for VP -> V NP PP, discuss has a preference for VP -> V NP) • Part-of-speech preferences: • The complex houses married and single students and their families. (houses is more likely to be a noun) LING 001 Introduction to Linguistics, Spring 2010

  17. Head lexicalization of PCFGs • The head word of a phrase gives a good representation of the phrase’s structure and meaning. • Puts the properties of words back into a PCFG. • Lexicalized Probabilistic Context-Free Grammars perform much better than PCFGs (88% vs. 73% accuracy). LING 001 Introduction to Linguistics, Spring 2010

  18. Part of Speech tagging • Part of Speech (POS) tagging: Input: the lead paint is unsafe Output: the/Det lead/N paint/N is unsafe/Adj • Uses of POS tagging: • Text-to-speech: how do we pronounce “lead”? http://www.ivona.com/, which words bear a pitch accent? • It can differentiate word senses that involve part of speech differences (what is the meaning of “interest”)? • Tagged text helps linguists find interesting syntactic constructions in texts (“google”, “ssh”, etc. used as a verb). • POS tagging is not parsing. It is highly accurate, state-of-the-art is 97% accuracy. But the baseline is already 90%: 1. Tag every word with its most frequent tag; 2. Tag unknown words as nouns. LING 001 Introduction to Linguistics, Spring 2010

  19. Part of Speech tagging • Penn Treebank has 45 different POS tags, which is most widely used. LING 001 Introduction to Linguistics, Spring 2010

  20. Part of Speech tagging • Percentage of the words accented (“stressed”) under each part-of-speech category in different speech genres: LING 001 Introduction to Linguistics, Spring 2010

  21. Hidden Markov Model POS tagger • HMM model has been widely used in many fields: Natural language processing, speech synthesis/recognition, Computer vision, Biology, Economics, Climatology, etc. • Top row is unobserved states (hidden states), interpreted as POS tags, bottom row is observed output (words). • Find the most likely hidden state sequences (POS tag sequence) given an observation sequence (word sequence). LING 001 Introduction to Linguistics, Spring 2010

  22. Hidden Markov Model POS tagger • Representation for Paths (hidden state sequences): Trellis LING 001 Introduction to Linguistics, Spring 2010

  23. HAL LING 001 Introduction to Linguistics, Spring 2010

More Related