1 / 17

METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE

METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE. Massimo Poesio Universita’ di Venezia. Obiettivi del corso. Un’introduzione all’uso dei corpora e ai metodi statistici. Piano del corso. Fondamenti di statistica, uso dei corpora

Download Presentation

METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita’ di Venezia

  2. Obiettivi del corso • Un’introduzione all’uso dei corpora e ai metodi statistici

  3. Piano del corso • Fondamenti di statistica, uso dei corpora • Tasks & tecniche base: predizione di parole, n-grams, smoothing, spelling, Bayesian inference • POS tagging: tagsets, Brill tagger, HMM tagging • Valutazione di sistemi • Il lessico • Grammatiche probabilistiche,parsing statistico

  4. Oggi • Statistica e Linguistica (Abney, 1996) • Fondamenti di probabilita’ • Corpora

  5. Dettagli pratici • Orario: 10:30-13, 14:30-17 • Laboratori: dalle 17 alle 18 (non oggi) • Orario di ricevimento: 9:30-10:30, 18-19 • Email: poesio@essex.ac.uk • Pagina web (temporanea): csstaff.essex.ac.uk/staff/poesio/Courses/Venezia/Stat_NLP/

  6. Empiricism vs. Rationalism • Chomskyan linguistics: • Assumption: linguistic knowledge mostly innate • Emphasis on explanation • Primary goal: simplicity of the theory • Empirical methods • Assumption: linguistic knowledge primarily derives from generalizations over experience • Emphasis on data • Primary goal: fact discovery • Computational Linguistics between 1960 & 1980 mostly Chomskyan

  7. Problems statistical methods are meant to address • Ambiguity resolution: previous choices were • Narrow domains to avoid ambiguity • Hand-coded rules • Hand-tuned preference weights • Adaptation to new domains • Measuring improvement

  8. Case study: POS tagging “Time flies like an arrow”N/V N/V V/N/CJ Det N

  9. The rise of statistical methods • First area in which statistical techniques truly proved their worth was Automatic Speech Recognition (ASR) • ASR techniques then used for POS tagging, and then in all areas of CL • A synthesis of statistical methods and linguistic insights now underway

  10. Modern empiricism in Computational Linguistics • Large data collections • Rigorous collection techniques (interannotator agreement) • Rigorous evaluation techniques • Discovery of generalizations: via learning techniques

  11. Statistics & the study of language? • Theoretical advances • Language acquisition: the role of experience • Linguistic theory: graded grammaticality • Language change: shifts in grammaticality • Empirical • Quantify linguistic phenomena • Analyze data • Test hypotheses • Psychological • Express preferences

  12. Some interesting statistics about language • Lexical biases • Category: “bank” = Noun 85%, Verb 15% • Sense: Bank(river) 22%, Bank(money) 78% • Syntax • Subcategorization of “realised”: NP 20%, S 65%, Other 15% • Semantics / discourse • “he” in subject position 65% of the time

  13. Corpora • The use of statistical techniques has been made possible by the availability of CORPORA – large collections of text typically ANNOTATED with linguistic information: • The Brown corpus (1M words) and British National Corpus (150 million words), annotated with POS tags (English) • Penn Treebank (4M words), syntactically annotated (English) • SEMCOR (250K), annotated with wordsense information • The MapTask, annotated with dialogue information • Italian: CORIS (100M words+, Bologna), Si-TAL (220K words, written, annotated with syntactic information & wordsense information), IPAR (‘MapTask Italiano’)

  14. Basic uses of corpora:Collocations • COMPOUNDS: “computer program”, “disk drive”, “calcio di rigore” • PHRASAL VERBS: “wake up”, “come on” • PHRASAL EXPRESSIONS: “bacon and eggs”, “the bees’ knees”, “siamo alla frutta”

  15. Bigrams: New York

  16. Statistical Language Processing • Statistical inference: • Collect statistics about occurrence of X • Predict new occurrences • Example: language modeling • Problem: predict word that follows, given previous ones • Find Wn that maximizes P(Wn|W1..W n-1) • Applications: • Speech recognition • Spell-checking • POS tagging …

  17. Bibliografia • Steven Abney, Statistical Methods and Linguistics, in Judith Klavans and Philip Resnik (eds.), The Balancing Act, The MIT Press, Cambridge, Mass., 1995. • Testi: • Daniel Jurafsky and James Martin, Speech and Language Processing, Prentice-Hall • Piu’ generale, e piu’ facile da seguire • Christopher Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press • Piu’ completo, e scritto da una prospettiva piu’ linguistica, ma tecnicamente piu’ avanzato

More Related