180 likes | 205 Views
Explore the use of corpora and statistical methods in linguistics, covering tasks like word prediction, n-grams, Bayesian inference, and more. Understand the empirical vs. rationalistic approaches and the impact of statistical methods on language analysis. Dive into the role of large data collections, empirical evaluation, and generalizations in modern computational linguistics.
E N D
METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE Massimo Poesio Universita’ di Venezia
Obiettivi del corso • Un’introduzione all’uso dei corpora e ai metodi statistici
Piano del corso • Fondamenti di statistica, uso dei corpora • Tasks & tecniche base: predizione di parole, n-grams, smoothing, spelling, Bayesian inference • POS tagging: tagsets, Brill tagger, HMM tagging • Valutazione di sistemi • Il lessico • Grammatiche probabilistiche,parsing statistico
Oggi • Statistica e Linguistica (Abney, 1996) • Fondamenti di probabilita’ • Corpora
Dettagli pratici • Orario: 10:30-13, 14:30-17 • Laboratori: dalle 17 alle 18 (non oggi) • Orario di ricevimento: 9:30-10:30, 18-19 • Email: poesio@essex.ac.uk • Pagina web (temporanea): csstaff.essex.ac.uk/staff/poesio/Courses/Venezia/Stat_NLP/
Empiricism vs. Rationalism • Chomskyan linguistics: • Assumption: linguistic knowledge mostly innate • Emphasis on explanation • Primary goal: simplicity of the theory • Empirical methods • Assumption: linguistic knowledge primarily derives from generalizations over experience • Emphasis on data • Primary goal: fact discovery • Computational Linguistics between 1960 & 1980 mostly Chomskyan
Problems statistical methods are meant to address • Ambiguity resolution: previous choices were • Narrow domains to avoid ambiguity • Hand-coded rules • Hand-tuned preference weights • Adaptation to new domains • Measuring improvement
Case study: POS tagging “Time flies like an arrow”N/V N/V V/N/CJ Det N
The rise of statistical methods • First area in which statistical techniques truly proved their worth was Automatic Speech Recognition (ASR) • ASR techniques then used for POS tagging, and then in all areas of CL • A synthesis of statistical methods and linguistic insights now underway
Modern empiricism in Computational Linguistics • Large data collections • Rigorous collection techniques (interannotator agreement) • Rigorous evaluation techniques • Discovery of generalizations: via learning techniques
Statistics & the study of language? • Theoretical advances • Language acquisition: the role of experience • Linguistic theory: graded grammaticality • Language change: shifts in grammaticality • Empirical • Quantify linguistic phenomena • Analyze data • Test hypotheses • Psychological • Express preferences
Some interesting statistics about language • Lexical biases • Category: “bank” = Noun 85%, Verb 15% • Sense: Bank(river) 22%, Bank(money) 78% • Syntax • Subcategorization of “realised”: NP 20%, S 65%, Other 15% • Semantics / discourse • “he” in subject position 65% of the time
Corpora • The use of statistical techniques has been made possible by the availability of CORPORA – large collections of text typically ANNOTATED with linguistic information: • The Brown corpus (1M words) and British National Corpus (150 million words), annotated with POS tags (English) • Penn Treebank (4M words), syntactically annotated (English) • SEMCOR (250K), annotated with wordsense information • The MapTask, annotated with dialogue information • Italian: CORIS (100M words+, Bologna), Si-TAL (220K words, written, annotated with syntactic information & wordsense information), IPAR (‘MapTask Italiano’)
Basic uses of corpora:Collocations • COMPOUNDS: “computer program”, “disk drive”, “calcio di rigore” • PHRASAL VERBS: “wake up”, “come on” • PHRASAL EXPRESSIONS: “bacon and eggs”, “the bees’ knees”, “siamo alla frutta”
Statistical Language Processing • Statistical inference: • Collect statistics about occurrence of X • Predict new occurrences • Example: language modeling • Problem: predict word that follows, given previous ones • Find Wn that maximizes P(Wn|W1..W n-1) • Applications: • Speech recognition • Spell-checking • POS tagging …
Bibliografia • Steven Abney, Statistical Methods and Linguistics, in Judith Klavans and Philip Resnik (eds.), The Balancing Act, The MIT Press, Cambridge, Mass., 1995. • Testi: • Daniel Jurafsky and James Martin, Speech and Language Processing, Prentice-Hall • Piu’ generale, e piu’ facile da seguire • Christopher Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press • Piu’ completo, e scritto da una prospettiva piu’ linguistica, ma tecnicamente piu’ avanzato