Corpora and Statistical Methods

Corpora and Statistical Methods Albert Gatt

Course goals CSA5011 -- Corpora and Statistical Methods Introduce the field of statistical natural language processing (statistical NLP). Describe the main directions, problems, and algorithms in the field. Discuss the theoretical foundations. Involve students in hands-on experiments with real problems.

A general introduction CSA5011 -- Corpora and Statistical Methods

Language CSA5011 -- Corpora and Statistical Methods • We can define a language formally as: • a set of symbols (“alphabet”) • a set of rules to combine those symbols • This mathematical definition covers many classes of languages, not just human language.

Java: An artificial (formal) language CSA5011 -- Corpora and Statistical Methods • fixed set of basic symbols: • public, static, for, while, {, }… • fixed syntax for symbol combination public static void main (String[] args) { for(inti = 0; i < args.length; i++) { … } }

Natural language CSA5011 -- Corpora and Statistical Methods • Often much more complicated than an artificial language. • NB: Some theorists view NL as a special kind of formal language as well (Montague…). • It does conform to the formal definition: • there are symbols • there are modes of combination • However, there are many levels at which these symbols and rules are defined.

Levels of analysis in Natural language (I) CSA5011 -- Corpora and Statistical Methods • Acoustic properties (phonetics) • defines a basic set of sounds in terms of their features • studies the combination of these phonemes • Higher-order acoustic features (phonology) • how combinations of phonemes combine into larger units, with suprasegmental features such as intonation.

Levels of analysis in Natural language (II) CSA5011 -- Corpora and Statistical Methods • Word formation (morphology) • combines morphemes into words • Combination into longer units in a structure-dependent way (syntax) • “legal” word combinations in a language • recursive phrasal combination • Interpretation (semantics): • of words (lexical semantics) • of longer units (sentential/propositionalsemantics) • Interpretation in context (pragmatics)

Natural Language Processing CSA5011 -- Corpora and Statistical Methods • Studies language at all its levels. • phonology, morphology, syntax, semantics… • focuses on process (Sparck-Jones `07) • computational methods to understand and generate human language • Often, the distinction between NLP and computational linguistics is fuzzy

Kindred disciplines: Linguistics CSA5011 -- Corpora and Statistical Methods • Theoretical linguistics tends to be less process-oriented than NLP • Q: how can we characterise knowledge that native speakers have of their language? • this leads to declarative models of speaker’s knowledge of language • tends to say less about how speakers process language in real time • NB: This depends on the theoretical orientation! • NLP has strong ties to theoretical linguistics • it has also been an important contributor: process models can serve as tests for declarative models

Kindred disciplines: Psycholinguistics CSA5011 -- Corpora and Statistical Methods • Like NLP, psycholinguistics tends to be strongly process-oriented • studies the online processes of language understanding and language production • NLP has benefited from such models. • NLP has also been a contributor: • it is increasingly common to test psycholinguistic theories by building computational models.

Paradigms in NLP (I) CSA5011 -- Corpora and Statistical Methods • Knowledge-based: • system is based on a priori rules and constraints • e.g. a syntactic parser might have hand-crafted rules such as: NP  DetAdjP N AdjP A+ • Problem: it is extremely difficult to hand-code all the relevant knowledge.

Paradigms in NLP (II) CSA5011 -- Corpora and Statistical Methods • Statistical: • starting point is a large repository of text or speech (a corpus) • corpus is often annotated with relevant information, e.g.: • parsed corpora (syntax) • tagged corpora (part-of-speech) • word-sense annotated corpora (semantics) • tries to learn a model from the data • tries to generalise this model to new data

The paradigms: a bird’s-eye view CSA5011 -- Corpora and Statistical Methods • We find similar “divisions” within mainstream linguistics: • generative linguistics tends to formulate generalisations about “internalised speaker knowledge of language” (competence, I-Language…) • corpus linguistics tends to formulate generalisations based on patterns observed in corpora • The two paradigms are viewed as having roots in different traditions: • rationalist tradition (Plato, Descartes…) • empiricist tradition (Locke…)

The idea of “linguistic knowledge” CSA5011 -- Corpora and Statistical Methods • Traditional linguistic theory (since the 1950s) introduced a dichotomy: • competence: a person’s knowledge of language, formalised as a set of rules • performance: actual production and perception of language in concrete situations • Much of linguistic theory has focused on characterising competence.

The idea of “linguistic knowledge” CSA5011 -- Corpora and Statistical Methods • The use of data (corpora) involves an increased focus on “performance”. • The idea is that exposure to such regularities is a crucial part of human language learning.

An initial example Most traditional grammars characterise these as intransitive CSA5011 -- Corpora and Statistical Methods • Suppose you’re a linguist interested in the syntax of verb phrases. • Some verbs are transitive, some intransitive • I ate the meat pie (transitive) • I swam (intransitive) • What about: • quiver • quake • Corpus data suggests they have transitive uses: • the insect quivered its wings • it quaked his bowels (with fear)

Example II: lexical semantics CSA5011 -- Corpora and Statistical Methods • Quasi-synonymous lexical items exhibit subtle differences in context. • strong • powerful • A fine-grained theory of lexical semantics would benefit from data about these contextual cues to meaning.

wind, feeling, accent, flavour tool, weapon, punch, engine Example II continued CSA5011 -- Corpora and Statistical Methods • Some differences between strong and powerful (source: British National Corpus): • strong • powerful • The differences are subtle, but examining their collocates helps.

Statistical approaches to language CSA5011 -- Corpora and Statistical Methods • Do not rely on categorical judgements of grammaticality etc. Examples: • Degrees of grammaticality: people often do not have categorical judgements of acceptability. • Category blending: We live nearer town than you thought. • Is near an adjective or a preposition? • Syntactic ambiguity: She killed the man with the gun. • What is the most likely parse?

Statistical NLP vs. Corpus Linguistics (I) CSA5011 -- Corpora and Statistical Methods • Corpus linguistics became popular with the arrival of large, machine-readable corpora. • generally viewed as a methodology • tests hypotheses empirically on data • aim is to refine a theory of language, or discover novel generalisations • Statistical NLP shares these aims; however: • it is often corpus-driven rather than corpus-based • the “theory” or “model” learned is often not a priori given

Statistical NLP vs. Corpus Linguistics (II) CSA5011 -- Corpora and Statistical Methods • The term “corpus” may mean different things to different people: • To a corpus linguist, a corpus is a balanced, representative sample of a particular language variety (e.g. The British National Corpus) • Representativeness allows generalisations to be made more rigorously. • In statistical NLP, there has traditionally been less emphasis on these properties. • emphasis on algorithms for learning language models • we frequently find the tacit assumption that the algorithm can be applied to any set of data, given the right annotations

Some applications of Statistical NLP CSA5011 -- Corpora and Statistical Methods

Natural Language analysis and understanding Natural Language Generation Machine translation, summarisation Speech Recognition Speech Synthesis Language Technology Meaning Structure Text Text Speech Speech

A (very) rough division of NLP tasks CSA5011 -- Corpora and Statistical Methods • understanding: typically take as input free text or speech, and conduct some structural or semantic analysis • POS Tagging, parsing, semantic role labelling, sentiment/opinion mining, named entity recognition… • generation: typically take textual or non-linguistic input, outputting some text/speech • automatic weather reporting, summarisation, machine translation • How effective are statistical NLP tools to carry out these and other tasks? • Are statistical techniques actually useful to learn things about language?

Example 1: Semantics • Example of an automatically acquired thesaurus of similar words. • Data: 1.5 bn words obtained from the web. (www.sketchengine.co.uk) • How does this work? “goat” CSA5011 -- Corpora and Statistical Methods

Example 1: Semantics (cont/d) CSA5011 -- Corpora and Statistical Methods • Corpus-based lexical semantic acquisition typically uses vector-space models. • represent a word as a vectors containing information about the context in which it is likely to occur • some models also include grammatical relations (subject-of, object-of etc)

Example 2: POS Tagging <tokpos="at">The</tok> <tokpos="jj">tall</tok> <tokpos="nn">woman</tok> <tokpos="cc">and</tok> <tokpos="at">the</tok> <tokpos="jj">strange</tok> <tokpos="nn">boy</tok> <tokpos="vbd">thought</tok> <tokpos="jj">statistical</tok> <tokpos="nn">NLP</tok> <tokpos="bedz">was</tok> <tokpos="jj">pointless</tok> <tokpos=".">.</tok> “The tall woman and the strange boy thought statistical NLP was pointless.” • Output from a statistical POS Tagger, trained on the Brown Corpus (LingPipe demo library) • Uses of POS Tagging: • pre-parsing • corpus analysis for linguistics • … CSA5011 -- Corpora and Statistical Methods

Example 3: parsing • Parsed using the Stanford Parser. • Based on probabilistic context-free grammar of English • trained on a treebank • CFG rules with probabilities CSA5011 -- Corpora and Statistical Methods

Example 4: Machine translation • Input: (Maltese translation of example sentence) • Output: The wife and son long strange nonetheless feels that the statistical NLP is without purpose. • Translated using Maltese-English Google Translate. • Obvious shortcomings, but robust, i.e. some output returned, even if garbled. • Based on automatic alignment between parallel text corpora. CSA5011 -- Corpora and Statistical Methods

Example 5: Generation/Summarisation […] No laboratories offering molecular genetic testing for prenatal diagnosis of 3-M syndrome are listed in the GeneTests Laboratory Directory. However, prenatal testing may be available for families in which the disease-causing mutations have been identified […] • Automatically generated article about 3-M syndrome (Sauper and Barzilay 2009) • Now on Wikipedia!!! (http://en.wikipedia.org/wiki/3-M_syndrome) • Summarised from multiple documents drawn from the web. • Uses automatically acquired templates from human-authored texts to ensure coherence. CSA5011 -- Corpora and Statistical Methods

Features of Statistical NLP systems CSA5011 -- Corpora and Statistical Methods Robustness: typically, don’t break down with new or unknown input (although they may output garbage) Portability: statistical learning algorithms can in principle be ported to new domains (given data) Sensitivity to training data: if (say) a POS tagger is trained on medical text, its performance will decline on a new genre (e.g. news).

Some important concepts CSA5011 -- Corpora and Statistical Methods • All the systems surveyed rely on regularities in large repositories of training data, expressed as probabilities. • In practice, we distinguish between: • training/development data: for learning a model and finetuning • test data: for evaluation on unseen but compatible data

References CSA5011 -- Corpora and Statistical Methods • Sparck-Jones, K. (2007). Computational Linguistics: What about the linguistics? Computational Linguistics 33 (3): 437 – 441 • McEnery, T., Xiao, R. & Tono, Y. 2006: • Corpus-based language studies: An advanced resource book. London: Routledge • (Contains an interesting discussion of corpus-based vs. corpus-driven approaches)

Corpora and Statistical Methods