Introduction to Corpora@Stanford

Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3rd, 2003

Some basic questions • Where are our corpora? Where is the software? • Is there a list of all the stuff we have? • How can I access the software? • Where do I start? What information is available where? • Are there tutorials for the available software? • What kind of corpus work is supported at Stanford? • Corpora are only for those computational folks … ;-) • And the most important question:

Why bother at all … • Because we are often wrong with our (ad-hoc) intuitions – linguistic methodology is … • well, let’s not go there. • While corpora have a lot of drawbacks (no negative evidence, genre specific, etc.) they offer a lot of opportunities. • To illustrate my point, a little case study …

Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Claim: “The interpretation of bare plurals does not, actually, consist of any subset of (well-defined) singulars.” • 0.5 apples/apple • 1.0 apples/apple • 1.5 apples/apple • zero apples/apple

Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Hagit Borer’s judgments: • 0.5 apples/*apple • 1.0 apples/*apple • 1.5 apples/*apple • zero apples/*apple

Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Google’s count: • 0.5 apples (120)/*apple (179) • 1.0 apples (42)/*apple (23,600) • 1.5 apples (59)/*apple (362) • zero apples (194)/*apple (124) • This also makes clear, some of the problems, so let’s take pears

Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Google’s count: • 0.1 pears (32)/*pear (118) • 0.5 pears (37)/*pear (50) • 0.7 pears (9)/*pear (14) • 1.0 pears (14)/*pear (24,000) • 1 pears (14)/?pear (7,480) • One pears (1,130)/?pear (3,060) • 1.5 pears (28)/*pear (316) • zero pears (3)/*pear (0) • Conclusion: • It is amazing how many programs or computers products use fruit names. • The original judgments seem questionable. • BUT: can we trust Google?

Looking for a corpus • There are several sites on the web that can help you to find out if what you are looking for exists: • Databases like David Lee’s site (see also our Top 10 list) • The LDC database • Our list of corpora (next page) • Email lists, see our site under ‘Support’ • Local: corpora@csli.stanford.edu • Global: MAJORDOMO@UIB.NO

Types of corpora • Different languages • Different media (speech, video, text) • Different levels of annotation • No annotation • Transcribed speech or video • Sociological annotation (gender of speaker, average age of audience, dialect of speaker, etc.) • Discourse and textual information (publication date, number of discourse participants, discussion panel vs. novel, etc.) • Linguistic annotation (phonemes, prosody, syntax, morpho-syntax, lexemes, phonological segments & syllables, etc.)

Looking for a specific corpus • List of available corpora • If the corpus is on AFS • If the corpus in on the Corpus Computer • If the corpus is on CD • If the corpus is on the WWW • If the corpus has special license conditions • If we don’t have the corpus

Tools & software • General • Where to start: • Local online tutorials (see also external references and manuals) • The corpus TA • corpora@csli.stanford.edu • Little helpers

A brief look at some tools • BNC Web • Problem: Superiority “who the hell …” • Problem: Distribution of “… is like …” – age dependent? • General information • Age (easy export to e.g. Excel) • Crosstabs • TGrep2 and Tgrep • Tutorial • Examples: • tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. NP)‘ • tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. PP-DTV)‘ • tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ NP)‘ • tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ PP-DTV)'

Note: Tgrep is right-headed • The following pattern matches an S which has a child A and another child that is a C and that the A has a child B: • S < (A < B) < C • However, this pattern means that S has child A and that A has children B and C: • S < ((A < B) < C) • It is equivalent to this: • S < (A < B < C)

Some more Tgrep2 syntax • A < B A is the parent of (immediately dominates) B. • A > B A is the child of B. • A <N B B is the Nth child of A (the rst child is <1). • A >N B A is the Nth child of B (the rst child is >1). • A <, B Synonymous with A <1 B. • A >, B Synonymous with A >1 B. • A <-N B B is the Nth-to-last child of A (the last child is <-1). • A >-N B A is the Nth-to-last child of B (the last child is >-1). • A <- B B is the last child of A (synonymous with A <-1 B). • A >- B A is the last child of B (synonymous with A >-1 B). • A <` B B is the last child of A (also synonymous with A <-1 B). • A >` B A is the last child of B (also synonymous with A >-1 B). • A <: B B is the only child of A • A >: B A is the only child of B • A << B A dominates B (A is an ancestor of B).

Some more TGrep2 syntax • A >> B A is dominated by B (A is a descendant of B). • A <<, B B is a left-most descendant of A. • A >>, B A is a left-most descendant of B. • A <<` B B is a right-most descendant of A. • A >>` B A is a right-most descendant of B. • A <<: B There is a single path of descent from A and B is on it. • A >>: B There is a single path of descent from B and A is on it. • A . B A immediately precedes B. • A , B A immediately follows B. • A .. B A precedes B. • A ,, B A follows B. • A $ B A is a sister of B (and A 6= B). • A $. B A is a sister of and immediately precedes B. • A $, B A is a sister of and immediately follows B. • A $.. B A is a sister of and precedes B. • A $,, B A is a sister of and follows B. • A = B The node matched by A is also matched by B.

The alternative with windows • TigerSearch 2.1; screen shots: • Grammar search • Collocation search

The end my friends • Want to help? • The website can always use additions (short blurbs about software, your opinion about the user-friendliness of a certain web interface, etc.) • Tschuessi!

Introduction to Corpora@Stanford

Introduction to Corpora@Stanford

Presentation Transcript

Domain-Specific Corpora

Welcome to Stanford

Introduction to Stanford DB Group Research

Stanford

Introduction to Corpora and Corpus Linguistics

Comparable Corpora

STANFORD

Web Corpora

Welcome to Stanford!

Introduction to Stanford Email and Calendar

Using Xaira to explore corpora

Raising teachers’ awareness to corpora

Hieroglyphic Corpora

Stanford

Using Corpora to Teach Vocabulary

Transforming Parallel Corpora to Translation Memory

Introduction to Speech Corpora@Stanford

Corpora and Translation