1 / 19

Introduction to Corpora@Stanford

Introduction to Corpora@Stanford. Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3 rd , 2003. Some basic questions. Where are our corpora? Where is the software? Is there a list of all the stuff we have? How can I access the software?

brygid
Download Presentation

Introduction to Corpora@Stanford

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Corpora@Stanford Florian Jaeger, tiflo@stanford.edu For the Methods class, December 3rd, 2003

  2. Some basic questions • Where are our corpora? Where is the software? • Is there a list of all the stuff we have? • How can I access the software? • Where do I start? What information is available where? • Are there tutorials for the available software? • What kind of corpus work is supported at Stanford? • Corpora are only for those computational folks … ;-) • And the most important question:

  3. Why bother at all … • Because we are often wrong with our (ad-hoc) intuitions – linguistic methodology is … • well, let’s not go there. • While corpora have a lot of drawbacks (no negative evidence, genre specific, etc.) they offer a lot of opportunities. • To illustrate my point, a little case study …

  4. Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Claim: “The interpretation of bare plurals does not, actually, consist of any subset of (well-defined) singulars.” • 0.5 apples/apple • 1.0 apples/apple • 1.5 apples/apple • zero apples/apple

  5. Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Hagit Borer’s judgments: • 0.5 apples/*apple • 1.0 apples/*apple • 1.5 apples/*apple • zero apples/*apple

  6. Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Google’s count: • 0.5 apples (120)/*apple (179) • 1.0 apples (42)/*apple (23,600) • 1.5 apples (59)/*apple (362) • zero apples (194)/*apple (124) • This also makes clear, some of the problems, so let’s take pears

  7. Hagit Borer: “Some notes on the Syntax and Semantics of Quantity”Talk for the Sem. Workshop, 10/31/2002 • Google’s count: • 0.1 pears (32)/*pear (118) • 0.5 pears (37)/*pear (50) • 0.7 pears (9)/*pear (14) • 1.0 pears (14)/*pear (24,000) • 1 pears (14)/?pear (7,480) • One pears (1,130)/?pear (3,060) • 1.5 pears (28)/*pear (316) • zero pears (3)/*pear (0) • Conclusion: • It is amazing how many programs or computers products use fruit names. • The original judgments seem questionable. • BUT: can we trust Google?

  8. Looking for a corpus • There are several sites on the web that can help you to find out if what you are looking for exists: • Databases like David Lee’s site (see also our Top 10 list) • The LDC database • Our list of corpora (next page) • Email lists, see our site under ‘Support’ • Local: corpora@csli.stanford.edu • Global: MAJORDOMO@UIB.NO

  9. Types of corpora • Different languages • Different media (speech, video, text) • Different levels of annotation • No annotation • Transcribed speech or video • Sociological annotation (gender of speaker, average age of audience, dialect of speaker, etc.) • Discourse and textual information (publication date, number of discourse participants, discussion panel vs. novel, etc.) • Linguistic annotation (phonemes, prosody, syntax, morpho-syntax, lexemes, phonological segments & syllables, etc.)

  10. Looking for a specific corpus • List of available corpora • If the corpus is on AFS • If the corpus in on the Corpus Computer • If the corpus is on CD • If the corpus is on the WWW • If the corpus has special license conditions • If we don’t have the corpus

  11. Tools & software • General • Where to start: • Local online tutorials (see also external references and manuals) • The corpus TA • corpora@csli.stanford.edu • Little helpers

  12. A brief look at some tools • BNC Web • Problem: Superiority “who the hell …” • Problem: Distribution of “… is like …” – age dependent? • General information • Age (easy export to e.g. Excel) • Crosstabs • TGrep2 and Tgrep • Tutorial • Examples: • tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. NP)‘ • tgrep2 -c wsj_mrg.t2c.gz -l 'VP < (NP $. PP-DTV)‘ • tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ NP)‘ • tgrep2 -c wsj_mrg.t2c.gz -l 'VP=foo < (/VB*/ < gave) & < (NP $ PP-DTV)'

  13. Note: Tgrep is right-headed • The following pattern matches an S which has a child A and another child that is a C and that the A has a child B: • S < (A < B) < C • However, this pattern means that S has child A and that A has children B and C: • S < ((A < B) < C) • It is equivalent to this: • S < (A < B < C)

  14. Some more Tgrep2 syntax • A < B A is the parent of (immediately dominates) B. • A > B A is the child of B. • A <N B B is the Nth child of A (the rst child is <1). • A >N B A is the Nth child of B (the rst child is >1). • A <, B Synonymous with A <1 B. • A >, B Synonymous with A >1 B. • A <-N B B is the Nth-to-last child of A (the last child is <-1). • A >-N B A is the Nth-to-last child of B (the last child is >-1). • A <- B B is the last child of A (synonymous with A <-1 B). • A >- B A is the last child of B (synonymous with A >-1 B). • A <` B B is the last child of A (also synonymous with A <-1 B). • A >` B A is the last child of B (also synonymous with A >-1 B). • A <: B B is the only child of A • A >: B A is the only child of B • A << B A dominates B (A is an ancestor of B).

  15. Some more TGrep2 syntax • A >> B A is dominated by B (A is a descendant of B). • A <<, B B is a left-most descendant of A. • A >>, B A is a left-most descendant of B. • A <<` B B is a right-most descendant of A. • A >>` B A is a right-most descendant of B. • A <<: B There is a single path of descent from A and B is on it. • A >>: B There is a single path of descent from B and A is on it. • A . B A immediately precedes B. • A , B A immediately follows B. • A .. B A precedes B. • A ,, B A follows B. • A $ B A is a sister of B (and A 6= B). • A $. B A is a sister of and immediately precedes B. • A $, B A is a sister of and immediately follows B. • A $.. B A is a sister of and precedes B. • A $,, B A is a sister of and follows B. • A = B The node matched by A is also matched by B.

  16. The alternative with windows • TigerSearch 2.1; screen shots: • Grammar search • Collocation search

  17. The end my friends • Want to help? • The website can always use additions (short blurbs about software, your opinion about the user-friendliness of a certain web interface, etc.) • Tschuessi!

More Related