1 / 18

Nancy Ide • Vassar College Catherine Macleod • New York University

Nancy Ide • Vassar College Catherine Macleod • New York University. Why we need an ANC. Brown Corpus of American English Too small to provide representative examples Pre-1960 only No spoken data British National Corpus Not representative of American English Texts up to 1993 only.

herreramary
Download Presentation

Nancy Ide • Vassar College Catherine Macleod • New York University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nancy Ide • Vassar College Catherine Macleod • New York University

  2. Why we need an ANC • Brown Corpus of American English • Too small to provide representative examples • Pre-1960 only • No spoken data • British National Corpus • Not representative of American English • Texts up to 1993 only

  3. British vs. American English • Lexical Items • Bobby vs. cop, underground vs. subway, lorry vs. truck, pavement vs. sidewalk, football vs. soccer… • Grammatical structures • “She could not endure to live with him” vs. “She could not endure living with him.” • “Have you a pen?” vs. “Do you have a pen?” • Modals • “shall” vs. “should” vs. “ought” vs. “will” vs. “would” vs. “should” • Adverbial Usage • “Immediately I get home” vs. “As soon as I get home” • Support Verbs • “take a decision” vs. “make a decision”

  4. ANC Background • June 1998 • ANC proposed at LREC’98 by Charles Fillmore, Nancy Ide, Daniel Jurafsky, Catherine Macleod • May 1998 • Publisher’s Day in Berkeley in conjunction with DSNA • November 1999 • Organizational meeting, New York University

  5. ANC Consortium • Pearson Education • Random House Publishers • Langenscheidt Publishing Group • Harper Collins Publishers • Cambridge University Press • LexiQuest • Microsoft Corporation • Shogakukan,Inc. • Associated Liberal Creators Press • Taishukan Publishers • Oxford University Press • Kenkyusha Publishers • IBM Corporation

  6. Contributors • “Founding” consortium members • $21,000 over 3 years • Texts • Linguistic Data Consortium • Management and distribution of the ANC • Manpower and expertise to create initial version • NYU and Vassar • Expertise and manpower for corpus creation and annotation

  7. ANC Makeup • Core “static” corpus • Texts and transcriptions of spoken data • 1990 onwards • Comparable in balance to BNC • Enables comparative studies • At least 100 million words • Snapshot of American English at the end of the millenium

  8. “Dynamic” component • Not necessarily balanced • Dictated by availability • Includes email, ephemera, rap lyrics, newsgroups, etc. plus historically important works from various time periods • Add 10% every five years • Layered organization • Dynamic component layered chronologically as added

  9. Eventual components • annotated and aligned speech data • dialects of American and Canadian English • other major languages of North America • Spanish,French Canadian • aligned to parallel translations inEnglish. High costs of production prevent inclusion at this stage

  10. Encoding and annotation • Markup compliant with the XML Corpus Encoding Standard (XCES) • Annotation • part of speech • Sub-paragraph elements • E.g., tokens, names, dates, numbers • Produced in a two-stage process

  11. Stage 1: Base level corpus • Produced after year 1, using limited resources • XML markup compliant with XCES level 0 • Markup produced by automatic transduction from original formats • Automatically tagged for part of speech • Only spot checking for validity • Minimal header • hand-produced • Includes domain information • Useful for concordance generation, collocation analysis

  12. Stage 2: Final corpus • Available after year 3 • XML markup conformant to XCES level 1 • Full header • Markup for major structural divisions, paragraphs, sentence boundaries • Markup for some sub-paragraph elements, where can be done automatically • E.g., tokens, names, dates, numbers • 10% markup and annotation hand-validated • “gold standard” corpus

  13. Data architecture • Follow XCES specifications for “stand-off” markup • Annotations in separate XML documents, linked to original • Easy to modify and/or add to • Enables a distributed development model • Different sites independently add annotation • Suitable for delivery over the WWW

  14. Software • ANC project will provide search and access software • Encoding via XML and layered architecture enables exploiting the evolving XML environment for search, access, manipulation of ANC data • XML Transformation Language (XSLT) • Resource Description Framework (RDF)

  15. Availability • Freely available to non-profit educational and research organizations from the outset • No restrictions on obtaining the corpus based on geographical location • Consortium members have exclusive access for commercial exploitation for 5 years • Distributed by LDC

  16. Licensing • LDC • obtains licenses from text providers • issues licenses to users • no redistribution without publisher’s permission • “open sub-corpus” portion of the ANC • licensed on the model of open-source software

  17. ANC Status • Founding memberships closed March 31 2001 • Consortium membership now $40K • Text gathering, format transduction, header production underway • Base corpus due March 31 2002 • Preparing production of level 1 corpus • Gathering technical input from research community • ANLP/NAACL workshop (Seattle, April 2000) • LREC workshop (Athens, June, 2000) • Seeking major funding • Final core corpus due March 31 2004

  18. Information • ANC: • http://AmericanNationalCorpus.org • Project Director: • Catherine Macleod <macleod@cs.nyu.edu> • Technical Director: • Nancy Ide <ide@cs.vassar.edu> • XCES: • http://www.cs.vassar.edu/XCES

More Related