1 / 17

CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania

CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania Philadelphia, PA 19104. There are several thousand languages. Over 320 are spoken by over 1,000,000 speakers. The ability to process foreign languages supports

olathe
Download Presentation

CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CIS 530 Orientation November 2001 Linguistic Data Consortium University of Pennsylvania Philadelphia, PA 19104

  2. There are several thousand languages. Over 320 are spoken by over 1,000,000 speakers. The ability to process foreign languages supports global economy, internationalization of business, software localization, military roles, intelligence gathering, humanitarian efforts, foreign policy To develop technology for language requires large amounts of data appropriately selected sampled, organized and annotated in corpora Corpus creation requires special equipment, unique legal arrangements and business models and specialized skills not usually taught in the programs of users of language data LDC exists to make language data broadly available for linguistic education, research and technology development Motivation

  3. LDC began in 1993 as a specialized publisher of language data. The data was typically produced elsewhere. Distributed over 14,000 copies of 196 corpora to >1000 organizations worldwide LDC gradually developed the ability to create language resources locally newswires/text collection, collection of conversational data via telephone, broadcast news collection transcription, time-alignment, topic relevance annotation, named entity annotation, phonological /morphological resources LDC more recently extended its research program TalkBank & Linguistic Exploration, Open Languages Archives, African Language Lexicons, DASL Linguistic technologies Information Detection, Extraction and Summarization Speech Recognition and Speech Synthesis Machine Translation Language and Speaker Identification Language Teaching, Linguistics LDC Role

  4. Annotating LDC Corpora: TDT • Topic Detection & Tracking (TDT) Corpora • TDT4 Corpus (most recent) contains 9 months of data in 6 languages • Subset of 4 months of English, Chinese, Arabic for annotation • Topics selected and defined from all sources • Topic is a specific event or activity along with all directly related events (e.g., Hurricane Mitch) • Multiple levels of annotation • segmentation of audio signal into individual stories • topic-story relevance judgements • first story identification • story-link identification • Millions of annotation decisions

  5. Audio Segmentation • Using commercial transcripts or closed-caption annotators • assess existing story boundaries • add, delete, move boundaries as needed • classify units as “news” or “not news” (commercials, etc.) • set and confirm timestamps for all story boundaries

  6. Topic-Story Annotation • Annotators read and evaluate news stories against topic list • Classify story as directly, briefly or not at all related to a target topic

  7. Annotating LDC Corpora: ACE • Automatic Content Extraction Project (ACE) • Develop technology to support automatic processing of human language in text form • Classification, filtering, representing language content • Four annotation tasks • Identify all nominal entities in news story • Categorize according to type • Persons, organizations, GPE, location, facility • Name, nominal, pronominal • Co-index all mentions of single entity within story • Classify relations among entities

  8. Nominal Entity Tagging

  9. Best practices in use of large-scale corpora in study of linguistic variation • Focus on -t/d deletion in American English (well-known variable) • Four LDC Corpora, all created for linguistic technology development • All data already transcribed, segmented to provide fine-grained access • Basic demographic information available (gender, age, education, region, race/ethnicity)

  10. DASL Technology • Create concordance -regular expression search of corpus • Create tag set -specify which factors to code • Create annotation file -combines data with tag set • Annotate using web browser -play each example, tool supports common audio formats -code factors in each factor group, adding comments when needed -demographic information displayed • Save results and output to text file -can be exported to Excel Spreadsheet, statistical analysis package

  11. TDT Overview

  12. Transcripts

  13. ASR Output

  14. Boundary Table

  15. Relevance Table

  16. Story Links

More Related