1 / 24

Developing Asian Language Corpora: standards and practice

Developing Asian Language Corpora: standards and practice. Richard Xiao Tony McEnery Paul Baker Andrew Hardie Lancaster University. An overview of the talk. Corpus development standards The EMILLE (Enabling Minority Language Engineering) Corpus

keona
Download Presentation

Developing Asian Language Corpora: standards and practice

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Developing Asian Language Corpora: standards and practice Richard Xiao Tony McEnery Paul Baker Andrew Hardie Lancaster University ALR04 - Sanya, China

  2. An overview of the talk • Corpus development standards • The EMILLE (Enabling Minority Language Engineering) Corpus • The Lancaster Corpus of Mandarin Chinese (LCMC) • XML-aware, Unicode-compliant corpus exploration tools • Software demonstration ALR04 - Sanya, China

  3. Corpus development standards (1) • Why is standardization important? • To be compliant with major international standards • To facilitate electronic data exchange • To foster cooperation and coordination between different centres and projects • To meet the requirements of corpus validation • The ALR Committee is working in the right direction ALR04 - Sanya, China

  4. Corpus development standards (2) • Corpus constituents • Corpus manifest • Type (paper document, computer file, audio/video recording, etc.) • Carrier (computer file name and location, document title etc.) • Status (integral part of corpus, descriptive metadata, associated annotation, documentation, etc.) • Digital components and the storage format (character encoding, binary format, record structure, etc.) • Primary data: corpus files • Ancillary data: corpus documentation ALR04 - Sanya, China

  5. Corpus development standards (3) • Data formats • Primary data • Text files: XML/SGML conforming to a standard or supplied DTD or schema • Audio: MP3 or WAV • Video: MPEG or Quicktime • Image files: PNG or JPG • Ancillary data • Documentation:PDF, HTML, or XML ALR04 - Sanya, China

  6. Corpus development standards (4) • File structure, markup and annotation • Corpus header • providing metadata about the corpus file • TEI/CES-compliance • Corpus body • Containing the corpus data • TEI/CES-compliance • Markup for paragraphs and sentences • Preferably annotated with various levels of linguistic analysis (POS tagging…) • Character encoding • Unicode-compliance (UTF-8/16) ALR04 - Sanya, China

  7. The EMILLE project • The EMILLE project • Funded by the UK EPSRC (Grant references GR/N19106, GR/M70735, GR/N28542 and GR/R42429/01) • Research partners: Lancaster University, Sheffield University, and the Central Institute of Languages (CIIL) in Mysore, India • Three main goals • To build corpora of South Asian languages • To extend the GATE (General Architecture for Text Engineering) LE architecture • To develop basic LE tools • Project site: http://www.emille.lancs.ac.uk/ • GATE: http://gate.ac.uk/sale/tao/index.html#x1-550002.26 ALR04 - Sanya, China

  8. The EMILLE Corpus: An overview • Three components • Monolingual, Annotated, and Parallel • 14 South Asian languages • Spoken data for five language • Monolingual corpora contain more than 96 million words • Spoken data over 2.6 million words • The Urdu corpus is POS tagged • Part of the Hindi corpus is annotated for anaphora • Parallel corpus covers English and five South Asian languages • Corpus building tools: Uni-codify, Uni-viewer, Uni-editor ALR04 - Sanya, China

  9. Language Written Spoken Total Assamese 2,620,000 0 2,620,000 Bengali 5,520,000 442,000 5,962,000 Gujarati 12,150,000 564,000 12,714,000 Hindi 12,390,000 588,000 12,978,000 Kannada 2,240,000 0 2,240,000 Kashmiri 2,270,000 0 2,270,000 Malayalam 2,350,000 0 2,350,000 Marathi 2,210,000 0 2,210,000 Oriya 2,730,000 0 2,730,000 Punjabi 15,600,000 521,000 16,121,000 Sinhala 6,860,000 0 6,860,000 Tamil 19,980,000 0 19,980,000 Telugu 3,970,000 0 3,970,000 Urdu 1,640,000 512,000 2,152,000 Total 93,530,000 2,627,000 96,157,000 The EMILLE Monolingual Corpora ALR04 - Sanya, China

  10. The EMILLE Annotated Corpora • POS tagging • The whole monolingual Urdu corpus • The Urdu component of the EMILLE Parallel Corpora • Anaphoric annotation • Around 100,000 words of news material (20 excerpts from the Ranchi Express data) from the Hindi Monolingual Corpus ALR04 - Sanya, China

  11. The EMILLE Parallel Corpus • 75 advice leaflets published by the UK government • Approximately 200,000 words of English originals with accompanying translations in five South Asian languages • Hindi, Bengali, Punjabi, Gujarati, and Urdu • Covering a range of term-rich domains ALR04 - Sanya, China

  12. The EMILLE corpus building tools • Uni-codify • Allows users to convert 30 (or so) different 8-bit encodings of South Asian scripts into 16-bit little-endian Unicode format • Compiled program accompanied by documentation • Uni-Viewer • Allows users to view Unicode texts • Uni-Editor • Allows users to edit Unicode texts • Urdu POS tagger • POS tagging Unicode-encoded Urdu texts • Accompanied by the tagset and the user manual ALR04 - Sanya, China

  13. The EMILLE Corpus: Availability • The full release of the EMILLE Corpus and tools are distributed free of charge for use in non-profit-making research • Digital sound files will also be released soon • Indexed version for use with Xara will be available soon • Corpus download site • http://www.ling.lancs.ac.uk/corplang/emille ALR04 - Sanya, China

  14. The LCMC Corpus: Aims • Built for the ESRC project Contrasting tense and aspect in English and Chinese (Grant Ref. RES-000-220135) • A Chinese match for FLOB/Frown for BrE/AmE • A publicly available balanced corpus of Mandarin Chinese • Distributed free of charge for use in non-profit-making research ALR04 - Sanya, China

  15. LCMC: Profile • One million words • 1990-1993 • 15 text categories • 500 text samples • Major text provider: SSReader Digital Library in China • Unicode (UTF-8) • XML-conformant mark-up • Marked for paragraphs and sentences • POS-tagged (precision rate 98%+) • Standard character and Romanized Pinyin versions ALR04 - Sanya, China

  16. Corpus POS Bal. Channel Variety Contr. LCMC Yes Yes Written Mainland E – C Sinica Yes Yes Mixed Taiwan No PH No No Written Mainland No PFR Yes No Written Mainland No LIVAC No No Written Mixed C – C SCCSD No Yes Spoken Mainland No TREC No No Written Mainland No Gigaword No No Written Mainland No Callhome No ? Spoken Mixed No Major Chinese corpus resources ALR04 - Sanya, China

  17. Code Text category Samples Proportion A Press reportage 44 8.8% B Press editorials 27 5.4% C Press reviews 17 3.4% D Religion 17 3.4% E Skills/trades/hobbies 38 7.6% F Popular lore 44 8.8% G Biographies/essays 77 15.4% H Miscellaneous 30 6% J Science 80 16% K General fiction 29 5.8% L Mystery/detective fiction 24 4.8% M Science fiction 6 1.2% N Western/adventure fiction 29 5.8% P Romantic fiction 29 5.8% R Humor 9 1.8% Total 500 100% LCMC: Sampling frame ALR04 - Sanya, China

  18. Level Code Gloss Attribute Value 1 text Text type TYPE As per Table 2 Text Category ID As per Table 2 Code 2 file Corpus file ID Text ID plus file number starting from 01 3 p Paragraph --- --- 4 s Sentence n Starting from 0001 onwards 5 w Word POS Part-of-speech tags as per the LCMC tagset c Punctuation and symbol gap Omission --- --- LCMC: Markup ALR04 - Sanya, China

  19. LCMC: Annotation • Segmentation • POS tagging • Applying the Peking University tagset • 26 Level 1 POS tags • 50 Level 2 POS tags • ICTCLAS (Chinese Lexical Analysis System) • Developed by the Institute of Computing Technology, Chinese Academy of Sciences (Zhang & Liu 2002) • A frequency dictionary of 80,000 words • Based on a multi-layer hidden Markov model • Applying the n-shortest paths method • Automatic tagging with a precision rate of 97.16% • Post-editing improved the precision to over 98% ALR04 - Sanya, China

  20. LCMC: Potential use • Monolingual study • Studying Mandarin Chinese as a whole • Exploring variation across text categories • Contrastive study (in conjunction with FLOB/Frown) • Contrasting Chinese and BrE/AmE • Contrasting text categories in Chinese and English ALR04 - Sanya, China

  21. LCMC: Availability • Distributed free of charge for use in non-profit-making research • Accompanied by the user manual • Online search available via WebConc • The LCMC website • http://www.ling.lancs.ac.uk/corplang/lcmc • The Chinese mirror site (Chinese Academy of Social Science) • http://www.cass.net.cn/chinese/s18_yys/dangdai/LCMC/LCMC.htm ALR04 - Sanya, China

  22. Corpus exploration tools • XML-aware, Unicode-compliant corpus exploration tools • The WordSmith Tools version 4 • Presently under beta test • Beta version available • http://www.lexically.net/wordsmith/version4/index.htm • Xara (XML-aware Sara) • Sara:SGML-aware Retrieval Application • For use with the British National Corpus (BNC) • For either local or remote access • Presently under beta test • Documentation available at http://www.oucs.ox.ac.uk/rts/xara/ • A tutorial available at the LCMC website ALR04 - Sanya, China

  23. Software demonstration • Using Xara for local access to LCMC • Query types: Quick query, word query (pattern), POS query, pattern query (regex), Query builder (e.g. a-n vs. a-de-n), etc • Display mode: KWIC mode vs. sentence mode • Display format: Plain vs. XML • Status bar: Reference • Other useful features: distribution, sort, collocation, partition, user-defined stylesheets, etc. • Using Xara to for local access to EMILLE • Using WebConC to access LCMC • http://www.ling.lancs.ac.uk/corplang/lcmc ALR04 - Sanya, China

  24. And… Thank you! Richard Xiao z.xiao@lancaster.ac.uk ALR04 - Sanya, China

More Related