1 / 11

How Useful Is the Web as a Linguistic Corpus?

How Useful Is the Web as a Linguistic Corpus?. William H. Fletcher United States Naval Academy 2002 North American Symposium on Corpus Linguistics American Association of Applied Corpus Linguistics Indianapolis, IN, 1-3 November 2002. Making the Web More Useful as a Corpus.

jenna-mckay
Download Presentation

How Useful Is the Web as a Linguistic Corpus?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How Useful Is the Webas a Linguistic Corpus? William H. Fletcher United States Naval Academy 2002 North American Symposium on Corpus Linguistics American Association of Applied Corpus Linguistics Indianapolis, IN, 1-3 November 2002

  2. Making the Web More Useful as a Corpus Objective of this ongoing study To develop and evaluate linguistic methods and PC tools to identify domain-relevant and linguistically representative documents more efficiently Long-range goal To establish the Web both as a "corpus of first resort" and as a supplementary corpus for language professionals and learners

  3. Advantages of Web • Virtually comprehensive coverage of major languages and language varieties, content domains and written text types • Ready availability and low cost throughout developed world • Freshness and topicality: emerging usage and current issues well documented • Easy to compile an ad-hoc corpus to answer a specific question or meet a specialized information need • User familiarity with Web and independent motivation to become more proficient in using it

  4. Disadvantages of Web • Generally unknown provenance and authorship, reliability and authorativeness of texts, both for content and linguistic form • Predominance of certain text types among coherent texts, especially legal, journalistic, commercial and academic prose • Overall lower standards of form and content verification than printed sources • Systematically accessible only through commercial search engines, which support only very rough search criteria • Counts of a given linguistic feature give only a general numeric indication, not statistical proof

  5. “Noise” Filter for HRDs • Highly Repetitive Documents • Discussion groups where replies incorporate original post • Internal links • Boilerplate • Search engine Spam • Strategy: identify documents with frequent n-grams • 8-grams, 12-grams, 25-grams useful range • Either eliminate document or eliminate redundant text

  6. “Noise” Filter for VIDs • Virtually Identical Documents • Mirrored documents with slight differences • News stories • Rank and absolute frequency of 3- to 5-grams alerts to VIDs

  7. “Noise” Filter for IDs • (Fully) Identical Documents • Mirrored documents • Multiple URLs for same document • Server-generated error messages • MD5 SHA (Message Digest 5 Secure Hash Algorithm) reduces normalized text of any length to 20-byte code with high probability of uniqueness • MD5 codes from thousands of documents can be stored in binary tree for efficient comparison and elimination of redundant documents

  8. ? Unproven “Noise” Filters • Microsoft Word Spelling Checker to recognize, normalize ill-formed documents automatically • Some success; deserves further attention • Problem: large number of items (personal, commercial and place names, technological terms) not in default lexicon, so it rejects too many good documents. • Patterns of 1- and 2-grams to recognize PFDs (Primarily Fragmentary Documents) • Some high-frequency types (articles, copula) rare in fragments, others (common prepositions) frequent • Content words and special terms (see above) relatively prominent

  9. Size as A Priori Filter • Webpages under 3 kB or over 150 kB have lower “signal to noise” ratio • In these extreme ranges documents consist of coherent text less frequently or to a lesser degree • Shorter files tend to have much lower ratio of text file size to HTML file size (49% vs. 64% overall) • Rule of thumb: download and process only pages larger than 5 kB or smaller than 200 kB (size before stripping HTML tags)

  10. My Web Corpus 1 • Compiled one afternoon in October 2001 via KWiCFinder searches on the 20 most frequent words in English • Preliminary studies of 100 and 5859 webpages respectively revealed great bias towards commercial sites due to "paid positioning" on AltaVista; sites ranked highest for this reason were excluded from this study • Initially consisted of 11,201 online documents (OLDs) • Various "noise filters" were applied to make the results more useful • 7294 survived automatic elimination of IDs and VIDs • 256 HRDs were eliminated • Remaining documents were viewed individually and classified as • Primarily useful text • "Noisy" text • Primarily non-text (link lists, fragments, headers / footers predominated...)

  11. My Web Corpus 2 • 4949 unique documents passed all automatic tests and human classification • 5.25 million tokens in 35 MB of files • Longer coherent texts from government, academic, legal, religious (Christian, Jewish, Muslim, Hindu), journalistic and commercial sources, plus many “hobbyist” pages on a wide range of topics • Compared to BNC as a standard to reference corpus (see appendix with annotated comparison of n-gram frequencies). • Generally quite comparable, but important differences: • UK vs. US bias in institutions, place names, spelling • BNC: bias toward third person, past tense, narrative style • WC: bias toward first (especially we) and second person, present tense, interactive style • Words referring to Internet concepts and information missing or rare in BNC, highly prominent in WC (and in contemporary English)

More Related