Mining the web to create minority language corpora
Download
1 / 20

Mining the Web to Create Minority Language Corpora - PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on

Mining the Web to Create Minority Language Corpora. Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic J. Stefan Institute, Slovenia. Who Needs a Language Specific Corpus?. Language Technology Applications Language Modeling

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Mining the Web to Create Minority Language Corpora' - whitby


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Mining the web to create minority language corpora

Mining the Web to Create Minority Language Corpora

Rayid GhaniAccenture Technology Labs - Research

Rosie Jones

Carnegie Mellon University

Dunja Mladenic

J. Stefan Institute, Slovenia


Who needs a language specific corpus
Who Needs a Language Specific Corpus?

  • Language Technology Applications

  • Language Modeling

  • Speech Recognition

  • Machine Translation

  • Linguistic and Socio-Linguistic Studies

  • Multilingual Retrieval


What corpora are available
What Corpora are Available?

  • Explicit, marked up corpora: Linguistic Data Consortium -- 20 languages [Liebermann and Cieri 1998]

  • Search Engines -- implicit language-specific corpora, European languages, Chinese and Japanese

    • Excite - 12 languages

    • Google - 25 languages

    • AltaVista - 25 languages

    • Lycos - 25 languages


But what about slovenian or tagalog or tatar

You’re just out of luck!

BUT what about Slovenian? Or Tagalog? Or Tatar?


The human solution
The Human Solution

  • Start from Yahoo->Slovenia…

  • Crawl www.*.si

  • Search on the web, look at documents, modify query, analyze documents, modify query,…

  • Repetitive, time-consuming, requires reasonable familiarity with the language


Task

  • Given:

    • 1 Document in Target Language

    • 1 Other Document (negative example)

    • Access to a Web Search Engine

  • Create a Corpus of the Target Language quickly with no human effort


Algorithm
Algorithm

Query Generator

WWW

Seed Docs

Language Filter


Build

Query

Learning

Web

Initial Docs

Word Statistics

Relevant

Filter

Non-Relevant


Query generation
Query Generation

  • Examine current relevant and non-relevant documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones

  • A Query consists of minclusion terms and nexclusion terms

    • e.g +intelligence +web –military


Query term selection methods
Query Term Selection Methods

  • Uniform (UN) – select k words randomly from the current vocabulary

  • Term-Frequency (TF) – select top k words ranked according to their frequency

  • Probabilistic TF (PTF) – k words with probability proportional to their frequency


Query term selection methods1
Query Term Selection Methods

  • RTFIDF – top k words according to their rtfidf scores

  • Odds-Ratio (OR) – top k words according to their odds-ratio scores

  • Probabilistic OR (POR) – select k words with probability proportional to their Odds-Ratio scores


Evaluation
Evaluation

  • Goal: Collect as many relevant documents as possible while minimizing the cost

  • Cost

    • Number of totaldocumentsretrieved from the Web

    • Number of distinct Queries issued to the Search Engine

  • Evaluation Measures

    • Percentage of retrieved documents that are relevant

    • Number of relevant documents retrieved per unique query


Experimental setup
Experimental Setup

  • Language: Slovenian

  • Initial documents: 1 web page in Slovenian, 1 in English

  • Search engine: Altavista



Results precision at 3000
Results – Precision at 3000

Percentage of Target Docs after 3000 Docs Retrieved



Results summary
Results - Summary

  • In terms of documents:

    • For lengths 1-3, Odds-Ratio works best

  • In terms of queries:

    • Odds-Ratio is consistently better than others

  • Long queries are usually very precise but do not result in a lot of documents (low recall)


Further experiments
Further Experiments

  • Comparison to Altavista’s “More Like This”

    • Better performance than Altavista’s feature

  • Keywords

    • Similar results when initializing with keywords instead of documents

  • Other Languages

    • Similar results with Croatian, Czech and Tagalog


Conclusions
Conclusions

  • Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines

  • Not sensitive to initial “seed” documents

  • System and Corpora are/will be available at

    www.cs.cmu.edu/~TextLearning/CorpusBuilder


Ideas for future work
Ideas for Future Work

  • Explore other Term-Selection methods

  • From Language specific corpus to Topic Specific corpus as an alternative to focused spidering

  • Finding documents matching a user profile – Personal Agent


ad