mining the web to create minority language corpora
Download
Skip this Video
Download Presentation
Mining the Web to Create Minority Language Corpora

Loading in 2 Seconds...

play fullscreen
1 / 20

Mining the Web to Create Minority Language Corpora - PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on

Mining the Web to Create Minority Language Corpora. Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic J. Stefan Institute, Slovenia. Who Needs a Language Specific Corpus?. Language Technology Applications Language Modeling

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Mining the Web to Create Minority Language Corpora' - whitby


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
mining the web to create minority language corpora

Mining the Web to Create Minority Language Corpora

Rayid GhaniAccenture Technology Labs - Research

Rosie Jones

Carnegie Mellon University

Dunja Mladenic

J. Stefan Institute, Slovenia

who needs a language specific corpus
Who Needs a Language Specific Corpus?
  • Language Technology Applications
  • Language Modeling
  • Speech Recognition
  • Machine Translation
  • Linguistic and Socio-Linguistic Studies
  • Multilingual Retrieval
what corpora are available
What Corpora are Available?
  • Explicit, marked up corpora: Linguistic Data Consortium -- 20 languages [Liebermann and Cieri 1998]
  • Search Engines -- implicit language-specific corpora, European languages, Chinese and Japanese
    • Excite - 12 languages
    • Google - 25 languages
    • AltaVista - 25 languages
    • Lycos - 25 languages
the human solution
The Human Solution
  • Start from Yahoo->Slovenia…
  • Crawl www.*.si
  • Search on the web, look at documents, modify query, analyze documents, modify query,…
  • Repetitive, time-consuming, requires reasonable familiarity with the language
slide6
Task
  • Given:
    • 1 Document in Target Language
    • 1 Other Document (negative example)
    • Access to a Web Search Engine
  • Create a Corpus of the Target Language quickly with no human effort
algorithm
Algorithm

Query Generator

WWW

Seed Docs

Language Filter

slide8

Build

Query

Learning

Web

Initial Docs

Word Statistics

Relevant

Filter

Non-Relevant

query generation
Query Generation
  • Examine current relevant and non-relevant documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones
  • A Query consists of minclusion terms and nexclusion terms
    • e.g +intelligence +web –military
query term selection methods
Query Term Selection Methods
  • Uniform (UN) – select k words randomly from the current vocabulary
  • Term-Frequency (TF) – select top k words ranked according to their frequency
  • Probabilistic TF (PTF) – k words with probability proportional to their frequency
query term selection methods1
Query Term Selection Methods
  • RTFIDF – top k words according to their rtfidf scores
  • Odds-Ratio (OR) – top k words according to their odds-ratio scores
  • Probabilistic OR (POR) – select k words with probability proportional to their Odds-Ratio scores
evaluation
Evaluation
  • Goal: Collect as many relevant documents as possible while minimizing the cost
  • Cost
    • Number of totaldocumentsretrieved from the Web
    • Number of distinct Queries issued to the Search Engine
  • Evaluation Measures
    • Percentage of retrieved documents that are relevant
    • Number of relevant documents retrieved per unique query
experimental setup
Experimental Setup
  • Language: Slovenian
  • Initial documents: 1 web page in Slovenian, 1 in English
  • Search engine: Altavista
results precision at 3000
Results – Precision at 3000

Percentage of Target Docs after 3000 Docs Retrieved

results summary
Results - Summary
  • In terms of documents:
    • For lengths 1-3, Odds-Ratio works best
  • In terms of queries:
    • Odds-Ratio is consistently better than others
  • Long queries are usually very precise but do not result in a lot of documents (low recall)
further experiments
Further Experiments
  • Comparison to Altavista’s “More Like This”
    • Better performance than Altavista’s feature
  • Keywords
    • Similar results when initializing with keywords instead of documents
  • Other Languages
    • Similar results with Croatian, Czech and Tagalog
conclusions
Conclusions
  • Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines
  • Not sensitive to initial “seed” documents
  • System and Corpora are/will be available at

www.cs.cmu.edu/~TextLearning/CorpusBuilder

ideas for future work
Ideas for Future Work
  • Explore other Term-Selection methods
  • From Language specific corpus to Topic Specific corpus as an alternative to focused spidering
  • Finding documents matching a user profile – Personal Agent
ad