1 / 13

Language Identification of Web Data for Building Linguistic Corpora

Language Identification of Web Data for Building Linguistic Corpora. Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences University of Zagreb, Croatia. INFuture2011: “Information Sciences and e-Society” Zagreb, 10 November 2011. Overview. Introduction

maney
Download Presentation

Language Identification of Web Data for Building Linguistic Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences University of Zagreb, Croatia INFuture2011: “Information Sciences and e-Society” Zagreb, 10 November 2011

  2. Overview Introduction Experimental setup Languages observed Methods used Main approaches Hybrid approaches Results Document level Paragraph level Conclusion Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  3. Introduction • Web as a rich source of linguistic material • More than one natural language within such sources • Defining the method for language identification of the data collected from the Web • Comparison of two main and two hybrid approaches • Ultimate goal • Using Web resources as a basis for constructing corpora – building hrWaC, the Croatian Web corpus Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  4. Experimental setup • Twelve languages observed cs de en es fr hr hu it pl sk sl sv cs - 18 22 26 22 53 25 31 42 70 54 23 de 18 - 34 34 35 12 17 31 20 17 18 53 en 22 34 - 27 33 16 16 35 15 17 19 35 es 26 34 27 - 62 22 18 56 18 23 28 38 fr 22 35 33 62 - 18 15 48 15 18 22 35 hr 53 12 16 22 18 - 11 31 39 51 74 24 hu 25 17 16 18 15 11 - 14 10 22 13 21 it 31 31 35 56 48 31 14 - 22 28 38 32 pl 42 20 15 18 15 39 10 22 - 50 40 18 sk 70 17 17 23 18 51 22 28 50 - 55 22 sl 54 18 19 28 22 74 13 38 40 55 - 26 sv 23 53 35 38 35 24 21 32 18 22 26 - Table 1: A snippetfromLanguage Similarity Table (Scannell, 2007) Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  5. Methods used • Main approaches • Function word distributions • Second-order Markov models • Hybrid approaches • Harmonic balance • Sophisticated method • Language identification on document and paragraph level Table 2:Amount of data collected for each basic method Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  6. Methods used – main approaches • Function word distributions • Lists of function words from all languages in question • The algorithm chooses the language for which the highest percentage of words could be identified as function words of the respective language • Second-order Markov models • Conditional probabilities of a character regarding the two previous characters for which distributions of bigram and trigram characters are calculated on a training set Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  7. Methods used - hybrid approaches • Harmonic balance • Harmonic mean of the certainty of the function words method and the Markov model method • Certainty is calculated as a/(a+b) where a is the first result, and b the second best result • Sophisticated hybrid method • Takes into account the strengths of each main method Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  8. Methods used - hybrid approaches • Sophisticated hybrid method algorithm • If the Markov model and function words method give the same results, the result is accepted • In case the results of both models are not the same, but the second best result of the Markov model method is identical to the first result of the function words method and its certainty is over 0.6, the result of the function word method is accepted • Otherwise the result of the Markov model method is accepted Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  9. Methods used - evaluation • Document level • 20 documents per language • Documents containing less than 70% of any language are considered unsolvable • Paragraph level • Paragraphs in 50 documents were labeled by language they are written in • 750 paragraphs in total • Evaluation measure is accuracy • a+d/a+b+c+d Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  10. Results • Main approaches Function words Markov model Function words Markov model Document level Paragraph level Positive 234 239 745 747 Negative 6 1 5 3 Accuracy 0.975 0.996 0.993 0.996 Table 3: Results of the evaluation of the traditional approaches Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  11. Results • Hybrid approaches Table 4: Results of the evaluation of hybrid methods Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  12. Conclusion • Markov model outperforms the function words method • Hybrid approaches showed to be more efficient on the document level (mixed language content) • Power-lawish distribution of languages • Three languages - 99% of the data • Around 96% of documents written in only one language • 4% have mixed content Stupar, Jurić, Ljubešić, Language Identification of Web Data for Building Linguistic Corpora

  13. Language Identification of Web Data for Building Linguistic Corpora Marija Stupar, Tereza Jurić, Nikola Ljubešić Faculty of Humanities and Social Sciences University of Zagreb, Croatia INFuture2011: “Information Sciences and e-Society” Zagreb, 10 November 2011

More Related