1 / 67

Global Expert Meeting Multilingualism in Cyberspace for Inclusive Sustainable Development -

Global Expert Meeting Multilingualism in Cyberspace for Inclusive Sustainable Development - 4 June - 9 June, 2017 Khanty-Mansiysk , Russian Federation. Daniel Pimienta pimienta@funredes.org Networks & Development Foundation http://funredes.org

ycrawford
Download Presentation

Global Expert Meeting Multilingualism in Cyberspace for Inclusive Sustainable Development -

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Global Expert Meeting Multilingualism in Cyberspace for Inclusive Sustainable Development - 4 June - 9 June, 2017 Khanty-Mansiysk, Russian Federation

  2. Daniel Pimienta pimienta@funredes.org Networks & DevelopmentFoundation http://funredes.org Observatory of languages & cultures in the Internet http://funredes.org/lc ExecutiveCommitteeMember of http://maaya.org

  3. Linguistic Indicators in cyberspace: the biases is all aytte Daniel Pimienta pimienta@funredes.org MAAYA

  4. CREDITS Part of the information offered is taken from a study realized on behalf MAAYA by the team D. Prado/D. Pimienta in 2015-2017 about the place of French in the Internet and which is funded by

  5. CREDITS The original idea to approach the space of languages on the Internet by collecting Internet application/spaces figures & transforming country figures into language figures belongs to Daniel Prado (2012)

  6. CREDITS Thanks to Deirdre Williams… (and to William Shakespeare) for the idea of borrowing from Hamlet Act 5 Scene 2 Not a whit, we defy augury; there's a special providence in the fall of a sparrow.  If it be now, ‘tis not to come, if it be not to come, it will be now; if it be not now, yet it will come.  The readiness is all.  Since no man has aught of what he leaves, what is't to leave betimes? Let be.

  7. ANTECEDENTS • IN 1998-2007 FUNREDES/UNION LATINE AND LOP PRODUCED INDICATORS FOR LANGUAGES IN THE INTERNET . • POST 2007 EVOLUTION OF THE WEB AND SEARCH ENGINES ENDED THE PRODUCTIONAND LEAVE A TERRIBLE VOID. • 2012-2014 : DILINET PROJECT A FAILED ATTEMPT COORDINATED BY MAAYA TO PROVIDE HIGH LEVEL RESPONSE.

  8. ANTECEDENTS THE VOID WAS FILLED BY INTERNETWORLDSTATS (10 languages with the higher number of Internauts) and W3TECHS (contents by language)

  9. BUT the data nicely provided by IWS and W3T is used by many people without a careful look at the BIASES !!!!

  10. and the BIASES need to be understood if you plan to derive serious conclusions from the data…

  11. And, by the way, most of those BIASES ARE NOT NEUTRAL they provoke an overestimation of the place of in the Net.

  12. THE AIM OF THIS PRESENTATION • IS TO WARN YOU ABOUT THOSE SERIOUS BIASES

  13. AND ALSO TO REVEAL ALTERNATIVE APPROACHES WHICH SERVES AT MEASURING THE BIASES • THE IDEA IS CERTAINLY NOT TO DENOTE EXISTING AND USEFUL INITIATIVES • BUT TO CLAIM FOR CAUTION ON THE USE OF THEIR DATA.

  14. LINGUISTIC DIVERSITY INDICATORS PARADOX INTEREST LOP…………… FUNREDES/UL……………………..…. W3TECH…… IDESCAT……. IWS………………… ALIS/ISOC………..OCLC FUNREDES……………….. XEROX…………………….. CAPACITY 1997 98 99 2000 01 02 03 04 05 06 07 08 09 2010 11 12 13 14 15 16 17

  15. ARE BIASES DEEPLY ROOTEDON THE SUBJECT OF LANGUAGES ON THE INTERNET? But why so then???????????????????????????

  16. LANGUAGES THE INTERNET Marketing took over Infiniteness of the web Search engine vs. Ad engine Demo-linguistic data No consensus Fuzzy boundaries Huge domain L1? L2? Li? CHOC OF BIASES

  17. BIASES TAXONOMY • Sources bias • Methodological bias • Statistical bias • Hypothesis bias

  18. SOURCE BIAS: “good practice” example ITU provides the most important data on Internet penetration : the percentage of individuals using the Internet per country. ITU provides with the data a precise definition of the collecting process and the associated assumptions. http://www.itu.int/en/ITU-D/Statistics/Documents/statistics/2016/Individuals_Internet_2000-2015.xls

  19. ITU DATA • Method: country survey • Definition: individuals from 16 to 75 years old having connected to the Internet from any type of device on fixed or mobile network at least once in the last 3 months. • Analysis: 60% of data is given by ITU. 15% given by Eurostat with same criteria. the rest given by country authorities often with different criteria (on age especially).

  20. ITU DATA • Best practice indeed yet bias-free data does not exist. • Careful screening shows that figures do not correspond to the same year (no big deal!) • Some data from countries willing to promote there “digital divide efficient policies” are probably exaggerated… • Careful with country definitions!

  21. STATISTICAL BIAS: extremely influent…and terrible example From 1999 to 2007, the (wrong) steady figure of 80% of webpage's in Englishwas propagated in media from an OCLC study with 2 publications, in 1999 and 2003, using the same methodology than a pioneer study from ALIS Technology, a Canadian company. with ISOC support in 1997. The fact that it was anyway totally flawed did not prevent to make it the truth for medias during 10 years!!!

  22. OCLC WEB CHARACTERIZATION METHODOLOGY: • Random selection of 3000 IP numbers leading to webpages. • Application of language recognition algorithm • Publication of results. • Where was the major flaw?

  23. ONE SHOT VS. STATISTICAL DISTRIBUTION

  24. STATISTICAL BIAS: gross example INKTOMI (a former search engine) published in 2000 its study, with absolutely no methodology revealed… but strong marketing, with a figure of 86.5% of Webpages in English.

  25. STATISTICAL BIAS: INKTOMI (2000) GROSS FLAW 100%

  26. W3TECHS: WHERE ARE THE BIASES? METHODOLOGY: On a daily basis W3Techs applies an algorithm of language recognition on the 10 millions of sites classified by ALEXA as the most visited in the world.

  27. W3TECHS BIASES?

  28. W3TECHS: A COHERENCE CONTROL The content productivity indicator P(L)= % content (L) / % internauts (L) Experience has shown that this indicator hardly gets out of the windows 0.5 – 1.5 showing some understandable statistical law between the number of internauts and the amount of contents.

  29. W3TECHS: A COHERENCE CONTROL Data from previous FUNREDES/UNION LATINE studies (2005 & 2008) and data from W3Techs (2017) combined from internauts data derived from ITU Hardly credible…

  30. W3TECHS: A COHERENCE CONTROL Data from W3Techs (2017) combined from country internauts data derived from ITU and transformed into language’s data by simple arithmetic weighting. Too high Too low

  31. W3TECHS: A COHERENCE CONTROL FROM DATA Data from W3Techs (2017) combined from country internauts data derived from ITU and transformed into language’s data by simple arithmetic weighting. Way too low!!! Too low

  32. China + India represent together more than 1 billion persons connected to the Internet in 2016, says close to 1 over 3 Internauts… Would you believe they have less than 7% of the world content???? For some reason W3Tech is incapable to reflect the reality of Asian languages in the Internet… Is that related to Alexa????

  33. AS FOR CHINA ONE REASON HAS TO DO WITH THE FACT THAT ONLY 20% OF CHINESE DOMAINS ARE IN ICANN DNS ROOT!!!! THIS MEANS AN UNDERESTIMATION BY W3TECH IN A FACTOR 5!!!

  34. ALEXA: WHERE IS THE BIAS?AND HOW IT AFFECTS W3Techs ALEXA OFFERS MARKETING DATA TO WEBSITE OWNERS. ALEXA.COM measure traffic to websites thanks to a banner Non transparent about the proceeding of the banner per country or per language. They only claim to have “millions of banners installed”. Wikipedia reported 10 millions in 2005.

  35. ALEXA SUSPECTED BIAS • TELL ME THE BANNER REPARTITION BY COUNTRY I’LL TELL YOU WHERE IS THE BIAS • A PRIORI ONE CAN EXPECT A PRO-OCCIDENTAL BIAS AND PROBABLY ALSO PRO-ENGLISH  HOW TO CONTROL IT IN A CONTEXT OF ZERO TRANSPARENCY?

  36. ALEXA BIAS Comparison of Alexa traffic datawith subscribers data for: Facebook, Twitter and Linkedin. The data is transformed from a per country based into a per language based using weighting with language repartition in countries (more later).

  37. RATIO WORLD % OF TRAFFIC / WORLD % OF SUBSCRIBERS

  38. RATIO WORLD % OF TRAFFIC / WORLD % OF SUBSCRIBERS

  39. ALEXA BIAS The pro-occidental bias appears clearly in the test although with some exceptions which call for further studies…

  40. BEFORE WE CHECK IWSAN EXCLUSIVITY: MY OWN BIAS • INSPIRED FROM THE CURRENT WORK OF MAAYA FOR OIF • FIGURES ON : • INTERNAUTS PER LANGUAGE IN THE INTERNET • CONTENTS PER LANGUAGE IN THE WEB

  41. MY FIGURES ON LANGUAGES IN THE INTERNET

  42. COMPARING WITH W3Techs RABBIT > W3Techs RABBIT < W3Techs

  43. COMPARING WITH IWS RABBIT > W3Techs RABBIT < W3Techs

  44. FINE COMPARINGIWS/RABBIT

  45. COMPARINGIWS/RABBIT We are supposed to rely on the same ITU source. So L1+ L2 shall explain all differences However simulating Rabbit with IWS L1+L2 figures and 100% reasoning (instead of 125%) still show a sub-estimation between 50% and 20% for French, Russian, German and Spanish… Why so? Multilingualism management. The simulation comparison drive a negative figure for the remaining of languages…

More Related