language and the internet assessing linguistic bias l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Language and the Internet Assessing Linguistic Bias PowerPoint Presentation
Download Presentation
Language and the Internet Assessing Linguistic Bias

Loading in 2 Seconds...

play fullscreen
1 / 33

Language and the Internet Assessing Linguistic Bias - PowerPoint PPT Presentation


  • 175 Views
  • Uploaded on

Language and the Internet Assessing Linguistic Bias. Measuring the Information Society WSIS, Tunis, November 15, 2005 John C. Paolillo, Indiana University. Overview. Sources of Linguistic Bias Linguistic Bias: examples Text Communication Internet Host Names Web Programming

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Language and the Internet Assessing Linguistic Bias' - comfort


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
language and the internet assessing linguistic bias

Language and the InternetAssessing Linguistic Bias

Measuring the Information Society

WSIS, Tunis, November 15, 2005

John C. Paolillo, Indiana University

overview
Overview
  • Sources of Linguistic Bias
  • Linguistic Bias: examples
    • Text Communication
    • Internet Host Names
    • Web Programming
  • Global Linguistic Diversity
    • Who bears the costs?
  • Conclusions
sources of linguistic bias
Sources of Linguistic Bias

(Friedman and Nissenbaum 1997)

  • Pre-existing
    • originate from outside the technical system
      • National, trans-national and institutional policies
      • Technology companies
  • Technical
    • are built into the technical system itself
      • Developers’ language backgrounds, national origins
      • Legacy standards, “backward” compatibility
  • Emergent
    • arise in specific contexts of use of a technical system
      • Economics of technology industry (marketing, monopoly power, unstable markets, etc.)
      • Rapid technologization
text communication
Text Communication
  • Requires an encoding and its support
    • Assign code numbers to script characters
      • ASCII (American English)
      • ISO-8859-1 (European Languages)
      • Unicode (most languages, but support is uneven)
    • Support means many things
      • Fonts, rendering, sorting, spell-checking etc.
  • Computer-Mediated Communication
    • Web pages, Email, chat, etc.
    • Language use is not uniform in these modes
      • Multilinguals tend to favor different languages for specific purposes
  • Represents both technical and emergent biases
unicode status examples
Unicode Status: Examples

Language

Chinese

English

French

German

Spanish

Finnish

Russian

Arabic

Hindi

Sinhala

S. Azerbaijani

Unicode

yes

yes

yes

yes

yes

yes

yes

yes

yes

yes

no

Browser

good

good

good

good

good

good

good (late)

good (late)

poor

none

none

Script

Chinese

Roman

Roman

Roman

Roman

Roman

Cyrillic

Arabic

Indic

Indic

Arabic

Pop.

1,240M

400M

81M

82M

358M

5M

132M

247M

213M

15M

26M

Good support

Poor support

No support

internet host names
Internet Host Names
  • The Domain Name System
    • Uses a 30-year old 7-bit ASCII standard
      • Now supports Punycode (a variant of Unicode)
      • Imposes a maximum name length
    • Run by ICANN under US Dept of Commerce contract
      • More concerned with trademark protection
      • Host/domain naming is widely abused (e.g. tv domain)
      • Names provided by the DNS are not that useful
  • An example of emergent bias
    • Technical origin
    • Economic and political forces amplify and sustain it
web programming and unicode
Web Programming and Unicode
  • Markup & web scripting languages
    • Unicode is standard
    • Browser support, fonts, etc. lag behind
    • Databases and development environments tend to lack proper Unicode support
    • End-user oriented, not programmer oriented
  • All of the most important technologies are Open-Source software (FLOSS)
    • User extensible/modifiable
    • Language localization of these is possible but rare
linguistic bias in web programming
Linguistic Bias in Web Programming
  • English is the source language for most programming & markup languages
    • Keywords
    • Operator-argument order
    • Programming constructs, etc.
  • Programming as a linguistic act
    • Complex concepts are rendered into text
    • Different languages have different ways of doing this
  • Emergent language biases
linguistic properties of programming
Linguistic Properties of Programming
  • LISP
    • Predicates precede their arguments
      • Like Arabic, Celtic, Hebrew, etc.

(defun fact (x)(if (<= x 0) 1 (* x (fact (- x 1)))))

  • Postscript
    • Predicates follow their arguments
      • Like Farsi, Hindi, Japanese, Tamil, Turkish, etc.

/factorial { dup 1 gt { dup 1 sub factorial mul } if } def

the linguistic digital divide
The Linguistic Digital Divide
  • Language issues go beyond content
    • WSIS repeatedly re-affirms principles of
      • Transparency
      • Self-determination
      • Open access to participation for all parties

These principles cannot be guaranteed unless speakers of different languages can manipulate all aspects of IT use in a way that is native-like

  • The linguistic divide has broader consequences
    • Costs are borne in
      • Education — great for non-English speaking people
      • Technical development — small, in comparison

(there is a trade-off)

language diversity

Language Diversity

Who bears the costs?

slide12

(source data: www.ethnologue.com)

A typical language group has around 10-50 thousand people

80% of language groups have fewer than 100 thousand members

slide13

(source data: www.ethnologue.com)

90% of the world’s population belongs to a language group with at least 1 million people (416 groups)

Many languages with hundreds of milloins of speakers lack adequate support

conclusions
Conclusions
  • Linguistic Bias is manifest in many ways
    • Technical biases are sometimes overt
    • Emergent biases can be subtle
  • All potential sources of bias need to be examined and questioned if we are to uphold principles affirmed by WSIS
  • Without this effort, the linguistic digital divide will simply amplify existing disparities in wealth and power
language diversity17

Language Diversity

On The Internet

linguistic diversity
Linguistic Diversity

Based on Entropy:

Diversity = –2 ∑pi ln pi

Diversity is the long-run per-individual average variance in language category

(similar to log-likelihood)