Understanding Text Processes in Globalised Computer Systems

Globalisation & Computer systems Week 7 • Text processes and globalisation part 1: • Sorting strings: collation • Searching strings and regular expressions

Text processes Character encoding design: “must provide the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired language” • Text processes operate over text elements

Text processes Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects

Sorting Sorting (collation) “The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)

Sorting Language specific • sort order • phonetically based sort • graphically based sort • sort element

Sorting Levels of comparison • Level 1 (primary difference) • Levels 2 and 3 (similar) • Level 4 (exact match)

Sorting Levels of comparison • Level 4: exact match • match in code value • character equivalence • resumes : resumes

Sorting Levels of comparison • Level 1 (primary difference: alphabetic)

Sorting Levels of comparison • Level 1 (primary difference) • resume < resumes

Sorting Levels of comparison • Level 1 (primary difference) • resume < resumes • Level 2 (similar: no accent < accent) • resume < résumé • resumes < résumés • Level 3 (similar: lower case < upper case) • résumé < Résumé

Sorting Forward and backward sequence sort • Forward sequence • Start comparison from beginning of string • Backward sequence • Start comparison from end of string

Sorting Implementation • Sort keys • assign set of weights to each character in the string • compare substrings according to weighting • switch weightings on / off

Searching Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects

Regular Expressions • Basis of all web-based and word-processor-based searches • Definition 1. An algebraic notation for describing a string • Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)

Regular Expressions • regular expression, text corpus • regular expression algebra has variants: Perl, Unix tools • Unix tools: egrep, sed, awk

Regular Expressions • Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt

Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt

Regular Expressions • set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt

Regular Expressions • optional operator egrep -n ‘shares?’ nokia_corpus.txt

Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt

Regular Expressions • Kleene operators: • /string*/ “zero or more occurrences of previous character” • /string+/ “1 or more occurrences of previous character”

Regular Expressions • Wildcard operator: • /string./ “any character after the previous character”

Regular Expressions • Wildcard operator: • /string./ “any character after the previous character” • Combine wildcard and kleene: • /string.*/ “zero or more instances of any character after the previous character” • /string.+/ “one or more instances of any character after the previous character”

Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt

Regular Expressions • Anchors • Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt • End of line operator: $ egrep ‘$said’ nokia_corpus.txt

Regular Expressions • Disjunction: • set operator /[Ss]tring/ “a string which begins with either S or s” • Range /[A-Z]tring/ “a string beginning with a capital letter” • pipe | /string1|string2/ “either string 1 or string 2”

Regular Expressions • Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt

Regular Expressions • Negation: /[^a-z]tring“ any strings that does not begin with a small letter”

Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | (a) /supply | iers/

Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/ • /suppl(y|iers)/ /supply/ suppliers/

Understanding Text Processes in Globalised Computer Systems