300 likes | 398 Views
Explore the intricacies of text processes in globalised computer systems, from sorting and searching strings to character encoding design. Understand the diverse objects text processes operate on and delve into the levels of comparisons in sorting. Discover the fundamentals of regular expressions and their applications in web and word-processor-based searches.
E N D
Globalisation & Computer systems Week 7 • Text processes and globalisation part 1: • Sorting strings: collation • Searching strings and regular expressions
Text processes Character encoding design: “must provide the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired language” • Text processes operate over text elements
Text processes Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects
Sorting Sorting (collation) “The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)
Sorting Language specific • sort order • phonetically based sort • graphically based sort • sort element
Sorting Levels of comparison • Level 1 (primary difference) • Levels 2 and 3 (similar) • Level 4 (exact match)
Sorting Levels of comparison • Level 4: exact match • match in code value • character equivalence • resumes : resumes
Sorting Levels of comparison • Level 1 (primary difference: alphabetic)
Sorting Levels of comparison • Level 1 (primary difference) • resume < resumes
Sorting Levels of comparison • Level 1 (primary difference) • resume < resumes • Level 2 (similar: no accent < accent) • resume < résumé • resumes < résumés • Level 3 (similar: lower case < upper case) • résumé < Résumé
Sorting Forward and backward sequence sort • Forward sequence • Start comparison from beginning of string • Backward sequence • Start comparison from end of string
Sorting Implementation • Sort keys • assign set of weights to each character in the string • compare substrings according to weighting • switch weightings on / off
Searching Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects
Regular Expressions • Basis of all web-based and word-processor-based searches • Definition 1. An algebraic notation for describing a string • Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)
Regular Expressions • regular expression, text corpus • regular expression algebra has variants: Perl, Unix tools • Unix tools: egrep, sed, awk
Regular Expressions • Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt
Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt
Regular Expressions • set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt
Regular Expressions • optional operator egrep -n ‘shares?’ nokia_corpus.txt
Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt
Regular Expressions • Kleene operators: • /string*/ “zero or more occurrences of previous character” • /string+/ “1 or more occurrences of previous character”
Regular Expressions • Wildcard operator: • /string./ “any character after the previous character”
Regular Expressions • Wildcard operator: • /string./ “any character after the previous character” • Combine wildcard and kleene: • /string.*/ “zero or more instances of any character after the previous character” • /string.+/ “one or more instances of any character after the previous character”
Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt
Regular Expressions • Anchors • Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt • End of line operator: $ egrep ‘$said’ nokia_corpus.txt
Regular Expressions • Disjunction: • set operator /[Ss]tring/ “a string which begins with either S or s” • Range /[A-Z]tring/ “a string beginning with a capital letter” • pipe | /string1|string2/ “either string 1 or string 2”
Regular Expressions • Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt
Regular Expressions • Negation: /[^a-z]tring“ any strings that does not begin with a small letter”
Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | (a) /supply | iers/
Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/ • /suppl(y|iers)/ /supply/ suppliers/