1 / 30

Globalisation & Computer systems

Globalisation & Computer systems. Week 7 Text processes and globalisation part 1: Sorting strings: collation Searching strings and regular expressions. Text processes. Character encoding design:

mayes
Download Presentation

Globalisation & Computer systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Globalisation & Computer systems Week 7 • Text processes and globalisation part 1: • Sorting strings: collation • Searching strings and regular expressions

  2. Text processes Character encoding design: “must provide the set of code values that allows programmers to design applications capable of implementing a variety of text processes in the desired language” • Text processes operate over text elements

  3. Text processes Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects

  4. Sorting Sorting (collation) “The process of ordering units of textual information. Collation is usually specific to a particular language” (Unicode version 3: glossary)

  5. Sorting Language specific • sort order • phonetically based sort • graphically based sort • sort element

  6. Sorting Levels of comparison • Level 1 (primary difference) • Levels 2 and 3 (similar) • Level 4 (exact match)

  7. Sorting Levels of comparison • Level 4: exact match • match in code value • character equivalence • resumes : resumes

  8. Sorting Levels of comparison • Level 1 (primary difference: alphabetic)

  9. Sorting Levels of comparison • Level 1 (primary difference) • resume < resumes

  10. Sorting Levels of comparison • Level 1 (primary difference) • resume < resumes • Level 2 (similar: no accent < accent) • resume < résumé • resumes < résumés • Level 3 (similar: lower case < upper case) • résumé < Résumé

  11. Sorting Forward and backward sequence sort • Forward sequence • Start comparison from beginning of string • Backward sequence • Start comparison from end of string

  12. Sorting Implementation • Sort keys • assign set of weights to each character in the string • compare substrings according to weighting • switch weightings on / off

  13. Searching Text elements • The objects of a text • Depends on perspective • Different text processes operate over different objects

  14. Regular Expressions • Basis of all web-based and word-processor-based searches • Definition 1. An algebraic notation for describing a string • Definition 2. A set of rules that you can use to specify one or more items, such as words in a file, by using a single character string (Sarwar et al.)

  15. Regular Expressions • regular expression, text corpus • regular expression algebra has variants: Perl, Unix tools • Unix tools: egrep, sed, awk

  16. Regular Expressions • Find occurrences of /Nokia/ in the text egrep -n ‘Nokia’ nokia_corpus.txt

  17. Regular Expressions egrep -n ‘Nokia’ nokia_corpus.txt

  18. Regular Expressions • set operator egrep -n ‘[Nn]okia’ nokia_corpus.txt

  19. Regular Expressions • optional operator egrep -n ‘shares?’ nokia_corpus.txt

  20. Regular Expressions egrep -n ‘shares?’ nokia_corpus.txt

  21. Regular Expressions • Kleene operators: • /string*/ “zero or more occurrences of previous character” • /string+/ “1 or more occurrences of previous character”

  22. Regular Expressions • Wildcard operator: • /string./ “any character after the previous character”

  23. Regular Expressions • Wildcard operator: • /string./ “any character after the previous character” • Combine wildcard and kleene: • /string.*/ “zero or more instances of any character after the previous character” • /string.+/ “one or more instances of any character after the previous character”

  24. Regular Expressions egrep –n ‘profit.*’ nokia_corpus.txt

  25. Regular Expressions • Anchors • Beginning of line operator: ^ egrep ‘^said’ nokia_corpus.txt • End of line operator: $ egrep ‘$said’ nokia_corpus.txt

  26. Regular Expressions • Disjunction: • set operator /[Ss]tring/ “a string which begins with either S or s” • Range /[A-Z]tring/ “a string beginning with a capital letter” • pipe | /string1|string2/ “either string 1 or string 2”

  27. Regular Expressions • Disjunction egrep –n ‘weak|warning|drop’ nokia_corpus.txt egrep –n ‘weak.*|warn.*|drop.*’ nokia_corpus.txt

  28. Regular Expressions • Negation: /[^a-z]tring“ any strings that does not begin with a small letter”

  29. Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | (a) /supply | iers/

  30. Regular Expressions • Precedence • Parantheses • Kleene and optional operators * . ? • Anchors and sequences • Disjunction operator | • /supply | iers/ /supply/ /iers/ • /suppl(y|iers)/ /supply/ suppliers/

More Related