1 / 38

Regex

Regex. wangyoutian@nilnul.com. char. 不可再分 (在当前讨论范围内) 比如: A,b “,”, “@” “ ” 汉字 大多可以视觉识别 也包括 whitespace ,如空格,换行, Tab. Set<char>. 有穷 的字符集成为字母表( alphabet ) 一般非空 比如:英文字母,数字,中文和标点. String. 把字符按顺序连起来,称为 string. 一般是有限长度 可以是 1 个,两个, … ,或者 0 个( ε , “” ,空字符串) 比如: “abcdaaab”

Download Presentation

Regex

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Regex wangyoutian@nilnul.com

  2. char • 不可再分 • (在当前讨论范围内) • 比如: • A,b • “,”, “@” • “ ” • 汉字 • 大多可以视觉识别 • 也包括whitespace,如空格,换行,Tab

  3. Set<char> • 有穷的字符集成为字母表(alphabet) • 一般非空 • 比如:英文字母,数字,中文和标点

  4. String • 把字符按顺序连起来,称为string. • 一般是有限长度 • 可以是1个,两个,…,或者0个(ε, “”,空字符串) • 比如: • “abcdaaab” • 在ES中,字符串用单引号或双引号括起来。

  5. Algebra of string • + 字符串连接 • 空字符串+其它字符串=其它字符串+空串=其它字符串

  6. Alphabet vs string 集合 列表 • 无序 • 不重复 • 势 • 有序 • 可重复 • 长度

  7. Set<string> • 字符串的集合 • 一般为无穷 • 长度不受限制 • 也可以是有穷的 • 比如:空集, {“a”,”ab”}

  8. Algebra of Set<string> • | • 相当于集合的并集,结果仍是Set<string> • Ф=The Union of Zero Set<string> • {“a”,”bc”,”e”} | {“a”,”1”}={“a”,”bc”,”e”,”1”} • Many use ∪, +, or ∨ for alternation

  9. 类似于 Cartesian Product 笛卡尔乘积,结果仍是Set<string> • 比如: • {“a”,”bc”,”e”} {“a”,”1”}= {“aa”,”a1”,”bca”,”bc1”,”ea”,”e1”}

  10. 乘方 • {“a”,”1”} 2 ={“a”,”1”} {“a”,”1”}={“aa”,”a1”,”1a”,”11”} • {“a”,”1”}3={“a”,”1”} {“a”,”1”} {“a”,”1”}={“aaa”,”aa1”,”a1a”,”a11”,”1aa”,”1a1”,”11a”,”111”} • 定义 • {”a”,”1”}1={“a”,”1”} • 这样{“a”,”1”} 2= {”a”,”1”}1+1= {”a”,”1”} {”a”,”1”} • {”a”,”1”}0={ε} • 这样,{”a”,”1”}1= {”a”,”1”}0+1={ε}{”a”,”1”}

  11. 乘和或 复合 • S{m,n} =Sm | Sm+1 | Sm+2… | Sn • S{m,} =Sm | Sm+1 | Sm+2… • S?=S0 |S • S+ = S1 | S2 | S3… • S* =S0 | S1 | S2 | S3… • * is called Kleene Star

  12. Priority of ops • * highest • Concatenation • alternation. • parentheses may be omitted. For example, • (ab)c can be written as abc, • and a|(b(c*)) can be written as a|bc*.

  13. Regular Expression • Some set<string> is called regular expression, or RE, RegExp, Regex • The following are RegExp • {“a”} is regular expression, for any char in alphabet • RS is RegExp, if R and S are both Regex • R* is RegExp, if R is Regex • So {ε} is RegExp • If a Set<string> cannot be represented by above process, it’s not RegExp

  14. Note • Ф is often included in RegExp

  15. RegExp@JS

  16. See Standard

  17. Empty • Empty allowed • [] • () • |

  18. Assertion • ^ • $ • \b • World boundary • Not _, [0-9], [A-z] • \B • Not \b • (?=expression) • (?!expression)

  19. quantifier • ? • + • * • {m,} • {m,n} • The following will be lazy if appended by another ?

  20. capture • () • (?: expression)

  21. Atom Escape

  22. \c followed by lower or upper letter • \a =a • For a is not designated special meanings • So are some other letters • \u002F • \0

  23. Character Class • [] • [abd]={“a”,”b”,”c”} • [a-c] = {“a”,”b”,”c”} • [-ca] where – is literal • [ac-] where – is literal • [^a-c] = alphabet / [a-c]

  24. Escape Class • [\b]={“backspace”} • [\]] = {“]”} • [\B] error • [\1] error • \1 will be the captured group

  25. . Any char except newline • \d digit • \D not digit • \w word char • \W not word char • \s whitespace • \S not whitespace

  26. Back reference • \1 • \0 • <NUL> • \1000000000 • Error if no such many matches.

  27. Literal

  28. / … /gim • Where g for global • i for case insensitive • m for multiline

  29. Note: • // will be taken as comments • Use /(?:)/

  30. RegExp

  31. RegExp. • RegExp is a function • Can construct Regular Expressions • RegExp(pattern, flags) • new RegExp(pattern, flags) • RegExp.prototype

  32. RegExp.prototype. • constructor • exec • Return matches, an array • Ordered by the appearance of ( • There is one implicit () around the whole pattern • test • Return bool • toString • Return string

  33. Members of RegExp instance • source • global • ignoreCase • multiline • lastIndex integer • { [[Writable]]: true, • [[Enumerable]]: false, • [[Configurable]]: false }.

  34. <ZWNJ> and <ZWJ> are format-control characters that are used to make necessary distinctions when forming words or phrases in certain languages.

  35. The Unicode format-control characters (i.e., the characters in category ―Cf‖ in the Unicode Character Database such as LEFT-TO-RIGHT MARK or RIGHT-TO-LEFT MARK) are control codes used to control the formatting of a range of text in the absence of higher-level protocols for this (such as mark-up languages). • All format control characters may be used within comments, and within string literals and regular expression literals. • In ECMAScript source text, <ZWNJ> and <ZWJ> may also be used in an identifier after the first character.

  36. <BOM> is a format-control character used primarily at the start of a text to mark it as Unicode and to allow detection of the text's encoding and byte order. <BOM> characters intended for this purpose can sometimes also appear after the start of a text, for example as a result of concatenating files. <BOM> characters are treated as white space characters (see 7.2).

  37. The special treatment of certain format-control characters outside of comments, string literals, and regular expression literals is summarised in Table 1. • Table 1 — Format-Control Character Usage Code • Unit Value Name Formal Name Usage • \u200C Zero width non-joiner <ZWNJ> IdentifierPart • \u200D Zero width joiner <ZWJ> IdentifierPart • \uFEFF Byte Order Mark <BOM> Whitespace

More Related