1 / 25

Introduction to Regular Expressions

Introduction to Regular Expressions. Christine Moulen MIT Libraries ELUNA 2014. What is a regular expression?. Regular expressions are : A language or syntax that lets you specify patterns for matching e.g. filenames or strings U sed to identify the files or lines you want to work with

hestia
Download Presentation

Introduction to Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Regular Expressions Christine Moulen MIT Libraries ELUNA 2014

  2. What is a regular expression? • Regular expressions are : • A language or syntax that lets you specify patterns for matching e.g. filenames or strings • Used to identify the files or lines you want to work with • Used inside of substitution functions to change the contents of a string

  3. Command line examples • ls 14* • * is a wildcard here, not regex • 14 followed by zero or more of any character • ls 14[0-1][0-9]* • [0-1] and [0-9] are regex character classes, specifying a single character within the the list of characters from 0 to 1, and 0 to 9, respectively • ls 14[0-1][0-9][0-3][0-9]* • 6 digits that look like a date YYMMDD, mostly

  4. More command line examples • mv [b-z]* $data_scratch • An alphabetical class, which depending on your system might match the lower case letters from b through z, OR a mix of upper and lower case: b C c D d ... Z z • grep 'MIT01$' sysnos.txt • Find lines that end ($) with MIT01 • ^ can be used to match at the beginning of a line

  5. UNIX/Linux editors • In vi, you can use regular expressions with the s/// substitution operator • With emacs, use M-x query-replace-regexp • Replace $ with MIT01 • Take a list of system numbers and make it valid input to an Aleph service by adding the library code to the end of each line

  6. Matching example in Perl • Look through a MARC file in Aleph sequential format for lines with tag 260 • 001234567 260 L $$aCambridge$$bMIT Press • if ($matched =~ m/^\d{9}\s260.+/) { ... } • $matched is the while loop variable representing the line we're working on • =~ is a pattern operator used with the matching (m), substitution (s), and translation (tr) functions • m// is the pattern matching function

  7. m/^\d{9}\s260.+/ • ^ start at the beginning of the line • \d Perl-speak for the digits character class • {9} a quantifier. Find exactly 9 of \d • \s Perl-speak for the whitespace char class • 260 the MARC tag I'm looking for • . any character • + a quantifier. Find 1 or more of .

  8. m/^\d{9}\s260.+/

  9. Working with MARC fields • Look for deleted records • LDRposition 05 is d • $my_LDR =~ /LDR L .....d/ • Look for e-resource records • $my_245 =~ /\$\$h\[electronic resource\]/ • Look for OCLC numbers • $my_035 =~ /(\(OCoLC\)\d{8,10})/ • Note the double use of () here

  10. Counting up records at the end of a script if ($hash{$tmp} =~ m/SKIP/ || $hash{$tmp} =~ m/NEW/) { $new_count++ if (m/ FMT L /); $skip_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP/); $bre_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP Brief/); $bks_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP Books24x7/); $eebo_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP EEBO/); $epda_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP EPDA/); $sta_count++ if (m/ FMT L / && $hash{$tmp} =~ m/SKIP STA/); }

  11. Substitution example in Perl • We have a browse index of URLs • An Aleph browse index only sorts the first 69 characters of the field • When we have many URLs from the same site, we need to get the unique part closer to the beginning • Following is an SFXOpenURL from the MARCit! service

  12. This OpenURL ... • http://owens.mit.edu/sfx_local? url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&rfr_id=info:sid/sfxit.com:opac_856&url_ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignore_date_threshold=1&rft.object_id=3710000000092335&svc_val_fmt=info:ofi/fmt:kev:mtx:sch_svc&

  13. ... becomes this. • http://owens.mit.edu/sfx_local?rft.object_id=3710000000092335&url_ver=Z39.88-2004&ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&rfr_id=info:sid/sfxit.com:opac_856&url_ctx_fmt=info:ofi/fmt:kev:mtx:ctx&sfx.ignore_date_threshold=1&svc_val_fmt=info:ofi/fmt:kev:mtx:sch_svc&

  14. The substitution expression • $my_856 =~ s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/; • s is the substitution operator • substitute/this/for this/ • Parentheses used here to group different sections of the pattern, and then re-arrange them

  15. s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/

  16. s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/s/(^.*sfx_local\?)(.*)(rft\.object_id\=\d{1,}\&)(.*$)/$1$3$2$4/ • Now change the order from $1$2$3$4 to $1$3$2$4

  17. Parsing thesis notes • Thesis degree, year, and department are stored in a single free text MARC field 502 • We have applied some structure to this, but it has varied over time • In DSpace, we want to get these 3 bits into separate fields, so the note is parsed on the way from MARC to Dublin Core

  18. Parsing thesis notes • $MIT = 'Massachusetts Institute of Technology\.?|M\.\s?I\.\s?T\.'; • ? is the zero or one quantifier. • | match the pattern alternative before or after this • $Dept = '[Dd]epartment\s[Oo]f|[dD]ept\.\s+[Oo]f'; • A few small character classes, to allow for case variation, and Department vs Dept.

  19. Parsing thesis notes • $Month = 'January|February|March|April|May|June|July|August|September|October|November|December'; • match any one month name when $Month is used inside a pattern

  20. Thesis. 1975. Sc.D.--Massachusetts Institute of Technology. Dept. of Mechanical Engineering • /^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)--($MIT)\.?\s+($Dept)?\s*(.+)$/o

  21. Thesis. 1975. Sc.D.--Massachusetts Institute of Technology. Dept. of Mechanical Engineering • /^Thesis\.\s+(\d+)\.?\s+([\w\.\s]+)--($MIT)\.?\s+($Dept)?\s*(.+)$/o

  22. More thesis examples • Massachusetts Institute of Technology. Dept. of Economics. Thesis. 1968. Ph.D. • Massachusetts Institute of Technology, Dept. of Civil Engineering, Thesis. 1965. Sc. D. • /^($MIT)(\.|,)?\s+($Dept)?\s*([\w\s\.,]+)\s+Thesis.\s*(\d{4})\.?\s*(.*)$/o

  23. More thesis examples • Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 1973. • Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics an Astronautics. • Thesis. (M.S.)--Sloan School of Management, 1983. • Thesis (Sc. D.)--Massachusetts Institute of Technology, Dept. of Mechanical Engineering, 1951. • Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Linguistics and Philosophy, February 2004. • /^Thesis\.?\s*\(([^\)]*)\)(\s*--?\s*|\s+)?(($MIT)[\.,]?)?\s*($Dept)?\s*(.*)(,\s+(\d{4}))?\.?$/o

  24. More thesis examples • Thesis (Ph. D.)--Joint Program in Oceanography/Applied Ocean Science and Engineering (Massachusetts Institute of Technology, Dept. of Earth, Atmospheric, and Planetary Sciences; and the Woods Hole Oceanographic Institution), 2013. • /^Thesis\.?\s*\(([^\)]*)\)(\s*--(Joint Program in ([\w\.\s]+)\((($MIT)[\.,]?)?\s*($Dept)?\s*([\w,;\s]+)\)))(,\s+(\d{4}))?\.?$/o

  25. Questions? orbitee@mit.edu

More Related