1 / 23

Regular Expressions & Pattern Matching

James Wasmuth University of Edinburgh james.wasmuth@ed.ac.uk. Regular Expressions & Pattern Matching. Definitions . Pattern Match – searching for a specified pattern within string. For example: A sequence motif, Accession number of a sequence, Parse HTML, Validating user input.

dyami
Download Presentation

Regular Expressions & Pattern Matching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. James Wasmuth University of Edinburgh james.wasmuth@ed.ac.uk Regular Expressions &Pattern Matching

  2. Definitions • Pattern Match – searching for a specified pattern within string. • For example: • A sequence motif, • Accession number of a sequence, • Parse HTML, • Validating user input. • Regular Expression (regex) – how to make a pattern match.

  3. Regular Expressions • A separate programming language, • Utilised in most popular languages - usually as separate library • Perl - fully incorporated (unique).

  4. How Regex work Regex code Perl compiler output regex engine Input data (e.g. sequence file) Overview: how to create regular expressions how to use them to match and extract data biological context

  5. Simple Patterns • Place the regex between a pair of forward slashes ( / / ). • try: • #!/usr/bin/perl • while (<STDIN>) { • if (/abc/) { • print “>> found ‘abc’ in $_\n”; • } • } • Save then run the program. Type something on the terminal then press return. Ctrl+C to exit script. • If you type anything containing ‘abc’ the print statement is returned.

  6. Binding Operator • Previous example matched against $_ • Want to match against a scalar variable? • Binding Operator “=~” matches pattern on right against string on left. • Usually add the m operator – clarity of code. • $string =~ m/pattern/

  7. Simple Patterns (2) • Also access files and pattern match using I/O. • try: • #!/usr/bin/perl • open IN, “<genomes_desc.txt”; • while ($line = <IN>) { • if ($line=~m/elegans/) { #true if finds ‘elegans’ • print $line; • } • }

  8. Flexible matching • Within regex there are many characters with special meanings – metacharacters • star (*) matches any number of instances • /ab*c/ => ‘a’ followed by zero or more ‘b’ followed by ‘c’ • plus (+) matches at least one instance • /ab+c/ => ‘a’ followed by 1 or more ‘b’ followed by ‘c’ • question mark (?) matches zero or one instance • /ab?c/ => ‘a’ followed by 0 or 1 ‘b’ followed by ‘c’

  9. More Flexibility • Match a character a specific number or range of instances • {x}will match x number of instances. • /ab{3}c/ => abbbc • {x,y}will match between x and yinstances. • /a{2,4}bc/ => aabc oraaabc oraaaabc • {x,}will match x+ instances. • /abc{3,}/ => abccc or abccccccc or abcccccccc

  10. More Flexibility • dot (.) is a wildcard character – matches any character except new line (\n) • /a.c/ => ‘a’ followed by any character followed by ‘c’ • Combine metacharacters • /a.{4}c/ => ‘a’ followed 4 instances of any character followed by ‘c’ • so will match addddc , afgthc , ab569c

  11. Escaping Metacharacters to use a * , + , ? or . in the pattern when not a metacharacter, need to 'escape' them with a backslash. /C\. elegans/ => C. elegansonly /C. elegans/ => will match Ca , Cb , C3 , C> , C. , etc... The 'delimitor' of the regex, forward slash '/', and the 'escape' character, backslash '\', are also metacharacters. These need to be escaped if required in regex. Important when trying to match URLs and email addresses. /joe\.bloggs\@darwin\.co\.uk/ /www\.envgen\.nox\.ac\.uk\/biolinux\.html/

  12. Finding Sequence Identifiers • The file nemaglobins contains EMBL database entries for globins of the phylum Nematoda. Write a script that counts the number of entries. • try: • #!/usr/bin/perl • $count; • open IN, “<nemaglobins.embl” or die; • while ($line = <IN>) { • if ($line=~m/AC .*/) { #that's three spaces • $count++; • } • } • print “total=$count\n”;

  13. Grouping Patterns • So far using metacharacters with one character. • Can group patterns – place within parenthesis “()”. • Powerful when coupled with quantifiers. • /MLSTSTG+/ =>MLSTSTGGGGGGGGG… • /MLS(TSTG)+/ =>MLSTSTGTSTGTSTG…TSTG • /ML(ST){2}G/ =>MLSTSTG

  14. Alternative Matching • Match this or this. • Two ways which depend on nature of pattern • 1) use a verticle bar ‘|’ • matches if either left side or right side matches, • /(human|mouse|rat)/ => any string with human or mouse or rat.

  15. 2) character class is a list of characters within '[]'. It will match any single character within the class. • /[wxyz1234\t]/ => any of the nine. • a range can be specified with '-' • /[w-z1-4\t]/ => as above • to match a hyphen it must be first in the class • /[-a-zA-Z]/ => any letter character or a hyphen negating a character with '^' /[^z]/ => any character exceptz • /[^abc]/=> any character except a or b or c

  16. Revisting EMBL file • Want to find the number of globins from Ascaris and ?????. • #!/usr/bin/perl • $count; • open IN, “<nemaglobins.embl” or die; • while ($line=<IN>) { • if ($line=~m/OS (Ascaris|Toxocara)/) { • $count++; • } • } • print “Found $count globins from Ascaris or Toxocara\n”;

  17. Shortcuts • \d => any digit [0-9] • \w => any “word” character [A-Za-z0-9_] • \s => any white space[\t\n\r\f ] • \D => any character except a digit [^\d] • \W => any character except a “word” character [^\w] • \S=> any character except a white space [^\s] • Can use any of these in conjunction with quantifiers, • /\s*/ => any amount of white space

  18. Anchoring a Pattern • /pattern/ will match anywhere in the string • Anchors hold the pattern to a point in the string. • caret “^” (shift 6) marks the beginning of string while dollar “$” marks end of a string. • /^elegans/ => elegans only at start of string. Not C. elegans. • /Canis$/ => Canis only at end of string. Not Canis lupus. • /^\s*$/ => a blank line. • ‘$’ ignores the new line character ‘\n’

  19. Memory Variables • Able to extract sections of the pattern and store in a variable. • Part of the pattern within parentheses ‘()’ is stored in special variable. • First instance is $1, second $2, the fourth $4… • Extract from file • Organism: Homo sapiens • From Perl script: • if ($line=~m/Organism:\s(\w+)\s(\w+)/) { • $genus = $1; • $species = $2; • }

  20. Revisiting EMBL File (again) • Use shortcuts and anchors to find what you want. • if ($line=~m/AC .*/) { #found lots • Try: • if ($line=~m/^AC\s{3}([.\w]+)\s*/) { • $accession=$1; #info stored to use later

  21. Substitutions • Match a pattern within in a string and replace with another string. • Uses the ‘s’ operator • s/abc/xyz/ => find abc and replace with xyz • Only finds first instance of match. Using ‘g’ modifer will find and replace all. • $line = ‘abcaabbcabca’; • $line =~ s/abc/xyz/g; • print $line; xyzaabbcxyza

  22. More Substitutions • Remove all gap characters from a multiple sequence alignment: • $aln = ‘AADG--ASD--P-GSTST’; • $aln =~ s/-//g; • print $aln; # AADGASDPGSTST • Inserting information: • $line = ‘vector:’; • $line =~ s/(vector:)/$1 M13MP7/; • $name = ‘Daniel’; • $name =~ s/(Daniel)/Jack $1/;

  23. Resources • Learning Perl (O' Reilly) Ch. 7-9 • Regular Expression Pocket Reference (O' Reilly) • perldoc perlre • http://etext.lib.virginia.edu/helpsheets/regex.html • http://www.nematodes.org/~jamesw/Perl/regex • Master Regular Expressions (O'Reilly)

More Related