1 / 28

Regular Expressions

Michael Smith. Regular Expressions. Not regular facial expressions. Regular expressions help us find the information we want They are incredibly powerful They are vital to the field of bioinformatics. Computer Scientists use them nearly every day as a filter

zaynah
Download Presentation

Regular Expressions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Michael Smith Regular Expressions

  2. Not regular facial expressions

  3. Regular expressions help us find the information we want • They are incredibly powerful • They are vital to the field of bioinformatics

  4. Computer Scientists use them nearly every day as a filter • You use regular expressions every day • Humans are great at them for small matches, but computers excel when matching large text files, especially from databases

  5. ‘grep’ made regular expressions popular • It takes a bunch of text and prints lines with matching regular expressions

  6. Fundamentals • Often shortened to ‘regex’ or ‘regexp’, or called a pattern • A string is made up of characters – numbers, letters, and symbols • Regex describe a set of strings • Have metacharacters that mean special things

  7. Fundamentals cont… • Searching for many strings with one string • Based on 3 ideas • Repetition: An asterisk (*) indicates 0 or more repetitions of the character before it • Alternation: A pattern like (a | b) matches the string ‘a’ or ‘b’ • Concatenation: a string (ab) means ‘a’ followed by ‘b’

  8. Motifs • One of the most common tasks in bioinformatics is looking for motifs, short segments of DNA or protein of particular interest • Often times the motifs we look for are not one specific sequence, they can have several variants

  9. Motifs cont… • Motif databases have commonly been used to: • Classify proteins • Provide functional alignment • Identify structural and evolutionary relationships

  10. Perl • Perl (a programming language) has powerful text processing power • Easily manipulates text files For my tutorial I will be using Perl, so you need to understand some special syntax

  11. Comic by: xkcd.com

  12. Perl Syntax • ‘$’ is the symbol for a scalar. A scalar is a single value (a number, string, or reference) • ‘=~’ is the symbol to say “apply the operation on the right to the string in the variable on the left” and is known as a binding • A period symbol (‘.’) can stand for any character except a newline.

  13. Perl Regex Syntax • The match operator is m//. It will return true or false • The substitution operator is s///. It returns a string • Regular expressions can have ‘modifiers’ they modify the meaning of the expression. They come after the slashes

  14. Regular expressions are used in many programming languages. Because perl uses them so elegantly, other languages have modeled their own implementation off of it.

  15. What does this mean for you? • You can find patterns in large databases! • Just like Andrew’s presentation on biopython, there exists a bioperl module • Sequence manipulation • Accessing web databases • Parsing of the results • Open source

  16. Use Bio::Perl; $seq_object = get_sequence(‘swiss’,”ROA1_HUMAN”); This program would get the ROA1_HUMAN sequence from the swiss database

  17. Available functions • Get_sequence • Read_sequence • Read_all_sequences • New_sequence • Write_sequence • Translate • Translate_as_string • Blast_sequence • Write_blast

  18. But wait, there’s more! • You don’t have to program to find useful information

  19. Database Patterns • http://expasy.org/tools/scanprosite/ • Sites like this have different regular expression ‘symbols’ than perl, but use the same concepts • One-letter codes for amino acids • Symbol ‘x’ is a wildcard • Alternation is provided by the ‘[]’ brackets • Negated alternation is provided by the ‘{}’ brackets

  20. A ‘-’ is just a separator • X(3) = x-x-x • A(2,4) = A-A or A-A-A or A-A-A-A • Examples : [AC]-x-V-x(4)-{ED}This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp} • (this means a lot more to biologists than me I imagine)

  21. My Query • http://en.wikipedia.org/wiki/Amino_acid • Uses standard amino acid abbr. • [ACG]-XXAG-V-X(4)-{AEGD} • [Alanine or cysteine or glycine], any, any, alanine, glycine, valine, any, any, any, any, {not alanine, glutamic acid, glycine, or aspartic acid}

  22. Results • A LOT of hits

  23. According to nature.com, the real power of databases is the ability to unearth patterns hidden across different types of data. • Databases are starting to be geared specifically for life sciences such as patter recognition functions • Built-in BLAST search • Regular expressions for complex word-pattern matching

  24. Uses • As long as biocomputing has been of interest, regular expressions have been used for sequence alignment. • I found an article as recently as 2007 that uses probabilities, gaps, and local optimization combined with regular expressions with results comparable to CLUSTALW

  25. In Conclusion • We discussed how we use regular expressions every day • We explored their practical uses in a field like bioinformatics • We learned how to write simple programs that quickly perform very borings tasks for humans • You don’t have to be a computer scientists to unlock their power!

  26. Extra • http://www.ncbi.nlm.nih.gov/pubmed/19534754 - Article June 2009, Regular expression Blasting algorithm

More Related