1 / 63

Topics

Topics. Advanced SQL for Multiple Table Databases JOINS Objected Oriented Perl BioPerl PDB BLAST. Some of the slides are based on previous material by J. Grefenstette. JOINS. Multiple Table Databases. Suppose we want to store info such as:

tevin
Download Presentation

Topics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topics • Advanced SQL for Multiple Table Databases • JOINS • Objected Oriented Perl • BioPerl • PDB • BLAST Some of the slides are based on previous material by J. Grefenstette BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  2. JOINS Multiple Table Databases Suppose we want to store info such as: +----------+----------+----------------+-----------+------------+ | name | city | country | city size | population | +----------+----------+----------------+-----------+------------+ | Wu | Shanghai | China | 17500000 | 1321851888 | | Lakshmi | Bombay | India | 21600000 | 1129866154 | | Prabhu | Delhi | India | 21500000 | 1129866154 | | Susan | New York | United States | 21900000 | 301139947 | | Richard | Chicago | United States | 9800000 | 301139947 | | Robert | London | United Kingdom | 12000000 | 60776238 | | Ali | London | United Kingdom | 12000000 | 60776238 | | Sohyoung | Seoul | South Korea | 23400000 | 49044790 | +----------+----------+----------------+-----------+------------+ Problem: • When population of city or country changes, we have to update many records • data may become inconsistent if we update, say, the city size for one person! Solution: • Design Database so that information is only stored in one place • called Normalizing the Database • Uses several tables to eliminate many-to-one relationships within a table • Queries can join several tables together BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  3. JOINS Multiple Table Databases The PERSON table: mysql> select * from person; +----+----------+----------+ | id | name | city | +----+----------+----------+ | 1 | Susan | New York | | 2 | Robert | London | | 3 | Ali | London | | 4 | Richard | Chicago | | 5 | Wu | Shanghai | | 6 | Lakshmi | Bombay | | 7 | Sohyoung | Seoul | | 8 | Prabhu | Delhi | +----+----------+----------+ 8 rows in set (0.00 sec) BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  4. JOINS Multiple Table Databases The CITY table: mysql> select * from city limit 10; +-------------+---------------+------------+ | name | country | population | +-------------+---------------+------------+ | Tokyo | Japan | 33600000 | | Seoul | South Korea | 23400000 | | Mexico City | Mexico | 22400000 | | New York | United States | 21900000 | | Bombay | India | 21600000 | | Delhi | India | 21500000 | | Sao Paulo | Brazil | 20600000 | | Los Angeles | United States | 18000000 | | Shanghai | China | 17500000 | | Osaka | Japan | 16700000 | +-------------+---------------+------------+ 10 rows in set (0.00 sec) BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  5. JOINS Multiple Table Databases The COUNTRY table: mysql> select * from country limit 10; +---------------+------------+ | name | population | +---------------+------------+ | China | 1321851888 | | India | 1129866154 | | United States | 301139947 | | Indonesia | 234693997 | | Brazil | 190010647 | | Pakistan | 169270617 | | Bangladesh | 150448339 | | Russia | 141377752 | | Nigeria | 135031164 | | Japan | 127467972 | +---------------+------------+ 10 rows in set (0.01 sec) BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  6. JOINS Wikipedia Definition of a JOIN • An SQL JOIN clause combines records from two or more tables in a database. It creates a set that can be saved as a table or used as is. A JOIN is a means for combining fields from two tables by using values common to each. ANSI standard SQL specifies four types of JOINs: INNER, OUTER, LEFT, and RIGHT. In special cases, a table (base table, view, or joined table) can JOIN to itself in a self-join. BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  7. JOINS JOINing Two Tables To JOIN two tables, use: SELECT columns From table1, table2 WHERE join conditions mysql> select * from person, city where person.city = city.name; +----+----------+----------+----------+----------------+------------+ | id | name | city | name | country | population | +----+----------+----------+----------+----------------+------------+ | 1 | Susan | New York | New York | United States | 21900000 | | 2 | Robert | London | London | United Kingdom | 12000000 | | 3 | Ali | London | London | United Kingdom | 12000000 | | 4 | Richard | Chicago | Chicago | United States | 9800000 | | 5 | Wu | Shanghai | Shanghai | China | 17500000 | | 6 | Lakshmi | Bombay | Bombay | India | 21600000 | | 7 | Sohyoung | Seoul | Seoul | South Korea | 23400000 | | 8 | Prabhu | Delhi | Delhi | India| 21500000 | +----+----------+----------+----------+----------------+------------+ 8 rows in set (0.01 sec) BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  8. JOINS Ambiguous Column Names To name a field, use table.column if column name appears in multiple tables: mysql> select person.name, person.city, country, population from person, city where person.city = city.name; +----------+----------+----------------+------------+ | name | city | country | population | +----------+----------+----------------+------------+ | Susan | New York | United States | 21900000 | | Robert | London | United Kingdom | 12000000 | | Ali | London | United Kingdom | 12000000 | | Richard | Chicago | United States | 9800000 | | Wu | Shanghai | China | 17500000 | | Lakshmi | Bombay | India | 21600000 | | Sohyoung | Seoul | South Korea | 23400000 | | Prabhu | Delhi | India | 21500000 | +----------+----------+----------------+------------+ 8 rows in set (0.00 sec) BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  9. JOINS Aliases for Column Tables You can provide an ALIAS for any column name: mysql> select person.name, person.city, city.country, city.population as 'city size' from person, city where person.city = city.name; +----------+----------+----------------+-----------+ | name | city | country | city size | +----------+----------+----------------+-----------+ | Susan | New York | United States | 21900000 | | Robert | London | United Kingdom | 12000000 | | Ali | London | United Kingdom | 12000000 | | Richard | Chicago | United States | 9800000 | | Wu | Shanghai | China | 17500000 | | Lakshmi | Bombay | India | 21600000 | | Sohyoung | Seoul | South Korea | 23400000 | | Prabhu | Delhi | India | 21500000 | +----------+----------+----------------+-----------+ 8 rows in set (0.00 sec) BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  10. JOINS WHERE conditions in JOINS Use WHERE conditions to limit which records are returned. Example: show date for people from India mysql> select person.name, person.city, city.country, city.population as 'city size' from person, city where person.city = city.name and city.country = 'India'; +---------+--------+---------+-----------+ | name | city | country | city size | +---------+--------+---------+-----------+ | Lakshmi | Bombay | India | 21600000 | | Prabhu | Delhi | India | 21500000 | +---------+--------+---------+-----------+ 2 rows in set (0.00 sec) BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  11. JOINS JOIN Three Tables Can JOIN three of more tables. Example: Show both country and city population for each person: mysql> select person.name, person.city, city.country, city.population as 'city size', country.population from person, city, country where person.city = city.name and city.country = country.name; +----------+----------+----------------+-----------+------------+ | name | city | country | city size | population | +----------+----------+----------------+-----------+------------+ | Susan | New York | United States | 21900000 | 301139947 | | Robert | London | United Kingdom | 12000000 | 60776238 | | Ali | London | United Kingdom | 12000000 | 60776238 | | Richard | Chicago | United States | 9800000 | 301139947 | | Wu | Shanghai | China | 17500000 | 1321851888 | | Lakshmi | Bombay | India | 21600000 | 1129866154 | | Sohyoung | Seoul | South Korea | 23400000 | 49044790 | | Prabhu | Delhi | India | 21500000 | 1129866154 | +----------+----------+----------------+-----------+------------+ 8 rows in set (0.00 sec) BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  12. JOINS ORDER BY Same as above, by sort results by country population: mysql> select person.name, person.city, city.country, city.population as 'city size', country.population from person, city, country where person.city = city.name and city.country = country.name order by country.population desc; +----------+----------+----------------+-----------+------------+ | name | city | country | city size | population | +----------+----------+----------------+-----------+------------+ | Wu | Shanghai | China | 17500000 | 1321851888 | | Lakshmi | Bombay | India | 21600000 | 1129866154 | | Prabhu | Delhi | India | 21500000 | 1129866154 | | Susan | New York | United States | 21900000 | 301139947 | | Richard | Chicago | United States | 9800000 | 301139947 | | Robert | London | United Kingdom | 12000000 | 60776238 | | Ali | London | United Kingdom | 12000000 | 60776238 | | Sohyoung | Seoul | South Korea | 23400000 | 49044790 | +----------+----------+----------------+-----------+------------+ 8 rows in set (0.00 sec) BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  13. BioPerl OOP BioPerl • Object Oriented Programming (OOP) • Intro to BioPerl • Bio::Seq • Bio::SeqIO • Bio::DB • Bio::Perl • Others BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  14. BioPerl OOP Perl Modules • A Module is a file containing Perl code • The file should have a .pm extension • Code within a module is brought into a program with the use statement: use moduledir::module; where moduledir is the directory in which the module file exists. • the moduledir can be omitted if module is in current directory or in the standard Perl directories (specified by the @INC list) • You can access data and subroutines by using “full name” use myModule; $myModule::counter = 3; myModule::translate($dna); BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  15. BioPerl OOP Object Oriented Programming • Object-oriented programming (OOP) is a software engineering technique for modularizing code • Think about a program as a set of interacting objects • Objects have attributes • data that describes the object • Objects also have actions (or methods) • code that controls how an object interacts with other objects • Main ideas: • Hide the details of an object’s data • so it can be changed easily (by the developer) • users don’t need to know excess details • All interactions are performed via the object’s methods BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  16. BioPerl OOP Basics of Perl Objects • In Procedural Programming, data and functions to operate on data that are kept separate. Example: $reverse_complement = revcom( $sequence ); • In Object-Oriented Programming, both data and functions are encapsulated into the same construct (called objects). Example: $reversed_obj = $seq_obj->revcom(); • $seq_obj is called an object, and revcom is called a method • In other words, the object $seq_obj “knows” how to calculate and return its own reverse complement. BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  17. BioPerl OOP Basics of Perl Objects • Objects are separated into types called classes • The class of an object defines both the data that it can hold and the methods that it knows. • A specific object that has a defined class is referred to as an instance of that class. • Perl uses the term "module" rather than "class". • An object in Perl is a special kind of hash reference. • In Bioperl the data that the object contains is stored in a single, complex hash and the object, like $seq_obj, is a reference to this hash. In addition, the methods that the object can use are also stored in this hash as particular kinds of references. BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  18. BioPerl BioPerl(www.bioperl.org) • A collection of modules that facilitates the development of Perl scripts for bioinformatics applications • Some Simple Procedural Functions • Bio::Perl • Sequences • Bio::Seq • Bio:SeqIO • Bio::PrimarySeq • Multiple Sequence Alignments • Bio::SimpleAlign • Access to Databases • Bio::DB::GenBank • Access to Tools • Bio::Tools::Blast • Bio::Tools::BPlite See Table 12.1 of Tisdall BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  19. BioPerl Methods on Bio::Seq Objects # these methods return strings, and accept strings in some cases: $seqobj->seq(); # string of sequence $seqobj->subseq(5,10); # part of the sequence as a string $seqobj->accession_number(); # when there, the accession number $seqobj->alphabet(); # one of 'dna','rna',or 'protein' $seqobj->seq_version() # when there, the version $seqobj->keywords(); # when there, the Keywords line $seqobj->length() # length $seqobj->desc(); # description $seqobj->primary_id(); # a unique id for this sequence regardless # of its display_id or accession number $seqobj->display_id(); # the human readable id of the sequence Most of these features are included in GenBank records. BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  20. BioPerl Using Bio::Seq • Each Class contains a method called new that returns an new object of the class • The new function takes arguments that initialize the data belonging to the new object • Once an object is created, you can access all the methods associated with its class by calling its attached functions use Bio::Seq; # module to handle sequence data # create a new sequence object $seq_obj = Bio::Seq->new(-seq => 'ATGGGGGTGGTGGTACCCT', -id => 'human_id', -accession_number => 'AL000012', -alphabet => 'dna'); # the seq() method returns the object's sequence as a string print "DNA: ", $seq_obj->seq(), "\n"; # create a new sequence object contain protein sequence $protein_obj = $seq_obj->translate(); print "Protein: ", $protein_obj->seq(), "\n"; OUTPUT: DNA: ATGGGGGTGGTGGTACCCT Protein: MGVVVP BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  21. BioPerl Using Bio::Seq and Bio::SeqIO #!/usr/bin/perl # File: bioseqio.pl use strict; use warnings; use Bio::Seq; use Bio::SeqIO; # module to read/write many sequence formats # create a new sequence object my $seq_obj = Bio::Seq->new(-seq => 'ATGGGGGTGGTGGTACCCT', -id => 'human_id', -accession_number => 'AL000012', -alphabet => 'dna'); # create a new FASTA file handler object (NOTE: ">" for writing) my $seqio_obj = Bio::SeqIO->new(-file=>">myseq.fsa",'-format'=>'fasta'); # tell the file handler to print the sequence object $seqio_obj->write_seq($seq_obj); % cat myseq.fsa >human_id ATGGGGGTGGTGGTACCCT BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  22. BioPerl Reading FASTA file with Bio::SeqIO #!/usr/bin/perl # File: read_fasta.pl use strict; use warnings; use Bio::Seq; use Bio::SeqIO; # module to read/write many sequence formats my $file = "seq3.fsa"; #this is a FASTA file with 3 seqs # create a new FASTA file handler object (no ">" for reading!) my $seqio_obj = Bio::SeqIO->new(-file=> $file, -format=>"fasta"); # read sequences from the FASTA file while (my $seq_obj = $seqio_obj->next_seq()) { my $seq_string = $seq_obj->seq(); print "$seq_string\n"; } OUTPUT: ATTGGGTGCGCGTGCNCCTTCC aaaaaatcatactacatgtagggtaca aaaaaatcatact BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  23. BioPerl LOCUS NM_007553 1541 bp mRNA linear ROD 07-JAN-2002 DEFINITION Mus musculus bone morphogenetic protein 2 (Bmp2), mRNA. ACCESSION NM_007553 VERSION NM_007553.1 GI:6680793 KEYWORDS . SOURCE house mouse. ORGANISM Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. REFERENCE 1 (bases 1 to 1541) AUTHORS Feng,J.Q., Harris,M.A., Ghosh-Choudhury,N., Feng,M., Mundy,G.R. and Harris,S.E. TITLE Structure and sequence of mouse bone morphogenetic protein-2 gene (BMP-2): comparison of the structures and promoter regions of BMP-2 and BMP-4 genes JOURNAL Biochim. Biophys. Acta 1218 (2), 221-224 (1994) MEDLINE 94289485 REFERENCE 2 (bases 1 to 1541) AUTHORS Heller,L.C., Li,Y., Abrams,K.L. and Rogers,M.B. TITLE Transcriptional regulation of the Bmp2 gene. Retinoic acid induction in F9 embryonal carcinoma cells and Saccharomyces cerevisiae JOURNAL J. Biol. Chem. 274 (3), 1394-1400 (1999) MEDLINE 99098878 PUBMED 9880512 COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence was derived from L25602.1. FEATURES Location/Qualifiers source 1..1541 /organism="Mus musculus" /db_xref="taxon:10090" /chromosome="2" /map="2 76.1 cM" /cell_type="spleen" gene 1..1541 /gene="Bmp2" /db_xref="LocusID:12156" /db_xref="MGD:88177" CDS 357..1541 /gene="Bmp2" /codon_start=1 /db_xref="LocusID:12156" /db_xref="MGD:88177" /product="bone morphogenetic protein 2" /protein_id="NP_031579.1" /db_xref="GI:6680794" /translation="MVAGTRCLLVLLLPQVLLGGAAGLIPELGRKKFAAASSRPLSRP SEDVLSEFELRLLSMFGLKQRPTPSKDVVVPPYMLDLYRRHSGQPGAPAPDHRLERAA SRANTVRTFHQLEAVEELPEMSGKTARRFFFNLSSVPSDEFLTSAELQIFREQIQEAL GNSSFQHRINIYEIIKPAAANLKFPVTRLLDTRLVNQNTSQWESFDVTPAVMRWTTQG HTNHGFVVEVAHLEENPGVSKRHVRISRSLHQDEHSWSQIRPLLVTFGHDGKGHPLHK REKRQAKHKQRKRLKSSCKRHPLYVDFSDVGWNDWIVAPPGYHAFYCHGECPFPLADH LNSTNHAIVQTLVNSVNSKIPKACCVPTELSAISMLYLDENEKVVLKNYQDMVVEGCG CR" A Sample GenBank Record - I BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  24. BioPerl misc_feature 492..1154 /note="TGFb_propeptide; Region: TGF-beta propeptide" misc_feature 1236..1538 /note="TGFB; Region: Transforming growth factor-beta (TGF-beta) family" misc_feature 1236..1538 /note="TGF-beta; Region: Transforming growth factor beta like domain" BASE COUNT 361 a 444 c 414 g 322 t ORIGIN 1 cgcactcctc cccctgctcg aggctgtgtg tcagcacttg gctggagact tcttgaactt 61 gccgggagag tgacttgggc tccccacttc gcgccggtgt cctcgcccgg cggatccagt 121 cttgccgcct ccagcccgat cacctctctt cctcagcccg ctggcccacc ccaagacaca 181 gttccctaca gggagaacac ccggagaagg aggaggaggc gaagaaaagc aacagaagcc 241 cagttgctgc tccaggtccc tcggacagag ctttttccat gtggagactc tctcaatgga 301 cgtgccccct agtgcttctt agacggactg cggtctccta aagccgcagg tcgaccatgg 361 tggccgggac ccgctgtctt ctagtgttgc tgcttcccca ggtcctcctg ggcggcgcgg 421 ccggcctcat tccagagctg ggccgcaaga agttcgccgc ggcatccagc cgacccttgt 481 cccggccttc ggaagacgtc ctcagcgaat ttgagttgag gctgctcagc atgtttggcc 541 tgaagcagag acccaccccc agcaaggacg tcgtggtgcc cccctatatg ctagatctgt 601 accgcaggca ctcaggccag ccaggagcgc ccgccccaga ccaccggctg gagagggcag 661 ccagccgcgc caacaccgtg cgcacgttcc atcaactaga agccgtggag gaacttccag 721 agatgagtgg gaaaacggcc cggcgcttct tcttcaattt aagttctgtc cccagtgacg 781 agtttctcac atctgcagaa ctccagatct tccgggaaca gatacaggaa gctttgggaa 841 acagtagttt ccagcaccga attaatattt atgaaattat aaagcctgca gcagccaact 901 tgaaatttcc tgtgaccaga ctattggaca ccaggttagt gaatcagaac acaagtcagt 961 gggagagctt cgacgtcacc ccagctgtga tgcggtggac cacacaggga cacaccaacc 1021 atgggtttgt ggtggaagtg gcccatttag aggagaaccc aggtgtctcc aagagacatg 1081 tgaggattag caggtctttg caccaagatg aacacagctg gtcacagata aggccattgc 1141 tagtgacttt tggacatgat ggaaaaggac atccgctcca caaacgagaa aagcgtcaag 1201 ccaaacacaa acagcggaag cgcctcaagt ccagctgcaa gagacaccct ttgtatgtgg 1261 acttcagtga tgtggggtgg aatgactgga tcgtggcacc tccgggctat catgcctttt 1321 actgccatgg ggagtgtcct tttccccttg ctgaccacct gaactccact aaccatgcca 1381 tagtgcagac tctggtgaac tctgtgaatt ccaaaatccc taaggcatgc tgtgtcccca 1441 cagagctcag cgcaatctcc atgttgtacc tagatgaaaa tgaaaaggtt gtgctaaaaa 1501 attatcagga catggttgtg gagggctgcg ggtgtcgtta g // A Sample GenBank Record - II BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  25. BioPerl Reading GenBank files with Bio::SeqIO #!/usr/bin/perl #File: readgb.pl use strict; use warnings; use Bio::Seq; use Bio::SeqIO; my $inputfile = $ARGV[0]; # create file input object my $in = Bio::SeqIO->new(-file=> $inputfile, -format=> 'GenBank'); while (my $seq = $in->next_seq()){ # print out parts of the sequence print "Sequence ", $seq->id(), " with accession ", $seq->accession_number, " and description:\n", $seq->desc, "\n\n"; } • Bio::SeqIO provides methods for reading and printing files in many file formats: FASTA, GenBank, SwissProt, GCG, EMBL, etc BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  26. BioPerl The Program in Action [binf:~/binf634/bioperl] jsolka% ./readgb.pl genbank_sample.txt Sequence NM_007553 with accession NM_007553 and description: Mus musculus bone morphogenetic protein 2 (Bmp2), mRNA. BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  27. BioPerl Direct Access to Databases • You can create new Bio::Seq objects by importing them directly from a remote database using the Bio::DB family of modules • GenBank: Bio::DB::GenBank • GenPept Bio::DB::GenPept • SwissProt Bio::DB::SwissProt • GDB Bio::DB::GDB • ACEDB Bio::DB:Ace BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  28. BioPerl Fetching a Sequence from GenBank #!/usr/bin/perl use strict; use warnings; use Bio::Seq; use Bio::DB::GenBank; # File: getgb.pl # get GI number from command line my ($id) = @ARGV; # create a new GenBank handle object my $gb = new Bio::DB::GenBank(); # Fetch the sequence from GenBank, returning a seq_object my $seq_obj = $gb->get_Seq_by_id($id); # print out parts of the sequence print "Id = ", $id, "\n", "Accession = ", $seq_obj->accession_number, "\n", "Description = ", $seq_obj->desc, "\n", "Sequence = ", $seq_obj->seq, "\n"; BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  29. BioPerl The Program in Action [binf:~/binf634/bioperl] jsolka% getgb.pl 1293614 Id = 1293614 Accession = AAA98665 Description = TCP1-beta [Saccharomyces cerevisiae]. Sequence = SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM • Check out this site for more information • http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  30. BioPerl Bio::Perl Bio::Perl has a number of easy to use functions, including get_sequence - gets sequence from standard internet databases read_sequence - reads a sequence from a file read_all_sequences - reads all sequences from a file new_sequence - makes a bioperl sequence from a string write_sequence - writes one or more sequences to a file translate - provides a translation of a sequence translate_as_string - provides translation of a sequence as string blast_sequence - BLASTs sequence against databases at NCBI write_blast - writes a blast report out to a file BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  31. BioPerl Bio::Perl The Bio::Perl module provides some simple access functions, for example, this script will retrieve a GenBank sequence and write it out in two different formats #!/usr/bin/perl # File: getgenbank.pl use strict; use warnings; use Bio::Perl; # this script will only work with an internet connection # on the computer it is run on my $seq_object = get_sequence('genbank',"ROA1_HUMAN"); # write the sequence in FASTA format write_sequence(">roa1.fasta",'fasta',$seq_object); # write the sequence in GenBank format write_sequence(">roa1.genbank",'genbank',$seq_object); exit; BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  32. % cat roa1.fasta >ROA1_HUMAN Heterogeneous nuclear ribonucleoprotein A1 (Helix-destabilizing protein SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVT YATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHL RDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKAL SKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSG DGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGG SGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSS SSSSYGSGRRF % cat roa1.genbank LOCUS ROA1_HUMAN 371 aa linear HUMAN 01-MAR-1989 DEFINITION Heterogeneous nuclear ribonucleoprotein A1 (Helix-destabilizing protein) (Single-strand binding protein) (hnRNP core protein A1). ACCESSION P09651 KEYWORDS Nuclear protein; RNA-binding; Repeat; Ribonucleoprotein; Methylation; Phosphorylation; Alternative splicing; 3D-structure; Polymorphism. SOURCE Human ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. ORIGIN 1 sksespkepe qlrklfiggl sfettdeslr shfeqwgtlt dcvvmrdpnt krsrgfgfvt 61 yatveevdaa mnarphkvdg rvvepkravs redsqrpgah ltvkkifvgg ikedteehhl 121 rdyfeqygki evieimtdrg sgkkrgfafv tfddhdsvdk iviqkyhtvn ghncevrkal 181 skqemasass sqrgrsgsgn fgggrgggfg gndnfgrggn fsgrggfggs rggggyggsg 241 dgyngfgndg gyggggpgys ggsrgygsgg qgygnqgsgy ggsgsydsyn ngggrgfggg 301 sgsnfggggs yndfgnynnq ssnfgpmkgg nfggrssgpy ggggqyfakp rnqggyggss 361 ssssygsgrr f // BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  33. BioPerl LWP: fetch WWW documents • To automate WWW access • LWP::Simple - procedural interface to LWP • Example of usage: use LWP::Simple; $url = "http://www.whatever.com/data.html"; $page = get($url); if ($page) { # do something } else { print "Problems getting $url"; } • See the following URL for an example • http://stein.cshl.org/genome_informatics/chervitz/perl-modules/ • http://oreilly.com/openbook/webclient/ch05.html BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  34. PDB Protein Data Bank • The Protein Data Bank (PDB) is a repository of protein data structure • http://www.rcsb.org/pdb/home/home.do • The nature of proteins • Primary sequences • Secondary structures • Alpha helices • Beta strands • Turns • Tertiary structures • Beta sheets • Alpha-alpha units • Quaternary structures • Enzyme • Virus BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  35. BioPerl PDB Overview of PDB • ASCII flat files that are human readable • One structure per file • Problems with consistency of the formats • Code you create today may not work tomorrow BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  36. PDB Files and Folders • PDB is distributed as files within directories • There are in-depth discussions of how to handle these directory structures within Chapter 11 of Tisdall • We will talk a little about recursion next week BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  37. PDB HEADER IMMUNOGLOBULIN 10-MAR-99 43C9 TITLE CRYSTALLOGRAPHIC STRUCTURE OF THE ESTEROLYTIC AND TITLE 2 AMIDOLYTIC 43C9 ANTIBODY COMPND MOL_ID: 1; COMPND 2 MOLECULE: IMMUNOGLOBULIN (LIGHT CHAIN); COMPND 3 CHAIN: A, C, E, G; COMPND 4 FRAGMENT: FV; COMPND 5 ENGINEERED: YES; COMPND 6 MOL_ID: 2; COMPND 7 MOLECULE: IMMUNOGLOBULIN (HEAVY CHAIN); COMPND 8 CHAIN: B, D, F, H; COMPND 9 FRAGMENT: FV; COMPND 10 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: MUS MUSCULUS; SOURCE 3 ORGANISM_COMMON: MOUSE; SOURCE 4 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 5 EXPRESSION_SYSTEM_CELL_LINE: BL21(DE3); SOURCE 6 MOL_ID: 2; SOURCE 7 ORGANISM_SCIENTIFIC: MUS MUSCULUS; SOURCE 8 ORGANISM_COMMON: MOUSE; SOURCE 9 EXPRESSION_SYSTEM: ESCHERICHIA COLI; SOURCE 10 EXPRESSION_SYSTEM_CELL_LINE: BL21(DE3) KEYWDS IMMUNOGLOBULIN EXPDTA X-RAY DIFFRACTION AUTHOR M.M.THAYER,E.D.GETZOFF,V.A.ROBERTS REVDAT 1 18-AUG-99 43C9 0 JRNL AUTH M.M.THAYER,E.H.OLENDER,A.S.ARVAI,C.K.KOIKE, JRNL AUTH 2 I.L.CANESTRELLI,J.D.STEWART,S.J.BENKOVIC, JRNL AUTH 3 E.D.GETZOFF,V.A.ROBERTS JRNL TITL STRUCTURAL BASIS FOR AMIDE HYDROLYSIS CATALYZED BY JRNL TITL 2 THE 43C9 ANTIBODY JRNL REF J.MOL.BIOL. V. 291 329 1999 JRNL REFN ASTM JMOBAK UK ISSN 0022-2836 0070 REMARK 1 REMARK 1 REFERENCE 1 REMARK 1 AUTH V.A.ROBERTS,J.STEWART,S.J.BENKOVIC,E.D.GETZOFF REMARK 1 TITL CATALYTIC ANTIBODY MODEL AND MUTAGENESIS IMPLICATE REMARK 1 TITL 2 ARGININE IN TRANSITION- STATE STABILIZATION REMARK 1 REF J.MOL.BIOL. V. 235 1098 1994 REMARK 1 REFN ASTM JMOBAK UK ISSN 0022-2836 0070 REMARK 1 REFERENCE 2 . . . An Example PDB File - I BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  38. PDB REMARK 800 REMARK 800 HIS 91 IN THE VARIABLE LIGHT CHAIN (A,C,E, OR G) - REMARK 800 NUCLEOPHILE THAT FORMS A COVALENT BOND TO THE REMARK 800 SUBSTRATE REMARK 800 REMARK 800 ARG 96 IN THE VARIABLE LIGHT CHAIN (A,C,E, OR G)- REMARK 800 SIDE CHAIN STABILIZES NEGATIVE CHARGE FORMED IN REMARK 800 THE TRANSITION STATES OF THE REACTION REMARK 800 REMARK 900 REMARK 900 RELATED ENTRIES REMARK 900 RELATED ID: 43CA RELATED DB: PDB DBREF 43C9 A 1 107 GB 207840 M68968 1 113 DBREF 43C9 B 0 112 GB 207840 M68968 127 244 DBREF 43C9 C 1 107 GB 207840 M68968 1 113 DBREF 43C9 D 0 112 GB 207840 M68968 127 244 DBREF 43C9 E 1 107 GB 207840 M68968 1 113 DBREF 43C9 F 0 112 GB 207840 M68968 127 244 DBREF 43C9 G 1 107 GB 207840 M68968 1 113 DBREF 43C9 H 0 112 GB 207840 M68968 127 244 SEQRES 1 A 113 ASP VAL VAL MET THR GLN THR PRO SER SER LEU ALA MET SEQRES 2 A 113 SER VAL GLY GLN LYS VAL THR MET SER CYS LYS SER SER SEQRES 3 A 113 GLN SER LEU LEU ASN ILE SER ASN GLN LYS ASN TYR LEU SEQRES 4 A 113 ALA TRP TYR GLN GLN LYS PRO GLY GLN SER PRO LYS LEU SEQRES 5 A 113 LEU VAL TYR PHE ALA SER THR ARG GLU SER GLY VAL PRO SEQRES 6 A 113 ASP ARG PHE ILE GLY SER GLY SER GLY THR ASP PHE THR SEQRES 7 A 113 LEU THR ILE SER SER VAL GLN ALA GLU ASP GLN ALA ASP SEQRES 8 A 113 TYR PHE CYS GLN GLN HIS TYR ARG ALA PRO ARG THR PHE SEQRES 9 A 113 GLY GLY GLY THR LYS LEU GLU ILE LYS SEQRES 1 B 118 GLY GLN VAL GLN LEU VAL GLU SER GLY PRO GLY LEU VAL SEQRES 2 B 118 ALA PRO SER GLN SER LEU SER ILE THR CYS THR VAL SER SEQRES 3 B 118 GLY ILE SER LEU SER ARG TYR ASN VAL HIS TRP VAL ARG SEQRES 4 B 118 GLN SER PRO GLY LYS GLY LEU GLU TRP LEU GLY MET ILE SEQRES 5 B 118 TRP GLY GLY GLY SER ILE GLU TYR ASN PRO ALA LEU LYS SEQRES 6 B 118 SER ARG LEU SER ILE SER LYS ASP ASN SER LYS SER GLN SEQRES 7 B 118 ILE PHE LEU LYS MET ASN SER LEU GLN THR ASP ASP SER SEQRES 8 B 118 ALA MET TYR TYR CYS VAL SER TYR GLY TYR GLY GLY ASP SEQRES 9 B 118 ARG PHE SER TYR TRP GLY GLN GLY THR LEU VAL THR VAL SEQRES 10 B 118 SER . . . FORMUL 9 HOH *162(H2 O1) SHEET 1 A 4 MET A 4 THR A 7 0 SHEET 2 A 4 VAL A 19 LYS A 24 -1 N LYS A 24 O THR A 5 SHEET 3 A 4 ASP A 70 ILE A 75 -1 N ILE A 75 O VAL A 19 SHEET 4 A 4 PHE A 62 SER A 67 -1 N SER A 67 O ASP A 70 SHEET 1 B 5 SER A 10 SER A 14 0 SHEET 2 B 5 THR A 102 LYS A 107 1 N LYS A 103 O LEU A 11 SHEET 3 B 5 ALA A 84 GLN A 90 -1 N TYR A 86 O THR A 102 SHEET 4 B 5 LEU A 33 GLN A 38 -1 N GLN A 38 O ASP A 85 SHEET 5 B 5 PRO A 44 VAL A 48 -1 N VAL A 48 O TRP A 35 SHEET 1 C 4 GLN B 3 SER B 7 0 SHEET 2 C 4 LEU B 18 SER B 25 -1 N SER B 25 O GLN B 3 SHEET 3 C 4 GLN B 77 MET B 82 -1 N MET B 82 O LEU B 18 . . . An Example PDB File - II BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  39. PDB ATOM 1 N ASP A 1 7.422 50.761 33.806 1.00 64.60 N ATOM 2 CA ASP A 1 8.287 50.204 34.870 1.00 77.57 C ATOM 3 C ASP A 1 8.848 48.834 34.503 1.00 77.32 C ATOM 4 O ASP A 1 9.752 48.357 35.183 1.00 82.14 O ATOM 5 CB ASP A 1 7.507 50.107 36.205 1.00 86.78 C ATOM 6 CG ASP A 1 8.392 49.691 37.423 1.00 92.57 C ATOM 7 OD1 ASP A 1 9.630 49.891 37.411 1.00 93.28 O ATOM 8 OD2 ASP A 1 7.826 49.183 38.422 1.00 96.45 O ATOM 9 N VAL A 2 8.333 48.184 33.458 1.00 70.66 N ATOM 10 CA VAL A 2 8.864 46.859 33.118 1.00 63.77 C ATOM 11 C VAL A 2 10.397 47.012 32.964 1.00 58.68 C ATOM 12 O VAL A 2 10.879 47.598 32.010 1.00 62.61 O ATOM 13 CB VAL A 2 8.149 46.194 31.855 1.00 62.22 C . . . CONECT 6928 6353 MASTER 381 0 0 13 80 0 4 24 7254 8 16 76 END An Example PDB File-III BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  40. PDB PDB File Format - I • In-depth discussions are available here • http://www.wwpdb.org/docs.html • PDB files are composed of 80 columns that begin with one of several record names and end with a newline • Blank columns are padded with spaces • A record type is one or more lines with the same record name • Each record type has a different number of fields with them BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  41. PDB PDB File Formats -II • Record Types • DBREF • Reference to the entry in the sequence database • SEQADV • Identification of conflicts between pdb and the named sequence database • SEQRES • Primary sequence of backbone residues • MODRES • Identification of modification to standard residues BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  42. PDB Our Example PDB File – The DBREF Section DBREF 43C9 A 1 107 GB 207840 M68968 1 113 DBREF 43C9 B 0 112 GB 207840 M68968 127 244 DBREF 43C9 C 1 107 GB 207840 M68968 1 113 DBREF 43C9 D 0 112 GB 207840 M68968 127 244 DBREF 43C9 E 1 107 GB 207840 M68968 1 113 DBREF 43C9 F 0 112 GB 207840 M68968 127 244 DBREF 43C9 G 1 107 GB 207840 M68968 1 113 DBREF 43C9 H 0 112 GB 207840 M68968 127 244 • This tells us there is a PDB file called pdb43cp.ent and that the residues in chain A are labeled from 1 to 107 in the original GenBank database and the ID number is 207840 and the name m68968 are assigned to that database and the residues are numbered 1 to 113 in PDB BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  43. PDB Our Example PDB File – The SEQRES Section SEQRES 1 A 113 ASP VAL VAL MET THR GLN THR PRO SER SER LEU ALA MET SEQRES 2 A 113 SER VAL GLY GLN LYS VAL THR MET SER CYS LYS SER SER SEQRES 3 A 113 GLN SER LEU LEU ASN ILE SER ASN GLN LYS ASN TYR LEU SEQRES 4 A 113 ALA TRP TYR GLN GLN LYS PRO GLY GLN SER PRO LYS LEU SEQRES 5 A 113 LEU VAL TYR PHE ALA SER THR ARG GLU SER GLY VAL PRO SEQRES 6 A 113 ASP ARG PHE ILE GLY SER GLY SER GLY THR ASP PHE THR SEQRES 7 A 113 LEU THR ILE SER SER VAL GLN ALA GLU ASP GLN ALA ASP SEQRES 8 A 113 TYR PHE CYS GLN GLN HIS TYR ARG ALA PRO ARG THR PHE SEQRES 9 A 113 GLY GLY GLY THR LYS LEU GLU ILE LYS BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  44. BioPerl and PDB #!/usr/local/bin/perl -w # Parsing PDB reports with BioPerl's Bio::Structure:IO module use strict; use Bio::Structure::IO; my $inFile; # Prompt the user for the file name if it's not an argument # NOTE: PDB file must be in text (not html) format if (! $ARGV[0]) { print "What is the PDB file to parse? "; # Get input and remove the newline character at the end chomp ($inFile = <STDIN>); } else { $inFile = $ARGV[0]; } my $structio = Bio::Structure::IO->new(-file => $inFile); my $struc = $structio->next_structure; for my $chain ($struc->get_chains) { my $chainid = $chain->id; # one-letter chaincode if present, 'default' otherwise for my $res ($struc->get_residues($chain)) { my $resid = $res->id; # format is 3-lettercode - dash - residue number, e.g. PHE-20 my $atoms = $struc->get_atoms($res); # actually a list of atom objects, used here to get a count print join "\t", $chainid,$resid,$atoms,"\n"; } } Using BioPerl to Parse a pdb File BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  45. BioPerl and PDB ./pdb_parse_0.pl pdb43c9.ent > pdb_parse_0.out A ASP-1 8 A VAL-2 7 A VAL-3 7 A MET-4 8 A THR-5 7 A GLN-6 9 A THR-7 7 A PRO-8 7 A SER-9 6 A SER-10 6 A LEU-11 8 A ALA-12 5 . . . The Code in Action BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  46. BioPerl and PDB The Code in Action (cont.) H GLY-106 4 H THR-107 7 H LEU-108 8 H VAL-109 7 H THR-110 7 H VAL-111 7 H SER-112 7 default HOH-1 1 default HOH-2 1 default HOH-3 1 default HOH-4 1 default HOH-5 1 . . . default HOH-440 1 default HOH-449 1 default HOH-454 1 default HOH-456 1 default HOH-463 1 default HOH-469 1 default HOH-472 1 BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  47. BLAST BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome) • The Basic Local Alignment Search Tool (BLAST) • Tests a query against a library of known sequences to find similarity • Collection of programs • query to database pairs • nucleotide-nucleotide • protein-nucleotide • protein-protein • nucleotid protein BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  48. BLAST BLAST Advantages of a Local Installation Versus Public Server • Using public server • Databases used by BLAST such as GenBank are always up to date • Using local server • Proprietary sequences • Sequences that are run time and time again BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  49. BLAST String Matching and Homology • BLAST is basically a string matching program • Biological string matching looks for similarity as an indication of homology • Similarity between the query and the sequences in the database may be measured by the percent identity, or the number of bases in the query that exactly match a corresponding region of a sequence from the database. • It may be measured by the degree of conservation , which finds matches between equivalent (redundant) codons or between amino acid residues with similar properties that don’t alter the function of a protein • Two sequences are or are not homologous, there’s no degree of homology • Beginning PERL FOR Bioinformatics, Tisdall BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

  50. BLAST Homology • It is difficult to find a single consistent definition of homology • “In genetics, the term "homolog" is used both to refer to a homologous protein, and to the gene (DNA sequence) encoding it.” • -- Wikipedia • In-depth philosophical discussions can be found in • G. P. Wagner, “The Biological Homology Concept,” Annu. Rev. Ecol. Syst. 1989. 20:51~9. BINF 634 FALL 2013 - Lecture 12 (JOINS, BioPerl, PDB, and BLAST)

More Related