1 / 36

Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc.edu 13928761660. Exercise 1. Ask for a protein file in fasta format Ask for an amino acid Count the frequency of that amino acid TKFHSNAHFYDCWRMLQYQLDMRCMRAISTFSPHCGMEHMPDQTHNQGEMCKPRMWQVSMNQSCNHTPPFRKTYVEWDYMAKALIAPYTLGWLASTCFIW. Exercise 2.

cole-ware
Download Presentation

Bioinformatics 生物信息学理论和实践 唐继军 jtang@cse.sc 13928761660

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu13928761660Bioinformatics生物信息学理论和实践唐继军jtang@cse.sc.edu13928761660

  2. Exercise 1 • Ask for a protein file in fasta format • Ask for an amino acid • Count the frequency of that amino acid • TKFHSNAHFYDCWRMLQYQLDMRCMRAISTFSPHCGMEHMPDQTHNQGEMCKPRMWQVSMNQSCNHTPPFRKTYVEWDYMAKALIAPYTLGWLASTCFIW

  3. Exercise 2 • Ask for an RNA file in fasta format • Convert it to RNA • Ask for a codon • Count the frequency of that codon • TCGTACTTAGAAATGAGGGTCCGCTTTTGCCCACGCACCTGATCGCTCCTCGTTTGCTTTTAAGAACCGGACGAACCACAGAGCATAAGGAGAACCTCTAGCTGCTTTACAAAGTACTGGTTCCCTTTCCAGCGGGATGCTTTATCTAAACGCAATGAGAGAGGTATTCCTCAGGCCACATCGCTTCCTAGTTCCGCTGGGATCCATCGTTGGCGGCCGAAGCCGCCATTCCATAGTGAGTTCTTCGTCTGTGTCATTCTGTGCCAGATCGTCTGGCAAATAGCCGATCCAGTTTATCTCTCGAAACTATAGTCGTACAGATCGAAATCTTAAGTCAAATCACGCGACTAGACTCAGCTCTATTTTAGTGGTCATGGGTTTTGGTCCCCCCGAGCGGTGCAACCGATTAGGACCATGTAGAACATTAGTTATAAGTCTTCTTTTAAACACAATCTTCCTGCTCAGTGGTACATGGTTATCGTTATTGCTAGCCAGCCTGATAAGTAACACCACCACTGCGACCCTAATGCGCCCTTTCCACGAACACAGGGCTGTCCGATCCTATATTACGACTCCGGGAAGGGGTTCGCAAGTCGCACCCTAAACGATGTTGAAGGCTCAGGATGTACACGCACTAGTACAATACATACGTGTTCCGGCTCTTATCCTGCATCGGAAGCTCAATCATGCATCGCACCAGCGTGTTCGTGTCATCTAGGAGGGGCGCGTAGGATAAATAATTCAATTAAGATATCGTTATGCTAGTATACGCCTACCCGTCACCGGCCAACAGTGTGCAGATGGCGCCACGAGTTACTGGCCCTGATTTCTCCGCTTCTAATACCGCACACTGGGCAATACGAGCTCAAGCCAGTCTCGCAGTAACGCTCATCAGCTAACGAAAGAGTTAGAGGCTCGCTAAATCGCACTGTCGGGGTCCCTTGGGTATTTTACACTAGCGTCAGGTAGGCTAGCATGTGTCTTTCCTTCCAGGGGTATG

  4. Subroutine • Some code needs to be reused • A good way to organize code • Called "function" in some languages • Name • Return • Parameters (@_)

  5. sub codon2aa { my($codon) = @_; if ( $codon =~ /GC./i) { return 'A' } Alanine elsif ( $codon =~ /TG[TC]/i) { return 'C' } Cysteine elsif ( $codon =~ /GA[TC]/i) { return 'D' } Aspartic Acid elsif ( $codon =~ /GA[AG]/i) { return 'E' } Glutamic Acid elsif ( $codon =~ /TT[TC]/i) { return 'F' } Phenylalanine elsif ( $codon =~ /GG./i) { return 'G' } Glycine elsif ( $codon =~ /CA[TC]/i) { return 'H' } Histidine elsif ( $codon =~ /AT[TCA]/i) { return 'I' } Isoleucine elsif ( $codon =~ /AA[AG]/i) { return 'K' } Lysine elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' } Leucine elsif ( $codon =~ /ATG/i) { return 'M' } Methionine elsif ( $codon =~ /AA[TC]/i) { return 'N' } Asparagine elsif ( $codon =~ /CC./i) { return 'P' } Proline elsif ( $codon =~ /CA[AG]/i) { return 'Q' } Glutamine elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' } Arginine elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' } Serine elsif ( $codon =~ /AC./i) { return 'T' } Threonine elsif ( $codon =~ /GT./i) { return 'V' } Valine elsif ( $codon =~ /TGG/i) { return 'W' } Tryptophan elsif ( $codon =~ /TA[TC]/i) { return 'Y' } Tyrosine elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' } Stop else {print STDERR "Bad codon \"$codon\"!!\n"; exit; } }

  6. !/usr/bin/perl –w print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename; open(DNAFILE, $dna_filename); $name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE; $DNA = join( '', @DNA);$DNA =~ s/\s//g; print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n"; sub dna2peptide { my ($dna) = @_; my $protein = ""; for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $codon = substr($dna,$i,3); $protein .= codon2aa($codon); } return $protein; } sub codon2aa { ... }

  7. Modules • A Perl Module is a self-contained pieceof [Perl] code that can be used by a Perl program later • Like a library • End with extension .pm • Needs a 1 at the end

  8. Bio.pm sub codon2aa { .... .... } sub dna2peptide { .... .... } 1

  9. !/usr/bin/perl -w use Bio; print "Please type the filename: "; $dna_filename = <STDIN>; chomp $dna_filename; open(DNAFILE, $dna_filename); $name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE; $DNA = join( '', @DNA);$DNA =~ s/\s//g; print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n";

  10. Bio.pm sub codon2aa { .... .... } sub dna2peptide { .... .... } sub fasta_read { print "Please type the filename: "; my $dna_filename = <STDIN>; chomp $dna_filename; unless (open(DNAFILE, $dna_filename)) { print "Cannot open file ", $dna_filename, "\n"; } $name = <DNAFILE>;@DNA = <DNAFILE>;close DNAFILE; $DNA = join( '', @DNA);$DNA =~ s/\s//g; return $DNA; } 1

  11. !/usr/bin/perl -w use Bio; $DNA = fasta_read(); print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n";

  12. Scope • my provides lexical scoping; a variable declared with my is visible only within the block in which it is declared. • Blocks of code are hunks within curly braces {}; files are blocks. • Use use vars qw([list of var names]) or our ([var_names]) to create package globals.

  13. !/usr/bin/perl -w use Bio; use strict; use warnings; $DNA = fasta_read(); print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n";

  14. Variable "$DNA" is not imported at frame2.pl line 6. Variable "$DNA" is not imported at frame2.pl line 8. Variable "$DNA" is not imported at frame2.pl line 9. Variable "$DNA" is not imported at frame2.pl line 10. Variable "$DNA" is not imported at frame2.pl line 12. Variable "$DNA" is not imported at frame2.pl line 12. Variable "$DNA" is not imported at frame2.pl line 13. Variable "$DNA" is not imported at frame2.pl line 14. Variable "$DNA" is not imported at frame2.pl line 15. Global symbol "$DNA" requires explicit package name at frame2.pl line 6. Global symbol "$DNA" requires explicit package name at frame2.pl line 8. Global symbol "$DNA" requires explicit package name at frame2.pl line 9. Global symbol "$DNA" requires explicit package name at frame2.pl line 10. Global symbol "$DNA" requires explicit package name at frame2.pl line 12. Global symbol "$DNA" requires explicit package name at frame2.pl line 12. Global symbol "$DNA" requires explicit package name at frame2.pl line 13. Global symbol "$DNA" requires explicit package name at frame2.pl line 14. Global symbol "$DNA" requires explicit package name at frame2.pl line 15. Execution of frame2.pl aborted due to compilation errors.

  15. !/usr/bin/perl -w use Bio; use strict; use warnings; my $DNA = fasta_read(); print "First ", dna2peptide($DNA), "\n"; print "Second ", dna2peptide(substr($DNA, 1)), "\n"; print "Third ", dna2peptide(substr($DNA, 2)), "\n"; $DNA = reverse $DNA; $DNA =~ tr/ACGTacgt/TGCAtgca/; print "Fourth ", dna2peptide($DNA), "\n"; print "Fifth ", dna2peptide(substr($DNA, 1)), "\n"; print "Sixth ", dna2peptide(substr($DNA, 2)), "\n";

  16. my $x = 10; for (my $x = 0; $x < 5; $x++) { Scope(); print $x, "\n"; } print $x, "\n"; sub Scope { my $x = 0; }

  17. sub get_file_data { my($filename) = @_; use strict; use warnings; # Initialize variables my @filedata = ( ); unless( open(GET_FILE_DATA, $filename) ) { print STDERR "Cannot open file \"$filename\"\n\n"; exit; } @filedata = <GET_FILE_DATA>; close GET_FILE_DATA; return @filedata; }

  18. sub extract_sequence_from_fasta_data { my(@fasta_file_data) = @_; my $sequence = ''; foreach my $line (@fasta_file_data) { if ($line =~ /^\s*$/) { next; } elsif($line =~ /^\s*#/) { next; } elsif($line =~ /^>/) { next; } else { $sequence .= $line; } } # remove non-sequence data (in this case, whitespace) from $sequence string $sequence =~ s/\s//g; return $sequence; }

  19. Molecular Scissors Molecular Cell Biology, 4th edition

  20. Discovering Restriction Enzymes • HindII - first restriction enzyme – was discovered accidentally in 1970 while studying how the bacterium Haemophilus influenzae takes up DNA from the virus • Recognizes and cuts DNA at sequences: • GTGCAC • GTTAAC

  21. Recognition Sites of Restriction Enzymes Molecular Cell Biology, 4th edition

  22. Uses of Restriction Enzymes • Recombinant DNA technology • Cloning • cDNA/genomic library construction • DNA mapping

  23. Restriction Enzyme Database • http://rebase.neb.com/rebase/rebase.html

  24. http://rebase.neb.com/rebase/rebase.files.html

  25. R = G or A Y = C or T M = A or C K = G or T S = G or C W = A or T B = not A (C or G or T) D = not C (A or G or T) H = not G (A or C or T) V = not T (A or C or G) N = A or C or G or T

  26. sub IUB_to_regexp { my($iub) = @_; my $regular_expression = ‘’; my %iub2character_class = ( A => 'A', C => 'C', G => 'G', T => 'T', R => '[GA]', Y => '[CT]', M => '[AC]', K => '[GT]', S => '[GC]', W => '[AT]', B => '[CGT]', D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]', ); $iub =~ s/\^//g; for ( my $i = 0 ; $i < length($iub) ; ++$i ) { $regular_expression .= $iub2character_class{substr($iub, $i, 1)}; } return $regular_expression; }

  27. Hash • Initialize: my %hash = (); • Add key/value pair: $hash{$key} = $value; • Add more keys: • %hash = ( 'key1', 'value1', 'key2', 'value2 ); • %hash = ( key1 => 'value1', key2 => 'value2', ); • Delete: delete $hash{$key};

  28. while ( my ($key, $value) = each(%hash) ) { print "$key => $value\n"; } for my $key ( keys %hash ) { my $value = $hash{$key}; print "$key => $value\n"; }

  29. sub codon2aa { my($codon) = @_; $codon = uc $codon; my %genetic_code = ( 'TCA' => 'S', # Serine 'TCC' => 'S', # Serine 'TCG' => 'S', # Serine 'TCT' => 'S', # Serine 'TTC' => 'F', # Phenylalanine 'TTT' => 'F', # Phenylalanine 'TTA' => 'L', # Leucine 'TTG' => 'L', # Leucine #Many more ); if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; }else{ print STDERR "Bad codon \"$codon\"!!\n"; exit; } }

  30. sub parseREBASE { my($rebasefile) = @_; my @rebasefile = ( );my %rebase_hash = ( );my $name;my $site;my $regexp; open($rebase_filehandle, $rebasefile) or die "Cannot open file\n"; while(<$rebase_filehandle>) { # Discard header lines ( 1 .. /Rich Roberts/ ) and next; # Discard blank lines /^\s*$/ and next; # Split the two (or three if includes parenthesized name) fields my @fields = split( " ", $_); $name = shift @fields; $site = pop @fields; # Translate the recognition sites to regular expressions $regexp = IUB_to_regexp($site); # Store the data into the hash $rebase_hash{$name} = "$site $regexp"; } # Return the hash containing the reformatted REBASE data return %rebase_hash; }

  31. Range • ( 1 .. /Rich Roberts/ ) and next • from first line till some line containing Rich Roberts • If that is true, it will check the statement after "and" • If that is not true, it will not check the statement after "and" • open(…) or die • If can open, the statement is already true, no need to check the statement after "or" • If cannot open, the statement is false, need to check the statement after "or" to see if it can be true

  32. @fred = (1,2,3); @barney = @fred; @huh = 1; @fred = qw(one two); @barney = (4,5,@fred,6,7); @barney = (8,@barney); ($a,$b,$c) = (1,2,3); @fred = (@barney = (2,3,4)); @fred = @barney = (2,3,4); @fred = (1,2,3); $fred[3] = "hi"; $fred[6] = "ho"; # @fred is now (1,2,3,"hi",undef,undef,"ho")

  33. Array operators • push and pop (right-most element) • @mylist = (1,2,3); push(@mylist,4,5,6); • $oldvalue = pop(@mylist); • shift and unshift (left-most element) • @fred = (5,6,7); unshift(@fred,2,3,4); • $x = shift(@fred); • reverse: @a = (7,8,9); @b = reverse(@a); • sort: @a = (7,9,9); @b = sort(@a);

  34. sub match_positions { my($regexp, $sequence) = @_; use BeginPerlBioinfo; my @positions = ( ); while ( $sequence =~ /$regexp/ig ) { push ( @positions, pos($sequence) - length($&) + 1); } return @positions; }

  35. use BeginPerlBioinfo; my %rebase_hash = ( ); my @file_data = ( ); my $query = ''; my $dna = ''; my $recognition_site = ''; my $regexp = ''; my @locations = ( ); @file_data = get_file_data("sample.dna"); $dna = extract_sequence_from_fasta_data(@file_data); %rebase_hash = parseREBASE('bionet'); do { print "Search for what restriction site for (or quit)?: "; $query = <STDIN>; chomp $query; if ($query =~ /^\s*$/ ) {exit; } if ( exists $rebase_hash{$query} ) { ($recognition_site, $regexp) = split ( " ", $rebase_hash{$query}); @locations = match_positions($regexp, $dna); if (@locations) { print "Searching for $query $recognition_site $regexp\n"; print "Restriction site for $query at :", join(" ", @locations), "\n"; } else { print "A restriction enzyme $query is not in the DNA:\n"; } } } until ( $query =~ /quit/ ); exit;

More Related