1 / 20

Computer Science for Bioinformatics, 140.636 F. Pineda

Computer Science for Bioinformatics, 140.636 F. Pineda. Please fill out questionnaire while waiting for class to start If you did not get permission from instructor to register for credit please talk to me after class...it could end badly.

plank
Download Presentation

Computer Science for Bioinformatics, 140.636 F. Pineda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Science for Bioinformatics, 140.636F. Pineda • Please fill out questionnaire while waiting for class to start • If you did not get permission from instructor to register for credit please talk to me after class...it could end badly...

  2. If all you have is a hammer, everything looks like a nail. • You can do just about anything in any of the modern languages, but different languages are optimized for different purposes, but the right language makes things simple... • Rapid Prototyping or scripting languages • Perl– Best for tasks that are 90% text manipulation and 10% everything else. • R– Best for tasks that are 90% stats/plotting and 10% everything else • Python– Good general purpose programming language. • Widely used development languages (strongly typed) • C – best for low-level programming where speed matters • Java – Very widely used • Databases • SQL – the standard language for relational databases

  3. Some typical text-oriented bioinformatics tasks • I have a huge output file from a BLAST search. What are all the Mus musculus matches with high scores? • I want to digest (with Trypsin) each of the ~900 proteins in a FASTA file and calculate the mass of each fragments. Oh, I also want the results in a format that I can feed to the MASCOT proteomics search engine . • What’s the most common 6-mer consisting of only A’s and T’s in the P. falciparum genome (14 chromosomes and 2.2 million nucleotides) • I want to automatically update the database on our local BLAST server every weekend with the latest sequences deposited at NCBI. • Search for potential RNA hairpins in the 14 chromosomes of the Plasmodium falciparum genome The common characteristic of these tasks is that theyinvolve ~90% text manipulation, and ~10% everything else.

  4. A quickie task Question: What 6-mer containing only A’s and T’s occurs mostly commonly in the P. falciparumgenome Answer: Scan through the 2.2 million nucleotides of the 14 chromosomes using a 6-nt windowand find that ATATAT occurs most frequently Runs in seconds on a modern laptop --------- A: 40.31 % T: 40.30 % C:  9.70 % G:  9.69 % total(ATCG only ) = 22853497 nucleotides total(everything) = 22853764 nucleotides --------- 1       ATATAT  570577 2       TATATA  529826 3       AAAAAA  420232 4       TTTTTT  417439 5       TATTAT  147506 6       ATAATA  146948 7       TATATT  146711 8       AATATA  146695 -- snip -- 57      ATTTAA  50617 58      ATTAAT  49856 59      TTAAAT  49731 60      TAATTA  45781 61      AATTAA  45278 62      TTAATT  45212 63      AATTTA  45101 64      TAAATT  44553

  5. A bigger task: “middleware” for a “pipeline” • Applications for processing raw data from DNA sequencers • A trace editor to analyze, display, and allow biologists to edit the short DNA read chromatograms from DNA-sequencing machines. • A read assembler to find overlaps between the reads and assemble them in long contiguous sections. • An assembly editor to view the assemblies and make changes in places where the assembler went wrong. • A database to keep track of everything. • Write a set of scripts that runs the applications sequentially and converts the output format of one application into the input format of the subsequent application. see e.g. L. Stein “How Perl Saved the Human Genome Project”

  6. MBL/TIGR Assembly Pipeline From Advanced Genomics and Bioinformatics course F. Pineda, 10/31/2002 ABI 3700 sequencer sample DNA .chr Traceviewer Legend PreTA (Lucy, Phred) PUC19 PUC19.splice applications Closure format conversion files .x .y Lab phd2fasta Consed .qual .seq No Closed? .fasta RunTA TIGR Assembler .contig ace2contig .ace Yes RepeatFinder .asm ta2ace Finished sequence .repeats .stats .details .dot Bambus .mates ? To database for annotation

  7. Got perl? • Unix, Linux and MacOS systems come with Perl. • Open a shell window for entering command lines • Find the DEFAULT perl interpreter by typing: whereis perl • To find out which version of perl you have type “perl -v” • If you insist on working on a Windows PC, you have two choices (neither of which we will support): • Cygwin (a “linux-like environment with a limited set of tools) • http://www.cygwin.com/ • Active state perl (a native perl interpreter). • (http://www.activestate.com/activeperl/)

  8. Invoking the perl interpreter • Perl is an application that processes perl statements. • Feeding perl a one line command • perl -e ‘print “hello world\n”’ • Feeding perl a sequence of commands • step 1. invoke perl • step 2. write the statement(s) • step 3. tell perl you are finished writing statements (control-D) (this causes perl to interpret the statements) print “an old pond\nthe sound of a frog jumping\ninto water\n”;

  9. Perl application = Compiler+Virtual Machine • Java, Perl, Python, Forth, etc. are implemented as virtual machines. • The perl interpreter is an application consisting of a compiler and a “virtual machine.” • The perl interpreter accepts input consisting of perl instructions and executes them in a two-step process: • A compiler converts source code into instructions for a “virtual machine” • Bottom-up parser • Top-down Optimizer & Peephole Optimizer • Generates Opcodes (instructions) • Code is compiled each time it is executed – no binaries • The Perl “virtual machine” executes the instructions • Executes the opcodes • Emulates a stack-based computer (like HP calculator or Forth)

  10. 1st perl program: Hello World # my first perl program print(“hello world\n”); • Text between ‘#’ and the end of the line is a comment • print() is a function that prints the text string • Double quotes delimit a string • The two characters \n are not interpreted literally. Instead they represents newline. • Run the program by invoking the perl interpreter with hello.pl as an argument. perl helloworld.pl

  11. stand alone perl script • Three things to remember • Add a special first line that will tell bash where to find perl • Tell bash it has permission to execute this script (just once) • Invoke the script #!/usr/bin/perl # my first perl script print(“hello world\n”); bash> chmod u+x helloperl.pl bash> ./helloperl.pl

  12. Character data, Sequences & codes • 99% of what you will do with perl is to manipulate characters and text files, so you might as well get intimately familiar with character data and how it is represented (coded) on computers. • Many important biological compounds are polymers. These are compounds of usually high molecular weight consisting of up to millions of repeated linked units, each a relatively light and simple molecule. Different single letter codes are used to represent simple molecules in different classes of compounds (e.g. DNA, RNA and proteins).

  13. Biological sequencesrepresenting information contained in biological polymers

  14. DNA RNA Proteins The fundamental “dogma” • after discovery of retroviruses Replication Reverse-transcription Transcription Translation

  15. DNA • Double stranded molecule with a spiral arrangement • Nucleotides pair by hydrogen bonding • Purines (A,G) bind to pyrimidines (T,C) A pairs to T with 2 hydrogen bonds G pairs to C with 3 hydrogen bonds • The strands run opposite to each other (antiparallel) • 4 letter alphabet: A,T,G,C • A human genome is about 3 billion base-pairs long spread across 23 chromosomes

  16. RNA • Single stranded molecule with complex secondary structure • Nucleotides pair by hydrogen bonding • Purines bind to pyrimidines • A pairs to U with 2 hydrogen bonds • G pairs to C with 3 hydrogen bonds • G pairs with U ( a wobble pair) • 4 letter alphabet: A,U,G,C

  17. Proteins20 letter amino acid alphabet of proteins

  18. Summary of the important polymers & their “alphabets” • DNA • Double stranded (helical) • deoxyribose-phosphate backbone • 4 letter alphabet (A,T,C,G) represent nucleotides • Complementary base pairing (A-T), (C-G) • RNA • Single stranded (messenger RNA) • ribose-phosephate backbone • 4 letter alphabet (A,U,C,G) • Complementary base pairing (A-U), (C-G),(U-G) • Proteins • Single stranded (complex folding structure) • 20 letter alphabet (represent amino acid residues)

  19. downstream upstream Notational and directional conventions “gene” 5’ sense (+) strand 3’ DNA anti-sense (-) strand (template) 3’ 5’ promoter RNA polymerease Transcription RNA 5’ 3’ UTR UTR Ribosome Translation Protein N-terminal ( amino end ) C -terminal (carboxyl end)

  20. Suppose I write out the human genome sequence as one long text file.How many human genomes will fit on a 4GB usb stick or how many 4GB sticks do I need to store 1 human genome?Don’t overthink this.

More Related