1 / 149

Computational Genomics

Computational Genomics. Izabela Makalowska July 15, 2006. The main task in modern biology is to find out how this…. TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG

spiro
Download Presentation

Computational Genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational Genomics Izabela Makalowska July 15,2006

  2. The main task in modern biology is to find out how this… TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG CGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAA GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT AGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCT GCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTA AAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACT AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT AAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC

  3. …becomes this

  4. DNA sequence contains all information but we need to decipher it. TGCATCGATCGTAGCTAGCTAGCGCATGCTAGCTAGCTAGCTAGCTACGATGCATCG TGCATCGATCGATGCATGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTAGCTATTGG CGCTAGCTAGCATGCATGCATGCATCGATGCATCGATTATAAGCGCGATGACGTCAG CGCGCGCATTATGCCGCGGCATGCTGCGCACACACAGTACTATAGCATTAGTAAAAA GGCCGCGTATATTTTACACGATAGTGCGGCGCGGCGCGTAGCTAGTGCTAGCTAGTC TCCGGTTACACAGGTAGCTAGCTAGCTGCTAGCTAGCTGCTGCATGCATGCATTAGT AGCTAGTGTAGCTAGCTAGCATGCTGCTAGCATGCAGCATGCATCGGGCGCGATGCT GCTAGCGCTGCTAGCTAGCTAGCTAGCTAGGCGCTAATTATTTATTTTGGGGGGTTA AAAAAAAAAATTTCGCTGCTTATACCCCCCCCCACATGATGATCGTTAGTAGCTACT AGCTCTCATCGCGCGGGGGGATGCTTAGCGTGGTGTGTGTGTGTGGTGTGTGTGGTC CTATAATTAGTGCATCGGCGCATCGATGGCTAGTCGATCGATCGATTTTATATATCT AAAGACCCCATCTCTCTCTCTTTTCCCTTCTCTCGCTAGCGGGCGGTACGATTTACC

  5. Program for life • DNA in our cells store information in a way that is very similar to the way computers do. • Instead of being a binary memory, where everything is either 0 or 1, DNA is a 4 letter alphabet: A, C, G, T • Using computer metaphor we can say that: • Plant cell do not look like a mouse cell because their “programs” are different • Liver cells work differently than lung cells because of different input to the program • Children look like parents because their program is a “revision” of parents program • Many diseases are caused by “bugs” in program: • Tay-Sach’s disease: A simple mistake in one line of code • Huntington’s disease: A “line” of code gets repeated a bunch of times by accident • Different ways to solve the same problem: • Plants: photosynthesis = turn light into sugar • Animals: eat plant or other animals

  6. What exactly are we looking for in the DNA sequence? • Genes • Protein coding • RNA genes • Retrogenes • Regulatory elements • Promotors • Enhancers • siRNA • Repetitive elements • LINES • SINES • Simple repeats

  7. Are genes just protein or RNA coding elements? Makorin1-p1 is a non-coding pseudogene of Makorin1. Makorin1-p1 regulates the expression of its related coding gene. It acts by stabilizing the Makorin1 gene by blocking of a cis-acting RNA decay element within the 5’ region of Makorin1.

  8. Are repeats just a junk DNA? Translation of mRNA containing Alu-cassette results in soluble form of the protein Caras, I.W., Davitz, M.A., Rhee, L., Weddell, G., Martin Jr., D.W., Namba, T., Sugimoto, Y., Negishi, M., Irie, A., Ushikubi, F., Kaki-Nussenzweig,V., Cloning of decay-accelerating factor suggests novel use of splicing to generate two proteins. Nature. 1987 Feb 5-11;325(6104):545-9.

  9. Getting all genes • The most direct way to identify a gene is to document the transcription of a fragment of the genome -EST sequencing • Requires less sequencing since it is focused on coding sequence only • Small rate of false positives, although even 10% of EST sequences could be artifacts • Genes with very restricted expression may newer be discovered • In most cases gives only partial sequences • Genome sequencing • Access to entire genome, allows to learn more about genome organization • Regulatory elements • Only small percentage of the genome codes for genes • Hard to identify less typical genes • High rate of false positives

  10. Constructing EST Cell or tissue Isolate mRNA and reverse transcribe into cDNA Analyze Clone cDNA into a vector to make a cDNA library 5' EST 3' EST cDNA Sequence the 5' and 3' ends of cDNA inserts Pick individual clones vector

  11. Problems with EST data • Contamination • Low quality – the error rates are high in individual ESTs • Highly redundant, for highly expressed genes we can have hundreds of ESTs representing a single gene • The databases are skewed for sequences near 3’ end of mRNA • For most ESTs there is no indication as to the gene from which it was derived • Overlapping genes • Splice variants

  12. Chromatogram An example of a good chromatogram showing well-resolved peaks and no ambiguities This is a region of a chromatogram fairly far along the sequence where some bases in runs of 2 or more are no longer visible as single peaks. Many peaks are beginning to broaden and smear into one another, interpretation of the peaks has become more difficult, and the basecalling software has begun to use 'N's. This is a region of a chromatogram where the traces have become too ambiguous for accurate basecalling. While some parts of this region of the chromatogram can be useful for linking to existing sequences following manual editing, it should not be considered accurate.

  13. Sequence quality screening • ABI sequencing software contains a program for quality screening • PHRED - reads DNA sequence data, calls bases, and writes the base calls and quality values to output files

  14. Quality file

  15. Phred quality scores

  16. Contamination • Vectors - DNA/cDNAs from the biological source organism/organelle are usually inserted into a cloning vector so that they can be cloned, propagated, and manipulated. Sequencing of such constructs frequently produces raw sequences that include segments derived from vector. • Adapters, linkers, and PCR primers - Various oligonucleotides can be attached to the DNA/RNA under investigation as part of the cloning or amplification process. • Impurities in the DNA/RNA - Nucleic acid preparations may contain DNA/RNA from sources other than the intended one. • nucleic acids from an organelle • mRNA/DNA present in a reagent used in the isolation, purification, or cloning procedures • nucleic acids from other organisms present in the material from which the DNA/RNA was isolated • other DNAs/RNAs used in the laboratory (e.g., from accidental mixing of samples or cross contamination from dirty pipettes, tips, tubes, or equipment)

  17. Consequences of contamination • Time and effort wasted on meaningless analyses • Erroneous conclusions drawn about the biological significance of the sequence • Misassembly of sequence contigs and false clustering of Expressed Sequence Tags (ESTs)

  18. Assembly Contigs Why contamination is causing problems Gene A Gene B

  19. VecScreen http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen_docs.html

  20. Cross_match • Cross_match is a general purpose application for comparing any two DNA sequence sets. For example, it can be used to compare a set of reads to a set of vector sequences and produce vector-masked versions of the reads. It is slower but more sensitive than BLAST. GCACGCACAACCAGACCATGCTCGGACGACCCGCTGTACATCGGCCTGCGGCAGAGGCGCGTGCGCGGCGCCGCG TACGACGAGTTCGTCGACGAGTTCATGCAGGCGGTCGTCAAGCGCTTCGGGCAGAACTGCCTCATACAGTTCGAG GACTTCGCCAACGCGAACGCGTTCCGCCTGCTCGAGAAGTACCGCGGCAGGTACTGCACGTTCAACGACGACCTC CAGGGCACGGCGGCGGTGGCGGTGGCCGGGCTGCTCGCGTCGCTGCGCATCACCGGCAAGCGGCTCTCCGACAAC GTGTTCGTGTTCCAGGGAGCCGGCGAGGCATCTCTGGGTATCGCCGAGCTGTGCGTGATGGCGATGAAGAACGAG GGTACATCGGACGCCGATGCCCGCTGCAAGATTTGGATGGTGGACTCCAAGGGTCTCATCGTGAAGAACCGTCCT GAAGGTGGACTGAACGAACACAAGGAGAAGTTTGCCCAGAACTGCTCCCCCATTCGGACACTTGCCGAAGTTATA AATGTTGCTAAGCCTTCTGTACTGATTGGCXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXXXXXXXXXXXXGCTCGACGGCGGAAGGAAAA

  21. Qscreen • Sequencing data management and quality checking system • Quality screening • Relational database for sequence data management and archive • Easy data access via web interface • Easy data sharing • Sequence and trace view • Project statistics • Password protected

  22. Qscreen: project and user management

  23. Gene discovery strategy • Cluster and assemble EST sequences to lower redundancy and to increase the length of transcripts • Find coding regions and reading frame • Use deducted protein to search databases and assign function to the gene

  24. Cluster and assemble EST - resources • Assembly tools: • TGICL http://www.tigr.org/tdb/tgi/software/ • Cap 3 http://genome.cs.mtu.edu/sas.html • Phrap http://www.phrap.org/ • EST clusters databases • UniGenehttp://www.ncbi.nlm.nih.gov/ • TGIhttp://www.tigr.org/ • EST analysis pipeline • SMAPihttp://smapi.cbio.psu.edu

  25. Assembling inside each cluster Assembling ESTs • Two stage process: • Clustering ESTs based on the similarity and clone ID

  26. Important parameters • Criteria too stringent = many ESTs will not be assembled and genes will stay fragmented • Criteria too loose = ESTs from genes from the same family will be assembled into one gene Length and similarity level of overlapping fragments Length of overhanging fragments

  27. Contig quality ATGTCTCTNTCACTGA TCTGTCCC-CAGTCACGATCGAN ATGTCTCTGTCNCTNAGTCACGATCGAN ATGTCTCTNTCACTGA TCTGTCCC-CAGTCACGATCGAN ATGTCTCGGTCAC-CAGTCACGATCGAT ATGTCTCGGTCAC-CAGTCACGATCGAT TTGTCTGGGTCAC-CTCC GGTGGC-CAGTCACGATNGAN ATGTCTCGGTCAC-CAGTCACGATCGAT ATGTCTCTGTCNCTNAGTCACGATCGAN ATGTCTCGGTCAC-CAGTCACGATCGAT

  28. Joined based on clone ID Joined based on similarity Joined based on similarity One gene one cluster? 3’ESTs 5’ESTs

  29. One cluster one gene?

  30. Cap3 • Use of forward-reverse constraints to correct assembly errors and link contigs. • Use of base quality values in alignment of sequence reads. • Automatic clipping of 5' and 3' poor regions of reads. • Generation of assembly results in ‘ace’ file format for Consed.

  31. Input files • CAP3 takes as input a file of sequence reads in FASTA format. CAP3 takes two optional files: a file of quality values in FASTA format and a file of forward-reverse constraints. The file of quality values must be named "xyz.qual", and the file of forward-reverse constraints must be named "xyz.con", where "xyz" is the name of the sequence file. CAP3 uses the same format of a quality file as Phrap.

  32. http://bio.ifom-firc.it/ASSEMBLY/assemble.html Web interface

  33. Output files

  34. TGICL – TIGR Gene Indices clustering tool • Clustering – uses modified megablast program to cluster sequences together • Assembly – CAP3 is used to assemble sequences inside each cluster

  35. TGICL – TIGR Gene Indices clustering tool • Sequences need to be cleaned before using TGICL (Lucy, UniVEc, SeqClean) • mRNA sequences may be used as ‘seeds’ for clustering. Caution: partial mRNAs mislabeled as complete can prevent cluster extension beyond the seed. • Difficulty with highly expressed genes that have several thousand ESTs in a single cluster (assembly program may run out of memory)

  36. Phrap • part of the Phred/Phrap/Consed • program for assembling shotgun DNA sequence data. • allows use of the entire read and not just the trimmed high quality part • uses a combination of user-supplied and internally computed data quality information to improve assembly accuracy • constructs the contig sequence as a mosaic of the highest quality read segments rather than a consensus • provides extensive assembly information to assist in trouble-shooting assembly problems • handles large datasets • It is strongly recommended that phrap be used in conjunction with the base calls and base quality values produced by the basecaller, phred; and with the sequence editor/assembly viewer, consed.

  37. Phrap output in Consed

  38. Apple EST assembly 183, 732 ESTs • Phrap: • 24,199 contigs; 9,765 singletons • CAP3 • 19,791 contigs; 18,927 singletons • TGICL • 22,481 contigs; 28,279 singletons

  39. NCBI UniGene

  40. Clustering at NCBI • EST must have at least 100bp after removing contaminants • The overlap between similar ESTs must be at least 70 bp • Similarity between overlapping area must be at least 96% over the 70% of overlapping region >100bp >70% of overlap with at least 96% of similarity >100bp >70bp

  41. TIGR Gene Indices www.tigr.org

  42. What are TIGR gene indices? • Clustered and assembled ESTs and mRNA sequences • Each gene and each splice variant is represented by a single consensus sequence • Provides ORF annotation, genome mapping, expression profiles, domain annotation, unique oligomers >40bp of overlap with at least 95% of similarity <30bp <30bp

  43. TGI annotations

  44. TGI sequence annotations

More Related