1 / 35

CSCE555 Bioinformatics

CSCE555 Bioinformatics. Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555. University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu. Roadmap. DNA, Chromosomes, Genomes

chen
Download Presentation

CSCE555 Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSCE555 Bioinformatics Lecture 2 Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun HuCourse page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

  2. Roadmap • DNA, Chromosomes, Genomes • Genome Sequencing and whole genomes • DNA Sequence Representation, Models • Sequence Retrieval, Manipulation • Basic Analysis and Questions of Genomes • Summary

  3. Tools to Learn Concepts Quickly • Wikipedia.org • Search “Genome” bringing up many related information • In google, type “keywards wiki” • Google search tips • Find info from university websites • Genome, site:edu • Find info as powerpoint files • Genome, tutorial, filetype:ppt

  4. DNA Bases A: adenosine C: cytidine G: guanosine T: thymidine • Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms. Backbone:sugars and phosphate groups DNA is a long polymer of simple units called nucleotides

  5. Microbial Genome: Clostridium sp. OhILAs CTGCTGTACTAGGATGCTGGTGGAGAGAGCTGCATATAAATCTTTGAGAGATGCACCAAG AATCACCATCATGGTTTCCGCCATAGGGGCTTCTTTTTTTATTCAAAATCTTGCCATTGT TTTATTTGGTGGTAGACCGAAAACTGTTCCAACGGTGGAGGTATTGTCCGGGGTGATAAA GCTGGGGTCCGTATCTCTACAAAGGCTGACCTTAGTGATTCCAGTAGTAACCATACTGCT ATTATTTCTTTTGATGTTTTTAGTGAACCAAACGAAAACTGGAATGGCAATGCGTGCCGT ATCCAAGGACTATGAAACCGCGCGGCTTATGGGAATTGACGTCAATAAAATTATTACCAT AACCTTTGGTATTGGCTCTGCTCTGGCAGCTATTGGTGGCATCATGTGGGGCGCAAAATT TCCTAAAATAGACCCTTTTGTTGGGACTATGCCGGGTATTAAATGCTTTATTGCTGCAGT TCTAGGTGGAATCGGAAACATTCCCGGTGCAGTAATCGGGGGGTTCATCTTAGGGATTGG AGAGATTATGCTCATTGCTTTTCTACCGAGCCTAACTGGCTATCGAGATGCCTTTGCTTT CATACTACTGATTATCATTCTACTGTTTAAGCCAACAGGAATCATGGGTGAAAAAATTGC GGAGAAGGTGTAGACGATGAAAAAGAAAAATACCATATTAACTGGATTAGCAGTATTGCT TTTATTGATTTATTTGATTTATGCAAATAAGAATTATGATTCTTATAAAATTAGAGTTCT AAATCTATGTGCAATTTATGCTGTATTGGGACTCAGTATGAATTTGATCAATGGATTTAC AGGTTTATTTTCCCTTGGACATGCAGGTTTTATGGCAGTAGGTGCCTATACTACCGCTCT TCTGACCATGACACCGCAAAGTAAGGAGGCAACATTCTTCTTAGTGCCCATTGTAGAGCC TTTGGCTAAAATTCAGCTTCCTTTTTTTGTGGCACTGATCATCGGTGGACTACTTTCAGC AATGGTGGCATTTTTAATCGGTGCACCGACTTTAAGGCTGAAGGGCGATTATTTAGCCAT Complementary Base Pairing: A  T C  G Write a program to export complementary sequence?

  6. Genome of organisms • genome of an organism is a complete DNA sequence of one set of chromosomes

  7. Sequencing: Basic Ideas • Current lab techniques can sequence small (say 700 base pairs) DNA pieces. • Use restriction enzymes to cut DNA pieces • Sort pieces of different sizes using gel electrophoresis and use the sorting to read them • Mapping and Walking • Sequence one piece, get 700 letters, make a primer that allowed you to read the next 700, and work sequentially down the clone • Estimate for human genome sequencing using this method: 100 years • Shotgun sequencing (introduced by Sanger et al. 1977) for sequencing genomes • Obtain random sequence reads from a genome • Assemble them into contigs on the basis of sequence overlaps • Straightforward for simple genomes (with no or few repeat sequences) • Merge reads containing overlapping sequence • Shotgun sequencing is more challenging for complex (repeat-rich) genomes: two approaches

  8. How Sequencing Works Beckman CEQ 8000

  9. Sequencing small DNA pieces • Use DNA cloning or PCR to make multiple copies. • Put in 4 testtubes marked G, A, T and C • In testtube G use restriction enzymes that cuts at G. • Do the above step for the other testubes. • Use gel electrophoresis separately for the content in each testtube. • The data results in the table on the left. • Reading the table we get G has lengths 1, 7, 12, 13, 19; A has lengths 2, 6, 8, 11, 14,15,16; T has length 4, 5, 9, 18 and C has length 3, 10, 17. • This gives us the sequence.

  10. Methods for very large scale sequencing • A hierarchical approach • Map on a large scale (physical mapping), sequence specific clones whose position in the genome is known • Shot gun sequencing • “Tear up” the genome and sequence random fragments until it is done • Sequence tagged connectors (STC) • Sequence the ends of many clones and use this info to pick overlapping clones

  11. “Shotgun” sequencing Sub- clone Copy Clone to sequence Sequence and “assemble” ….GTCTACCTGTACTGATCTAGC... …. CCTGTACTGATCTAGCATTA... …. GTACTGATCTAGCATTACG...

  12. Emerging Sequence Methods • Sequencing by Hybridization (SBH). • Mass Spectrophotometric Sequences. • Direct Visualization of Single DNA Molecules by Atomic force Microscopy (AFM ) • Single Molecule Sequencing Techniques • Single nucleotide Cutting • Nanopore sequencing • Readout of Cellular Gene Expression

  13. Whole Genomes of Species • Bacterial Genomes • Eukaryotic Genomes • Human Genome Project • Other Animal and Plant Genomes • Model Genomes The genomes of more than 180 organisms have been sequenced since 1995 http://www.genomenewsnetwork.org/resources/sequenced_genomes/genome_guide_p1.shtml

  14. Sizes of Genomes You will learn to download all these genomes into your computer’s harddrive Refer to Table 1.1 Page 2 of Intro to Comp Genomics book.

  15. Roadmap • DNA, Chromosomes, Genomes • Genome Sequencing and whole genomes • DNA Sequence Representation, Models • Sequence Retrieval, Manipulation • Basic Analysis and Questions of Genomes • Summary

  16. DNA Sequence Representation • DNA Sequence: a string of letters with alphabet {A, C, G, T} • Protein sequence: a string of amino acids with alphabet {ARNDCEQGHILKMFPSTWYV} • 20 standard amino acids • Genetic code:

  17. Genetic Code: Condon • DNA (ATCG) RNA (AUCG) • Three bases of DNA encode an amino acid

  18. Genetic Code with Degeneracy

  19. Representation of Sequences • Single DNA sequence • ATCCTTAAGGAAA • Multiple sequences with similarity • Regular Expression • ATAAA • ACAAAA • ATAAAAAA • A[TC]A+

  20. Representation of Sequences • Probablistic Model: Position-specific scoring matrices (PSSM)

  21. Representation of Sequence: FASTA format • text-based format for representing either nucleic acid sequences or peptide sequences, • allows for sequence names and comments to precede the sequences.

  22. Roadmap • DNA, Chromosomes, Genomes • Genome Sequencing and whole genomes • DNA Sequence Representation, Models • Sequence Retrieval, Manipulation • Basic Analysis and Questions of Genomes • Summary

  23. Sequence Retrieval, Manipulation • Where to download genome/sequence data • Online databases: EMBL, GenBank • Entrez cross-database search (life science search engine) • Goolge -

  24. Example: Download H. influenzae Genome • First bacterial genome: H. influenzae, 1830Kb • http://www.ncbi.nlm.nih.gov/sites/entrez • NC_007146LinksHaemophilus influenzae 86-028NP, complete genomeDNA; circular; Length: 1,914,490 ntReplicon Type: chromosomeCreated: 2005/06/27

  25. Genome Information of H. influenzae

  26. Download the Complete Genome Sequence in Fasta Format

  27. Roadmap • DNA, Chromosomes, Genomes • Genome Sequencing and whole genomes • DNA Sequence Representation, Models • Sequence Retrieval, Manipulation • Basic Analysis and Questions of Genomes • Summary

  28. Simple Questions and Analysis of Genome Sequence • Frequencies of Bases A/C/G/T by simple counting • Sliding windows to check local density • AT AG AC TA TG TC • K-mers frequent/unusual words • 2-mers AT AG AC TA TG TC etc. • 3-mers

  29. Genomic landscape: GC content analysis • The overall GC content of the human genome is 41%. • A plot of GC content versus number of 20 kb windows shows a broad profile with skewing to the right. Page 627

  30. GC content of the human genome: mean 41% Fig. 17.15 Page 628 Source: IHGSC (2001)

  31. Genomic landscape: CpG islands • Dinucleotides of CpG are under-represented in genomic DNA, occuring at one fifth the expected frequency. • CpG dinucleotides are often methylated on cytosine (and subsequently may be deamination to thymine). • Methylated CpG residues are often associated with house-keeping genes in the promoter and exonic regions. • Methyl-CpG binding proteins recruit histone deacetylases and are thus responsible for transcriptional repression. • They have roles in gene silencing, genomic imprinting, and X-chromosome inactivation.

  32. Broad genomic landscape: CpG islands • Findings: • 50,267 CpG islands in human genome • 28,890 after masking repeats with RepeatMasker • 5-15 CpG islands per megabase • (about <40 genes per megabase)

  33. Summary • DNA, Chromosome, Genome • Sequence models • Sequence database, retrieval • Whole genome sequence analysis

  34. Slides Credits • Slides in this presentation are partially based on the work of slides from Internet.

More Related