1 / 45

Sequencing a genome

Sequencing a genome. Definition. Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism. Basic problem. Genomes are large (typically millions or billions of base pairs)

brady-white
Download Presentation

Sequencing a genome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequencing a genome

  2. Definition • Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism

  3. Basic problem • Genomes are large (typically millions or billions of base pairs) • Current technology can only reliably ‘read’ a short stretch – typically hundreds of base pairs

  4. Elements of a solution • Automation – over the past decade, the amount of hand-labor in the ‘reads’ has been steadily and dramatically reduced • Assembly of the reads into sequences is an algorithmic and computational problem

  5. A human drama • There are competing methods of assembly • The competing – public and private – sequencing teams used competing assembly methods

  6. Assembly: • Putting sequenced fragments of DNA into their correct chromosomal positions

  7. BAC • Bacterial artificial chromosome: bacterial DNA spliced with a medium-sized fragment of a genome (100 to 300 kb) to be amplified in bacteria and sequenced.

  8. Contig • Contiguous sequence of DNA created by assembling overlapping sequenced fragments of a chromosome (whether natural or artificial, as in BACs)

  9. Cosmid • DNA from a bacterial virus spliced with a small fragment of a genome (45 kb or less) to be amplified and sequenced

  10. Directed sequencing • Successively sequencing DNA from adjacent stretches of chromosome

  11. Draft sequence • Sequence with lower accuracy than a finished sequence; some segments are missing or in the wrong order or orientation

  12. EST • Expressed sequence tag: a unique stretch of DNA within a coding region of a gene; useful for identifying full-length genes and as a landmark for mapping

  13. Exon • Region of a gene’s DNA that encodes a portion of its protein; exons are interspersed with noncoding introns

  14. Genome • The entire chromosomal genetic material of an organism

  15. Intron • Region of a gene’s DNA that is not translated into a protein

  16. Kilobase (kb) • Unit of DNA equal to 1000 bases

  17. Locus • Chromosomal location of a gene or other piece of DNA

  18. Megabase (mb) • Unit of DNA equal to 1 million bases

  19. PCR • Polymerase chain reaction: a technique for amplifying a piece of DNA quickly and cheaply

  20. Physical map • A map of the locations of identifiable markers spaced along the chromosomes; a physical map may also be a set of overlapping clones

  21. Plasmid • Loop of bacterial DNA that replicates independently of the chromosomes; artificial plasmids can be inserted into bacteria to amplify DNA for sequencing

  22. Regulatory region • A segment of DNA that controls whether a gene will be expressed and to what degree

  23. Repetitive DNA • Sequences of varying lenths that occur in multiple copies in the genome; it represents much of the genome

  24. Restriction enzyme • An enzyme that cuts DNA at specific sequences of base pairs

  25. RFLP • Restriction fragment length polymorphism: genetic variation in the length of DNA fragments produced by restriction enzymes; useful as markers on maps

  26. Scaffold • A series of contigs that are in the right order but are not necessarily connected in one continuous stretch of sequence

  27. Shotgun sequencing • Breaking DNA into many small pieces, sequencing the pieces, and assembling the fragments

  28. STS • Sequence tagged site: a unique stretch of DNA whose location is known; serves as a landmark for mapping and assembly

  29. YAC • Yeast artificial chromosome: yeast DNA spliced with a large fragment of a genome (up to 1 mb) to be amplified in yeast cells and sequenced

  30. Readings • Myers, “Whole Genome DNA Sequencing,” http://www.cs.arizona.edu/people/gene/PAPERS/whole.IEEE.pdf • Venter, et al, “The Sequence of the Human Genome,” Science, 16 Feb 2001, Vol. 291 No 5507, 1304 (parts 1 & 2) • Waterston, Lander, Sulston, “On the sequencing of the human genome,” PNAS, March 19, 2002, Vol 99, no 6, 3712-3716 • Myers, et.al., “On the sequencing and assembly of the human genome,” www.pnas.org/cgi/doi/10.1073/pnas.092136699

  31. Hierarchical sequencing • Create a high-level physical map, using ESTs and STSs • Shred genome into overlapping clones • Multiply clones in BACs • ‘shotgun’ each clone • Read each ‘shotgunned’ fragment • Assemble the fragments

  32. Physical map

  33. Whole genome sequencing (WGS) • Make multiple copies of the target • Randomly ‘shotgun’ each target, discarding very big and very small pieces • Read each fragment • Reassemble the ‘reads’

  34. Hierarchical v. whole-genome

  35. The fragment assembly problem • Aim: infer the target from the reads • Difficulties – • Incomplete coverage. Leaves contigs separated by gaps of unknown size. • Sequencing errors. Rate increases with length of read. Less than some . • Unknown orientation. Don’t know whether to use read or its Watson-Crick complement.

  36. Scaling and computational complexity • Increasing size of target G. • 1990 – 40kb (one cosmid) • 1995 – 1.8 mb (H. Influenza) • 2001 – 3,200 mb (H. sapiens)

  37. The repeat problem • Repeats • Bigger G means more repeats • Complex organisms have more repetitive elements • Small repeats may appear multiple times in a read • Long repeats may be bigger than reads (no unique region)

  38. Gaps • Read length LR hasn’t changed much •  = LR /G gets steadily smaller • Gaps ~ Re- R (Waterman & Lander)

  39. How deep must coverage be?

  40. Double-barreled shotgun sequencing • Choose longer fragments (say, 2 x LR) • Read both ends • Such fragments probably span gaps • This gives an approximate size of the gap • This links contigs into scaffolds

  41. Genomic results

  42. HGSC v Celera results

  43. To do or not to do? • “The idea is gathering momentum. I shiver at the thought.” – David Baltimore, 1986 • “If there is anything worth doing twice, it’s the human genome.” – David Haussler, 2000

  44. Public or private? • “This information is so important that it cannot be proprietary.” – C Thomas Caskey, 1987 • “If a company behaves in what scientists believe is a socially responsible manner, they can’t make a profit.” – Robert Cook-Deegan, 1987

  45. HW for Feb 17 • Comment on these assertions (500-1000 words): • WLS – “Our analysis indicates that the Celera paper provides neither a meaningful test of the WGS approach nor an independent sequence of the human genome.” • Venter – “This conclusion is based on incorrect assumptions and flawed reasoning.”

More Related