1 / 39

Biology 162: Computational Genetics Fall 2004

Biology 162: Computational Genetics Fall 2004. Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill. Bioinformatics vs computational genetics. Bioinformatics : The application of computing technology to molecular biology

paxton
Download Presentation

Biology 162: Computational Genetics Fall 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biology 162: Computational GeneticsFall 2004 Todd Vision Assistant Professor Department of Biology, UNC Chapel Hill

  2. Bioinformatics vs computational genetics • Bioinformatics: The application of computing technology to molecular biology • Computational genetics: The interdisciplinary intersection of genetics, computer science and statistics

  3. Course emphasis • Data analysis in molecular genetics • We will not cover • Developments in IT hardware • Analysis of protein structure • Modeling of metabolic pathways, cells, tissues, organs, etc. (i.e. systems biology)

  4. Prerequisites • Bio 50: Molecular Biology and Genetics • Gene/protein structure and expression • Principles of inheritance • Comp Sci 14: Introduction to Programming • Algorithms and their design • Fundamental programming skills • Stat 31: Introduction to Statistics • Probability and Distributions • Hypothesis testing and parameter estimations

  5. Related courses at UNC • Biology 170/Math 107, Mathematical and Computational Models in Biology (Tim Elston and Maria Servedio) • Summer courses in • Computer Science • Graduate courses in • Bioinformatics and Computational Biology • Biostatistics • School of Pharmacy

  6. Readings • Gibson and Muse, A Primer of Genome Science, Sinauer Associates. • Available in Student Bookstore • Primarily covers genomic technologies • Brief on computational/statistical aspects • Supplemental papers • Handed out in class or posted on Blackboard • Includes • More detail on computational/statistical aspects • Papers which you will review for class assignments

  7. https://blackboard.unc.edu

  8. Computer labs / Problem sets • Thursdays 3:30-4:30 in Wilson 132 • Assignments are due following Tuesday • Purpose: • Familiarity with genomic databases and tools • Functional and evolutionary sequence analysis • Gene expression analysis • Mapping of genomes and complex traits • Comfort with command-line tools and computing • Exercise of scientific reasoning and biological judgement • No programming required (but learn Perl anyway!)

  9. Research paper • Critical review of the computational challenges involved in assembly of the human genome • Based on opposing articles from the main players in the drama • Paper will be judged on • Understanding of content • Critical and synthetic reasoning • Clarity of scientific writing

  10. Late policy • Assignments are due at beginning of class on the due date • Late assignments receive half-credit • Exceptions can be made but require more than 24 hours notice

  11. Group work • You are encouraged to work together on most assignments (some exceptions) • What you turn in should be your own • Show your work • Be able to defend your answers • Know and love the UNC Honor Code • http://honor.unc.edu

  12. Exams • Two midterms • Final exam will be cumulative • May include material from labs/problem sets, readings and lectures • Most questions will be similar to those on lab/problem sets • You will receive a study guide in advance

  13. Grading • 10 Labs/problem sets - 50% (5% each) • Review paper - 10% • Midterms - 20% (10% each) • Final exam - 20% • Final grades • No curve, point divisions at discretion of instructor • Different divisions for undergraduate/graduate students

  14. Computer lab server: Biolinux • All necessary analysis software is installed • Dell PowerEdge server • Linux Redhat operating system • 2 Xeon processors • 2 GB RAM • 60 GB disk space • Requires an ONYEN for login • Uses AFS file space

  15. Connecting to Biolinux • biolinux.bio.unc.edu (IP 152.2.66.25) • Windows • Zip archive contains necessary connection software • MacOSX • X11 for graphical sessions • Fugu for secure ftp • Linux/Solaris/etc. • Should work as is

  16. https://onyen.unc.edu

  17. http://cilantro.bio.unc.edu/biolinux

  18. Cretaceous Park? • In 1994, researchers reported a remarkably well-preserved Cretaceous dinosaur fossil. • DNA was extracted • Care was taken to prevent contamination • Specific regions were amplified • 20 different PCR primer pairs used, including 6 pairs from mitochondrial cytB • How would you design primers for dinosaur DNA? • All yielded products in mammals, birds and reptiles • Only one cytB pair yielded a product from the fossil • Negative controls did not reveal contamination

  19. Cretaceous Park? • One cytB fragment amplified • 9 sequences obtained from two bone samples • Variability was present within and between the two samples, none were identical • Consensus sequences used to search for homologs • Genbank (215,000 sequences) • BLAST • Measured percent identity • Closest matches were ~70% identical • Equidistant to mammals, birds, and reptiles

  20. Cretaceous Park? • One would expect dinosaur DNA to be most similar to that of birds, and then crocodilians • Other authors reanalyzed the data • Multiple alignment • Protein sequence scoring matrix • Phylogenetic analysis • All concluded that the DNA was clearly mammalian, possibly human • One group showed that similar sequences could be amplified from human nuclear DNA

  21. Cretaceous Park? • Three possibilities • Preparation of human nuclear DNA could have been contaminated by dinosaur DNA • Dinosaurs and humans might have hybridized during the Cretaceous • Dinosaur extracts were contaminated by human DNA • Study revealed an interesting aspect of human molecular evolution, but not much about dinosaurs • Lesson learned: naïve computational analysis can lead to very misguided conclusions!

  22. Discussion question • You are given the sequence of a new gene and asked to determine its function. • How would you begin? • What ‘wet lab’ approaches are possible? • What ‘in silico’ approaches are possible? • What approaches might require both wet lab and in silico components?

  23. Biological topics • Sequence alignment and assembly • Sequence homology searching • Sequence evolution and phylogenetics • Finding genes and other features • Patterns of gene expression • Genetic mapping • Dissecting genetic diseases and quantitative traits

  24. Computational topics • Dynamic programming • Regular expressions and suffix trees • Markov chains • Hidden Markov models and machine learning • Techniques for clustering and classification • Maximum likelihood and Bayesian statistics • Graph traversal

  25. Some informatics tools • Genbank, Uniprot, and major sequence repositories • InterPro and protein signature dBs • Gene Ontology • Model organism genome databases (SGD, FlyBase, Ensembl) • A sampling of software programs • Chosen primarily for pedagogical utility

  26. Genomics • Genetics on lots of genes? • Hypothesis-free science? • Some technologies • Enabled by • Robotics • Computers

  27. Genome database examples • Primary databases • Genbank/EMBL/DDBJ • Secondary databases • Pfam (protein domains) • Organism-specific • SGD (yeast genomics) • Specialized dBs • OMIM (human genetic disorders) • Annual database issue of Nucleic Acids research: http://www3.oup.co.uk/nar/database/c/

  28. Growth of Genbank

  29. http://www.expasy.org/cgi-bin/show_thumbnails.pl?2

  30. First bacterial genome: 1995 • Haemophilus influenzae (TIGR) • 1.8 x 106 bp shotgun assembly • Required 9 months of computer time • Now there are hundreds • 160 Bacterial • 19 Archaeal • 32 Eukaryotic • Over a thousand projects ongoing • And a bacterial genome takes only days to sequence and assemble

  31. Tree of life

  32. More protein families await

  33. Other types of genomic data • Spatiotemporal gene expression • Alternative transcription • Genetic knockout/overexpression phenotypes • Genetic variability • Molecular polymorphism • Phenotypic variation / disease • Comparative data / molecular evolution • Protein • Structure, including modifications • Interactions with other molecules • Metabolic profiling, etc., etc.

  34. Algorithmic/statistical innovations • The most fundamental and heavily used application in the field is pairwise alignment • Smith-Waterman algorithm (1981) • Still too slow for general database search • BLAST (1987) • Made database search of 107-108 sequences feasible • Statistical ranking of each alignment • Statistical methods in molecular evolution <25 yrs old • Modern genetic mapping methods ~15 yrs old

  35. Things to review • Chemical differences among amino acids • Prokaryotic and eukaryotic gene structure • The central dogma • Anatomy of a typical protein

  36. Reading for Thursday • Gibson and Muse, Ch.1 Genome Projects, pgs. 1-58.

More Related