1 / 49

Bioinformatics Stuart M. Brown, Ph.D. NYU School of Medicine

Explore the use of computers to collect, analyze, and interpret biological information at the molecular level. Discover the challenges and impact of bioinformatics in genomics and medicine.

willisc
Download Presentation

Bioinformatics Stuart M. Brown, Ph.D. NYU School of Medicine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BioinformaticsStuart M. Brown, Ph.D.NYU School of Medicine

  2. What is Bioinformatics • The use of computers to collect, analyze, and interpret biological information at the molecular level. "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information." • A set of software tools for molecular sequence analysis

  3. Introduction • The Human Genome Project • Challenges of Molecular Biology computing • The changing role of the Biologist in the Age of Information • Bioinformatics software • Genomics • Impact on medicine

  4. I. The Human Genome Project The genome sequence is complete - almost! • approximately 3.2 billion base pairs.

  5. All the Genes • Any human gene can now be found in the genome by similarity searching with over 99% certainty. • However, the sequence still has many gaps • hard to find an uninterrupted genomic segment for any gene • still can’t identify pseudogenes with certainty • This will improve as more sequence data accumulates

  6. Raw Genome Data:

  7. The next step is obviously to locate all of the genes and describe their functions. This will probably take another 15-20 years!

  8. Celera says that there are only ~34,000 genes • so why are there ~60,000 human genes on Affymetrix GeneChips? • Why does GenBank have 49,000 human gene coding sequences and UniGene have 96,000 clusters of unique human ESTs? • Clearly we are in desperate need of a theoretical framework to go with all of this data

  9. Implications for Biomedicine • Physicians will use genetic information to diagnose and treat disease. • Virtually all medical conditions have a genetic component. • Faster drug development research • Individualized drugs • Gene therapy • All Biologists will use gene sequence information in their daily work

  10. II. Bioinformatics Challenges The huge dataset • Lots of new sequences being added - automated sequencers - Human Genome Project - EST sequencing • GenBank has over 16 Billion bases and is doubling |every year!! (problem of exponential growth...) • How can computers keep up?

  11. New Types of Biological Data • Microarrays - gene expression • Multi-level maps: genetic, physical, sequence, annotation • Networks of Protein-protein interactions • Cross-species relationships • Homologous genes • Chromosome organization

  12. Similarity Searching the Databanks • What is similar to my sequence? • Searching gets harder as the databases get bigger - and quality degrades • Tools: BLAST and FASTA = time saving heuristics (approximate) • Statistics + informed judgement of the biologist

  13. >gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'. Length = 369 Score = 272 bits (137), Expect = 4e-71 Identities = 258/297 (86%), Gaps = 1/297 (0%) Strand = Plus / Plus Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76 |||||||||||||||| | ||| | ||| || ||| | |||| ||||| ||||||||| Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59 Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136 |||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| || Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119 Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196 |||||||| | || | ||||||||||||||| ||||||||||| || |||||||||||| Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179 Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256 ||||||||| | |||||||| |||||||||||||||||| |||||||||||||||||||| Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239 Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313 || || ||||| || ||||||||||| | |||||||||||||||||| |||||||| Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296

  14. Alignment • Alignment is the basis for finding similarity • Pairwise alignment = dynamic programming • Multiple alignment: protein families and functional domains • Multiple alignment is "impossible" for lots of sequences • Another heuristic - progressive pairwise alignment

  15. Sample Multiple Alignment

  16. Structure- Function Relationships • Can we predict the function of protein molecules from their sequence? sequence > structure > function • Conserved functional domains = motifs • Prediction of some simple 3-D structures (a-helix, b-sheet, membrane spanning, etc.)

  17. Protein domains(from ProDom database)

  18. DNA Sequencing • Automated sequencers > 40 KB per day • 500 bp reads must be assembled into complete genes - errors especially insertions and deletions - error rate is highest at the ends where we want to overlap the reads - vector sequences must be removed from ends • Faster sequencing relies on better software • overlapping deletions vs. shotgun approaches: TIGR

  19. Finding Genes in genome Sequence is Not Easy • About 2% of human DNA encodes functional genes. • Genes are interspersed among long stretches of non-coding DNA. • Repeats, pseudo-genes, and introns confound matters

  20. Pattern Finding Tools • It is possible to use DNA sequence patterns to predict genes: • promoters • translational start and stop codes (ORFs) • intron splice sites • codon bias • Can also use similarity to known genes/ESTs

  21. Phylogenetics • Evolution = mutation of DNA (and protein) sequences • Can we define evolutionary relationships between organisms by comparing DNA sequences • is there one molecular clock? • phenetic vs. cladisitic approaches • lots of methods and software, what is the "correct" analysis?

  22. II. The Biologist in the Age of Information

  23. The Internet provides a wealth of biological information • can be overwhelming • e-mail • USENET • Web • Info skill = finding the information that you need efficiently

  24. Computing in the lab - everyday tasks (not computational biology) • ordering supplies • online reference books • lab notes • literature searching

  25. Training "computer savvy" scientists • Know the right tool for the job • Get the job done with tools available • Network connection is the lifeline of the scientist • Jobs change, computers change, projects change, scientists need to be adaptable

  26. The job of the biologist is changing • As more biological information becomes available … • The biologist will spend more time using computers • The biologist will spend more time on data analysis (and less doing lab biochemistry) • Biology will become a more quantitative science (think how the periodic table and atomic theory affected chemistry)

  27. III. Molecular Biology Software Tools

  28. GCG (Wisconsin Package) • The most popular and most comprehensive set of tools for the molecular biologist. - Runs on mainframe computers: (UNIX) - Web, X-Windows (SeqLab) interfaces - Inexpensive for large numbers of users - Requires local databases (on the mainframe computer) - Allows for custom databases and programming

  29. The Web • Many of the best tools are free over the Web • BLAST • ENTREZ/PUBMED • Protein motifs databases • Bioinformatics “service providers” • DoubleTwist™,Celera, BioNavigator™ • Hodgepodge collection of other tools • PCR primer design • Pairwise and Multiple Alignment

  30. Personal Computer Programs • Macintosh and Windows applications - Commercial: Vector NTI™, MacVector™, OMIGA™, Sequencher™ - Freeware: Phylip, Fasta, Clustal, etc. • Better graphics, easier to use • Can't access very large databases or perform demanding calculations • Integration with web databases and computing services

  31. Putting it all together • The current state of the art requires the biologist to jump around from Web to mainframe to personal computer • The trend is for integration: • Web + personal computer will replace text interface to mainframe ? • Will the Web become the ultimate interface for all computing ??

  32. The Role of the RCR • Provide software (site licenses), computing hardware, and databases • Train scientists to use the software • Courses • Newsletter & e-mail updates • Seminars • One-on-one training • Technical support (on our software!) • Phone, e-mail, lab/office visits • Consulting • Recommendations, joint work, do it for you, custom software development

  33. IV. Genomics • The application of high-throughput automated technologies to molecular biology. • The experimental study of complete genomes.

  34. Genomics Technologies • Automated DNA sequencing • Automated annotation of sequences • DNA microarrays • gene expression (measure RNA levels) • single nucleotide polymorphisms (SNPs) • Protein chips (SELDI, etc.) • Protein-protein interactions

  35. cDNA spotted microarrays

  36. Affymetrix Gene Chips

  37. Microarray Data Analysis • Clustering and pattern detection • Data mining and visualization • Controls and normalization of results • Statistical validatation • Linkage between gene expression data and gene sequence/function/metabolic pathways databases • Discovery of common sequences in co-regulated genes • Meta-studies using data from multiple experiments

  38. Pharmacogenomics • The use of DNA sequence information to measure and predict the reaction of individuals to drugs. • Personalized drugs • Faster clinical trials • Selected trail populations • Less drug side effects • Toxicogenomics

  39. Impact on Bioinformatics • Genomics produces high-throughput, high-quality data, and bioinformatics provides the analysis and interpretation of these massive data sets. • It is impossible to separate genomics laboratory technologies from the computational tools required for data analysis.

  40. Genomics Software @ the RCR • Affymetrix Gene Chip Analysis Suite • GeneSpring • Research Genetics Pathways (nylon filters) • TIGR Spotfinder, ScanAlyze, Cluster • Coming soon : a shared microarray database

More Related