450 likes | 484 Views
Explore the fundamental concepts of molecular biology, from cell structure to DNA transcription, in this comprehensive overview of living systems. Discover how bioinformatics uses computer calculations to unravel biological data, sequences, structures, and functions.
E N D
IBGP/BMI 730 Introduction to Bioinformatics Director: Prof. Victor Jin
Basic Molecular Biology • All living things are made of Cells • Prokaryote, Eukaryote • Cell Signaling • What is Inside the cell: From DNA, to RNA, to Proteins
Cells • Fundamental working units of every living system. • Every organism is composed of one of two radically different types of cells: prokaryotic cells or eukaryotic cells. • Prokaryotes and Eukaryotes are descended from the same primitive cell. • All extant prokaryotic and eukaryotic cells are the result of a total of 3.5 billion years of evolution.
Cell Structure • A cell is a smallest structural unit of an organism that is capable of independent functioning • All cells have some common features
Cell Cycle • Born, eat, replicate, and die
The Tree of Life • According to the most recent evidence, there are three main branches to the tree of life. • Prokaryotes include Archaea (“ancient ones”) and bacteria. • Eukaryotes are kingdom Eukarya and includes plants, animals, fungi and certain algae.
Signaling Pathways: Control Gene Activity • Instead of having brains, cells make decision through complex networks of chemical reactions, called pathways • Synthesize new materials • Break other materials down for spare parts • Signal to eat or die
Cells Information and Machinery • Cells store all information to replicate itself • Human genome is around 3 billions base pair long • Almost every cell in human body contains same set of genes • But not all genes are used or expressed by those cells • Machinery: • Collect and manufacture components • Carry out replication • Kick-start its new offspring
Terminology • Genome: an organism’s genetic material • Gene: a discrete units of hereditary information located on the chromosomes and consisting of DNA • Genotype: The genetic makeup of an organism • Phenotype: the physical expressed traits of an organism • Nucleic acid: Biological molecules(RNA and DNA) that allow organisms to reproduce • Amino acid: Organic molecules that build blocks of proteins. • Protein: a large, complex molecule that is essential part of organisms and participates in every process within cells and achieve a particular function.
Three critical molecules • DNAs • Hold information on how cell works • RNAs • Act to transfer short pieces of information to different parts of cell • Provide templates to synthesize into protein • Proteins • Form enzymes that send signals to other cells and regulate gene activity • Form body’s major components (e.g. hair, skin, etc.)
Overview of DNA to RNA to Protein • A gene is expressed in two steps • Transcription: RNA synthesis • Translation: Protein synthesis
DNA the Genetics Makeup • Genes are inherited and are expressed • genotype (genetic makeup) • phenotype (physical expression) • On the left, is the eye’s phenotypes of green and black eye genes.
Central Dogmas of Molecular Biology 1) The concept of genes is historically defined on the basic of genetic inheritance of a phenotype. (Mendellian Inheritance) 2) The DNA an organism encodes the genetic information. It is made up of a double stranded helix composed of ribose sugars. Adenine(A), Citosine (C), Guanine (G) and Thymine (T). [note that only 4 values nees be encode ACGT.. Which can be done using 2 bits.. But to allow redundant letter combinations (like N means any 4 nucleotides), one usually resorts to a 4 bit alphabet.]
Central Dogmas of Molecular Biology 3) Each side of the double helix faces it´s complementary base. A T, and G C. 4) Biochemical process that read off the DNA always read it from the 5´´side towards the 3´ side. (replication and transcription). 5) A gene can be located on either the ´plus strand´ or the minus strand. But rule 4) imposes the orientation of reading .. And rule 3 (complementarity) tells us to complement each base E.g. If the sequence on the + strand is ACGTGATCGATGCTA, the – strand must be read off by reading the complement of this sequence going ´backwards´ e.g. TAGCATCGATCACGT
Central Dogmas of Molecular Biology 6) DNA information is copied over to mRNA that acts as a template to produce proteins. We often concentrate on protein coding genes, because proteins are the building blocks of cells and the majority of bio-active molecules. (but let´s not forget the various RNA genes)
Bioinformatics Bioinformatics (computational biology) solves biological problems on the molecular level with the use of techniques including: • applied mathematics • statistics • computer science • artificial intelligence
Biological Data Computer Calculations + Bioinformatics
Central Dogmaof Molecular BiologyDNA-> RNA-> Protein-> Phenotype Molecules Sequence, Structure, Function Processes Mechanism, Specificity, Regulation Central Paradigmfor Bioinformatics->Genomic Sequence->Transcript->Protein Structure->Protein Function Large Amounts of Information Data Management Computer Algorithms Statistical Methods Molecular Biology as an Information Science
Major research efforts • Sequence alignment • Gene finding • Genome assembly • RNA structure prediction • Protein structure prediction • Analysis of gene regulation • Prediction of protein-protein interactions • Modeling of evolution
Major research areas • Sequence analysis • Genome annotation • Computational evolutionary biology • Measuring biodiversity • Analysis of gene expression • Analysis of regulation • Analysis of protein expression • Analysis of mutations in cancer • Analysis of epigenetics in cancer • High-throughput in vivo binding analysis • Prediction of protein structure • Comparative genomics • Modeling biological systems • High-throughput image analysis • Protein-protein docking • Software and tools • Databases • Web services in bioinformatics
Data types • DNA sequences • RNA sequences • Protein sequences • Gene Expression • cDNA, mRNA microarray data • Now tiling array technology • 50 M data points to tile the human genome at ~50 bp res. • Can only sequence genome once but can do an infinite variety of array experiments • Protein-DNA interactions • ChIP-chip, ChIP-seq, ChIP-PET and so on • Phenotype Experiments • KOs • Protein Interactions • Yeast hybrid • Proteomics
Other Integrative Data • Information to understand genomes • Metabolic Pathways • Regulatory Networks • Signaling Networks • Whole Organisms Phylogeny • The Literature (MEDLINE)
Exponential Growth of Data Matched by Development of Computer Technology Internet Hosts • CPU vs Disk & Internet • Driving Force in Bioinformatics No.Protein DomainStructures
Types of Relational databases • The Internet can be thought of as one enormous relational database. • The “links”/URL are the primary keys. • SQL (Standard Query Language) • Sybase; Oracle ; Access; (Databases systems) • Sybase used at NCBI. • SRS(One type of database querying system of use in Biology)
XML Database and vocabularies for life science • HTML: Hypertext Markup Language • XML: a general-purpose specification for creating custom markup languages. It is classified as an extensible language, because it allows the user to define the mark-up elements • BSML: an extensible language specification and container for bioinformatic data. BSML was developed under a 1997 grant from the National Human Genome Research Institute (NHGRI) as an evolving public domain standard for the bioinformatics community
Examples of XML • <?xml version="1.0" encoding="UTF-8"?> • <element_name attribute_name="attribute_value">Element Content</element_name> • <book>This is a book... </book>
Primary Databases • A primary Database is a repository of data derived from experiments or from research knowledge. • Genbank (Nucleotide repository) • Protein DB, Swissprot • PDB (MMDB) are primary databases. • Pubmed (literature) • Genome Mapping databases. • Kegg Database.(pathways)
Secondary Databases • A secondary database contains information derived from other sources. • Refseq (Currated collection of Genbank at NCBI) • UniGene (Clustering of ESTs at NCBI) • GeneID (Unique ID for each Gene at NCBI) • Organism-specific databases are often a mix between primary and secondary.
Biological Databases • Nucleotide databases: • Genbank: International Collaboration • NCBI (USA), EMBL (Europe), DDBJ (Japan and Asia) • A “bank” No curation.. Submission to these database is required for publication in a journal. • Organism specific databases (Quick quiz: Find URLs using search engines) • FlyBase • ChickGBASE • pigbase • wormpep • YPD (Yeast Protein Database) • SGD(Saccharomyces Genome Database)
Protein Databases: • NCBI: More on next week • Swiss Prot:(Free for academic use, otherwise commercial. Licensing restrictions on discoveries made using the DB. 1998 version free of any licensing) • http://www.expasy.ch(latest pay version) • NCBI has the latest free version. • Translated Proteins from Genbank Submissions • EMBL • TrEMBL is a computer-annotated supplement of SWISS-PROT that contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS-PROT • PIR
Structure databases: • PDB: Protein structure database. • Http://www.rscb.org/pdb/ • MMDB: NCBI’s version of PDB with entrez links. • Http://www.ncbi.nlm.nih.gov • Genome mapping information: • http://www.il-st-acad-sci.org/health/genebase.html • NCBI (Human) • Genome Centers: Stanford, Washington University, UC Berkeley • Research Centers and Universities
Literature databases: • NCBI: Pubmed: All biomedical literature. • www.ncbi.nlm.nih.gov • Abstracts and links to publisher sites for • full text retrieval/ordering • journal browsing. • Publisher web sites. • Biomednet: Commercial site for litterature search. • Pathways database: • KEGG: Kyoto Encyclopedia of Genes and Genomes: www.genome.ad.jp/kegg/kegg/html • Genome Search and Visualization database: • UCSC Genome Browser (genome.uscs.edu/)
Databases Building, Querying Complex data Text String Comparison Text Search 1D Alignment Significance Statistics Alta Vista, grep Finding Patterns Machine Learning Clustering Data mining Geometry Robotics Graphics (Surfaces, Volumes) Comparison and 3D Matching (Vision, recognition) Physical Simulation Newtonian Mechanics Electrostatics Numerical Algorithms Simulation Information techniques
Physics Prediction based on physical principles EX: Exact Determination of Rocket Trajectory Emphasizes: Supercomputer, CPU Bioinformatics as New Paradigm forScientific Computing • Biology • Classifying information and discovering unexpected relationships • EX: Gene Expression Network • Emphasizes: networks, “federated” database
Finding Genes in Genomic DNA introns exons promotors Characterizing Repeats in Genomic DNA Statistics Patterns Duplications in the Genome Large scale genomic alignment Whole-Genome Comparisons Finding Structural RNAs Topics -- Genome Sequence
Sequence Alignment How to align two strings optimally via Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BLAST, FASTA) Amino acid substitution scoring matrices Multiple Alignment and Consensus Patterns How to align more than one sequence and then fuse the result in a consensus representation HMMs, Profiles Motifs Scoring schemes and Matching statistics How to tell if a given alignment or match is statistically significant A P-value or An E-value)? Score Distributions Low Complexity Sequences Evolutionary Issues Rates of mutation and change Topics -- Protein Sequence
Secondary Structure “Prediction” via Propensities Neural Networks, Genetic Alg. Simple Statistics TM-helix finding Assessing Secondary Structure Prediction Structure Prediction: Protein vs RNA Tertiary Structure Prediction Fold Recognition Threading Ab initio Direct Function Prediction Active site identification Relation of Sequence Similarity to Structural Similarity Topics – Structures
Structure Comparison Basic Protein Geometry and Least-Squares Fitting Distances, Angles, Axes, Rotations Calculating a helix axis in 3D via fitting a line LSQ fit of 2 structures Molecular Graphics Calculation of Volume and Surface How to represent a plane How to represent a solid How to calculate an area Hinge prediction Packing Measurement Structural Alignment Aligning sequences on the basis of 3D structure. DP does not converge, unlike sequences, what to do? Other Approaches: Distance Matrices, Hashing Fold Library Docking and Drug Design as Surface Matching Topics -- Structures
Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Regions Large scale cross referencing of information Function Classification and Orthologs The Genomic vs. Single-molecule Perspective Genome Comparisons Ortholog Families, pathways Large-scale censuses Frequent Words Analysis Genome Annotation Identification of interacting proteins Networks Global structure and local motifs Structural Genomics Folds in Genomes, shared & common folds Bulk Structure Prediction Genome Trees Topics – Function Genomics
Bioinformatics tools • Sequence comparison (pairwise and multiple alignments, e.g. ClustalW, Blastz, ) • Phylogenetic reconstruction (e.g. Phylip, IQPNNI, SplitsTree) • Database search (e.g. BLAST, HMMer) • Comparative sequence assembly (e.g. OSLay) • Gene finding (e.g. genscan, FirstEF) • Motif discovery (e.g. MEME, Weeder) • Protein structure (e.g. CE)
Bioinformatics algorithms • Dynamic Programming • EM algorithms • Neural Networks • Hidden Markov Models • Support Vector Machine • Phylogenetic Trees • Clustering
Bioinformatics Topics? • (YES?) Digital Libraries • Automated Bibliographic Search and Textual Comparison • Knowledge bases for biological literature • (YES) Motif Discovery Using Gibb's Sampling • (YES) Metabolic Pathway Simulation • (YES) Gene identification by sequence inspection • Prediction of splice sites • (YES) Linkage Analysis • Linking specific genes to various traits • YES) RNA structure predictionIdentification in sequences • (YES) Homology modeling