Lecture 1: Biological information database and data mining

Lecture 1: Biological information database and data mining • Biology as an information intensive science • Typical databases • Introduction to data mining • Data mining in biology

Biology as an information intensive science Organization of living systems: Ecosystems=> Communities=> Populations => Organisms => Organ systems => Organs => Tissues => Cells => Molecules. • Ecosystem: • All living things in a particular area (such as an island) and all non-living, physical components of the environment that affect living things (such as air, soil, water, sunlight). • Community: • All living things in an ecosystem (such as all animals, plants, bacteria, fungal, viruses etc. in a rain forest). • Population: • A group of interbreeding individuals of one species (such as all flying squirrels in a rain forest). • Organism: • An individual living thing (such as one flying squirrel). • Organ system: • A group of related body components that perform a specific type of function (such CNP). • Organ: • Functional group of organ system (such as brain).

Biology as an information intensive science • Fundamental Theory: • Evolution: • Simple molecules => Organic molecules => RNA-based life systems => Single cells => Multiple cellular organisms => Higher organisms • Molecular Basis of Life: • DNA (Genes) => RNAs => Proteins: • Structural organization • Chemical reaction, synthesis and destruction of molecules • Signal transduction • Transportation of molecules. • Regulation

Biology as an information intensive science • Cell Organization and Function: • Structural organization • Chemical reaction, synthesis and destruction • of molecules • Signal transduction • Transportation • of molecules. • Regulation

Biology as an information intensive science • Information (Molecular Level): • DNA: • 30,000 ~ 100,000 genes for human (many with unknown functions) • 3x109 base pairs for human DNA (< 10% coding region) • Protein: • 60,000 ~ 100,000 proteins for human. • Individual level: • sequence, 3D structure, molecular function. • Group level: • pathways, cellular location, collective function. • Classification: • Family: superfamily, family, subfamily (based on evolution and function) • Type: receptor, ion channel, enzyme, carrier, regulator, structure • Function: • Physiological function, diseases, therapeutics, toxicity, pharmacokinetics, agriculture, plant, environmentally relevant.

Typical Databases • Category: • General • Sequence • 3D structure • Protein function, proteomics, and pathways. • Pharmainformatics • Medical informatics and disease information • Reference: • Nucleic. Acids. Res., 30, 1-12 (2002). • Internet links: • http://www.cz3.nus.edu.sg/~yzchen/database.html

Typical Databases • General: • The National Center for Biotechnology Information (NCBI). (http://www3.ncbi.nlm.nih.gov/) • Integrated ENTREZ retrieval software and databases for genetics, gene and protein sequences, 3D structures, and on-line PubMed library. CAM (Complementary and Alternative Medicine) on PubMed. • Pedro's BioMolecular Research Tools. A Collection of WWW Links to Information and Services Useful to Molecular Biologists. Other mirror sites in Germany, and Switzerland. • The CMS Molecular Biology Resource. This site is a compendium of electronic and Internet-accessible tools and resources for Molecular Biology, Biotechnology, Molecular Evolution, Biochemistry, and Biomolecular Modeling. Other mirror sites in Japan, Canada, France, Germany, Italy, and UK.

Typical Databases • Sequence: • GenBank DataBase (GenBank). (http://www.ncbi.nih.gov/Genbank/) • The GenBank database contains and distributes publicly available DNA sequences from more than 130,000 different organisms. It contains DNA sequences, their derived protein sequences, and annotations describing biological, structural, and other relevant features. It currently contains 27213748 loci, 33865022251 bases, from 27213748 reported sequences • SWISS-PROT (http://us.expasy.org/sprot/) • Annotated protein sequence database. Information includes the description of the function of a protein, its domains structure, post-translational modifications, variants, etc. • Release 42.0 of 10-Oct-2003 of Swiss-Prot contains 135850 sequence entries, comprising 50046799 amino acids abstracted from 109694 references.

Typical Databases • Sequence-related knowledge databases: • Online Mendelian Inheritance in Man. • (http://www3.ncbi.nlm.nih.gov/omim/) • Database that catalogs the human genes and genetic disorders. Located at NCBI. It currently contains 14831 entries • Pfam: Protein families database of alignments and HMMs. (http://www.sanger.ac.uk/Software/Pfam/ ). A large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. In this way, proteins are grouped into domain-based families. It currently covers 6190 families.

Typical Databases Structure: Protein Data Bank (PDB). (http://www.rcsb.org/pdb/ ) 3D crystal and NMR structure of proteins, DNA, RNA and ligand-bound complexes. Official mirror site in Singapore, and other places in China., Japan, Taiwan and several places in USA: Boston, North Carolina. It currently contains 22874 Structures. Nucleic Acids Database (NDB). 3D crystal structure of DNA and RNA. Mirror sites in UK, Japan, and other sites in USA: San Diego.

Typical Databases Structure derived knowledge databases: SCOP. Structural classification of proteins. Mirror sites in Singapore, China, the U.S., and Japan. CATH. Protein Structure Classification. A hierarchical domain classification of protein structures in PDB. MODBASE. A database of Comparative Protein Structure Models. Models were generated by PSI-BLAST and MODELLER. As of Aug 2000, there are 3,379 reliable models for domains in 2,220 proteins, and 5433 reliable fold assignments for domains in 3,083 proteins.

Function and pathways: GeneCards. A database of human genes, their products and their involvement in diseases. It offers concise information about the functions of all human genes that have an approved symbol, as well as selected others [gene listing]. PROSITE. Protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. Mirror sites in Australia, Canada, China, Taiwan. PRINTS. Protein fingerprint database. A fingerprint is a group of conserved motifs used to characterise a protein family. PROCAT. A database of 3D enzyme active site templates. It can be thought of as the 3D equivalent of the 1D templates found in sequence motif databases such as PROSITE and PRINTS. KEGG: Kyoto Encyclopedia of Genes and Genomes. Site contains Pathway Info, Disease Catalogs, Cell Catalogs, Molecule Catalog, and Genomic Info. It also provides Links to Pathway and Other Databases. SPAD: Signaling Pathway Database. An integrated database for genetic information and signal transduction systems. Divided into four categories based on extracellular signal molecules (Growth factor, Cytokine, and Hormone) and stress, that initiate the intracellular signaling pathway. Typical Databases

Typical Databases Pharmainformatics: TTD: Therapeutic Target Database. A database to provide information about the known and newly proposed therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs/ligands directed at each of these targets. Links to relevant databases also provided. MedChem/Biobyte QSAR Database. A collection of 10,000 of QSAR datasets that covers both biological and physical-organic chemistry. The NCI Drug Information System 3D Database. A collection of 3D structures for over 400,000 drugs which was built and is maintained by the Developmental Therapuetics Program Division of Cancer Treatment, National Cancer Institute. The database is an extension of the NCI Drug Information System. Drug Discovery Databases Compiled by The Biophysical Pharmacology Group at NCI. Site has links to several therapeutics program databases and tools, and a 2D-Gel protein expression database. Pharmaceutical Information Network . A comprehensive information database about drugs and diseases. U. S. Food and Drug Administration Center for Drug Evaluation and Research.

Introduction to Data Mining • Main Objective: • Pattern identification, Classification, Extraction of related data (character) set. • Tasks: • Generation of association rules. • Classification and clustering. • Pre-processing and post-processing of relevant dataset. • General Procedure: • Understanding of application domain. • Data source identification and data selection. • Pre-processing: feature selection, discretization, data cleaning. • Data mining: pattern extraction and model building. • Post-processing: identification of interesting/useful/novel patterns/rules. • Incorporation of patterns in real world tasks.

Introduction to Data Mining • Example: • Generation of association rules: • Record of customer purchases: • John: Jacket, Boots • Alfred: Milk, Cheese, Bread, Shoes • Green: Milk, Bread • Brown: Milk, Bread, Shoes, Greeting Cards, Pork • Eric: Cheese, Milk, Shoes, Beef • Bob: Jacket, Boots, Ski Pants • Form of association rules: • Item A => Item B [sup, conf] • sup = support = % of records containing both item A and B • conf = confidence = sup / (% of records containing item B)

Data Mining in Biology • Types of Tasks: • Search for similar pattern in a subsection of each member of datasets (e.g. protein sequence motifs). • Classification of datasets into groups (e.g. proteins into families). • Search for a dataset matching given characteristics (e.g. alignment of a protein sequence against all entries in a protein sequence database). • Extraction of particular information from literature (e.g. drugs that bind to a particular protein). • Proc. Natl. Acad. Sci. USA 95, 10710-10715 (1998) • Structure 7, 1099-1112 (1999) • Bioinformatics 17, 721-728 (2001) • Bioinformatics 17, 155-161 (2001); 17, 359-363 (2001))

Homework • Write a very short report about a database assigned to you. • Can you give at least two more examples to each type of tasks in biological data mining? • Read the reference about typical biological database and get a broad picture about the current status of publicly-accessible bioinformatics databases. • Read at least one of the references about data mining in biology and be prepared to give a brief description about the paper.

Lecture 1: Biological information database and data mining

Lecture 1: Biological information database and data mining

Presentation Transcript

Lecture 2: Data Mining

Data Mining

Data Mining UMUC CSMN 667

Data Mining

Data Mining

ICS 278: Data Mining Lecture 14: Text Mining and Information Retrieval

Introduction to DATA MINING

DATA MINING LECTURE 1

IST 220 - Intro to DB

Data Mining and the Weka Toolkit

Large scale genomic data mining

Data Mining

DNA as Biological Information

DATA MINING

Data Mining: Data

UNIT-1 Introduction

Data Mining LECTURE # 01 Introduction to Data Mining

DATA MINING LECTURE 1