csce555 bioinformatics n.
Skip this Video
Loading SlideShow in 5 Seconds..
CSCE555 Bioinformatics PowerPoint Presentation
Download Presentation
CSCE555 Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 36

CSCE555 Bioinformatics - PowerPoint PPT Presentation

  • Uploaded on

CSCE555 Bioinformatics. Lecture 21 Integrative Genomics Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: University of South Carolina Department of Computer Science and Engineering 2008 Outline.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'CSCE555 Bioinformatics' - blue

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
csce555 bioinformatics
CSCE555 Bioinformatics
  • Lecture 21 Integrative Genomics

Meeting: MW 4:00PM-5:15PM SWGN2A21

Instructor: Dr. Jianjun Hu

Course page:

University of South Carolina

Department of Computer Science and Engineering


  • What is Integrative Genomics
  • Why integrative genomics
  • The Data Sources
  • Integrating strategies
  • Issues in Integrative genomics
  • Application Example: disease gene prioritization





Information is not knowledge - Albert Einstein

Integrative Genomics - what is it?

Acquisition, Integration, Curation, and Analysis of biological data

Integrative Genomics: the study of complex interactions between genes, organism and environment, the triple helix of biology. Gene <–> Organism <-> Environment

It is definitely beyond the buzzword stage - Universities now have programs named 'Integrated Genomics.'

why integrative genomics support complex queries
Why Integrative Genomics? Support Complex Queries
  • Show me all genes involved in brain development that are expressed in the Central Nervous System.
  • Show me allgenesinvolved in brain developmentinhumanandmouse that also showiron ion binding activity.
  • For this set of genes, what aspects of function and/or cellular localization do they share?
  • For this set of genes, what mutations are reported to cause pathological conditions?
integrative genomics for biomedicine
Integrative genomics for Biomedicine
  • To correlate diseases with
    • anatomical parts affected,
    • the genes/proteins involved, and
    • the underlying physiological processes (interactions, pathways, processes).
    • support personalized or “tailor-made” medicine.

How to integrate multiple types of genome-scale data across experiments and phenotypes in order to find genes associated with diseases


Two Separate Worlds…..

Disease World






  • Name
  • Synonyms
  • Related/Similar Diseases
  • Subtypes
  • Etiology
  • Predisposing Causes
  • Pathogenesis
  • Molecular Basis
  • Population Genetics
  • Clinical findings
  • System(s) involved
  • Lesions
  • Diagnosis
  • Prognosis
  • Treatment
  • Clinical Trials……






Medical Informatics

Bioinformatics & the “omes



Disease Database

Patient Records


Clinical Synopsis

Clinical Trials

382 “omes” so far………

and there is “UNKNOME” too - genes with no function known

With Some Data Exchange…

data sources the omics
Data Sources: The –Omics

Clinical data

Disease data


Bioinformatic Data-1978 to present

  • DNA sequence
  • Gene expression
  • Protein expression
  • Protein Structure
  • Genome mapping
  • SNPs & Mutations
  • Metabolic networks
  • Regulatory networks
  • Trait mapping
  • Gene function analysis
  • Scientific literature
  • and others………..

Human Genome Project – Data Deluge

No. of Human Gene Records currently in NCBI: 29413 (excluding pseudogenes, mitochondrial genes and obsolete records).

Includes ~460 microRNAs

NCBI Human Genome Statistics – as on February12, 2008


Information Deluge…..

A researcher would have to scan 130 different journals and read 27 papers per dayto follow a single disease, such as breast cancer (Baasiri et al., 1999 Oncogene 18: 7958-7965).

  • 3 scientific journals in 1750
  • Now - >120,000 scientific journals!
  • >500,000 medical articles/year
  • >4,000,000 scientific articles/year
  • >16 million abstracts in PubMed derived from >32,500 journals

Methods for Integration

  • Link driven federations
    • Explicit links between databanks.
  • Warehousing
    • Data is downloaded, filtered, integrated and stored in a warehouse. Answers to queries are taken from the warehouse.
    • Integrative analysis
  • Others….. Semantic Web, etc………

Link-driven Federations

  • Creates explicit links between databanks
  • query: get interesting results and use web links to reach related data in other databanks
  • Examples: NCBI-Entrez, SRS

Data Warehousing

Data is downloaded, filtered, integrated and stored in a warehouse. Answers to queries are taken from the warehouse.

  • Advantages
  • Good for very-specific, task-based queries and studies.
  • Since it is custom-built and usually expert-curated, relatively less error-prone
  • Disadvantages
  • Can become quickly outdated – needs constant updates.
  • Limited functionality – For e.g., one disease-based or one system-based.
integrative data analysis
Integrative data analysis
  • Data is downloaded, filtered
  • Inference algorithms that integrate heterogeneous data
  • Evidences are usually weak from one data source, integration will enhance signals
  • Cross-validation effect to reduce false positive

Common Issues in Integrative Genomics

  • Heterogeneous Data Sets - Data Integration
    • From Genotype to Phenotype
    • Experimental and Consensus Views
  • Incorporation of Large Datasets
    • Whole genome annotation pipelines
    • Large scale mutagenesis/variation projects (dbSNP)
  • Computational vs. Literature-based Data Collection and Evaluation (MedLine)
  • Data Mining
    • extraction of new knowledge
    • testable hypotheses (Hypothesis Generation)
no integrative genomics is complete without ontologies

Gene World

Biomedical World

No Integrative Genomics is Complete without Ontologies
  • Gene Ontology (GO)
  • Unified Medical Language System (UMLS)

The 3 Gene Ontologies (Recap)

  • Molecular Function = elemental activity/task
    • the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity
    • What a product ‘does’, precise activity
  • Biological Process = biological goal or objective
    • broad biological goals, such as dna repair or purine metabolism, that are accomplished by ordered assemblies of molecular functions
    • Biological objective, accomplished via one or more ordered assemblies of functions
  • Cellular Component= location or complex
    • subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme
    • ‘is located in’ (‘is a subcomponent of’ )


What can researchers do with GO?

  • Access gene product functional information
  • Find how much of a proteome is involved in a process/ function/ component in the cell
  • Map GO terms and incorporate manual annotationsinto own databases
  • Provide a link between biological knowledge and
      • gene expression profiles
      • proteomics data
  • Getting the GO and GO_Association Files
  • Data Mining
    • My Favorite Gene
    • By GO
    • By Sequence
  • Analysis of Data
    • Clustering by function/process
  • Other Tools
unified medical language system umls http umlsks nlm nih gov kss
Unified Medical Language System (UMLS)
  • The UMLS Metathesaurus contains information about biomedical concepts and terms from many controlled vocabularies and classifications used in patient records, administrative health data, bibliographic and full-text databases, and expert systems.
  • The Semantic Network, through its semantic types, provides a consistent categorization of all concepts represented in the UMLS Metathesaurus. The links between the semantic types provide the structure for the Network and represent important relationships in the biomedical domain.
  • The SPECIALIST Lexicon is an English language lexicon with many biomedical terms, containing syntactic, morphological, and orthographic information for each term or word.

Example Study: Disease Gene Identification and Prioritization

Hypothesis: Majority of genes that impact or cause disease share membership in any of several functional relationships OR Functionally similar or related genes cause similar phenotype.

  • Functional Similarity – Common/shared
    • Gene Ontology term
    • Pathway
    • Phenotype
    • Chromosomal location
    • Expression
    • Cis regulatory elements (Transcription factor binding sites)
    • miRNA regulators
    • Interactions
    • Other features…..

Which of these interactants are potential new candidates?


Known Disease Genes




Mining human interactome


Direct Interactants of Disease Genes

Indirect Interactants of Disease Genes

  • Prioritize candidate genes in the interacting partners of the disease-related genes
  • Training sets: disease related genes
  • Test sets: interacting partners of the training genes

ToppGene – General Schema

toppgene data sources
TOPPGene - Data Sources
  • Gene Ontology: GO and NCBI Entrez Gene
  • Mouse Phenotype: MGI (used for the first time for human disease gene prioritization)
  • Pathways: KEGG, BioCarta, BioCyc, Reactome, GenMAPP, MSigDB
  • Domains: UniProt (Pfam, Interpro,etc.)
  • Interactions: NCBI Entrez Gene (Biogrid, Reactome, BIND, HPRD, etc.)
  • Pubmed IDs: NCBI Entrez Gene
  • Expression: GEO
  • Cytoband: MSigDB
  • Cis-Elements: MSigDB
  • miRNA Targets: MSigDB

New features added


Benefits of Integrative Genomics

  • To unravel the connection between genotype and phenotype - Systematically identify novel phenotype–genotype relationships.
  • Hypotheses generator.
  • Paves way for prognosis, diagnosis, and personalized medicine (adverse drug reactions, etc.).
  • Deeper understanding of disease and an enhanced integration of medicine with biology.
  • Increasing knowledge of the genes associated with diseases will allow researchers to address more complicated issues, including the relative contributions to disease of genes in the core biological set shared by all species and those encoding proteins specific to humans; how sequence features (such as conservation and polymorphism) relate to disease characteristics; and how protein function relates to the outcome of clinical treatment
  • And MANY MORE……..
  • Networks and integration of databases are keys to success in Bioinformatics.
  • Integration of computation and data into a single cohesive whole will increase the efficiency of research effort
    • by reducing the serendipity & hit and miss nature of empirical research and
    • will provide valuable clues to the biomedical researchers on their choice of experiments - limitations of funds, manpower and time.
  • Users have to know what is available and how to access (what are the limitations) and use the resources they are offered.

Algorithms in bioinformatics

• string algorithms

• dynamic programming

• machine learning (NN, k-NN, SVM, GA, ..)

• Markov chain models

• hidden Markov models

• Markov Chain Monte Carlo (MCMC) algorithms

• stochastic context free grammars

• EM algorithms

• Gibbs sampling

• clustering

• tree algorithms

• text analysis

• hybrid/combinatorial techniques and more…