An introduction to bioinformatics high school version
Download
1 / 28

An Introduction to Bioinformatics high-school version - PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on

An Introduction to Bioinformatics (high-school version). Ying Xu Institute of Bioinformatics, and Biochemistry and Molecular Biology Department University of Georgia [email protected]

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'An Introduction to Bioinformatics high-school version' - yuma


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
An introduction to bioinformatics high school version

An Introduction to Bioinformatics(high-school version)

Ying Xu

Institute of Bioinformatics, and Biochemistry and Molecular Biology Department

University of Georgia

[email protected]


The basics

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…………………………………

genome and sequencing

chromosome

metabolic pathway/network

genes

protein

The Basics

cell


Bioinformatics or computational biology
Bioinformaticsccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag(or computational biology)

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…………………………………

  • This interdisciplinary science … is aboutproviding computational support to studies on linking the behavior of cells, organisms and populations to the information encoded in the genomes

    • Temple Smith


Information encoded in genomes
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

  • What information? And how to find and interpret it?

  • Working molecules (proteins, RNAs) in our cells

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…………………………………

bacterial cell


Information encoded in genomes1
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

  • How to find where protein-encoding genes are in a genome?

  • A genome is like a book written in “words” consisting of 4 letters (A, C, G, T), and each protein-encoding gene is like an instruction about how the protein is made

  • People have found that the six-letter words (e.g., AAGTGC) have different frequencies in genes from non-gene regions

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgctagaggtcagtgactgatgatcgatgcatgcatggatgatgcagctgatcgatgtagatgcaataagtcgatgatcgatgatgatgctagatgatagctagatgtgatcgatggtaggtaggatggtaggtaaattgatagatgctagatcgtaggta…………………………


Information encoded in genomes2
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

Frequency in genes (AAA ATT) = 1.4%; Frequency in non-genes (AAA ATT) = 5.2%

Frequency in genes (AAA GAC) = 1.9%; Frequency in non-genes (AAA GAC) = 4.8%

Frequency in genes (AAA TAG) = 0.0%; Frequency in non-genes (AAA TAG) = 6.3%

….

AAAATTAAAATTAAAGACAAAATTAAAGACAAACACAAAATTAAATAGAAATAGAAAATT …..

Is this a gene or non-gene region if you have to make a bet?


Information encoded in genomes3
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

  • Preference model:

    • for each 6-letter word X (e.g., AAA AAA), calculate its frequencies in gene and non-gene regions, FC(X), FN(X)

    • calculate X’s preference value P(X) = log (FC(X)/FN(X))

  • Properties:

    • P(X) is 0 if X has the same frequencies in gene and non-gene regions

    • P(X) has positive score if X has higher frequency in gene than in non- gene region; the larger the difference, the more positive the score is

    • P(X) has negative score if X has higher frequency in non-gene than in gene region; the larger the difference, the more negative the score is

  • Gene prediction: given a DNA region, calculate the sum of P(X) values for all 6-letter words X in the region;

    • if the sum is larger than zero, predict “gene”

    • otherwise predict non-gene


Information encoded in genomes4
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

  • You just learned your first bioinformatics method for gene prediction –congratulations!


Information encoded in genomes5
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

  • Ok, we now have learned how to find genes encoded in a genome

  • How do we find out what they do (their biological functions, e.g. sensors, transportors, regulators, enzymes)?


Information encoded in genomes6
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

  • People have observed that similar protein sequences tend to have similar functions

  • Over the years, many genes have been thoroughly studied in different organisms,e.g.,human, mouse, fly, …., rice, …

    • their biological functions have been identified and documented

  • For a new protein, scientists can possibly predict its function by identifying well-studied proteins in other organisms, that have high sequence similarities to it

    • This works for ~60% of genes in a newly sequenced genome


Information encoded in genomes7
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

  • Scientists have developed computational techniques for

    • identifying regulatory signals that controls gene transcription

    • predicting protein-protein interactions

    • elucidating biological networks for a particular function

    • …... and elucidating many other information


Information encoded in genomes8
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

E. Coli O157 and O111 are human pathogenic while E. Coli K12 is not;

Can we tell why? Which genes or pathways in E. coli O157 and O111 are responsible for the pathogenicity?


Information encoded in genomes9

human chromosome #1ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

B. pseudomallei

E. coli K-12

E. coli O157

Random seq

P. furiosus

Information Encoded in Genomes


Information encoded in genomes10
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

Red: prokaryotes

Blue: eukaryotes

Green: plastids

Orange: plasmids

Black: mitochondria

x-axis: average of variations of the K-mer frequencies,

y-axis: average barcode similarity among fragments of a genome


Information encoded in genomes11
Information Encoded in Genomesccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatag

  • Yes, biologists can derive a lot of information from genomes now

  • … but we are far from fully understanding any genome yet, even for the simplest living organisms, bacteria

  • We can clearly use new ideas from bright young minds – interested in doing bioinformatics?


Linking genome information to biological systems behaviors

ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

gene

protein

Linking Genome Information to Biological Systems Behaviors

  • To fully understand cellular behaviors, we need to

    • elucidate information encoded in the genome, and

    • understand working molecules, encoded by the genome, behaves according to the physical laws on earth!


Key drivers of bioinformatics
Key Drivers of Bioinformaticsccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

  • Human genome project has fundamentally changed biological science

  • A key consequence of the genome project is scientists learned that they can produce biological data massively

    • genome sequences

    • microarray data for gene expression levels

    • yeast two hybrid systems for protein-protein interactions

    • …… and other “high-throughput” biological data

These data reflect the cellular states, molecular structures and functions, in complex ways


Key drivers of bioinformatics1
Key Drivers of Bioinformaticsccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

  • … and let bioinformaticians to (help to) decipher the meaning of these data, like in genome sequences

  • Together, high-throughput probing technologies and bioinformatics are transforming biological science into a new science more like physics


Key drivers of bioinformatics2
Key Drivers of Bioinformaticsccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

  • Like physics, where general rules and laws are taught at the start, biology will surely be presented to future generations of students as a set of basic systems ....... duplicated and adapted to a very wide range of cellular and organismic functions, following basic evolutionary principles constrained by Earth’s geological history.

    • Temple Smith, Current Topics in Computational Molecular Biology


Biomarker identification

……ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag………………………… in a similar fashion to doing pregnancy test using a test kit, possibly at home

Biomarker Identification

  • Our goal is to identify markers in blood that can tell if a person has a particular form of cancer


Biomarker identification1
Biomarker Identificationccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

  • Microarray gene expression data allow comparative analyses of gene expression patterns in cancer versus normal tissues

Finding genes showing maximum difference in their expression levels between cancer and normal tissues

on cancer tissues

on normal tissues


Biomarker identification2
Biomarker Identificationccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

proteins A, …, Z highly expressed in cancer


Biomarker identification3
Biomarker Identificationccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

  • Question: Can we predict which of these tissue marker proteins can get secreted into blood circulation so we can get markers in blood?

  • Through literature search, we found over proteins being secreted into blood circulation due to various physiological conditions

  • We then trained a “classifier” to identify “features” that distinguish between proteins that can be secreted into blood and proteins that cannot


Biomarker identification4
Biomarker Identificationccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

  • We have developed a classifier to distinguish blood-secretory proteins and other proteins

  • On a test set with 52 positive data and 3,629 negative data, our classifier achieves

    • 89.6% sensitivity, 98.5% specificity and 94% AUC


Biomarker identification5
Biomarker Identificationccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

  • The predicted marker proteins can be validated using mass spectrometry experiment


Biomarker identification6
Biomarker Identificationccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

  • If successful, it will be possible to test for cancer using a test-kit like pregnancy test-kits


Take home message
Take-Home Messageccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………

  • Biological science is under rapid transformation because of high-throughput measurement technologies and bioinformatics

  • As an emerging field, bioinformatics is about using computational techniques to solve biological problems, and represents the future of biology


THANK YOU!ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgggtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagcggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttcgcgcataaag…………………………


ad