1 / 15

Applications of Data Mining and Machine Learning in Bioinformatics

Applications of Data Mining and Machine Learning in Bioinformatics. Yen-Jen Oyang Dept. of Computer Science and Information Engineering. Basics of Protein Structures. A typical protein consists of hundreds to thousands of amino acids.

gwyneth
Download Presentation

Applications of Data Mining and Machine Learning in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applications of Data Mining and Machine Learning in Bioinformatics Yen-Jen Oyang Dept. of Computer Science and Information Engineering

  2. Basics of Protein Structures • A typical protein consists of hundreds to thousands of amino acids. • There are 20 basic amino acids, each of which is denoted by one English character.

  3. 20 Amino Acid - 1 Source: http://prowl.rockefeller.edu/aainfo/struct.htm

  4. 20 Amino Acid - 2 Source: http://prowl.rockefeller.edu/aainfo/struct.htm

  5. 20 Amino Acid - 3 Source: http://prowl.rockefeller.edu/aainfo/struct.htm

  6. Three-dimensional Structure of Myoglobin Source: Lectures of BioInfo by yukijuan

  7. Prediction of Protein Functions • Given a protein sequence, biochemists are interested in its functions and its tertiary structure.

  8. Protein Classification Based on the Homology Model • The sizes of modern protein databases are growing at fast rates. • In order to expedite the process to identify protein functions, it is desirable to classify the concerned protein, before biochemistry experiments are conducted.

  9. One widely used approach to classify proteins is based on the homology model, i.e. classify proteins based on the similarities of amino acid sequences. • BLAST and FASTA are two most widely used software utilities for computing the similarity between two sequences. • We can cluster the proteins in an existing protein database in advance as the next slide exemplifies.

  10. An Example of Similar Protein Sequences 3BP2_HUMAN MAAEEMHWPVPMKAIGAQNLLTMPGGVAKAGYLHKKGGTQLQLLKWPLRFVIIHKRCVYYFKSSTSASPQGAFSLSGYNRVMRAAEETTSNNVFPFKIIHISKKHRTWFFSASSEEERKSWMALLRREIGHFHEKKDLPLDTSDSSSDTDSFYGAVERPVDISLSPYPTDNEDYEHDDEDDSYLEPDSPEPGRLEDALMHPPAYPPPPVPTPRKPAFSDMPRAHSFTSKGPGPLLPPPPPKHGLPDVGLAAEDSKRDPLCPRRAEPCPRVPATPRRMSDPPLSTMPTAPGLRKPPCFRESASPSPEPWTPGHGACSTSSAAIMATATSRNCDKLKSFHLSPRGPPTSEPPPVPANKPKFLKIAEEDPPREAAMPGLFVPPVAPRPPALKLPVPEAMARPAVLPRPEKPQLPHLQRSPPDGQSFRSFSFEKPRQPSQADTGGDDSDEDYEKVPLPNSVFVNTTESCEVERLFKATSPRGEPQDGLYCIRNSSTKSGKVLVVWDETSNKVRNYRIFEKDSKFYLEGEVLFVSVGSMVEHYHTHVLPSHQSLLLRHPYGYTGPR 3BP2_MOUSE MAAEEMQWPVPMKAIGAQNLLTMPGGVAKAGYLHKKGGTQLQLLKWPLRFVIIHKRCIYYFKSSTSASPQGAFSLSGYNRVMRAAEETTSNNVFPFKIIHISKKHRTWFFSASSEDERKSWMAFVRREIGHFHEKKELPLDTSDSSSDTDSFYGAVERPIDISLSSYPMDNEDYEHEDEDDSYLEPDSPGPMKLEDALTYPPAYPPPPVPVPRKPAFSDLPRAHSFTSKSPSPLLPPPPPKRGLPDTGSAPEDAKDALGLRRVEPGLRVPATPRRMSDPPMSNVPTVPNLRKHPCFRDSVNPGLEPWTPGHGTSSVSSSTTMAVATSRNCDKLKSFHLSSRGPPTSEPPPVPANKPKFLKIAEEPSPREAAKFAPVPPVAPRPPVQKMPMPEATVRPAVLPRPENTPLPHLQRSPPDGQSFRGFSFEKARQPSQADTGEEDSDEDYEKVPLPNSVFVNTTESCEVERLFKATDPRGEPQDGLYCIRNSSTKSGKVLVVWDESSNKVRNYRIFEKDSKFYLEGEVLFASVGSMVEHYHTHVLPSHQSLLLRHPYGYAGPR

  11. When a protein with unknown functions is inputted, the classification software identifies the protein clusters that contain most similar proteins. • The biochemists then can predict the functions of the protein based on the output of the classification software. • The protein clustering conducted in advance expedites the search process.

  12. Applications of Data Classification in Microarray Data Analysis • In microarray data analysis, data classification is employed to predict the class of a new sample based on the existing samples with known class.

  13. For example, in the Leukemia data set, there are 72 samples and 7129 genes. • 25 Acute Myeloid Leukemia(AML) samples. • 38 B-cell Acute Lymphoblastic Leukemia samples. • 9 T-cell Acute Lymphoblastic Leukemia samples.

  14. Model of Microarray Data Sets

  15. Applications of Data Clustering in Microarray Data Analysis • Data clustering has been employed in microarray data analysis for • identifying the genes with similar expressions; • identifying the subtypes of samples.

More Related