1 / 33

Applying NLP models to the Biological Domain

Applying NLP models to the Biological Domain. Eugen Buehler Lyle Ungar. Overview. “Languages” of Computers and Biology Probability Models for NL and Biology Maximum Entropy Basic ME amino acid model The “Whole Protein Model” Results in a gene prediction model.

arwen
Download Presentation

Applying NLP models to the Biological Domain

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar CLUNCH

  2. Overview • “Languages” of Computers and Biology • Probability Models for NL and Biology • Maximum Entropy • Basic ME amino acid model • The “Whole Protein Model” • Results in a gene prediction model CLUNCH

  3. Bits and Bytes: The Alphabet of Computers • Computer electronics are complicated: RAM, processor, etc. • It all comes down to bits (1s and 0s). • Bits can be organized into bytes (8). • Bytes can represent, among other things, letters (ASCII), which can form sentences. CLUNCH

  4. CLUNCH

  5. DNA: Biology’s Alphabet • Biology is complicated. • It comes down to nucleotides (A,C,G,T). • Nucleotides can be grouped into codons. • Codons represent amino acids, amino acids make proteins/genes. CLUNCH

  6. Find the words! 0101000110010100100100011010100011100101101101101001011101010101000001110101010100010001001110011101001100111001001110010100110010100100010010010001000100100010001001001100010001001100110010011101010100110011001001100101010001000110100100010000100100100010100100100010001101010100010101011100101011100011110001111000110011101001111101000011010000011110100111110010011000111100101111000111010101011001 CLUNCH

  7. Find the genes! AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC CLUNCH

  8. NL and Biological Modeling “Mary went to the____ .” MSGTIPSCPTAL ___ CLUNCH

  9. Markov Models CLUNCH

  10. ME, In a Nutshell • Constrain the model. • Maximize entropy. CLUNCH

  11. Constraining features • “is the” occurs with frequency 1/10000. • Define a feature: • Require that: CLUNCH

  12. Exponential Solution • A unique solution exists with maximum entropy: CLUNCH

  13. Triggers • Triggers – Words that increase the likelihood of other words. Crop→ Harvest Cuban→ Havana Iran → Hashemi Hate → Hate CLUNCH

  14. Unigram and Bigram Caches • Caches – frequency tables built from the history. • Is “supercalifragilisticexpialidocious” a common word? • Allow for model adaptation. CLUNCH

  15. Applying ME Models in Computational Biology • Significant improvement for NLP. • Same for biological models? • AA sequences: a simple test case. CLUNCH

  16. Feature Sets • Unigrams and Bigrams • Self-triggers - frequency of a specific amino acid. • Class based self-triggers - frequency of a specific amino acid class. • Unigram Cache - Amino acid frequency for this protein. CLUNCH

  17. Training and testing data • Burset et al. set of 571 proteins. • Homologous proteins eliminated. • Resulting set of 204 proteins split into 2 groups of 102 each. CLUNCH

  18. CLUNCH

  19. Results • “Long distance” features help. • Best model gives a 30% reduction in perplexity over unigram reduction. • Our model may improve predictions made by Genscan, a eukaryotic gene finding algorithm. CLUNCH

  20. Limitations of this model • Artificial model. • Cannot represent all global features. CLUNCH

  21. The “Whole Sentence” Model CLUNCH

  22. Secondary Structure MAGTVTEAWDVAVFAARRRNDEDDTTRDSLFTYTNSNNTRGPFEGPNYHIAPRWVYNITSVWMIFVVIASIFTNGLVLVATAKFKKLRHPLNWILVNLAIADLGETVIASTISVINQISGYFILGHPMCVLEGYTVSTCGISALWSLAVISWERWVVVCKPFGNVKFDAKLAVAGIVFSWVWSAVWTAPPVFGWSRYWPHGLKTSCGPDVFSGSDDPGVLSYMIVLMITCCFIPLAVILLCYLQVWLAIRAVAAQQKESESTQKAEKEVSRMVVVMIIAYCFCWGPYTVFACFAAANPGYAFHPLAAALPAYFAKSATIYNPIIYVFMNRQFRNCIMQLFGKKVDDGSELSSTSRTEVSSVSNSSVSPA -----HHHHHHHHHHH--------------EEE--------------------EEEE---EEEEEEEEEEEEH--HEEHHHHHHHH------HHHHHHHHH---HHEEEEEEEEEEE---EEE-----EEE----EEEE-EHEHHHHEHHHH-HEEEE---------HHHHHEHEEEEEEEHH-------------H------------E----------EEEEEEEEE------EHHHHHHHHHHHHHHHHHH-H----HHHHHHHHHHHHHEEEEEEEE-------EEEHHH----------HHHHHHHHHH--------EEEEEH-----HHHHHHH---------------EEEEE--------- CLUNCH

  23. “Whole Protein” Results • 19 features evaluated • Two were selected: • Mean length of alpha helix region • Maximum length of any structural region • 59% increase in protein likelihood CLUNCH

  24. Improved Glimmer Models • Glimmer used IMMs to predict genes in bacteria. • Will adding amino acid triggers improve these models? How much? CLUNCH

  25. H. Pylori Genome • 1562 Coding Sequences • Split into: • Training (>500bp) – 1154 genes, 1,354,167 bp • Testing (<500bp) – 408 genes, 129,045 bp CLUNCH

  26. Glimmer Depth CLUNCH

  27. Lateral Gene Transfer • Many genes in bacteria come not from their ancestors but from other bacterial species. • Different bacteria “prefer” to use different codons. • Analogous to detection of plagiarism detection? CLUNCH

  28. Model Adaptation • Gene models are trained for every organism. • Lots of unused information • Analogous to cross-domain application of NLP models. CLUNCH

  29. Thanks • Lyle Ungar • Roni Rosenfeld • NIH Grant CLUNCH

  30. CLUNCH

  31. N-Gram Features • Unigram (frequency of individual words) • Bigram (frequency of pairs of words) CLUNCH

  32. CLUNCH

  33. Trigger feature function CLUNCH

More Related