Applying NLP models to the Biological Domain

Applying NLP models to the Biological Domain Eugen Buehler Lyle Ungar CLUNCH

Overview • “Languages” of Computers and Biology • Probability Models for NL and Biology • Maximum Entropy • Basic ME amino acid model • The “Whole Protein Model” • Results in a gene prediction model CLUNCH

Bits and Bytes: The Alphabet of Computers • Computer electronics are complicated: RAM, processor, etc. • It all comes down to bits (1s and 0s). • Bits can be organized into bytes (8). • Bytes can represent, among other things, letters (ASCII), which can form sentences. CLUNCH

CLUNCH

DNA: Biology’s Alphabet • Biology is complicated. • It comes down to nucleotides (A,C,G,T). • Nucleotides can be grouped into codons. • Codons represent amino acids, amino acids make proteins/genes. CLUNCH

Find the words! 0101000110010100100100011010100011100101101101101001011101010101000001110101010100010001001110011101001100111001001110010100110010100100010010010001000100100010001001001100010001001100110010011101010100110011001001100101010001000110100100010000100100100010100100100010001101010100010101011100101011100011110001111000110011101001111101000011010000011110100111110010011000111100101111000111010101011001 CLUNCH

Find the genes! AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT GCCCAAATAAAACATGTCCTGCATGGCATTAGTTTGTTGGGGCAGTGCCCGGATAGCATCAACGCTGCGC TGATTTGCCGTGGCGAGAAAATGTCGATCGCCATTATGGCCGGCGTATTAGAAGCGCGCGGTCACAACGT TACTGTTATCGATCCGGTCGAAAAACTGCTGGCAGTGGGGCATTACCTCGAATCTACCGTCGATATTGCT GAGTCCACCCGCCGTATTGCGGCAAGCCGCATTCCGGCTGATCACATGGTGCTGATGGCAGGTTTCACCG CCGGTAATGAAAAAGGCGAACTGGTGGTGCTTGGACGCAACGGTTCCGACTACTCTGCTGCGGTGCTGGC TGCCTGTTTACGCGCCGATTGTTGCGAGATTTGGACGGACGTTGACGGGGTCTATACCTGCGACCCGCGT CAGGTGCCCGATGCGAGGTTGTTGAAGTCGATGTCCTACCAGGAAGCGATGGAGCTTTCCTACTTCGGCG CTAAAGTTCTTCACCCCCGCACCATTACCCCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC CGGAAATCCTCAAGCACCAGGTACGCTCATTGGTGCCAGCCGTGATGAAGACGAATTACCGGTCAAGGGC ATTTCCAATCTGAATAACATGGCAATGTTCAGCGTTTCTGGTCCGGGGATGAAAGGGATGGTCGGCATGG CGGCGCGCGTCTTTGCAGCGATGTCACGCGCCCGTATTTCCGTGGTGCTGATTACGCAATCATCTTCCGA ATACAGCATCAGTTTCTGCGTTCCACAAAGCGACTGTGTGCGAGCTGAACGGGCAATGCAGGAAGAGTTC CLUNCH

NL and Biological Modeling “Mary went to the____ .” MSGTIPSCPTAL ___ CLUNCH

Markov Models CLUNCH

ME, In a Nutshell • Constrain the model. • Maximize entropy. CLUNCH

Constraining features • “is the” occurs with frequency 1/10000. • Define a feature: • Require that: CLUNCH

Exponential Solution • A unique solution exists with maximum entropy: CLUNCH

Triggers • Triggers – Words that increase the likelihood of other words. Crop→ Harvest Cuban→ Havana Iran → Hashemi Hate → Hate CLUNCH

Unigram and Bigram Caches • Caches – frequency tables built from the history. • Is “supercalifragilisticexpialidocious” a common word? • Allow for model adaptation. CLUNCH

Applying ME Models in Computational Biology • Significant improvement for NLP. • Same for biological models? • AA sequences: a simple test case. CLUNCH

Feature Sets • Unigrams and Bigrams • Self-triggers - frequency of a specific amino acid. • Class based self-triggers - frequency of a specific amino acid class. • Unigram Cache - Amino acid frequency for this protein. CLUNCH

Training and testing data • Burset et al. set of 571 proteins. • Homologous proteins eliminated. • Resulting set of 204 proteins split into 2 groups of 102 each. CLUNCH

CLUNCH

Results • “Long distance” features help. • Best model gives a 30% reduction in perplexity over unigram reduction. • Our model may improve predictions made by Genscan, a eukaryotic gene finding algorithm. CLUNCH

Limitations of this model • Artificial model. • Cannot represent all global features. CLUNCH

The “Whole Sentence” Model CLUNCH

Secondary Structure MAGTVTEAWDVAVFAARRRNDEDDTTRDSLFTYTNSNNTRGPFEGPNYHIAPRWVYNITSVWMIFVVIASIFTNGLVLVATAKFKKLRHPLNWILVNLAIADLGETVIASTISVINQISGYFILGHPMCVLEGYTVSTCGISALWSLAVISWERWVVVCKPFGNVKFDAKLAVAGIVFSWVWSAVWTAPPVFGWSRYWPHGLKTSCGPDVFSGSDDPGVLSYMIVLMITCCFIPLAVILLCYLQVWLAIRAVAAQQKESESTQKAEKEVSRMVVVMIIAYCFCWGPYTVFACFAAANPGYAFHPLAAALPAYFAKSATIYNPIIYVFMNRQFRNCIMQLFGKKVDDGSELSSTSRTEVSSVSNSSVSPA -----HHHHHHHHHHH--------------EEE--------------------EEEE---EEEEEEEEEEEEH--HEEHHHHHHHH------HHHHHHHHH---HHEEEEEEEEEEE---EEE-----EEE----EEEE-EHEHHHHEHHHH-HEEEE---------HHHHHEHEEEEEEEHH-------------H------------E----------EEEEEEEEE------EHHHHHHHHHHHHHHHHHH-H----HHHHHHHHHHHHHEEEEEEEE-------EEEHHH----------HHHHHHHHHH--------EEEEEH-----HHHHHHH---------------EEEEE--------- CLUNCH

“Whole Protein” Results • 19 features evaluated • Two were selected: • Mean length of alpha helix region • Maximum length of any structural region • 59% increase in protein likelihood CLUNCH

Improved Glimmer Models • Glimmer used IMMs to predict genes in bacteria. • Will adding amino acid triggers improve these models? How much? CLUNCH

H. Pylori Genome • 1562 Coding Sequences • Split into: • Training (>500bp) – 1154 genes, 1,354,167 bp • Testing (<500bp) – 408 genes, 129,045 bp CLUNCH

Glimmer Depth CLUNCH

Lateral Gene Transfer • Many genes in bacteria come not from their ancestors but from other bacterial species. • Different bacteria “prefer” to use different codons. • Analogous to detection of plagiarism detection? CLUNCH

Model Adaptation • Gene models are trained for every organism. • Lots of unused information • Analogous to cross-domain application of NLP models. CLUNCH

Thanks • Lyle Ungar • Roni Rosenfeld • NIH Grant CLUNCH

CLUNCH

N-Gram Features • Unigram (frequency of individual words) • Bigram (frequency of pairs of words) CLUNCH

CLUNCH

Trigger feature function CLUNCH

Applying NLP models to the Biological Domain

Applying NLP models to the Biological Domain

Presentation Transcript

Applying Semantic Technologies to the Glycoproteomics Domain

NLP Document and Sequence Models

Applying Domain-Driven Design

Applying the SUBDUE Substructure Discovery System to the Chemical Toxicity Domain

UML for Domain Models

UML for Domain Models

Log-Linear Models in NLP

Statistical NLP: Hidden Markov Models

Applying Finite Mixture Models

Using Domain Models to Specify Systems

Applying Hidden Markov Models to Bioinformatics

Applying Theories and Models

Applying Finite Mixture Models

Applying Semiautomatic Generation of Conceptual Models to Decision Support Systems Domain

“Applying Morphology Generation Models to Machine Translation”

Domain Specific Models

Log-Linear Models in NLP

Hidden Markov Models in NLP

Chapter 9: Domain Models

Applying the SUBDUE Substructure Discovery System to the Chemical Toxicity Domain

Applying haplotype models to association study design