2005 ITTC Research Review April 7, 2005

Beyond Genome Annotation -Characterizing Chromosome FeaturesTerry ClarkAssistant Professor Electrical Engineering and Computer ScienceThe University of Kansas 2005 ITTC Research Review April 7, 2005

ABSTRACT Genome sequence data and their annotations are routinely used for determining genetic variation, assessing gene products, designing primers for various experiments, designing microarrays and other laboratory and computational applications. Well-known methods for genome analysis include sequence alignment, motif-based systems, and stochastic models, among others. Genome sequences are also representative of a dynamic chemical and physical interplay among proteins and DNA in the eukaryotic nucleus involving chromatin and various proteins. This organization of nuclear DNA is critical to the function and specialization of cells through regulation of genes. Toward understanding genome structure, our laboratory develops, uses, and applies methods ranging from computational linguistics to molecular modeling. One such method is an unsupervised, alignment-free approach that naturally tolerates re-organizations and insertions common to genome evolution; and as unsupervised permits de novo determination of features and feature association. In this presentation I develop a notion in an unsupervised, alignment-free context that we call a lexicon, an inductively generated set of nucleotide “words” of varying length devised to represent optimally a given sequence. The resulting lexicon and parse provide points of departure for sequence analyses utilizing lexicon content, the sequence representation, and sequence information content. The insights gained from bioinformatics are rationalized by and also steer molecular modeling studies. A representative application will be presented in this talk. (Selected slides from the presentation follow.)

DNA sequencinga basic tool for genome study

DNA sequence GCTGAGGGAAGTGAGAGACTGAGGTGGGGNCTGGAGGAGCCTGAAAAGCAGAAGTAGGAGGAAGCAGAGC TGCTCGGAACAGATCCAGAAACAGCATGTACTCACCCATCCCCCAGAGCGGCTCTCCGTTCCCACCGACC GTGAAGCTCCCTGGCCTGCACATATGGAGGGTGGAGAAGCTGAAGCCAGTGCCTGTGGCCCCTGAGAACT ACGGCATTTTCTTCTCGGGAGACTCCTACCTGGTGCTGCACAATGGCCCGGAAGAGCTCTCCCACCTGCA CCTGTGGATCGGCCAGCAGTCGTCCCGGGACGAGCAGGGGGGCTGCGCCATATTGGCCGTGCACCTCAAC ACCCTGCTCGGAGAGCGGCCTGTGCAGCACCGAGAGTCACAGGGCAATGAGTCCGACCTCTTCATGAGCT ACTTCCCCCACGGCCTCAAGTACCAGGAAGGCGGCGTGGAGTCGGCGTTTCACAAGACCTCCCCAGGAAC CGCCCCAGCTGCCATCAAGAAACTCTACCAGGTGAAGGGCAAGAAGAACATTCGTGCCACTGAGCGGGTG CTGAGCTGGGACAGTTTCAACACAGGGGACTGCTTCATCCTGGATCTGGGCCAGAACATCTTTGCCTGGT GTGGTGCGAAGTCCAACATATTGGAGCGGAACAAGGCACGGGACCTGGCACTGGCCATCCGGGACAGCGA GCGGCAGGGCAAGGCCCACGTGGAGATCGTCACCGATGGGGAGGAGCCTGCCGACATGATACAGGTCTTG GGTCCCAAGCCCTCTCTGAAGGAGGGTAACCCTGAGGAAGACCTCACAGCTGACCGGACAAACGCACAGG CCGCGGCTCTGTATAAGGTCTCTGACGCCACTGGACAGATGAACCTGACCAAGCTGGCTGATTCCAGCCC CTTCGCCCTCGAGCTGCTGATACCCGATGACTGCTTTGTGTTGGACAACGGACTCTGCGGCAAGATCTAC ATCTGGAAGGGGCGCAAAGCTAATGAGAAGGAGAGGCAGGCGGCCCTCCAAGTGGCGGAGGACTTTATCA CCCGCATGCGGTATGCCCCAAACACTCAGGTGGAGATTCTGCCCCAGGGCCGCGAGAGTGCCATCTTCAA GCAATTCTTCAAGGACTGGAAGTGAGGGTGGGCATCTCCCTGCCCCTACCTCCTACCCACTTGCTCCTCC

Human Chromosome 12

The Model: DNA as a Sequence of Features gene LTR gene transposon LTR LTR LTR binding site To detect features in a nucleotide sequence without prior knowledge solely based on nucleotide occurrence patterns, we apply an unsupervised algorithm developed initially for modeling speech acquisition. Text (the DNA sequences) are presented to the algorithm as unbroken sequences of characters using the nucleotide alphabet. The task is to find the vocabulary for the text, which we also call a corpus. A chromosome may be thought of as a collection of different languages. This analogy intuitively follows from the inhomogeneities in nucleotide compositions arising from the various functions that DNA performs.

A central computation in this approach is the probability of a parameter in the representation of the sequence (corpus). For this, the well-known forward – backward algorithm is used which takes into account all paths through a lattice of representations, where a representation of a sequence is a concatenation of words. w a b Represented above are two positions in a sequence, namely, positions a and b. The arcs into these locations are all possible paths, each using some combination of the current lexicon. Roughly, is the sum of the probabilities of paths in the model from the front of the sequence to location a; whereas is the same from the end of the sequence back to location b. The parameter under consideration, word w, spans the sequence between locations a and b.

With the forward and backward probabilities, and the probability of the parameter under consideration, w, the probability of w spanning the region from a to b in sequence s is given by: With this equation for all representations, and all points a and b, the count of parameter w is determined. Such counts are the basis of the expectation step in the EM optimization algorithm; the maximization step adjusts probabilities in the model to maximize the expectation of the evidence based on the model. Parameters are added to and deleted from the lexicon by combining existing parameters based on the evidence and the estimated cost/benefit of the new parameter to the description length.

A portion of a lexicon from a chunk containing satellites wordcount in representationfrequency A 1363 0.312471 T 664 0.152224 C 624 0.143054 G 465 0.106602 ... CCTTA 9 0.00206327 AAACCCTAAT 9 0.00206327 GTTTT 9 0.00206327 TCCTAAACCCT 9 0.00206327 CAAACC 8 0.00183402 CCAT 8 0.00183402 AACCCTAAACC 8 0.00183402 ACTCCA 8 0.00183402 CCTTAAACCCTAAACC 8 0.00183402 CTAAACCCTAA 8 0.00183402 CTTTAAAACCTAAATCCTA 8 0.00183402 CTAG 8 0.00183402 ATCCTACTTTAGCTTC 8 0.00183402 TTCGTATGATTTTTGGTTTTC 7 0.00160477 GGATT 7 0.00160477 ACCCTAAACATTAAAACCTAAACCC 7 0.00160477 ATCTTCCAACAAGGAAAGAACACTTTA 7 0.00160477 ATCTAGTCATATTTGAC 7 0.00160477 AAAGTATATTTGGTC 7 0.00160477 CTTCTA 7 0.00160477 GTTGCGGTTCTAGTTCTTATACTCAATC 7 0.00160477 Number of words contained in lexicons around this region % wc -l chr4range007[789]_Lexicon_Frequency.txt 201 chr4range0077_Lexicon_Frequency.txt 117 chr4range0078_Lexicon_Frequency.txt 215 chr4range0079_Lexicon_Frequency.txt

Protein and DNA Sequences: 8 Histones and 2 DNA Strands >1KX5:A HISTONE H3 ARTKQTARKSTGGKAPRKQLATKAARKSAPATGGVKKPHRYRPGTVALREIRRYQKSTEL LIRKLPFQRLVREIAQDFKTDLRFQSSAVMALQEASEAYLVALFEDTNLCAIHAKRVTIM PKDIQLARRIRGERA >1KX5:B HISTONE H4 SGRGKGGKGLGKGGAKRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKV FLENVIRDAVTYTEHAKRKTVTAMDVVYALKRQGRTLYGFGG . . . >1KX5:H HISTONE H2B.2 PEPAKSAPAPKKGSKKAVTKTQKKDGKKRRKTRKESYAIYVYKVLKQVHPDTGISSKAMS IMNSFVNDVFERIAGEASRLAHYNKRSTITSREIQTAVRLLLPGELAKHAVSEGTKAVTK YTSAK >1KX5:I DNA ATCAATATCCACCTGCAGATACTACCAAAAGTGTATTTGGAAACTGCTCCATCAAAAGGC ATGTTCAGCTGGAATCCAGCTGAACATGCCTTTTGATGGAGCAGTTTCCAAATACACTTT TGGTAGTATCTGCAGGTGGATATTGAT >1KX5:J DNA ATCAATATCCACCTGCAGATACTACCAAAAGTGTATTTGGAAACTGCTCCATCAAAAGGC ATGTTCAGCTGGATTCCAGCTGAACATGCCTTTTGATGGAGCAGTTTCCAAATACACTTT TGGTAGTATCTGCAGGTGGATATTGAT

ITTC High Performance Computing Infrastructure • 128 processor cluster (64 nodes) • 3.2 GHz Processors (Xeon based) • 4 GB RAM / node • 146 GB SCSI Disk / node • 8 dual processor server nodes • 25-Terabyte File Server • Tape Robot System (LTO3 Ultrium) • High Performance Network Compute nodes and server cluster components. System housed in newly expanded and remodeled machine room 218, Nichols Hall.

2005 ITTC Research Review April 7, 2005

2005 ITTC Research Review April 7, 2005

Presentation Transcript

Estonia April 2005

April 27, 2005

April 7, 2005

April 2005

Presentation, April 6-7, 2005

April 23, 2005

April 2005

April 11, 2005

EEA workshop, Copenhagen 7 April 2005

JISAO Review April 19, 2005

EGU 2005 Vienna, April 2005

April, 2005

April 2005

April 2005

Controls Overview April 7, 2005

UNDERGRADUATE RESEARCH DAY April 13, 2005

7 – 8 April 2005

Monte Carlo 2005 - Chattanooga, April 2005

April 2005

April 27, 2005