Kevin C. O
This presentation is the property of its rightful owner.
Sponsored Links
1 / 47

Kevin C. O'Kane Department of Computer Science The University of Northern Iowa PowerPoint PPT Presentation


  • 89 Views
  • Uploaded on
  • Presentation posted in: General

Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa 50613. The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences: Towards a vector space approach.

Download Presentation

Kevin C. O'Kane Department of Computer Science The University of Northern Iowa

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Kevin c o kane department of computer science the university of northern iowa

Kevin C. O'Kane

Department of Computer Science

The University of Northern Iowa

Cedar Falls, Iowa 50613

The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences:Towards a vector space approach


Kevin c o kane department of computer science the university of northern iowa

The area of natural language text indexing and retrieval has been studied since the mid-50's. In text retrieval, the problem is to locate documents related to a natural language query.

To this purpose, natural language text indexing programs have employed many techniques to identify terms in a document most likely to be content

descriptors as opposed to terms that are poor content descriptors.

By eliminating poor descriptors and pre-indexing documents by descriptors more likely to be good discriminators, the speed of selection and precision of document relevance ranking can be improved.

The “vector space model”, developed by G. Salton, views the problem as an n-dimensional hyperspace in which documents and queries.


Overview

In text retrieval, the problem is to locate documents related to a natural language query.

Natural language text indexing programs identify terms in a document most likely to be content descriptors.

The goal of these experiments is to apply text indexing techniques to genomic data bases.

Overview


Natural language indexing

Natural language text indexing and retrieval has been studied since the mid-50's. In text retrieval, the problem is to locate documents related to a natural language query.

Natural language text indexing programs employ techniques to identify terms in a document most likely to be content descriptors.

By eliminating poor descriptors and pre-indexing documents by descriptors likely to be good discriminators, the speed of selection and precision of document relevance ranking can be improved.

The “vector space model”, developed by G. Salton, views the problem as an n-dimensional hyperspace of documents and queries.

Natural Language Indexing


Document hyperspace

Document Hyperspace


Hyperspace queries

Hyperspace Queries


Clustering objects by feature

Clustering Objects by Feature


Cosine similarity coefficient

Cosine Similarity Coefficient


Genomic data bases

EMBL (http://www.embl.org)

SWISS-PROT (http://www.expasy.org/sprot/sprot-top.html)

PROSITE (http://www.expasy.org/prosite/)

PIR (http://pir.georgetown.edu/home.shtml)

NCBI/NLM GenBank (http://www.ncbi.nih.gov/)

MGD: The Mouse Genome Database (http://www.informatics.jax.org/)

OMIM - Online Mendelian Inheritance in Man (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM)

Genomic Data Bases


Nt sequence data base

NCBI “nt” data base: ~12 billion bytes in length comprising 2,584,440 sequences in FASTA format (Sept 2004).

Example sequence:

> gi|2695852|emb|Y13263.1|ABY13263 Acipenser baeri mRNA for immunoglobulin heavy chain, clone CAAGAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGTATAATAATGACAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGAGTCCCATAAACTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCAACAACATGGGCTGGGTTCGACAAGCTCCTGGAAAGGGTCTGGAATGGGTGTCTACTATAAGCTATAGTGTAAATGCATACTATGCCCAGTCTGTCCAGGGAAGATTCACCATCTCCAGAGACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACTCTGCCGTGTATTACTGTGCTCGAGAGTCTAACTTCAACCGCTTTGACTACTGGGGATCCGGGACTATGGTGACCGTAACAAATGCTACGCCATCACCACCGACAGTGTTTCCGCTTATGCAGGCATGTTGTTCGGTCGATGTCACGGGTCCTAGCGCTACGGGCTGCTTAGCAACCGAATTC

“nt” Sequence Data Base


Genbank

JOURNAL Submitted (31-OCT-1997) Structural Biology, Stanford University,

Fairchild Building Campus West Dr. Room D-100, Stanford, CA

94305-5126, USA

FEATURES Location/Qualifiers

source 1..289

/organism="Aotus azarai"

/mol_type="genomic DNA"

/db_xref="taxon:30591"

sig_peptide 134..193

exon <134..200

/number=1

intron 201..>289

/number=1

ORIGIN

1 gtccccgcgg gccttgtcct gattggctgt ccctgcgggc cttgtcctga ttggctgtgc

61 ccgactccgt ataacataaa tagaggcgtc gagtcgcgcg ggcattactg cagcggacta

121 cacttgggtc gagatggctc gcttcgtggt ggtggccctg ctcgtgctac tctctctgtc

181 tggcctggag gctatccagc gtaagtctct cctcccgtcc ggcgctggtc cttcccctcc

GenBank

  • LOCUS AAB2MCG1 289 bp DNA linear PRI 23-AUG-2002

  • DEFINITION Aotus azarai beta-2-microglobulin precursor exon 1.

  • ACCESSION AF032092

  • VERSION AF032092.1 GI:3265027

  • KEYWORDS .

  • SEGMENT 1 of 2

  • SOURCE Aotus azarai (Azara's night monkey)

  • ORGANISM Aotus azarai

  • Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

  • Mammalia; Eutheria; Primates; Platyrrhini; Cebidae; Aotinae; Aotus.

  • REFERENCE 1 (bases 1 to 289)

  • AUTHORS Canavez,F.C., Ladasky,J.J., Muniz,J.A., Seuanez,H.N., Parham,P. and

  • Cavanez,C.

  • TITLE beta2-Microglobulin in neotropical primates (Platyrrhini)

  • JOURNAL Immunogenetics 48 (2), 133-140 (1998)

  • MEDLINE 98298008

  • PUBMED 9634477

  • REFERENCE 2 (bases 1 to 289)

  • AUTHORS Canavez,F.C., Ladasky,J.J., Seuanez,H.N. and Parham,P.

  • TITLE Direct Submission


Sequence matching

Currrent access to sequence databases mainly by heuristic-assisted pattern matching on flat or nearly flat files using programs such as BLAST and FASTA.

Underlying data bases growing rapidly with consequent deterioration of search times even on large, multiprocessor systems as current software tools reach design limits.

BLAST systems index data base sequences according to short code letter words (usually, 3 letters for amino acids and 11 for nucleotide data bases); scoring matrices.

Queries also decomposed to similar short code words. The data base is scanned & sequences with words in common with the query are processed to extend the initial code word match.

Sequence Matching


Example blast output

Example BLAST Output

  • Score E

  • Sequences producing significant alignments: (bits) Value

  • emb|BX015832.1|CNS08KDO Single read from an extremity of a ... 918 0.0

  • emb|BX032891.1|CNS08XJJ Single read from an extremity of a ... 902 0.0

  • emb|BX065445.1|CNS09MNT Single read from an extremity of a ... 894 0.0

  • emb|BX052703.1|CNS09CTV Single read from an extremity of a ... 894 0.0

  • emb|BX030708.1|CNS08VUW Single read from an extremity of a ... 894 0.0

  • emb|BX030663.1|CNS08VTN Single read from an extremity of a ... 894 0.0

  • .............................................................................

  • >emb|BX015832.1|CNS08KDO Single read from an extremity of a full-length cDNA clone made from Anopheles gambiae total adult females. 3-PRIME end of clone FK0AAA23DA12

  • Length = 866

  • Score = 918 bits (463), Expect = 0.0

  • Identities = 535/559 (95%)

  • Strand = Plus / Plus

  • Query: 1 tctttactattattggggaatttcgaggaacatttgttccccttacaatgcatttctata 60

  • |||||||||||||||| |||||||| |||||||||||||||||||||||||||||||||

  • Sbjct: 1 tctttactattattggtcaatttcgatgaacatttgttccccttacaatgcatttctata 60

  • Query: 61 acctacacctggagtaggtggttccggttcagccacttcagtgggaggaacttccgtttc 120

  • | |||||||||||||||||||||||||||||||||||||||||||||| |||||||||||

  • Sbjct: 61 aactacacctggagtaggtggttccggttcagccacttcagtgggaggcacttccgtttc 120


Developing a vector space approach to sequence indexing

This work attempts to explore natural language indexing techniques applied genomic data bases through:

Weight based indexing of k-tuples derived from NCBI “nt” sequence data base.

Text terms used in genomic sequence data banks and literature;

Both applications are implemented for Linux and written in Mumps and MDH, a Mumps related C++ toolkit capable of indexing data sets of up to 256 terabytes using a B-tree based multidimensional data model, that includes many retrieval and sequence matching functions.

Developing A Vector Space Approach to Sequence Indexing


Inverse document frequency wgt

The IDF weight yields higher values for words whose distribution is more concentrated and lower values for words whose use is more widespread.

Thus, words of broad context are weighted lower than words of narrow context.

Words of low weight are hypothesized to be poor indexing terms while words with high weights are hypothesized to be good indexing terms.

The bulk of the words, as is the case in natural language text, reside in the middle range.

Inverse Document Frequency Wgt.


Natural language example

Word Freq(i,j) TotFreq DocFreq Wgt1 Wgt2 Wgt3 MCA

[1] Death of a cult. (Apple Computer needs to alter its strategy) (column)

apple 4 261 112 1.716 9.757 17 -1.1625

computer 4 706 358 2.028 5.109 10 -19.4405

mac 2 146 71 0.973 6.290 6 -0.0256

macintosh 4 210 107 2.038 9.940 20 -0.5855

strategy 2 79 67 1.696 6.406 11 -0.0592

[3] WordPerfect. (WordPerfect for the Macintosh 2.0) (evaluation) Taub, Eric.

edit 2 111 77 1.387 6.128 8 -0.0961

frame 2 9 7 1.556 10.924 17 0.0131

import 2 29 19 1.310 8.927 12 0.0998

macintosh 3 210 107 1.529 7.705 12 -0.5855

macro 3 38 24 1.895 12.189 23 0.1075

outstand 1 10 9 0.900 5.711 5 0.0168

user 4 861 435 2.021 4.330 9 -26.8094

wordperfect 8 24 8 2.667 39.627 106 0.1747

[4] Radius Pivot for Built-In Video an Radius Color Pivot. (Hardware Review)

(new Mac monitors)(includes related article on design of

built-in 3 35 29 2.486 11.621 29 0.0678

color 3 81 47 1.741 10.173 18 0.0809

mac 2 146 71 0.973 6.290 6 -0.0256

monitor 6 88 52 3.545 18.739 66 0.0946

resolution 2 50 32 1.280 7.884 10 0.0288

screen 2 92 62 1.348 6.561 9 0.0199

video 4 106 61 2.302 12.188 28 0.0187

Natural Language Example


Indexing experiment

Sequences from the NCBI "nt" (non-redundant nucleotide) data base were used.

The “nt” data base is approximately ~12 billion bytes in length comprising 2,584,440 sequences in FASTA format (Sept 2004).

A “word” size of 11 was used throughout. A total of 4,194,299 words were identified, slightly less than the theoretical maximum of 4,194,304.

Indexing Experiment


Calculating the idf weight

The overall frequencies of occurrence of all possible 11 character words from each sequence were determined along with the number of sequences in which each unique word was found.

A weight Wgtifor each word i was calculated by taking the Log10, multiplied by 10 and truncated to the nearest integer, of the total number of sequences (N) divided by the number of sequences in which the word occurred (DocFreqi).

Wgti= (int) 10 * Log10 ( N / DocFreqi )

In natural language indexing, this is referred to as the inverse document frequency (IDF) weight.

Calculating the IDF Weight


File sizes

Initial file analysis produces about 110 intermediate files of about 440 million bytes each from the input data base (12 GB).

out.table is a large (40 billion byte) word-sequence file.

freq.bin contains the inverse document frequency weight for each word (53 million bytes);

index (76 million bytes) gives for each word the eight byte offset of the word's entry in out.table.

index and freq.bin are merged into ITABLE (112 million bytes) which contains for each word its weight, offset, and a pointer to a list of aliases (not used with the “nt” data base).

File Sizes


Data base

W = ( w1, w2, w3, ... wM) vector of M weights

F = ( f1,1 f1,2 f1,3 ... f1,N )

( f2,1 f2,2 f2,3 ... f2,N )

( f3,1 f3,2 f3,3 ... f3,N )

... word-sequence matrix

( fM,1 fM,2 fM,3 ... fM,N )

Data Base


For i 1 to 120 z i 0 for j 1 to m if w j i t hen z i z i 1

for i = 1 to 120

zi← 0

for j = 1 to M

if wj = i then zi ← zi + 1

Number of Words at each Weight


Number of words at each idf wgt

Number of Words at Each IDF Wgt.


Sum of all instances of each weight

for i = 1 to 120 // for each weight

xi← 0

for j = 1 to M // for each word

for k = 1 to N // for each sequence

if fj, k = i then xi ← xi + 1

Sum of all Instances of Each Weight


Number of occurrences at each idf level

Number of Occurrences at Each IDF Level


Sequence retrieval

For retrieval, a query sequence is read and decomposed into 11 character words. These words are reduced to a numeric equivalent which is used as an index into the word-sequence table. Entries in a master vector corresponding to sequences are incremented by the weight of the word if the word occurs in the sequence if the weight of the word lies within a specified range. When all words have been processed, entries in the master sequence vector are normalized according to the length of the underlying sequence in respect to the length of the query. Finally, the master sequence vector is sorted and the top scoring entries printed or submitted to a Smith-Waterman alignment, sorted and then printed. Optionally, the Smith-Waterman alignments themselves can be printed and the selected sequences can be extracted from the nt data base and stored in a separate output file for additional processing. FASTA post-processing is an option.

Sequence Retrieval


Unweighted result for 500 random queries

Unweighted Result for 500 Random Queries


Result for 500 random queries weight range 65 120

Result for 500 Random Queries Weight Range 65-120


Overall results for 500 random queries

Overall Results for 500 Random Queries


Index scoring results

Query: >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71

Query string has 289 letters

Searching ...

68224 >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

31420 >gi|29467317|dbj|AB089555.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol

30508 >gi|19911912|dbj|AB072084.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

30296 >gi|29467668|dbj|AB100815.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol

29800 >gi|14150634|gb|AF369255.1| Hepatitis C virus Pt.2F NS3 protease gene, partial cds

29444 >gi|19911960|dbj|AB072108.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

29240 >gi|19911888|dbj|AB072072.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

29196 >gi|14150646|gb|AF369261.1| Hepatitis C virus Pt.6A NS3 protease gene, partial cds

29120 >gi|19911862|dbj|AB072059.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

28896 >gi|14150628|gb|AF369252.1| Hepatitis C virus Pt.128 NS3 protease gene, partial cds

28116 >gi|2731651|gb|U81612.1|HCU81612 Hepatitis C virus polyprotein gene, partial cds

28116 >gi|3157741|dbj|AB013621.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region),

27700 >gi|14150620|gb|AF369248.1| Hepatitis C virus Pt.1 NS3 protease gene, partial cds

.............................................................................................

Total fetch time used: 13

Total number of accessions found: 501

Total primary query word count: 279

Total alias count: 0

Total number of sequences searched: 2584440

Max Indx 1300023

Index Scoring Results


Smith waterman result scoring

Query: >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71

Query string has 289 letters

top= >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, partial cds, isolate:71

166 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCAAATCACCCAGATGTACACCAATGTAGACCAGGACCT 245

::::::::::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::: :::

1 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCAAGTCACCCAGATGTACACCAATGTAGACCAGGTCCT 80

246 CGTCGGCTGGCCGGCGCCCCCCGGAGCGCGTTCCTTGACACCATGCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 325

:::::::::::::::::: ::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::

81 CGTCGGCTGGCCGGCGCCGCCCGGAGCGCGTTCCTTGAGACCATGCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 160

326 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGGGGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 405

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

161 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGGGGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 240

406 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGCTGTGG 454

:::::::::::::::::::::::::::::::::::::::::::::::::

241 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGCTGTGG 289

score=566

Smith-Waterman Result Scoring


S w scores

566 >gi|19911940|dbj|AB072098.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

505 >gi|29467668|dbj|AB100815.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol

504 >gi|29467317|dbj|AB089555.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol

503 >gi|19911914|dbj|AB072085.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

503 >gi|14150628|gb|AF369252.1| Hepatitis C virus Pt.128 NS3 protease gene, partial cds

502 >gi|29467247|dbj|AB089520.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol

502 >gi|3157741|dbj|AB013621.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region),

501 >gi|19911862|dbj|AB072059.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

499 >gi|29467670|dbj|AB100816.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol

498 >gi|3157753|dbj|AB013627.1| Hepatitis C virus RNA for polyprotein (NS3 proteinase region),

498 >gi|14150634|gb|AF369255.1| Hepatitis C virus Pt.2F NS3 protease gene, partial cds

497 >gi|29467311|dbj|AB089552.1| Hepatitis C virus NS3 gene for polyprotein, partial cds, isol

497 >gi|19911934|dbj|AB072095.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

497 >gi|19911912|dbj|AB072084.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

496 >gi|19911900|dbj|AB072078.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

495 >gi|14150638|gb|AF369257.1| Hepatitis C virus Pt.3O NS3 protease gene, partial cds

495 >gi|14150616|gb|AF369246.1| Hepatitis C virus Pt.1A NS3 protease gene, partial cds

495 >gi|14150646|gb|AF369261.1| Hepatitis C virus Pt.6A NS3 protease gene, partial cds

494 >gi|14150620|gb|AF369248.1| Hepatitis C virus Pt.1 NS3 protease gene, partial cds

494 >gi|19911888|dbj|AB072072.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

493 >gi|19911960|dbj|AB072108.1| Hepatitis C virus type 1b gene for polyprotein, NS3 region, p

...

Total fetch time used: 26

Total number of accessions found: 31

Total primary query word count: 279

Total alias count: 0

Total number of sequences searched: 2584440

Max Indx 1300023

S-W Scores


Larger sequences

On larger query sequences (5,000 to 6,000 letters), the IDF method performed slightly better than BLAST. On 25 sequences randomly generated, the IDF method correctly ranked the original sequence first 24 times and once at rank 3. BLAST, on the other hand, ranked the original sequence first 21 times while the remaining 4 were ranked 2, 2, 3 and 4. Average time per query for the IDF method was 47.4 seconds and the average time for BLAST was 122.8 seconds.

Larger Sequences


The next step

Future work

Weighted Term Vectors.

Other weighting schemes such as the Modified Centroid Algorithm.

Sequence-Sequence and Term-Term Correlations.

Sequence clustering.

The Next Step


References

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J. Mol. Biol. 215:403-10.

O'Kane, K.C.; and Lockner, M. J. (2004) Indexing genomic sequence libraries, Information Processing and Management, 41:265-274.

O'Kane, K.C. (2004) The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval, submitted.

Pearson, W. R. (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 132:185-219.

Salton, G. (1983), Introduction to Modern Information Retrieval, McGraw-Hill (New York 1983).

Smith, T.F. & Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147:195-197

References


Hierarchical data base

Hierarchical Data Base


Bioinformatics

Sloan Report on Bioinformatics from June 2004.

Number of graduates

* There were only 26 new PhD's produced...

* 102 masters degrees awarded...

* Only 17 Bachelor's degrees produced...

* The data is for January 2002 until March 2003.

"... in the next few years the number of graduates is expected to increase by two or three times."

Average program enrollment:

* 103 = Bachelors

* 435 = Masters

* 296 = Phd

Bioinformatics


B s in bioinformtics at uni

Mathematics: 800:060; 800:061; 800:064; 800:152; 800:164 (17 hours)

Computer Science: 810:061; 810:062; 810:065; 810:066;

810:080; 810:114; 810:115; 810:180 (24 hours)

Biology: 840:051; 840:052; 840:130; 840:140; 840:153 (19 hours)

Chemistry: 860:070* or both 860:044 and 860:048; 860:063 (9-12 hours)

Elective: One course from the following (3 hours)

Computer Science: 810:143; 810:147; 810:153; 810:155;

810:161; 810:172; 810:181

Total 73-75 hours

B.S. in Bioinformtics at UNI


Courses

800:060. Calculus I . The derivatives and integrals of elementary functions and their applications.

800:061. Calculus II. Continuation of 800:060

800:064. Elementary Probability and Statistics for Bioinformatics. Descriptive statistics, basic probability concepts, confidence intervals, hypothesis testing, correlation and regression, elementary concepts of survival analysis

800:152. Introduction to Probability. Axioms of probability, sample spaces having equally likely outcomes, conditional probability and independence, random variables, expectation, moment generating functions, jointly distributed random variables, weak law of large numbers, central limit theorem

Courses


Courses1

800:164. Statistical Methods in Bioinformatics. Analysis of a DNA sequence, analysis of multiple DNA and protein sequences, BLAST.

810:061. Computer Science I. Introduction to computer programming in the context of a modern object-oriented programming language. Emphasis on good programming techniques, object-oriented design, and style through extensive practice in designing, coding, and debugging programs.

810:062. Computer Science II. Intermediate programming in an object-oriented environment. Topics include object-oriented design, implementation of classes and methods, dynamic polymorphism, frameworks, patterns, software reuses, limitations, exceptions, and threads.

Courses


Courses2

810:065. Computing for Bioinformatics I. Intermediate programming with emphasis on bioinformatics. Includes file handling, memory management, multi-threading, B-trees, introduction to dynamic programming including Wunsch-Neddleman and Smith-Waterman algorithms for optimal alignments, exploration of BLAST, FASTA and gapped alignment, substitution matrices.

810:066. Computing for Bioinformatics II. Advanced bioinformatics computing: Perl and CGI programming; data base facilities for bioinformatics; pattern matching with regular expressions; advanced dynamic programming: optimal versus local alignment, multiple alignments; data base mining tools, Entrez, SRS, BLAST, FASTA, CLUSTAL; graphical 3-D representation of proteins; phylogenic trees.

Courses


Courses3

810:080. Discrete Structures. Topics include propositional and first-order logic; proofs and inference; mathematical induction; sets, relations, and functions; and graphs, lattices, and Boolean algebra, all in the context of computer science.

810:114. Database Systems. Storage of, and access to, physical databases; data models, query languages, transaction processing, and recovery techniques; object-oriented and distributed database systems; and database design.

810:115. Information Storage and Retrieval. Natural language processing; analysis of textual material by statistical, syntactic, and logical methods; retrieval systems models, dictionary construction, query processing, file structures, content analysis; automatic retrieval systems and question-answering systems; and evaluation of retrieval effectiveness.

Courses


Courses4

810:180. Undergraduate Research in Computer Science

840:051. General Biology: Organismal Diversity. Study of organismic biology emphasizing evolutionary patterns and diversity of organisms and interdependency of structure and function in living systems.

840:052. General Biology: Cell Structure and Function. Study of cells, genetics, and DNA technology emphasizing the chemical basis of life and flow of information.

840:130. Molecular Biology of the Cell. Introduction to the molecular, biochemical, and cellular structure and function of cells, DNA structure and functions, and the translation of genetic information into functional structures of living cells. DNA replication, transcription of genes, and synthesis and processing of proteins will be emphasized.

Courses


Courses5

840:140. Genetics. Analytical approach to classical, molecular, and population genetics

840:153. Recombinant DNA Techniques. Study of techniques for manipulating and analyzing DNA, including genomic library construction, polymerase chain reaction, oligonucleotide synthesis, genomic analysis with computers, and DNA and RNA isolation.

860:070. General Chemistry I-II. Accelerated course for well-prepared students. Content similar to 860:044 and 860:048 but covered in one semester. Completion satisfies General Chemistry requirement of any chemistry major.

Courses


Courses6

860:063. Applied Organic and Biochemistry. Basic concepts in organic chemistry and biochemistry, including nomenclature, functional groups, reactivity, and macromolecules.

Elective from:

810:143(g). Operating Systems. History and evolution of operating systems; process and processor management; primary and auxiliary storage management; performance evaluation, security, and distributed systems issues; and case studies of modern operating systems.

Courses


Courses7

810:147. Networking. Network architectures and communication protocol standards. Topics include communication of digital data, data-link protocols, local-area networks, network-layer protocols, transport-layer protocols, applications, network security, and management.

810:153. Design and Analysis of Algorithms. Algorithm design techniques such as dynamic programming and greedy algorithms; complexity analysis of algorithms; efficient algorithms for classical problems; intractable problems and techniques for addressing them; and algorithms for parallel machines.

810:155. Translation of Programming Languages. Introduction to analysis of programming languages and construction of translators.

Courses


Courses8

810:161. Artificial Intelligence. Models of intelligent behavior and problem solving; knowledge representation and search methods; learning; topics such as knowledge-based systems, language understanding, and vision; optional 1-hour lab in symbolic programming techniques: heuristic programming; symbolic representations and algorithms; and applications to search, parsing, and high-level problem-solving tasks.

810:172. Software Engineering. Study of software life cycle models and their phases--planning, requirements, specifications, design, implementation, testing, and maintenance. Emphasis on tools, documentation, and applications.

810:181. Theory of Computation. Topics include regular languages and grammars; finite state automata; context-free languages and grammars; language recognition and parsing; and turing computability and undecidability.

Courses


  • Login