BioinfoGRID Symposium 2007

BioinfoGRID Symposium 2007 Mathematical methods for the analysis of the Barcode Of Life P. Bertolazzi, G. Felici Istituto di Analisi dei Sistemi e Informatica del Consiglio Nazionale delle Ricerche Optimization Laboratory for Data Mining

DNA Barcoding 1/4 • DNA barcoding is a new technique that uses a short DNA sequence from a standardized and agreed-uponposition in the genome as a molecular diagnostic for species-level identification. • The chosen sequence for Barcode is a small portion of mitocondrial DNA (mt-DNA) that differs by several percent, even in closely related species, and collects enough information to identify the species of an individual. • It is easy to isolate and analyze. • Moreover it resumes many properties of the entire mt-DNA sequence.

DNA Barcoding 2/4 A typical animal cell. Within the cytoplasm, the major organelles and cellular structures include: (1) nucleolus (2) nucleus (3) ribosome (4) vesicle (5) rough endoplasmic reticulum (6) Golgi apparatus (7) cytoskeleton (8) smooth endoplasmic reticulum (9) mitochondria (10) vacuole (11) cytosol (12) lysosome (13) centriole.

DNA Barcoding 3/4 • The first studies on barcode (2003) are due to Hebert (see [1] for the last results and a complete bibliography) • Two mt-DNA subsequences (genes) are proposed as barcode: • Cytochrome c Oxidase I (COI) • Cytochrome b • Since 2003 COI has been used by Hebert to study fishes, birds, and other species • Hebert employs the Neighbor Joining (NJ) [2] method, proposed to obtain phylogenetic trees, and identifies each species as represented by a distinct, non overlapping cluster of sequences in the NJ tree.

DNA Barcoding 4/4 • Recent studies [1] show that even fragments of the COI sequence have the same expressive power than the entire sequence • The Consortium of Barcode of Life (CBOL) is an international initiative devoted to developing DNA barcoding as a global standard for the identification of biological species http://www.barcoding.si.edu/ • Data Analysis Working Group (CBOLand DIMACS, Rutgers) http://dimacs.rutgers.edu/Workshops/BarcodeResearchChallenges2007/ http://dimacs.rutgers.edu/Workshops/DNAInitiative/

Research challenges 1/2 • Specimen identification versus species discovery : the knowledge about species is not always complete • Is a species similar to another or not? • Optimizing sample size: barcodes are not easy to measure, large samples are very expensive • Using character-based barcodes: an alternative approach to comparing specimens in terms of overall percent sequence similarity • Shrinking the barcode

Research challenges 2/2 • Shrinking the barcode: we want to identify the relevant portions of the barcode wrt each species; this would make easier to identify new data on species and study it A barcode fragment for 2 species Species 1: CTGGCATAGTAGGTACTGCCCTTAGCCTCCCCCCAGTCCTCTCCC Species 2: CTGGCATAGTCGGAACCGCTCTCAGCCTACCCCCAGCCCTTTCTC • 45 sites • Difference bases on only 8 sites • In this case only 1 site is sufficient to distinguish the 2 species

Main goal • Given samples from different species, the objective is to identify those combinations of muted nucleotides that have determined the differences among species in the evolution path

Our work 1/2 • We have addressed these challenges using supervised learning and some special methods that we have produces for classification of logic data of large size • Such methods already proved successful in other bio-computational problems (Tag SNPs selection, microarray data analysis, genetic diagnostics)[3]: • Feature selection methods based on Integer Programming: aims at finding a subset of bases that allows to distinguish among species • Logic classification methods based on logic programming:aims at finding rules or formulas that can distinguish between individuals of different species

Our Work 2/2 • Feature Selection is used to select a limited number of positions where the values of the bases differs consistently from species to species; • Logic classification methods use the selected positions to find formulas with high semantic value that can provide a deep understanding of the analyzed data. Logic classification is computationally expensive Many features Feature selection Few features Compact logic formulas Logic classification

Supervised Learning: the standard setting • Given: • A finite number of objects described by a finite number of measures X = (x1, x2, …, xn) • An interesting characteristic of these objects Y(TARGET) • Determine the relation between TARGET and measures Y = F(X) • It is only required that, for each set of measures X, it produces an output Y • Very important • The form of the relation is decided up front; • The parameter of the relation are determined by the data; • The results of the estimation need to be validated. Estimation: Y Q  R Classification: Y = (0,1,2, …)

Supervised Learning: training and testing Data Training Set Choose model Estimate model Testing Set Performance ? ok • If data contains good information and the choice of the model is correct, the model has a good fit on the training data • Good fitting to training data may provide information and knowledge • The model should also adapt well to test data • Only good fitting to test data strengthen the knowledge extracted and validates the forecasting or classification function of the model

The Special case of Logic Data 1/2 Logic variables (property of the object) • Frequently, objects are (or can be) described only by logic attributes (True/False) • The classification function is expressed in terms of: • Logic variables p, q, r, with True/False value • conjunction (p  q), disjunction (p  q), negation ( p) “If the object shows both the presence of property 1 and the absence of property 3, then it is class A, else is class I” 1

The Special case of Logic Data 2/2 True for red, false for blue IF [ -X1 & -X2 ] OR [ -X2 & -X3 ] OR [ -X1 & -X3 ] True for blue, false for red IF [ X2 & X6 ] OR [ X1 & X3 ] OR [ X2 & X3 ] OR [ X1 & X2 ]

Mining Logic Data • The Lsquare Logic Miner [4,5,6] • Builds logic separations in Disjunctive Normal Form (DNF) • Identifies iteratively the clauses of the DNF that separates the largest part of object in one class from all the objects of the other class • Clause identification is based on the solution of a Minimum Cost Satisfiability Problem (MINSAT), computationally hard Satisfying solution:

Mining Logic Data Two MINSAT problems are solved at each iteration Lsquare finds 4 DNF formulas: 1: A from B, min support 2: A from B, max support 3: B from A, min support 4: B from A, max support ALL: majority vote, use undecided The behavior of the formulas strongly interacts with the quality of the data… Select Largest separable subset Identify the clause with desired support

Feature Selection • Most methods for FS are based on a greedy construction of the feature set, and do not take into proper account the interactions among the selected features (correlation, collinearity) • An interesting method adopts Integer Programming (Set Covering) to select the smallest set of features that enables to differentiate logic records belonging to different classes The Set Covering Approach for FS (Rutgers - LAD approach) A binary variable is associated with each feature A linear constraint is associated with each pair of objects belonging to different classes A set covering model is used requires that each pair of objects belonging to different classes are differentiated by at least one of the selected features a(ij)k = 1 if i,k belong to different classes i,k are different on feature j a(ij)k = 1 feature j selected

Feature Selection : a simplified version To overcome the untreatable dimensions of the quadratic model, we propose a simplified version of the set covering model (Linear SC), where the number of constraints is equal to the number of samples • fi feature i • PA(i) = proportion of elements with fi = 1 in class A; • PB(i) = proportion of elements with fi = 1 in class B; • xi = 1 iff feature i is chosen in the solution • if PA(i) > PB(i) Coverage is maximized, given the bound  on the # of features Coverage is given • Rows are linear in the # of examples • Redundancy can be controlled • May still need heuristics for large number of features…

Application to Barcode Data • Data Set from the 2006 conference • 1623 samples belonging to 150 different species • Each sample is described by 690 nucleotides (columns) • We search for a compact rule for each one of the 150 classes • For each species k, we solve a 2-class learning problem: • class A: all samples in class k • class B: samples of all classes different from k • We use Linear set covering for feature selection and logic mining to determine the formulas on a training subset (80-90%) of the available data, and then test their classification capabilities on the remaining data. • Training and testing samples are drawn at random maintaining the same proportion in each class

Application to Barcode Data DATA SET TRAINING SET TEST SET TRAINING SET Integer Programming model associated with the Linear set Covering model is solved optimally with commercial solver ILOG CPLEX with 10, 20 and 30 as values for  FEATURE SELECTION FORMULA EXTRACTION LSQUARE is used to separate each species from the others 149, and a compact formula explaining each specie is obtained The formulas are used to predict the specie of each element in the test set. TEST SET

Application to Barcode Data • LSC construction • PAjk = proportion of samples in class k with nucleotide = A in column j • PCjk = proportion of samples in class k with nucleotide = C in column j • PGjk = proportion of samples in class k with nucleotide = G in column j • PTjk = proportion of samples in class k with nucleotide = T in column j • PNAjk = proportion of samples in class <>k with nucleotide = A in column j • PNCjk = proportion of samples in class <> k with nucleotide = C in column j • PNGjk = proportion of samples in class <> k with nucleotide = G in column j • PNTjk = proportion of samples in class <> k with nucleotide = T in column j • aij = 1 iff: • Sample i is in class k, nucleotide i of j = A and PAjk > 2 PNAjk • Sample i is in class k, nucleotide i of j = C and PCjk > 2 PNCjk • Sample i is in class k, nucleotide i of j = G and PGjk > 2 PNGjk • Sample i is in class k, nucleotide i of j = T and PTjk > 2 PNTjk Select  columns with largest  Use these columns to formulate a separation problem for each class with Lsquare (1 vs all) Obtain a logic formula for each class

An example • With  = 1: x3 = 1, x1=x2=x4=0: non separable • x4 = 1, x1=x2=x3=0: separable • With  = 2: x3 = x4 =1, x1=x2=0: separable • IF (X4=T) THEN CLASS 1 • IF (X4=A) THEN CLASS 2 • IF (X4=C) THEN CLASS 3

Results • For each row we solve: • 1 set covering problem • 150 logic classification problems •   {10, 20, 30} • Test % {10, 20} • 3 random repetitions for each setting

Results Site 580 appears 54 times in the formulas that discriminate each class from the rest (approx. 1/3)

Results • SPECIES DIM CLAUSE(S) • 0 1 * 274 A 499 T 580 C • 1 1 * 19 T 172 T • 2 1 * 340 G 343 A 445 C • 3 1 * 445 T 499 T 580 G • 4 1 * 172 C 445 T 493 G 499 C 580 A • 5 1 * 58 G 430 T • 6 1 * 268 C 289 A 334 C 430 T • 7 1 * 136 A 277 C 445 T • 8 2 * 58 A 121 T 172 C * 277 T 499 G • 9 1 * 163 T 274 A 334 G • 10 1 * 19 T 331 C 652 T • 11 1 * 4 C 10 T 274 C 289 A 445 A • 12 1 * 340 G 445 G • 13 1 * 277 A 340 A 430 T 445 C • 14 1 * 121 C 331 T • 15 1 * 58 A 283 C 331 T

Results IF BASE IN POSITION 19 IS T AND BASE IN POSITION 172 IS T THEN SPECIES IS …1 IF BASE IN POSITION 340 IS G AND BASE IN POSITION 343 IS A AND BASE IN POSITION 445 IS C THEN SPECIES IS … 2 IF BASE IN POSITION 58 IS A AND BASE IN POSITION 121 IS T AND BASE IN POSITION 172 IS C OR BASE IN POSITION 277 IS T AND BASE IN POSITION 499 IS G THEN SPECIES IS … 8

Related work • Haplotype Inference by Parsimony: new very efficient heuristic • Tag SNP and SNP reconstruction problem • Phylogenetic tree in polyploid organisms

Conclusions and Future work • The logic technique is very powerful for identifying small non contiguous subsequences of the barcode • We are testing the technique on a very huge set of data from Lepidoptera • We will compare our technique with NJ method [2] • CBOL and DAWG have asked us to implemented our technique as a web service • We are designing a software platform that implements algorithms for all the above problems

References • [1] M. Hajibabaei, G. A.C. Singer, E. L. Clare, P.D.N. HebertDesign and applicability of DNA arrays and DNA barcodes in biodiversity monitoring BMC Biology, 2007 • [2] M. Saitou., M. Nei Neighbour Joining Method, Mol Biol Evol. 1987 • [3] P.Bertolazzi, G. Felici, P. Festa, G. Lancia, Logic Classification and Feature Selection for Biomedical Data, Computers & Mathematics with Applications, on-line version (2007) • [4]G. Felici, K. Truemper, A Minsat Approach for Learning in Logic Domains, INFORMS JOC, 2002; • [5] K. Truemper, Design of Logic-Based Intelligent Systems, Wiley-Interscience, 2004 • [6] G. Felici, K. Truemper, The Lsquare System for Mining Logic Data, Encyclopedia of Data Warehousing and Mining, 2005 • [7] P. Bertolazzi. G. Felici SpeciesClassification with Optimized Logic Formulas poster, EMBO Conference, Rome, May 2007. • [8] P. Bertolazzi, G. Felici SpeciesClassification with Optimized Logic Formulasinvited talk , Second BOL Conference, Taipei, Sept. 2007

BioinfoGRID Symposium 2007

BioinfoGRID Symposium 2007

Presentation Transcript

Hadron Collider Physics Symposium 2007

Navy Counselor Association Symposium June 2007

Research Ethics Symposium 2007

GSEPS Research Students Summer Symposium 2007

The 2007 Zarrow Mental Health Symposium

Sustainable Government Symposium December 11, 2007

Hadron Collider Physics Symposium 2007

2007 Assessment Symposium WELCOME! May 22, 2007

Maurice Blackmon Symposium October 29, 2007

Previously at the OERC Symposium 2007

DCED PhD Symposium 2007

Symposium June 8, 2007

2007 Semi-Annual NWTEMC Symposium

CRTI symposium 2007

International Symposium of Multiparticle Dynamics 2007 09 Aug 2007

GSEPS Research Students Summer Symposium 2007

2007 National Human Services Training Evaluation Symposium

BioinfoGRID Project: Bioinformatics Grid Application for life science

UWCISA Symposium 2007

ARIPPA Technical Symposium August 28, 2007

2007 AFRICA SYMPOSIUM STATISTICAL DEVELOPMENT

ASW Metrics Symposium 04-05 Jan 2007