Studying the Protein Folding Problem by Means of a New Data Mining Approach. by Huy N.A. Pham and Triantaphyllou Evangelos Department of Computer Science, Louisiana State University 298 Coates Hall, Baton Rouge, LA 70803 Email: [email protected] and [email protected]
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
by Huy N.A. Pham and
Triantaphyllou Evangelos
Department of Computer Science, Louisiana State University
298 Coates Hall, Baton Rouge, LA 70803
Email: [email protected] and [email protected]
ICDM 2005 Workshop on Temporal Data Mining: Algorithms, Theory and Applications
November 2730, 2005, Houston, TX
This research was done under the LBRN program (www.lbrn.lsu.edu)
=> Classification problem.
=> Develop an algorithm that balances overfitting and overgeneralization.
A Mining Approach
Some basic conceptsExample: B, A1, A2 are homogenous clauses while A is a nonhomogenous clause. A can be partitioned into two smaller homogenous clauses A1 and A2. The example is a 2D representation. The high dimension cases can be treated similarly.
=> Unclassified examples covered by clause B can more accurately be assumed to belong to the same class than those in the original clause A.
A is superimposed to a hypergrid and the density of all cells can be computed
=> standard deviation = 0
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
B is superimposed to a hypergrid and the density of all cells can be computed
=> standard deviation > 0
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Determine the homogenous values of clauses A and B.
Some basic concepts  Cont'dwhere n = #(examples in the cell), D = #(dimensions), and = a kernel function
A
B
R_Unit cells can be computed
R_Unit
A
B
The Density of homogenous clause A > The density of homogenous clause B
Some basic concepts  Cont'dC cells can be computed
F’s radius = C’s radius + (G’s radius – C’s radius)/(2 *D)



G

D=6

+
+

+
+
+
+


Stopping conditions for expansion:
F’s radius ≤ D * C’s radius
#(Noisy points) ≤ (D * n) / 100
BEAInput: positive and negative examples
Output: a suitable classification
F = expanded area, C = original area, and G = enveloping area.
Input: positive and negative examples
Output: a suitable classification
Step 1: Find positive and negative clauses using the kmeans clusteringbased approach with the Euclidean distance.
Step 2: Find positive and negative homogenous clauses from positive and negative clauses respectively.
Step 3: Sort positive and negative homogenous clauses on densities.
Step 4: FOR each homogenous clause C DO
If (its density > a threshold = (max – min)/2 of densities) then
 Expand C using its density D.
 Accept (D*n)/100 noisy examples where n=#(its examples).
Else
Reduce C into smaller homogenous clauses by considering each cell of its hypergrid as a new homogenous clause.
Positive cells can be computed
Clauses
Expand
Homogenous Clauses
Extended HC
BEA  Cont'dor e is refined to e’.
qi = ci/ni where ni = #(examples in class ith)
and ci = #(true examples in class ith).
wi=ni/N where N = Total of examples of a given class.
BEA
The BEA provides 15.5% improvement in the classification accuracy vs. C.J.Lin’s SVMs.
Q1: The average accuracy of the SVMs with the independent test method in [Ding and Dubchack, 2001, Table 6, p11].
Q2: The average accuracy of the Neural Networks with the independent test method in [Ding and Dubchack, 2001, Table 6, p11].
Q3: The average accuracy of the SVMsAAC method in [Zerrin, 2004].
Q4: The average accuracy of the SVMstrioAAC method in [Zerrin, 2004].
http://people.sabanciuniv.edu/~berrin/methods/foldclassificationiscis04.pdf
http://www.brc.dcs.gla.ac.uk/~actan/methods/actanGIW03.pdf
http://www.kernelmachines.org/methods/upload_4192_bioinfo.ps
Thank you! cells can be computed
Any questions?