Data Mining 2

Data Mining 2 Data Mining is one aspect of Database Query Processing (on the "what if" or pattern and trend end of Query Processing, rather than the "please find" or straight forward end. To say it another way, data mining queries are on the ad hoc or unstructured end of the query spectrum rather than standard report generation or "retieve all records matching a criteria" or SQL side). Still, Data Mining queries ARE queries and are processed (or will eventually be processed) by a Database Management System the same way queries are processed today, namely: 1. SCANand PARSE (SCANNER-PARSER): A Scanner identifies the tokens or language elements of the DM query. The Parser check for syntax or grammar validity. 2. VALIDATED: The Validator checks for valid names and semantic correctness. 3. CONVERTER converts to an internal representation. |4. QUERY OPTIMIZED: the Optimzier devises a stategy for executing the DM query (chooses among alternative Query internal representations). 5. CODE GENERATION: generates code to implement each operator in the selected DM query plan (the optimizer-selected internal representation). 6. RUNTIME DATABASE PROCESSORING: run plan code. Developing new, efficient and effective DataMining Query (DMQ) processors is the central need and issue in DBMS research today (far and away!). These notes concentrate on 5,i.e., generating code (algorithms) to implement operators (at a high level) namely operators that do: Association Rule Mining (ARM), Clustering (CLU),Classification (CLA)

Database analysiscan be broken down into 2 areas, Querying and Data Mining. * * * Data Mining can be broken down into 2 areas, Machine Learning and Assoc. Rule Mining Machine Learning can be broken down into 2 areas, Clustering and Classification. Clustering can be broken down into 2 types, Isotropic (round clusters) and Density-based Classification can be broken down into to types, Model-based and Neighbor-based Machine Learning is almost always based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1st (round NNSs,  about a center). Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity >0 >0 : d(x,a)<  d(f(x),f(a))< where f assigns a class to a feature vector, or  -NNS of f(a),  a -NNS of a in its pre-image. f(Dom) categorical >0 : d(x,a)<f(x)=f(a) Caution: For classification, boundary analysis may be needed also to see the class (done by projecting?). 1234 Finding NNS in lower a dimension may still the 1st step. Eg, 12345678 are all  from a 5a6(unclassified sample); 1234 are red-class, 5678 are blue-class. 7 8 Any  that gives us a vote gives us a tie vote (0-to-0 then 4-to-4). But projecting onto the vertical subspace, then taking /2 we see that /2 about a contains only blue class (5,6) votes. Using horizontal data, NNS derivation requires ≥1 scan (O(n)). Lε-NNS can be derived using vertical-data in O(log2n) (but Euclidean disks are preferred). (Euclidean and L coincide in Binary data sets).

A t1 i1 t2 i2 t3 i3 t4 i4 t5 C T I AssociationRule Mining (ARM) Assume a relationship between two entities, T(e.g., a set of Transactions an enterprise performs) and I(e.g., a set of Items which are acted upon by those transactions). In Market Basket Research (MBR)a transaction is a checkout transaction and an item is an Item in that customer's market basket going thru check out). An I-Association Rule, AC, relates 2 disjoint subsets of I (I-temsets) has 2 main measures,support and confidence(A is called the antecedent, C is called the consequent) The support of an I-set, A, is the fraction of T-instances related to every I-instance in A, e.g. if A={i1,i2} and C={i4} then supp(A)= |{t2,t4}|/|{t1,t2,t3,t4,t5}| = 2/5 Note: | | means set size or count of elements in the set. I.e., T2 and T4 are the only transactions from the total transaction set, T={T1,T2,T3,T4,T5}. that are related to both i1 and i2, (buy i1 and i2 during the pertinent T-period of time). support of rule, AC,is defined as supp{A C} = |{T2, T4}|/|{T1,T2,T3,T4,T5}| = 2/5 confidence of rule, AC,is supp(AC)/ supp(A) = (2/5) / (2/5) = 1 DM Queriers typically wantSTRONG RULES: supp≥minsupp, conf≥minconf(minsupp and minconf are threshold levels) Note that Conf(AC) is also just the conditional probability of t being related to C, given that t is related to A). There are also the dual concepts of T-association rules (just reverse the roles of T and I above). Examples of Association Rules include: The MBR, relationship between customer cash-register transactions, T, and purchasable items, I (t is related to i iff i is being bought by that customer during that cash-register transaction.). In Software Engineering (SE), the relationship between Aspects, T, and Code Modules, I (t is related to i iff module, i, is part of the aspect, t). In Bioformatics, the relationship between experiments, T, and genes, I (t is related to i iff gene, i, expresses at a threshold level during experiment, t). In ER diagramming, any “part of” relationship in which iI is part of tT (t is related to i iff i is part of t); and any “ISA” relationship in which iI ISA tT (t is related to i iff i IS A t) . . .

3 2 2 1 1 1 3 2 2 1-itemset supp Large (supp2) Start by finding large 1-ItemSets. Finding Strong Assoc Rules Or Transaction TableItems can be expressed using “Item bit vectors” The relationship between Transactions and Items can be expressed in a Transaction Table where each transaction is a row containing its ID and the list of the items that are related to that transaction: If minsupp is set by the querier at .5 and minconf at .75: To findfrequent or Large itemsets(support ≥ minsupp) PseudoCode: Assume the items in Lk-1 are ordered: Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 p from Lk-1, q from Lk-1 where p.item1=q.item1,..,p.itemk-2=q.itemk-2, p.itemk-1<q.itemk-1 Step 2: pruning • forall itemsets c in Ck do • forall (k-1)-subsets s of c do • if (s is not in Lk-1) delete c from Ck FACT: Any subset of a large itemset is large. Why? (e.g., if {A, B} is large, {A} and {B} must be large) APRIORI METHOD: Iteratively find the large k-itemsets, k=1... Find all association rules supported by each large Itemset. Ck denotes the candidate k-itemsets generated at each step Lk denotse the Large k-itemsets.

Ptree Review: A data table, R(A1..An), containing horizontal structures (records) is Vertical basic binary Predicate-tree (P-tree): vertically partition table; compress each vertical bit slice into a basic binary P-tree as follows R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Horizontal structures (records) Scanned vertically R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 1. Whole file is not pure1 0 2. 1st half is not pure1  0 0 0 0 0 1 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42P43 3. 2nd half is not pure1  0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 1 10 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 4. 1st half of 2nd half not  0 0 0 1 0 1 01 5. 2nd half of 2nd half is  1 0 1 0 6. 1st half of 1st of 2nd is  1 Eg, Count number of occurences of 111 000 001 1000 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 =0 0 22-level=2 01 21-level 7. 2nd half of 1st of 2nd not 0 processed vertically (vertical scans) then process using multi-operand logical ANDs. R11 0 0 0 0 1 0 1 1 The basic binary P-tree, P1,1, for R11 is built top-down by record truth of predicate pure1 recursively on halves, until purity. But it is pure (pure0) so this branch ends

R11 0 0 0 0 1 0 1 1 Top-down construction of basic binary P-trees is good for understanding, but bottom-up is more efficient. 0 0 0 0 0 1 0 0 0 0 0 1 1 1 Bottom-up construction of P11 is done using in-order tree traversal and the collapsing of pure siblings, as follow: P11 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0

Processing Efficiencies? (prefixed leaf-sizes have been removed) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 = R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 01 0 1 0 1 0 0 1 0 0 1 01 This 0 makes entire left branch 0 7 0 1 4 These 0s make this node 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 These 1s and these 0s make this 1 0 0 0 0 1 01 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 To count occurrences of 7,0,1,4 use pure111000001100: 0 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 01 0 1 0 ^ ^ ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 0 0 1 0 1 01 ^ 0 1 0 R(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 21-level has the only 1-bit so the 1-count = 1*21 = 2

Database D L3 C1 L1 C2 L2 C3 C2 Scan D Scan D Scan D {123} pruned since {12} not large {135} pruned since {15} not Large P1 2 //\\ 1010 P2 3 //\\ 0111 P1^P2^P3 1 //\\ 0010 Build Ptrees: Scan D P1^P2 1 //\\ 0010 P3 3 //\\ 1110 P1^P3 ^P5 1 //\\ 0010 P1^P3 2 //\\ 1010 P4 1 //\\ 1000 P2^P3 ^P5 2 //\\ 0110 P5 3 //\\ 0111 P1^P5 1 //\\ 0010 L2={13,23,25,35} L1={1,2,3,5} L3={235} P2^P3 2 //\\ 0110 P2^P5 3 //\\ 0111 P3^P5 2 //\\ 0110 Example ARM using uncompressed P-trees(note: I have placed the 1-count at the root of each Ptree)

L3 L1 L2 1-ItemSets don’t support Association Rules (They will have no antecedent or no consequent). 2-Itemsets do support ARs. Are there any Strong Rules supported by Large 2-ItemSets(at minconf=.75)? {1,3} conf{1}{3} = supp{1,3}/supp{1} = 2/2 = 1 ≥ .75 STRONG conf{3}{1} = supp{1,3}/supp{3} = 2/3 = .67 < .75 {2,3} conf{2}{3} = supp{2,3}/supp{2} = 2/3 = .67 < .75 conf{3}{2} = supp{2,3}/supp{3} = 2/3 = .67 < .75 {2,5} conf{2}{5} = supp{2,5}/supp{2} = 3/3 = 1 ≥ .75STRONG! conf{5}{2} = supp{2,5}/supp{5} = 3/3 = 1 ≥ .75STRONG! {3,5} conf{3}{5} = supp{3,5}/supp{3} = 2/3 = .67 < .75 conf{5}{3} = supp{3,5}/supp{5} = 2/3 = .67 < .75 Are there any Strong Rules supported by Large 3-ItemSets? {2,3,5} conf{2,3}{5} = supp{2,3,5}/supp{2,3} = 2/2 = 1 ≥ .75STRONG! conf{2,5}{3} = supp{2,3,5}/supp{2,5} = 2/3 = .67 < .75 No subset antecedent can yield a strong rule either (i.e., no need to check conf{2}{3,5} or conf{5}{2,3} since both denominators will be at least as large and therefore, both confidences will be at least as low. conf{3,5}{2} = supp{2,3,5}/supp{3,5} = 2/3 = .67 < .75 No need to check conf{3}{2,5} or conf{5}{2,3} DONE!

Ptree-ARM versus Apriori on aerial photo (RGB) data together with yeild data P-ARM compared to Horizontal Apriori (classical) and FP-growth (an improvement of it). • In P-ARM, we find all frequent itemsets, not just those containing Yield (for fairness) • Aerial TIFF images (R,G,B) with synchronized yield (Y). Scalability with number of transactions Scalability with support threshold • Identical results • P-ARM is more scalable for lower support thresholds. • P-ARM algorithm is more scalable to large spatial datasets. • 1320  1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000).

P-ARM versus FP-growth (see literature for definition) 17,424,000 pixels (transactions) Scalability with support threshold Scalability with number of trans • FP-growth = efficient, tree-based frequent pattern mining method (details later) • For a dataset of 100K bytes, FP-growth runs very fast. But for images of large size, P-ARM achieves better performance. • P-ARM achieves better performance in the case of low support threshold.

Other methods (other than FP-growth) to Improve Apriori’s Efficiency(see the literature or the html notes 10datamining.html in Other Materials for more detail) • Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans • Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB • Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness • Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent • The core of the Apriori algorithm: • Use only large (k – 1)-itemsets to generate candidate large k-itemsets • Use database scan and pattern matching to collect counts for the candidate itemsets • The bottleneck of Apriori: candidate generation 1. Huge candidate sets: • 104 large 1-itemset may generate 107 candidate 2-itemsets • To discover large pattern of size 100, eg, {a1…a100}, we need to generate 2100  1030 candidates. • 2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)

Classification 1. Build a MODEL of the feature_tupleclass relationship tuple# feature1 f2 f3 ... fn class tuple1 value1,1 val2,1 val3,1 valn,1 class1 . . . tuplem value1,m val2,m val3,m valn,m classm INPUT hopper Predicted class of unclassified sample OUTPUT conveyor Unclassified sample Using a Training Data Set (TDS) in which each feature tuple is already classified (has a class value attached to it in the class column, called its class label.), 1. Build a model of the TDS (called the TRAINING PHASE). 2. Use that model to classify unclassified feature tuples (unclassified samples). E.g., TDS = last year's aerial image of a crop field (feature columns are R,G,B columns together with last year's crop yeilds attached in a class column, e.g., class values={Hi, Med, Lo} yeild. Unclassified samples are the RGB tuples from this year's aerial image 3. Predict the class of each unclassified tuple (in the e.g.,: predict yeild for each point in the field.) 3 steps: Build a Model of the TDS feature-to-class relationship, Test that Model, Use the Model (to predict the most likely class of each unclassified sample). Note: other names for this process: regression analysis, case-based reasoning,...) Other Typical Applications: • Targeted Product Marketing(the so-called classsical Business Intelligence problem) • Medical Diagnosis(the so-called Computer Aided Diagnosis or CAD) • Nearest Neighbor Classifiers (NNCs) use a portion of the TDS as the model (neighboring tuples vote) finding the neighbor set is much faster than building other models but it must be done anew for each unclasified sample. (NNC is called a lazy classifier because it get's lazy and doesn't take the time to build a concise model of the relationship between feature tuples and class labels ahead of time). • Eager Classifiers(~all other classifiers) build 1 concise model once and for all - then use it for all unclassified samples. The model building can be very costly but that cost can be amortized over all the classifications of a large number of unclassified samples (e.g., all RGB points in a field).

Eager Classifiers Classification Algorithm (creates the Classifier or Model during training phase) CLASSIFICATION or USE PHASE TRAINING PHASE Training Data Unclassified sample herb, professor, 3 Unclassified sample clyde, professor,8 Unclassified sample tim, assoc. professor, 4 e.g., Model (as a rule set) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ INPUT hopper OUTPUT conveyor no yes yes

Test Process (2): Usually some of the Training Tuples are set aside as a Test Set and after a model is constructed, the Test Tuples are run through the Model. The Model is acceptable if, e.g., the % correct > 60%. If not, the Model is rejected (never used). Classifier Testing Data % correct classifications? NAME RANK YEARS TENURED Tom Assistant Prof 2 no Correct=3 Incorrect=1 75% Merlisa Associate Prof 7 no George Associate Prof 5 yes Joseph Assistant Prof 7 no Since 75% is above the acceptability threshold, accept the model!

Classification by Decision Tree Induction • Decision tree(instead of a simple case statement of rules, the rules are prioritized into a tree) • Each Internal node denotes a test or rule on an attribute (test attribute for that node) • Each Branch represents an outcome of the test (value of the test attribute) • Leaf nodes represent class label decisions (plurality leaf class is predicted class) • Decision tree model development consists of two phases • Tree construction • At start, all the training examples are at the root • Partition examples recursively based on selected attributes • Tree pruning • Identify and remove branches that reflect noise or outliers • Decision tree use: Classifying unclassified samples by filtering them down the decision tree to their proper leaf, than predict the plurality class of that leaf (often only one, depending upon the stopping condition of the construction phase)

Algorithm for Decision Tree Induction • Basic ID3 algorithm (a simple greedy top-down algorithm) • At start, the current node is the root and all the training tuples are at the root • Repeat, down each branch, until the stopping condition is true • At current node, choose a decision attribute (e.g., one with largest information gain). • Each value for that decision attribute is associated with a link to the next level down and that value is used as the selection criterion of that link. • Each new level produces a partition of the parent training subset based on the selection value assigned to its link. • stopping conditions: • When all samples for a given node belong to the same class • When there are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • When there are no samples left

Bayesian Classification(eager: Model is based on conditional probabilities. Prediction is done by taking the highest conditionally probable class) A Bayesian classifier is a statistical classifier, which is based on following theorem known as Bayes theorem: Bayes theorem: Let X be a data sample whose class label is unknown. Let H be the hypothesis that X belongs to class, H. P(H|X) is the conditional probability of H given X. P(H) is prob of H, then P(H|X) = P(X|H)P(H)/P(X)

Naïve Bayesian Classification Given training set, R(f1..fn, C) where C={C1..Cm} is the class label attribute. A Naive Bayesian Classifier will predict the class of unknown data sample, X=(x1..xn), to be the class, Cj having the highest conditional probability, conditioned on X. That isit will predict the class to be Cj iff (a tie handling algorithm may be required). P(Cj|X) ≥ P(Ci|X), i  j. • From the Bayes theorem; P(Cj|X) = P(X|Cj)P(Cj)/P(X) • P(X) is constant for all classes so we need only maximize P(X|Cj)P(Cj): • P(Ci)s are known. • To reduce the computational complexity of calculating all P(X|Cj)s, the naive assumption is to assume class conditional independence: P(X|Ci) is the product of the P(Xi|Ci)s.

Neural Network Classificaton • A Neural Network is trained to make the prediction • Advantages • prediction accuracy is generally high • it is generally robust(works when training examples contain errors) • output may be discrete, real-valued, or a vector of several discrete or real-valued attributes • It provides fast classification of unclassified samples. • Criticism • It is difficult to understand the learned function (involves complex and almost magicweight adjustments.) • It makes it difficult to incorporate domain knowledge • long training time(for large training sets, it is prohibitive!)

A Neuron - mk x0 w0 x1 w1 f å output y xn wn Input vector x weight vector w weighted sum Activation function • The input feature vector x=(x0..xn) is mapped into variable y by means of the scalar product and a nonlinear function mapping, f (called the damping function). and a bias function,

Neural Network Training • The ultimate objective of training • obtain a set of weights that makes almost all the tuples in the training data classify correctly (usually using a time consuming "back propagation" procedure which is based, ultimately on Neuton's method. See literature of Other materials - 10datamining.htmlfor examples and alternate training techniques). • Steps • Initialize weights with random values • Feed the input tuples into the network • For each unit • Compute the net input to the unit as a linear combination of all the inputs to the unit • Compute the output value using the activation function • Compute the error • Update the weights and the bias

Neural Multi-Layer Perceptron Output vector Output nodes Hidden nodes wij Input nodes Input vector: xi

= = X x , x , x , , x Y y , y , y , , y … … 1 2 3 n 1 2 3 n Minkowski distance or Lp distance, Manhattan distance, (P = 1) Euclidian distance, (P = 2) Max distance, (P = ) Canberra distance Squared chi-squared distance Squared cord distance These next 3 slides treat the concept of Distance it great detail. You may feel you don't need this much detail - if so, skip what you feel you don't need.For Nearest Neighbor Classification, a distance is needed(to make sense of "nearest". Other classifiers also use distance.) A distance is a function, d,applied to two n-dimensional points X and Y, is such that d(X, Y)is positive definite: if (X  Y), d(X, Y) > 0; if (X = Y), d(X, Y) = 0 d(X, Y) issymmetric: d(X, Y) = d(Y, X) d(X, Y) holds triangle inequality: d(X, Y) + d(Y, Z)  d(X, Z)

Y (6,4) Manhattan, d1(X,Y)= XZ+ ZY =4+3 = 7 Euclidian, d2(X,Y)= XY = 5 Z Max, d(X,Y)= Max(XZ, ZY) = XZ = 4 X (2,1) In fact, for any positive integer p, An Example A two-dimensional space: d1d2d always

Manhattan Euclidian Max 2r 2r 2r X X X T T T Neighborhoods of a Point A Neighborhood (disk neighborhood) of a point, T, is a set of points, S, : XSiff d(T, X) r If Xis a point on the boundary, d(T, X) = r

Classical k-Nearest Neighbor Classification • Select a suitable value for k(how many Training Data Set (TDS) neighbors do you want to vote as to the best predicted class for the unclassified feature sample? ) • Determine a suitable distance metric(to give meaning to neighbor) • Find the k nearest training set points to the unclassified sample. • Let them vote(tally up the counts of TDS neighbors that for each class.) • Predict the highest class vote (plurality class) from among the k-nearest neighbor set.

T Closed-KNN Example assume 2 features (one in the x-direction and one in the y T is the unclassified sample. using k = 3, find the three nearest neighbor, KNN arbitrarily select one point from the boundary line shown Closed-KNN includes all points on the boundary Closed-KNN yields higher classification accuracy than traditional KNN (thesis of MD Maleq Khan,NDSU, 2001). The P-tree method always produce closed neighborhoods (and is faster!)

k-Nearest Neighbor (kNN) Classification andClosed-k-Nearest Neighbor (CkNN) Classification 1) Select a suitable value for k 2) Determine a suitable distance or similarity notion. 3) Find the k nearest neighbor set [closed] of the unclassified sample. 4) Find the plurality class in the nearest neighbor set. 5) Assign the plurality class as the predicted class of the sample T is the unclassified sample. Use Euclidean distance. k = 3: Find 3 closest neighbors. Move out from T until ≥ 3 neighbors T That's 2 ! kNN arbitrarily select one point from that boundary line as 3rd nearest neighbor, whereas, CkNN includes all points on that boundary line. That's 1 ! That's more than 3 ! CkNN yields higher classification accuracy than traditional kNN. At what additional cost? Actually, at negative cost (faster and more accurate!!)

The slides numbered 28 through 93 give great detail on the relative performance of kNN and CkNN, on the use of other distance functions and some exampels, etc. There may be more detail on these issue that you want/need. If so, just scan for what you are most interested in or just skip ahead to slide 94 on CLUSTERING. Experimented on two sets of (Arial) Remotely Sensed Images of Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values

Performance – Accuracy (3 horizontal methods in middle, 3 vertical methods (the 2 most accurate and the least accurate) 1997 Dataset: 80 75 70 65 Accuracy (%) 60 55 kNN-Manhattan kNN-Euclidian 50 kNN-Max kNN using HOBbit distance P-tree Closed-KNN0-max Closed-kNN using HOBbit distance 45 40 256 1024 4096 16384 65536 262144 Training Set Size (no. of pixels)

Performance – Accuracy (3 horizontal methods in middle, 3 vertical methods (the 2 most accurate and the least accurate) 1998 Dataset: 65 60 55 50 45 Accuracy (%) 40 kNN-Manhattan kNN-Euclidian kNN-Max kNN using HOBbit distance P-tree Closed-KNN-max Closed-kNN using HOBbit distance 20 256 1024 4096 16384 65536 262144 Training Set Size (no of pixels)

Performance – Speed (3 horizontal methods in middle, 3 vertical methods (the 2 fastest (the same 2) and the slowest) Hint: NEVER use a log scale to show a WIN!!! 1997 Dataset: both axis in logarithmic scale Training Set Size (no. of pixels) 256 1024 4096 16384 65536 262144 1 0.1 0.01 Per Sample Classification time (sec) 0.001 0.0001 kNN-Manhattan kNN-Euclidian kNN-Max kNN using HOBbit distance P-tree Closed-KNN-max Closed-kNN using HOBbit dist

Performance – Speed (3 horizontal methods in middle, 3 vertical methods (the 2 fastest (the same 2) and the slowest) Win-Win situation!! (almost never happens) P-tree CkNN and CkNN-H are more accurate and much faster. kNN-H is not recommended because it is slower and less accurate (because it doesn't use Closed nbr sets and it requires another step to get rid of ties (why do it?). Horizontal kNNs are not recommended because they are less accurate and slower! 1998 Dataset : both axis in logarithmic scale Training Set Size (no. of pixels) 256 1024 4096 16384 65536 262144 1 0.1 0.01 Per Sample Classification Time (sec) 0.001 0.0001 kNN-Manhattan kNN-Euclidian kNN-Max kNN using HOBbit distance P-tree Closed-kNN-max Closed-kNN using HOBbit dist

WALK THRU: 3NN CLASSIFICATION of an unclassified sample, a=(a5 a6 a11a12a13a14 )=(000000). HORIZONTAL APPROACH( relevant attributes are a5 a6a10=Ca11 a12 a13 a14distance The 3 nearest neighbors C=1 wins! d=4, don’t replace d=2, don’t replace d=4, don’t replace d=2, don’t replace d=3, don’t replace d=2, don’t replace d=3, don’t replace d=2, don’t replace d=2, don’t replace d=3, don’t replace d=2, don’t replace d=3, don’t replace d=2, don’t replace 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d=1, replace 0 0 0 0 0 0 a5 a6 a11 a12 a13 a14 ) Note only 1 of many training tuple at a distance=2 from the sample got to vote. We didn’t know that distance=2 was going to be the vote cutoff until the end of the 1st scan. Finding the other distance=2 voters (Closed 3NN set or C3NN) requires another scan. t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 t15 0 0 1 0 1 0 1 2 t53 0 0 0 0 1 0 0 1 0 1 Key a1 a2 a3 a4a5 a6 a7 a8 a9 a10=Ca11 a12 a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

WALK THRU of required 2nd scan to find Closed 3NN set. Does it change vote? 3NN set after 1st scan Unclassified sample: 0 0 0 0 0 0 a5 a6a10=Ca11 a12 a13 a14distance t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 t53 0 0 0 0 1 0 0 1 0 1 d=2, include it also d=2, include it also d=2, include it also d=2, include it also d=4, don’t include d=4, don’t include d=3, don’t include d=3, don’t replace d=3, don’t include d=3, don’t include d=2, include it also d=2, include it also d=2, include it also d=2, include it also 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d=1, already voted d=2, already voted d=1, already voted 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 YES! C=0 wins now! Vote after 1st scan. Key a1 a2 a3 a4a5 a6 a7 a8 a9 a10=Ca11 a12 a13 a14 a15 a16 a17 a18 a19 a20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0

WALK THRU: Closed 3NNC using P-trees First let all training points at distance=0 vote, then distance=1, then distance=2, ... until  3 For distance=0 (exact matches) constructing the P-tree, Ps then AND with PC and PC’ to compute the vote. (black denotescomplement,red denotes uncomplemented a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 No neighbors at distance=0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 Ps 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 a16 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 a17 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 a18 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 a19 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75 a20 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1

Construct Ptree, PS(s,1) = OR Pi = P|si-ti|=1; |sj-tj|=0, ji = ORPS(si,1)   S(sj,0) P14 P13 P12 P11 P6 P5 0 1 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 OR WALK THRU: C3NNC distance=1 nbrs: i=5,6,11,12,13,14 i=5,6,11,12,13,14 j{5,6,11,12,13,14}-{i} a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 C 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a10 =C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 PD(s,1) 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 a4 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a7 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a8 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a9 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a15 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 a16 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 a17 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 a18 0 0 1 0 1 1 1 0 0 1 1 0 1 1 1 0 1 a19 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75 a20 1 1 0 0 1 0 1 1 1 0 1 1 0 0 1 1 0 a2 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 a3 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1

OR{all double-dim interval-Ptrees}; PD(s,2) =OR Pi,j Pi,j = PS(si,1) S(sj,1)  S(sk,0) i,j{5,6,11,12,13,14} k{5,6,11,12,13,14}-{i,j} 0 1 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 a13 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 WALK THRU: C3NNC distance=2 nbrs: We now have the C3NN set and we can declare C=0 the winner! We now have 3 nearest nbrs. We could quite and declare C=1 winner? P5,12 P5,13 P5,14 P6,11 P6,12 P6,13 P6,14 P11,12 P11,13 P11,14 P12,13 P12,14 P13,14 P5,11 P5,6 a10 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75

In the previous example, there were no exact matches (dis=0 neighbors or similarity=6 neighbors) for the sample. There were two neighbors were found at a distance of 1 (dis=1 or sim=5) and nine dis=2, sim=4 neighbors. All 11 neighbors got an equal votes even though the two sim=5 are much closer neighbors than the nine sim=4. Also processing for the 9 is costly. A better approach would be to weight each vote by the similarity of the voter to the sample (We will use a vote weight function which is linear in the similarity (admittedly, a better choice would be a function which is Gaussian in the similarity, but, so far, it has been too hard to compute). As long as we are weighting votes by similarity, we might as well also weight attributes by relevance also (assuming some attributes are more relevant neighbors than others. e.g., the relevance weight of a feature attribute could be the correlation of that attribute to the class label). P-trees accommodate this method very well (in fact, a variation on this theme won the KDD-cup competition in 02 ( http://www.biostat.wisc.edu/~craven/kddcup/ )

Association of Computing Machinery KDD-Cup-02 NDSU Team

Closed Manhattan Nearest Neighbor Classifier (uses a linear fctn of Manhattan similarity) Sample is (000000), attribute weights of relevant attributes are their subscripts) black isattribute complement, red is uncomplemented. The vote is even simpler than the "equal" vote case. We just note that all tuples vote in accordance with their weighted similarity (if the ai values differs form that of (000000) then the vote contribution is the subscript of that attribute, else zero). Thus, we can just add up the root counts of each relevant attribute weighted by their subscript. a14 1 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 0 a13 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 1 Class=1 root counts: a12 0 0 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 a11 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0 1 1 a6 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 C=1 vote is: 343 =4*5 + 8*6 + 7*11 + 4*12 + 4*13 + 7*14 C 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 a5 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 0 a6 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 a11 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 a12 1 1 1 0 0 0 0 1 1 1 0 1 1 0 0 1 1 a13 1 0 0 1 1 1 1 1 0 0 1 0 0 1 1 1 0 a14 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 key t12 t13 t15 t16 t21 t27 t31 t32 t33 t35 t51 t53 t55 t57 t61 t72 t75 a5 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 1 rc(PC^Pa12)=4 rc(PC^Pa13)=4 rc(PC^Pa14)=7 rc(PC^Pa6)=8 rc(PC^Pa11)=7 rc(PC^Pa5)=4 C=1 vote is: 343 Similarly, C=0 vote is: 258= 6*5 + 7*6 + 5*11 + 3*12 + 3*13 + 4*14

We note that the Closed Manhattan NN Classifier uses an influence function which is pyramidal It would be much better to use a Gaussian influence function but it is much harder to implement. One generalization of this method to the case of integer values rather than Boolean, would be to weight each bit position in a more Gaussian shape (i.e., weight the bit positions, b, b-1, ..., 0 (high order to low order) using Gaussian weights. By so doing, at least within each attribute, influences are Gaussian. We can call this method, Closed Manhattan Gaussian NN Classification. Testing the performance of either CM NNC or CMG NNC would make a great paper for this course (thesis?). Improving it in some way would make an even better paper (thesis).

Review of slide 2 (with additions): Database analysiscan be broken down into 2 areas, Querying and Data Mining. * * * Data Mining can be broken down into 2 areas, Machine Learning and Assoc. Rule Mining Machine Learning can be broken down into 2 areas, Clustering and Classification. Clustering can be broken down into 2 types, Isotropic (round clusters) and Density-based Classification can be broken down into to types, Model-based and Neighbor-based Machine Learning is based on Near Neighbor Set(s), NNS. Clustering, even density based, identifies near neighbor cores 1st (round NNSs,  about a center). Classification is continuity based and Near Neighbor Sets (NNS) are the central concept in continuity >0 >0 : d(x,a)<  d(f(x),f(a))< where f assigns a class to a feature vector, or  -NNS of f(a),  a -NNS of a in its pre-image. f(Dom) categorical >0 : d(x,a)<f(x)=f(a) Caution: For classification, boundary analysis may be needed also to see the class (done by projecting?). 1234 Finding NNS in lower a dimension may still the 1st step. Eg, 12345678 are all  from a 5a6(unclassified sample); 1234 are red-class, 5678 are blue-class. 7 8 Any  that gives us a vote gives us a tie vote (0-to-0 then 4-to-4). But projecting onto the vertical subspace, then taking /2 we see that /2 about a contains only blue class (5,6) votes. Using horizontal data, NNS derivation requires ≥1 scan (O(n)). Lε-NNS can be derived using vertical-data in O(log2n) (but Euclidean disks are preferred). (Euclidean and L coincide in Binary data sets). Solution (next slide): Circumscribe desired Euclidean-NNS with a few intersections of functional-contours, (f -1([b,c] ) sets, until the intersection is scannable, then scan it for Euclidean--nbrhd membership. Advantage: intersection can be determined before scanning - create and AND functional contour P-trees.

Functional Contours: function, f:R(A1..An)  Y Equivalently,  derived attribute, Af, with DOMAIN(Af)Y (The equivalence is x.Af = f(x) x  R ) Y R* R A1 A2 An x1 x2 xn : : . . . A1 A2 An Af x1 x2 xn f(x1..xn) : . . . f x Y graph(f) = {(x, f(x)) | xR} Y S1 S2 S3 R f A1 A2 An : : . . . S S f-contour(S) If S={a}, Isobar(f,a)= contour(f,{a}) R f(x) and  S  Y, contour(f,S) = f-1(S)Equivalently, contour(Af,S) = SELECT A1..An FROM R* WHERE x.AfS. Graphically:  partition, {Si} of Y, the contour set, {f-1(Si)}, is a partition of R (clustering of R): A Weather map, f = barometric pressure or temperature, {Si}=equi-width partion of Reals. f = local density (eg, OPTICS: f = reachability distance, {Sk} = partition produced by intersection points of {graph(f), plotted wrt to some walk of R} and a horizontal threshold line. A grid is the intersection of dimension projection contour partitions (next slide for more defintions). A Class is a contour under f:RClassAttr wrt the partition, {Ci} of ClassAttr (where {Ci} are the classes). An L -disk about a is the intersection of all -dimension_projection contours containing a.

GRIDs 2.lo grid1.hi grid Want square cells or a square pattern? 11 10 01 00 11 10 01 00 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111 f:RY,  partition S={Sk} of Y, {f-1(Sk)}=S,f-grid of R (grid cells=contours) If Y=Reals, the j.lo f-grid is produced by agglomerating over the j lo bits of Y,  fixed (b-j) hi bit pattern. The j lo bits walk [isobars of] cells. Theb-j hi bits identify cells. (lo=extension / hi=intention) Let b-1,...,0 be the b bit positions of Y. The j.lo f-grid is the partition of R generated by f and S = {Sb-1,...,b-j | Sb-1,...,b-j = [(b-1)(b-2)...(b-j)0..0, (b-1)(b-2)...(b-j)1..1)} partition of Y=Reals. If F={fh}, the j.lo F-grid is the intersection partition of the j.lo fh-grids (intersection of partitions). The canonicalj.lo grid is the j.lo -grid; ={d:RR[Ad] | d = dth coordinate projection} j-hi gridding is similar ( the b-j lo bits walk cell contents / j hi bits identify cells). If the horizontal and vertical dimensions have bitwidths 3 and 2 respectively:

j.lo and j.hi gridding continued The horizontal_bitwidth = vertical_bitwidth = b iff j.lo grid = (b-j).hi grid e.g., for hb=vb=b=3 and j=2: 2.lo grid1.hi grid 111 110 101 100 111 110 101 100 011 010 001 000 011 010 001 000 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

r1 r1 a A distance, d, generates a similarity many ways, e.g., s(x,y)=1/(1+d(x,y)): (or if the relationship various by location, s(x,y)=(x,y)/(1+d(x,y)) s r2 1 r2 For C = {a} d s(x,y)=a*e-b*d(x,y)2 : s a d 0 : d(x,y)> s(x,y)= ae-bd(x,y)2-ae-b2 : d(x,y) (vote weighting IS a similarity assignment, so the similarity-to-distance graph IS a vote weighting for classification) s C a-ae-b2 d  Similarity NearNeighborSets (SNNS) Given similarity s:RRPartiallyOrderedSet (eg, Reals) ( i.e., s(x,y)=s(y,x) and s(x,x)s(x,y) x,yR ) and given any C  R The Ordinal disks, skins and rings are: disk(C,k)  C : |disk(C,k)C'|=k and s(x,C)s(y,C) xdisk(C,k), ydisk(C,k) skin(C,k)= disk(C,k)-C (skin comes from s k immediate neighbors and is a kNNS of C.) ring(C,k)= cskin(C,k)-cskin(C,k-1) closeddisk(C,k)alldisk(C,k); closedskin(C,k)allskin(C,k) The Cardinal disk, skins and rings are (PartiallyOrderedSet = Reals) disk(C,r) {xR | s(x,C)r} also = functional contour, f-1([r, ), where f(x)=sC(x)=s(x,C) skin(C,r) disk(C,r) - C ring(C,r2,r1) disk(C,r2)-disk(C,r1)  skin(C,r2)-skin(C,r1) also = functional contour, sC-1(r1,r2] Note: closeddisk(C,r) is redundant, since all r-disks are closed and closeddisk(C,k) = disk(C,s(C,y)) where y = kth NN of C L skins: skin(a,k) = {x | d, xd is one of the k-NNs of ad} (a local normalizer?)

Partition tree R / … \ C1 … Cn /…\ … /…\ C11…C1,n1Cn1…Cn,nn . . . Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing Jury is still out on parallelization, vertical (by relation) or horizontal (by tree node) or some combination? Horizontal parallelization is pretty, but network multicast overhead is huge Use active networking? Clusters of Playstations?... Formally, P-trees are be defined as any of the following; Partition-tree: Tree of nested partitions (a partition P(R)={C1..Cn}; each component is partitioned by P(Ci)={Ci,1..Ci,ni} i=1..n; each component is partitioned by P(Ci,j)={Ci,j1..Ci,jnij}... ) Ptrees Predicate-tree: For a predicate on the leaf-nodes of a partition-tree (also induces predicates on i-nodes using quantifiers) Predicate-tree nodes can be truth-values (Boolean P-tree); can be quantified existentially (1 or a threshold %) or universally; Predicate-tree nodes can count # of true leaf children of that component (Count P-tree) Purity-tree: universally quantified Boolean-Predicate-tree (e.g., if the predicate is <=1>, Pure1-tree or P1tree) A 1-bit at a node iff corresponding component is pure1 (universally quantified) There are many other useful predicates, e.g., NonPure0-trees; But we will focus on P1trees. All Ptrees shown so far were: 1-dimensional (recursively partition by halving bit files), but they can be; 2-D (recursively quartering) (e.g., used for 2-D images); 3-D (recursively eighth-ing), …; Or based on purity runs or LZW-runs or … Further observations about Ptrees: Partition-tree: have set nodes Predicate-tree: have either Boolean nodes (Boolean P-tree)or count nodes (Count P-tree) Purity-tree: being universally quantified Boolean-Predicate-tree have Boolean nodes (since the count is always the “full” count of leaves, expressing Purity-trees as count-trees is redundant. Partition-tree can be sliced at a level if each partition is labeled with same label set (e.g., Month partition of years). A Partition-tree can be generalized to a Set-graph when the siblings of a node do not form a partition.

The partitions used to create P-trees can come from functional contours (Note: there is a natural duality between partitions and functions, namely a partition creates a function from the space of points partitioned to the set of partition components and a function creates the pre-image partition of its domain). In Functional Contour terms (i.e., f-1(S) where f:R(A1..An)Y, SY), the uncompressed Ptree or uncompressed Predicate-tree0Pf, S = bitmap of set containment-predicate, 0Pf,S(x)=true iff xf-1(S) 0Pf,S = equivalently, the existential R*-bit mapof predicate, R*.AfS The Compressed Ptree,sPf,S is the compression of 0Pf,S with equi-width leaf size, s, as follows 1. Choose a walk of R (converts 0Pf,S from bit map to bit vector) 2. Equi-width partition 0Pf,S with segment size, s (s=leafsize, the last segment can be short) 3. Eliminate and mask to 0, all pure-zero segments (call mask, NotPure0 Mask or EM) 4. Eliminate and mask to 1, all pure-one segments (call mask, Pure1 Mask or UM) (EM=existential aggregation UM=universal aggregation) Compressing each leaf of sPf,S with leafsize=s2 gives: s1,s2Pf,SRecursivly, s1, s2, s3Pf,S s1, s2, s3, s4Pf,S ... (builds an EM and a UM tree) BASIC P-trees If AiReal or Binaryand fi,j(x)  jth bit of xi ; {(*)Pfi,j ,{1} (*)Pi,j}j=b..0 are basic (*)P-trees of Ai, *= s1..sk If AiCategorical and fi,a(x)=1 if xi=a, else 0; {(*)Pfi,a,{1}(*)Pi,a}aR[Ai] are basic (*)P-trees of Ai Notes: The UM masks (e.g., of 2k,...,20Pi,j, with k=roof(log2|R| ), form a (binary) tree. Whenever the EM bit is 1, that entire subtree can be eliminated (since it represents a pure0 segment), then a 0-node at level-k (lowest level = level-0) with no sub-tree indicates a 2k-run of zeros. In this construction, the UM tree is redundant. We call these EM trees the basic binary P-trees. The next slide shows a top-down (easy to understand) construction of and the following slide is a (much more efficient) bottom up construction of the same. We have suppressed the leafsize prefix.

Data Mining 2

Data Mining 2

Presentation Transcript

Chapter 2 Data Mining

Data Mining: Data

Data Mining: Data

DATA MINING LECTURE 2

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

DATA MINING L ecture #2

Data Mining: Data

Statistical Data Mining - 2

Mining Data Streams (Part 2)

Data Mining: Data

Lecturette 2: Mining Classroom Data

Data Mining: Data

Data Mining – Day 2