Instance based and Bayesian learning

Instance based and Bayesian learning Kurt Driessens with slide ideas from a.o. HendrikBlockeel, Pedro Domingos, David Page, Tom Dietterich and Eamon Keogh

Overview Nearest neighbor methods • Similarity • Problems: • dimensionality of data, efficiency, etc. • Solutions: • weighting, edited NN, kD-trees, etc. Naïve Bayes • Including an introduction to Bayesian ML methods

Nearest Neighbor: A very simple idea Imagine the world’s music collection represented in some space When you like a song, other songs residing close to it should also be interesting … Picture from Oracle

Nearest Neighbor Algorithm • Store all the examples <xi,yi> • Classify a new example x by finding the stored example xk that most resembles it and predicts that example’s class yk + - + + - - + - ? + + - + - - - + + - - + - + - -

Some properties • Learning is very fast (although we come back to this later) • No information is lost • Hypothesis space • variable size • complexity of the hypothesis rises with the number of stored examples

Decision Boundaries Voronoi diagram + - + + - - - + + Boundaries are not computed! - - + - + + - -

Keeping All Information Advantage: no details lost Disadvantage: "details" may be noise + - + + + + + + + - + + + + + + + + + - - - + - - - - - - - - -

k-Nearest-Neighbor: kNN To improve robustness against noisy learning examples, use a set of nearest neighbors For classification: use voting ?

k-Nearest-Neighbor: kNN (2) For regression: use the mean 1 1 1 4 3 5 2 ? 2 4 5 3 3 4 4 5

Lazy vs Eager Learning kNN doesn’t do anything until it needs to make a prediction = lazy learner • Learning is fast! • Predictions require work and can be slow Eager learners start computing as soon as they receive data Decision tree algorithms, neural networks, … • Learning can be slow • Predictions are usually fast!

Similarity measures Distance metrics: measure of dis-similarity E.g. Manhattan, Euclidean or Ln-norm for numerical attributes Hamming distance for nominal attributes

Distance definition = critical! E.g. comparing humans • 1.85m, 37yrs • 1.83m, 35yrs • 1.65m, 37yrs d(1,2) = 2.00…0999975… d(1,3) = 0.2 d(2,3) = 2.00808… • 185cm, 37yrs • 183cm, 35yrs • 165cm, 37yrs d(1,2) = 2.8284… d(1,3) = 20.0997… d(2,3) = 18.1107…

Normalize attribute values Rescale all dimensions such that the range is equal, e.g. [-1,1] or [0,1] For [0,1] range: with mi the minimum and Mi the maximum value for attribute i

Curse of dimensionality Assume a uniformly distributed set of 5000 examples To capture 5 nearest neighbors we need: • in 1 dim: 0.1% of the range • in 2 dim: = 3.1% of the range • in n dim: 0.1%1/n

Curse of Dimensionality (2) With 5000 points in 10 dimensions, each attribute range must be covered approx. 50% to find 5 neighbors … ?

Curse of Noisy Features Irrelevant features destroy the metric’s meaningfulness Consider a 1dim problem where the query x is at the origin, the nearest neighbor x1 is at 0.1 and the second neighbor x2 at 0.5 (after normalization) • Now add a uniformly random feature. What is the probability that x2 becomes the closest neighbor? approx. 15% !!

Curse of Noisy Features (2) Location of x1 vs x2 on informative dimension

Weighted Distances Solution: Give each attribute a different weight in the distance computation

Selecting attribute weights Several options: • Experimentally find out which weights work well (cross-validation) • Other solutions, e.g. (Langley,1996) • Normalize attributes (to scale 0-1) • Then select weights according to "average attribute similarity within class” for each example in that class for each class for each attribute

More distances Strings • Levenshtein distance/edit distance = minimal number of changes needed to change one word into the other Allowed edits/changes: • delete character • insert character • change character (not used by some other edit-distances)

D(Q,C) Even more distances Given two time series: Q = q1…qn C = c1…cn Euclidean C Q R Start and end times are critical! D(Q,R)

Sequence distances (2) Dynamic Time Warping Dimensionality reduction Fixed Time Axis Sequences are aligned “one to one”. “Warped” Time Axis Nonlinear alignments are possible. edit distance!

Distance-weighted kNN k places arbitrary border on example relevance • Idea: give higher weight to closer instances Can now use all training instances instead of only k (“Shepard’s method”) ! In high-dimensional spaces, a function of d that “goes to zero fast enough” is needed. (Again “curse of dimensionality”.)

Fast Learning – Slow Predictions Efficiency • For each prediction, kNN needs to compute the distance (i.e. compare all attributes) for ALL stored examples • Prediction time = linear in the size of the data-set For large training sets and/or complex distances, this can be too slow to be practical

(1) Edited k-nearest neighbor Use only part of the training data Incremental deletion of examples • Less storage • Order dependent • Sensitive to noisy data • More advanced alternatives exist (= IB3) Incremental addition of examples

(2) Pipeline filters Reduce time spent on far-away examples by using more efficient distance-estimates first • Eliminate most examples using rough distance approximations • Compute more precise distances for examples in the neighborhood

(3) kD-trees Use a clever data-structure to eliminate the need to compute all distances kD-trees are similar to decision trees except • splits are made on the median/mean value of dimension with highest variance • each node stores one data point, leaves can be empty

Example kD-tree Use a form of A* search using the minimum distance to a node as an underestimate of the true closest distance Finds closest neighbor in logarithmic (depth of tree) time

kD-trees (cont.) Building a good kD-tree may take some time • Learning time is no longer 0 • Incremental learning is no longer trivial • kD-tree will no longer be balanced • re-building the tree is recommended when the max-depth becomes larger than 2* the minimal required depth (= log(N) with N training examples) Moving away from lazy learning Cover trees are more advanced, more complex, and more efficient!!

(4) Using Prototypes The rough decision surfaces of nearest neighbor can sometimes be considered a disadvantage • Solve two problems at once by using prototypes = Representative for a whole group of instances + + + + + + + + + - - - - - - -

Prototypes (cont.) Prototypes can be: • Single instance, replacing a group • Other structure (e.g., rectangle, rule, ...) -> in this case: need to define distance Moving further and further away from lazy learning + + + + - - -

Recommender Systems through instance based learning • Predict ratings for films users have not yet seen (or rated).

Recommender Systems Predict through instance based regression: rating by user j of entry k Avg. rating of user i can be all entries or kNN Pearson coefficient

Some Comments on k-NN Positive • Easy to implement • Good “baseline” algorithm / experimental control • Incremental learning easy • Psychologically plausible model of human memory Negative • Led astray by irrelevant features • No insight into domain (no explicit model) • Choice of distance function is problematic • Doesn’t exploit/notice structure in examples

Summary • Generalities of instance based learning • Basic idea, (dis)advantages, Voronoi diagrams, lazy vs. eager learning • Various instantiations • kNN, distance-weighted methods, ... • Rescaling attributes • Use of prototypes

Bayesian learning This is going to be very introductory • Describing (results of) learning processes • MAP and ML hypotheses • Developing practical learning algorithms • Naïve Bayes learner • application: learning to classify texts • Learning Bayesian belief networks

Bayesian approaches Several roles for probability theory in machine learning: • describing existing learners • e.g. compare them with “optimal” probabilistic learner • developing practical learning algorithms • e.g. “Naïve Bayes” learner Bayes’ theoremplays a central role

Basics of probability • P(A): probability that A happens • P(A|B): probability that A happens, given that B happens (“conditional probability”) • Some rules: • complement: P(not A) = 1 - P(A) • disjunction: P(A or B) = P(A)+P(B)-P(A and B) • conjunction: P(A and B) = P(A) P(B|A) = P(A) P(B) if A and B independent • total probability:P(A) = iP(A|Bi) P(Bi) With each Bi mutually exclusive

Bayes’ Theorem P(A|B) = P(B|A) P(A) / P(B) Mainly 2 ways of using Bayes’ theorem: • Applied to learning a hypothesis h from data D: P(h|D) = P(D|h) P(h) / P(D) ~ P(D|h)P(h) • P(h): a priori probability that h is correct • P(h|D): a posteriori probability that h is correct • P(D): probability of obtaining data D • P(D|h): probability of obtaining data D if h is correct • Applied to classification of a single example e: P(class|e) = P(e|class)P(class)/P(e)

Bayes’ theorem: Example Example: • assume some lab test for a disease has 98% chance of giving positive result if disease is present, and 97% chance of giving negative result if disease is absent • assume furthermore 0.8% of population has this disease • given a positive result, what is the probability that the disease is present? P(Dis|Pos) = P(Pos|Dis)P(Dis) / P(Pos) =0.98*0.008 / (0.98*0.008 + 0.03*0.992)

MAP and ML hypotheses Task: Given the current data D and some hypothesis space H, return the hypothesis h in H that is most likely to be correct. Note: this h is optimalin a certain sense • no method can exist that finds with higher probability the correct h

MAP hypothesis Given some data D and a hypothesis space H, find the hypothesis hH that has the highest probability of being correct; i.e., P(h|D) is maximal This hypothesis is called the maximal a posteriorihypothesishMAP : hMAP = argmaxhH P(h|D) = argmaxhH P(D|h)P(h)/P(D) = argmaxhH P(D|h)P(h) • last equality holds because P(D) is constant So : we need P(D|h) and P(h) for all hH to compute hMAP

ML hypothesis P(h): a priori probability that h is correct What if no preferences for one h over another? • Then assume P(h) = P(h’) for all h, h’H • Under this assumption hMAP is called the maximum likelihood hypothesis hML hML = argmaxhH P(D|h)(because P(h) constant) • How to find hMAP or hML ? • brute force method: compute P(D|h), P(h) for all hH • usually not feasible

Naïve Bayes classifier Simple & popular classification method • Based on Bayes’ rule + assumption of conditional independence • assumption often violated in practice • even then, it usually works well Example application: classification of text documents

Classification using Bayes rule Given attribute values, what is most probable value of target variable? Problem: too much data needed to estimate P(a1…an|vj)

The Naïve Bayes classifier Naïve Bayes assumption: attributes are independent, given the class P(a1,…,an|vj) = P(a1|vj)P(a2|vj)…P(an|vj) • also called conditional independence(given the class) • Under that assumption, vMAP becomes

Learning a Naïve Bayes classifier To learn such a classifier: just estimate P(vj), P(ai|vj) from data How to estimate? • simplest: standard estimate from statistics • estimate probability from sample proportion • e.g., estimate P(A|B) as count(A and B) / count(B) • in practice, something more complicated needed…

Estimating probabilities Problem: • What if attribute value ai never observed for class vj? • Estimate P(ai|vj)=0 because count(ai and vj) = 0 ? • Effect is too strong: this 0 makes the whole product 0! Solution: use m-estimate • interpolates between observed value nc/n and a priori estimate p -> estimate may get close to 0 but never 0 • m is weight given to a priori estimate

Learning to classify text Example application: • given text of newsgroup article, guess which newsgroup it is taken from • Naïve bayes turns out to work well on this application • How to apply NB? • Key issue : how do we represent examples? what are the attributes?

Representation Binary classification (+/-) or multiple classes possible Attributes = word frequencies • Vocabulary = all words that occur in learning task • # attributes = size of vocabulary • Attribute value = word count or frequency in the text (using m-estimate) = “Bag of Words” representation

Instance based and Bayesian learning