Overview of Today’s Lecture

1 / 28

# Overview of Today’s Lecture - PowerPoint PPT Presentation

Overview of Today’s Lecture. Last Time: course introduction Reading assignment posted to class webpage Don’t get discouraged Today: introduction to “Supervised Machine Learning” Our first ML algorithm: K-nearest neighbor HW 0 out online Create a dataset of “fixed-length feature vectors”

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Overview of Today’s Lecture' - zohar

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Overview of Today’s Lecture
• Last Time: course introduction
• Reading assignment posted to class webpage
• Don’t get discouraged
• Today: introduction to “Supervised Machine Learning”
• Our first ML algorithm: K-nearest neighbor
• HW 0 out online
• Create a dataset of
• “fixed-length feature vectors”
• Due next Tuesday Sept 19 (4 PM)
• Instructions for handing in HW0 coming soon
Supervised Learning: Overview

Digital Representation

(feature space)

Real World

classification

rules

select

features

construct

classifier

If feature 2 = X

then

APPLY BREAK = TRUE

machine

humans

HW 1-2

HW 0

• Given
• A collection of positive examples of some concept/class/category (i.e., members of the class) and, possibly, a collection of the negative examples (i.e., non-members)
• Produce
• A description that covers (includes) all (most) of the positive examples and none (few) of the negative examples

(which, hopefully, properly categorizes most future examples!)

The Key

Point!

Note: one can easily extend this definition

to handle more than two classes

Example

Positive Examples

Negative Examples

How does this symbol classify?

• Concept
• Solid Red Circle in a Regular Polygon
• Figure with red solid circles not in larger red circle
• Figures on left side of page etc
• Step 1: Choose a Boolean (true/false) concept
• Subjective judgment (can’t articulate)
• Books I like/dislike
• Movies I like/dislike
• www pages I like/dislike
• “time will tell” concepts
• Medical treatment (at time t, predict outcome at time (t +∆t))
• Sensory interpretation
• Face recognition (See text)
• Handwritten digit recognition
• Sound recognition
• Hard to program functions
• Step 2: Choose a feature space
• We will use fixed-length feature vectors
• Choose N features
• Each feature has Vipossible values
• Each example is represented by a vector of N feature values

(i.e., is a point in the feature space)

e.g.: <red, 50, round>

colorweight shape

• Feature Types
• Boolean
• Nominal
• Ordered
• Hierarchical
• Step 3: Collect examples (“I/O” pairs)

Defines a space

We will not use hierarchical features

closed

polygon

continuous

square

triangle

circle

ellipse

Standard Feature Typesfor representing training examples – source of “domain knowledge”
• Nominal (Boolean is a special case)
• No relationship among possible values

e.g., color є {red, blue, green} (vs. color = 1000 Hertz)

• Linear (or Ordered)
• Possible values of the feature are totally ordered

e.g., size є {small, medium, large} ←discrete

weight є [0…500] ←continuous

• Hierarchical
• Possible values are partiallyordered in an ISA hierarchy

e.g. for shape->

Product

Pet

Foods

Tea

99 Product

Classes

2302 Product

Subclasses

Dried

Cat Food

Canned

Cat Food

Friskies

Liver, 250g

~30k

Products

• Structure of one feature!
• “the need to be able to incorporate hierarchical (knowledge about data types) is shown in every paper.”
• - From eds. Intro to special issue (on applications) of KDD journal, Vol 15, 2001

* Officially, “Data Mining and Knowledge Discovery”, Kluwer Publishers

Digitized

camera image

Learned

Function

Steering

Angle

age = 13

sex = M wgt = 18

Learned

Function

ill

vs

healthy

Some Famous Examples
• Car Steering (Pomerleau)
• Medical Diagnosis (Quinlan)
• DNA Categorization
• TV-pilot rating
• Chemical-plant control
• Back gammon playing
• WWW page scoring
• Credit application scoring

Medical

record

• Choose a dataset
• based on interest/familiarity
• meets basic requirements
• >1000 examples
• category (function) learned should be binary valued
• ~500 “true” and “false” examples

→ Internet Movie Database (IMDb)

Example Database: IMDb
• Name
• Country
• Movies
• Name
• Year of birth
• Movies
• Name
• Year of birth
• Gender
• Oscars
• Movies

Studio

Actor

Director/

Producer

Acted in

Directed

Produced

• Title
• Genre
• Year
• Opening Weekend
• BO receipts
• List of actors/actresses
• Release season

Movie

Choose Boolean target function (category)

• Some examples:
• Opening weekend box office receipts > \$2 million
• Movie is drama? (action, sci-fi,…)
• Movies I like/dislike (e.g. Tivo)
• Movie
• Average age of actors
• Number of producers
• Percent female actors
• Studio
• Average movie gross
• Percent movies released in US

• Director/Producer
• Years of experience
• Most prevalent genre
• Number of award winning movies
• Average movie gross
• Actor
• Gender
• Has previous Oscar award or nominations
• Most prevalent genre

David Jensen’s group at UMass used Naïve Bayes (NB) to predict the following based on attributes they selected and a novel way of sampling from the data:

• Opening weekend box office receipts > \$2 million
• 25 attributes
• Accuracy = 83.3%
• Default accuracy = 56%
• Movie is drama?
• 12 attributes
• Accuracy = 71.9%
• Default accuracy = 51%
Back to Supervised Learning

One way learning systems differ is in how they represent concepts:

Neural

Net

Backpropagation

C4.5, CART

Decision

Tree

Training

Examples

AQ, FOIL

Φ <- X^Y

Φ <- Z

Rules

.

.

.

SVMs

If 5x1 + 9x2 – 3x3 > 12

Then +

Feature Space

If examples are described in terms of values of features, they can be plotted as points in an N-dimensional space.

Size

Big

?

Color

Gray

2500

Weight

A “concept” is then a (possibly disjoint) volume in this space.

Supervised Learning = Learning from Labeled Examples
• Most common & successful form of ML

Venn Diagram

-

-

-

-

+

+

+

-

+

-

-

-

• Examples – points in multi-dimensional “feature space”
• Concepts – “function” that labels points in feature space
• (as +, -, and possibly ?)
Brief Review

Instances

• Conjunctive Concept
• Color(?obj1, red)

^

• Size(?obj1, large)
• Disjunctive Concept
• Color(?obj2, blue)

v

• Size(?obj2, small)

“and”

“or”

A

A

A

Empirical Learning and Venn Diagrams

Venn Diagram

Concept = A or B (Disjunctive concept)

Examples = labeled points in feature space

Concept = a label for a set of points

-

-

-

-

-

-

-

-

+

+

-

-

-

-

+

-

-

+

-

+

-

+

+

+

+

+

+

+

+

+

+

-

-

-

-

-

A

-

-

-

+

+

+

-

+

-

B

-

-

-

-

-

-

-

-

Feature Space

Aspects of an ML System
• “Language” for representing examples
• “Language” for representing “Concepts”
• Technique for producing concept “consistent” with the training examples
• Technique for classifying new instance

Each of these limits the expressiveness/efficiency of the supervised learning algorithm.

HW 0

Other

HW’s

Nearest-Neighbor Algorithms

(aka. Exemplar models, instance-based learning (IBL), case-based learning)

• Learning ≈ memorize training examples
• Problem solving = find most similar example in memory; output its category

Venn

-

-

+

+

+

+

“Voronoi

Diagrams”

(pg 233)

+

-

-

-

-

+

-

+

+

+

+

?

-

Sample Experimental Results

Simple algorithm works quite well!

Simple Example – 1-NN

(1-NN ≡one nearest neighbor)

Training Set

• a=0, b=0, c=1+
• a=0, b=1, c=0-
• a=1, b=1, c=1-

Test Example

• a=0, b=1, c=0 ?
• “Hamming Distance”
• Ex 1 = 2
• Ex 2 = 1
• Ex 3 = 2

So output -

K-NN Algorithm

Collect K nearest neighbors, select majority classification (or somehow combine their classes)

• What should K be?
• It probability is problem dependent
• Can use tuning sets (later) to select a good setting for K

Shouldn’t really

“connect the dots”

(Why?)

Tuning Set

Error Rate

2

3

4

5

K

1

Some Common Jargon
• Classification
• Learning a discrete valued function
• Regression
• Learning a real valued function

IBL easily extended to regression tasks (and to multi-category classification)

Discrete/Real

Outputs

Variations on a Theme

(From Aha, Kibler and Albert in ML Journal)

• IB1 – keep all examples
• IB2 – keep next instance if incorrectly classified by using previous instances
• Uses less storage
• Order dependent
• Sensitive to noisy data
Variations on a Theme (cont.)
• IB3– extend IB2 to more intelligently decide which examples to keep (see article)
• Better handling of noisy data
• Another Idea - cluster groups, keep “examples” from each (median/centroid)
Next time
• Finish K-NN
• Begin linear separators
• Naïve Bayes