Machine Learning and Review

1 / 44

# Machine Learning and Review - PowerPoint PPT Presentation

Machine Learning and Review Reading: C. 18 Bayesian Approach Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis Prior knowledge can be combined with observed data to determine hypothesis

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Machine Learning and Review

Bayesian Approach
• Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis
• Prior knowledge can be combined with observed data to determine hypothesis
• Bayesian methods can accommodate hypotheses that make probabilistic predictions
• New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities
Applying Bayes Theorem
• Best hypothesis = most probable hypothesis
• Maximum a posteriori (MAP) hypothesis
• Variables
• h = hypothesis
• D = data
• Prior probability
• h: P(h)
• training data observed: P(D)
• P(D|h) = probability of observing data D given some world where hypothesis holds
• Bayes theorem:
• P(h|D) = P(D|h)*P(h) P(D)
Defining the MAP hypothesis
• hMAP=argmax P(h|D) hεH
• hMAP=argmax P(D|h)*P(h) hεH P(D)

(Using Bayes Theorem)

• hMAP=argmax P(D|h)*P(h) hεH (P(D) is a constant independent of h)
• hMAP=argmax P(D|h) hεH(when we can make the assumption that each hypothesis h is equally probable)
Bayes Optimal Classifier
• The most probable classification of the new instance by combining the predictions of all hypotheses weighted by their posterior probabilities
• Possible classifications: vjεV
• Argmax ∑ P(vj|hi)P(hi|D)vjεVhiεH
Example
• V = {p, n}
• P(h1|D)=.4 P(p|h1)=0 P(n,h1)=1
• P(h2|D)=.3 P(p|h2)=1 P(n,h2)=0
• P(h3|D)=.3 P(p|h3)=1 P(n,h3)=0
• ∑ P(n|hi)P(hi|D) = .4hiεH
• ∑ P(p|hi)P(hi|D) = .6

hiεH

• Argmax ∑ P(vj|hi)P(hi|D) = p

vjε{p,n}hiεH

Properties of Bayesian Approach
• Bayesian learning is optimal
• Easy to estimate P(h) by counting in training data
• Estimating P(D|h) not feasible
• Why?
Naïve Bayes
• Assume independence of attributes
• D = a1,a2,…an
• P(a1,a2,…an|vj)=∏P(ai|vj)i
• Substitute into VMAP formula
• VNB=argmax P(vj)∏P(ai|vj) vjV i
Estimating Probabilities
• What happens when the number of data elements is small?
• Suppose true P(S-length=high|verginica)=.05
• There are only 2 instances with C=Verginica
• We estimate probability by nc/n or #S-length|Verginica/C-Verginica
• #S-length|Verginica must = 0
• Then, instead of .05 we use estimated probability of 0
• Two problems
• Biased underestimate of probability
• This probability term will dominate
• Use priors as well
• nc+mp n+m
• Where p = prior estimate
• M is a constant called the equivalent sample size
• Determines how heavily to weight p relative to observed data
• Typical method: assume a uniform prior
Benefits of Naïve Bayes
• Practical
• As effective and in some cases, more so, than other machine learners
Review for Midterm
• Concepts you should know
• Search algorithms
• Depth-first, breadth-first, iterative deepening, A*, greedy, hill-climbing, beam
• Constraint propagation
• Game playing
• Bayesian Nets
• A little on machine learning
Midterm format
• Multiple choice
• Problem solving
• Essay
• An example midterm will be posted under links
Concepts
• Any words in yellow or light blue or pink on slides
Uninformed Search
• Depth-first
• Iterative Deepening
Formulating Problems as Search

Given an initial state and a goal, find the sequence of actions leading through a sequence of states to the final goal state.

Terms:

• Successor function: given action and state, returns {action, successors}
• State space: the set of all states reachable from the initial state
• Path: a sequence of states connected by actions
• Goal test: is a given state the goal state?
• Path cost: function assigning a numeric cost to each path
• Solution: a path from initial state to goal state
• OPEN = start node; CLOSED = empty
• While OPEN is not empty do
• Remove leftmost state from OPEN, call it X
• If X = goal state, return success
• Put X on CLOSED
• SUCCESSORS = Successor function (X)
• Remove any successors on OPEN or CLOSED
• Put remaining successors on right end of OPEN
• End while
Depth-first
• OPEN = start node; CLOSED = empty
• While OPEN is not empty do
• Remove leftmost state from OPEN, call it X
• If X = goal state, return success
• Put X on CLOSED
• SUCCESSORS = Successor function (X)
• Remove any successors on OPEN or CLOSED
• Put remaining successors on left end of OPEN
• End while
Can we combine benefits of both?
• Depth limited
• Select some limit in depth to explore the problem using DFS
• How do we select the limit?
• Iterative deepening
• DFS with depth 1
• DFS with depth 2 up to depth d
Complexity Analysis
• Completeness: is the algorithm guaranteed to find a solution when there is one?
• Optimality: Does the strategy find the optimal solution?
• Time: How long does it take to find a solution?
• Space: How much memory is needed to perform the search?

Is this notion of completeness the same as completeness in logic?

Cost variables
• Time: number of nodes generated
• Space: maximum number of nodes stored in memory
• Branching factor: b
• Maximum number of successors of any node
• Depth: d
• Depth of shallowest goal node
• Path length: m
• Maximum length of any path in the state space
Informed Search
• Best-first
• A*
• Greedy
• Hill climbing
• Variants
• Randomness, Simulated annealing, Local beam search,
• Online search will not be on midterm
Greedy Search
• OPEN = start node; CLOSED = empty
• While OPEN is not empty do
• Remove leftmost state from OPEN, call it X
• If X = goal state, return success
• Put X on CLOSED
• SUCCESSORS = Successor function (X)
• Remove any successors on OPEN or CLOSED
• Compute heuristic function for each node
• Put remaining successors on either end of OPEN
• Sort nodes on OPEN by value of heuristic function
• End while
A* Search
• Try to expand node that is on least cost path to goal
• Evaluation function = f(n)
• f(n)=g(n)+h(n)
• h(n) is heuristic function: cost from node to goal
• g(n) is cost from initial state to node
• f(n) is the estimated cost of cheapest solution that passes through n
• If h(n) is an underestimate of true cost to goal
• A* is complete
• A* is optimal
• A* is optimally efficient: no other algorithm using h(n) is guaranteed to expand fewer states
• A heuristic that never overestimates the cost to the goal
• h1 and h2 are admissable heuristics
• Consistency: the estimated cost of reaching the goal from n is no greater than the step cost of getting to n’ plus estimated cost to goal from n’
• h(n) <=c(n,a,n’)+h(n’)
Local Search Algorithms
• Operate using a single current state
• Move only to neighbors of the state
• Paths followed by search are not retained
• Iterative improvement
• Keep a single current state and try to improve it
Problems for hill climbing

When the higher the heuristic function the better: maxima (objective fns); when the lower the function the better: minima (cost fns)

• Local maxima: A local maximum is a peak that is higher than each of its neighboring states, but lower than the global maximum
• Ridges: a sequence of local maxima
• Plateaux: an area of the state space landscape where the evaluation function is flat
Some solutions
• Stochastic hill-climbing
• Chose at random from among the uphill moves
• First-choice hill climbing
• Generates successors randomly until one is generated that is better than current state
• Random-restart hill climbing
• Keep restarting from randomly generated initial states, stopping when goal is found
• Simulated annealing
• Generate a random move. Accept if improvement. Otherwise accept with continually decreasing probability.
• Local beam search
• Keep track of k states rather than just 1
CSP algorithm

Depth-first search often used

• Initial state: the empty assignment {}; all variables are unassigned
• Successor fn: assign a value to any variable, provided no conflicts w/constraints
• All CSP search algorithms generate successors by considering possible assignments for only a single variable at each node in the search tree
• Goal test: the current assignment is complete
• Path cost: a constant cost for every step
Local search
• Complete-state formulation
• Every state is a compete assignment that might or might not satisfy the constraints
• Hill-climbing methods are appropriate
General purpose methods for efficient implementation
• Which variable should be assigned next?
• in what order should its values be tried?
• Can we detect inevitable failure early?
• Can we take advantage of problem structure?
Order
• Choose the most constrained variable first
• The variable with the fewest remaining values
• Minimum Remaining Values (MRV) heuristic
• What if there are >1?
• Tie breaker: Most constraining variable
• Choose the variable with the most constraints on remaining variables
Order on value choice
• Given a variable, chose the least constraining value
• The value that rules out the fewest values in the remaining variables
Forward Checking
• Keep track of remaining legal values for unassigned variables
• Terminate search when any variable has no legal values
Game Playing
• Minimax
• Alpha-beta pruning
• Evaluation function (what is the difference between a cost function, a utility function, a heuristic function, an evaluation function?)
Bayesian nets
• Example problem