- By
**jaden** - Follow User

- 439 Views
- Uploaded on

Download Presentation
## Machine Learning and Review

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Machine Learning and Review

Reading: C. 18

Bayesian Approach

- Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis
- Prior knowledge can be combined with observed data to determine hypothesis
- Bayesian methods can accommodate hypotheses that make probabilistic predictions
- New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities

Applying Bayes Theorem

- Best hypothesis = most probable hypothesis
- Maximum a posteriori (MAP) hypothesis
- Variables
- h = hypothesis
- D = data
- Prior probability
- h: P(h)
- training data observed: P(D)
- P(D|h) = probability of observing data D given some world where hypothesis holds
- Bayes theorem:
- P(h|D) = P(D|h)*P(h) P(D)

Defining the MAP hypothesis

- hMAP=argmax P(h|D) hεH
- hMAP=argmax P(D|h)*P(h) hεH P(D)

(Using Bayes Theorem)

- hMAP=argmax P(D|h)*P(h) hεH (P(D) is a constant independent of h)
- hMAP=argmax P(D|h) hεH(when we can make the assumption that each hypothesis h is equally probable)

Bayes Optimal Classifier

- The most probable classification of the new instance by combining the predictions of all hypotheses weighted by their posterior probabilities
- Possible classifications: vjεV
- Argmax ∑ P(vj|hi)P(hi|D)vjεVhiεH

Example

- V = {p, n}
- P(h1|D)=.4 P(p|h1)=0 P(n,h1)=1
- P(h2|D)=.3 P(p|h2)=1 P(n,h2)=0
- P(h3|D)=.3 P(p|h3)=1 P(n,h3)=0
- ∑ P(n|hi)P(hi|D) = .4hiεH
- ∑ P(p|hi)P(hi|D) = .6

hiεH

- Argmax ∑ P(vj|hi)P(hi|D) = p

vjε{p,n}hiεH

Properties of Bayesian Approach

- Bayesian learning is optimal
- Easy to estimate P(h) by counting in training data
- Estimating P(D|h) not feasible
- Why?

Naïve Bayes

- Assume independence of attributes
- D = a1,a2,…an
- P(a1,a2,…an|vj)=∏P(ai|vj)i
- Substitute into VMAP formula
- VNB=argmax P(vj)∏P(ai|vj) vjV i

Estimating Probabilities

- What happens when the number of data elements is small?
- Suppose true P(S-length=high|verginica)=.05
- There are only 2 instances with C=Verginica
- We estimate probability by nc/n or #S-length|Verginica/C-Verginica
- #S-length|Verginica must = 0
- Then, instead of .05 we use estimated probability of 0
- Two problems
- Biased underestimate of probability
- This probability term will dominate

Instead

- Use priors as well
- nc+mp n+m
- Where p = prior estimate
- M is a constant called the equivalent sample size
- Determines how heavily to weight p relative to observed data
- Typical method: assume a uniform prior

Benefits of Naïve Bayes

- Practical
- As effective and in some cases, more so, than other machine learners

Review for Midterm

- Concepts you should know
- Search algorithms
- Depth-first, breadth-first, iterative deepening, A*, greedy, hill-climbing, beam
- Constraint propagation
- Game playing
- Bayesian Nets
- A little on machine learning

Midterm format

- Multiple choice
- Short answer questions
- Problem solving
- Essay
- An example midterm will be posted under links

Concepts

- Any words in yellow or light blue or pink on slides

Uninformed Search

- Depth-first
- Breadth-first
- Iterative Deepening

Formulating Problems as Search

Given an initial state and a goal, find the sequence of actions leading through a sequence of states to the final goal state.

Terms:

- Successor function: given action and state, returns {action, successors}
- State space: the set of all states reachable from the initial state
- Path: a sequence of states connected by actions
- Goal test: is a given state the goal state?
- Path cost: function assigning a numeric cost to each path
- Solution: a path from initial state to goal state

Breadth first

- OPEN = start node; CLOSED = empty
- While OPEN is not empty do
- Remove leftmost state from OPEN, call it X
- If X = goal state, return success
- Put X on CLOSED
- SUCCESSORS = Successor function (X)
- Remove any successors on OPEN or CLOSED
- Put remaining successors on right end of OPEN
- End while

Depth-first

- OPEN = start node; CLOSED = empty
- While OPEN is not empty do
- Remove leftmost state from OPEN, call it X
- If X = goal state, return success
- Put X on CLOSED
- SUCCESSORS = Successor function (X)
- Remove any successors on OPEN or CLOSED
- Put remaining successors on left end of OPEN
- End while

Can we combine benefits of both?

- Depth limited
- Select some limit in depth to explore the problem using DFS
- How do we select the limit?
- Iterative deepening
- DFS with depth 1
- DFS with depth 2 up to depth d

Complexity Analysis

- Completeness: is the algorithm guaranteed to find a solution when there is one?
- Optimality: Does the strategy find the optimal solution?
- Time: How long does it take to find a solution?
- Space: How much memory is needed to perform the search?

Is this notion of completeness the same as completeness in logic?

Cost variables

- Time: number of nodes generated
- Space: maximum number of nodes stored in memory
- Branching factor: b
- Maximum number of successors of any node
- Depth: d
- Depth of shallowest goal node
- Path length: m
- Maximum length of any path in the state space

Informed Search

- Best-first
- A*
- Greedy
- Hill climbing
- Variants
- Randomness, Simulated annealing, Local beam search,
- Online search will not be on midterm

Greedy Search

- OPEN = start node; CLOSED = empty
- While OPEN is not empty do
- Remove leftmost state from OPEN, call it X
- If X = goal state, return success
- Put X on CLOSED
- SUCCESSORS = Successor function (X)
- Remove any successors on OPEN or CLOSED
- Compute heuristic function for each node
- Put remaining successors on either end of OPEN
- Sort nodes on OPEN by value of heuristic function
- End while

A* Search

- Try to expand node that is on least cost path to goal
- Evaluation function = f(n)
- f(n)=g(n)+h(n)
- h(n) is heuristic function: cost from node to goal
- g(n) is cost from initial state to node
- f(n) is the estimated cost of cheapest solution that passes through n
- If h(n) is an underestimate of true cost to goal
- A* is complete
- A* is optimal
- A* is optimally efficient: no other algorithm using h(n) is guaranteed to expand fewer states

Admissable heuristics

- A heuristic that never overestimates the cost to the goal
- h1 and h2 are admissable heuristics
- Consistency: the estimated cost of reaching the goal from n is no greater than the step cost of getting to n’ plus estimated cost to goal from n’
- h(n) <=c(n,a,n’)+h(n’)

Local Search Algorithms

- Operate using a single current state
- Move only to neighbors of the state
- Paths followed by search are not retained
- Iterative improvement
- Keep a single current state and try to improve it

Problems for hill climbing

When the higher the heuristic function the better: maxima (objective fns); when the lower the function the better: minima (cost fns)

- Local maxima: A local maximum is a peak that is higher than each of its neighboring states, but lower than the global maximum
- Ridges: a sequence of local maxima
- Plateaux: an area of the state space landscape where the evaluation function is flat

Some solutions

- Stochastic hill-climbing
- Chose at random from among the uphill moves
- First-choice hill climbing
- Generates successors randomly until one is generated that is better than current state
- Random-restart hill climbing
- Keep restarting from randomly generated initial states, stopping when goal is found
- Simulated annealing
- Generate a random move. Accept if improvement. Otherwise accept with continually decreasing probability.
- Local beam search
- Keep track of k states rather than just 1

CSP algorithm

Depth-first search often used

- Initial state: the empty assignment {}; all variables are unassigned
- Successor fn: assign a value to any variable, provided no conflicts w/constraints
- All CSP search algorithms generate successors by considering possible assignments for only a single variable at each node in the search tree
- Goal test: the current assignment is complete
- Path cost: a constant cost for every step

Local search

- Complete-state formulation
- Every state is a compete assignment that might or might not satisfy the constraints
- Hill-climbing methods are appropriate

General purpose methods for efficient implementation

- Which variable should be assigned next?
- in what order should its values be tried?
- Can we detect inevitable failure early?
- Can we take advantage of problem structure?

Order

- Choose the most constrained variable first
- The variable with the fewest remaining values
- Minimum Remaining Values (MRV) heuristic
- What if there are >1?
- Tie breaker: Most constraining variable
- Choose the variable with the most constraints on remaining variables

Order on value choice

- Given a variable, chose the least constraining value
- The value that rules out the fewest values in the remaining variables

Forward Checking

- Keep track of remaining legal values for unassigned variables
- Terminate search when any variable has no legal values

Game Playing

- Minimax
- Alpha-beta pruning
- Evaluation function (what is the difference between a cost function, a utility function, a heuristic function, an evaluation function?)

Bayesian nets

- Example problem

Download Presentation

Connecting to Server..