An Introduction toMarkov Decision ProcessesBob Givan Ron ParrPurdue University Duke UniversityMDP Tutorial - 1OutlineMarkov Decision Processes defined (Bob)• Objective functions• PoliciesFinding Optimal Solutions (Ron)• Dynamic programming• Linear programmingRefinements to the basic model (Bob)• Partial observability• Factored representationsMDP Tutorial - 2Stochastic Automata with UtilitiesA Markov Decision Process (MDP) modelcontains:• A set of possible world states S• A set of possible actions A• A real valued reward function R(s,a)• A description T of each action’s effects in each state.We assume the Markov Property: the effects of an actiontaken in a state depend only on that state and not on theprior history.MDP Tutorial - 3Stochastic Automata with UtilitiesA Markov Decision Process (MDP) modelcontains:• A set of possible world states S• A set of possible actions A• A real valued reward function R(s,a)• A description T of each action’s effects in each state.We assume the Markov Property: the effects of an actiontaken in a state depend only on that state and not on theprior history.MDP Tutorial - 4·fi·fiRepresenting ActionsDeterministic Actions:• T :SA S For each state and action we specify a new state.0.60.4Stochastic Actions:• T :SA Prob()S For each state and action wespecify a probability distribution over next states.Represents the distribution P(s’ | s, a).MDP Tutorial - 5·fifi·Representing ...
A MarkovDecisionProcess (MDP) model contains: • A set of possible world states S • A set of possible actions A • A real valued reward function R(s,a) • A description T of each action’s effects in each state.
We assume the Markov Property : theeffectsofanaction takeninastatedependonlyonthatstateandnotonthe priorhistory.
MDP Tutorial - 3
Stochastic Automata with Utilities
A MarkovDecisionProcess (MDP) model contains: • A set of possible world states S • A set of possible actions A • A real valued reward function R(s,a) • A description T of each action’s effects in each state.
We assume the Markov Property : theeffectsofanaction takeninastatedependonlyonthatstateandnotonthe priorhistory.
MDP Tutorial - 4
Representing Actions
Deterministic Actions: • T : S ´ A | S For each state and action we specify a new state.
Foraechstateanda
0.6
ction
Stochastic Actions: • T : S ´ A | Prob ( S ) specify a probability distribution over next states. RepresentsthedistributionP( s ’ | s , a ).
MDP Tutorial - 5
0.4
ew
Representing Actions
Deterministic Actions: • T : S ´ A | S For each state and action we specify a new state.
oFraechtsateanda
1.0
ction
Stochastic Actions: • T : S ´ A | Prob ( S ) specify a probability distribution over next states. RepresentsthedistributionP( s ’ | s , a ).
MDP Tutorial - 6
ew
A
policy
Representing Solutions
ϑ is a mapping from
MDP
S to
Tutorial
-
7
A
Following a Policy
Following a policy ϑ : 1. Determine the current state 2. Execute action ϑ ( s ) 3. Goto step 1.
s
Assumes full observability : the new state resulting from executing an action will be known to the system
MDP Tutorial - 8
Evaluating a Policy
How good is a policy ϑ in a state s ?
For deterministic actions just total the rewards obtained... but result may be infinite.
For stochastic actions, instead expectedtotalreward obtained–again typically yields infinite value.
How do we compare policies of infinite value?
MDP Tutorial - 9
Objective Functions
An objective function maps infinite sequences of rewards to single real numbers (representing utility)
Options: 1. Set a finite horizon and just total the reward 2. Discounting to prefer earlier rewards 3. Average reward rate in the limit
Discounting is perhaps the most analytically tractable and most widely studied approach
MDP Tutorial - 10
Discounting
A reward n steps away is discounted by Χ n for discount rate 0 0 Χ 0 1 . • models mortality: you may die at any moment • models preference for shorter solutions • a smoothed out version of limited horizon lookahead
We use cumulativediscountedreward as our objective
( Max value <= M + Χ × M + Χ 2 × M +....=--1 ----1 –----Χ -× M )