22 pages

English

mdp-tutorial

Uffe

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

22 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

An Introduction toMarkov Decision ProcessesBob Givan Ron ParrPurdue University Duke UniversityMDP Tutorial - 1OutlineMarkov Decision Processes deﬁned (Bob)• Objective functions• PoliciesFinding Optimal Solutions (Ron)• Dynamic programming• Linear programmingReﬁnements to the basic model (Bob)• Partial observability• Factored representationsMDP Tutorial - 2Stochastic Automata with UtilitiesA Markov Decision Process (MDP) modelcontains:• A set of possible world states S• A set of possible actions A• A real valued reward function R(s,a)• A description T of each action’s effects in each state.We assume the Markov Property: the effects of an actiontaken in a state depend only on that state and not on theprior history.MDP Tutorial - 3Stochastic Automata with UtilitiesA Markov Decision Process (MDP) modelcontains:• A set of possible world states S• A set of possible actions A• A real valued reward function R(s,a)• A description T of each action’s effects in each state.We assume the Markov Property: the effects of an actiontaken in a state depend only on that state and not on theprior history.MDP Tutorial - 4·ﬁ·ﬁRepresenting ActionsDeterministic Actions:• T :SA S For each state and action we specify a new state.0.60.4Stochastic Actions:• T :SA Prob()S For each state and action wespecify a probability distribution over next states.Represents the distribution P(s’ | s, a).MDP Tutorial - 5·ﬁﬁ·Representing ...

Informations

Publié par	Uffe
Nombre de lectures	21
Langue	English

Extrait

An Introduction to Markov Decision Processes

Bob Givan Purdue University

Ron Parr Duke University

MDP Tutorial - 1

Outline

Markov Decision Processes deﬁned (Bob)

• Objective functions

• Policies

Finding Optimal Solutions (Ron)

• Dynamic programming

• Linear programming

Reﬁnements to the basic model (Bob)

• Partial observability

• Factored representations

MDP Tutorial - 2

Stochastic Automata with Utilities

A Markov Decision Process (MDP) model contains: • A set of possible world states S • A set of possible actions A • A real valued reward function R(s,a) • A description T of each action’s effects in each state.

We assume the Markov Property : theeffectsofanaction takeninastatedependonlyonthatstateandnotonthe prior history.

MDP Tutorial - 3

Stochastic Automata with Utilities

We assume the Markov Property : theeffectsofanaction takeninastatedependonlyonthatstateandnotonthe prior history.

MDP Tutorial - 4

Representing Actions

Deterministic Actions: • T : S ´ A | S For each state and action we specify a new state.

Foraechstateanda

0.6

ction

Stochastic Actions: • T : S ´ A | Prob ( S ) specify a probability distribution over next states. Represents the distributionP( s ’ | s , a ).

MDP Tutorial - 5

0.4

Representing Actions

Deterministic Actions: • T : S ´ A | S For each state and action we specify a new state.

oFraechtsateanda

1.0

ction

Stochastic Actions: • T : S ´ A | Prob ( S ) specify a probability distribution over next states. Represents the distributionP( s ’ | s , a ).

MDP Tutorial - 6

policy

Representing Solutions

ϑ is a mapping from

MDP

S to

Tutorial

Following a Policy

Following a policy ϑ : 1. Determine the current state 2. Execute action ϑ ( s ) 3. Goto step 1.

Assumes full observability : the new state resulting from executing an action will be known to the system

MDP Tutorial - 8

Evaluating a Policy

How good is a policy ϑ in a state s ?

For deterministic actions just total the rewards obtained... but result may be inﬁnite.

For stochastic actions, instead expected total reward obtained–again typically yields inﬁnite value.

How do we compare policies of inﬁnite value?

MDP Tutorial - 9

Objective Functions

An objective function maps inﬁnite sequences of rewards to single real numbers (representing utility)

Options: 1. Set a ﬁnite horizon and just total the reward 2. Discounting to prefer earlier rewards 3. Average reward rate in the limit

Discounting is perhaps the most analytically tractable and most widely studied approach

MDP Tutorial - 10

Discounting

A reward n steps away is discounted by Χ n for discount rate 0 0 Χ 0 1 . • models mortality: you may die at any moment • models preference for shorter solutions • a smoothed out version of limited horizon lookahead