rl.tutorial

Nopod

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

5 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Informations

Publié par	Nopod
Nombre de lectures	19
Langue	English

Extrait

Outline of Tutorial
1. Introduction
An Introduction to Reinforcement 2. Elements of Reinforcement Learning
3. Solution methodsLearning
4. Survey of other subjects
Tim Kovacs
University of Bristol
kovacs@cs.bris.ac.uk
2
1 What is it? 1 Points of View
The learning agent’s point of view:
Reinforcement Learning is learning how to act in order to maximise a • RL is learning from trial and error interaction with the world.
numerical reward signal. • E.g. how much reward will I get if I do this?
RL is:
• a class of tasks “RL as a tool” point of view:
• which require a kind of trial-and-error learning • RL is training by rewards and punishments.
• Train the computer as we might train a dog.
Features of RL:
• learning from numerical rewards Application areas: problems with ongoing interaction e.g.
• interaction with the task; sequences of states, actions and rewards • robotics
• uncertainty and non-deterministic worlds • animal learning
• delayed consequences • scheduling
• the explore/exploit dilemma • games
• the whole problem of goal-directed learning • control systems
3 4
1 Supervised Learning 1 Reinforcement Learning
Step: 1 Step: 1
Teacher: Does picture 1 show a car or a flower? World: You are in state 9. Choose action A or C.
Learner: A flower. Learner: Action A.
Teacher: No, it’s a car. World: Your reward is 100.
Step: 2 Step: 2
Teacher: Does picture 2 show a car or a flower? World: You are in state 32. Choose action B or E.
Learner: A car. Learner: Action B.
Teacher: Yes, it’s a car. World: Your reward is 50.
Step: 3 .... Step: 3 ....
5 6
1g
g
g
£
£
g
g
1 Examples of RL Tasks 2 Policies, Rewards and Goals
Pole balancing policy
• -1 when pole falls, 0 otherwise • defines the agent’s behaviour at a given time
• maps from perceptions to actions
Chess • can be defined by: look-up table, neural net, search algorithm...
• win +1, loose -1 • may be stochastic
• may be non-stationary (indexed by time)
Elevator dispatching
• reward based on mean squared time for elevator to arrive reward function
(minimisation problem) • defines the goal(s) in an RL problem
• maps from states, state-action pairs, or state-action-successor state
Channel allocation for cellular phones triplets to a numerical reward
• lower reward the more calls are blocked • goal of the agent is to maximise the total reward in the long run
• the policy is altered to achieve this goal
Figure from Reinforcement Learning. An Introduction. R. Sutton & A. Barto. MIT Press 1998.
7 8
2 Reward and Return 2 Discounted Return
• The reward function indicates how good things are right The geometrically discounted model of return:
now i.e. at time t T
2 T k
• But the agent wants to maximise reward in the long-term Rt = rt + 1 + rt + 2 + rt + 3 + rT = rt + 1 + k
i.e. over many time steps k =0
• We refer to long-term (multi-step) reward as return
where is the discount rate.0 1
• Several definitions, including simply summing rewards:
T
Used to:
Rt = rt + 1 + rt + 2 + rt + 3 +rT = rk
• bound the infinite sum
k =t +1
• give more weight to earlier rewards (e.g. to give
preference to shorter paths)
where T is the last time step of the world.
9 10
2 Specifying RL Tasks 2 Stochastic Decision Processes
Used to model RL’s stochastic agent-environment interactions
RL tasks are often specified as:
Consists of:
• A set of states S
• A reward function R • A set of actions A
• A transition function P• A model of return
• A Stochastic Decision Process (usually a Markov
P gives the probability of transition from a state s to successor state s’
Decision Process) under a given action a
We consider discrete time processes although continuous time
processes are studied
If the agent cannot distinguish some states we have a partially
11 12observable process
2p
p
p
p
p
2 The Markov Property 2 Optimal Policies
If P is a function only of the current state and action, the learning task • An RL agent adapts its policy in order to increase return
has the Markov Property (or is Markov)
• A policy is at least as good as a policy if its expected1 2
• If the agent can distinguish all states we have a Markov Decision
return is at least as great in each possible initial state
Process (MDP)
• An optimal policy * is at least as good as any other • Otherwise a Partially Observable MDP (or POMDP)
policy
• For the geometrically discounted model of return any If a task is Markov, memory of previous states and actions is not
needed and is no help to the agent MDP has at least one optimal policy
• There is an optimal deterministic policy for any
The theory of RL generally assumes the Markov property completely observable MDP
• Some POMDPs which have no optimal deterministic
Real-world problems are very often non-Markov, but we generally act
policy have an optimal stochastic policy
as though they are
13 14
3 Policy Adaptation Methods 3 Value Functions
• Direct policy search methods A value function maps each state to an estimate of return
under a policy• generate and evaluate policies
*• Given the true value function for and the transition • e.g. evolutionary search, policy gradient methods
function P it is trivial to derive an optimal policy
Value function-based methods
An action-value function maps from state-action pairs to
• learn a value function for the policy
estimates of return
• generate a new policy from the value function
*• Given the true action-value function for it is trivial to
• e.g. Q-learning, Dynamic Programming
act optimally even without the transition function
Learning a value function is referred to as the ‘prediction’
These methods are said to address the ‘control’ problem; problem or ‘policy evaluation’ in the Dynamic
controlling the agent’s behaviour. Programming literature.
15 16
3 Model-free and Model-based Learning 3 Models
We refer to P and R as a model of the world Models are not always available to the learner. E.g.:
• Easy to specify rules of blackjack, harder to determine P
Model-based learners
• Easy to specify rules of chess, but not P because it
• Can learn from the model instead of interacting with the world
depends partly on opponent’s policy (which may be
• Can visit arbitrary parts of the model
unknown)
• Can use model to make full updates to value estimates (fast)
• Also called indirect methods
Models can be learned while interacting with the world• E.g. Dynamic Programming
• We can use the estimated model to estimate the value
functionModel-free learners
• Must sample P and R by interacting with the world • These are called certainty equivalent estimates, since we
• Must follow the trajectory generated by P and their actions treat them as equivalent to the true value
• Must make sample updates to value estimates (slow)
• Also called direct methods
17 18
3p
„
p
a
g
a
a
p
p
g
-
g
p
p
g
p
p
g
p
3 Policy Evaluation 3 Policy Evaluation
Example
A model-based method to approximate the true value
function of
Initialise V(s) = 0 for all s in S
Repeat
Error := 0
For each s in S
v := V(s)
V (s) := P(s, (s), s')[R(s, (s), s') + V (s')]s'
Error := max(Error, |v - V(s)|)
Until Error is small
Figures from Reinforcement Learning
Output V An Introduction. R. Sutton & A. Barto.
19 MIT Press 1998. 20
3 Policy Iteration 3 Value Iteration
A model-based method to optimise policies Policy iteration can be slow because full policy evaluation can be slow
Alternates policy evaluation and policy improvement steps Value iteration combines evaluation and improvement in one update
1. Initialise V(s) := 0 and (s) := random(A) for all s in S Initialise V(s) := 0 for all s in S
Repeat
2. Perform policy evaluation Error := 0
For each s in S
3. policy-stable := true v := V(s)
V (s) := max P(s, a,s')[R(s, a,s') + V (s')]For each s in S a s'
b := (s) Error := max(Error, |v - V(s)|)
(s) := arg max P (s, a, s' )[ R (s, a, s' ) + V (s' )] Until Error is smalla s'
If b ( s ) then policy-stable := false Output a deterministic policy such that
If policy-stable then stop else go to 2 (s) = arg max P(s, a, s')[R(s,a, s') + V (s')]s'a
21 22
3 Notes on Policy and Value Iteration 3 Q-learning
Both converge on the optimal policy in the limit Model-free method
Learns action-values Q(s,a) rather than state-values V(s)
Which converges more quickly depends on the problem Like asynchronous VI, but samples R and P from experience with the
world rather using them directly in update
• Consequently needs learning rate to average successive samplesVI as presented is strictly termed synchronous value iteration as each
iteration updates each V(s) once Converges in limit if all Q(s,a) updated and declined appropriately
Asynchronous VI allows updating arbitrary subsets of states each Initialise all Q(s,a) arbitrarily
iteration Repeat
• Guaranteed to converge if all states have some non-zero probability s := current_state( )
of update on each iteration Choose a based on Q