Vous aimerez aussi
Outline of Tutorial
1. Introduction
An Introduction to Reinforcement 2. Elements of Reinforcement Learning
3. Solution methodsLearning
4. Survey of other subjects
Tim Kovacs
University of Bristol
kovacs@cs.bris.ac.uk
2
1 What is it? 1 Points of View
The learning agent’s point of view:
Reinforcement Learning is learning how to act in order to maximise a • RL is learning from trial and error interaction with the world.
numerical reward signal. • E.g. how much reward will I get if I do this?
RL is:
• a class of tasks “RL as a tool” point of view:
• which require a kind of trialanderror learning • RL is training by rewards and punishments.
• Train the computer as we might train a dog.
Features of RL:
• learning from numerical rewards Application areas: problems with ongoing interaction e.g.
• interaction with the task; sequences of states, actions and rewards • robotics
• uncertainty and nondeterministic worlds • animal learning
• delayed consequences • scheduling
• the explore/exploit dilemma • games
• the whole problem of goaldirected learning • control systems
3 4
1 Supervised Learning 1 Reinforcement Learning
Step: 1 Step: 1
Teacher: Does picture 1 show a car or a flower? World: You are in state 9. Choose action A or C.
Learner: A flower. Learner: Action A.
Teacher: No, it’s a car. World: Your reward is 100.
Step: 2 Step: 2
Teacher: Does picture 2 show a car or a flower? World: You are in state 32. Choose action B or E.
Learner: A car. Learner: Action B.
Teacher: Yes, it’s a car. World: Your reward is 50.
Step: 3 .... Step: 3 ....
5 6
1g
g
g
£
£
g
g
1 Examples of RL Tasks 2 Policies, Rewards and Goals
Pole balancing policy
• 1 when pole falls, 0 otherwise • defines the agent’s behaviour at a given time
• maps from perceptions to actions
Chess • can be defined by: lookup table, neural net, search algorithm...
• win +1, loose 1 • may be stochastic
• may be nonstationary (indexed by time)
Elevator dispatching
• reward based on mean squared time for elevator to arrive reward function
(minimisation problem) • defines the goal(s) in an RL problem
• maps from states, stateaction pairs, or stateactionsuccessor state
Channel allocation for cellular phones triplets to a numerical reward
• lower reward the more calls are blocked • goal of the agent is to maximise the total reward in the long run
• the policy is altered to achieve this goal
Figure from Reinforcement Learning. An Introduction. R. Sutton & A. Barto. MIT Press 1998.
7 8
2 Reward and Return 2 Discounted Return
• The reward function indicates how good things are right The geometrically discounted model of return:
now i.e. at time t T
2 T k
• But the agent wants to maximise reward in the longterm Rt = rt + 1 + rt + 2 + rt + 3 + rT = rt + 1 + k
i.e. over many time steps k =0
• We refer to longterm (multistep) reward as return
where is the discount rate.0 1
• Several definitions, including simply summing rewards:
T
Used to:
Rt = rt + 1 + rt + 2 + rt + 3 +rT = rk
• bound the infinite sum
k =t +1
• give more weight to earlier rewards (e.g. to give
preference to shorter paths)
where T is the last time step of the world.
9 10
2 Specifying RL Tasks 2 Stochastic Decision Processes
Used to model RL’s stochastic agentenvironment interactions
RL tasks are often specified as:
Consists of:
• A set of states S
• A reward function R • A set of actions A
• A transition function P• A model of return
• A Stochastic Decision Process (usually a Markov
P gives the probability of transition from a state s to successor state s’
Decision Process) under a given action a
We consider discrete time processes although continuous time
processes are studied
If the agent cannot distinguish some states we have a partially
11 12observable process
2p
p
p
p
p
2 The Markov Property 2 Optimal Policies
If P is a function only of the current state and action, the learning task • An RL agent adapts its policy in order to increase return
has the Markov Property (or is Markov)
• A policy is at least as good as a policy if its expected1 2
• If the agent can distinguish all states we have a Markov Decision
return is at least as great in each possible initial state
Process (MDP)
• An optimal policy * is at least as good as any other • Otherwise a Partially Observable MDP (or POMDP)
policy
• For the geometrically discounted model of return any If a task is Markov, memory of previous states and actions is not
needed and is no help to the agent MDP has at least one optimal policy
• There is an optimal deterministic policy for any
The theory of RL generally assumes the Markov property completely observable MDP
• Some POMDPs which have no optimal deterministic
Realworld problems are very often nonMarkov, but we generally act
policy have an optimal stochastic policy
as though they are
13 14
3 Policy Adaptation Methods 3 Value Functions
• Direct policy search methods A value function maps each state to an estimate of return
under a policy• generate and evaluate policies
*• Given the true value function for and the transition • e.g. evolutionary search, policy gradient methods
function P it is trivial to derive an optimal policy
Value functionbased methods
An actionvalue function maps from stateaction pairs to
• learn a value function for the policy
estimates of return
• generate a new policy from the value function
*• Given the true actionvalue function for it is trivial to
• e.g. Qlearning, Dynamic Programming
act optimally even without the transition function
Learning a value function is referred to as the ‘prediction’
These methods are said to address the ‘control’ problem; problem or ‘policy evaluation’ in the Dynamic
controlling the agent’s behaviour. Programming literature.
15 16
3 Modelfree and Modelbased Learning 3 Models
We refer to P and R as a model of the world Models are not always available to the learner. E.g.:
• Easy to specify rules of blackjack, harder to determine P
Modelbased learners
• Easy to specify rules of chess, but not P because it
• Can learn from the model instead of interacting with the world
depends partly on opponent’s policy (which may be
• Can visit arbitrary parts of the model
unknown)
• Can use model to make full updates to value estimates (fast)
• Also called indirect methods
Models can be learned while interacting with the world• E.g. Dynamic Programming
• We can use the estimated model to estimate the value
functionModelfree learners
• Must sample P and R by interacting with the world • These are called certainty equivalent estimates, since we
• Must follow the trajectory generated by P and their actions treat them as equivalent to the true value
• Must make sample updates to value estimates (slow)
• Also called direct methods
17 18
3p
„
p
a
g
a
a
p
p
g

g
p
p
g
p
p
g
p
3 Policy Evaluation 3 Policy Evaluation
Example
A modelbased method to approximate the true value
function of
Initialise V(s) = 0 for all s in S
Repeat
Error := 0
For each s in S
v := V(s)
V (s) := P(s, (s), s')[R(s, (s), s') + V (s')]s'
Error := max(Error, v  V(s))
Until Error is small
Figures from Reinforcement Learning
Output V An Introduction. R. Sutton & A. Barto.
19 MIT Press 1998. 20
3 Policy Iteration 3 Value Iteration
A modelbased method to optimise policies Policy iteration can be slow because full policy evaluation can be slow
Alternates policy evaluation and policy improvement steps Value iteration combines evaluation and improvement in one update
1. Initialise V(s) := 0 and (s) := random(A) for all s in S Initialise V(s) := 0 for all s in S
Repeat
2. Perform policy evaluation Error := 0
For each s in S
3. policystable := true v := V(s)
V (s) := max P(s, a,s')[R(s, a,s') + V (s')]For each s in S a s'
b := (s) Error := max(Error, v  V(s))
(s) := arg max P (s, a, s' )[ R (s, a, s' ) + V (s' )] Until Error is smalla s'
If b ( s ) then policystable := false Output a deterministic policy such that
If policystable then stop else go to 2 (s) = arg max P(s, a, s')[R(s,a, s') + V (s')]s'a
21 22
3 Notes on Policy and Value Iteration 3 Qlearning
Both converge on the optimal policy in the limit Modelfree method
Learns actionvalues Q(s,a) rather than statevalues V(s)
Which converges more quickly depends on the problem Like asynchronous VI, but samples R and P from experience with the
world rather using them directly in update
• Consequently needs learning rate to average successive samplesVI as presented is strictly termed synchronous value iteration as each
iteration updates each V(s) once Converges in limit if all Q(s,a) updated and declined appropriately
Asynchronous VI allows updating arbitrary subsets of states each Initialise all Q(s,a) arbitrarily
iteration Repeat
• Guaranteed to converge if all states have some nonzero probability s := current_state( )
of update on each iteration Choose a based on Q // policy derived from Q(s,a) online
• Allows policy adaptation even when state space is too big for
Take action a, observe r, s’20synchronous updates (e.g. backgammon has over 1 0 states).
Q(s, a) := Q(s, a) + [r + max Q(s', a') Q(s, a)]
a'
s := s’
23 24
4e
e
3 Explore/Exploit 4 Other Subjects in RL
Qlearning uses sample experience • Eligibility traces
Hence how we generate sample experience is an issue – Update all actionvalues each time step, in proportion to their
eligibilityIn each state* we must balance:
– Increase eligibility when we visit a stateaction, decrease it on • Exploration of alternative actions (so we can find the best)
all other time steps• Exploitation of current knowledge (so we behave sensibly, most of
– Speeds up RL considerablythe time)
• Function approximationThe optimal balance is very difficult to determine
– Approximate the value function with e.g. a neural network
– Allows generalisation to previously unvisited states (which Usually simple exploration control methods are used e.g. greedy:
speeds up RL)with p(1 ) take the greedy action, otherwise choose at random
– Much more compact than a lookup table
– Convergence results more limited*Exploration is actually a sequential issue: we need sequences of
actions to reach states
25 26
4 Other Subjects in RL 4 Hot Topics in RL
• Structured models • Efficient exploration and optimal learning
– Represent transition function P compactly with a • Learning with structured models (e.g. Bayes nets)
dynamic Bayes net or factored MDP
• Learning with relational models
• Learning in continuous state and action spaces
• RL with hidden state
• Hierarchical RL– Learning with POMDPs or kMarkov environments
• Learning in processes with hidden state – Much more difficult than MDPs
(POMDPs)
• Direct policy search • Policy search methods
– Forget the value function. Search in the space of
policies
– Useful when the Markov property is violated e.g. by
27 28hidden state
5