rl-tutorial.ps (mpage)

Shyan

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

18 pages

English

Le téléchargement nécessite un accès à la bibliothèque YouScribe
Tout savoir sur nos offres

A propos
Informations
Extrait

Description

Informations

Publié par	Shyan
Nombre de lectures	57
Langue	English

Extrait

A Reinforcement Learning Tutorial for
Game Strategy Acquisition
Thomas Philip Runarsson
March 17, 2005
T.P. Runarsson (tpr@hi.is)
version 0.0.3 March 17, 2005
Introduction
This tutorial is based on:
• the book by Richard S. Sutton and Andrew G. Barto,
Reinforcement Learning An Introduction. MIT Press, 1998.
Also available on-line here:
http://www.cs.ualberta.ca/~sutton/book/the-book.html
• the computational intelligence course taught at the University
of Iceland.
1version 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
The n-armed Bandit Problem 1.5
ε=0.1
ε=0.01
1
ε=0You must choose between n diﬀerent options or actions. Having
performed that action a you receive a reward r. The reward is 0.5
sampled from a stationary probability distribution that depend on
the action made. 0
0 200 400 600 800 1000
plays∗Let Q (a) denote the mean (expected) reward received when
100
action a is selected, this is also known as the value of action a. ε=0.1
Q (a) is the estimated value at the tth play, that ist
ε=0.0150r +r +...+r1 2 kaQ (a) = =E[r]t ε=0ka
0where tth play action a has been chosen k times prior to t,a 0 200 400 600 800 1000
playsyielding rewards r ,r , ..., r .1 2 ka
With = 0.1 the optimal action is never selected more than
91% of the time, but will ﬁnd the optimal action earlier than
= 0.01. However, in the end = 0.01 will perform better,
it may in this case be a good idea to start with a large and
decrease it over time. The greedy method performs signiﬁcantly
worse and often gets stuck performing suboptimal actions. On
the other hand if the reward variance was zero the greedy method
would know the true value of each action after trying once and
would work better. The noisier the reward the more exploration
is needed.
2 4
version 0.0.3 March 17, 2005
Exploration versus exploitation
• greedy action: is the action whose estimated value Q(a) at
any time is the greatest. In this case you are exploiting your
current knowledge.
• non-greedy action: when choosing a non-greedy action you
improve your estimate of the value of non-greedy actions and
we say you are exploring.
Considerthefollowingexperiment,repeated2000times,for1000
plays of the “n-armed bandit”
n = 10; % number of arms
epsilon = 0.01; % probability of selecting a purely random action
for i = 1:2000, % number of experiments
Qstar = randn(1,n); % mean reward
[dummy,astar] = max(Qstar); % optimal action
Q = zeros(1,n); % initial estimated return
ka = zeros(1,n); % number of times action taken
for j = 1:1000, % number of games played
if (rand(1) < epsilon), % epsilon greedy policy
a = ceil(rand(1)*n); % purely random move
else,
Qm = Q./max(ka,ones(1,n)); % mean return (avoid divide by zero)
K = find(Qm == max(Qm)); % find actions that give maximum return
k = ceil(rand(1)*length(K)); % break ties randomly
a = K(k); % greedy move
end
r = Qstar(a) + randn(1); % reward for this action
Q(a) = Q(a) + r; % add to rewards obtained for this action
ka(a) = ka(a) + 1; % increment counter for this action
R(i,j) = r; % keep track of reward obtained for this game
O(i,j) = (astar == a); % was this an opimal move?
end
end
3
% optimal action
mean rewardversion 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
Incremental Implementation Experiment using a ﬁxed-step size
The average of k+1 rewards can be computed by The previous experiment is repeated using the incremental update
rule with α = 0.1 and = 0.1.
k+1X1
Q = rk+1 i
alpha = 0.1; % fixed step sizek+1
i=1 n = 10; % number of arms
epsilon = 0.1; % probability of selecting a purely random action„ «kX1 for i = 1:2000, % number of experiments
= r + rk+1 i Qstar = randn(1,n); % mean reward
k+1
[dummy,astar] = max(Qstar); % optimal actioni=1
Q = zeros(1,n); % initial estimated return
1 ` ´ ka = zeros(1,n); % number of times action taken= r +kQ +Q −Qk+1 k k k for j = 1:1000, % number of games playedk+1
if (rand(1) < epsilon), % epsilon greedy policy` ´ a = ceil(rand(1)*n); % purely random move1
= r +(k+1)Q −Q else,k+1 k k
k+1 K = find(Q == max(Q)); % find actions that give maximum return
k = ceil(rand(1)*length(K)); % break ties randomly1 ` ´
a = K(k); % greedy move= Q + r −Qk k+1 k
endk+1
ka(a) = ka(a) + 1; % increment counter for this action
r = Qstar(a) + randn(1); % reward for this action
% alpha = 1/ka(a); % uncomment this line for sample average
This update rule is commonly used in reinforcement learning. The Q(a) = Q(a) + alpha*(r - Q(a)); % incremental implementation
general form is: R(i,j) = r; % keep track of reward obtained for this game
O(i,j) = (astar == a); % was this an opimal move?
endNewEstimate← OldEstimate+StepSize[Target−OldEstimate]
end
5 7
version 0.0.3 March 17, 2005 version 0.0.3 March 17, 2005
The red lines represent the results using the incremental updateTracking a Non-stationary Problem
with ﬁxed a step-size α.
1.5When the incremental update rule is modiﬁed to be ε=0.1
ε=0.01` ´
1Q = Q +α r −Qk+1 k k+1 k ε=0
where the step-size α, 0 < α≤ 1, is constant, this results in 0.5
a weighted average of past rewards and the initial estimate Q ,0
0that is:
0 200 400 600 800 1000
plays` ´
Q = Q +α r −Qk k−1 k k−1 100
ε=0.1
= αr +(1−α)Qk k−1
2= αr +(1−α)αr +(1−α) Q ε=0.01k k−1 k−2 50
2= αr +(1−α)αr +(1−α) αr +...k k−1 k−2 ε=0
k−1 k+(1−α) αr +(1−α) Q1 0 0
0 200 400 600 800 1000
k playsX
k k−i= (1−α) Q + α(1−α) r0 i
i=1
kThis is an exponential recency-weightedaverage since(1−α) +P
k k−iα(1−α) = 1. Sometimes it may be convenient toi=1
1vary α (a) after thekth selection of actiona. The special casek
is α (a) = 1/k which results in the sample average method.k P1 ∞
Conditions: overcome initial conditions or random ﬂuctuations α (a) =∞ andi=1 kP∞ 2
assure convergence α (a) <∞.i=1 k
6 8
% optimal action
mean rewardversion 0.0.3 March 17, 2005
Agent - Environment Interface
Reinforcement learning (RL) is learning what to do – mapping
situations to actions – so as to maximize a numerical reward
signal. The learner is not instructed which actions to take, but
must discover which actions yield the most reward by trying
them. Actions may not only aﬀect the immediate rewards but
also the next situation and consequently all subsequent rewards.
The elements of RL are:
- Agent – an agent is the learner and decision maker.
- Environment–everythingoutsidetheagentisitsenvironment
(the thing it interacts with).
- Rewards – special numerical values the environment gives
rise to and the agent tries to maximize over time, r ∈<.t
- State – a description of the situation, a representation of the
environment’s state s .t
- Action – the feasible set of actions a ∈A(s ), available int t
state s .t
- Policy – an agent maps states to probabilities of selecting
possibleactions,thatisπ (s,a)istheprobabilitythata = at t
if s = s. An alternative notation used is a = π (s ) whent t t t
action a is selected given state s and policy π .t t t
9
version 0.0.3 March 17, 2005
Agent - Environment Interaction
AGENT
STATE REWARD ACTION
a = π(s)s r t t tt t
rt+1
st+1 ENVIRONMENT
10version 0.0.3 March 17, 2005
Rewards deﬁne the goal of learning
In reinforcement learning, the purpose or goal of the agent is
formalized in terms of a special reward signal passing from the
environment to the agent. The agent’s goal is to maximize
the cumulative reward in the long run, not just the immediate
rewards.
Examples of how reward signals are used to formalize the idea of
a goal:
• Robot learning to walk: a reward is given proportional to the
forward motion of the robot.
• Robot escapes a maze: give no reward until it escapes. To
ﬁnd the shortest path out of the maze give a negative reward
at each discrete time step.
• Robot cleaning up a room: reward given for each piece of
rubbish picked up.
• Playing chess or other game, rive a reward +1 for winning,
−1 for loosing and 0 for a draw and all nonterminal position.
One could give rewards for achieving subgoals, but then there
is the danger of the agent ﬁnding ways of achieving these
subgoals without achieving the real goal – winning.
11
version 0.0.3 March 17, 2005
Returns
The agent’s goal is to maximize the reward it receives in the long
run, that is the return
r +r +r +...t+1 t+2 t+3
More precisely we aim to maximize the mean or expected return
E[R ], where R is some function of the reward sequence above.t t
In the simplest case the return is
R = r +r +r +...+rt t+1 t+2 t+3 T
where T is the ﬁnal time step. This makes sense for tasks
like games that have terminal states. A sequence of agent-
environment interactions of this type are known as an episode
and the tasks known as episodic tasks.
Continuous tasks go on continually without limit. In this case
the return may also have no limit, for example if T =∞ and
the reward is +1 in each time step. To overcome this problem
we introduce discounting, that is a discounted return:
∞X
2 kR = r +γr +γ r +... = γ rt t+1 t+2 t+3 t+k+1
k=0
where the parameter 0≤ γ≤ 1 is called the discount rate.
12version 0.0.3