Q1
When we run our experiments with different alpha and epsilon
should we run our programs several times with different random sequences?
A1
Yes, recall a minor complication related to pseudo-random number generators.
You need to initialize your generator to different starting values
by choosing a new seed value
(read "man srand" and
"man srandom" for details)
each time when you run your program, because you need a different
sequence of pseudo-random numbers each time when you run your
simulations of the grid-world. To choose different seed values,
you can use current system time converted to an integer similar to the
1st assignment: srand((unsigned int)time(NULL)).
Collect data from different runs of your program and
provide in your report data averaged over several runs (to exclude
minor variations due to random noise). Some useful metrics/statistics that
characterize a learning algorithm can be average computation time per episode
and an average number of steps per episode before
reaching the goal state divided by the distance from the random starting
state to the goal (your distance can be so-called "Manhattan distance": the number
of horizontal and vertical steps from the initial state to the goal state).
Because each episode starts at a new random state, the number of steps
and computation time will depend on this starting state: you can account
for this by dividing time and the number of steps by the distance
from the start state to the goal state. However, you will need
to print an optimal policy only once. Make sure that you optimal
policy is stable, i.e., it does not change if you choose different
random sequences on your computer.
Q2
When we are deciding which action is the greedy action, you said to use
the function Q(s,a). Which Q(s,a) function are we using? The one given
in the algorithm for sarsa/q-learning ?
A2
There is only one value function for state-action pairs in the text-book.
This function is always denoted by Q(s,a).
The value function for states is denoted by V(s) in the book.
In particular, Q(s,a)-function is used in the Sarsa
and Q-learning algorithms. In the Chapter 4 on dynamic programming,
this function is characterized by a version of Bellman equations for
Q-function. But because in Chapter 6 on TD-learning, the agent has
no information about probabilities of transitions and rewards
(i.e., the agent does not know p[s][a][ss] and r[s][a][ss]),
the agent cannot use Bellman's full backup to compute
a better approximation in the iterative computation of the Q(s,a) function.
However, the agent can use sample values (reward R and a next state S')
to update the current approximation to Q(s,a) (both in Sarsa and
Q-learning). Because samples are generated by an environment
(which is simulated by a programmer) using probabilities "p_1" and "p_2",
updates based on samples will converge to correct optimal Q*(s,a)
values (that satisfy the Bellman optimality equation) in the limit, i.e.,
if the agent interacts with the environment for a long time
(and if "epsilon" decreases gradually to 0).
Q3
I have a question about what the agent does in the case of a doorway in
terms of accidental diagonal moves.
Is it possible for the agent to move to the states beside
where it is trying to go or not?
A3
First, note that there is a state S_b just before a doorway,
and there is a state S_a immediately after a doorway.
I did not draw a line between S_a and S_b
on the figure given in the assignment,
and you see a rectangle instead of two squares
representing S_a and S_b. But there are actually
two different states in each doorway.
Second, for the sake of uniformity, it is convenient
to treat the states in each doorway equally with other states in
"the free space". Hence, yes, it is possible
for the agent to move to the states S_l and S_r beside
the state S_a (which is located immediately after the doorway)
when the agent is trying to move through the doorway
from S_b to S_a.
Q4
Suppose the environment below and suppose I want to evaluate the probabilities of
starting at state 0 and moving up,
-------+------+-------+ | 0 | 1 | 2 | -------+------+-------+ | 3 | 4 | 5 | -------+------+-------+ | 6 | 7 | 8 | -------+------+-------+So what is the probability of going from state 0 to state 0 by taking action up? My reasonning is the following:
Q5
In assignment 3, are the numbers p1 and p2 related, ie. does p1=1-p2?
In our testing can we use values of p1 and p2 such that p1+p2>1?
A5
No, p1+p2 cannot be greater than 1. You can asume that always
p1 + p2 =< 1
but it can be that p1+p2 < 1, i.e., side moves can be possible.
Q6
When I'm implementing Monte-Carlo on-policy control, where the episodes can start?
A6
Assume the agent starts each episode in the randomly chosen initial state
with a random action.
Q7
What are we supposed to write in our report?
What kind of comparisons we have to do?
A7
Describe briefly in your report the results of your experiments and make conclusion which
approach is more efficient with what values of learning parameters. For each set of parameters,
collect data from several runs of your program and provide in your report data averaged
over many runs (to exclude minor variations due to random noise). Some useful statistics
that characterize a learning algorithm can be
(a) average computation time per episode, and
(b) an average number of steps per episode before reaching the goal state,
in both cases divided by the distance from the random starting state to the goal.
The distance between states can be calculated as the Manhattan distance: the number of
horizontal and vertical steps from the initial state to the goal state. Because each episode
starts at a new random state, the number of steps and computation time will depend on this
starting state. You can account for this by dividing time and the number of steps by
the distance from the start state to the goal state. However, print an optimal policy
only once. Make sure that your optimal policy is stable, i.e., it does not change if
you choose different random sequences on your computer.
Q8
I have 100 states and 4 possible actions mapping to another 100 states.
Hence, I have to consider an array p[100][4][100] with 40000 indices
(where "p" must express the
probabilities of going from s to s' when the agent takes stochastic actions a.)
But you told in class that we need only 400 indices? How can this be?
On the other hand, I understand that the array of state-action values Q
is Q[100][4], therefore it is 400 in size.
Could you explain me how you can have an array of 400?
A8
Both in Sarsa and in Q-learning we need to find a successor state SS
whenever the agent does an action in the current state S.
Note that in both cases, pseudo-code specifies the following:
Take action a, observe r, s'.
The question is how to implement the instruction observe s'.
There are different approaches to this implementation.
You are right that an array of transition probabilities "p"
has 40000 indices, but this array may be not the most convenient
way of finding a successor state SS after each transition from a state S.
In principle, you can define this array and then find
the next state SS by doing something like
int nextState = 0; double totalProbability = 0.0; while( nextState < NumberOfStates ) { totalProbability = totalProbability + p[s][a][nextState]; if ( x <= totalProbability ) return nextState ; else ++nextState ; }where NumberOfStates=100 in our case and x is a random number in the interval between 0 and 1. You can find x by calling rand() or random(), a simple random number generator (read "man rand" and "man random" for details). But, it can be difficult to define this array "p" and in addition, in the worst case, it can take 100 iterations to find the next state (after each transition).
if 0 <= x < p1 then SS="the state where the agent wants to go" else if p1 <= x < (p1 + p2) then SS=S ("the current state) else if (p1+p2) <= x <= 1 then SS="is one of the states beside the state where the agent is trying to go"The conditions above can have minor variations for states near the border or near the wall and for states at the corners. Perhaps, it can be convenient to represent the states in the grid-world by their (x,y)-coordinates: this may help to simplify conditions that your program needs to check to find the next state SS. Note a minor complication related to pseudo-random number generators. You need to initialize your generator to different starting values by choosing a new seed value (read "man srand" and "man srandom" for details) each time when you run your program, because you need a different sequence of pseudo-random numbers in your simulations of the grid-world. To choose different seed values, you can use current system time converted to an integer.
Q9
My partner and I have run into a little difficulty in regards to selecting
the greedy action. We are not sure what criteria to base this selection
on....could you please advise.
A9
To select an epsilon-greedy action using the current approximation to Q(s,a) function
you need two steps.
First, generate a random number x in the interval from 0 to 1.
Second, evaluate whether x is greater than or equal to
the value (1 - "exploration parameter"), where
"exploration parameter"=epsilon=0.1 according to the assignment.
If x >= (1 - epsilon) then choose an action randomly
(this corresponds to exploration),
else find an index a=A where the function Q(s,a) has a maximal value
(this corresponds to exploitation, i.e., to greedy-selection of an action).
Q10
Should we implement some sort of stopping criterion for checking policy stability (like in
earlier assignments), or can we just accept the total number of episodes as user input and
tweak it manually?
A10
You can run first-visit MC, Q-learning and SARSA over a fixed large number of episodes.
Choose the number of episodes as large as needed to guarantee convergence.
Read the sample Blackjack program in Handouts for an example of MC control.
Q11
Should we implement first-visit MC or every-visit MC, or both?
A11
Implementing first-visit on-policy MC control with epsilon-soft policies is fine.
Q12
A12
Q13
A13
Q25
A25