Frequently answered questions about the 3rd assignment.

Questions related to the 3rd assignment

This page contains questions from students and my answers to them. If a question is asked that has already been answered, then I will not post another answer. So if your questions seems to go unanswered, you should check over previous questions.

Q1
When we run our experiments with different alpha and epsilon should we run our programs several times with different random sequences?
A1
Yes, recall a minor complication related to pseudo-random number generators. You need to initialize your generator to different starting values by choosing a new seed value (read "man srand" and "man srandom" for details) each time when you run your program, because you need a different sequence of pseudo-random numbers each time when you run your simulations of the grid-world. To choose different seed values, you can use current system time converted to an integer similar to the 1st assignment: srand((unsigned int)time(NULL)). Collect data from different runs of your program and provide in your report data averaged over several runs (to exclude minor variations due to random noise). Some useful metrics/statistics that characterize a learning algorithm can be average computation time per episode and an average number of steps per episode before reaching the goal state divided by the distance from the random starting state to the goal (your distance can be so-called "Manhattan distance": the number of horizontal and vertical steps from the initial state to the goal state). Because each episode starts at a new random state, the number of steps and computation time will depend on this starting state: you can account for this by dividing time and the number of steps by the distance from the start state to the goal state. However, you will need to print an optimal policy only once. Make sure that you optimal policy is stable, i.e., it does not change if you choose different random sequences on your computer.

Q2
When we are deciding which action is the greedy action, you said to use the function Q(s,a). Which Q(s,a) function are we using? The one given in the algorithm for sarsa/q-learning ?
A2
There is only one value function for state-action pairs in the text-book. This function is always denoted by Q(s,a). The value function for states is denoted by V(s) in the book. In particular, Q(s,a)-function is used in the Sarsa and Q-learning algorithms. In the Chapter 4 on dynamic programming, this function is characterized by a version of Bellman equations for Q-function. But because in Chapter 6 on TD-learning, the agent has no information about probabilities of transitions and rewards (i.e., the agent does not know p[s][a][ss] and r[s][a][ss]), the agent cannot use Bellman's full backup to compute a better approximation in the iterative computation of the Q(s,a) function. However, the agent can use sample values (reward R and a next state S') to update the current approximation to Q(s,a) (both in Sarsa and Q-learning). Because samples are generated by an environment (which is simulated by a programmer) using probabilities "p_1" and "p_2", updates based on samples will converge to correct optimal Q*(s,a) values (that satisfy the Bellman optimality equation) in the limit, i.e., if the agent interacts with the environment for a long time (and if "epsilon" decreases gradually to 0).

Q3
I have a question about what the agent does in the case of a doorway in terms of accidental diagonal moves. Is it possible for the agent to move to the states beside where it is trying to go or not?
A3
First, note that there is a state S_b just before a doorway, and there is a state S_a immediately after a doorway. I did not draw a line between S_a and S_b on the figure given in the assignment, and you see a rectangle instead of two squares representing S_a and S_b. But there are actually two different states in each doorway. Second, for the sake of uniformity, it is convenient to treat the states in each doorway equally with other states in "the free space". Hence, yes, it is possible for the agent to move to the states S_l and S_r beside the state S_a (which is located immediately after the doorway) when the agent is trying to move through the doorway from S_b to S_a.

Q4
Suppose the environment below and suppose I want to evaluate the probabilities of starting at state 0 and moving up,

    -------+------+-------+
    |  0   |   1  |   2   |
    -------+------+-------+
    |  3   |   4  |   5   |
    -------+------+-------+
    |  6   |   7  |   8   |
    -------+------+-------+

So what is the probability of going from state 0 to state 0 by taking action up? My reasonning is the following: p[0][up][0] = p1 + p2 + (1-p1-p2)/2 = (1 + p1 + p2)/2. Here's why. Because there is the probability p1 that the agent will go up and therefore go off the grid, but this brings it back to state 0. Also there is the probability p2 that the agent will not move and stay in state 0. In addition, there is the probability (1-p1-p2) that the agent will move left to an adjacent state (e.g., the agent could move left and right because of noise in its motors), note that there is no such state left there, so the agent stays in state 0. In other words, we need to add the probability (1-p1-p2)/2 for the possible stochastic action (move left): these last action brings the agent back to state 0, because it's going off the grid and by definition it stays in the same state 0. Is it right or not? In the bottom-left corner (in the state 6), it seems also that p[6][down][6] = 1, but if this probability is defined similarly to the previous assignment, then p[6][down][6] = (p1+p2) and the agent can move either to the state 7 or to the state 3 with probabilities (1 - p1 - p2)/2. So, what probability I can use for the corners?
A4
In the 2nd assignment, it was assumed that corners are the special cases where the agent can move accidentally to other states beside corners because the agent can be "reflected" from the border of the grid in addition to the "imperfections" in the agent's motors that makes movements in the grid-world stochastic. However, you can define probabilities for corners either following the previous assignment, or as you suggested above. Use the approach that makes your implementation simpler. Write in your report what case you have implemented. Note that in the test cases when p_1 =1 (you have to run your programs with different values of alpha parameter), both approaches to defining probabilites for corners will give the same result.

Q5
In assignment 3, are the numbers p1 and p2 related, ie. does p1=1-p2? In our testing can we use values of p1 and p2 such that p1+p2>1?
A5
No, p1+p2 cannot be greater than 1. You can asume that always p1 + p2 =< 1 but it can be that p1+p2 < 1, i.e., side moves can be possible.

Q6
When I'm implementing Monte-Carlo on-policy control, where the episodes can start?
A6
Assume the agent starts each episode in the randomly chosen initial state with a random action.

Q7
What are we supposed to write in our report? What kind of comparisons we have to do?
A7
Describe briefly in your report the results of your experiments and make conclusion which approach is more efficient with what values of learning parameters. For each set of parameters, collect data from several runs of your program and provide in your report data averaged over many runs (to exclude minor variations due to random noise). Some useful statistics that characterize a learning algorithm can be (a) average computation time per episode, and (b) an average number of steps per episode before reaching the goal state, in both cases divided by the distance from the random starting state to the goal. The distance between states can be calculated as the Manhattan distance: the number of horizontal and vertical steps from the initial state to the goal state. Because each episode starts at a new random state, the number of steps and computation time will depend on this starting state. You can account for this by dividing time and the number of steps by the distance from the start state to the goal state. However, print an optimal policy only once. Make sure that your optimal policy is stable, i.e., it does not change if you choose different random sequences on your computer.

Q8
I have 100 states and 4 possible actions mapping to another 100 states. Hence, I have to consider an array p[100][4][100] with 40000 indices (where "p" must express the probabilities of going from s to s' when the agent takes stochastic actions a.) But you told in class that we need only 400 indices? How can this be? On the other hand, I understand that the array of state-action values Q is Q[100][4], therefore it is 400 in size. Could you explain me how you can have an array of 400?
A8
Both in Sarsa and in Q-learning we need to find a successor state SS whenever the agent does an action in the current state S. Note that in both cases, pseudo-code specifies the following: Take action a, observe r, s'. The question is how to implement the instruction observe s'. There are different approaches to this implementation. You are right that an array of transition probabilities "p" has 40000 indices, but this array may be not the most convenient way of finding a successor state SS after each transition from a state S. In principle, you can define this array and then find the next state SS by doing something like

int nextState = 0;
double totalProbability = 0.0;
while( nextState < NumberOfStates ) {
   totalProbability = totalProbability + p[s][a][nextState];
   if ( x <= totalProbability )  return nextState ;
   else  ++nextState ;
   }

where NumberOfStates=100 in our case and x is a random number in the interval between 0 and 1. You can find x by calling rand() or random(), a simple random number generator (read "man rand" and "man random" for details). But, it can be difficult to define this array "p" and in addition, in the worst case, it can take 100 iterations to find the next state (after each transition).
Another simpler approach to computing SS can be based on properties of the grid-world defined in the assignment. We know that there are no more than 3 possible candidate states where the agent can move after any action. Hence, to compute which one of these states is the successor state it is sufficient to generate a random number x in the interval from 0 to 1 and then check several conditions:

if 0 <= x < p1          then SS="the state where the agent wants to go" else
if p1 <= x < (p1 + p2)  then SS=S ("the current state) else
if (p1+p2) <= x <= 1    then SS="is one of the states beside the state 
                                 where the agent is trying to go"

The conditions above can have minor variations for states near the border or near the wall and for states at the corners. Perhaps, it can be convenient to represent the states in the grid-world by their (x,y)-coordinates: this may help to simplify conditions that your program needs to check to find the next state SS. Note a minor complication related to pseudo-random number generators. You need to initialize your generator to different starting values by choosing a new seed value (read "man srand" and "man srandom" for details) each time when you run your program, because you need a different sequence of pseudo-random numbers in your simulations of the grid-world. To choose different seed values, you can use current system time converted to an integer.

Q9
My partner and I have run into a little difficulty in regards to selecting the greedy action. We are not sure what criteria to base this selection on....could you please advise.
A9
To select an epsilon-greedy action using the current approximation to Q(s,a) function you need two steps. First, generate a random number x in the interval from 0 to 1. Second, evaluate whether x is greater than or equal to the value (1 - "exploration parameter"), where "exploration parameter"=epsilon=0.1 according to the assignment.
If x >= (1 - epsilon) then choose an action randomly (this corresponds to exploration), else find an index a=A where the function Q(s,a) has a maximal value (this corresponds to exploitation, i.e., to greedy-selection of an action).

Q10
Should we implement some sort of stopping criterion for checking policy stability (like in earlier assignments), or can we just accept the total number of episodes as user input and tweak it manually?
A10
You can run first-visit MC, Q-learning and SARSA over a fixed large number of episodes. Choose the number of episodes as large as needed to guarantee convergence. Read the sample Blackjack program in Handouts for an example of MC control.

Q11
Should we implement first-visit MC or every-visit MC, or both?
A11
Implementing first-visit on-policy MC control with epsilon-soft policies is fine.

Q12

A12

Q13

A13

Q25

A25

CPS824