Q1
I program in Java. Should I name my programs via.ext and pia.ext
or via.java and pia.java?
A1
Name your programs according to the programming language
that you use; i.e., name them either via.java and pia.java
if you program in java or pia.c and via.c if you program in C.
In the case of other programming languages, replace "ext"
accordingly to the standard naming convention of your programming
language. Remember to compile/run your programs on moon under Linux
before you submit them, because TA will run them on moons.
Also, make sure that TA can use his own file for testing.
Q2
Where I can find an example of a C implementation of the policy iteration algorithm?
A2
You can read
Example 4.3: Gambler's Problem and its solution in
Handouts.
Q3
I need a bit of clarification.
In the programming part of the assignment, you say:
"In both cases, output the number of
iterations, computation time (for each iteration)
and the optimal policy."
The part that I'm unsure about is what you consider
to be an iteration. In value iteration there is only
one repeated part in the algorithm, so that's fairly
obvious. In policy iteration, however, there is an
inner loop and an outer loop. So, looking at pseudocode in Section 4.3,
is it the Loop .. until in step 2 that is the
iteration, or is it the cycle of policy evaluation
and policy improvement that is the iteration?
A3
In the case of policy iteration, we are interested
in iterations of policy improvement (the cycle of policy evaluation
and policy improvement, as you say), i.e., in iterations of
an outer loop. The number of iterations in the inner loop
("repeat ... until" loop in the part 2 of code in Section 4.3)
used to approximate the value function of the current policy
is less important and it will be accounted because you measure
also time taken per iteration.
Q4
In Part 3 it seems that we have 4 different types of rewards (*r-up,
r-down, r-right, r-left*). So, from what I understand, these are given to
the agent depending on where the agent is going to end up after making a
move, regardless of where it was intending to go. E.g. agent moved right,
but due to stochastic nature of the environment it ended up going left.
Therefore the reward is going to be *r-left*. Could you confirm if this is
correct?
A4
No, the reward is determined by the action that the agent was attempting,
not by the real outcome of the action.
For example, if the agent was trying to do the action "up", then
it will get the award r_up even if in reality it will remain
in the same state.
Q5
I assume in this setup there is only one policy which we have to
improve. Or should we make multiple policies and iterate over them?
A5
You are told that the initial policy is equiprobable random policy,
same as in Example 4.1. See also Q13/A13 below.
Q6
I have read chapter 3 and 4 many times and I am still unsure how to
compute the V(s). The formula given is vague. According to the formula
in the book in order to compute V(s) we have to know V(s').
How do I know what V(s') is?
A6
Consider the system of the following 3 equations:
Q7
In terms of the programming
question, I've read the algorithm and kind of understand it but being able
to actually apply it in a program is a different matter. Hopefully you
can give me and my partner some insight.
A7
First of all, you need some data structures. If you work with C or Java,
you can use an array v[s] to represent values of states s. Because in
the problem we have 14 usual states and 1 terminal state, your array
must have 15 elements. Similarly, you need an array of transition
probabilities p[s][a][s'], remember we have 4 different actions a in each state.
Also, you need an array of rewards r[s]. In general case,
one may need an array r[s][a][s'], but because in this problem
the reward depends only on the agent's action and depends neither on
the current state nor the successor state, we can consider only 1-dimensional array of rewards.
Note that values for an array p[s][a][s'] are defined in the problem:
you can initialize your array using double numbers p1 and p2
that will be entered by an user. Probabilities of transitions from
terminal state are 0 no matter what the agent does.
Finally, you need an array to represent a policy pi[s]: elements of
this array can be numbers 0,1,2,3 or characters (u, d, l, r),
whatever you find more convenient. This array can have only 14 elements,
because we cannot do transitions from the terminal state.
Note that values for an array p[s][a][s'] are defined in the problem.
Now, you have everything you need to convert the pseudo-code given
in the textbook into a working program.
In your program, for every element s of the array v[s],
you will add numbers v[s'] from this array with numbers from rewards array
and multiply them by numbers from the probability array. The sum over s'
in the pseudo-code means that you can use a for-loop over s' to add
all required numbers. The max operation over actions
can be easily implemented by another for-loop that compares 4 different
numbers to find the greatest number.
Q8
When I compute a new approximation of V(s), should I use one array
v[s] or two arrays? FOr example, I can keep my previous approximation in v[s]
and compute values of the array new_v[s] and later
update all elements in v[s] by elements from new_v[s]
when I exit from the for-loop over states in the policy evaluation step
of the algorithm (step 2).
A8
A quote from the text-book (Section 4.1):
"To write a sequential computer program to implement iterative policy evaluation,
as given by (4.5), you would have to use two arrays, one for the old values, and one for the new
values. This way the new values can be computed one by one from the old values without
the old values being changed.
Of course it is easier simply to use one array and update the values ``in place", that is,
with each new backed-up value immediately overwriting the old one.
Then, depending on the order in which the states are backed up, sometimes new values
are used instead of old ones on the righthand side of (4.5).
This slightly different algorithm also converges to V(s);
in fact, it usually converges faster than the two-array version, as you might expect
since it uses new data as soon as it is available. We think of the backups as
being done in a sweep through the state space.
The order in which states are backed up during the sweep
has a significant influence on the rate of convergence.
We usually have the in-place version in mind when we think of DP algorithms."
Q9
For question 3, there are 14 states plus 1 terminal. One
terminal is between states 4 and 1 and another terminal is between states 11 and 14.
Then that means there should be 2 terminals and 14 states. So we would have to
make the array 16. (we are refering to figure 4.1)
A9
According to Example 4.1, there is only 1 terminal state,
all other states are not terminal (i.e., they are not absorbing states).
A quote from the text-book:
"The terminal state is shaded in the figure (although it is
shown in two places it is formally one state)."
This means that there is no need to introduce 2 differtent terminal states,
it is enough to have just one.
Q10
I would like to ask you this question:
+---+---+---+---+ | 0 | 1 | 2 | 3 | +---+---+---+---+ | 4 | 5 | 6 | 7 | +---+---+---+---+ | 8 | 9 |10 |11 | +---+---+---+---+ |12 |13 |14 |15 | +---+---+---+---+If I am in 8, and move UP:
p1 bring me to 4 p2 bring me to 8 (1-p1-p2)/2 bring me to 5 (1-p1-p2)/2 bring me to which state?
prob[8][up][4] = p1 + (1-p1-p2)/2 = (1+p1-p2)/2 prob[8][up][8] = p2 prob[8][up][5] = (1-p1-p2)/2 For all other states s', prob[8][up][s'] = 0Similarly,
prob[8][down][12]= p1 + (1-p1-p2)/2 prob[8][down][8]= p2 prob[8][down][13]= (1-p1-p2)/2 For all other states s', prob[8][down][s'] = 0This means that the probability of moving in the desired direction is a bit higher when the agent moves along the edge of the grid in comparison to the case when the agent moves in "the free space" and can drift by chance (due to slippage or inaccuracies in wheels) to one of the two adjacent states. Note that this minor difference between states inside the grid and states along the edge of the grid has no effect on policy if all actions {up, down, left, right} are deterministic and always move the agent in the direction where the agent heads for, i.e., if p1= 1.0 and p2= 0.0
prob[8][left][8] = p1+p2 prob[8][left][?] = (1-p1-p2)/2are the adjacent states [?] (5 and 13) or (4 and 12)?
prob[8][left][8] = p1+p2 ; prob[8][left][4] = (1-p1-p2)/2 ; prob[8][left][12] = (1-p1-p2)/2 ;In addition,
prob[12][left][12] = p1+p2 ; prob[12][left][8] = (1-p1-p2)/2; prob[12][left][13] = (1-p1-p2)/2;and
prob[12][down][12] = p1+p2 ; prob[12][down][8] = (1-p1-p2)/2; prob[12][down][13] = (1-p1-p2)/2;
p(7, r_up | 7, up) = p2 p(3, r_up | 7, up) = p1 + (1-p1-p2)/2 p(2, r_up | 7, up) = (1-p1-p2)/2 p(s', r_up | 7, up) = 0, for all other states s'. p(7, r_right, | 7, right) = p1 + p2 p(3, r_right | 7, right) = (1-p1-p2)/2 p(11, r_right | 7, right) = (1-p1-p2)/2, p(s', r_right | 7, right) = 0, for all other states s'.