CSC2547 – Introduction to Reinforcement Learning (Spring 2021)
计算机cs代考 Your first question is about the dimension of the state and action spaces. Your friend tells you that the agent receives a 100 100
Student Number:
Last (Family) Name(s):
First (Given) Name(s):
- Deadline: Friday, April 2nd, 2021 at 23:59.
Make sure to write your name and student number on the 计算机cs代考
|
You may either (1) print these pages, answer each question directly on the paper, and then scan it and upload it as a PDF file, or (2) type your answers using your favourite word processor using the same order of questions as here. In the latter case, make sure you use the right question numbers and part (e.g., 1(a), 4(e), etc.). If you don’t, you may not get the mark.
|
You should help us having a fair take-home test. You may consult the slides, the lecture notes, or other textbooks. We in fact encourage you to do so. But do not discuss the questions with anyone else (including on Piazza). And do not search for answer to these questions on the Internet.
|
Do not share this take-home test with anyone else, even after the semester ends, as I may reuse some of these questions in the future. 计算机cs代考
|
There will not be an auto-fail in this take-home test. But try hard to do a good job. This will be a good opportunity to practice your skills once more, and get some feedback.
Submission:
You need to submit one PDF file through MarkUs. You can produce the file however you like (e.g. LATEX, Microsoft Word, etc) as long as it is readable. Points will be deducted if we have a hard time reading your
solutions.
|
Late Submission:
10% of the marks will be deducted for each day late, up to a maximum of 3 days. After that, no submissions will be accepted. 计算机cs代考
|
Collaboration: The take-home test must be done individually, and you can- not collaborate with others.
Marking Guide
|
No No No No No No No No No
TOTAL: /100
Question 1. Short Questions [14 marks]
These are several short questions that only require one or two lines of answers. 计算机cs代考
Part (a) Write down the Bellman equations for Qπ and Q∗. (2 answers) [4 marks]
Part (b) Write down the update rule for Q-Learning and SARSA. (2 answers) [4 marks]
Part (c) Write down the TD error for policy evaluation of a value function. [2 marks]
Part (d) What is the relation between the TD error and the Bellman Residual?
[2 marks]
Part (e) Does TD error ever go to zero? Explain! [2 marks] 计算机cs代考
Question 2. Policy Iteration [8 marks]
Part (a) Write down the Policy Iteration algorithm [4 marks]
Part (b) Does PI ever converge? If so, under what conditions? [2 marks]
Part (c) Is Fitted Q-Iteration an example of the Policy Iteration algorithm? Explain!
[2 marks]
Question 3. Value Iteration [10 marks]
Part (a) Write down the Value Iteration algorithm for computing Q∗. [4 marks] 计算机cs代考
Part (b) Does VI converge to Q∗ in a finite number of iterations? Explain! [2 marks]
|
Part (c) If VI converge, specify the number of required iterations K such that QK = Q∗. If it does not, provide an error bound on QK Q∗ . In either case, formally prove your claim. [4 marks]
Question 4. Monte Carlo vs. Temporal Difference [12 marks] 计算机cs代考
Part (a) MC Estimate of V π(x): Briefly describe how an MC estimate of V π(x) can be obtained. [2 marks]
Part (b) Is an MC estimate a biased or unbiased estimate? Explain! [2 marks]
Part (c) Write down the TD estimate for V π(x). [2 marks]
Part (d) Is the TD estimate of V π(x) biased or unbiased? Explain! [2 marks]
Part (e) Briefly compare the MC and TD estimates of V π(x). [2 marks]
Part (f ) In what situations do you prefer an MC estimate to a TD estimate? [2 marks] 计算机cs代考
Question 5. Help a Friend Solve an RL Problem [20 marks]
Your friend is trying to train an RL agent to play a computer game, but he has been facing some challenges. You are helping him to figure out the error.
Part (a) State or not? [2 marks]
|
Your first question is about the dimension of the state and action spaces. Your friend tells you that the agent receives a 100 100 grayscale image from the game engine, and has to choose one of the 4 available actions.
You have to first figure out whether the input is a state or not. You ask your friend about it, but he does not know what a state is. Briefly explain what a state is and why it is important to make sure the agent has access to it. 计算机cs代考
Part (b) Making a state from observations [2 marks]
|
After your explanation, your friend tells you that a single 100 100 grayscale frame is not the state of the system. Suggest him how he can construct the state from the image frames.
Part (c) Q-Learning with function approximator [4 marks]
Your next question is about the algorithm. Your friend mentions that he is using the Q-Learning algorithm with a Deep Neural Network as a function approximator. The standard Q-Learning algorithm is presented for a finite state-action MDPs. You can derive it for the case of function approximation too. Write down the update rule for the Q-Learning algorithm with the nonlinear function approximator to make sure that it is the same algorithm that your friend is using (look at Semi-gradient viewpoint in the lecture notes).
Part (d) Advise against Q-Learning with function approximator [2 marks] 计算机cs代考
You advise your friend against using the Q-Learning algorithm with a function ap- proximator. Why?
Part (e) Q-Learning vs. DQN [4 marks]
It turns out that your friend is not using what you derived, but is using the DQN algorithm. Briefly write down the DQN algorithm and describe why it is different from the online Q-Learning algorithms.
Part (f ) The FA is not suitable! [2 marks]
|
Now you ask about the detail of the function approximation. Your friend is a using a NN to represent Qˆ. The architecture is a feedforward NN with 100 100 inputs and 4 outputs with linear units. Explain two reasons that this may not be a good choice for this problem.
(If you are not familiar with DNNs, instead of this question and the next, suggest
a function approximator for this particular problem.)
Part (g) Designing a function approximator [4 marks]
Suggest a DNN architecture to your friend that might be suitable for this problem. Of course, this is just a guess, and at the end of the day, one has to try the architecture and see if it works.
Question 6. LSTD for Finite State Problems [8 marks] 计算机cs代考
The LSTD method was presented in the context of problems with large (e.g., con- tinuous) state spaces and linear function approximator (Sections 5.1.4.2 and 5.2.4 of the Lecture Notes). We can, however, use the same formulation for finite state problems. The goal of this question is to see how LSTD looks like if we apply it to a finite problem. For simplicity, here you only need to focus on the population version of the LSTD. And focus on the value function V and not the action-value function.
Part (a) Choice of basis functions [4 marks]
|
Suppose you have an MDP with N = states. You want to use a linear func- tion approximator V (x) = φT(x)w to represent the value function, with φ(x) = [φ1(x), . . . , φp(x)].
What basis functions do you use in order to represent any value function exactly ?
|
Write them explicitly (you can specify an N p matrix Φ). 计算机cs代考
(Hint: What is the right choice for p? Can you solve this problem if p = 1 and N = 10?)
Part (b) What is the LSTD solution w for your choice of Φ? [4 marks]
Question 7. BRM vs FQI [12 marks]
|
Assume that you are given a dataset Dn = {(Xi, Ai, Ri, Xij)}n . You want to esti-
mate Qπ by a Q ∈ F using either BRM or FQI.
Part (a) Write down the optimization problem for the Bellman Residual Minimiza- tion. [4 marks]
Part (b) Explain why this is not a good objective. [2 marks]
Part (c) Write down the optimization problem for the Fitted Q-Iteration algorithm.
[4 marks]
Part (d) What is the difference between the objective of BRM and FQI? [2 marks]
Question 8. Play Risky or Safe [14 marks] 计算机cs代考
You are in a situation (call it state x1) where you have to decide between choosing a risky action or a safe one. If you choose the safe action (asafe), you go to state x2 and stays there forever. In state x2, you always get a reward of r(x2) = 0. So it is a neutral state. Being there is not bad, but it is not rewarding either.
On the other hand, if you choose action arisky at x1, there is a 0.99 chance that you stay in x1 and 0.01 chance that you go to x3. Being in state x1 is rewarding as for every time step you are there, you get a reward r(x1) = 1. But if you go to x3, you are stuck there and get a negative reward of r(x3) < 0 at each time step until eternity (the value of the reward is not specified yet). This continuing MDP is shown in Figure 1.
Figure 1: Risky or Safe 计算机cs代考
Part (a) What are the policies here? Specify them explicitly. [2 marks]
(Hint: You do not have any actions at states x2 and x3.)
Part (b) Write down the value function for states x2 and x3 (The solution may depend on r = r(x3)). [4 marks]
(Hint: The values of these two states do not depend on the policy.)
Part (c) Write down the value function for state x1 for all policies that you have specified before. [6 marks]
Part (d) What is the optimal policy? It should depend on r = r(x3). Specify the range of r such that one of them is the optimal choice. [2 marks]
Question 9. Course Evaluation [2 marks] 计算机cs代考
Please fill the course evaluation! I will read all your comments and it helps me improve the course in the future. Also feel free to send me an email if you want to provide more detailed comments.
If you did, please specify it here. You get the point if you have filled it.
发表回复
要发表评论,您必须先登录。