计算机cs代考 Introduction to Reinforcement Learning

CSC2547 – Introduction to Reinforcement Learning (Spring 2021)

计算机cs代考 Your first question is about the dimension of the state and action spaces. Your friend tells you that the agent receives a 100 100

Student Number:

Last (Family) Name(s):

First (Given) Name(s):

Deadline: Friday, April 2nd, 2021 at 23:59.

Make sure to write your name and student number on the 计算机cs代考

•

You may either (1) print these pages, answer each question directly on the paper, and then scan it and upload it as a PDF file, or (2) type your answers using your favourite word processor using the same order of questions as here. In the latter case, make sure you use the right question numbers and part (e.g., 1(a), 4(e), etc.). If you don’t, you may not get the mark.

•

You should help us having a fair take-home test. You may consult the slides, the lecture notes, or other textbooks. We in fact encourage you to do so. But do not discuss the questions with anyone else (including on Piazza). And do not search for answer to these questions on the Internet.

•

Do not share this take-home test with anyone else, even after the semester ends, as I may reuse some of these questions in the future. 计算机cs代考

•

There will not be an auto-fail in this take-home test. But try hard to do a good job. This will be a good opportunity to practice your skills once more, and get some feedback.

计算机cs代考

Submission:

You need to submit one PDF file through MarkUs. You can produce the file however you like (e.g. LATEX, Microsoft Word, etc) as long as it is readable. Points will be deducted if we have a hard time reading your

solutions.

•

Late Submission:

10% of the marks will be deducted for each day late, up to a maximum of 3 days. After that, no submissions will be accepted. 计算机cs代考

•

Collaboration: The take-home test must be done individually, and you can- not collaborate with others.

Marking Guide

1:	/	14
2:	/	8
3:	/	10
4:	/	12
5:	/	20
6:	/	8
7:	/	12
8:	/	14
9:	/	2

No No No No No No No No No

TOTAL: /100

Question 1. Short Questions [14 marks]

These are several short questions that only require one or two lines of answers. 计算机cs代考

Part (a) Write down the Bellman equations for Q^π and Q^∗. (2 answers) [4 marks]

Part (b) Write down the update rule for Q-Learning and SARSA. (2 answers) [4 marks]

Part (c) Write down the TD error for policy evaluation of a value function. [2 marks]

Part (d) What is the relation between the TD error and the Bellman Residual?

[2 marks]

Part (e) Does TD error ever go to zero? Explain! [2 marks] 计算机cs代考

Question 2. Policy Iteration [8 marks]

Part (a) Write down the Policy Iteration algorithm [4 marks]

Part (b) Does PI ever converge? If so, under what conditions? [2 marks]

Part (c) Is Fitted Q-Iteration an example of the Policy Iteration algorithm? Explain!

[2 marks]

Question 3. Value Iteration [10 marks]

Part (a) Write down the Value Iteration algorithm for computing Q^∗. [4 marks] 计算机cs代考

Part (b) Does VI converge to Q^∗ in a finite number of iterations? Explain! [2 marks]

" − "

Part (c) If VI converge, specify the number of required iterations K such that Q_K = Q^∗. If it does not, provide an error bound on Q_K Q^∗ . In either case, formally prove your claim. [4 marks]

Question 4. Monte Carlo vs. Temporal Difference [12 marks] 计算机cs代考

Part (a) MC Estimate of V ^π(x): Briefly describe how an MC estimate of V ^π(x) can be obtained. [2 marks]

Part (b) Is an MC estimate a biased or unbiased estimate? Explain! [2 marks]

Part (c) Write down the TD estimate for V ^π(x). [2 marks]

Part (d) Is the TD estimate of V ^π(x) biased or unbiased? Explain! [2 marks]

Part (e) Briefly compare the MC and TD estimates of V ^π(x). [2 marks]

Part (f ) In what situations do you prefer an MC estimate to a TD estimate? [2 marks] 计算机cs代考

Question 5. Help a Friend Solve an RL Problem [20 marks]

Your friend is trying to train an RL agent to play a computer game, but he has been facing some challenges. You are helping him to figure out the error.

Part (a) State or not? [2 marks]

Your first question is about the dimension of the state and action spaces. Your friend tells you that the agent receives a 100 100 grayscale image from the game engine, and has to choose one of the 4 available actions.

You have to first figure out whether the input is a state or not. You ask your friend about it, but he does not know what a state is. Briefly explain what a state is and why it is important to make sure the agent has access to it. 计算机cs代考

Part (b) Making a state from observations [2 marks]

After your explanation, your friend tells you that a single 100 100 grayscale frame is not the state of the system. Suggest him how he can construct the state from the image frames.

Part (c) Q-Learning with function approximator [4 marks]

Your next question is about the algorithm. Your friend mentions that he is using the Q-Learning algorithm with a Deep Neural Network as a function approximator. The standard Q-Learning algorithm is presented for a finite state-action MDPs. You can derive it for the case of function approximation too. Write down the update rule for the Q-Learning algorithm with the nonlinear function approximator to make sure that it is the same algorithm that your friend is using (look at Semi-gradient viewpoint in the lecture notes).

Part (d) Advise against Q-Learning with function approximator [2 marks] 计算机cs代考

You advise your friend against using the Q-Learning algorithm with a function ap- proximator. Why?

Part (e) Q-Learning vs. DQN [4 marks]

It turns out that your friend is not using what you derived, but is using the DQN algorithm. Briefly write down the DQN algorithm and describe why it is different from the online Q-Learning algorithms.

Part (f ) The FA is not suitable! [2 marks]

Now you ask about the detail of the function approximation. Your friend is a using a NN to represent Qˆ. The architecture is a feedforward NN with 100 100 inputs and 4 outputs with linear units. Explain two reasons that this may not be a good choice for this problem.

(If you are not familiar with DNNs, instead of this question and the next, suggest

a function approximator for this particular problem.)

Part (g) Designing a function approximator [4 marks]

Suggest a DNN architecture to your friend that might be suitable for this problem. Of course, this is just a guess, and at the end of the day, one has to try the architecture and see if it works.

Question 6. LSTD for Finite State Problems [8 marks] 计算机cs代考

The LSTD method was presented in the context of problems with large (e.g., con- tinuous) state spaces and linear function approximator (Sections 5.1.4.2 and 5.2.4 of the Lecture Notes). We can, however, use the same formulation for finite state problems. The goal of this question is to see how LSTD looks like if we apply it to a finite problem. For simplicity, here you only need to focus on the population version of the LSTD. And focus on the value function V and not the action-value function.

Part (a) Choice of basis functions [4 marks]

|X |

Suppose you have an MDP with N = states. You want to use a linear func- tion approximator V (x) = φ^T(x)w to represent the value function, with φ(x) = [φ₁(x), . . . , φ_p(x)].

What basis functions do you use in order to represent any value function exactly ?

Write them explicitly (you can specify an N p matrix Φ). 计算机cs代考

(Hint: What is the right choice for p? Can you solve this problem if p = 1 and N = 10?)

Part (b) What is the LSTD solution w for your choice of Φ? [4 marks]

Question 7. BRM vs FQI [12 marks]

i=1

Assume that you are given a dataset D_n = {(X_i, A_i, R_i, Xij)}n . You want to esti-

mate Q^π by a Q ∈ F using either BRM or FQI.

Part (a) Write down the optimization problem for the Bellman Residual Minimiza- tion. [4 marks]

Part (b) Explain why this is not a good objective. [2 marks]

Part (c) Write down the optimization problem for the Fitted Q-Iteration algorithm.

[4 marks]

Part (d) What is the difference between the objective of BRM and FQI? [2 marks]

Question 8. Play Risky or Safe [14 marks] 计算机cs代考

You are in a situation (call it state x₁) where you have to decide between choosing a risky action or a safe one. If you choose the safe action (a_safe), you go to state x₂ and stays there forever. In state x₂, you always get a reward of r(x₂) = 0. So it is a neutral state. Being there is not bad, but it is not rewarding either.

On the other hand, if you choose action a_risky at x₁, there is a 0.99 chance that you stay in x₁ and 0.01 chance that you go to x₃. Being in state x₁ is rewarding as for every time step you are there, you get a reward r(x₁) = 1. But if you go to x₃, you are stuck there and get a negative reward of r(x₃) < 0 at each time step until eternity (the value of the reward is not specified yet). This continuing MDP is shown in Figure 1.

Figure 1: Risky or Safe 计算机cs代考

Part (a) What are the policies here? Specify them explicitly. [2 marks]

(Hint: You do not have any actions at states x₂ and x₃.)

Part (b) Write down the value function for states x₂ and x₃ (The solution may depend on r = r(x₃)). [4 marks]

(Hint: The values of these two states do not depend on the policy.)

Part (c) Write down the value function for state x₁ for all policies that you have specified before. [6 marks]

Part (d) What is the optimal policy? It should depend on r = r(x₃). Specify the range of r such that one of them is the optimal choice. [2 marks]

Question 9. Course Evaluation [2 marks] 计算机cs代考

Please fill the course evaluation! I will read all your comments and it helps me improve the course in the future. Also feel free to send me an email if you want to provide more detailed comments.

If you did, please specify it here. You get the point if you have filled it.

乔万大学代写

更多代写：英国代写经济代考 essay代写 python代写风险管理论文代写

计算机cs代考 Introduction to Reinforcement Learning

CSC2547 – Introduction to Reinforcement Learning (Spring 2021)

Make sure to write your name and student number on the 计算机cs代考

Submission:

Late Submission:

Question 3. Value Iteration [10 marks]

Question 4. Monte Carlo vs. Temporal Difference [12 marks] 计算机cs代考

Question 5. Help a Friend Solve an RL Problem [20 marks]

Part (d) Advise against Q-Learning with function approximator [2 marks] 计算机cs代考

Question 6. LSTD for Finite State Problems [8 marks] 计算机cs代考

Question 7. BRM vs FQI [12 marks]

Question 8. Play Risky or Safe [14 marks] 计算机cs代考

Question 9. Course Evaluation [2 marks] 计算机cs代考

发表回复

联系我们

分类目录

精选文章

关键词

最近页面