Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit


COMP3702 Artificial Intelligence (Semester 2, 2021)

Assignment 3: DragonGame Reinforcement Learning


Key information:

 Due: 4pm, Monday 1 November

This assignment will assess your skills in developing algorithms for solving Reinforcement Learning Problems.

Assignment 3 contributes 20% to your final grade.

This assignment consists of two parts: (1) programming and (2) a report.

This is an individual assignment.

Both code and report are to be submitted via Gradescope (https://www.gradescope.com/).

Your program (Part 1, 60/100) will be graded using the Gradescope code autograder, using testcases similar to those in the support code provided at https://gitlab.com/3702-2021/a3-support.

Your report (Part 2, 40/100) should fit the template provided, be in .pdf format and named according to the format a3-COMP3702-[SID].pdf, where SID is your student ID. Reports will be graded by the teaching team.


The DragonGame AI Environment

“Untitled Dragon Game” or simply DragonGame, is a 2.5D Platformer game in which the player must collect all of the gems in each level and reach the exit portal, making use of a jump-and-glide movement mechanic, and avoiding landing on lava tiles. DragonGame is inspired by the “Spyro the Dragon” game series from the original PlayStation. In Assignment 3, actions may again have non-deterministic outcomes, but in addition, the transition probabilities and reward values are unknown.

To solve a level, your AI agent must explore the environment and determine a policy (mapping from states to actions) which collects all gems and reaches the exit while incurring the minimum expected cost, which is equivalent to maximising the expected reward.


DragonGame as a Reinforcement Learning problem

In this assignment, you will write the components of a program to play DragonGame, with the objective of finding a high-quality solution to the problem using various reinforcement learning algorithms. This assignment will test your skills in defining reinforcement learning algorithms for a practical problem and understanding of key algorithm features and parameters.


What is provided to you

We will provide supporting code in Python only, in the form of:

1. A class representing a DragonGame game map and a number of helper functions

2. A parser method to take an input file (testcase) and convert it into a DragonGame map

3. A policy visualiser

4. A simulator script to evaluate the performance of your solution

5. Testcases to test and evaluate your solution

6. A solution file template

The support code can be found at: https://gitlab.com/3702-2021/a3-support. See the README.md for more details. Autograding of code will be done through Gradescope, so that you can test your submission and continue to improve it based on this feedback — you are strongly encouraged to make use of this feedback.


Your assignment task

Your task is to develop two reinforcement learning algorithms for computing paths (series of actions) for the agent (i.e. the Dragon), and to write a report on your algorithms’ performance. You will be graded on both your submitted program (Part 1, 60%) and the report (Part 2, 40%). These percentages will be scaled to the 20% course weighting for this assessment item.

The provided support code provides a generative DragonGame environment, and your task is to submit code implementing both of the following Reinforcement Learning algorithms:

1. Q-learning

2. SARSA

There isn’t an explicit requirement to use a particular learning type for each testcase, but the testcases are designed to make using a specific type advantageous for that testcase. To achieve separation between Q-learning and SARSA results, the total reward received during training is tracked in addition to the reward received during evaluation, with separate reward targets specified for each in the testcases.

Once you have implemented and tested the algorithms above, you are to complete the questions listed in the section “Part 2 - The Report” and submit the report to Gradescope.

More detail of what is required for the programming and report parts are given below.


Part 1 — The programming task

Your program will be graded using the Gradescope autograder, using testcases similar to those in the support code provided at https://gitlab.com/3702-2021/a3-support.

Interaction with the testcases and autograder

We now provide you with some details explaining how your code will interact with the testcases and the autograder (with special thanks to Nick Collins for his efforts making this work seamlessly, yet again!).

First, note that the Assignment 3 version of the class GameEnv (in game_env.py) differs from previous assignments in that the transition and reward functions are now randomised and unknown to the agentThe action outcome probabilities (for glide, supercharge, superjump actions and the ladder fall probability) and costs/penalties (action_cost, collision_penalty, game_over_penalty) are randomised within some fixed range based on the seed of the filename, and are all stored in private variables. Your agent does not know these values, and therefore must interact with the environment to determine the optimal policy.

Implement your solution using the supplied solution.py template file. You are required to fill in the following method stubs:

__init__()

run_training()

select_action()

You may add to the __init__ method if required, and can add additional helper methods and classes (either in solution.py or in files you create) if you wish. To ensure your code is handled correctly by the autograder, you should avoid using any try-except blocks in your implementation of the above methods (as this can interfere with our time-out handling). Also, unlike in the previous assignments, the autograder now does not allow you to upload your own copy of game_env.py.

Refer to the documentation in solution.py for more details.

Grading rubric for the programming component (total marks: 60/100)

For marking, we will use five different testcases of ascending level of difficulty to evaluate your solution.

There will be a total of 60 code marks, consisting of:

20 Threshold Marks

– Program runs without errors (+5 marks)

– Program approximately solves at least 1 testcase within 2x time limit (+7.5 marks)

– Program approximately solves at least 2 testcases within 2x time limit (+7.5 marks)

40 Testcase Marks

– 5 testcases worth 8 marks each

– A maximum of 8 marks for each testcase, with deductions for taking more than the time limit or solution having higher than the target costs (training and evaluation reward targets) proportional to the amount exceeded

– The code used to compute your score is in simulator.py

 Program will be terminated after 2× time limit has elapsed


Part 2 — The report

The report tests your understanding of Reinforcement Learning and the methods you have used in your code, and contributes 40/100 of your assignment mark.


Question 1. Q-learning is closely related to the Value Iteration algorithm for Markov decision processes.

a) (5 marks) Describe two key similarities between Q-learning and Value Iteration.

b) (5 marks) Give one key difference between Q-learning and Value Iteration.

For Questions 2, 3 and 4, consider testcase a3-t5.txt, and compare Q-learning and SARSA.


Question 2.

a) (5 marks) With reference to Q-learning and SARSA, explain the difference between off-policy and on-policy reinforcement learning algorithms.

b) (5 marks) How does the difference between off-policy and on-policy algorithms affect the way in which Q-learning and SARSA solves testcase a3-t5.txt? Give an example of an expected difference between the way the algorithms learn a policy.

For questions 3 and 4, you are asked to plot the solution quality at each episode, as given by the 50-step moving average reward received by your learning agent. At time step t, the 50-step moving average reward is the average reward earned by your learning agent in the episodes [t − 50, t], including episode restarts. If the Q-values imply a poor quality policy, this value will be low. If the Q-values correspond to a high-value policy, the 50-step moving average reward will be higher. We are using a moving average here because the reward is received only occasionally and there are sources of randomness in the transitions and the exploration strategy.


Question 3.

a) (5 marks) Plot the quality of the policy learned by Q-learning in testcase a3-t5.txt against episode number for three different fixed values of the learning_rate (which is called α in the lecture notes and in many texts and online tutorials), as given by the 50-step moving average reward (i.e. for this question, do not adjust α over time, rather keep it the same value throughout the learning process). Your plot should display the solution quality up to an episode count where the performance stabilises, with a minimum of 2000 episodes (note the policy quality may still be noisy, but the algorithm’s performance will stop increasing and its average quality will level out).

b) (5 marks) With reference to your plot, comment on the effect of varying the learning_rate.


Question 4.

a) (5 marks) Plot the quality of the learned policy against episode number under Q-learning and SARSA in testcase a3-t5.txt, as given by the 50-step moving average reward. Your plot should display the solution quality up to an episode count where the performance of both algorithms stabilise, with a minimum of 2000 episodes.

b) (5 marks) With reference to your plot, compare the learning trajectory of the two algorithms, and their final solution quality. Discuss the way the solution quality of Q-learning and SARSA change as they learn to solve the testcase, both as they learn and once they have stabilised.


Academic Misconduct

The University defines Academic Misconduct as involving “a range of unethical behaviours that are designed to give a student an unfair and unearned advantage over their peers.” UQ takes Academic Misconduct very seriously and any suspected cases will be investigated through the University’s standard policy (https://ppl.app.uq.edu.au/content/3.60.04-student-integrity-and-misconduct). If you are found guilty, you may be expelled from the University with no award.

It is the responsibility of the student to ensure that you understand what constitutes Academic Misconduct and to ensure that you do not break the rules. If you are unclear about what is required, please ask.

It is also the responsibility of the student to take reasonable precautions to guard against unauthorised access by others to his/her work, however stored in whatever format, both before and after assessment.

In the coding part of COMP3702 assignments, you are allowed to draw on publicly-accessible resources and provided tutorial solutions, but you must make reference or attribution to its source, by doing the following:

All blocks of code that you take from public sources must be referenced in adjacent comments in your code.

Please also include a list of references indicating code you have drawn on in your solution.py docstring.

However, you must not show your code to, or share your code with, any other student under any circumstances. You must not post your code to public discussion forums (including Ed Discussion) or save your code in publicly accessible repositories (check your security settings). You must not look at or copy code from any other student.

All submitted files (code and report) will be subject to electronic plagiarism detection and misconduct proceed-ings will be instituted against students where plagiarism or collusion is suspected. The electronic plagiarism detection can detect similarities in code structure even if comments, variable names, formatting etc. are modified. If you collude to develop your code or answer your report questions, you will be caught.

For more information, please consult the following University web pages:

Information regarding Academic Integrity and Misconduct:

https://my.uq.edu.au/information-and-services/manage-my-program/student-integrity-and-conduct/academic-integrity-and-student-conduct

http://ppl.app.uq.edu.au/content/3.60.04-student-integrity-and-misconduct

Information on Student Services:

https://www.uq.edu.au/student-services/


Late submission

Students should not leave assignment preparation until the last minute and must plan their workloads to meet advertised or notified deadlines. It is your responsibility to manage your time effectively.

Late submission of the assignment will not be accepted. Unless advised, assessment items received after the due date will receive a zero mark unless you have been approved to submit the assessment item after the due date.

In the event of exceptional circumstances, you may submit a request for an extension. You can find guide-lines on acceptable reasons for an extension here https://my.uq.edu.au/information-and-services/manage-my-program/exams-and-assessment/applying-extension All requests for extension must be submitted on the UQ Application for Extension of Progressive Assessment form at least 48 hours prior to the submission deadline.