CPS824/CP8319: Reinforcement Learning

CPS824/CP8319: Reinforcement Learning
Course Management Form

Instructor:	Mikhail Soutchanski
Email:	mes (at) cs (dot) torontomu (dot) ca (write RL in Subject of your email)
Web page:	www.cs.torontomu.ca/~mes/courses/cps824
Office:	The Centre for Computing and Engineering, ENG275
Office Hours:	Wednesday, 11am-noon (by appointment only) Tuesday, 14-14:30 (every 2nd week starting from Jan 14th)
TA:	Reggie McLean (email: reginald.mclean (at) torontomu.ca)
Lectures:

Section	Status	Day	Start Time	End Time	Room
All	Available	Tuesday	15:10	17:00	EPH-142
All	Available	Wednesday	10:10	11:00	VIC-203

Course Description

This course will provide a comprehensive introduction to reinforcement learning, a powerful approach to learning from interaction to achieve goals in stochastic and deterministic environments. Reinforcement learning has adapted key ideas from machine learning, operations research, control theory, psychology, and neuroscience to produce some strikingly successful engineering applications. The focus is on algorithms that learn what actions to take, and when to take them, so as to optimize long-term performance. This may involve sacrificing immediate reward to obtain greater reward in the long-term or just to obtain more information about the environment. The course will cover Markov decision processes, dynamic programming, temporal-difference learning, Monte Carlo reinforcement learning methods, function approximation methods, and the integration of learning and planning. The course covers some of the key approaches underlying the success of the modern computer programs that can defeat human professional players in the game of Go and other classic games. A number of applications of reinforcement learning will be discussed as well. The focus is mostly on cases characterized by discrete finite probability distributions and for this reason requires minimal background in probability theory that is briefly reviewed in the beginning of this course.
Prerequisites: The course requires ability to write computer programs in one of the modern programming languages such as C/C++, Java or Python, basics of data structures (CPS305 or equivalent) as well as basic probability theory (MTH380 or equivalent). Do not enroll into this course if you cannot write computer programs.
Compulsory Text Book: R. S. Sutton and Andrew Barto Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 2nd edition, 2018. The students are expected to read sections and chapters from this textbook each week. You might wish to browse the older 1st edition (1998). (Clicking on the link will take you to Professor Richard Sutton's personal Web page.) The Second Edition is also published by the MIT Press, Nov 2018, ISBN 9780262039246. The printed copy is available from Indigo or Amazon.ca for about $130(CA). However, you can use instead an online edition (linked above) of this book for free.
Extra References (not required)
1. Hector Geffner, Blai Bonet ``A Concise Introduction to Models and Methods for Automated Planning", Chapter 6. Morgan and Claypool Publishers, 2013. Synthesis Lectures on Artificial Intelligence and Machine Learning, Vol. 7, No. 2. June 2013. Available online from the TMU Library.
2. Dimitri Bertsekas textbook "Reinforcement Learning and Optimal Control", ‎ Athena Scientific, 1st edition, 2019.
3. Dimitri Bertsekas ``Dynamic Programming and Optimal Control". Athena Scientific; 4th edition, 2012, volume 2, Chapter 6 Approximate Dynamic Programming, a draft from November 11, 2011. (This book is more advanced than what is required in this course. It is optional reading for graduate students).
Evaluation: 4 assignments (10% each): worth a total of 40% of the final grade. Midterm: 20%. Final exam: 40%. Graduate students may be asked to do additional work on assignments and tests. In particular, graduate students will be asked to complete a small project as part of their 4th assignment. Undergraduate students can earn bonus marks for doing extra work. To complete the 4th assignment, the students may be asked to prepare slides for a 20-30min talk on a topic related to the course, and present their talk in class.
Brief Description This course focuses on topics related to reinforcement learning. The course will cover an n-armed bandit problem, making multiple-stage decisions under uncertainty, Markov decision processes, dynamic programming, Monte Carlo reinforcement learning methods, temporal-difference learning including Q-learning (off-policy control) and SARSA (on-policy control), eligibility traces, function approximation methods, and the integration of learning and planning including DYNA architecture, prioritized sweeping, real-time dynamic programming and heuristic search.

Course Policies

To pass the course the following is required:
- At least 50% must be achieved on the theoretical component (the weighted total of the midterm test, and final exam marks)
- At least a 50% grade on the remaining practical component: the weighted total of the homework assignments and in-class presentation
The students are strongly encouraged to take notes in class, and study their notes after class. Learning can be a gradual process that requires time and efforts. The students benefit from attending lectures since some important details will be discussed only there. For this reason, attending lectures is mandatory. Some of the announcements and clarifications mentioned in class will not be communicated by any other means. It is your responsibility to find the news mentioned in class, if you missed a class.
All course materials posted on D2L and presented in class are copyrighted and protected by law. You cannot share them with anyone. You cannot repost them anywhere on the Web. Please review the Policy about copyrights. Moreover, you cannot post on the Web any of your solutions to homework assignments, since doing this would violate the TMU policies. You can read parts of this policy online related to "Academic misconduct".
The policy for in-person content delivery. The students are expected to pay attention to a lecture and volunteer to answer instructor's questions during the class-time. The students might be asked to participate in unannounced polls or quizzes. Turn off your mobile phones and all other electronic devices in class. You can keep your laptop or tablet open only if you use it to take notes in class.
Examinations: The midterm test, and the final exam may include short essay and yes/no questions, as well as problem solving (but not programming questions). The duration of these examinations will be 1h30min, and 2h30 minutes, respectively. There will be no supplemental examinations. The final exam will be cumulative and will include all the material covered throughout the term. Grades are earned for the demonstration of knowledge.
If you miss a midterm test, or a final exam for medical reasons, you have to read Policy 167 Academic Consideration and submit a copy of a completed official Health Certificate to the designated contact person or to the department of Computer Science within 3 working days. Once the submitted student’s health documentation has been verified, the instructor will be notified of the verification. Similarly, all documentation related to special accomodation or academic consideration should be submitted online to the designated contact person within the specified time limits.

Assignments should be submitted on or before the deadline specified in the assignment (you are encouraged to submit assignments earlier). Your assignment is considered late if any part of the assignment is late (even if it is just 1 minute late). The penalty for a late assignment is 10% off. No assignments will be accepted if more than 24 hours late. Start solving your assignment on the same day when it is posted. Do not procrastinate. No make-up assignments.
From time to time, I will hand out exercises. The students are expected to solve the exercises, but they will not be graded. However, working on exercises will improve your understanding of this course (and will help you to get better marks on tests).
Up to 4% (or less) extra credit may be assigned for active class participation throughout the term, e.g., a student attends classes and takes notes of the lectures, participates actively by asking/answering questions, solves exercises in class. Class participation marks are earned for active course participation and given at discretion of the course instructor; they cannot be requested by the students. Unexplained lack of attendance can negatively affect one's grade.
Handouts and assignments will be made available on the Web only. You are responsible for visiting the course Web pages regularly and reading assignments and tests related information that is provided or linked from these Web pages. In particular, Frequently Answered Questions (FAQs) related to home work can be linked from there. These FAQs are considered to be an integral part of the assignment. Before sending your questions by e-mail to the instructor, check these Web pages whether similar questions have been already answered.
Email communication: you can send email from local TMU's email addresses only: you can use either your departmental account (preferred) or your university account to send email. Email sent from Google, Bell, Rogers and any other external email providers can be filtered out as spam and might not reach the instructors. Email messages will be normally answered within 24 hours. However, messages sent on weekend (starting from Friday afternoon) will be usually answered on Monday.
Grades for assignments and tests will be normally posted on D2L Web site no later than two weeks after the due date (exam date). Marking guides, the assignments and some other course related documents will be posted on D2L only. Feedback will be usually provided to students within two weeks. The students can contact the TA who was responsible for marking, if they have questions about marking, or attend the office hour.

Policy on collaboration in homework assignments
Collaboration in discussing general approaches to problems is allowed only with students in your team. No collaboration is allowed between teams. You may discuss assignments only with other people currently taking the course. However, you should never put your name on anything you do not understand. If challenged, you must be able to reproduce and explain all solutions by yourself, or solve similar exercises. If you cannot explain a solution that you handed in, or if you cannot solve an exercise similar to questions in your home work or in your quiz, this will negatively affect your grade. In particular, you might be asked to solve extra exercises during the office hours, during one of the labs, or in class (as a quiz). These unscheduled tests or evaluations can be given at any time without prior notice. Remember that if you work with partners, you are still expected to know solutions of all exercises from the home work. Grades are earned for the demonstration of knowledge. In cases when a student fails to demonstrate knowledge about a home work, the grade for the home work can be decreased to 0. The first page of your homework should include: the name of all students with whom you discussed any homework problems (even briefly). Otherwise, it is assumed that you didn't discuss with anyone except the instructor. Copied work (both original and copies) will be graded as 0. Involvement with plagiarism will be penalized in accordance with Academic Policy 60. Additional penalty for copied work may be assigned as deterrence against plagiarism. More specifically, additional penalty for a copied assignment (in part or in whole) can be up to -5% of the final course grade.

Contract Cheating Statement
In regard to any and all assessments in this course, the use of Chegg, or any other similar help site/service/tool will be pursued as "contract cheating".

The use of ChatGPT, CoPilot, Gemini and similar generative Large Language Models (LLM) with the purposes of solving homework problems will be pursued as "a breach of Policy 60: Academic Integrity", if the student accessed them before submitting course work and assessment is presented as if it is one’s own original work without appropriate referencing. Generative LLM tools may only be used for comparison with your own course work that you have already submitted, but not for the creation of submitted work.

In regard to any and all assessments in this course, the use of any third party (e.g., family member, freelancer, room-mate, friend, tutor) to complete work on your behalf will be pursued as "contract cheating" under Policy 60 "Academic Integrity".

Policy 60 Penalty Guidelines for contract cheating (e.g., viewing a solution on Chegg or Discord) that only impacts you: F in course.

Policy 60 Penalty Guidelines for contract cheating that facilitates cheating for others (e.g., posting a question to Chegg): Disciplinary Suspension.

ACADEMIC MISCONDUCT
Committing academic misconduct, such as plagiarism and cheating, will trigger academic penalties including failing grades, suspension and possibly expulsion from the University. As a TMU student, you are responsible for familiarizing yourself with the Student Code of Academic Conduct.

ACADEMIC CONDUCT
The students are expected to pay attention to a lecture and volunteer to answer instructor's questions during the class-time. In the case of in-person classes, in order to create an environment conducive to learning and respectful of others rights, phones and pagers must be silenced during lectures, and evaluations. Students should refrain from disrupting the lectures by arriving late and/or leaving before the lecture is finished.

Policy on Non-Academic Conduct No disruption of instructional activities is allowed. Among many other infractions, the Code specifically refers to the following as a violation: ``Disruption of Learning and Teaching - Students shall not behave in disruptive ways that obstruct the learning and teaching environment." In particular, the students can use the laptops (and similar electronic devices) in class only for taking notes. In difficult cases, penalties can be imposed by the Student Conduct Officer. You can read the TMU Senate Policy 61 for details.

Remarking Policy

Grades are earned for the demonstration of knowledge.
Read carefully the marking guide for the assignment or test you'd like to be remarked. Your grade may go up, down, or remain the same.
Fill in this remarking form (available online).
Email the form and your assignment/test to TA who marked your homework.
If you are not satisfied with the TA's remarking, you can appeal to the instructor.
You may not submit a remarking request later than ONE WEEK from the date on which the assignments/tests were returned in class. It's your responsibility to pick up your work ASAP.
Your mark can decrease if TA sees something that was incorrectly awarded too high a mark.

Tentative Course Calendar (all changes of dates will be announced)

Course Work	Due Date	Grade Value (%)
Assignment 1	February 4	10
Assignment 2	February 18	10
Midterm	Tuesday, February 25, in-class	20
Assignment 3	March 18	10
Assignment 4	March 25	10
Final Exam	April 22, KHE-119, from 3pm	40
		100