• Course References, Links and Random Notes

    Title: Probabilistic Reasoning and Reinforcement Learning
    Info: ECE 493 Topic 42 - Technical Electives
    Instructor: Prof. Mark Crowley, ECE Department, UWaterloo

    Website: markcrowley.ca/rlcourse

  • Course Resources

  • Topics

    Primary Textbook : Reinforcement Learning: An Introduction
    Small
    : Richard S. Sutton and Andrew G. Barto, 2018 [SB]

    Some topics are not covered in the SB textbook or they are covered in much more detail than the lectures. We will continue to update this list with references as the term progresses.

    1. Motivation & Context [SB 1.1, 1.2, 17.6]
    2. Decision Making Under Uncertainty [SB 2.1-2.3, 2.7, 3.1-3.3]
    3. Solving MDPs [SB 3.5, 3.6, 4.1-4.4]
    4. The RL Problem [SB 3.7, 6.4, 6.5]
    5. TD Learning [SB 12.1, 12.2]
    6. Policy Search [SB 13.1, 13.2, 13.5]
    7. State Representation & Value Function Approximation
    8. Basics of Neural Networks
    9. Deep RL
    10. POMDPs, MARL (skipped in 2020)
    11. MCTS, AlphaGo (mentioned briefly in 2020)
  • Primary References for Course

  • Additional Resources

  • Videos to Watch on RL (Current Research)

  • Old Topics Archive

    Other resources connected with previous versions of the course, I’m happy to talk about any of these if people are interested.

  • Course Description :

    Introduction to Reinforcement Learning (RL) theory and algorithms for learning decision-making policies in situations with uncertainty and limited information. Topics include Markov decision processes, classic exact/approximate RL algorithms such as value/policy iteration, Q-learning, State-action-reward-state-action (SARSA), Temporal Difference (TD) methods, policy gradients, actor-critic, and Deep RL such as Deep Q-Learning (DQN), Asynchronous Advantage Actor Critic (A3C), and Deep Deterministic Policy Gradient (DDPG). [Offered: S, first offered Spring 2019]

  • Week 1 - Course Introduction

  • Topic 1 - Basics of Probability

  • Topic 2.1 - Basic Decision Making Models - Multiarmed Bandits

    Week 2
    Textbook Sections: [SB 1.1, 1.2, 17.6]

  • Topic 3 - Markov Decision Processes

    Week 3

    Textbook Sections

    • Markov Decision Processes
      [SB 3.0-3.4]
    • Solving MDPs Exactly
      [SB 3.5, 3.6, 3.7]
  • Topic 4 - Dynamic Programming

    Week 4
    Former title: The Reinforcement Learning Problem
    Textbook Sections:[SB 4.1-4.4]

  • Topic 5 - Temporal Difference Learning - Part 1

    Week 5 (June 7-11)

    Textbook Sections: Selections from [SB chap 5], [SB 6.0 - 6.5]

    • Quick intro to Monte-Carlo methods
    • Temporal Difference Updating
  • Topic 5.1 - TD Learning - Part 2

    Week 6 (June 14-17)

    • SARSA
    • Q-Learning
    • Expected SARSA
    • Double Q-Learning
  • Part 1 Review

    Week 7 (June 21 - 25)
    Go over any questions or open topics from first 6 weeks.


  • MIDTERM Exam

    Week 7
    Questions on Midterm (June 23-25) can be on any topics up to this point, Weeks 1-6 inclusive.


  • (!SKIP!) Topic 5.2 - N-Step TD and Eligibility Traces

    optional topic
    Textbook Sections: [SB 12.1, 12.2]

    Note: Given the pace that people are watching videos, we will drop this topic. It is less essential in the Deep RL era although very interesting theoretically. Calendar will be updated accordingly.

  • Topic 6 - State Representation & Value Function Approximation

    Week 8 (June 28- July 2)

  • Week 9 (July 5 - 9)
    [SB 13.1, 13.2, 13.5]

    • Policy Gradients
    • Actor-Critic
  • Topic 8 - Basics of Neural Networks and Deep RL as DQN

    Week 10
    Note: If Topic 5.2 is dropped this will be a week earlier.

  • Topic 9 - Using DQN to defeat Atari and Go (MCTS+DQN=AlphaGo)

    Week 11
    Note: If Topic 5.2 is dropped this will be a week earlier.

  • Topic 11 - Deep RL Beyond DQN

    Week 12

    • DDPG
    • A2C
    • PPO
  • Review

    Week 13

  • Bayes Nets (dropped)

  • Conjugate Priors (dropped)

  • Primary References for Probabilistic Reasoning (mostly dropped)

    • ECE 657A Youtube Videos

      Introductory topics on this from my graduate course ECE 657A are available on youtube and mostly applicable to this course as well.

    • ECE 108 YouTube Videos

      For a very fundamental view of probability from another course of Prof. Crowley you can view the lectures and tutorials for ECE 108

      ECE 108 Youtube (look at “future lectures” and “future tutorials” for S20): https://www.youtube.com/channel/UCHqrRl12d0WtIyS-sECwkRQ/playlists

      The last few lectures and tutorials are on probability definitions as seen from the perspective of discrete math and set theory.

    • Likelihood, Loss and Risk

      A Good article summarizing how likelihood, loss functions, risk, KL divergence, MLE, MAP are all connected.
      https://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/

    • Probability Intro Markdown Notes

      From the course website for a previous year. Some of this we won’t need so much but they are all useful to know for Machine Learning methods in general.

      https://compthinking.github.io/RLCourseNotes/

      • Basic probability definitions
      • conditional probability
      • Expectation
      • Inference in Graphical Models
      • Variational Inference
    • Videos

    • Live Lecture/Discussion June 14 4pm

      There will be given as a Live Lecture on June 14, 2021 during the 4pm-5:30pm ET Live Session.

      According to my youtube analytics, very few people have watched the first two lectures on Temporal Difference Learning or Monte Carlo. But there were a fair number looking at SARSA and QLearning (probably because they are the most famous, fair enough).


      test caption
      This plot is views per video. I removed the even higher video from the first three weeks.
      The first bar is “Dynamic Programming 1” with 76 views. (as of June 11, 2021 5:27pm ET)


      I was planning to record a new video (that isn’t on youtube yet) on the following during the live session on Monday:

      • Expected SARSA and Double Q-Learning - these are just modifications of those important algorithms, that have their own benefits. So there will be time to review the essentials of SARSA/QL here at the same time. If there are few people attending or no other discussion here, I will do that.

      But… if lots of people show up, we could also:

      • go over something else from earlier in the course they didn’t quite understand
      • or something that they didn’t get a chance to watch yet

      So let me know here what topic you would want to go over, or redo live:

      • Maybe it’s all the essentials from TD1 and TD2 that you really need for SARSA and QLearning.
      • Or maybe it’s essentials from some of these earlier topics that people seemed to have skipped like Dynamic Programming, or Mont-Carlo methods.

      I’ll check this post on Sunday/Monday and see which option it will be.

    • VFA Concept

      A Value Function Approximation (VFA)
      is a necessary technique to use whenever the size of the state of action spaces become too large to represent the value function explicitly as a table. In practice, any practical problem needs to use a VFA.

    • Video:

    • Other Resources:

    • Tools

    • References

    • Benefits of VFA

      • Reduce memory need to store the functions (transition, reward, value etc)
      • Reduce computation to look up values
      • Reduce experience needed to find the optimal value or policy (sample efficiency)
      • For continuous state spaces, a coarse coding or tile coding can be effective
    • Types of Function Approximators

      • Linear function approximations (linear combination of features)
      • Neural Networks
      • Decision Trees
      • Nearest Neighbors
      • Fourier/ wavelet bases
    • Finding an Optimal Value Function

      When using a VFA, you can use a Stochastic Gradient Descent (SGD) method to search for the best weights for your value function according to experience.
      This parametric form the value function will then be used to obtain a greedy or epsilon-greedy policy at run-time.

      This is why using a VFA + SGD is still different from a Direct Policy Search approach where you optimize the parameters of the policy directly.

        {"cards":[{"_id":"5fe4d175553d829d700002aa","treeId":"5fe4d3bb553d829d700002a7","seq":22552050,"position":1.109375,"parentId":null,"content":"# Course References, Links and Random Notes \n**Title:** Probabilistic Reasoning and Reinforcement Learning\n**Info:** ECE 493 Topic 42 - Technical Electives\n**Instructor:** [Prof. Mark Crowley](https://uwaterloo.ca/scholar/mcrowley), [ECE Department](https://uwaterloo.ca/electrical-computer-engineering/), [UWaterloo](https://uwaterloo.ca/)\n\n**Website:** [markcrowley.ca/rlcourse](https://markcrowley.ca/rlcourse/)"},{"_id":"3e7df62d427d5a52c2000341","treeId":"5fe4d3bb553d829d700002a7","seq":22551653,"position":1,"parentId":"5fe4d175553d829d700002aa","content":"## Course Description :\nIntroduction to Reinforcement Learning (RL) theory and algorithms for learning decision-making policies in situations with uncertainty and limited information. Topics include Markov decision processes, classic exact/approximate RL algorithms such as value/policy iteration, Q-learning, State-action-reward-state-action (SARSA), Temporal Difference (TD) methods, policy gradients, actor-critic, and Deep RL such as Deep Q-Learning (DQN), Asynchronous Advantage Actor Critic (A3C), and Deep Deterministic Policy Gradient (DDPG). [Offered: S, first offered Spring 2019]\n\n"},{"_id":"5fadc422b1ba66c81c000052","treeId":"5fe4d3bb553d829d700002a7","seq":22552046,"position":1.21875,"parentId":null,"content":"# Course Resources\n- [Course Website](https://markcrowley.ca/rlcourse/) : contains course outline, grade breakdown, weekly schedule information\n- Notes and slides via the Textbook *(available free online)*:\n - [Reinforcement Learning: An Introduction\nSmall](http://incompleteideas.net/book/the-book-2nd.html) : Richard S. Sutton and Andrew G. Barto\nSutton Textbook, 2018\n- Course Youtube Channel : [Reinforcement Learning](https://www.youtube.com/channel/UC6p1AJ7jKNFp6OB2MmAoWvA/featured)\n- See [Additional Resources](#resources) for more online notes and reading.\n"},{"_id":"5fae8ea6b1ba66c81c00004f","treeId":"5fe4d3bb553d829d700002a7","seq":22624061,"position":1.328125,"parentId":null,"content":"# Topics\n\nPrimary Textbook : [Reinforcement Learning: An Introduction\nSmall](http://incompleteideas.net/book/the-book-2nd.html) : Richard S. Sutton and Andrew G. Barto, 2018 [SB]\n\nSome topics are not covered in the SB textbook or they are covered in much more detail than the lectures. We will continue to update this list with references as the term progresses.\n\n1. Motivation & Context [SB 1.1, 1.2, 17.6]\n2. Decision Making Under Uncertainty [SB 2.1-2.3, 2.7, 3.1-3.3]\n3. Solving MDPs [SB 3.5, 3.6, 4.1-4.4]\n4. The RL Problem [SB 3.7, 6.4, 6.5]\n5. TD Learning [SB 12.1, 12.2]\n6. Policy Search [SB 13.1, 13.2, 13.5]\n7. State Representation & Value Function Approximation\n8. Basics of Neural Networks\n9. Deep RL\n10. POMDPs, MARL (skipped in 2020)\n11. MCTS, AlphaGo (mentioned briefly in 2020)\n"},{"_id":"4c3ed7a7fbcdd860c6000094","treeId":"5fe4d3bb553d829d700002a7","seq":20701599,"position":0.125,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Week 1 - Course Introduction"},{"_id":"4c92fec809b26e012800008d","treeId":"5fe4d3bb553d829d700002a7","seq":20902911,"position":0.25,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 1 - Basics of Probability\n"},{"_id":"3bd0358cf7167777b60003f9","treeId":"5fe4d3bb553d829d700002a7","seq":22429401,"position":0.25,"parentId":"4c92fec809b26e012800008d","content":"### ECE 657A Youtube Videos\nIntroductory topics on this from my graduate course ECE 657A are available on youtube and mostly applicable to this course as well. \n\n[ ] add link"},{"_id":"4c92fe6909b26e012800008e","treeId":"5fe4d3bb553d829d700002a7","seq":20701608,"position":1.25,"parentId":"4c92fec809b26e012800008d","content":"### ECE 108 YouTube Videos\nFor a very fundamental view of probability from another course of Prof. Crowley you can view the lectures and tutorials for ECE 108\n\nECE 108 Youtube (look at \"future lectures\" and \"future tutorials\" for S20): https://www.youtube.com/channel/UCHqrRl12d0WtIyS-sECwkRQ/playlists\n\nThe last few lectures and tutorials are on probability definitions as seen from the perspective of discrete math and set theory."},{"_id":"5e77880ec94dab71ec000089","treeId":"5fe4d3bb553d829d700002a7","seq":22436001,"position":2,"parentId":"4c92fec809b26e012800008d","content":"### Likelihood, Loss and Risk\nA Good article summarizing how likelihood, loss functions, risk, KL divergence, MLE, MAP are all connected.\nhttps://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/"},{"_id":"4c92f7ec09b26e0128000090","treeId":"5fe4d3bb553d829d700002a7","seq":22574118,"position":2.5,"parentId":"4c92fec809b26e012800008d","content":"### Probability Intro Markdown Notes\nFrom the course website for a previous year. Some of this we won't need so much but they are all useful to know for Machine Learning methods in general. \n\nhttps://compthinking.github.io/RLCourseNotes/\n\n- Basic probability definitions\n- conditional probability\n- Expectation\n- Inference in Graphical Models\n- Variational Inference\n\n\n "},{"_id":"5dbdc38da0d422009a0000bd","treeId":"5fe4d3bb553d829d700002a7","seq":22613464,"position":0.5,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 2.1 - Basic Decision Making Models - Multiarmed Bandits\n**Week 2**\n**Textbook Sections:** [SB 1.1, 1.2, 17.6]"},{"_id":"4c3f41e6fbcdd860c6000093","treeId":"5fe4d3bb553d829d700002a7","seq":22574120,"position":0.25,"parentId":"5dbdc38da0d422009a0000bd","content":"### Videos\n-* Part 1 - Live Lecture May 17, 2021 on *Virtual Classroom - [View Live Here](https://bongo-ca.youseeu.com/sync-activity/invite/1747607/73748f05469e35afeaf7ea19d353ced8?lti-scope=d2l-resource-syncmeeting-list)\n- Part 2 - Bandits and Values (the sound is horrible! we'll record a new one) - https://youtu.be/zVIv1ipnubA\n- Part 3 - Regret Minimization, UCB and Thompson Sampling - https://youtu.be/a0OcuuglkHQ"},{"_id":"5dbab04ba0d422009a0000bf","treeId":"5fe4d3bb553d829d700002a7","seq":17959418,"position":0.5,"parentId":"5dbdc38da0d422009a0000bd","content":"### Multiarmed Bandit : Solving it via Reinforcement Learning in Python\n- Quite a good blog post with all the concepts laid out in simple terms in order https://www.analyticsvidhya.com/blog/2018/09/reinforcement-multi-armed-bandit-scratch-python/"},{"_id":"5dbdc2dea0d422009a0000be","treeId":"5fe4d3bb553d829d700002a7","seq":20805936,"position":1,"parentId":"5dbdc38da0d422009a0000bd","content":"### Thompson Sampling\n- Long tutorial on Thompson Sampling with more background and theory. Nice charts as well: https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf"},{"_id":"4c494fe865cbc6c2910003c6","treeId":"5fe4d3bb553d829d700002a7","seq":22613465,"position":1.875,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 3 - Markov Decision Processes\n**Week 3**\n\n### Textbook Sections\n- Markov Decision Processes\n[SB 3.0-3.4]\n- Solving MDPs Exactly\n[SB 3.5, 3.6, 3.7]"},{"_id":"4bb33d5dd177fa84520000a7","treeId":"5fe4d3bb553d829d700002a7","seq":22613456,"position":1,"parentId":"4c494fe865cbc6c2910003c6","content":"### Playlist:\n- MDPs Chp 3 : https://youtube.com/playlist?list=PLrV5TcaW6bIX_wnVztMoDFk_8ybteeW7Y\n\n### Individual Videos:\n- Markov Decision Processes 3.0-3.1: \nhttps://youtu.be/pGW1wP4jJas\n- Rewards and Returns 3.3-3.4: https://youtu.be/K7ymZkEd0ZA\n- Value Functions 3.5 - 3.6 : https://youtu.be/lNBXDgAthmQ"},{"_id":"4c37bc29c5d471e911000097","treeId":"5fe4d3bb553d829d700002a7","seq":22613463,"position":2.5625,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 4 - Dynamic Programming\n**Week 4**\n*Former title: The Reinforcement Learning Problem*\n**Textbook Sections:**[SB 4.1-4.4]"},{"_id":"4b351e35a5e88a34d000004e","treeId":"5fe4d3bb553d829d700002a7","seq":21076335,"position":2,"parentId":"4c37bc29c5d471e911000097","content":"### Videos:\n- Dynamic Programming 1: https://youtu.be/nhyCQK4v4Cw\n- Dynamic Programming 2 : Policy and Value Iteration: https://youtu.be/NHN02JnGmdQ\n- Dynamic Programming 3 : Generalized Policy Iteration and Asynchronous Value Iteration https://youtu.be/7gfRBYpzhxU"},{"_id":"5c8f6a314d10ff3e9c000098","treeId":"5fe4d3bb553d829d700002a7","seq":22613527,"position":8,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 5 - Temporal Difference Learning - Part 1\n**Week 5** (June 7-11)\n\n**Textbook Sections:** Selections from [SB chap 5], [SB 6.0 - 6.5]\n- Quick intro to Monte-Carlo methods\n- Temporal Difference Updating\n\n"},{"_id":"4ace81878043f4de720000a6","treeId":"5fe4d3bb553d829d700002a7","seq":22613481,"position":1,"parentId":"5c8f6a314d10ff3e9c000098","content":"### Videos\n- [Week 5 Youtube Playlist](https://youtube.com/playlist?list=PLrV5TcaW6bIUiMLNDYq7cnHhgq0toIFOk)\n\nParts:\n- Just the MC Lecture part - https://youtu.be/b1C_2x6IUUw\n- Temporal Difference Learning 1 - Introduction https://youtu.be/pJyz6OZiIBo\n- Temporal Difference Learning 2 - Comparison to Monte-Carlo Method on Random Walk\nhttps://youtu.be/NVtoj4XRRZw\n"},{"_id":"485d37021e28d34c2d000171","treeId":"5fe4d3bb553d829d700002a7","seq":22613568,"position":8.4375,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 5.1 - TD Learning - Part 2\n**Week 6** (June 14-17)\n- SARSA\n- Q-Learning\n- Expected SARSA\n- Double Q-Learning"},{"_id":"385a3ffb55a18817e20000db","treeId":"5fe4d3bb553d829d700002a7","seq":22616935,"position":2,"parentId":"485d37021e28d34c2d000171","content":"### Videos\n- [Week 5 Youtube Playlist](https://youtube.com/playlist?list=PLrV5TcaW6bIUiMLNDYq7cnHhgq0toIFOk)\n- Temporal Difference Learning 3 - Sarsa and QLearning Algorithms\nhttps://youtu.be/nEDblNhoL2E\n- Temporal Difference Learning 4 - Expected Sarsa and Double Q-Learning\nhttps://youtu.be/uGFb0mtJW00"},{"_id":"3883e2e71b6da149c50000da","treeId":"5fe4d3bb553d829d700002a7","seq":22613571,"position":3,"parentId":"485d37021e28d34c2d000171","content":"### Live Lecture/Discussion June 14 4pm\nThere will be given as a **Live Lecture** on June 14, 2021 during the *4pm-5:30pm ET Live Session*.\n\nAccording to my youtube analytics, very few people have watched the first two lectures on Temporal Difference Learning or Monte Carlo. But there were a fair number looking at SARSA and QLearning (probably because they are the most famous, fair enough).\n\n---\n![test caption](https://www.filepicker.io/api/file/EEYou1T1SKLZB6Zq2JcX)\n*This plot is views per video. I removed the even higher video from the first three weeks. \nThe first bar is \"Dynamic Programming 1\" with 76 views. (as of June 11, 2021 5:27pm ET)*\n\n---\n\nI was planning to record a new video (that isn't on youtube yet) on the following during the live session on Monday:\n- **Expected SARSA and Double Q-Learning** - these are just modifications of those important algorithms, that have their own benefits. So there will be time to review the essentials of SARSA/QL here at the same time. If there are few people attending or no other discussion here, I will do that.\n\n*But...* if lots of people show up, we could also:\n- go over something else from earlier in the course they didn't quite understand\n- or something that they didn't get a chance to watch yet\n\nSo let me know here what topic you would want to go over, or redo live: \n- Maybe it's all the essentials from TD1 and TD2 that you really need for SARSA and QLearning.\n- Or maybe it's essentials from some of these earlier topics that people seemed to have skipped like Dynamic Programming, or Mont-Carlo methods.\n\nI'll check this post on Sunday/Monday and see which option it will be.\n"},{"_id":"38328f59ed3f6703010000e0","treeId":"5fe4d3bb553d829d700002a7","seq":22616809,"position":8.4609375,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Part 1 Review \n**Week 7** (June 21 - 25)\nGo over any questions or open topics from first 6 weeks."},{"_id":"3859914555a18817e20000e0","treeId":"5fe4d3bb553d829d700002a7","seq":22616800,"position":8.484375,"parentId":"5fae8ea6b1ba66c81c00004f","content":"---\n\n## MIDTERM Exam\n**Week 7**\nQuestions on Midterm (June 23-25) can be on any topics up to this point, Weeks 1-6 inclusive."},{"_id":"3859d17555a18817e20000de","treeId":"5fe4d3bb553d829d700002a7","seq":22613539,"position":8.53125,"parentId":"5fae8ea6b1ba66c81c00004f","content":"---"},{"_id":"4ae07ab7cbafb23a0b000244","treeId":"5fe4d3bb553d829d700002a7","seq":22616804,"position":8.625,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## (!SKIP!) Topic 5.2 - N-Step TD and Eligibility Traces\n**optional topic**\n**Textbook Sections:** [SB 12.1, 12.2]\n\n**Note:** Given the pace that people are watching videos, we will drop this topic. It is less essential in the Deep RL era although very interesting theoretically. Calendar will be updated accordingly."},{"_id":"5c8f51374d10ff3e9c00009c","treeId":"5fe4d3bb553d829d700002a7","seq":22435999,"position":1,"parentId":"4ae07ab7cbafb23a0b000244","content":"Eligibility traces, in a tabular setting, lead to a significant benefit in training time in additional to the Temporal Difference method. \n\nIn Deep RL it is very common to use **experience replay** to reduce overfitting and bias to recent experiences. However, experience replay makes it very hard to leverage eligibility traces which require a sequence of actions to distribute reward backwards."},{"_id":"3859f08a55a18817e20000dd","treeId":"5fe4d3bb553d829d700002a7","seq":22613526,"position":1.25,"parentId":"4ae07ab7cbafb23a0b000244","content":"### Videos:\n- youtube playlist : https://youtube.com/playlist?list=PLrV5TcaW6bIVtMNt_dZMdMQ9JdtzV5VWS"},{"_id":"3bc0e966278148f4ee000ca3","treeId":"5fe4d3bb553d829d700002a7","seq":22435995,"position":1.5,"parentId":"4ae07ab7cbafb23a0b000244","content":"### Other Resources:"},{"_id":"5c8f67d74d10ff3e9c00009b","treeId":"5fe4d3bb553d829d700002a7","seq":22435996,"position":2,"parentId":"4ae07ab7cbafb23a0b000244","content":"- [Discussion about Incompatibility of Eligibility Traces with Experience Replay](https://stats.stackexchange.com/questions/341027/eligibility-traces-vs-experience-replay/341038)"},{"_id":"5c8f68bf4d10ff3e9c00009a","treeId":"5fe4d3bb553d829d700002a7","seq":22435997,"position":3,"parentId":"4ae07ab7cbafb23a0b000244","content":"- [Efficient Eligibility Traces for Deep Reinforcement Learning - \nBrett Daley, Christopher Amato](https://arxiv.org/abs/1810.09967)"},{"_id":"5c8f69eb4d10ff3e9c000099","treeId":"5fe4d3bb553d829d700002a7","seq":22435998,"position":4,"parentId":"4ae07ab7cbafb23a0b000244","content":"- [Investigating Recurrence and Eligibility Traces in Deep Q-Networks -\nJean Harb, Doina Precup](https://arxiv.org/abs/1704.05495)"},{"_id":"5c0cd8c60bd10517f7000071","treeId":"5fe4d3bb553d829d700002a7","seq":22616807,"position":8.65625,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 6 - State Representation & Value Function Approximation\n**Week 8** (June 28- July 2)\n"},{"_id":"606cb6c741b97f0409060d5c","treeId":"5fe4d3bb553d829d700002a7","seq":22436029,"position":0.25,"parentId":"5c0cd8c60bd10517f7000071","content":"### VFA Concept\nA **Value Function Approximation (VFA)**\n is a necessary technique to use whenever the size of the state of action spaces become too large to represent the value function explicitly as a table. In practice, any practical problem needs to use a VFA.\n"},{"_id":"606cb6c741b97f0409060d5d","treeId":"5fe4d3bb553d829d700002a7","seq":22436072,"position":1,"parentId":"606cb6c741b97f0409060d5c","content":"#### Benefits of VFA\n- Reduce memory need to store the functions (transition, reward, value etc)\n- Reduce computation to look up values\n- Reduce experience needed to find the optimal value or policy (sample efficiency)\n- For continuous state spaces, a coarse coding or tile coding can be effective"},{"_id":"606cb6c741b97f0409060d62","treeId":"5fe4d3bb553d829d700002a7","seq":22436053,"position":2,"parentId":"606cb6c741b97f0409060d5c","content":"#### Types of Function Approximators\n- Linear function approximations (linear combination of features)\n- Neural Networks\n- Decision Trees\n- Nearest Neighbors\n- Fourier/ wavelet bases"},{"_id":"606cb6c741b97f0409060d68","treeId":"5fe4d3bb553d829d700002a7","seq":22436034,"position":3,"parentId":"606cb6c741b97f0409060d5c","content":"#### Finding an Optimal Value Function\nWhen using a VFA, you can use a Stochastic Gradient Descent (SGD) method to search for the best weights for your value function according to experience. \nThis parametric form the value function will then be used to obtain a *greedy* or *epsilon-greedy* policy at run-time.\n\nThis is why using a VFA + SGD is still different from a Direct Policy Search approach where you optimize the parameters of the policy directly.\n"},{"_id":"49b6caa7543a9264380000a7","treeId":"5fe4d3bb553d829d700002a7","seq":22436003,"position":0.5,"parentId":"5c0cd8c60bd10517f7000071","content":"### Video:\n- Lecture on Value Function Approximation approaches - https://youtu.be/7Dg6KiI_0eM"},{"_id":"3bc0df3d278148f4ee000ca4","treeId":"5fe4d3bb553d829d700002a7","seq":22436007,"position":0.75,"parentId":"5c0cd8c60bd10517f7000071","content":"### Other Resources:"},{"_id":"5c0cd8180bd10517f7000072","treeId":"5fe4d3bb553d829d700002a7","seq":22436075,"position":1,"parentId":"3bc0df3d278148f4ee000ca4","content":"- [How to use a shallow, linear approximation for Atari](https://www.amii.ca/the-success-of-dqn-explained-by-shallow-reinforcement-learning/) - This post explains a paper showing how to achieve the same performance as the Deep RL DQN method for Atari using carefully constructed linear value function approximation."},{"_id":"5bad6d2b44bacb1e5f000074","treeId":"5fe4d3bb553d829d700002a7","seq":22616808,"position":8.6875,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 7 - Direct Policy Search\n**Week 9** (July 5 - 9)\n[SB 13.1, 13.2, 13.5]\n- Policy Gradients\n- Actor-Critic\n"},{"_id":"5f07525027e746044eb71bcb","treeId":"5fe4d3bb553d829d700002a7","seq":22436106,"position":1,"parentId":"5bad6d2b44bacb1e5f000074","content":"### Video:\n- Lecture on Policy Gradient methods - \nhttps://youtu.be/SqulTcLHRnY"},{"_id":"5c2619c2456eb77c410000ec","treeId":"5fe4d3bb553d829d700002a7","seq":22436101,"position":2,"parentId":"5bad6d2b44bacb1e5f000074","content":"### Policy Gradient Algorithms\nSome of the posts used for lecture on July 26.\n\n- A good post with all the fundamental math for policy gradients.\nhttps://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#a3c\n- Also a good intro post about Policy gradients vs DQN by great ML blogger Andrej Karpathy (this is the one I showed in class with the Pong example):\nhttp://karpathy.github.io/2016/05/31/rl/\n- The Open-AI page on the PPO algorithm used on their simulator domains of humanoid robots:\nhttps://openai.com/blog/openai-baselines-ppo/\n- Good description of Actor-Critic approach using Sonic the Hedgehog game as example:\nhttps://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/\n- Blog post about how the original Alpha Go solution worked using Policy Gradient RL and Monte-Carlo Tree Search:\nhttps://medium.com/@jonathan_hui/alphago-how-it-works-technically-26ddcc085319"},{"_id":"5c0d1f340bd10517f7000070","treeId":"5fe4d3bb553d829d700002a7","seq":22436100,"position":2.5,"parentId":"5bad6d2b44bacb1e5f000074","content":"### Actor-Critic Algorithm\nVery clear blog post on describing Actor-Critic Algorithms to improve Policy Gradients\nhttps://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/"},{"_id":"5bad534e44bacb1e5f000075","treeId":"5fe4d3bb553d829d700002a7","seq":22436099,"position":3,"parentId":"5bad6d2b44bacb1e5f000074","content":"### Cutting Edge Algorithms\nGoing beyond what we covered in class, here are some exciting trends and new advances in RL research in the past few years to find out more about.\nPG methods are a fast changing area of RL research. This post has a number of the successful algorithms in this area from a few years ago:\nhttps://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#actor-critic"},{"_id":"4c37b768c5d471e91100009a","treeId":"5fe4d3bb553d829d700002a7","seq":22613535,"position":10.5,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 8 - Basics of Neural Networks and Deep RL as DQN\n**Week 10**\n**Note:** If *Topic 5.2* is dropped this will be a week earlier."},{"_id":"4c37b6a0c5d471e91100009b","treeId":"5fe4d3bb553d829d700002a7","seq":22613536,"position":10.75,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 9 - Using DQN to defeat Atari and Go (MCTS+DQN=AlphaGo)\n**Week 11**\n**Note:** If *Topic 5.2* is dropped this will be a week earlier."},{"_id":"5fddb2b3553d829d700002ad","treeId":"5fe4d3bb553d829d700002a7","seq":22613513,"position":1,"parentId":"4c37b6a0c5d471e91100009b","content":"### Alpha Go Documentary\nhttps://youtu.be/jGyCsVhtW0M\n"},{"_id":"5fddb43c553d829d700002ac","treeId":"5fe4d3bb553d829d700002a7","seq":22613516,"position":2,"parentId":"4c37b6a0c5d471e91100009b","content":"- `Get Timepoint`: Jump straight to the part of the Alpha Go Documentary where they explain the learning process Alpha Go uses. It also is the start of the first moment where the program does a creative move that humans did not expect.\nhttps://youtu.be/jGyCsVhtW0M?t=2834\n"},{"_id":"5fdc5274553d829d700002ae","treeId":"5fe4d3bb553d829d700002a7","seq":22613519,"position":3,"parentId":"4c37b6a0c5d471e91100009b","content":"- Analysis of What Alpha Go was \"thinking\" when it played Sedol Lee\nhttps://www.wired.com/2016/03/googles-ai-viewed-move-no-human-understand/"},{"_id":"4c37b65fc5d471e91100009c","treeId":"5fe4d3bb553d829d700002a7","seq":22613520,"position":10.875,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Topic 11 - Deep RL Beyond DQN\n**Week 12**\n- DDPG\n- A2C\n- PPO\n"},{"_id":"385a07cb55a18817e20000dc","treeId":"5fe4d3bb553d829d700002a7","seq":22613522,"position":10.9375,"parentId":"5fae8ea6b1ba66c81c00004f","content":"## Review\n**Week 13**"},{"_id":"5fae8e69b1ba66c81c000050","treeId":"5fe4d3bb553d829d700002a7","seq":22558481,"position":1.73828125,"parentId":null,"content":"# Primary References for Course"},{"_id":"5fae93a0b1ba66c81c00004e","treeId":"5fe4d3bb553d829d700002a7","seq":22558476,"position":3,"parentId":"5fae8e69b1ba66c81c000050","content":"**[SuttonBarto2018]** - Reinforcement Learning: An Introduction. Book, free pdf of draft available.\nhttp://incompleteideas.net/book/the-book-2nd.html"},{"_id":"5c32fe007484cb6591000119","treeId":"5fe4d3bb553d829d700002a7","seq":22558480,"position":1.9296875,"parentId":null,"content":"<h1 id=\"resources\">Additional Resources</h1>\n"},{"_id":"3a17e2e62230ed09fd000649","treeId":"5fe4d3bb553d829d700002a7","seq":22551688,"position":0.125,"parentId":"5c32fe007484cb6591000119","content":"## Other Useful Texts"},{"_id":"5fdbb1a74031e9485600004a","treeId":"5fe4d3bb553d829d700002a7","seq":22551680,"position":0.25,"parentId":"5c32fe007484cb6591000119","content":"**[Dimitrakakis2019]** - Decision Making Under Uncertainty and Reinforcement Learning\n\nhttp://www.cse.chalmers.se/~chrdimi/downloads/book.pdf"},{"_id":"5fdbb1e14031e94856000049","treeId":"5fe4d3bb553d829d700002a7","seq":22551687,"position":0.375,"parentId":"5c32fe007484cb6591000119","content":"**[Ghavamzadeh2016]** - Bayesian Reinforcement Learning: A Survey. Ghavamzadeh et al. 2016.\nhttps://arxiv.org/abs/1609.04436"},{"_id":"3a1556155baaa8a6db000257","treeId":"5fe4d3bb553d829d700002a7","seq":22552035,"position":0.4375,"parentId":"5c32fe007484cb6591000119","content":"- More probability notes online: https://compthinking.github.io/RLCourseNotes/\n"},{"_id":"49d2aa8ba6966f3bd2000245","treeId":"5fe4d3bb553d829d700002a7","seq":22551690,"position":0.5,"parentId":"5c32fe007484cb6591000119","content":"## Open AI Reference Website\nThis website is a great resource. It lays out concepts from start to finish. Once you get through the first half of our course, many of the concepts on this site will be familiar to you.\n\n### Key Papers in Deep RL List\nhttps://spinningup.openai.com/en/latest/spinningup/keypapers.html\n\n### Fundamental RL Concepts Overview \nThe fundamentals of RL are briefly covered here. We will go into all this and more in detail in our course.\nhttps://spinningup.openai.com/en/latest/spinningup/rl_intro.html\n\n### Family Tree of Algorithms\nHere, a list of algorithms at the cutting edge of RL as of 1 year ago to so, so it's a good place to find out more. But in a fast growing field, it may be a bit out of date about the latest now.\nhttps://spinningup.openai.com/en/latest/spinningup/rl_intro2.html"},{"_id":"5c32fd867484cb659100011a","treeId":"5fe4d3bb553d829d700002a7","seq":20641665,"position":1,"parentId":"5c32fe007484cb6591000119","content":"## Reinforcement Learning Tutorial with Demo on GitHub\nThis is a thorough collection of slides from a few different texts and courses laid out with the essentials from basic decision making to Deep RL. There is also code examples for some of their own simple domains.\nhttps://github.com/omerbsezer/Reinforcement_learning_tutorial_with_demo#ExperienceReplay"},{"_id":"5c261b33456eb77c410000eb","treeId":"5fe4d3bb553d829d700002a7","seq":18278594,"position":2,"parentId":"5c32fe007484cb6591000119","content":"## Deep Q Network vs Policy Gradients - An Experiment on VizDoom with Keras\nA nice blog post on comparing DQN and Policy Gradient algorithms such A2C.\nhttps://flyyufelix.github.io/2017/10/12/dqn-vs-pg.html"},{"_id":"3ae2f299afedc620250000d5","treeId":"5fe4d3bb553d829d700002a7","seq":22509415,"position":3,"parentId":"5c32fe007484cb6591000119","content":"## Online Courses\n- Coursera/University of Alberta (Martha White)https://www.coursera.org/specializations/reinforcement-learning#courses"},{"_id":"4ca67c14d2c75ce94900016f","treeId":"5fe4d3bb553d829d700002a7","seq":22552042,"position":3.1015625,"parentId":null,"content":"# Videos to Watch on RL (Current Research)"},{"_id":"4ca66d9dd2c75ce949000171","treeId":"5fe4d3bb553d829d700002a7","seq":22551662,"position":1,"parentId":"4ca67c14d2c75ce94900016f","content":"## Conferences 2020\n- Multiple talks at [Canadian AI 2020](https://www.caiac.ca/en/conferences/canadianai-2020/program) conference.\n - Csaba Szepesvari (U. Alberta)\n- AAMAS 2021 conference just finished recently and is focussed on decision making and planning, lots of RL papers.\n - See their [Twitter Feed](https://twitter.com/Aamas2020C?ref_src=twsrc%5Etfw%7Ctwcamp%5Eembeddedtimeline%7Ctwterm%5Eprofile%3AAamas2020C&ref_url=https%3A%2F%2Faamas2020.conference.auckland.ac.nz%2F) for links to talks\n\n- ICLR 2020 conference (https://iclr.cc/virtual_2020/index.html) \n\n\n"},{"_id":"4c92f99c09b26e012800008f","treeId":"5fe4d3bb553d829d700002a7","seq":22551692,"position":3.21875,"parentId":null,"content":"# Old Topics Archive\nOther resources connected with previous versions of the course, I'm happy to talk about any of these if people are interested."},{"_id":"5fdbfef14031e94856000041","treeId":"5fe4d3bb553d829d700002a7","seq":20650526,"position":1,"parentId":"4c92f99c09b26e012800008f","content":"## Bayes Nets (dropped)"},{"_id":"5fdbfe8e4031e94856000043","treeId":"5fe4d3bb553d829d700002a7","seq":17757718,"position":0.5,"parentId":"5fdbfef14031e94856000041","content":"### Tools"},{"_id":"5f526d57956a4cdb8d000053","treeId":"5fe4d3bb553d829d700002a7","seq":17789849,"position":0.5,"parentId":"5fdbfe8e4031e94856000043","content":"**SamIam Bayesian Network GUI Tool**\n- Java GUI tool for playing with BNs (its old but its good)\nhttp://reasoning.cs.ucla.edu/samiam/index.php?h=emodels\n\n**Other Tools**\n- Bayesian Belief Networks Python Package :\nAllows creation of Bayesian Belief Networks\nand other Graphical Models with pure Python\nfunctions. Where tractable exact inference\nis used.\nhttps://github.com/eBay/bayesian-belief-networks\n- Python library for conjugate exponential family BNs and variational inference only\nhttp://www.bayespy.org/intro.html\n- Open Markov\nhttp://www.openmarkov.org/\n- Open GM (C++ library)\nhttp://hciweb2.iwr.uni-heidelberg.de/opengm/"},{"_id":"5f7efed8e052795359000051","treeId":"5fe4d3bb553d829d700002a7","seq":17757725,"position":2,"parentId":"5fdbfef14031e94856000041","content":"### References\n"},{"_id":"5f7efe69e052795359000052","treeId":"5fe4d3bb553d829d700002a7","seq":17757726,"position":1,"parentId":"5f7efed8e052795359000051","content":"Some videos and resources on Bayes Nets, d-seperation, Bayes Ball Algorithm and more:\nhttps://metacademy.org/graphs/concepts/bayes_ball"},{"_id":"5e6a1a13a97a1a2ecb0002c6","treeId":"5fe4d3bb553d829d700002a7","seq":20650529,"position":2,"parentId":"4c92f99c09b26e012800008f","content":"## Conjugate Priors (dropped)"},{"_id":"5e6a19c8a97a1a2ecb0002c7","treeId":"5fe4d3bb553d829d700002a7","seq":17873346,"position":1,"parentId":"5e6a1a13a97a1a2ecb0002c6","content":"https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions"},{"_id":"5fdbb2c74031e94856000047","treeId":"5fe4d3bb553d829d700002a7","seq":22297610,"position":3,"parentId":"4c92f99c09b26e012800008f","content":"## Primary References for Probabilistic Reasoning (mostly dropped)"},{"_id":"5fdbb2814031e94856000048","treeId":"5fe4d3bb553d829d700002a7","seq":17690088,"position":1,"parentId":"5fdbb2c74031e94856000047","content":"**[Ermon2019]** - First half of notes are based on Stanford CS 228 (https://ermongroup.github.io/cs228-notes/) which goes even more into details on PGMs than we will.\n"},{"_id":"5fdbafec4031e9485600004b","treeId":"5fe4d3bb553d829d700002a7","seq":20641709,"position":1.25,"parentId":"5fdbb2c74031e94856000047","content":"**[Cam Davidson 2018]** - Bayesian Methods for Hackers - Probabilistic Programming textbook as set of python notebooks.\nhttps://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/#contents"},{"_id":"5f5269c8956a4cdb8d000054","treeId":"5fe4d3bb553d829d700002a7","seq":17786788,"position":3,"parentId":"5fdbb2c74031e94856000047","content":"**[Koller, Friedman, 2009]** Probabilistic Graphical Models : Principles and Techniques \nThe extensive theoretical book on PGMs.\nhttps://mitpress.mit.edu/books/probabilistic-graphical-models"}],"tree":{"_id":"5fe4d3bb553d829d700002a7","name":"Course 457C - Links For Students","publicUrl":"course-457c-links-for-students"}}