to use this document yourself.
• # Course References, Links and Random Notes

Title: Probabilistic Reasoning and Reinforcement Learning
Info: ECE 457C - Reinforcement Learning
Instructor: Prof. Mark Crowley, ECE Department, UWaterloo

NOTE: Ignore the weekly dates, they are from a previous year

Website: markcrowley.ca/rlcourse

• # Topics

Primary Textbook : Reinforcement Learning: An Introduction
Small
: Richard S. Sutton and Andrew G. Barto, 2018 [SB]

Some topics are not covered in the SB textbook or they are covered in much more detail than the lectures. We will continue to update this list with references as the term progresses.

1. Motivation & Context [SB 1.1, 1.2, 17.6]
2. Decision Making Under Uncertainty [SB 2.1-2.3, 2.7, 3.1-3.3]
3. Solving MDPs [SB 3.5, 3.6, 4.1-4.4]
4. The RL Problem [SB 3.7, 6.4, 6.5]
5. TD Learning [SB 12.1, 12.2]
6. Policy Search [SB 13.1, 13.2, 13.5]
7. State Representation & Value Function Approximation
8. Basics of Neural Networks
9. Deep RL
10. AlphaGo and MCTS
11. Quick Overview of Other Topics:
1. MARL
2. Free Energy
3. Hierarchical RL
4. Supervised Learning for RL and Curriculum Learning

Skipped Topics:

1. POMDPs (skipped in S22)

• # Old Topics Archive

Other resources connected with previous versions of the course, I’m happy to talk about any of these if people are interested.

• ## Course Description :

Introduction to Reinforcement Learning (RL) theory and algorithms for learning decision-making policies in situations with uncertainty and limited information. Topics include Markov decision processes, classic exact/approximate RL algorithms such as value/policy iteration, Q-learning, State-action-reward-state-action (SARSA), Temporal Difference (TD) methods, policy gradients, actor-critic, and Deep RL such as Deep Q-Learning (DQN), Asynchronous Advantage Actor Critic (A3C), and Deep Deterministic Policy Gradient (DDPG). [Offered: S, first offered Spring 2019]

• ## Basic Decision Making Models - Multiarmed Bandits

Textbook Sections: [SB 1.1, 1.2, 17.6]

• ## Markov Decision Processes

### Textbook Sections

• Markov Decision Processes
[SB 3.0-3.4]
• Solving MDPs Exactly
[SB 3.5, 3.6, 3.7]
• ## Dynamic Programming

Former title: The Reinforcement Learning Problem
Textbook Sections:[SB 4.1-4.4]

• ## Temporal Difference Learning

Textbook Sections: Selections from [SB chap 5], [SB 6.0 - 6.5]

• Quick intro to Monte-Carlo methods
• Temporal Difference Updating
• SARSA
• Q-Learning
• Expected SARSA
• Double Q-Learning
• ## N-Step TD and Eligibility Traces

Textbook Sections: [SB 12.1, 12.2]

• ## Part 1 Review

Go over any questions or open topics from first 6 weeks.

• ## Deep Learning Fundamentals

• [SB 13.1, 13.2, 13.5]

• Actor-Critic
• ## Evaluating RL Algorithms and Double DQN

• discussion of evaluation metrics for RL algorithms
• training hyper-parameters vs. algorithm parameters
• Double DQN bringing back the Double-Q-Learning idea and giving it new life to solve optimism bias
• Note: the content listed in LEARN for the S22 offering are being updated more frequently and consistently with content than this list.

• Trust Region Methods
• TRPO
• PPO

• ## Looking Ahead with Tree Search - MCTS and AlphaGo

• Monte-Carlo Tree Search (MCTS)
• How AlphaGo works (combining A2C and MCTS)
• ## RL Next Steps

• An overview next steps in learning more about RL research and applications
Going Beyond: MARL, Hierarchical RL, Supervised and Curriculum Learning
• Find out about Big New Ideas: LeCun, DeepMind, OpenAI, Friston
• Get Involved: Competitions and OpenSource
• You can find the slides here: RL Next Steps

Week 13

• # E7 Elevator Pitch

## Defining the MDP

### States

• Elevators : $e_i\in E$ : $i \in \mathcal{R} \in[1,7]$
• Floors : $f \in \mathcal{Z} \in [1,8]$
• Location : $L(e_i) : E \rightarrow f$ - which floor is the elevator on?
• Outside Button: $b\in B^f_{i,dir} \in \{0,1\}; dir\in {up, down}$
• Movement: $M(e_i): E\rightarrow \{up, stopped,down\}$
• Doors: $G(e_i,f): E \times f \rightarrow \{closed, closing, opening, open\}$
• Next Floor: $NL(e_i) : E \rightarrow f \cup {stopped}$ - the next floor the elevator will arrive at, if the elevator is not currently moving, then this returns “stopped”.

### Actions

In general: move the elevators, open/close the doors in order to maximize your objective function

At every moment the system can take any of the following actions, we can assume they only happen one at a time

• Do nothing

• Open a door/Close a door : set $G(e_i,f)$

• Move an elevator up/down from current floor : set $M(e_i)$

• Stop an elevator at the current floor it is moving towards using $NL(e_i)$

## Dynamics

• Define dynamics
• # Questions

## Does the system need to remember that it just closed a door?

• Should we define actions to be “close door and move to floor f”?

## How would Exploration/Exploitation Work in This Domain?

• how long are you willing to annoy users to get the information you need?
• can we build a simulator for this system?

• ## Primary References for Probabilistic Reasoning (mostly dropped)

• ### ECE 657A Youtube Videos

Introductory topics on this from my graduate course ECE 657A - Data and Knowledge Modeling and Analysis are available on youtube and mostly applicable to this course as well.

Probability and Statistics Review (youtube playlist)

Containing Videos on:

• Conditional Prob and Bayes Theorem
• Comparing Distributions and Random Variables
• Hypothesis Testing
• ### ECE 108 YouTube Videos

For a very fundamental view of probability from another course of Prof. Crowley you can view the lectures and tutorials for ECE 108

ECE 108 Youtube (look at “future lectures” and “future tutorials” for S20): https://www.youtube.com/channel/UCHqrRl12d0WtIyS-sECwkRQ/playlists

The last few lectures and tutorials are on probability definitions as seen from the perspective of discrete math and set theory.

• ### Likelihood, Loss and Risk

A Good article summarizing how likelihood, loss functions, risk, KL divergence, MLE, MAP are all connected.
https://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/

• ### Probability Intro Markdown Notes

From the course website for a previous year. Some of this we won’t need so much but they are all useful to know for Machine Learning methods in general.

https://compthinking.github.io/RLCourseNotes/

• Basic probability definitions
• conditional probability
• Expectation
• Inference in Graphical Models
• Variational Inference
• Eligibility traces, in a tabular setting, lead to a significant benefit in training time in additional to the Temporal Difference method.

In Deep RL it is very common to use experience replay to reduce overfitting and bias to recent experiences. However, experience replay makes it very hard to leverage eligibility traces which require a sequence of actions to distribute reward backwards.

• ### VFA Concept

A Value Function Approximation (VFA)
is a necessary technique to use whenever the size of the state of action spaces become too large to represent the value function explicitly as a table. In practice, any practical problem needs to use a VFA.

• ### Deep Learning

• Review, or learn, a bit about Deep Learning
• See videos and content from DKMA Course (ECE 657A)
• This youtube playlist is a targeted “Deep Learning Crash Course” ( #dnn-crashcourse-for-rl ) with just the essentials you’ll need for Deep RL.
• That course also has more detailed videos on Deep Learning which won’t be specifically useful for ECE 493, but which you can refer to if interested.
• ### Lect 9B - Deep Learning Introduction

In this video go over some of the fundamental concepts that led to neural networks (such as linear regression and logistic regression models), the basic structure and formulation of classic neural networks and the history of their development.

• ### Lect 11A - 1 - Deep Learning Fundamentals

This video goes through a ground level description of logistic neural units, classic neural networks, modern activation functions and the idea of a Neural Network as a Universal Approximator.

• ### Lect 11A - 1.2 - Deep Learning - Gradient Descent

In this video we discuss the nuts and bolts of how training in Neural Networks (Deep or Shallow) works as a process of incremental optimization of weights via gradient descent. Topics discussed: Backpropagation algorithm, gradient descent, modern optimizer methods.

• ### Lect11B - 1 - DeepLearning - Fundamentals II

In this video we go over the fundamentals of Deep Learning from a different angle using the approach from Goodfellow et. al.’s Deep Learning Textbook and their network graph notation for neural networks.

We describe the network diagram notation, and how to view neural networks in this way, focussing on the relationship between sets of weights and layers.

Other topics include: gradient descent, loss functions, cross-entropy, network output distribution types, softmax output for classification.

• ### Lect 11B - 2 - Deep Learning - Fundamentals III

This video continues with the approach from Goodfellow et. al.’s Deep Learning Textbook and goes into detail about computational methods, efficiency and defining the measure being used for optimization.

Topics covered include: relationship of network depth to generalization power, computation benefits of convolutional network structures, revisiting the meaning of backpropagation, methods for defining loss functions

• ### Lect 11B - 3 - Deep Learning - Regularization

In this lecture I talk about some of the problems that can arise when training neural networks and how they can be mitigated. Topics include : overfitting, model complexity, vanishing gradients, catastrophic forgetting and interpretability.

• ### Lect 11B - 4 - Deep Learning - Data Augmentation and Vanishing Gradients

In this video we give an overview of several approaches for making DNNs more usable when data is limited with respect to the size of the network. Topics include data augmentation, residual network links, vanishing gradients.

#### Tags

• See the RL Next Steps tree for what was discussed in class July 22, 2022.

• ### References

• #### Benefits of VFA

• Reduce memory need to store the functions (transition, reward, value etc)
• Reduce computation to look up values
• Reduce experience needed to find the optimal value or policy (sample efficiency)
• For continuous state spaces, a coarse coding or tile coding can be effective
• #### Types of Function Approximators

• Linear function approximations (linear combination of features)
• Neural Networks
• Decision Trees
• Nearest Neighbors
• Fourier/ wavelet bases
• #### Finding an Optimal Value Function

When using a VFA, you can use a Stochastic Gradient Descent (SGD) method to search for the best weights for your value function according to experience.
This parametric form the value function will then be used to obtain a greedy or epsilon-greedy policy at run-time.

This is why using a VFA + SGD is still different from a Direct Policy Search approach where you optimize the parameters of the policy directly.

{"cards":[{"_id":"62a35171410c0a03aa77a6a9","treeId":"259ed199a2ac6763e000041a","seq":23066831,"position":1,"parentId":null,"content":"# Course References, Links and Random Notes \n**Title:** Probabilistic Reasoning and Reinforcement Learning\n**Info:** ECE 457C - Reinforcement Learning\n**Instructor:** [Prof. Mark Crowley](https://uwaterloo.ca/scholar/mcrowley), [ECE Department](https://uwaterloo.ca/electrical-computer-engineering/), [UWaterloo](https://uwaterloo.ca/)\n\n***NOTE:*** **Ignore the weekly dates, they are from a previous year**\n\n**Website:** [markcrowley.ca/rlcourse](https://markcrowley.ca/rlcourse/)\n\nLink to this Gingko Tree: [RL Course Links and Notes](https://gingkoapp.com/rlcourse)"},{"_id":"62a35171410c0a03aa77a6aa","treeId":"259ed199a2ac6763e000041a","seq":23066642,"position":1,"parentId":"62a35171410c0a03aa77a6a9","content":"## Course Description :\nIntroduction to Reinforcement Learning (RL) theory and algorithms for learning decision-making policies in situations with uncertainty and limited information. Topics include Markov decision processes, classic exact/approximate RL algorithms such as value/policy iteration, Q-learning, State-action-reward-state-action (SARSA), Temporal Difference (TD) methods, policy gradients, actor-critic, and Deep RL such as Deep Q-Learning (DQN), Asynchronous Advantage Actor Critic (A3C), and Deep Deterministic Policy Gradient (DDPG). [Offered: S, first offered Spring 2019]\n\n"},{"_id":"62b5cf52410c0a03aa977876","treeId":"259ed199a2ac6763e000041a","seq":23081588,"position":1,"parentId":"62a35171410c0a03aa77a6aa","content":"###### Style Sheet\n- *don't mess with it*\n- Style Sheet - see saved main version : https://gingkoapp.com/app#3bf9513db6a011c9e8000239\n\n<style>\nh1 {\n font-size: 3em;\n color: #d55;\n text-align: center;\n border-bottom: 5px solid #eee;\n}\n\nh2 {\n font-size: 2em;\n color: #a55;\n border-bottom: 2px solid #999;\n\n}\n\nh3 {\n font-size: 1.75em;\n color: #722;\n border-bottom: 1px dashed #999;\n\n}\n\nh4 {\n font-size: 1.25em;\n border-bottom: 1px dashed #aaa;\n}\nh5 {\n font-size: 1.25em;\n border-bottom: 1px dashed #aaa;\n}\nh5, h5 + * {\n display: none;\n}\n\nh6, h6 + * {\n font-size: 1em;\n color: #77b;\n}\n\n.fullscreen-overlay .fullscreen-container {\n max-width: 2500px;\n \n height:100%;\n margin:1 auto;\n padding:40px 0;\n}\n\n#migration-notice {\n display: none;\nvisible: false;\n}\n</style>\n"},{"_id":"62a35171410c0a03aa77a6ab","treeId":"259ed199a2ac6763e000041a","seq":23066643,"position":2,"parentId":null,"content":"# Course Resources\n- [Course Website](https://markcrowley.ca/rlcourse/) : contains course outline, grade breakdown, weekly schedule information\n- Notes and slides via the Textbook *(available free online)*:\n - [Reinforcement Learning: An Introduction\nSmall](http://incompleteideas.net/book/the-book-2nd.html) : Richard S. Sutton and Andrew G. Barto\nSutton Textbook, 2018\n- Course Youtube Channel : [Reinforcement Learning](https://www.youtube.com/channel/UC6p1AJ7jKNFp6OB2MmAoWvA/featured)\n- See [Additional Resources](#resources) for more online notes and reading.\n"},{"_id":"62a35171410c0a03aa77a6ac","treeId":"259ed199a2ac6763e000041a","seq":23107486,"position":3,"parentId":null,"content":"# Topics\n\nPrimary Textbook : [Reinforcement Learning: An Introduction\nSmall](http://incompleteideas.net/book/the-book-2nd.html) : Richard S. Sutton and Andrew G. Barto, 2018 [SB]\n\nSome topics are not covered in the SB textbook or they are covered in much more detail than the lectures. We will continue to update this list with references as the term progresses.\n\n1. Motivation & Context [SB 1.1, 1.2, 17.6]\n2. Decision Making Under Uncertainty [SB 2.1-2.3, 2.7, 3.1-3.3]\n3. Solving MDPs [SB 3.5, 3.6, 4.1-4.4]\n4. The RL Problem [SB 3.7, 6.4, 6.5]\n5. TD Learning [SB 12.1, 12.2]\n6. Policy Search [SB 13.1, 13.2, 13.5]\n7. State Representation & Value Function Approximation\n8. Basics of Neural Networks\n9. Deep RL\n11. AlphaGo and MCTS\n1. Quick Overview of Other Topics:\n 1. MARL\n 1. Free Energy\n 1. Hierarchical RL\n 1. Supervised Learning for RL and Curriculum Learning\n\nSkipped Topics:\n1. POMDPs (skipped in S22)\n\n"},{"_id":"62a35171410c0a03aa77a6ad","treeId":"259ed199a2ac6763e000041a","seq":23066774,"position":1,"parentId":"62a35171410c0a03aa77a6ac","content":"## Course Introduction\n"},{"_id":"62a35171410c0a03aa77a6ae","treeId":"259ed199a2ac6763e000041a","seq":23066773,"position":2,"parentId":"62a35171410c0a03aa77a6ac","content":"## Basics of Probability\n"},{"_id":"62a35171410c0a03aa77a6af","treeId":"259ed199a2ac6763e000041a","seq":23066647,"position":1,"parentId":"62a35171410c0a03aa77a6ae","content":"### ECE 657A Youtube Videos\nIntroductory topics on this from my graduate course [ECE 657A - Data and Knowledge Modeling and Analysis](https://compthinking.github.io/DKMA/) are available on youtube and mostly applicable to this course as well. \n\n**[Probability and Statistics Review](https://youtube.com/playlist?list=PLCILw_sLfhbPo5-xfYrbo2-Cp0meZu-cN)** *(youtube playlist)*\n\nContaining Videos on:\n - Conditional Prob and Bayes Theorem\n - Comparing Distributions and Random Variables\n - Hypothesis Testing\n\n"},{"_id":"62a35171410c0a03aa77a6b0","treeId":"259ed199a2ac6763e000041a","seq":23066648,"position":2,"parentId":"62a35171410c0a03aa77a6ae","content":"### ECE 108 YouTube Videos\nFor a very fundamental view of probability from another course of Prof. Crowley you can view the lectures and tutorials for ECE 108\n\nECE 108 Youtube (look at \"future lectures\" and \"future tutorials\" for S20): https://www.youtube.com/channel/UCHqrRl12d0WtIyS-sECwkRQ/playlists\n\nThe last few lectures and tutorials are on probability definitions as seen from the perspective of discrete math and set theory."},{"_id":"62a35171410c0a03aa77a6b1","treeId":"259ed199a2ac6763e000041a","seq":23066649,"position":3,"parentId":"62a35171410c0a03aa77a6ae","content":"### Likelihood, Loss and Risk\nA Good article summarizing how likelihood, loss functions, risk, KL divergence, MLE, MAP are all connected.\nhttps://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/"},{"_id":"62a35171410c0a03aa77a6b2","treeId":"259ed199a2ac6763e000041a","seq":23066650,"position":4,"parentId":"62a35171410c0a03aa77a6ae","content":"### Probability Intro Markdown Notes\nFrom the course website for a previous year. Some of this we won't need so much but they are all useful to know for Machine Learning methods in general. \n\nhttps://compthinking.github.io/RLCourseNotes/\n\n- Basic probability definitions\n- conditional probability\n- Expectation\n- Inference in Graphical Models\n- Variational Inference\n\n\n "},{"_id":"62a35171410c0a03aa77a6b3","treeId":"259ed199a2ac6763e000041a","seq":23066772,"position":3,"parentId":"62a35171410c0a03aa77a6ac","content":"## Basic Decision Making Models - Multiarmed Bandits\n**Textbook Sections:** [SB 1.1, 1.2, 17.6]\n"},{"_id":"62a35171410c0a03aa77a6b4","treeId":"259ed199a2ac6763e000041a","seq":23066652,"position":1,"parentId":"62a35171410c0a03aa77a6b3","content":"### Videos\n-* Part 1 - Live Lecture May 17, 2021 on *Virtual Classroom - [View Live Here](https://bongo-ca.youseeu.com/sync-activity/invite/1747607/73748f05469e35afeaf7ea19d353ced8?lti-scope=d2l-resource-syncmeeting-list)\n- Part 2 - Bandits and Values (the sound is horrible! we'll record a new one) - https://youtu.be/zVIv1ipnubA\n- Part 3 - Regret Minimization, UCB and Thompson Sampling - https://youtu.be/a0OcuuglkHQ"},{"_id":"62a35171410c0a03aa77a6b5","treeId":"259ed199a2ac6763e000041a","seq":23066653,"position":2,"parentId":"62a35171410c0a03aa77a6b3","content":"### Multiarmed Bandit : Solving it via Reinforcement Learning in Python\n- Quite a good blog post with all the concepts laid out in simple terms in order https://www.analyticsvidhya.com/blog/2018/09/reinforcement-multi-armed-bandit-scratch-python/"},{"_id":"62a35171410c0a03aa77a6b6","treeId":"259ed199a2ac6763e000041a","seq":23066654,"position":3,"parentId":"62a35171410c0a03aa77a6b3","content":"### Thompson Sampling\n- Long tutorial on Thompson Sampling with more background and theory. Nice charts as well: https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf"},{"_id":"62a35171410c0a03aa77a6b7","treeId":"259ed199a2ac6763e000041a","seq":23066771,"position":4,"parentId":"62a35171410c0a03aa77a6ac","content":"## Markov Decision Processes\n\n### Textbook Sections\n- Markov Decision Processes\n[SB 3.0-3.4]\n- Solving MDPs Exactly\n[SB 3.5, 3.6, 3.7]"},{"_id":"62a35171410c0a03aa77a6b8","treeId":"259ed199a2ac6763e000041a","seq":23066656,"position":1,"parentId":"62a35171410c0a03aa77a6b7","content":"### Playlist:\n- MDPs Chp 3 : https://youtube.com/playlist?list=PLrV5TcaW6bIX_wnVztMoDFk_8ybteeW7Y\n\n### Individual Videos:\n- Markov Decision Processes 3.0-3.1: \nhttps://youtu.be/pGW1wP4jJas\n- Rewards and Returns 3.3-3.4: https://youtu.be/K7ymZkEd0ZA\n- Value Functions 3.5 - 3.6 : https://youtu.be/lNBXDgAthmQ"},{"_id":"62a35171410c0a03aa77a6ba","treeId":"259ed199a2ac6763e000041a","seq":23066770,"position":6,"parentId":"62a35171410c0a03aa77a6ac","content":"## Dynamic Programming\n*Former title: The Reinforcement Learning Problem*\n**Textbook Sections:**[SB 4.1-4.4]"},{"_id":"62a35171410c0a03aa77a6bb","treeId":"259ed199a2ac6763e000041a","seq":23066659,"position":1,"parentId":"62a35171410c0a03aa77a6ba","content":"### Videos:\n- Dynamic Programming 1: https://youtu.be/nhyCQK4v4Cw\n- Dynamic Programming 2 : Policy and Value Iteration: https://youtu.be/NHN02JnGmdQ\n- Dynamic Programming 3 : Generalized Policy Iteration and Asynchronous Value Iteration https://youtu.be/7gfRBYpzhxU"},{"_id":"62a35171410c0a03aa77a6c0","treeId":"259ed199a2ac6763e000041a","seq":23066769,"position":9,"parentId":"62a35171410c0a03aa77a6ac","content":"## Temporal Difference Learning\n\n**Textbook Sections:** Selections from [SB chap 5], [SB 6.0 - 6.5]\n- Quick intro to Monte-Carlo methods\n- Temporal Difference Updating\n- SARSA\n- Q-Learning\n- Expected SARSA\n- Double Q-Learning"},{"_id":"62a35171410c0a03aa77a6be","treeId":"259ed199a2ac6763e000041a","seq":23066760,"position":0.99999,"parentId":"62a35171410c0a03aa77a6c0","content":"### Videos\n- [Week 5 Youtube Playlist](https://youtube.com/playlist?list=PLrV5TcaW6bIUiMLNDYq7cnHhgq0toIFOk)\n\nParts:\n- Just the MC Lecture part - https://youtu.be/b1C_2x6IUUw\n- Temporal Difference Learning 1 - Introduction https://youtu.be/pJyz6OZiIBo\n- Temporal Difference Learning 2 - Comparison to Monte-Carlo Method on Random Walk\nhttps://youtu.be/NVtoj4XRRZw\n"},{"_id":"62a35171410c0a03aa77a6c1","treeId":"259ed199a2ac6763e000041a","seq":23066665,"position":1,"parentId":"62a35171410c0a03aa77a6c0","content":"### Videos\n- [Week 5 Youtube Playlist](https://youtube.com/playlist?list=PLrV5TcaW6bIUiMLNDYq7cnHhgq0toIFOk)\n- Temporal Difference Learning 3 - Sarsa and QLearning Algorithms\nhttps://youtu.be/nEDblNhoL2E\n- Temporal Difference Learning 4 - Expected Sarsa and Double Q-Learning\nhttps://youtu.be/uGFb0mtJW00"},{"_id":"62a35171410c0a03aa77a6c5","treeId":"259ed199a2ac6763e000041a","seq":23066780,"position":9.5,"parentId":"62a35171410c0a03aa77a6ac","content":"## N-Step TD and Eligibility Traces\n**Textbook Sections:** [SB 12.1, 12.2]"},{"_id":"62a35171410c0a03aa77a6c6","treeId":"259ed199a2ac6763e000041a","seq":23066670,"position":1,"parentId":"62a35171410c0a03aa77a6c5","content":"Eligibility traces, in a tabular setting, lead to a significant benefit in training time in additional to the Temporal Difference method. \n\nIn Deep RL it is very common to use **experience replay** to reduce overfitting and bias to recent experiences. However, experience replay makes it very hard to leverage eligibility traces which require a sequence of actions to distribute reward backwards.\n"},{"_id":"62a35171410c0a03aa77a6c7","treeId":"259ed199a2ac6763e000041a","seq":23107637,"position":2,"parentId":"62a35171410c0a03aa77a6c5","content":"### Videos:\n- **ET1** - One Step vs Direct Value Updates\n - https://youtu.be/dRRYdkw-bqE\n- **ET2** - ET2 N Step TD Forward View\n - https://youtu.be/dRRYdkw-bqE\n- **ET3** - N Step TD Backward View\n- **ET4** - Eligibility Traces On Policy\n- **ET5** - Eligibility Traces Off Policy\n- youtube playlist of entire topic ET1-5: https://youtube.com/playlist?list=PLrV5TcaW6bIVtMNt_dZMdMQ9JdtzV5VWS"},{"_id":"62a35171410c0a03aa77a6c8","treeId":"259ed199a2ac6763e000041a","seq":23066672,"position":3,"parentId":"62a35171410c0a03aa77a6c5","content":"### Other Resources:"},{"_id":"62a35171410c0a03aa77a6c9","treeId":"259ed199a2ac6763e000041a","seq":23066673,"position":4,"parentId":"62a35171410c0a03aa77a6c5","content":"- [Discussion about Incompatibility of Eligibility Traces with Experience Replay](https://stats.stackexchange.com/questions/341027/eligibility-traces-vs-experience-replay/341038)"},{"_id":"62a35171410c0a03aa77a6ca","treeId":"259ed199a2ac6763e000041a","seq":23066674,"position":5,"parentId":"62a35171410c0a03aa77a6c5","content":"- [Efficient Eligibility Traces for Deep Reinforcement Learning - \nBrett Daley, Christopher Amato](https://arxiv.org/abs/1810.09967)"},{"_id":"62a35171410c0a03aa77a6cb","treeId":"259ed199a2ac6763e000041a","seq":23066675,"position":6,"parentId":"62a35171410c0a03aa77a6c5","content":"- [Investigating Recurrence and Eligibility Traces in Deep Q-Networks -\nJean Harb, Doina Precup](https://arxiv.org/abs/1704.05495)"},{"_id":"62a35171410c0a03aa77a6c2","treeId":"259ed199a2ac6763e000041a","seq":23066775,"position":10,"parentId":"62a35171410c0a03aa77a6ac","content":"## Part 1 Review \nGo over any questions or open topics from first 6 weeks."},{"_id":"62a35171410c0a03aa77a6cc","treeId":"259ed199a2ac6763e000041a","seq":23066783,"position":10.5,"parentId":"62a35171410c0a03aa77a6ac","content":"## State Representation & Value Function Approximation\n"},{"_id":"62a35171410c0a03aa77a6cd","treeId":"259ed199a2ac6763e000041a","seq":23066677,"position":1,"parentId":"62a35171410c0a03aa77a6cc","content":"### VFA Concept\nA **Value Function Approximation (VFA)**\n is a necessary technique to use whenever the size of the state of action spaces become too large to represent the value function explicitly as a table. In practice, any practical problem needs to use a VFA.\n"},{"_id":"62a35171410c0a03aa77a6ce","treeId":"259ed199a2ac6763e000041a","seq":23066678,"position":1,"parentId":"62a35171410c0a03aa77a6cd","content":"#### Benefits of VFA\n- Reduce memory need to store the functions (transition, reward, value etc)\n- Reduce computation to look up values\n- Reduce experience needed to find the optimal value or policy (sample efficiency)\n- For continuous state spaces, a coarse coding or tile coding can be effective"},{"_id":"62a35171410c0a03aa77a6cf","treeId":"259ed199a2ac6763e000041a","seq":23066679,"position":2,"parentId":"62a35171410c0a03aa77a6cd","content":"#### Types of Function Approximators\n- Linear function approximations (linear combination of features)\n- Neural Networks\n- Decision Trees\n- Nearest Neighbors\n- Fourier/ wavelet bases"},{"_id":"62a35171410c0a03aa77a6d0","treeId":"259ed199a2ac6763e000041a","seq":23066680,"position":3,"parentId":"62a35171410c0a03aa77a6cd","content":"#### Finding an Optimal Value Function\nWhen using a VFA, you can use a Stochastic Gradient Descent (SGD) method to search for the best weights for your value function according to experience. \nThis parametric form the value function will then be used to obtain a *greedy* or *epsilon-greedy* policy at run-time.\n\nThis is why using a VFA + SGD is still different from a Direct Policy Search approach where you optimize the parameters of the policy directly.\n"},{"_id":"62a35171410c0a03aa77a6d1","treeId":"259ed199a2ac6763e000041a","seq":23066681,"position":2,"parentId":"62a35171410c0a03aa77a6cc","content":"### Video:\n- Lecture on Value Function Approximation approaches - https://youtu.be/7Dg6KiI_0eM"},{"_id":"62a35171410c0a03aa77a6d2","treeId":"259ed199a2ac6763e000041a","seq":23066682,"position":3,"parentId":"62a35171410c0a03aa77a6cc","content":"### Other Resources:"},{"_id":"62a35171410c0a03aa77a6d3","treeId":"259ed199a2ac6763e000041a","seq":23066683,"position":1,"parentId":"62a35171410c0a03aa77a6d2","content":"- [How to use a shallow, linear approximation for Atari](https://www.amii.ca/the-success-of-dqn-explained-by-shallow-reinforcement-learning/) - This post explains a paper showing how to achieve the same performance as the Deep RL DQN method for Atari using carefully constructed linear value function approximation."},{"_id":"62a35171410c0a03aa77a6c3","treeId":"259ed199a2ac6763e000041a","seq":23066776,"position":11,"parentId":"62a35171410c0a03aa77a6ac","content":"---\n\n## MIDTERM Exam\n"},{"_id":"62a35171410c0a03aa77a6c4","treeId":"259ed199a2ac6763e000041a","seq":23066668,"position":12,"parentId":"62a35171410c0a03aa77a6ac","content":"---"},{"_id":"62a35171410c0a03aa77a6e1","treeId":"259ed199a2ac6763e000041a","seq":23107531,"position":15.5,"parentId":"62a35171410c0a03aa77a6ac","content":"## Deep Reinforcement Learning\n- Deep RL playlist (https://youtube.com/playlist?list=PLrV5TcaW6bIXkjBAExaFcv8NnnNU-qtzt)\n - DQN - new [youtube lecture](https://youtu.be/Wf7a8eB7DVM) on this topic posted July 26, 2021\n - revised look at Value Function Approximations in light of DQN and Atari games"},{"_id":"62a35171410c0a03aa77a6d9","treeId":"259ed199a2ac6763e000041a","seq":23066791,"position":16,"parentId":"62a35171410c0a03aa77a6ac","content":"## Deep Learning Fundamentals"},{"_id":"259e8f20fadf18b6250000cb","treeId":"259ed199a2ac6763e000041a","seq":23066800,"position":0.5,"parentId":"62a35171410c0a03aa77a6d9","content":"### Deep Learning \n- Review, or learn, a *bit* about Deep Learning\n - See videos and content from [DKMA Course (ECE 657A)](https://youtube.com/playlist?list=PLCILw_sLfhbNUH8uQcxAcEPoQETPIsqNs) \n - This youtube playlist is a targeted \"Deep Learning Crash Course\" ( #dnn-crashcourse-for-rl ) with just the essentials you'll need for Deep RL.\n - That course also has more detailed videos on Deep Learning which won't be specifically useful for ECE 493, but which you can refer to if interested."},{"_id":"62a35171410c0a03aa77a6da","treeId":"259ed199a2ac6763e000041a","seq":23066690,"position":1,"parentId":"62a35171410c0a03aa77a6d9","content":"### Lect 9B - Deep Learning Introduction\n- link - https://youtu.be/eopsPef7rLc\n\nIn this video go over some of the fundamental concepts that led to neural networks (such as linear regression and logistic regression models), the basic structure and formulation of classic neural networks and the history of their development.\n\n\n#### Tags\n#deeplearning #introduction #overview"},{"_id":"62a35171410c0a03aa77a6db","treeId":"259ed199a2ac6763e000041a","seq":23066691,"position":2,"parentId":"62a35171410c0a03aa77a6d9","content":"### Lect 11A - 1 - Deep Learning Fundamentals\n- link - https://youtu.be/_Pe7eyLN6VY\n\nThis video goes through a ground level description of logistic neural units, classic neural networks, modern activation functions and the idea of a Neural Network as a Universal Approximator.\n\n#### Tags\n#deeplearning #introduction #dnn-crashcourse-for-rl"},{"_id":"62a35171410c0a03aa77a6dc","treeId":"259ed199a2ac6763e000041a","seq":23066692,"position":3,"parentId":"62a35171410c0a03aa77a6d9","content":"### Lect 11A - 1.2 - Deep Learning - Gradient Descent \n- link - https://youtu.be/eWzbLXWEJJ4\n\nIn this video we discuss the nuts and bolts of how training in Neural Networks (Deep or Shallow) works as a process of incremental optimization of weights via gradient descent. Topics discussed: Backpropagation algorithm, gradient descent, modern optimizer methods.\n\n\n#### Tags\n#deeplearning #detail"},{"_id":"62a35171410c0a03aa77a6dd","treeId":"259ed199a2ac6763e000041a","seq":23066693,"position":4,"parentId":"62a35171410c0a03aa77a6d9","content":"### Lect11B - 1 - DeepLearning - Fundamentals II\n- link - https://youtu.be/R8PZ7UPKQNM\n\nIn this video we go over the fundamentals of Deep Learning from a different angle using the approach from Goodfellow et. al.'s [Deep Learning Textbook](https://www.deeplearningbook.org/) and their network graph notation for neural networks. \n\nWe describe the network diagram notation, and how to view neural networks in this way, focussing on the relationship between sets of weights and layers.\n\nOther topics include: gradient descent, loss functions, cross-entropy, network output distribution types, softmax output for classification.\n\n\n#### Tags\n#deeplearning #introduction #dnn-crashcourse-for-rl"},{"_id":"62a35171410c0a03aa77a6de","treeId":"259ed199a2ac6763e000041a","seq":23066694,"position":5,"parentId":"62a35171410c0a03aa77a6d9","content":"### Lect 11B - 2 - Deep Learning - Fundamentals III\n- link - https://youtu.be/c6g0dfMWQ6k\n\nThis video continues with the approach from Goodfellow et. al.'s [Deep Learning Textbook](https://www.deeplearningbook.org/) and goes into detail about computational methods, efficiency and defining the measure being used for optimization.\n\n*Topics covered include:* relationship of network depth to generalization power, computation benefits of convolutional network structures, revisiting the meaning of backpropagation, methods for defining loss functions\n\n#### Tags\n#deeplearning #detail"},{"_id":"62a35171410c0a03aa77a6df","treeId":"259ed199a2ac6763e000041a","seq":23066695,"position":6,"parentId":"62a35171410c0a03aa77a6d9","content":"### Lect 11B - 3 - Deep Learning - Regularization\n- link - https://youtu.be/qkqkY09splc\n\nIn this lecture I talk about some of the problems that can arise when training neural networks and how they can be mitigated. Topics include : overfitting, model complexity, vanishing gradients, catastrophic forgetting and interpretability.\n\n\n#### Tags\n#deeplearning #detail #dnn-crashcourse-for-rl"},{"_id":"62a35171410c0a03aa77a6e0","treeId":"259ed199a2ac6763e000041a","seq":23066696,"position":7,"parentId":"62a35171410c0a03aa77a6d9","content":"### Lect 11B - 4 - Deep Learning - Data Augmentation and Vanishing Gradients\n- link - https://youtu.be/k4DdJ590teM\n\nIn this video we give an overview of several approaches for making DNNs more usable when data is limited with respect to the size of the network. Topics include data augmentation, residual network links, vanishing gradients.\n\n#### Tags\n#deeplearning #detail #overview"},{"_id":"62a35171410c0a03aa77a6d4","treeId":"259ed199a2ac6763e000041a","seq":23107539,"position":16.40625,"parentId":"62a35171410c0a03aa77a6ac","content":"## Direct Policy Search\n[SB 13.1, 13.2, 13.5]\n- Policy Gradients\n- Actor-Critic\n\n"},{"_id":"62a35171410c0a03aa77a6d5","treeId":"259ed199a2ac6763e000041a","seq":23066685,"position":1,"parentId":"62a35171410c0a03aa77a6d4","content":"### Video:\n- Lecture on Policy Gradient methods - \nhttps://youtu.be/SqulTcLHRnY"},{"_id":"62a35171410c0a03aa77a6d6","treeId":"259ed199a2ac6763e000041a","seq":23066686,"position":2,"parentId":"62a35171410c0a03aa77a6d4","content":"### Policy Gradient Algorithms\nSome of the posts used for lecture on July 26.\n\n- A good post with all the fundamental math for policy gradients.\nhttps://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#a3c\n- Also a good intro post about Policy gradients vs DQN by great ML blogger Andrej Karpathy (this is the one I showed in class with the Pong example):\nhttp://karpathy.github.io/2016/05/31/rl/\n- The Open-AI page on the PPO algorithm used on their simulator domains of humanoid robots:\nhttps://openai.com/blog/openai-baselines-ppo/\n- Good description of Actor-Critic approach using Sonic the Hedgehog game as example:\nhttps://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/\n- Blog post about how the original Alpha Go solution worked using Policy Gradient RL and Monte-Carlo Tree Search:\nhttps://medium.com/@jonathan_hui/alphago-how-it-works-technically-26ddcc085319\n"},{"_id":"62a35171410c0a03aa77a6d7","treeId":"259ed199a2ac6763e000041a","seq":23066687,"position":3,"parentId":"62a35171410c0a03aa77a6d4","content":"### Actor-Critic Algorithm\nVery clear blog post on describing Actor-Critic Algorithms to improve Policy Gradients\nhttps://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/"},{"_id":"62a35171410c0a03aa77a6d8","treeId":"259ed199a2ac6763e000041a","seq":23066688,"position":4,"parentId":"62a35171410c0a03aa77a6d4","content":"### Cutting Edge Algorithms\nGoing beyond what we covered in class, here are some exciting trends and new advances in RL research in the past few years to find out more about.\nPG methods are a fast changing area of RL research. This post has a number of the successful algorithms in this area from a few years ago:\nhttps://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#actor-critic"},{"_id":"62a35171410c0a03aa77a6e2","treeId":"259ed199a2ac6763e000041a","seq":23066824,"position":5,"parentId":"62a35171410c0a03aa77a6d4","content":"### A3C/A2C Resources\n- Blog from OpenAI introducing their implementation for A3C and analysis of how a simpler, non-parallalized version they call A2C is just as good: \n - https://openai.com/blog/baselines-acktr-a2c/\n- The original A3C paper from DeepMind:\n - Mnih, 2016 : https://arxiv.org/pdf/1602.01783.pdf\n- Good summary of these algorithms with cleaned up pseudocode and links:\n - https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#actor-critic"},{"_id":"259e8ad6fadf18b6250000cd","treeId":"259ed199a2ac6763e000041a","seq":23066828,"position":6,"parentId":"62a35171410c0a03aa77a6d4","content":"- A2C - Review of policy gradients and adding how A2C implements them using Deep Learning - (https://youtu.be/WPs8KsWM8sg)"},{"_id":"2356139edcf22a91d20000db","treeId":"259ed199a2ac6763e000041a","seq":23107530,"position":16.8125,"parentId":"62a35171410c0a03aa77a6ac","content":"## Evaluating RL Algorithms and Double DQN\n- discussion of evaluation metrics for RL algorithms\n- training hyper-parameters vs. algorithm parameters\n- Double DQN bringing back the Double-Q-Learning idea and giving it new life to solve optimism bias\n"},{"_id":"23ce106b38a9b1680e0000d4","treeId":"259ed199a2ac6763e000041a","seq":23098712,"position":17.625,"parentId":"62a35171410c0a03aa77a6ac","content":"**Note: the content listed in LEARN for the S22 offering are being updated more frequently and consistently with content than this list.**"},{"_id":"259e8dd9fadf18b6250000cc","treeId":"259ed199a2ac6763e000041a","seq":23107524,"position":17.6875,"parentId":"62a35171410c0a03aa77a6ac","content":"## Advanced Policy Gradient Methods using Trust Regions\n- Trust Region Methods\n- TRPO\n- PPO"},{"_id":"23ce1dcc38a9b1680e0000d2","treeId":"259ed199a2ac6763e000041a","seq":23098707,"position":17.75,"parentId":"62a35171410c0a03aa77a6ac","content":"## DPG, DDGP and SAC"},{"_id":"23ce1cf238a9b1680e0000d3","treeId":"259ed199a2ac6763e000041a","seq":23098709,"position":1,"parentId":"23ce1dcc38a9b1680e0000d2","content":"### Hypothesis: Original DDPG Paper - Lillicrap, ICLR, 2016\n\nhttps://hyp.is/go?url=https%3A%2F%2Farxiv.org%2Fpdf%2F1509.02971.pdf&group=DM67BYBG"},{"_id":"235691dbdcf22a91d20000da","treeId":"259ed199a2ac6763e000041a","seq":23107495,"position":17.875,"parentId":"62a35171410c0a03aa77a6ac","content":"## Looking Ahead with Tree Search - MCTS and AlphaGo\n- Monte-Carlo Tree Search (MCTS)\n- How AlphaGo works (combining A2C and MCTS)"},{"_id":"62a35171410c0a03aa77a6e3","treeId":"259ed199a2ac6763e000041a","seq":23107500,"position":18,"parentId":"62a35171410c0a03aa77a6ac","content":"## RL Next Steps\n- An overview next steps in learning more about RL research and applications\n - Keep Reading: Conferences\nGoing Beyond: MARL, Hierarchical RL, Supervised and Curriculum Learning\n - Find out about Big New Ideas: LeCun, DeepMind, OpenAI, Friston\n - Get Involved: Competitions and OpenSource \n- You can find the slides here: [RL Next Steps](https://gingkoapp.com/rl-next-steps)"},{"_id":"62a35171410c0a03aa77a6e4","treeId":"259ed199a2ac6763e000041a","seq":23105777,"position":19,"parentId":"62a35171410c0a03aa77a6ac","content":"## Review and End of Classes\n**Week 13**"},{"_id":"2374a287fb15709a91000337","treeId":"259ed199a2ac6763e000041a","seq":23105819,"position":1,"parentId":"62a35171410c0a03aa77a6e4","content":"See the RL Next Steps tree for what was discussed in class July 22, 2022."},{"_id":"62a35171410c0a03aa77a6e5","treeId":"259ed199a2ac6763e000041a","seq":23105779,"position":20,"parentId":"62a35171410c0a03aa77a6ac","content":"## Final Exam\nSee LEARN for more information."},{"_id":"259e6db7fadf18b6250001a6","treeId":"259ed199a2ac6763e000041a","seq":23066835,"position":4.5,"parentId":null,"content":"# E7 Elevator Pitch"},{"_id":"259e6d38fadf18b6250001a7","treeId":"259ed199a2ac6763e000041a","seq":23105775,"position":1,"parentId":"259e6db7fadf18b6250001a6","content":"# E7 Elevator Pitch\n\n## Defining the MDP\n### States\n\n- Elevators : $e_i\\in E$ : $i \\in \\mathcal{R} \\in[1,7]$\n- Floors : $f \\in \\mathcal{Z} \\in [1,8]$\n- Location : $L(e_i) : E \\rightarrow f$ - which floor is the elevator on?\n- Outside Button: $b\\in B^f_{i,dir} \\in \\{0,1\\}; dir\\in {up, down}$\n- Movement: $M(e_i): E\\rightarrow \\{up, stopped,down\\}$\n- Doors: $G(e_i,f): E \\times f \\rightarrow \\{closed, closing, opening, open\\}$\n- *Next Floor:* $NL(e_i) : E \\rightarrow f \\cup {stopped}$ - the next floor the elevator will arrive at, if the elevator is not currently moving, then this returns \"stopped\".\n\n### Actions\n\nIn general: move the elevators, open/close the doors in order to maximize your objective function\n\nAt every moment the system can take any of the following actions, we can assume they only happen one at a time\n\n- Do nothing\n\n- Open a door/Close a door : set $G(e_i,f)$\n\n- Move an elevator up/down from current floor : set $M(e_i)$\n\n- Stop an elevator at the current floor it is *moving towards* using $NL(e_i)$\n\n \n\n## Dynamics\n\n- Define dynamics"},{"_id":"2374bf8eb4617be51600032b","treeId":"259ed199a2ac6763e000041a","seq":23105776,"position":1.5,"parentId":"259e6db7fadf18b6250001a6","content":"# Questions\n\n## Does the system need to remember that it just closed a door?\n- Should we define actions to be \"close door and move to floor f\"?\n\n## How would Exploration/Exploitation Work in This Domain?\n- how long are you willing to annoy users to get the information you need?\n- can we build a simulator for this system?"},{"_id":"62a35171410c0a03aa77a6e6","treeId":"259ed199a2ac6763e000041a","seq":23066837,"position":4.75,"parentId":null,"content":"# Primary References for Course"},{"_id":"62a35171410c0a03aa77a6e7","treeId":"259ed199a2ac6763e000041a","seq":23066703,"position":1,"parentId":"62a35171410c0a03aa77a6e6","content":"**[SuttonBarto2018]** - Reinforcement Learning: An Introduction. Book, free pdf of draft available.\nhttp://incompleteideas.net/book/the-book-2nd.html"},{"_id":"62a35171410c0a03aa77a6e8","treeId":"259ed199a2ac6763e000041a","seq":23066704,"position":5,"parentId":null,"content":"<h1 id=\"resources\">Additional Resources</h1>\n"},{"_id":"62a35171410c0a03aa77a6e9","treeId":"259ed199a2ac6763e000041a","seq":23066705,"position":1,"parentId":"62a35171410c0a03aa77a6e8","content":"## Other Useful Textbooks\n**[Dimitrakakis2019]** - Decision Making Under Uncertainty and Reinforcement Learning\n\nhttp://www.cse.chalmers.se/~chrdimi/downloads/book.pdf\n**[Ghavamzadeh2016]** - Bayesian Reinforcement Learning: A Survey. Ghavamzadeh et al. 2016.\nhttps://arxiv.org/abs/1609.04436\n- More probability notes online: https://compthinking.github.io/RLCourseNotes/"},{"_id":"62a35171410c0a03aa77a6ea","treeId":"259ed199a2ac6763e000041a","seq":23066706,"position":2,"parentId":"62a35171410c0a03aa77a6e8","content":"## Practical Resources\n- The Open-AI page for their standard set of baseline implementations for the major Deep RL algorithms:\nhttps://github.com/openai/baselines/tree/master/baselines\n- This is a very good page with all the fundamental math for many policy gradient based Deep RL algorithms. References to the original papers, mathematical explanation and pseudocode included:\nhttps://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#a3c"},{"_id":"62a35171410c0a03aa77a6eb","treeId":"259ed199a2ac6763e000041a","seq":23066707,"position":1,"parentId":"62a35171410c0a03aa77a6ea","content":"## Deep Q Network vs Policy Gradients - An Experiment on VizDoom with Keras\nA nice blog post on comparing DQN and Policy Gradient algorithms such A2C.\nhttps://flyyufelix.github.io/2017/10/12/dqn-vs-pg.html"},{"_id":"62a35171410c0a03aa77a6ec","treeId":"259ed199a2ac6763e000041a","seq":23066708,"position":3,"parentId":"62a35171410c0a03aa77a6e8","content":"## Open AI Reference Website\nThis website is a great resource. It lays out concepts from start to finish. Once you get through the first half of our course, many of the concepts on this site will be familiar to you.\n\n### Key Papers in Deep RL List\nhttps://spinningup.openai.com/en/latest/spinningup/keypapers.html\n\n### Fundamental RL Concepts Overview \nThe fundamentals of RL are briefly covered here. We will go into all this and more in detail in our course.\nhttps://spinningup.openai.com/en/latest/spinningup/rl_intro.html\n\n### Family Tree of Algorithms\nHere, a list of algorithms at the cutting edge of RL as of 1 year ago to so, so it's a good place to find out more. But in a fast growing field, it may be a bit out of date about the latest now.\nhttps://spinningup.openai.com/en/latest/spinningup/rl_intro2.html"},{"_id":"62a35171410c0a03aa77a6ed","treeId":"259ed199a2ac6763e000041a","seq":23066709,"position":4,"parentId":"62a35171410c0a03aa77a6e8","content":"## Reinforcement Learning Tutorial with Demo on GitHub\nThis is a thorough collection of slides from a few different texts and courses laid out with the essentials from basic decision making to Deep RL. There is also code examples for some of their own simple domains.\nhttps://github.com/omerbsezer/Reinforcement_learning_tutorial_with_demo#ExperienceReplay"},{"_id":"62a35171410c0a03aa77a6ee","treeId":"259ed199a2ac6763e000041a","seq":23105771,"position":5,"parentId":"62a35171410c0a03aa77a6e8","content":"## Online/Other Courses\n- Coursera/University of Alberta (Martha White)https://www.coursera.org/specializations/reinforcement-learning#courses\n- great course with notes online that uses MineCraft for assignments and projects to teach RL : https://canvas.eee.uci.edu/courses/34142"},{"_id":"62a35171410c0a03aa77a6ef","treeId":"259ed199a2ac6763e000041a","seq":23066711,"position":6,"parentId":null,"content":"# Videos to Watch on RL (Current Research)"},{"_id":"62a35171410c0a03aa77a6f0","treeId":"259ed199a2ac6763e000041a","seq":23066712,"position":1,"parentId":"62a35171410c0a03aa77a6ef","content":"## Conferences 2020\n- Multiple talks at [Canadian AI 2020](https://www.caiac.ca/en/conferences/canadianai-2020/program) conference.\n - Csaba Szepesvari (U. Alberta)\n- AAMAS 2021 conference just finished recently and is focussed on decision making and planning, lots of RL papers.\n - See their [Twitter Feed](https://twitter.com/Aamas2020C?ref_src=twsrc%5Etfw%7Ctwcamp%5Eembeddedtimeline%7Ctwterm%5Eprofile%3AAamas2020C&ref_url=https%3A%2F%2Faamas2020.conference.auckland.ac.nz%2F) for links to talks\n\n- ICLR 2020 conference (https://iclr.cc/virtual_2020/index.html) \n\n\n"},{"_id":"62a35171410c0a03aa77a6f1","treeId":"259ed199a2ac6763e000041a","seq":23066713,"position":7,"parentId":null,"content":"# Old Topics Archive\nOther resources connected with previous versions of the course, I'm happy to talk about any of these if people are interested."},{"_id":"62a35171410c0a03aa77a6f2","treeId":"259ed199a2ac6763e000041a","seq":23066714,"position":1,"parentId":"62a35171410c0a03aa77a6f1","content":"## Bayes Nets (dropped)"},{"_id":"62a35171410c0a03aa77a6f3","treeId":"259ed199a2ac6763e000041a","seq":23066715,"position":1,"parentId":"62a35171410c0a03aa77a6f2","content":"### Tools"},{"_id":"62a35171410c0a03aa77a6f4","treeId":"259ed199a2ac6763e000041a","seq":23066716,"position":1,"parentId":"62a35171410c0a03aa77a6f3","content":"**SamIam Bayesian Network GUI Tool**\n- Java GUI tool for playing with BNs (its old but its good)\nhttp://reasoning.cs.ucla.edu/samiam/index.php?h=emodels\n\n**Other Tools**\n- Bayesian Belief Networks Python Package :\nAllows creation of Bayesian Belief Networks\nand other Graphical Models with pure Python\nfunctions. Where tractable exact inference\nis used.\nhttps://github.com/eBay/bayesian-belief-networks\n- Python library for conjugate exponential family BNs and variational inference only\nhttp://www.bayespy.org/intro.html\n- Open Markov\nhttp://www.openmarkov.org/\n- Open GM (C++ library)\nhttp://hciweb2.iwr.uni-heidelberg.de/opengm/"},{"_id":"62a35171410c0a03aa77a6f5","treeId":"259ed199a2ac6763e000041a","seq":23066717,"position":2,"parentId":"62a35171410c0a03aa77a6f2","content":"### References\n"},{"_id":"62a35171410c0a03aa77a6f6","treeId":"259ed199a2ac6763e000041a","seq":23066718,"position":1,"parentId":"62a35171410c0a03aa77a6f5","content":"Some videos and resources on Bayes Nets, d-seperation, Bayes Ball Algorithm and more:\nhttps://metacademy.org/graphs/concepts/bayes_ball"},{"_id":"62a35171410c0a03aa77a6f7","treeId":"259ed199a2ac6763e000041a","seq":23066719,"position":2,"parentId":"62a35171410c0a03aa77a6f1","content":"## Conjugate Priors (dropped)"},{"_id":"62a35171410c0a03aa77a6f8","treeId":"259ed199a2ac6763e000041a","seq":23066720,"position":1,"parentId":"62a35171410c0a03aa77a6f7","content":"https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions"},{"_id":"62a35171410c0a03aa77a6f9","treeId":"259ed199a2ac6763e000041a","seq":23066721,"position":3,"parentId":"62a35171410c0a03aa77a6f1","content":"## Primary References for Probabilistic Reasoning (mostly dropped)"},{"_id":"62a35171410c0a03aa77a6fa","treeId":"259ed199a2ac6763e000041a","seq":23066722,"position":1,"parentId":"62a35171410c0a03aa77a6f9","content":"**[Ermon2019]** - First half of notes are based on Stanford CS 228 (https://ermongroup.github.io/cs228-notes/) which goes even more into details on PGMs than we will.\n"},{"_id":"62a35171410c0a03aa77a6fb","treeId":"259ed199a2ac6763e000041a","seq":23066723,"position":2,"parentId":"62a35171410c0a03aa77a6f9","content":"**[Cam Davidson 2018]** - Bayesian Methods for Hackers - Probabilistic Programming textbook as set of python notebooks.\nhttps://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/#contents"},{"_id":"62a35171410c0a03aa77a6fc","treeId":"259ed199a2ac6763e000041a","seq":23066724,"position":3,"parentId":"62a35171410c0a03aa77a6f9","content":"**[Koller, Friedman, 2009]** Probabilistic Graphical Models : Principles and Techniques \nThe extensive theoretical book on PGMs.\nhttps://mitpress.mit.edu/books/probabilistic-graphical-models"}],"tree":{"_id":"259ed199a2ac6763e000041a","name":"Course 457C Student Links","publicUrl":"rlcourse","latex":true}}