**Title:** Probabilistic Reasoning and Reinforcement Learning**Info:** ECE 457C - Reinforcement Learning**Instructor:** Prof. Mark Crowley, ECE Department, UWaterloo

*NOTE:***Ignore the weekly dates, they are from a previous year**

**Website:** markcrowley.ca/rlcourse

Links to this Gingko Tree:

- as dynamic tree : RL Course Links and Notes
- as html : RL Course Links and Notes
- as markdown : RL Course Links and Notes

Introduction to Reinforcement Learning (RL) theory and algorithms for learning decision-making policies in situations with uncertainty and limited information. Topics include Markov decision processes, classic exact/approximate RL algorithms such as value/policy iteration, Q-learning, State-action-reward-state-action (SARSA), Temporal Difference (TD) methods, policy gradients, actor-critic, and Deep RL such as Deep Q-Learning (DQN), Asynchronous Advantage Actor Critic (A3C), and Deep Deterministic Policy Gradient (DDPG). [Offered: S, first offered Spring 2019]

*don’t mess with it*- Style Sheet - see saved main version : https://gingkoapp.com/app#3bf9513db6a011c9e8000239

- Course Website : contains course outline, grade breakdown, weekly schedule information
- Notes and slides via the Textbook
*(available free online)*:- Reinforcement Learning: An Introduction

Small : Richard S. Sutton and Andrew G. Barto

Sutton Textbook, 2018

- Reinforcement Learning: An Introduction
- Course Youtube Channel : Reinforcement Learning
- See Additional Resources for more online notes and reading.

Primary Textbook : Reinforcement Learning: An Introduction

Small : Richard S. Sutton and Andrew G. Barto, 2018 [SB]

Some topics are not covered in the SB textbook or they are covered in much more detail than the lectures. We will continue to update this list with references as the term progresses.

- Motivation & Context [SB 1.1, 1.2, 17.6]
- Decision Making Under Uncertainty [SB 2.1-2.3, 2.7, 3.1-3.3]
- Solving MDPs [SB 3.5, 3.6, 4.1-4.4]
- The RL Problem [SB 3.7, 6.4, 6.5]
- TD Learning [SB 12.1, 12.2]
- State Representation & Value Function Approximation
- Basics of Neural Networks
- Deep RL
- Policy Search [SB 13.1, 13.2, 13.5]
- AlphaGo and MCTS
- Multi-Agent RL (MARL)
- Hierarchical RL
- Reinforcement Learning with Human Feedback
- Decision Transformers
- Other Possible Topics:
- Free Energy
- Distributional RL
- Supervised Learning for RL and Curriculum Learning
- POMDPs (skipped in S22)

Introductory topics on this from my graduate course ECE 657A - Data and Knowledge Modeling and Analysis are available on youtube and mostly applicable to this course as well.

**Probability and Statistics Review** *(youtube playlist)*

Containing Videos on:

- Conditional Prob and Bayes Theorem
- Comparing Distributions and Random Variables
- Hypothesis Testing

For a very fundamental view of probability from another course of Prof. Crowley you can view the lectures and tutorials for ECE 108

ECE 108 Youtube (look at “future lectures” and “future tutorials” for S20): https://www.youtube.com/channel/UCHqrRl12d0WtIyS-sECwkRQ/playlists

The last few lectures and tutorials are on probability definitions as seen from the perspective of discrete math and set theory.

A Good article summarizing how likelihood, loss functions, risk, KL divergence, MLE, MAP are all connected.

https://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/

From the course website for a previous year. Some of this we won’t need so much but they are all useful to know for Machine Learning methods in general.

https://compthinking.github.io/RLCourseNotes/

- Basic probability definitions
- conditional probability
- Expectation
- Inference in Graphical Models
- Variational Inference

**Textbook Sections:** [SB 1.1, 1.2, 17.6]

-* Part 1 - Live Lecture May 17, 2021 on *Virtual Classroom - View Live Here

- Part 2 - Bandits and Values (the sound is horrible! we’ll record a new one) - https://youtu.be/zVIv1ipnubA
- Part 3 - Regret Minimization, UCB and Thompson Sampling - https://youtu.be/a0OcuuglkHQ

- Quite a good blog post with all the concepts laid out in simple terms in order https://www.analyticsvidhya.com/blog/2018/09/reinforcement-multi-armed-bandit-scratch-python/

- Long tutorial on Thompson Sampling with more background and theory. Nice charts as well: https://web.stanford.edu/~bvr/pubs/TS_Tutorial.pdf

- Markov Decision Processes

[SB 3.0-3.4] - Solving MDPs Exactly

[SB 3.5, 3.6, 3.7]

- Markov Decision Processes 3.0-3.1:

https://youtu.be/pGW1wP4jJas - Rewards and Returns 3.3-3.4: https://youtu.be/K7ymZkEd0ZA
- Value Functions 3.5 - 3.6 : https://youtu.be/lNBXDgAthmQ

*Former title: The Reinforcement Learning Problem***Textbook Sections:**[SB 4.1-4.4]

- Dynamic Programming 1: https://youtu.be/nhyCQK4v4Cw
- Dynamic Programming 2 : Policy and Value Iteration: https://youtu.be/NHN02JnGmdQ
- Dynamic Programming 3 : Generalized Policy Iteration and Asynchronous Value Iteration https://youtu.be/7gfRBYpzhxU

**Textbook Sections:** Selections from [SB chap 5], [SB 6.0 - 6.5]

- Quick intro to Monte-Carlo methods
- Temporal Difference Updating
- SARSA
- Q-Learning
- Expected SARSA
- Double Q-Learning

Parts:

- Just the MC Lecture part - https://youtu.be/b1C_2x6IUUw
- Temporal Difference Learning 1 - Introduction https://youtu.be/pJyz6OZiIBo
- Temporal Difference Learning 2 - Comparison to Monte-Carlo Method on Random Walk

https://youtu.be/NVtoj4XRRZw

- Week 5 Youtube Playlist
- Temporal Difference Learning 3 - Sarsa and QLearning Algorithms

https://youtu.be/nEDblNhoL2E - Temporal Difference Learning 4 - Expected Sarsa and Double Q-Learning

https://youtu.be/uGFb0mtJW00

**Textbook Sections:** [SB 12.1, 12.2]

**Eligibility traces in a tabular setting** can lead to a significant benefit in training time in additional to the Temporal Difference method.

**ET1**- One Step vs Direct Value Updates**ET2**- ET2 N Step TD Forward View**ET3**- N Step TD Backward View**ET4**- Eligibility Traces On Policy**ET5**- Eligibility Traces Off Policy- youtube playlist of entire topic ET1-5: https://youtube.com/playlist?list=PLrV5TcaW6bIVtMNt_dZMdMQ9JdtzV5VWS

**Eligibility Traces in Deep RL**

In Deep RL it is very common to use **experience replay** to reduce overfitting and bias to recent experiences. However, experience replay makes it very hard to leverage eligibility traces which require a sequence of actions to distribute reward backwards.

There is a fair bit of discussion about Eligibility Traces and Deep RL. See some of the following papers and notes.

*Authors:*Hado van Hasselt, Sephora Madjiheurem, Matteo Hessel, David Silver, Andre Barreto, Diana Borsa*From:*DeepMind and University College London, UK*Arxiv Link:*https://arxiv.org/pdf/2007.01839.pdf*Hypothesis Discussion Link:*

https://hyp.is/go?url=https%3A%2F%2Farxiv.org%2Fpdf%2F2007.01839.pdf&group=__world__

I put some notes up on Hypothesis about this one, it seems quite interesting. It’s more recent, just 2021, after lots of advances of the initial Deep RL algorithms (unlike the “Investigating Recurrence…” paper). And it makes a fairly straightforward argument about **Eligibility Traces** that is similar to **Expected SARSA** in its implementation.

*This could be a good algorithm to consider implementing for #asg4.*

(https://stats.stackexchange.com/questions/341027/eligibility-traces-vs-experience-replay/341038)

Brett Daley, Christopher Amato

(https://arxiv.org/abs/1704.05495)

- Hypothesis Discussion Link: https://hyp.is/go?url=https%3A%2F%2Fwww.cs.mcgill.ca%2F~jmerhe1%2Frnn_nips.pdf&group=__world__

No midterm in Spring 2023 course.

A **Value Function Approximation (VFA)**

is a necessary technique to use whenever the size of the state of action spaces become too large to represent the value function explicitly as a table. In practice, any practical problem needs to use a VFA.

- Reduce memory need to store the functions (transition, reward, value etc)
- Reduce computation to look up values
- Reduce experience needed to find the optimal value or policy (sample efficiency)
- For continuous state spaces, a coarse coding or tile coding can be effective

- Linear function approximations (linear combination of features)
- Neural Networks
- Decision Trees
- Nearest Neighbors
- Fourier/ wavelet bases

When using a VFA, you can use a Stochastic Gradient Descent (SGD) method to search for the best weights for your value function according to experience.

This parametric form the value function will then be used to obtain a *greedy* or *epsilon-greedy* policy at run-time.

This is why using a VFA + SGD is still different from a Direct Policy Search approach where you optimize the parameters of the policy directly.

- Lecture from 2020 by former TA for the coruse Sriram Ganapathi Subramanian on classic Value Function Approximation approaches - https://youtu.be/7Dg6KiI_0eM

- How to use a shallow, linear approximation for Atari - This post explains a paper showing how to achieve the same performance as the Deep RL DQN method for Atari using carefully constructed linear value function approximation.

- Review, or learn, a
*bit*about Deep Learning- See videos and content from DKMA Course (ECE 657A)
- This youtube playlist is a targeted “Deep Learning Crash Course” ( #dnn-crashcourse-for-rl ) with just the essentials you’ll need for Deep RL.
- That course also has more detailed videos on Deep Learning which won’t be specifically useful for ECE 493, but which you can refer to if interested.

- link - https://youtu.be/eopsPef7rLc

In this video go over some of the fundamental concepts that led to neural networks (such as linear regression and logistic regression models), the basic structure and formulation of classic neural networks and the history of their development.

- link - https://youtu.be/_Pe7eyLN6VY

This video goes through a ground level description of logistic neural units, classic neural networks, modern activation functions and the idea of a Neural Network as a Universal Approximator.

- link - https://youtu.be/eWzbLXWEJJ4

In this video we discuss the nuts and bolts of how training in Neural Networks (Deep or Shallow) works as a process of incremental optimization of weights via gradient descent. Topics discussed: Backpropagation algorithm, gradient descent, modern optimizer methods.

- link - https://youtu.be/R8PZ7UPKQNM

In this video we go over the fundamentals of Deep Learning from a different angle using the approach from Goodfellow et. al.’s Deep Learning Textbook and their network graph notation for neural networks.

We describe the network diagram notation, and how to view neural networks in this way, focussing on the relationship between sets of weights and layers.

Other topics include: gradient descent, loss functions, cross-entropy, network output distribution types, softmax output for classification.

- link - https://youtu.be/c6g0dfMWQ6k

This video continues with the approach from Goodfellow et. al.’s Deep Learning Textbook and goes into detail about computational methods, efficiency and defining the measure being used for optimization.

*Topics covered include:* relationship of network depth to generalization power, computation benefits of convolutional network structures, revisiting the meaning of backpropagation, methods for defining loss functions

- link - https://youtu.be/qkqkY09splc

In this lecture I talk about some of the problems that can arise when training neural networks and how they can be mitigated. Topics include : overfitting, model complexity, vanishing gradients, catastrophic forgetting and interpretability.

- link - https://youtu.be/k4DdJ590teM

In this video we give an overview of several approaches for making DNNs more usable when data is limited with respect to the size of the network. Topics include data augmentation, residual network links, vanishing gradients.

- Deep RL playlist (https://youtube.com/playlist?list=PLrV5TcaW6bIXkjBAExaFcv8NnnNU-qtzt)
- DQN - new youtube lecture on this topic posted July 26, 2021
- revised look at Value Function Approximations in light of DQN and Atari games
- Agent57 - 2020 update by DeepMind to learn how to play all 57 Atari dataset games (huge datausage) - https://www.deepmind.com/blog/agent57-outperforming-the-human-atari-benchmark#:~:text=The%20Atari57%20suite%20of%20games,all%2057%20Atari%202600%20games.
- “Human-level Atari 200x Faster” - Deepmind 2022 - https://arxiv.org/abs/2209.07550

- Also a good intro post about Policy gradients vs DQN by great ML blogger Andrej Karpathy (this is the one I showed in class with the Pong example):

http://karpathy.github.io/2016/05/31/rl/

These resources will be useful for the course in general but especially for assignments 3 and 4.

`CODE :`

Stable Baselines and Gymnasium`StableBaselines3`

is a project to maintain a standard repository of core RL algorithms, and even trained models/policies. It uses the API defined in Gymnasium for interacting with RL environments, but SB3 is about policies, values functions, optimization, neural networks, gradients, etc., it isn’t about environments themselves.

`Gymnasium`

is the successor of Open AI’s Gym project, which defines a standard set of environments for RL including an API for interacting with them.

The only way to keep up with the changes in a fast paced field like RL (or any area of Machine Learning in general, these days) is to read the latest papers from relevant conferences or *pre-prints* (unpublished paper drafts) on Arxiv.

The Stable Baselines library has a list of Key Papers in Deep RL to get started, especially the first three sections 1.1, 1.2, and 1.3.

[SB 13.1, 13.2, 13.5]

- Policy Gradients
- REINFORCE
- Actor-Critic

The basic idea of policy gradients is often explained with a simple algorithm that predates Deep RL. There seem to be two main versions of this story, although the result is the same.

- Monte Carlo Policy Gradient method - collects rewards from entire trajectory, computes return $G_t$ and calculates a gradient update
- This tutorial on the Gymnasium website walks through code for defining this algorithm from scratch with DNNs, we’ll go through this in class: https://gymnasium.farama.org/tutorials/training_agents/reinforce_invpend_gym_v26/

The OpenAI Spinning Up documentation has a description of Vanilla Policy Gradients (#VPG). This is almost the same as the #REINFORCE algorithm.

**The Difference Between REINFORCE and VPG**:

The difference is subtle, but is explained well in this stackexchange response

**They do look very similar in objective functions, but they are different.** The way the gradient ascent is performed differs strongly since in REINFORCE method the gradient ascent is performed once for each action taken for each episode and the direction of ascent is taken as

$$

G_t\frac{\nabla \pi (A_t |S_t, \theta)}{\pi (A_t |S_t, \theta)}

$$

so the update becomes

$$

\theta*{t+1} = \theta*{t} + \alpha G_t\frac{\nabla \pi (A_t |S_t, \theta)}{\pi (A_t |S_t, \theta)}

$$

but in VPG algorithm the gradient ascents performed once over multiple episodes and direction of ascent taken as average

$$

\frac{1}{|\mathcal{T}|}\sum*{\tau\in\mathcal{T}} \sum*{t=0}^T R(\tau) \frac{\nabla \pi (A_t |S_t, \theta)}{\pi (A_t |S_t, \theta)}

$$

and gradient ascent step is

$$

\theta*{t+1} = \theta*{t} + \alpha \frac{1}{|\mathcal{T}|}\sum*{\tau\in\mathcal{T}} \sum*{t=0}^T R(\tau) \frac{\nabla \pi (A_t |S_t, \theta)}{\pi (A_t |S_t, \theta)}

$$

which looks a lot like what you have stated as REINFORCE algorithm.

I admit, that some form of mathematical equivalence can be derived between them, since expectation over policy and expectation over the trajectory sampled from the policy looks practically the same. But approaches differ at least in the way, the ascent is computed.

link to post: https://ai.stackexchange.com/a/34344/73583

author: https://ai.stackexchange.com/users/52494/vl-knd

You can check the Open AI [Introduction to RL series][1], they explain pretty neatly there what is the Policy Optimization and how to derive it. I think, that usually when we are talking about REINFORCE algorithm, we are talking about the one described in [Sutton’s book on Reinforcement learning][2]. It is described as the policy optimization algorithm maximizing the Value Function $v_{\pi(\theta)}(s) = E[G_t|S_t = s]$ of initial state of the agent. Here $G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$ is the $\gamma$ discounted return from given state, time $s, t$. Or, shortly put.

$$

J(\theta) = v*{\pi(\theta)}(s_0) = E[G_t|S_t = s_0]\\nabla J(\theta) = E*\pi\left[G_t\frac{\nabla \pi (A_t |S_t, \theta)}{\pi (A_t |S_t, \theta)}\right]

$$

But in the RL series of Open AI, the algorithm that is described as Vanilla policy gradient (If it is the one you are talking about) is optimizing finite-horizon undiscounted return $E_{\tau \sim \pi} [R(\tau)] $, where $\tau$ are possible trajectories. e.g.

$$

J(\theta) = E*{\tau \sim \pi} [R(\tau)] \\nabla J(\theta) = E*{\tau \sim\pi}\left[\sum_{t=0}^T R(\tau) \frac{\nabla \pi (A_t |S_t, \theta)}{\pi (A_t |S_t, \theta)}\right]

$$

- A good post with all the fundamental math for policy gradients.

https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#a3c

- Lecture on Policy Gradient methods -

https://youtu.be/SqulTcLHRnY - new lecture also available on the Teams Stream playlist available in LEARN

Very clear blog post on describing Actor-Critic Algorithms to improve Policy Gradients

https://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/

- Blog from OpenAI introducing their implementation for A3C and analysis of how a simpler, non-parallalized version they call A2C is just as good:
- The original A3C paper from DeepMind:
- Mnih, 2016 : https://arxiv.org/pdf/1602.01783.pdf

- Good summary of these algorithms with cleaned up pseudocode and links:
- A2C - Review of policy gradients and adding how A2C implements them using Deep Learning - (https://youtu.be/WPs8KsWM8sg)

- Good description of Actor-Critic approach using Sonic the Hedgehog game as example:

https://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/

- A2C - Review of policy gradients and adding how A2C implements them using Deep Learning - (https://youtu.be/WPs8KsWM8sg)

Here are some exciting trends and new advances in RL research in the past few years to find out more about.

PG methods are a fast changing area of RL research. This post has a number of the successful algorithms in this area from a few years ago:

https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#actor-critic

**NOTE:** I know the topics and lectures from this point onward have become a bit scattered. There are many resources to share and it’s not always clear what parts of them are essential. Also, the RL textbook will have less up to date information on the latests algorithms after REINFORCE/Actor-Critic.

So, when in doubt about slides or websites to trust, stick to the high-level understanding available on the Spinning Up Documentation : https://spinningup.openai.com/en/latest/user/algorithms.html

- The #PPO algorithm does better in most cases, it’s a good starting default algorithm
- But even so, it’s not that well understood
*why*it works to well

**Blog/Code:**The Open-AI page on the PPO algorithm used on their simulator domains of humanoid robots:

https://openai.com/blog/openai-baselines-ppo/**Original Paper:**https://arxiv.org/abs/1511.05952**Hypothes.is Discussion**: https://hyp.is/go?url=https%3A%2F%2Farxiv.org%2Fpdf%2F1511.05952.pdf&group=__world__

- #PPO is based on #TRPO, which is hard to implement
- #TRPO is often impractical, which is why PPO does it more efficiently with lots of approximations
- PPO introduces a parameter, $\beta$, in equation (5) of the original paper that isn’t that well understood
- Open AI has their own setting for it, but it’s not well understood
- if you fix $\beta$, then you can’t change anything else and it’s very tricky

- See this paper comparing PPO and TRPO
- (Engstrom, ICLR, 2019) : “Implementation Matters in Deep RL: A Case Study on PPO and TRPO”
- Link to the Paper at Conference: https://openreview.net/forum?id=r1etN1rtPB
- Hypothes.is Discussion Link: https://hyp.is/go?url=https%3A%2F%2Fopenreview.net%2Fpdf%3Fid%3Dr1etN1rtPB&group=__world__
- Conclusion:
**all the cases where PPO works better than TRPO are because the hyper-parameters are set just right**and it’s very sensitive to those parameters

- discussion of evaluation metrics for RL algorithms
- training hyper-parameters vs. algorithm parameters
- Double DQN bringing back the Double-Q-Learning idea and giving it new life to solve optimism bias

*[updated july 14, 2023]*

- Deep Double Q-Learning
- Deep Reinforcement Learning that Matters
- Rainbow Paper
- This famous paper gives a great review of the DQN algorithm a couple years after it changed everything in Deep RL. It compares six different extensions to DQN for Deep Reinforcement Learning, many of which have now become standard additions to DQN and other Deep RL algorithms. It also combines all of them together to produce the “rainbow” algorithm, which outperformed many other models for a while.

This paper introduces the DDPG algorithm which builds on the existing DPG algorithm from classic RL theory. The main idea is to define a deterministic policy, or nearly deterministic, for situations where the environment is very sensitive to suboptimal actions, and one action setting usually dominates in each state. This showed good performance, but could not beat algorithms such as PPO until the additions of SAC were added. SAC adds an entropy penalty which essentially penalizes uncertainty in any states. Using this, the deterministic policy gradient approach performs well.

**Public Link:** https://hyp.is/go?url=https%3A%2F%2Farxiv.org%2Fpdf%2F1509.02971.pdf&group=__world__

- Monte-Carlo Tree Search (MCTS)
- How AlphaGo works (combining A2C and MCTS)
- AlphaZero
- Alpha(Everything?)

- Blog post about how the original Alpha Go solution worked using Policy Gradient RL and Monte-Carlo Tree Search:

https://medium.com/@jonathan_hui/alphago-how-it-works-technically-26ddcc085319

- An overview next steps in learning more about RL research and applications
- You can find the Spring 2022 slides on this topic here: RL Next Steps or below this card in the tree

- #Hierarchical-RL
- #MARL : Multi-agent Reinforcement Learning
- #RLHF : #Reinforcement-Learning-with-Human-Feedback
- Supervised and #Curriculum-Learning
- #Imitation-Learning
- #Inverse-Reinforcement-Learning
- #Constrained-RL

Reinforcement Learning is a great framework for training systems to perform actions in a way that makes less assumptions than most other optimization and planning methods. But it’s not perfect and it’s not always the best solution to a particular problem. Here we’ll include descriptions and examples of times when RL fails in a major way.

- A famous example of what can happen if you don’t create an appropriate reward function. This relates to the current hot topic of AI #Allignment too, how do you get an AI to do what you want, or in our case, how do you specify rewards in such a way that when the agent converges to a policy, the policy will satisfy what you watned?

LeCun, DeepMind, OpenAI, Friston

(ECE 457C - Mark Crowley - UWaterloo)

We’ve covered the basics of classic (Tabular) and modern (Deep) Reinforcement Learning.

But it’s a fast changing field, where do you go next with RL?

- Keep Reading: Conferences
- Going Beyond: MARL, Hierarchical RL, Learning Process
- Big New Ideas: LeCun, DeepMind, OpenAI, Friston
- Get Involved: Competitions and OpenSource

AI is more general than ML (Prof. Crowley’s opinion) and RL is a more AI-like pursuit than ML itself. So these conferences often have a broader set of tasks and results.

- AAAI - largest, general Artificial Intelligence conference in North America, annual
- IJCAI - largest, general Artificial Intelligence conference internationally, annual

- RLDM - Reinforcement Learning and Decision Making
- This is a great, small conference only once every two years. Lots of big ideas. Half the papers are from Neuroscience/Psychology and half are from Engineering/Computer Science.
- So the focus is to understanding learning how to act in the world
*in general*!

- AAMAS - Autonomous Agents and Multiagent Systems (https://www.ifaamas.org/)

- NeurIPS

- The International Conference on Machine Learning (icml.cc) is on now!
- This is a general and very technical ML conference with quite a lot of RL topics often covered.
- See topics this year: https://icml.cc/Conferences/2022/Schedule?q=%22reinforcement+learning%22

- We teach human’s by building up ever more complex tasks, why not teach RL agents the same way?
- nice summary here: https://lilianweng.github.io/posts/2020-01-29-curriculum-rl/

Curiosity alone can often lead to good policies, but only when reward and curiosity learned dynamics are correlated.

- Yann LeCun and General Artificial Intelligence
- DeepMind
- VPT - a pre-trained model for #Minecraft
- other - https://www.deepmind.com/blog/generally-capable-agents-emerge-from-open-ended-play

- OpenAI
- Free Energy Principle

living systems fight entropy by minimizing free energy, or surprise

**Week 13**

See the RL Next Steps tree for what was discussed in class July 22, 2022.

See LEARN for more information.

- Elevators : $e_i\in E$ : $i \in \mathcal{R} \in[1,7]$
- Floors : $f \in \mathcal{Z} \in [1,8]$
- Location : $L(e_i) : E \rightarrow f$ - which floor is the elevator on?
- Outside Button: $b\in B^f_{i,dir} \in \{0,1\}; dir\in {up, down}$
- Movement: $M(e_i): E\rightarrow \{up, stopped,down\}$
- Doors: $G(e_i,f): E \times f \rightarrow \{closed, closing, opening, open\}$
*Next Floor:*$NL(e_i) : E \rightarrow f \cup {stopped}$ - the next floor the elevator will arrive at, if the elevator is not currently moving, then this returns “stopped”.

In general: move the elevators, open/close the doors in order to maximize your objective function

At every moment the system can take any of the following actions, we can assume they only happen one at a time

Do nothing

Open a door/Close a door : set $G(e_i,f)$

Move an elevator up/down from current floor : set $M(e_i)$

Stop an elevator at the current floor it is

*moving towards*using $NL(e_i)$

- Define dynamics

*(huh? no it’s not short…it’s about elevators)*

- Should we define actions to be “close door and move to floor f”?

- how long are you willing to annoy users to get the information you need?
- can we build a simulator for this system?

**[SuttonBarto2018]** - Reinforcement Learning: An Introduction. Book, free pdf of draft available.

http://incompleteideas.net/book/the-book-2nd.html

- The Open-AI page for their standard set of baseline implementations for the major Deep RL algorithms:

https://github.com/openai/baselines/tree/master/baselines - This is a very good page with all the fundamental math for many policy gradient based Deep RL algorithms. References to the original papers, mathematical explanation and pseudocode included:

https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#a3c

A nice blog post on comparing DQN and Policy Gradient algorithms such A2C.

https://flyyufelix.github.io/2017/10/12/dqn-vs-pg.html

**[Dimitrakakis2019]** - Decision Making Under Uncertainty and Reinforcement Learning

http://www.cse.chalmers.se/~chrdimi/downloads/book.pdf**[Ghavamzadeh2016]** - Bayesian Reinforcement Learning: A Survey. Ghavamzadeh et al. 2016.

https://arxiv.org/abs/1609.04436

- More probability notes online: https://compthinking.github.io/RLCourseNotes/

This website is a great resource. It lays out concepts from start to finish. Once you get through the first half of our course, many of the concepts on this site will be familiar to you.

https://spinningup.openai.com/en/latest/spinningup/keypapers.html

The fundamentals of RL are briefly covered here. We will go into all this and more in detail in our course.

https://spinningup.openai.com/en/latest/spinningup/rl_intro.html

*(as of 2022)*

Here, a list of algorithms at the cutting edge of RL as of 1 year ago to so, so it’s a good place to find out more. But in a fast growing field, it may be a bit out of date about the latest now.

https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html

This is a thorough collection of slides from a few different texts and courses laid out with the essentials from basic decision making to Deep RL. There is also code examples for some of their own simple domains.

https://github.com/omerbsezer/Reinforcement_learning_tutorial_with_demo#ExperienceReplay

- Coursera/University of Alberta (Martha White)https://www.coursera.org/specializations/reinforcement-learning#courses
- great course with notes online that uses MineCraft for assignments and projects to teach RL : https://canvas.eee.uci.edu/courses/34142
- Deep Mind RL Fundamentals Lecture Series 2021 - a good resource from lecturers at UCL in collaboration with Google’s DeepMind

- Multiple talks at Canadian AI 2020 conference.
- Csaba Szepesvari (U. Alberta)

AAMAS 2021 conference just finished recently and is focussed on decision making and planning, lots of RL papers.

- See their Twitter Feed for links to talks

ICLR 2020 conference (https://iclr.cc/virtual_2020/index.html)

Other resources connected with previous versions of the course, I’m happy to talk about any of these if people are interested.

**SamIam Bayesian Network GUI Tool**

- Java GUI tool for playing with BNs (its old but its good)

http://reasoning.cs.ucla.edu/samiam/index.php?h=emodels

**Other Tools**

- Bayesian Belief Networks Python Package :

Allows creation of Bayesian Belief Networks

and other Graphical Models with pure Python

functions. Where tractable exact inference

is used.

https://github.com/eBay/bayesian-belief-networks - Python library for conjugate exponential family BNs and variational inference only

http://www.bayespy.org/intro.html - Open Markov

http://www.openmarkov.org/ - Open GM (C++ library)

http://hciweb2.iwr.uni-heidelberg.de/opengm/

Some videos and resources on Bayes Nets, d-seperation, Bayes Ball Algorithm and more:

https://metacademy.org/graphs/concepts/bayes_ball

**[Ermon2019]** - First half of notes are based on Stanford CS 228 (https://ermongroup.github.io/cs228-notes/) which goes even more into details on PGMs than we will.

**[Cam Davidson 2018]** - Bayesian Methods for Hackers - Probabilistic Programming textbook as set of python notebooks.

https://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/#contents

**[Koller, Friedman, 2009]** Probabilistic Graphical Models : Principles and Techniques

The extensive theoretical book on PGMs.

https://mitpress.mit.edu/books/probabilistic-graphical-models