Reinforcement Learning and Decision Making Conference

The fourth iteration of this conference was held in Montreal July 7-10, 2019.

Basic Information

Neuro Into Tutorial

Melissa Sharpe , UCLA

Associative Tasks

These are well established theories on associations by tasks to brain. She uses it to test computational questions about behaviour.

Associative learning

The main principle of association between actions and rewards based on Pavlovian conditioning.
Example: Rats sound tone associated with food

Prediction error:

Two forms of learning (conditional reinforcement):

Neutral Associations

Rats can learn to associate sound + light (sensory prediction). Later when they learn that sound leads to food then they’ll infer that light leads to food too. But the way they value the light and the sound themselves will differ!:


Dopamine Prediction Error


(Schultz 1997 papers)* - the discovery of it is really interesting through the 90’s as a story of scientific discovery

Dopamine neurons encode suprise and cause learning:

Optogenetics are used now to validate this

There are subtle differences in how dopamine effects associating value to stimulus and learning causal relations

Dynamic Decisions in Humans

Cleotilde Gonzalez , CMU

Irrationality : how should we define it?

Whenever we do not choose the action that maximizes our reward based on or model of the world-values. Framing bias is relevant.

Naturalistic Decision Making

C. Klein, Oksana., Klinger

Dynamic Decision Making (DDM)

They use the idea of decision makers who are experts in complex noisy domains but carried out in lab experiments with more control. This includes closed loop Decision making

Properties of DDM:

Two Types:

Consequential choice Problems

These are choices which are important and cannot be rolled back (eg. forest fire, finance)

Choice from Sampling

Choices can be tried with no impact before a real consequential choice is made (eg. shopping for clothes)


The goal is to maintain a particular state of a ‘stock’ (eg. weight, temperature in environment)

Post office / Water Flow Microworld

(Gonzales, Hunan Factors, 2004)
They get people to play these very challenging games
then analyse their strategies and heuristics they use

Questions Arising from their Work

1. How do memory and intelligence affect this?

2. Does practice under time pressure help?

3. Does more practice help?

Instance Based Learning

(Gordon, Lerch and Le biene, 2003)


(Andersons Lebiere, 1998)

A model for combining Memory and Symbolic representations and how it happens in the human mind.

Description-Experience Gap

(Hertwig, 2004)




Counterfactual RL

Emma Brunskill

Also Check out her awesome tutorial on RL for the people which gives a great top-to-bottom perspective on the current state of RL.

How do we focus is on treating learning concepts as RL?
She says Counterfactual learning is related to Batch learning and experience replay since both are learning based on old data.
We can’t know what would have happened with different choices

 "The ability to imagine what would have hopped is critical to human intelligence." - Judea Pearl

Batch Policy Optimization


Policy Evaluation

TODO: Importune Sylyfr RL Policy Eva#

(Preap, 2. ⇒

weighting the trajectories using just a combination of the policy probabilities
see this, high variance

Stationary Importance Sampling (Challah and Manornor 2017)

If your state representation is too simple for the domain then the problem is no longer Markov. This is because you’ll need to remember past states to compensate for the lack of representation. This is the flip side of the usual idea that everything is Markovian as long as you have rich enough features in your state description.

So, if your models are bad then picking the MLE for the dynamics isn’t a good idea
even using importance sampling has problems because it can have very high variance even though it isn’t biased.

A Big Idea

Unlike in supervised learning these are really hard in RL:


There are promising methods for dealing with this in non i. i. d. domains but it’s hard.

Moving the Goalpost

(Liu, Swaminathan UAI, 2019)

Finding the Best Policy in a class

Distributional Reinforcement Learning

Will Dabney, DeepMind


Distributional TD Learning

Validation with Simple Experimental Tasks with Animals

Count-based Exploration with Successor Representation

Marlos Machado, Michael Bowling

Successor Representation

Counting the number of visits to a state along a trajectory

Updating Existing Algorithms

Function Approximation


Using the norm of the successor representation encodes state visitition counts and helps to explore faster.

Directions in Distributional Learning

Will Dabney, Deep Mind

Another talk on this approach in general.

The Big idea

Treat the return itself as a random variable. this will have the same recursive structure as the bellman equation except it is the relationship between a distribution over rewards and the valeu distribtuion.

Why does Distributional RL Work?

The Virtuous Circle of RL

Further Reading

Substance Use Disorder

Anna Konova

Drug users seem to be risk seeking so their team modelled ambiguity as risk tolerance separately to explain people’s varying values for money and drugs ambiguity tolerance. They founds this explains ongoing drug use better than risk tolerance.

Hyperbolic Discounting

William Fedus, Yoshua Bengio — Google Brain

Heart Health with RL

Susan Murphy

Goal: Promote behaviour changes on taking medication, reducing addiction etc.

Existing health support apps take two main forms:

Heart steps app

An advisor app to help encourage sedentary people at risk of heart attacks.
Very customized to personal schedule and context.

They tried to build an RL agent to push messages but not have the person get used to it.
This problem has interesting challenges:


Can they reuse previous trajectories?

Termination Critic

Anna Harutyunyan, Doina Recep, Remi Munos, et. al. (DeepMind)

Temporal Abstraction

Reasoning at multiple timescales

How do we define and measure generalization? How do we encode inductive biases?


option(O) = policy($\pi$) + termination condition $\beta$

Traditionally, policy is the focus, but termination also optimizes the same thing. Biases are added in to encourage options to last longer.

Their idea: a separate termination rule to be focussed on when to end option entirely.

Option Transition Model

Constraints on Options: The goal is the encourage options to be simpler. Minimize entropy offinal option model.

Termination critic: They use the Actor-Critic approach but have a critic for the termination rule in addition to the policy.

Intrinsic Motivation (a.k.a Curiosity)

(Pierre-Yves Oudeyer , INRIA)

Their lab look at intrinsic motivation in humans and machines. They explore developmental learning in children and try to apply it to

It is well known that children always explore and invent their own goals. So looking only at the accuracy statistics is not useful, instead we need to consider the context.

The Learning Progress Hypothesis

“Interestingness” is not just about novelty or surprise.
It is about situations where a high level of learning is happening. If something becomes partially mastered then progress will slow and agent/child should/will lose interest.

IAC Algorithm

(Oudeyer, IEEE TEC 2017)

They also explore the use Inverse Reinforcement Learning

(Forester et al. 2017) have a great video of robot learning.

Anatomy of a Social Interaction

Luke Chang

They look at actual human social interactions for compelx dynamics between choices amongst multiple people.

Trust Game

Prediction Error in the Brain

What makes someone trustworthy?

Moral Strategy Model


Theory of Mind

Many existing modles of theory of mind are low dimensional, a few main types of quealities and that they are static over time.

There is a push to explore this using Inverse Reinforcement Learning and Bayesian Learning

Prediction Game

Guilt Aversion as a Useful Component of Values

Learning to Learn to Communicate

Abhinav Gupta, Joelle Pineau, et al.

Learning of simulated languages between agents including programming language, natural or artificial ones.

Interesting work that explicitely builds compositional models of agents learning languages so that they contain some of the properties of natural languages.

Arbitration between imitation and emulation during human observational learning

Caroline J Charpentier; Kiyohito Iigaya; John O’Doherty

Learning by observing people perform a tasks, there are two main approaches:

Question: do people alternate between these two strateies and when?




Computational Models


Can a Game Require Theory of Mind?

Michael Bowling

Game: Hanabi - Michael explained how this card game blends the notion of communication between explicit messages and observation of the actions of other players

Working Memory


Working Memory is fast to use but has limited capacity and is forgotten quickly.

Simple Experiment

Give people a small set (3-7) of images to remember the position of based on another larger image with stimuatus in it.

Try to test two aspects of working memory if it is an RL system


People can perform optimally, but once it gets to 5 or 6 they start having 20-30% errors. But the RL model alone of memory doesn’t have these problems.

Adding a Seperate Working Memory Process

RL alone isn’t enough, so we need some kind of mixture model. Once this is added then the model corresponds closely to human performance.

But…Working Memory Interferes with RL

How do WM and RL interact?

So What’s Happening?

WM blocks RL by contributing to the reward prediction error and helping improve it, there is a closed loop between them.

Why is this Important?

Keeping an open mind about how different observed brain systems could contribute to and interact with learning. Example domains help to motivate this:

There has been a lot of success in Cognitive science and neuroscience to grab useful ideas from CS RL to do more experiments. She encourages the CS RL community to grab more ideas from neuroscience and try them out in computational models, working memory for example seems to have benefits beyond benefits to RL alone.

Reinforcement Learning for Robots

Chelsea Finn, Berkelty, Google Brain, Stanford

A nice aspect of doing RL on robots is that you can’t get around the problems of noise, bad reward models, generalization as you can in simulation. A major problem right now is training specialist robots is known but they generalize very badly even on mundane differences. So exploration, using raw sensory input and continual improvement without supervision are needed.

Learning Reusable Models from Self-Supervision

Their approach (Visual Foresight) is to do two things

Their state representation model is the full images that the robot sees, this includes all the pixels and the view of the environment. So their dynamics prediction needs to be a Recurrent Neural Network that predicts video images.

How We Know When Not to Think

Fiery Cushman

A standard dichotomy is Model-based (Habitual) vs Model-based (Planning)

How can a little bit of model-based knowledge help with planning?

They show a bunch of experiments highlights that people are able to seperate these two tasks well. They find that the consideration set generally based on a quick heuristic of pre-cached cases weighted by value. This means we think of things based on value, even if we need to eventually choose amongst that set according to something other than value maximization (e.g. choose your least favourite food, choose an item that satisfies some convoluted constraint.)

What does “Possible” mean?

We usually mean practical when we say possible, not if it is physically conceivable. They show some interesting experiemntal results that show immoral considerations are not immediately added to these consideration sets, but only available under more deliberation. So we don’t even consider these scenarios to save time.

One final result is that this indicates that

Play : Interplay of Goals and Subgoals in Mental Development

Rich Sutton

We finally wrapped up with a talk by Rich Sutton himself. He argues for there being an truly Integrated Science of Mind that applies equally to biological and machine minds.

Intelligence== Mind - Rich Sutton

The Reward Hypothesis is great but…

Where are we?

We start with goals :

Next we need to look at subgoals:

These are all subproblems that are not essentially about reward or value. How do we learn them or represent them generally?

He thinks play is an important way to look at it.

Three key open quetsions about subproblems in RL:

  1. Q1 - what should subproblems be
  2. Q2 - where do they come from
  3. Q3 - how do subproblems help the main problem (ie. how to subgoals help the main reward maximization task)
    • Learning to solve subproblems could help with shaping better state representations, behaviour patterns that are more coherent
    • It also allows high level planning because now you have a mdoel of what happens after you’ve achieved the subgoal

Some settled issues:



That’s All, it was fun. See you in two years at Brown!