The fourth iteration of this conference was held in Montreal July 7-10, 2019.
Melissa Sharpe , UCLA
(…)
Cleotilde Gonzalez , CMU
Will Dabney, DeepMind
Marlos Machado, Michael Bowling
Will Dabney, Deep Mind
Anna Konova
William Fedus, Yoshua Bengio — Google Brain
Susan Murphy
Anna Harutyunyan, Doina Recep, Remi Munos, et. al. (DeepMind)
(Pierre-Yves Oudeyer , INRIA)
Luke Chang
Abhinav Gupta, Joelle Pineau, et al.
Caroline J Charpentier; Kiyohito Iigaya; John O’Doherty
Michael Bowling
speaker
Chelsea Finn, Berkelty, Google Brain, Stanford
Fiery Cushman
Rich Sutton
These are well established theories on associations by tasks to brain. She uses it to test computational questions about behaviour.
(Schultz 1997 papers)* - the discovery of it is really interesting through the 90’s as a story of scientific discovery
Dopamine neurons encode suprise and cause learning:
Optogenetics are used now to validate this
There are subtle differences in how dopamine effects associating value to stimulus and learning causal relations
Whenever we do not choose the action that maximizes our reward based on or model of the world-values. Framing bias is relevant.
They use the idea of decision makers who are experts in complex noisy domains but carried out in lab experiments with more control. This includes closed loop Decision making
Properties of DDM:
Two Types:
The goal is to maintain a particular state of a ‘stock’ (eg. weight, temperature in environment)
(Gonzales, Hunan Factors, 2004)
They get people to play these very challenging games
then analyse their strategies and heuristics they use
(Gordon, Lerch and Le biene, 2003)
(Andersons Lebiere, 1998)
A model for combining Memory and Symbolic representations and how it happens in the human mind.
Also Check out her awesome tutorial on RL for the people which gives a great top-to-bottom perspective on the current state of RL.
How do we focus is on treating learning concepts as RL?
She says Counterfactual learning is related to Batch learning and experience replay since both are learning based on old data.
We can’t know what would have happened with different choices
"The ability to imagine what would have hopped is critical to human intelligence." - Judea Pearl
Using the norm of the successor representation encodes state visitition counts and helps to explore faster.
Another talk on this approach in general.
Treat the return itself as a random variable. this will have the same recursive structure as the bellman equation except it is the relationship between a distribution over rewards and the valeu distribtuion.
Drug users seem to be risk seeking so their team modelled ambiguity as risk tolerance separately to explain people’s varying values for money and drugs ambiguity tolerance. They founds this explains ongoing drug use better than risk tolerance.
Goal: Promote behaviour changes on taking medication, reducing addiction etc.
Existing health support apps take two main forms:
Can they reuse previous trajectories?
option(O) = policy() + termination condition
Traditionally, policy is the focus, but termination also optimizes the same thing. Biases are added in to encourage options to last longer.
Their idea: a separate termination rule to be focussed on when to end option entirely.
Their lab look at intrinsic motivation in humans and machines. They explore developmental learning in children and try to apply it to
It is well known that children always explore and invent their own goals. So looking only at the accuracy statistics is not useful, instead we need to consider the context.
“Interestingness” is not just about novelty or surprise.
It is about situations where a high level of learning is happening. If something becomes partially mastered then progress will slow and agent/child should/will lose interest.
They look at actual human social interactions for compelx dynamics between choices amongst multiple people.
Many existing modles of theory of mind are low dimensional, a few main types of quealities and that they are static over time.
There is a push to explore this using Inverse Reinforcement Learning and Bayesian Learning
Learning of simulated languages between agents including programming language, natural or artificial ones.
Interesting work that explicitely builds compositional models of agents learning languages so that they contain some of the properties of natural languages.
Learning by observing people perform a tasks, there are two main approaches:
Question: do people alternate between these two strateies and when?
Experiment:
Hypothesis:
Results:
Game: Hanabi - Michael explained how this card game blends the notion of communication between explicit messages and observation of the actions of other players
Working Memory is fast to use but has limited capacity and is forgotten quickly.
Give people a small set (3-7) of images to remember the position of based on another larger image with stimuatus in it.
Try to test two aspects of working memory if it is an RL system
How do WM and RL interact?
A nice aspect of doing RL on robots is that you can’t get around the problems of noise, bad reward models, generalization as you can in simulation. A major problem right now is training specialist robots is known but they generalize very badly even on mundane differences. So exploration, using raw sensory input and continual improvement without supervision are needed.
A standard dichotomy is Model-based (Habitual) vs Model-based (Planning)
How can a little bit of model-based knowledge help with planning?
They show a bunch of experiments highlights that people are able to seperate these two tasks well. They find that the consideration set generally based on a quick heuristic of pre-cached cases weighted by value. This means we think of things based on value, even if we need to eventually choose amongst that set according to something other than value maximization (e.g. choose your least favourite food, choose an item that satisfies some convoluted constraint.)
We finally wrapped up with a talk by Rich Sutton himself. He argues for there being an truly Integrated Science of Mind that applies equally to biological and machine minds.
Intelligence== Mind - Rich Sutton
The main principle of association between actions and rewards based on Pavlovian conditioning.
Example: Rats sound tone associated with food
Prediction error:
Rats can learn to associate sound + light (sensory prediction). Later when they learn that sound leads to food then they’ll infer that light leads to food too. But the way they value the light and the sound themselves will differ!:
They will happily press a button to hear the bell even though they know they are full and don’t want food. They might even “enjoy” hearing the bell because it reminds them of food.
But the light, which is also associated by inference to food arriving, doesn’t hold any value. They won’t press the button to see the light.
The order they learn about the light+sound and about the food, makes a difference here too!
C. Klein, Oksana., Klinger
These are choices which are important and cannot be rolled back (eg. forest fire, finance)
Choices can be tried with no impact before a real consequential choice is made (eg. shopping for clothes)
1. How do memory and intelligence affect this?
2. Does practice under time pressure help?
3. Does more practice help?
(Hertwig, 2004)
An advisor app to help encourage sedentary people at risk of heart attacks.
Very customized to personal schedule and context.
They tried to build an RL agent to push messages but not have the person get used to it.
This problem has interesting challenges:
Reasoning at multiple timescales
How do we define and measure generalization? How do we encode inductive biases?
(Oudeyer, IEEE TEC 2017)
They also explore the use Inverse Reinforcement Learning
(Forester et al. 2017) have a great video of robot learning.
People can perform optimally, but once it gets to 5 or 6 they start having 20-30% errors. But the RL model alone of memory doesn’t have these problems.
RL alone isn’t enough, so we need some kind of mixture model. Once this is added then the model corresponds closely to human performance.
WM blocks RL by contributing to the reward prediction error and helping improve it, there is a closed loop between them.
Keeping an open mind about how different observed brain systems could contribute to and interact with learning. Example domains help to motivate this:
There has been a lot of success in Cognitive science and neuroscience to grab useful ideas from CS RL to do more experiments. She encourages the CS RL community to grab more ideas from neuroscience and try them out in computational models, working memory for example seems to have benefits beyond benefits to RL alone.
Their approach (Visual Foresight) is to do two things
Their state representation model is the full images that the robot sees, this includes all the pixels and the view of the environment. So their dynamics prediction needs to be a Recurrent Neural Network that predicts video images.
We usually mean practical when we say possible, not if it is physically conceivable. They show some interesting experiemntal results that show immoral considerations are not immediately added to these consideration sets, but only available under more deliberation. So we don’t even consider these scenarios to save time.
One final result is that this indicates that
We start with goals :
Next we need to look at subgoals:
These are all subproblems that are not essentially about reward or value. How do we learn them or represent them generally?
He thinks play is an important way to look at it.
Three key open quetsions about subproblems in RL:
Some settled issues:
That’s All, it was fun. See you in two years at Brown!
TODO: Importune Sylyfr RL Policy Eva#
(Preap, 2. ⇒
weighting the trajectories using just a combination of the policy probabilities
see this, high variance
Stationary Importance Sampling (Challah and Manornor 2017)
If your state representation is too simple for the domain then the problem is no longer Markov. This is because you’ll need to remember past states to compensate for the lack of representation. This is the flip side of the usual idea that everything is Markovian as long as you have rich enough features in your state description.
So, if your models are bad then picking the MLE for the dynamics isn’t a good idea
even using importance sampling has problems because it can have very high variance even though it isn’t biased.
Unlike in supervised learning these are really hard in RL:
structural risk minimization
cross validation
There are promising methods for dealing with this in non i. i. d. domains but it’s hard.
Constraints on Options: The goal is the encourage options to be simpler. Minimize entropy offinal option model.
Termination critic: They use the Actor-Critic approach but have a critic for the termination rule in addition to the policy.
(Liu, Swaminathan UAI, 2019)