• Reinforcement Learning and Decision Making Conference

    The fourth iteration of this conference was held in Montreal July 7-10, 2019.

  • Neuro Into Tutorial

    Melissa Sharpe , UCLA

  • Dopamine Prediction Error

    (…)

  • Dynamic Decisions in Humans

    Cleotilde Gonzalez , CMU

  • Counterfactual RL

    Emma Brunskill

  • Distributional Reinforcement Learning

    Will Dabney, DeepMind

  • Count-based Exploration with Successor Representation

    Marlos Machado, Michael Bowling

  • Directions in Distributional Learning

    Will Dabney, Deep Mind

  • Substance Use Disorder

    Anna Konova

  • Hyperbolic Discounting

    William Fedus, Yoshua Bengio — Google Brain

  • Heart Health with RL

    Susan Murphy

  • Termination Critic

    Anna Harutyunyan, Doina Recep, Remi Munos, et. al. (DeepMind)

  • Intrinsic Motivation (a.k.a Curiosity)

    (Pierre-Yves Oudeyer , INRIA)

  • Anatomy of a Social Interaction

    Luke Chang

  • Learning to Learn to Communicate

    Abhinav Gupta, Joelle Pineau, et al.

  • Arbitration between imitation and emulation during human observational learning

    Caroline J Charpentier; Kiyohito Iigaya; John O’Doherty

  • Can a Game Require Theory of Mind?

    Michael Bowling

  • Working Memory

    speaker

  • Reinforcement Learning for Robots

    Chelsea Finn, Berkelty, Google Brain, Stanford

  • How We Know When Not to Think

    Fiery Cushman

  • Play : Interplay of Goals and Subgoals in Mental Development

    Rich Sutton

  • Associative Tasks

    These are well established theories on associations by tasks to brain. She uses it to test computational questions about behaviour.

  • Associative learning

  • Two forms of learning (conditional reinforcement):

    • value of tone itself : like watching cooking show
    • value of a causal outcome of the reward (food):
      • this shows up after the signal (tone)
      • This explains some of the irrational behaviour of drug users. Signals
  • Neutral Associations

  • Questions

    • so are there completely diffenet learning (conditioning)
      some types of learning associate value in itself and others are just about prediction and causation?
    • are these results arising from particular brain structures or an algorithm? Is there even a difference?
  • (Schultz 1997 papers)* - the discovery of it is really interesting through the 90’s as a story of scientific discovery

  • Dopamine neurons encode suprise and cause learning:

    • once the association is learned then when it predicts reward the dopamine fires anyways.
    • If no reward also encodes dissapointment
    • They linked this to TD(0) learning
    • So dopamine is temporal difference value
  • Optogenetics are used now to validate this

    • then can explicitly send TD errors into dopamine neurons and see effect
    • it’s like adding suprise even when they aren’t suprised , or increasing the reward even though its the same food
    • if they kill off dopamine entirely the rats do still learn, but it’s reduced
    • they also show it encourages sensory specific situations, so Q(s, a)
  • There are subtle differences in how dopamine effects associating value to stimulus and learning causal relations

    • the timing of when the dopamine armies seems to matter
    • so small changes in when and how much d. p or reward is recieved can lead to huge differences.
  • Irrationality : how should we define it?

    Whenever we do not choose the action that maximizes our reward based on or model of the world-values. Framing bias is relevant.

  • Naturalistic Decision Making

  • Dynamic Decision Making (DDM)

    They use the idea of decision makers who are experts in complex noisy domains but carried out in lab experiments with more control. This includes closed loop Decision making

    Properties of DDM:

    • utility can be dependent on the
    • decisions overtime are interdependent
    • limited time and cognitive resources
    • delayed feedback

    Two Types:

    • choice - maximize total reward
    • control-maintain system balance
  • Control

    The goal is to maintain a particular state of a ‘stock’ (eg. weight, temperature in environment)

  • Post office / Water Flow Microworld

    (Gonzales, Hunan Factors, 2004)
    They get people to play these very challenging games
    then analyse their strategies and heuristics they use

  • Questions Arising from their Work

  • Instance Based Learning

    20190707_160550
    (Gordon, Lerch and Le biene, 2003)

  • ACT-R

    (Andersons Lebiere, 1998)

    A model for combining Memory and Symbolic representations and how it happens in the human mind.

  • IBLT

    • create new meta states, “instances”, to evaluate based on multiple memorized events that are similar to the current situation
    • they have a Python library to define their models
  • Also Check out her awesome tutorial on RL for the people which gives a great top-to-bottom perspective on the current state of RL.

  • How do we focus is on treating learning concepts as RL?
    She says Counterfactual learning is related to Batch learning and experience replay since both are learning based on old data.
    We can’t know what would have happened with different choices

     "The ability to imagine what would have hopped is critical to human intelligence." - Judea Pearl
  • History

    • discovering TD leraning for AI methods
    • then discoering that this looks similar to what happens in the brain with dopamine neurons
    • estaimtes of value at all states udpated in direction of improving prediction error
  • Distributional TD Learning

    • traditional TD learning updates for all states (or neurons) with the same scale
    • but in DTD they weight the updates using the local distribution of rewards somehow
    • switch from mean value update to distributional value update
    • they find that it seems like learning the distribution helps to learn a better representation
  • Validation with Simple Experimental Tasks with Animals

    • animals receive one of seven amounts of food, with a prob distribution
    • some animals get a signal associated with each case
    • traditional TD learning: if reward is above average the positive learning happens
    • but what about the distribution for each neuron / state?
    • looking at experimental data it seems to align with what we’d expect from a distributional model rather than the old mean approach
  • Successor Representation

    • Function approximatoin requires that we really see examples of the different state we want values of. If we never see then then we can’t learn it.
    • one simple wayt o bootstrap this is useing proximity between states.
    • but proximity can break in spatial domains or complex state spaces.
    • what we really want is to talk about how many steps it would take to get between two ‘nearby’ states rahter than their euclidean distance
  • Counting the number of visits to a state along a trajectory

    • this can be estimated with TD learning
    • there is a good way to do function approximation on this as well
    • The Success Representation (SR) naturally comes out of a the dual approach to dynamic programing for RL
    • there is also some evidence taht it matches some of what is happening in the hippocampus
    • this can be seen as an alternative to optimism under uncertainty used in R-Max and others
  • Updating Existing Algorithms

    • add the L2 norm of the SR as an exploration bonus to standard SARSA
    • intuition: if some state has not been visited much before it will get a bonus to encourage exploring it
    • huge improvement on SARSA
    • also works to add it to model based algorithms like , R-MAX etc
  • Function Approximation

    • adding this idea to DQN seems to help as well especially for domains for random exploration doesn’t work well
  • Result

    Using the norm of the successor representation encodes state visitition counts and helps to explore faster.

  • Another talk on this approach in general.

    • Distributed RL says we should learn the true distribution of the values.
    • The means can be used directely to update value estimates using the bellman equation.
    • But this doesn’t work if you aren’t using the mean (moments) because there may be multiple distributions that are consistent with that mean.
    • So the big question is how to best impute the right distribution to explain the experiences.
    • The way they’ve approached it is to fix the representation or projection to a consistent estimator and preserves the mean even though it’s not necessarily the best one.
  • The Big idea

    Treat the return itself as a random variable. this will have the same recursive structure as the bellman equation except it is the relationship between a distribution over rewards and the valeu distribtuion.

  • Drug users seem to be risk seeking so their team modelled ambiguity as risk tolerance separately to explain people’s varying values for money and drugs ambiguity tolerance. They founds this explains ongoing drug use better than risk tolerance.

    • The standard discount factor is an exponential discounting
    • is a hyperbolic discount
    • (Souzo, 1998) use survival (t) rather than
      • the probability of surviving until timestep t
      • we can derive standard from this for a domain with a fixed risk of dying at each step
    • but if the hazard rate depends a state we get other discount functions
    • they simply the use of this by training the agent on multiple time horizons as an
      auxiliary task
    • this is part of a larger discussion in RL that the common assumptions about the du’t work
  • Goal: Promote behaviour changes on taking medication, reducing addiction etc.

    Existing health support apps take two main forms:

    • pull : just info, depends on user
    • push: deliver intervention when needed
  • Heart steps app

  • Questions:

    Can they reuse previous trajectories?

  • Temporal Abstraction

  • Options

    option(O) = policy() + termination condition

    Traditionally, policy is the focus, but termination also optimizes the same thing. Biases are added in to encourage options to last longer.

    Their idea: a separate termination rule to be focussed on when to end option entirely.

  • Their lab look at intrinsic motivation in humans and machines. They explore developmental learning in children and try to apply it to

    • building robots
    • developing better education methods

    It is well known that children always explore and invent their own goals. So looking only at the accuracy statistics is not useful, instead we need to consider the context.

  • The Learning Progress Hypothesis

    “Interestingness” is not just about novelty or surprise.
    It is about situations where a high level of learning is happening. If something becomes partially mastered then progress will slow and agent/child should/will lose interest.

  • IAC Algorithm

  • They look at actual human social interactions for compelx dynamics between choices amongst multiple people.

  • Trust Game

    • sequence of choices: join or don’t join the interaction
    • share of don’t share : information
  • Moral Strategy Model

    20190709_144501

  • Theory of Mind

    Many existing modles of theory of mind are low dimensional, a few main types of quealities and that they are static over time.

    There is a push to explore this using Inverse Reinforcement Learning and Bayesian Learning

  • Guilt Aversion as a Useful Component of Values

    • It’s important to include guilt or theory of mind about other’s dissapointment or pain
    • It is a real effect in human decision making and it is robust across culture and does not conform to the standard economics idea of expected utility maximization
    • Advice being given for medical and other safety-critical domains needs to consider this.
  • Learning of simulated languages between agents including programming language, natural or artificial ones.

  • Interesting work that explicitely builds compositional models of agents learning languages so that they contain some of the properties of natural languages.

  • Learning by observing people perform a tasks, there are two main approaches:

    • imitation learning
    • learning by inferring about tier goals and preferences

    Question: do people alternate between these two strateies and when?

    Experiment:

    • bandit task to choose which to pull, some feature about hte machines to identify them
    • you get to watch another player follow a straegy and you know they are a good player
    • you also know that one of the features (tokens) is perfectly correlated with high reward

    Hypothesis:

    • immitation is slower to learn, but better when system has lots of uncertainty
    • emulation will be favoured for highly volatile domains

    Results:

    • people use both approaches
  • Computational Models

    • they build a computational model for each strategy and an arbitration model which weights a tradeoff between the two strategies
    • they show the arbitration model performs better
    • then they test if it explains the human behaviour better and they find it idoes very closely
  • Implications

    • they perform fMRI scans and show which parts of the brain correlate with activity for each of the two strategies and the joint arbitration signal too
  • Game: Hanabi - Michael explained how this card game blends the notion of communication between explicit messages and observation of the actions of other players

  • Working Memory is fast to use but has limited capacity and is forgotten quickly.

  • Simple Experiment

    Give people a small set (3-7) of images to remember the position of based on another larger image with stimuatus in it.

    Try to test two aspects of working memory if it is an RL system

    • time limiting factors, how long it lasts
    • Size of memory, how many different elements can be remembered
  • But…Working Memory Interferes with RL

    How do WM and RL interact?

    • EEG studies show that RL reward prediction or reward history are correlated with the set size (the number of things trying to be remembered). So they are not independent. So WM is somehow blocking?
    • EEG studies also show that the Q-value drops faster (improves faster) for small sets, so somehow WM is helping?
    • Long term associations are learned better (this is harder) when the set of images is larger.
  • Why is this Important?

  • A nice aspect of doing RL on robots is that you can’t get around the problems of noise, bad reward models, generalization as you can in simulation. A major problem right now is training specialist robots is known but they generalize very badly even on mundane differences. So exploration, using raw sensory input and continual improvement without supervision are needed.

  • Learning Reusable Models from Self-Supervision

  • A standard dichotomy is Model-based (Habitual) vs Model-based (Planning)

    How can a little bit of model-based knowledge help with planning?

    • this is important because in reality we have an inifinity of choices and yes we only consider a small subset, how does that happen?
    • consideration set - things we usually want to consider for this task, feasible options, but very restricted
    • choice - this is the standard onine- value based estiamtion usin g amodel to pit the best thing for the context

    They show a bunch of experiments highlights that people are able to seperate these two tasks well. They find that the consideration set generally based on a quick heuristic of pre-cached cases weighted by value. This means we think of things based on value, even if we need to eventually choose amongst that set according to something other than value maximization (e.g. choose your least favourite food, choose an item that satisfies some convoluted constraint.)

  • What does “Possible” mean?

  • We finally wrapped up with a talk by Rich Sutton himself. He argues for there being an truly Integrated Science of Mind that applies equally to biological and machine minds.

    Intelligence== Mind - Rich Sutton
  • The Reward Hypothesis is great but…

    • it reduces the importance of subgoals
    • it seems to be something that can’t change over time
  • Where are we?

    • The main principle of association between actions and rewards based on Pavlovian conditioning.
      Example: Rats sound tone associated with food

    • Prediction error:

      • if they associate soured with food turds, add light they don’t learn it
      • they only learn when their are error, in predictions
      • leads to causal learning, correlation isn’t enough
    • Rats can learn to associate sound + light (sensory prediction). Later when they learn that sound leads to food then they’ll infer that light leads to food too. But the way they value the light and the sound themselves will differ!:

      • They will happily press a button to hear the bell even though they know they are full and don’t want food. They might even “enjoy” hearing the bell because it reminds them of food.

      • But the light, which is also associated by inference to food arriving, doesn’t hold any value. They won’t press the button to see the light.

      • The order they learn about the light+sound and about the food, makes a difference here too!

    • C. Klein, Oksana., Klinger

      • human factors and ergonomics
        • how-do people really make decisions
        • expert decision makers, so quite rational but lots of knowledge ahead
          time sensitive, huge messy domains
      • Some conclusions:
        • expert decision amkers often know what to do, they don’t feel they really make a decision choice in the classical sense
        • if experience doesn’t provide solution then they give up
          and run forward a simulator in their head based on experience
          • tree search? MCTS? UCB?
    • Consequential choice Problems

      These are choices which are important and cannot be rolled back (eg. forest fire, finance)

    • Choice from Sampling

      Choices can be tried with no impact before a real consequential choice is made (eg. shopping for clothes)

    • 1. How do memory and intelligence affect this?

      • highly skilled people leave regardless
      • low skilled people give up an rely on ‘advised heuristic as time goes on
    • 2. Does practice under time pressure help?

      • No . People who learn with no time pressure learn better and perform better under time pressure
      • learn slow, play fast
      • learner who had more time are more willing to ignore simple heuristic advice once they master it
    • 3. Does more practice help?

      • Practice doesn’t help under time pressure but in slow learning it helps a lot to be robust.
    • Description-Experience Gap

      (Hertwig, 2004)

      • the experiment is like a simplified Multi-armed Bandit task carried out on people
      • Theydiscovered an interesting effect in human decision making
    • Batch Policy Optimization

      20190707_165441

    • Policy Evaluation

    • A Big Idea

    • Moving the Goalpost

    • Why does Distributional RL Work?

      • helps maintain stability for complex domains for deep RL
      • aids representation learning by providing a stronger signal about the structure of the domain
      • it helps with improving generalization error, that is learning on some states can work well in very different unseen states, which can be shown to really help with improving performance in RL.
    • The Virtuous Circle of RL

      • see “Neuro-science Inspired AI” article by Demis Hassabis of Deep Mind.
      • The communication between cognitive science, psychology, neuroscience and CS/AI helps us all to learn useful things and contribute to the overall truth
    • Further Reading

    • An advisor app to help encourage sedentary people at risk of heart attacks.
      Very customized to personal schedule and context.

    • They tried to build an RL agent to push messages but not have the person get used to it.
      This problem has interesting challenges:

      • very noisy data, unknown variables, complex rewards
      • delayed penalties (over seasitivation)
      • immediately all actions are positive
    • Reasoning at multiple timescales

      • We can remember and reason about different levels of detail over time.
      • why do we do it? The hope is that high level plans are more reusable.
    • How do we define and measure generalization? How do we encode inductive biases?

    • Option Transition Model

      • (Oudeyer, IEEE TEC 2017)

        • Build an interesting mess metric using change in gradient during learning for many points in state-action space.
        • Choose actions to try which high values of this metric are best.
        • Hierarchically divide the space into distinct regions by clustering on the metric.
      • They also explore the use Inverse Reinforcement Learning

      • (Forester et al. 2017) have a great video of robot learning.

      • Prediction Error in the Brain

        • they found correlations between activations in part of the brain related to prediction error
        • they performed user study on trust game where people play with agents who have trustworthyness probabilities rather than actuall people
        • they scanned people’s brains during playing the game to see what lights up in the brain
        • result: values are higher when interacting with someone you trust
      • What makes someone trustworthy?

        • Recent study on Trust around the world, lost wallet game Cohn et al, 2019, Science
          • that study found that people are more likely to return the wallet if there was a lot of money in it.
          • Economists think there is not existing theory to explain this
        • Psychological Game Theory can explain this
          • one agent has second order theory about other player and they converge on a solution
          • this involves thikning about the dissapointment the other person is going to experience and this is partially valid. If the money is higher obviously this weight is higher.
          • TODO: see image
          • They also find support for this by looking at brain scans
      • Prediction Game

        • goal is for player to learn the likely straggy of another player from overseved actions
        • people can predict very quickly what watching people will do
        • RL doing it to optimaize doesn’t do well
        • but IRL does better than RL here
        • IRL does not perform as well as humans
        • Also, this domain gave the IRL learner a lot of state representation information which humans don’t have
      • Results

        People can perform optimally, but once it gets to 5 or 6 they start having 20-30% errors. But the RL model alone of memory doesn’t have these problems.

      • Adding a Seperate Working Memory Process

        RL alone isn’t enough, so we need some kind of mixture model. Once this is added then the model corresponds closely to human performance.

      • So What’s Happening?

        WM blocks RL by contributing to the reward prediction error and helping improve it, there is a closed loop between them.

      • Keeping an open mind about how different observed brain systems could contribute to and interact with learning. Example domains help to motivate this:

        • Schizophrenia
          • They can show that working memory is impaired in Schizophrenia patients and that it is due largely to WM problems seperated from RL, this is only visible if WM is modelled explicitely.
        • Age related learning
          • they found that learning rate seems to increase with age to compensate with decrease in working memory
      • There has been a lot of success in Cognitive science and neuroscience to grab useful ideas from CS RL to do more experiments. She encourages the CS RL community to grab more ideas from neuroscience and try them out in computational models, working memory for example seems to have benefits beyond benefits to RL alone.

      • Their approach (Visual Foresight) is to do two things

        • Learn general policies through unsupervised exploration
        • Learn fundamental physics and dynamics from pre-existing videos
      • Their state representation model is the full images that the robot sees, this includes all the pixels and the view of the environment. So their dynamics prediction needs to be a Recurrent Neural Network that predicts video images.

      • We usually mean practical when we say possible, not if it is physically conceivable. They show some interesting experiemntal results that show immoral considerations are not immediately added to these consideration sets, but only available under more deliberation. So we don’t even consider these scenarios to save time.

      • One final result is that this indicates that

      • We start with goals :

        • first we learn value functions based on expected future reward
        • we learn policies
      • Next we need to look at subgoals:

        • learn about state - eg. state representations
        • skills - eg. options
        • models -
      • These are all subproblems that are not essentially about reward or value. How do we learn them or represent them generally?

      • He thinks play is an important way to look at it.

      • Three key open quetsions about subproblems in RL:

        1. Q1 - what should subproblems be
        2. Q2 - where do they come from
        3. Q3 - how do subproblems help the main problem (ie. how to subgoals help the main reward maximization task)
          • Learning to solve subproblems could help with shaping better state representations, behaviour patterns that are more coherent
          • It also allows high level planning because now you have a mdoel of what happens after you’ve achieved the subgoal
      • Some settled issues:

        • subproblems are a reward in themselves an dmay be terminal, planning stops
        • solving a subproblem can be done with an option, a seperately subpolicy
      • 20190710_113310

      • 20190710_114403

      • That’s All, it was fun. See you in two years at Brown!

      • Description:

        • you tell them the probabilities
        • people overweight unlikely situations, Prospect Theory
      • Experience:

        • when people build their estimates from actual experience
        • then they don’t behave that way because ‘ under weight unlikely situations
        • So Prospect Theory only seems to apply to how people apply probabities that are described to them
      • TODO: Importune Sylyfr RL Policy Eva#

      • (Preap, 2. ⇒

      • weighting the trajectories using just a combination of the policy probabilities
        see this, high variance

      • Stationary Importance Sampling (Challah and Manornor 2017)

        • This is a new method that has lower variance than original approach
          but is still hard to estimate.
      • If your state representation is too simple for the domain then the problem is no longer Markov. This is because you’ll need to remember past states to compensate for the lack of representation. This is the flip side of the usual idea that everything is Markovian as long as you have rich enough features in your state description.

      • So, if your models are bad then picking the MLE for the dynamics isn’t a good idea
        even using importance sampling has problems because it can have very high variance even though it isn’t biased.

      • Unlike in supervised learning these are really hard in RL:

        • structural risk minimization

        • cross validation

      • 20190707_174155

      • There are promising methods for dealing with this in non i. i. d. domains but it’s hard.

      • Finding the Best Policy in a class

        • they have a result for doing this using an advantage function
        • restricted to domains with a single “when to act = decision (eg. when to start a drug treatment, when to sell a stock)
        • one advantage of this is interpretability since plicies are more related to the actual human experience.
        • this is an MDP where we take options instead of actions
        • provide a distribution over all states you could end up in
        • however still need to learn the per-step Beta parameter
        • they show how you can define the value function and Bellman equation for Beta then they can solve this with policy gradients.
      • Constraints on Options: The goal is the encourage options to be simpler. Minimize entropy offinal option model.

      • Termination critic: They use the Actor-Critic approach but have a critic for the termination rule in addition to the policy.

        • doing policy gradients while using old data
        • traditional approach for this leads to very high variance
        • use importance sampling to reweight old trajectories and still converge
      • 20190707_174918
        (Liu, Swaminathan UAI, 2019)

      {"cards":[{"_id":"5d2aab5d81e4cc0495d6978f","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18263014,"position":1.5,"parentId":null,"content":"# Reinforcement Learning and Decision Making Conference\nThe fourth iteration of this conference was held in Montreal July 7-10, 2019."},{"_id":"5d2aab5d81e4cc0495d69790","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18263059,"position":1,"parentId":"5d2aab5d81e4cc0495d6978f","content":"## Basic Information\n- Conference Website: http://rldm.org/\n- Conference Brochure: http://otto.lab.mcgill.ca/temp/RLDM2019ProgramBrochure.pdf\n- A PDF of all abstracts accepted: http://rldm.org/papers/abstracts.pdf"},{"_id":"5d2aab5d81e4cc0495d69793","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262913,"position":2,"parentId":null,"content":"# Neuro Into Tutorial\n\n*Melissa Sharpe , UCLA*"},{"_id":"5d2aab5d81e4cc0495d69794","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262707,"position":1,"parentId":"5d2aab5d81e4cc0495d69793","content":"##Associative Tasks\n\nThese are well established theories on associations by tasks to brain. She uses it to test computational questions about behaviour."},{"_id":"5d2aab5d81e4cc0495d69795","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262708,"position":2,"parentId":"5d2aab5d81e4cc0495d69793","content":"##Associative learning"},{"_id":"5d2aab5d81e4cc0495d69796","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262709,"position":1,"parentId":"5d2aab5d81e4cc0495d69795","content":"The main principle of association between actions and rewards based on Pavlovian conditioning.\n**Example: Rats** sound tone associated with food"},{"_id":"5d2aab5d81e4cc0495d69797","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262710,"position":2,"parentId":"5d2aab5d81e4cc0495d69795","content":"**Prediction error:**"},{"_id":"5d2aab5d81e4cc0495d69798","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262711,"position":3,"parentId":"5d2aab5d81e4cc0495d69795","content":"- if they associate soured with food turds, add light they don't learn it\n- they only learn when their are error, in predictions\n- leads to causal learning, correlation isn't enough"},{"_id":"5d2aab5d81e4cc0495d69799","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262712,"position":3,"parentId":"5d2aab5d81e4cc0495d69793","content":"## Two forms of learning (conditional reinforcement):\n\n- value of tone itself : like watching cooking show\n- value of a *causal outcome* of the reward (food):\n - this shows up after the signal (tone)\n - This explains some of the irrational behaviour of drug users. Signals"},{"_id":"5d2aab5d81e4cc0495d6979a","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262713,"position":4,"parentId":"5d2aab5d81e4cc0495d69793","content":"##Neutral Associations"},{"_id":"5d2aab5d81e4cc0495d6979b","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262714,"position":1,"parentId":"5d2aab5d81e4cc0495d6979a","content":"Rats can learn to associate sound + light (sensory prediction). Later when they learn that sound leads to food then they'll *infer that light leads to food too*. But the way they *value the light and the sound themselves* will differ!:"},{"_id":"5d2aab5d81e4cc0495d6979c","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262715,"position":2,"parentId":"5d2aab5d81e4cc0495d6979a","content":"- They will happily press a button to hear the bell even though they know they are full and don't want food. They might even \"enjoy\" hearing the bell because it reminds them of food.\n\n- But the light, which is also associated by inference to food arriving, doesn't hold any value. They won't press the button to see the light.\n\n- The order they learn about the light+sound and about the food, makes a difference here too!"},{"_id":"5d2aab5d81e4cc0495d6979d","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262716,"position":5,"parentId":"5d2aab5d81e4cc0495d69793","content":"## Questions\n\n- so are there completely diffenet learning (conditioning)\nsome types of learning associate value in itself and others are just about prediction and causation?\n- are these results arising from particular brain structures or an algorithm? *Is there even a difference?*"},{"_id":"5d2aab5d81e4cc0495d6979e","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262915,"position":3.125,"parentId":null,"content":"# Dopamine Prediction Error\n*(...)*"},{"_id":"5d2aab5d81e4cc0495d6979f","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262718,"position":1,"parentId":"5d2aab5d81e4cc0495d6979e","content":"(Schultz 1997 papers)* - the discovery of it is really interesting through the 90's as a story of scientific discovery"},{"_id":"5d2aab5d81e4cc0495d697a0","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262719,"position":2,"parentId":"5d2aab5d81e4cc0495d6979e","content":"Dopamine neurons encode suprise and cause learning:"},{"_id":"5d2aab5d81e4cc0495d697a1","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262720,"position":3,"parentId":"5d2aab5d81e4cc0495d6979e","content":"- once the association is learned then when it predicts reward the dopamine fires anyways. \n- If no reward also encodes dissapointment\n- They linked this to TD(0) learning\n- So dopamine is temporal difference value"},{"_id":"5d2aab5d81e4cc0495d697a2","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262721,"position":4,"parentId":"5d2aab5d81e4cc0495d6979e","content":"Optogenetics are used now to validate this"},{"_id":"5d2aab5d81e4cc0495d697a3","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262722,"position":5,"parentId":"5d2aab5d81e4cc0495d6979e","content":"- then can explicitly send TD errors into dopamine neurons and see effect\n- it's like adding suprise even when they aren't suprised , or increasing the reward even though its the same food\n- if they kill off dopamine entirely the rats do still learn, but it's reduced\n- they also show it encourages sensory specific situations, so Q(s, a)"},{"_id":"5d2aab5d81e4cc0495d697a4","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262723,"position":6,"parentId":"5d2aab5d81e4cc0495d6979e","content":"There are subtle differences in how dopamine effects associating value to stimulus and learning causal relations"},{"_id":"5d2aab5d81e4cc0495d697a5","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262724,"position":7,"parentId":"5d2aab5d81e4cc0495d6979e","content":"- the timing of when the dopamine armies seems to matter\n- so small changes in when and how much d. p or reward is recieved can lead to huge differences."},{"_id":"5d2aab5d81e4cc0495d697a6","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262893,"position":4.5,"parentId":null,"content":"# Dynamic Decisions in Humans\n\n*Cleotilde Gonzalez , CMU*"},{"_id":"5d2aab5d81e4cc0495d697a7","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262726,"position":1,"parentId":"5d2aab5d81e4cc0495d697a6","content":"## Irrationality : how should we define it?\n\nWhenever we do not choose the action that maximizes our reward based on or model of the world-values. Framing bias is relevant."},{"_id":"5d2aab5d81e4cc0495d697a8","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262727,"position":2,"parentId":"5d2aab5d81e4cc0495d697a6","content":"## Naturalistic Decision Making"},{"_id":"5d2aab5d81e4cc0495d697a9","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262728,"position":1,"parentId":"5d2aab5d81e4cc0495d697a8","content":"*C. Klein, Oksana., Klinger*"},{"_id":"5d2aab5d81e4cc0495d697aa","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262729,"position":2,"parentId":"5d2aab5d81e4cc0495d697a8","content":"- human factors and ergonomics\n - how-do people really make decisions\n - expert decision makers, so quite rational but lots of knowledge ahead\n time sensitive, huge messy domains\n- Some conclusions:\n - expert decision amkers often *know* what to do, they don't feel they really make a decision choice in the classical sense\n - if experience doesn't provide solution then they give up\n and run forward a simulator in their head based on experience\n - tree search? MCTS? UCB?"},{"_id":"5d2aab5d81e4cc0495d697ab","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262730,"position":3,"parentId":"5d2aab5d81e4cc0495d697a6","content":"## Dynamic Decision Making (DDM)\n\nThey use the idea of decision makers who are experts in complex noisy domains but carried out in lab experiments with more control. This includes closed loop Decision making\n\n**Properties of DDM:**\n\n- utility can be dependent on the\n- decisions overtime are interdependent\n- limited time and cognitive resources\n- delayed feedback\n\n**Two Types:**\n\n- choice - maximize total reward\n- control-maintain system balance"},{"_id":"5d2aab5d81e4cc0495d697ac","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262731,"position":1,"parentId":"5d2aab5d81e4cc0495d697ab","content":"### Consequential choice Problems\n\nThese are choices which are important and cannot be rolled back (eg. forest fire, finance)"},{"_id":"5d2aab5d81e4cc0495d697ad","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262732,"position":2,"parentId":"5d2aab5d81e4cc0495d697ab","content":"### Choice from Sampling\n\nChoices can be tried with no impact before a real consequential choice is made (eg. shopping for clothes)"},{"_id":"5d2aab5d81e4cc0495d697ae","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262733,"position":4,"parentId":"5d2aab5d81e4cc0495d697a6","content":"##Control\n\nThe goal is to maintain a particular state of a 'stock' (eg. weight, temperature in environment)"},{"_id":"5d2aab5d81e4cc0495d697af","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262734,"position":5,"parentId":"5d2aab5d81e4cc0495d697a6","content":"##Post office / Water Flow Microworld\n\n*(Gonzales, Hunan Factors, 2004)*\nThey get people to play these very challenging games\nthen analyse their strategies and heuristics they use"},{"_id":"5d2aab5d81e4cc0495d697b0","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262735,"position":6,"parentId":"5d2aab5d81e4cc0495d697a6","content":"## Questions Arising from their Work"},{"_id":"5d2aab5d81e4cc0495d697b1","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262736,"position":1,"parentId":"5d2aab5d81e4cc0495d697b0","content":"**1. How do memory and intelligence affect this?**"},{"_id":"5d2aab5d81e4cc0495d697b2","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262737,"position":2,"parentId":"5d2aab5d81e4cc0495d697b0","content":"- highly skilled people leave regardless\n- low skilled people give up an rely on 'advised heuristic as time goes on"},{"_id":"5d2aab5d81e4cc0495d697b3","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262738,"position":3,"parentId":"5d2aab5d81e4cc0495d697b0","content":"**2. Does practice under time pressure help?**"},{"_id":"5d2aab5d81e4cc0495d697b4","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262739,"position":4,"parentId":"5d2aab5d81e4cc0495d697b0","content":"- No . People who learn with no time pressure learn better and perform better under time pressure\n- learn slow, play fast\n- learner who had more time are more willing to ignore simple heuristic advice once they master it"},{"_id":"5d2aab5d81e4cc0495d697b5","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262740,"position":5,"parentId":"5d2aab5d81e4cc0495d697b0","content":"**3. Does more practice help?**"},{"_id":"5d2aab5d81e4cc0495d697b6","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262741,"position":6,"parentId":"5d2aab5d81e4cc0495d697b0","content":"- Practice doesn't help under time pressure but in slow learning it helps a lot to be robust."},{"_id":"5d2aab5d81e4cc0495d697b7","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18263233,"position":7,"parentId":"5d2aab5d81e4cc0495d697a6","content":"##Instance Based Learning\n\n![20190707_160550](/Users/mcrowley/Downloads/rldm2019/20190707_160550.jpg)\n*(Gordon, Lerch and Le biene, 2003)*"},{"_id":"5d2aab5d81e4cc0495d697b8","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262743,"position":8,"parentId":"5d2aab5d81e4cc0495d697a6","content":"## ACT-R \n\n*(Andersons Lebiere, 1998)* \n\nA model for combining Memory and Symbolic representations and how it happens in the human mind."},{"_id":"5d2aab5d81e4cc0495d697b9","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262744,"position":1,"parentId":"5d2aab5d81e4cc0495d697b8","content":"### Description-Experience Gap\n\n*(Hertwig, 2004)*\n\n- the experiment is like a simplified Multi-armed Bandit task carried out on people\n- Theydiscovered an interesting effect in human decision making"},{"_id":"5d2aab5d81e4cc0495d697ba","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262745,"position":1,"parentId":"5d2aab5d81e4cc0495d697b9","content":"#### Description:\n\n- you tell them the probabilities\n- people overweight unlikely situations, Prospect Theory"},{"_id":"5d2aab5d81e4cc0495d697bb","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262746,"position":2,"parentId":"5d2aab5d81e4cc0495d697b9","content":"#### Experience:\n\n- when people build their estimates from actual experience\n- then they don't behave that way because ' under weight unlikely situations\n- So Prospect Theory only seems to apply to how people apply probabities that are described to them"},{"_id":"5d2aab5d81e4cc0495d697bc","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262747,"position":9,"parentId":"5d2aab5d81e4cc0495d697a6","content":"##IBLT\n\n- create new meta states, \"instances\", to evaluate based on multiple memorized events that are similar to the current situation\n- they have a Python library to define their models"},{"_id":"5d2aab5d81e4cc0495d697bd","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262924,"position":5.5,"parentId":null,"content":"# Counterfactual RL\n*[Emma Brunskill](https://cs.stanford.edu/people/ebrun/)*"},{"_id":"5d2aab5d81e4cc0495d697bf","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262750,"position":2,"parentId":"5d2aab5d81e4cc0495d697bd","content":"*Also Check out her [awesome tutorial](awesome tutorial) on RL for the people which gives a great top-to-bottom perspective on the current state of RL.*"},{"_id":"5d2aab5d81e4cc0495d697c0","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262918,"position":3,"parentId":"5d2aab5d81e4cc0495d697bd","content":"How do we focus is on treating learning concepts as RL?\nShe says **Counterfactual learning** is related to Batch learning and experience replay since both are learning based on old data.\nWe can't know what would have happened with different choices\n\n```\n \"The ability to imagine what would have hopped is critical to human intelligence.\" - Judea Pearl\n```"},{"_id":"5d2aab5d81e4cc0495d697c1","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262919,"position":1,"parentId":"5d2aab5d81e4cc0495d697c0","content":"## Batch Policy Optimization\n\n![20190707_165441](/Users/mcrowley/Downloads/rldm2019/20190707_165441.jpg)"},{"_id":"5d2aab5d81e4cc0495d697c2","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262753,"position":2,"parentId":"5d2aab5d81e4cc0495d697c0","content":"## Policy Evaluation"},{"_id":"5d2aab5d81e4cc0495d697c3","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262754,"position":1,"parentId":"5d2aab5d81e4cc0495d697c2","content":"TODO: Importune Sylyfr RL Policy Eva#"},{"_id":"5d2aab5d81e4cc0495d697c4","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262755,"position":2,"parentId":"5d2aab5d81e4cc0495d697c2","content":"(Preap, 2. ⇒"},{"_id":"5d2aab5d81e4cc0495d697c5","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262756,"position":3,"parentId":"5d2aab5d81e4cc0495d697c2","content":"weighting the trajectories using just a combination of the policy probabilities\nsee this, high variance"},{"_id":"5d2aab5d81e4cc0495d697c6","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262757,"position":4,"parentId":"5d2aab5d81e4cc0495d697c2","content":"*Stationary Importance Sampling (Challah and Manornor 2017)*"},{"_id":"5d2aab5d81e4cc0495d697c7","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262758,"position":5,"parentId":"5d2aab5d81e4cc0495d697c2","content":"- This is a new method that has lower variance than original approach\n but is still hard to estimate."},{"_id":"5d2aab5d81e4cc0495d697c8","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262759,"position":3,"parentId":"5d2aab5d81e4cc0495d697c0","content":"##Interesting Idea Related to Policy Optimization"},{"_id":"5d2aab5d81e4cc0495d697c9","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262760,"position":1,"parentId":"5d2aab5d81e4cc0495d697c8","content":"If your state representation is *too simple* for the domain then the problem is *no longer Markov*. This is because you'll need to remember past states to compensate for the lack of representation. This is the flip side of the usual idea that *everything is Markovian* as long as you have rich enough features in your state description."},{"_id":"5d2aab5d81e4cc0495d697ca","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262761,"position":2,"parentId":"5d2aab5d81e4cc0495d697c8","content":"So, if your models are bad then picking the MLE for the dynamics isn't a good idea\neven using importance sampling has problems because it can have very high variance even though it isn't biased."},{"_id":"5d2aab5d81e4cc0495d697cb","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262762,"position":4,"parentId":"5d2aab5d81e4cc0495d697c0","content":"##A Big Idea"},{"_id":"5d2aab5d81e4cc0495d697cc","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262763,"position":1,"parentId":"5d2aab5d81e4cc0495d697cb","content":"Unlike in supervised learning these are really hard in RL:"},{"_id":"5d2aab5d81e4cc0495d697cd","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262764,"position":2,"parentId":"5d2aab5d81e4cc0495d697cb","content":"- structural risk minimization\n\n- cross validation"},{"_id":"5d2aab5d81e4cc0495d697ce","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262765,"position":3,"parentId":"5d2aab5d81e4cc0495d697cb","content":"![20190707_174155](/Users/mcrowley/Downloads/rldm2019/20190707_174155.jpg)"},{"_id":"5d2aab5d81e4cc0495d697cf","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262766,"position":4,"parentId":"5d2aab5d81e4cc0495d697cb","content":"There are promising methods for dealing with this in non i. i. d. domains but it's hard."},{"_id":"5d2aab5d81e4cc0495d697d0","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262767,"position":5,"parentId":"5d2aab5d81e4cc0495d697c0","content":"## Moving the Goalpost"},{"_id":"5d2aab5d81e4cc0495d697d1","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262768,"position":1,"parentId":"5d2aab5d81e4cc0495d697d0","content":"### Direct Batch Policy Search"},{"_id":"5d2aab5d81e4cc0495d697d2","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262769,"position":1,"parentId":"5d2aab5d81e4cc0495d697d1","content":"- doing policy gradients while using old data\n- traditional approach for this leads to very high variance\n- use importance sampling to reweight old trajectories and still converge"},{"_id":"5d2aab5d81e4cc0495d697d3","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262770,"position":2,"parentId":"5d2aab5d81e4cc0495d697d1","content":"![20190707_174918](/Users/mcrowley/Downloads/rldm2019/20190707_174918.jpg)\n *(Liu, Swaminathan UAI, 2019)*"},{"_id":"5d2aab5d81e4cc0495d697d4","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262771,"position":2,"parentId":"5d2aab5d81e4cc0495d697d0","content":"###Finding the Best Policy in a class\n\n- they have a result for doing this using an advantage function\n- restricted to domains with a single \"when to act = decision (eg. when to start a drug treatment, when to sell a stock)\n- one advantage of this is interpretability since plicies are more related to the actual human experience."},{"_id":"5d2aab5d81e4cc0495d697d5","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262875,"position":7.5,"parentId":null,"content":"# Distributional Reinforcement Learning\n\n*Will Dabney, DeepMind*"},{"_id":"5d2aab5d81e4cc0495d697d6","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262773,"position":1,"parentId":"5d2aab5d81e4cc0495d697d5","content":"## History\n\n- discovering TD leraning for AI methods\n- then discoering that this looks similar to what happens in the brain with dopamine neurons\n- estaimtes of value at all states udpated in direction of improving prediction error"},{"_id":"5d2aab5d81e4cc0495d697d7","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262774,"position":2,"parentId":"5d2aab5d81e4cc0495d697d5","content":"## Distributional TD Learning\n\n- traditional TD learning updates for all states (or neurons) with the same scale\n- but in DTD they weight the updates using the local distribution of rewards somehow\n- switch from mean value update to distributional value update\n- they find that it seems like learning the distribution helps to learn a better representation"},{"_id":"5d2aab5d81e4cc0495d697d8","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262775,"position":3,"parentId":"5d2aab5d81e4cc0495d697d5","content":"## Validation with Simple Experimental Tasks with Animals\n\n- animals receive one of seven amounts of food, with a prob distribution\n- some animals get a signal associated with each case\n- traditional TD learning: if reward is above average the positive learning happens\n- but what about the distribution for each neuron / state?\n- looking at experimental data it seems to align with what we'd expect from a distributional model rather than the old mean approach"},{"_id":"5d2aab5d81e4cc0495d697d9","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262776,"position":8,"parentId":null,"content":"# Count-based Exploration with Successor Representation\n\n*Marlos Machado, Michael Bowling*"},{"_id":"5d2aab5d81e4cc0495d697da","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262777,"position":1,"parentId":"5d2aab5d81e4cc0495d697d9","content":"## Successor Representation\n\n- Function approximatoin requires that we really see examples of the different state we want values of. If we never see then then we can't learn it.\n- one simple wayt o bootstrap this is useing proximity between states.\n- but proximity can break in spatial domains or complex state spaces.\n- what we really want is to talk about how many steps it would take to get between two 'nearby' states rahter than their euclidean distance"},{"_id":"5d2aab5d81e4cc0495d697db","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262778,"position":2,"parentId":"5d2aab5d81e4cc0495d697d9","content":"## Counting the number of visits to a state along a trajectory\n\n- this can be estimated with TD learning\n- there is a good way to do function approximation on this as well\n- The Success Representation (SR) naturally comes out of a the dual approach to dynamic programing for RL\n- there is also some evidence taht it matches some of what is happening in the hippocampus\n- this can be seen as an alternative to optimism under uncertainty used in R-Max and others"},{"_id":"5d2aab5d81e4cc0495d697dc","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262911,"position":3,"parentId":"5d2aab5d81e4cc0495d697d9","content":"## Updating Existing Algorithms\n\n- add the L2 norm of the SR as an exploration bonus to standard SARSA\n- **intuition**: if some state has not been visited much before it will get a bonus to encourage exploring it\n- huge improvement on SARSA\n- also works to add it to model based algorithms like $E^3$, R-MAX etc"},{"_id":"5d2aab5d81e4cc0495d697dd","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262780,"position":4,"parentId":"5d2aab5d81e4cc0495d697d9","content":"## Function Approximation\n\n- adding this idea to DQN seems to help as well especially for domains for random exploration doesn't work well"},{"_id":"5d2aab5d81e4cc0495d697de","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262781,"position":5,"parentId":"5d2aab5d81e4cc0495d697d9","content":"## Result\n\nUsing the norm of the successor representation encodes state visitition counts and helps to explore faster."},{"_id":"5d2aab5d81e4cc0495d697df","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262927,"position":9,"parentId":null,"content":"# Directions in Distributional Learning\n\n*Will Dabney, Deep Mind*"},{"_id":"5c5541e80815f3bfd700017b","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262930,"position":2,"parentId":"5d2aab5d81e4cc0495d697df","content":"Another talk on this approach in general.\n\n- Distributed RL says we should learn the true distribution of the values.\n- The means can be used directely to update value estimates using the bellman equation.\n- But this doesn't work if you aren't using the mean (moments) because there may be multiple distributions that are consistent with that mean.\n- So the big question is how to best impute the right distribution to explain the experiences. \n- The way they've approached it is to fix the representation or projection to a consistent estimator and preserves the mean even though it's not necessarily the best one."},{"_id":"5d2aab5d81e4cc0495d697e0","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262931,"position":3,"parentId":"5d2aab5d81e4cc0495d697df","content":"## The Big idea \n\nTreat the return itself as a random variable. this will have the same recursive structure as the bellman equation except it is the relationship between a distribution over rewards and the valeu distribtuion."},{"_id":"5d2aab5d81e4cc0495d697e1","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262784,"position":1,"parentId":"5d2aab5d81e4cc0495d697e0","content":"### Why does Distributional RL Work?\n\n- helps maintain stability for complex domains for deep RL\n- aids representation learning by providing a stronger signal about the structure of the domain\n- it helps with improving generalization error, that is learning on some states can work well in very different unseen states, which can be shown to really help with improving performance in RL."},{"_id":"5d2aab5d81e4cc0495d697e2","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262785,"position":2,"parentId":"5d2aab5d81e4cc0495d697e0","content":"### The Virtuous Circle of RL\n\n- see *\"Neuro-science Inspired AI\"* article by *Demis Hassabis of Deep Mind*.\n- The communication between cognitive science, psychology, neuroscience and CS/AI helps us all to learn useful things and contribute to the overall truth"},{"_id":"5d2aab5d81e4cc0495d697e3","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262786,"position":3,"parentId":"5d2aab5d81e4cc0495d697e0","content":"### Further Reading \n\n- *Marc G. Bellemare, Will Dabney, Rémi Munos. 2017.*\n https://arxiv.org/abs/1707.06887\n- Deep Mind 2019 Arxiv:\n - DRL algorithms can be decomposed as the combination of some statistical estimator and a method for imputing a return distribution consistent with that set of statistics\n - https://deepmind.com/research/publications/statistics-and-samples-distributional-reinforcement-learning/"},{"_id":"5d2aab5d81e4cc0495d697e4","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262934,"position":10,"parentId":null,"content":"# Substance Use Disorder\n*Anna Konova*"},{"_id":"5d2aab5d81e4cc0495d697e6","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262789,"position":2,"parentId":"5d2aab5d81e4cc0495d697e4","content":"Drug users seem to be risk seeking so their team modelled ambiguity as *risk tolerance* separately to explain people's varying values for money and drugs ambiguity tolerance. They founds this explains ongoing drug use better than risk tolerance."},{"_id":"5d2aab5d81e4cc0495d697e7","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262940,"position":11,"parentId":null,"content":"# Hyperbolic Discounting\n*William Fedus, Yoshua Bengio — Google Brain*"},{"_id":"5d2aab5d81e4cc0495d697e9","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262912,"position":2,"parentId":"5d2aab5d81e4cc0495d697e7","content":"- The standard $\\gamma^t$ discount factor is an exponential discounting\n- $\\frac{1}{1+kt}$ is a **hyperbolic discount**\n- *(Souzo, 1998)* use survival (t) rather than $\\gamma^t$ \n - the probability of surviving until timestep t\n - we can derive standard $\\gamma^t$ from this for a domain with a fixed risk of dying at each step\n- but if the hazard rate depends a state we get other discount functions\n- they simply the use of this by training the agent on multiple time horizons as an\n auxiliary task\n- this is part of a larger discussion in RL that the common assumptions about the du't work"},{"_id":"5d2aab5d81e4cc0495d697ea","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262943,"position":12,"parentId":null,"content":"# Heart Health with RL\n\n*Susan Murphy*"},{"_id":"5c5540cc0815f3bfd700017c","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262949,"position":0.5,"parentId":"5d2aab5d81e4cc0495d697ea","content":"**Goal:** Promote behaviour changes on taking medication, reducing addiction etc.\n\nExisting health support apps take two main forms:\n\n- *pull* : just info, depends on user\n- *push*: deliver intervention when needed"},{"_id":"5d2aab5d81e4cc0495d697eb","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262794,"position":1,"parentId":"5d2aab5d81e4cc0495d697ea","content":"## Heart steps app"},{"_id":"5d2aab5d81e4cc0495d697ec","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262795,"position":1,"parentId":"5d2aab5d81e4cc0495d697eb","content":"An advisor app to help encourage sedentary people at risk of heart attacks.\nVery customized to personal schedule and context."},{"_id":"5d2aab5d81e4cc0495d697ed","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262796,"position":2,"parentId":"5d2aab5d81e4cc0495d697eb","content":"They tried to build an RL agent to push messages but not have the person get used to it.\nThis problem has interesting challenges:"},{"_id":"5d2aab5d81e4cc0495d697ee","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262797,"position":3,"parentId":"5d2aab5d81e4cc0495d697eb","content":"- very noisy data, unknown variables, complex rewards\n- delayed penalties (over seasitivation)\n- immediately all actions are positive"},{"_id":"5d2aab5d81e4cc0495d697ef","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262798,"position":2,"parentId":"5d2aab5d81e4cc0495d697ea","content":"## Questions:\n\nCan they reuse previous trajectories?"},{"_id":"5d2aab5d81e4cc0495d697f0","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262799,"position":13,"parentId":null,"content":"# Termination Critic\n\n*Anna Harutyunyan, Doina Recep, Remi Munos, et. al. (DeepMind)*"},{"_id":"5d2aab5d81e4cc0495d697f1","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262800,"position":1,"parentId":"5d2aab5d81e4cc0495d697f0","content":"## Temporal Abstraction"},{"_id":"5d2aab5d81e4cc0495d697f2","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262801,"position":1,"parentId":"5d2aab5d81e4cc0495d697f1","content":"Reasoning at multiple timescales"},{"_id":"5d2aab5d81e4cc0495d697f3","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262802,"position":2,"parentId":"5d2aab5d81e4cc0495d697f1","content":"- We can remember and reason about different levels of detail over time.\n- why do we do it? The hope is that high level plans are more reusable."},{"_id":"5d2aab5d81e4cc0495d697f4","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262803,"position":3,"parentId":"5d2aab5d81e4cc0495d697f1","content":"How do we define and measure generalization? How do we encode inductive biases?"},{"_id":"5d2aab5d81e4cc0495d697f5","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262909,"position":2,"parentId":"5d2aab5d81e4cc0495d697f0","content":" ## Options\n\n**option(O) = policy($\\pi$) + termination condition $\\beta$**\n\nTraditionally, policy is the focus, but termination also optimizes the same thing. Biases are added in to encourage options to last longer.\n\n**Their idea:** a separate termination rule to be focussed on when to end option entirely.\n"},{"_id":"5d2aab5d81e4cc0495d697f6","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262805,"position":1,"parentId":"5d2aab5d81e4cc0495d697f5","content":"###Option Transition Model"},{"_id":"5d2aab5d81e4cc0495d697f7","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262806,"position":1,"parentId":"5d2aab5d81e4cc0495d697f6","content":"- this is an MDP where we take options instead of actions\n- provide a distribution over all states you could end up in\n- however still need to learn the per-step Beta parameter\n- they show how you can define the value function and Bellman equation for Beta then they can solve this with policy gradients."},{"_id":"5d2aab5d81e4cc0495d697f8","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262807,"position":2,"parentId":"5d2aab5d81e4cc0495d697f6","content":"**Constraints on Options:** The goal is the encourage options to be simpler. Minimize entropy offinal option model."},{"_id":"5d2aab5d81e4cc0495d697f9","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262808,"position":3,"parentId":"5d2aab5d81e4cc0495d697f6","content":"**Termination critic:** They use the *Actor-Critic* approach but have a critic for the termination rule in addition to the policy."},{"_id":"5d2aab5d81e4cc0495d697fa","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262950,"position":14,"parentId":null,"content":"# Intrinsic Motivation (a.k.a Curiosity)\n\n*(Pierre-Yves Oudeyer , INRIA)*"},{"_id":"5c5540750815f3bfd700017d","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262961,"position":0.5,"parentId":"5d2aab5d81e4cc0495d697fa","content":"Their lab look at intrinsic motivation in humans and machines. They explore developmental learning in children and try to apply it to \n\n- building robots\n- developing better education methods\n\nIt is well known that children always explore and invent their own goals. So looking only at the accuracy statistics is not useful, instead we need to consider the context."},{"_id":"5d2aab5d81e4cc0495d697fb","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262810,"position":1,"parentId":"5d2aab5d81e4cc0495d697fa","content":"##The Learning Progress Hypothesis\n\n\"Interestingness\" is not just about novelty or surprise.\nIt is about situations where a high level of learning is happening. If something becomes partially mastered then progress will slow and agent/child should/will lose interest."},{"_id":"5d2aab5d81e4cc0495d697fc","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262811,"position":2,"parentId":"5d2aab5d81e4cc0495d697fa","content":"## IAC Algorithm"},{"_id":"5d2aab5d81e4cc0495d697fd","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262812,"position":1,"parentId":"5d2aab5d81e4cc0495d697fc","content":"*(Oudeyer, IEEE TEC 2017)*"},{"_id":"5d2aab5d81e4cc0495d697fe","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262813,"position":2,"parentId":"5d2aab5d81e4cc0495d697fc","content":"- Build an interesting mess metric using *change in gradient* during learning for many points in state-action space.\n- Choose actions to try which high values of this metric are best.\n- Hierarchically divide the space into distinct regions by clustering on the metric."},{"_id":"5d2aab5d81e4cc0495d697ff","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262814,"position":3,"parentId":"5d2aab5d81e4cc0495d697fc","content":"They also explore the use *Inverse Reinforcement Learning*"},{"_id":"5d2aab5d81e4cc0495d69800","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262815,"position":4,"parentId":"5d2aab5d81e4cc0495d697fc","content":"*(Forester et al. 2017)* have a great video of robot learning."},{"_id":"5d2aab5d81e4cc0495d69801","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262962,"position":15,"parentId":null,"content":"# Anatomy of a Social Interaction\n\n*Luke Chang*"},{"_id":"5c553fd80815f3bfd7000182","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262971,"position":0.5,"parentId":"5d2aab5d81e4cc0495d69801","content":"They look at actual human social interactions for compelx dynamics between choices amongst multiple people."},{"_id":"5d2aab5d81e4cc0495d69802","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262817,"position":1,"parentId":"5d2aab5d81e4cc0495d69801","content":"## Trust Game\n\n- sequence of choices: join or don't join the interaction\n- share of don't share : information"},{"_id":"5d2aab5d81e4cc0495d69803","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262818,"position":1,"parentId":"5d2aab5d81e4cc0495d69802","content":"### Prediction Error in the Brain\n\n- they found correlations between activations in part of the brain related to prediction error\n- they performed user study on trust game where people play with agents who have trustworthyness probabilities rather than actuall people\n- they scanned people's brains during playing the game to see what lights up in the brain\n- **result**: values are higher when interacting with someone you trust"},{"_id":"5d2aab5d81e4cc0495d69804","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262819,"position":2,"parentId":"5d2aab5d81e4cc0495d69802","content":"### What makes someone trustworthy?\n\n- Recent study on Trust around the world, lost wallet game [Cohn et al, 2019, Science](https://www.nytimes.com/2019/06/20/science/lost-wallet-what-to-do.html)\n - that study found that people are more likely to return the wallet if there was a lot of money in it. \n - Economists think there is not existing theory to explain this\n- Psychological Game Theory can explain this\n - one agent has *second order* theory about other player and they converge on a solution \n - this involves thikning about the dissapointment the other person is going to experience and this is partially valid. If the money is higher obviously this weight is higher.\n - TODO: see image\n - They also find support for this by looking at brain scans"},{"_id":"5d2aab5d81e4cc0495d69805","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262820,"position":2,"parentId":"5d2aab5d81e4cc0495d69801","content":"## Moral Strategy Model\n![20190709_144501](/Users/mcrowley/Downloads/rldm2019/Photos/20190709_144501.jpg)"},{"_id":"5d2aab5d81e4cc0495d69806","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262821,"position":3,"parentId":"5d2aab5d81e4cc0495d69801","content":"## Theory of Mind\n\nMany existing modles of theory of mind are low dimensional, a few main types of quealities and that they are static over time.\n\nThere is a push to explore this using *Inverse Reinforcement Learning* and *Bayesian Learning*"},{"_id":"5d2aab5d81e4cc0495d69807","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262822,"position":1,"parentId":"5d2aab5d81e4cc0495d69806","content":"### Prediction Game\n\n- goal is for player to learn the likely straggy of another player from overseved actions\n- people can predict very quickly what watching people will do\n- RL doing it to optimaize doesn't do well\n- but IRL does better than RL here\n- IRL does not perform as well as humans\n- Also, this domain gave the IRL learner a lot of state representation information which humans don't have"},{"_id":"5d2aab5d81e4cc0495d69808","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262823,"position":4,"parentId":"5d2aab5d81e4cc0495d69801","content":"## Guilt Aversion as a Useful Component of Values\n\n- It's important to include guilt or theory of mind about other's dissapointment or pain\n- It is a real effect in human decision making and it is robust across culture and does not conform to the standard economics idea of expected utility maximization\n- Advice being given for medical and other safety-critical domains needs to consider this."},{"_id":"5d2aab5d81e4cc0495d69809","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262974,"position":16,"parentId":null,"content":"# Learning to Learn to Communicate\n*Abhinav Gupta, Joelle Pineau, et al.*"},{"_id":"5d2aab5d81e4cc0495d6980b","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262826,"position":2,"parentId":"5d2aab5d81e4cc0495d69809","content":"Learning of simulated languages between agents including programming language, natural or artificial ones."},{"_id":"5d2aab5d81e4cc0495d6980c","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262827,"position":3,"parentId":"5d2aab5d81e4cc0495d69809","content":"Interesting work that explicitely builds compositional models of agents learning languages so that they contain some of the properties of natural languages."},{"_id":"5d2aab5d81e4cc0495d6980d","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262977,"position":17,"parentId":null,"content":"# Arbitration between imitation and emulation during human observational learning\n\n*Caroline J Charpentier; Kiyohito Iigaya; John O’Doherty*"},{"_id":"5c553f500815f3bfd7000184","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262982,"position":0.5,"parentId":"5d2aab5d81e4cc0495d6980d","content":"Learning by observing people perform a tasks, there are two main approaches:\n\n- imitation learning\n- learning by inferring about tier goals and preferences\n\n**Question:** do people alternate between these two strateies and when?\n\n**Experiment:**\n\n- bandit task to choose which to pull, some feature about hte machines to identify them\n- you get to watch another player follow a straegy and you know they are a good player\n- you also know that one of the features (tokens) is perfectly correlated with high reward\n\n**Hypothesis:** \n\n- immitation is slower to learn, but better when system has lots of uncertainty\n- emulation will be favoured for highly volatile domains\n\n**Results:**\n\n- people use both approaches"},{"_id":"5d2aab5d81e4cc0495d6980e","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262829,"position":1,"parentId":"5d2aab5d81e4cc0495d6980d","content":"## Computational Models\n\n- they build a computational model for each strategy and an arbitration model which weights a tradeoff between the two strategies\n- they show the arbitration model performs better\n- then they test if it explains the human behaviour better and they find it idoes very closely"},{"_id":"5d2aab5d81e4cc0495d6980f","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262830,"position":2,"parentId":"5d2aab5d81e4cc0495d6980d","content":"## Implications\n\n- they perform fMRI scans and show which parts of the brain correlate with activity for each of the two strategies and the joint arbitration signal too"},{"_id":"5d2aab5d81e4cc0495d69810","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262985,"position":18,"parentId":null,"content":"# Can a Game Require Theory of Mind?\n*Michael Bowling*"},{"_id":"5d2aab5d81e4cc0495d69812","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262833,"position":2,"parentId":"5d2aab5d81e4cc0495d69810","content":"Game: ***Hanabi*** - Michael explained how this card game blends the notion of communication between explicit messages and observation of the actions of other players"},{"_id":"5d2aab5d81e4cc0495d69813","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262988,"position":19,"parentId":null,"content":"# Working Memory\n\n*speaker*"},{"_id":"5c553ea50815f3bfd7000185","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262995,"position":0.5,"parentId":"5d2aab5d81e4cc0495d69813","content":"Working Memory is fast to use but has limited capacity and is forgotten quickly."},{"_id":"5d2aab5d81e4cc0495d69814","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262835,"position":1,"parentId":"5d2aab5d81e4cc0495d69813","content":"## Simple Experiment\n\nGive people a small set (3-7) of images to remember the position of based on another larger image with stimuatus in it.\n\nTry to test two aspects of working memory if it is an RL system\n\n- time limiting factors, how long it lasts\n- Size of memory, how many different elements can be remembered"},{"_id":"5d2aab5d81e4cc0495d69815","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262836,"position":1,"parentId":"5d2aab5d81e4cc0495d69814","content":"### Results\n\nPeople can perform optimally, but once it gets to 5 or 6 they start having 20-30% errors. But the RL model alone of memory doesn't have these problems."},{"_id":"5d2aab5d81e4cc0495d69816","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262837,"position":2,"parentId":"5d2aab5d81e4cc0495d69814","content":"### Adding a Seperate Working Memory Process\n\nRL alone isn't enough, so we need some kind of mixture model. Once this is added then the model corresponds closely to human performance."},{"_id":"5d2aab5d81e4cc0495d69817","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262838,"position":2,"parentId":"5d2aab5d81e4cc0495d69813","content":"## But...Working Memory Interferes with RL\n\nHow do WM and RL interact?\n\n- EEG studies show that RL reward prediction or reward history are correlated with the set size (the number of things trying to be remembered). So they are not independent. So WM is somehow blocking?\n- EEG studies also show that the Q-value drops faster (improves faster) for small sets, so somehow WM is helping?\n- Long term associations are learned better (this is harder) when the set of images is larger."},{"_id":"5d2aab5d81e4cc0495d69818","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262839,"position":1,"parentId":"5d2aab5d81e4cc0495d69817","content":"### So What's Happening?\n\nWM blocks RL by contributing to the reward prediction error and helping improve it, there is a closed loop between them."},{"_id":"5d2aab5d81e4cc0495d69819","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262840,"position":3,"parentId":"5d2aab5d81e4cc0495d69813","content":"## Why is this Important?"},{"_id":"5d2aab5d81e4cc0495d6981a","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262841,"position":1,"parentId":"5d2aab5d81e4cc0495d69819","content":"Keeping an open mind about how different observed brain systems could contribute to and interact with learning. Example domains help to motivate this:"},{"_id":"5d2aab5d81e4cc0495d6981b","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262842,"position":2,"parentId":"5d2aab5d81e4cc0495d69819","content":"- Schizophrenia\n - They can show that working memory is impaired in Schizophrenia patients and that it is due largely to WM problems seperated from RL, this is only visible if WM is modelled explicitely.\n- Age related learning\n - they found that learning rate seems to increase with age to compensate with decrease in working memory"},{"_id":"5d2aab5d81e4cc0495d6981c","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262843,"position":3,"parentId":"5d2aab5d81e4cc0495d69819","content":"There has been a lot of success in Cognitive science and neuroscience to grab useful ideas from CS RL to do more experiments. She encourages the CS RL community to grab more ideas from neuroscience and try them out in computational models, working memory for example seems to have benefits beyond benefits to RL alone."},{"_id":"5d2aab5d81e4cc0495d6981d","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262996,"position":20,"parentId":null,"content":"# Reinforcement Learning for Robots\n\n*Chelsea Finn, Berkelty, Google Brain, Stanford*"},{"_id":"5c553e440815f3bfd7000186","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262999,"position":0.5,"parentId":"5d2aab5d81e4cc0495d6981d","content":"A nice aspect of doing RL on robots is that you can't get around the problems of noise, bad reward models, generalization as you can in simulation. A major problem right now is training specialist robots is known but they generalize very badly even on mundane differences. So **exploration**, using **raw sensory input** and **continual improvement without supervision** are needed."},{"_id":"5d2aab5d81e4cc0495d6981e","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262845,"position":1,"parentId":"5d2aab5d81e4cc0495d6981d","content":"## Learning Reusable Models from Self-Supervision"},{"_id":"5d2aab5d81e4cc0495d6981f","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262846,"position":1,"parentId":"5d2aab5d81e4cc0495d6981e","content":"Their approach (**Visual Foresight**) is to do two things"},{"_id":"5d2aab5d81e4cc0495d69820","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262847,"position":2,"parentId":"5d2aab5d81e4cc0495d6981e","content":"- Learn general policies through unsupervised exploration\n- Learn fundamental physics and dynamics from pre-existing videos"},{"_id":"5d2aab5d81e4cc0495d69821","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262848,"position":3,"parentId":"5d2aab5d81e4cc0495d6981e","content":"Their state representation model is the full images that the robot sees, this includes all the pixels and the view of the environment. So their dynamics prediction needs to be a Recurrent Neural Network that predicts video images."},{"_id":"5d2aab5d81e4cc0495d69822","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18263001,"position":21,"parentId":null,"content":"# How We Know When Not to Think\n\n*Fiery Cushman*"},{"_id":"5c553def0815f3bfd7000187","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18263004,"position":0.5,"parentId":"5d2aab5d81e4cc0495d69822","content":"A standard dichotomy is **Model-based (Habitual) vs Model-based (Planning)**\n\nHow can a little bit of model-based knowledge help with planning?\n\n- this is important because in reality we have an inifinity of choices and yes we only consider a small subset, how does that happen?\n- **consideration set** - things we usually want to consider for this task, feasible options, but very restricted\n- **choice** - this is the standard onine- value based estiamtion usin g amodel to pit the best thing for the context\n\nThey show a bunch of experiments highlights that people are able to seperate these two tasks well. They find that the consideration set generally based on a quick heuristic of pre-cached cases weighted by value. This means we think of things based on value, even if we need to eventually choose amongst that set according to something other than value maximization (e.g. choose your least favourite food, choose an item that satisfies some convoluted constraint.)"},{"_id":"5d2aab5d81e4cc0495d69823","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262850,"position":1,"parentId":"5d2aab5d81e4cc0495d69822","content":"## What does \"Possible\" mean?"},{"_id":"5d2aab5d81e4cc0495d69824","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262851,"position":1,"parentId":"5d2aab5d81e4cc0495d69823","content":"We usually mean practical when we say possible, not if it is physically conceivable. They show some interesting experiemntal results that show immoral considerations are not immediately added to these consideration sets, but only available under more deliberation. So we don't even consider these scenarios to save time."},{"_id":"5d2aab5d81e4cc0495d69825","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262852,"position":2,"parentId":"5d2aab5d81e4cc0495d69823","content":"One final result is that this indicates that"},{"_id":"5d2aab5d81e4cc0495d69826","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18263005,"position":22,"parentId":null,"content":"# Play : Interplay of Goals and Subgoals in Mental Development\n\n*Rich Sutton*"},{"_id":"5c553daf0815f3bfd7000188","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18263010,"position":0.5,"parentId":"5d2aab5d81e4cc0495d69826","content":"We finally wrapped up with a talk by Rich Sutton himself. He argues for there being an truly Integrated Science of Mind that applies equally to *biological* and *machine* minds.\n\n```\nIntelligence== Mind - Rich Sutton\n\n```"},{"_id":"5d2aab5d81e4cc0495d69828","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18263013,"position":1.5,"parentId":"5d2aab5d81e4cc0495d69826","content":"## The Reward Hypothesis is great but...\n\n- it reduces the importance of subgoals\n- it seems to be something that can't change over time"},{"_id":"5d2aab5d81e4cc0495d69829","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262856,"position":2,"parentId":"5d2aab5d81e4cc0495d69826","content":"## Where are we?"},{"_id":"5d2aab5d81e4cc0495d6982a","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262857,"position":1,"parentId":"5d2aab5d81e4cc0495d69829","content":"We start with goals :"},{"_id":"5d2aab5d81e4cc0495d6982b","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262858,"position":2,"parentId":"5d2aab5d81e4cc0495d69829","content":"- first we learn value functions based on expected future reward\n- we learn policies"},{"_id":"5d2aab5d81e4cc0495d6982c","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262859,"position":3,"parentId":"5d2aab5d81e4cc0495d69829","content":"Next we need to look at subgoals:"},{"_id":"5d2aab5d81e4cc0495d6982d","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262860,"position":4,"parentId":"5d2aab5d81e4cc0495d69829","content":"- learn about state - eg. state representations\n- skills - eg. options\n- models -"},{"_id":"5d2aab5d81e4cc0495d6982e","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262861,"position":5,"parentId":"5d2aab5d81e4cc0495d69829","content":"These are all *subproblems* that are not essentially about reward or value. How do we learn them or represent them generally?"},{"_id":"5d2aab5d81e4cc0495d6982f","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262862,"position":6,"parentId":"5d2aab5d81e4cc0495d69829","content":"He thinks **play** is an important way to look at it."},{"_id":"5d2aab5d81e4cc0495d69830","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262863,"position":7,"parentId":"5d2aab5d81e4cc0495d69829","content":"Three key open quetsions about subproblems in RL:"},{"_id":"5d2aab5d81e4cc0495d69831","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262864,"position":8,"parentId":"5d2aab5d81e4cc0495d69829","content":"1. Q1 - what should subproblems be\n2. Q2 - where do they come from\n3. Q3 - how do subproblems help the main problem (ie. how to subgoals help the main reward maximization task)\n - Learning to solve subproblems could help with shaping better *state representations*, *behaviour patterns* that are more coherent\n - It also allows *high level planning* because now you have a mdoel of what happens after you've achieved the subgoal"},{"_id":"5d2aab5d81e4cc0495d69832","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262865,"position":9,"parentId":"5d2aab5d81e4cc0495d69829","content":"Some settled issues:"},{"_id":"5d2aab5d81e4cc0495d69833","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262866,"position":10,"parentId":"5d2aab5d81e4cc0495d69829","content":"- subproblems are a reward in themselves an dmay be terminal, planning stops\n- solving a subproblem can be done with an option, a seperately subpolicy"},{"_id":"5d2aab5d81e4cc0495d69834","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262867,"position":11,"parentId":"5d2aab5d81e4cc0495d69829","content":"![20190710_113310](20190710_113310.jpg)"},{"_id":"5d2aab5d81e4cc0495d69835","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262868,"position":12,"parentId":"5d2aab5d81e4cc0495d69829","content":"![20190710_114403](20190710_114403.jpg)"},{"_id":"5d2aab5d81e4cc0495d69836","treeId":"5cb8f20afc73c7a1eb0002aa","seq":18262869,"position":13,"parentId":"5d2aab5d81e4cc0495d69829","content":"That's All, it was fun. See you in two years at Brown!"}],"tree":{"_id":"5cb8f20afc73c7a1eb0002aa","name":"RLDM 2019","publicUrl":"rldm-2019","latex":true}}