• ECE 657A

    Data and Knowledge Modelling and Analysis

    A graduate course in the
    Electrical and Computer Engineering Department at
    The University of Waterloo
    taught by
    Prof. Mark Crowley
    in

    Winter 2021 (Jan-April, 2021)

    These notes are dynamic and will be updated over the term


  • Notes on Fundamentals



  • Notes on Data Analysis and Machine Learning Concepts



  • Notes on Particular Algorithms



  • Where is the next Frontier or AI/ML?


    While no one can predict the future, and looking at history, it seems AI/ML researchers are particularly bad at this, there are current areas and trends taking up a lot of attention. Rather than being what will be big in five years, this more likely means that in a few years there will be solid, or at least well accepted, approaches or even solutions to these problems.


  • Fun

    You are allowed to have fun…sometimes.


  • Course Textbook(s)

    The course has no required textbook itself. There are a number of resources listed on the course website and here that are useful. The fact is that the pace and nature of how the field changes these days makes it very hard for any physical textbook to simultaneously cover the fundamentals as well the latest relevant trends and advances that are required for a course like this. So the web is full of blogs, info-sites, corporate demonstration pages and framework documentation sites that provide fantastic description of all of the concepts in this course, with the latest technology and approaches. Finding the best ones is hard, of course, so on this humble gingko tree I will make my best attempt to curate a list of resources relevant to the topic of this course as I come across them in my own, never-ending-mad-rush-to-stay-up-to-date.

    :)

  • Aside - Fundamentals vs. The Bleeding Edge

  • Universal References

    One thing we sometimes think we want, is a universal solution to a problem.

    • Murphy book
      • buy it: amazon link
      • library: If you are familiar with the idea of a Library, then for my actual course ECE657A that this gingko is primarily created for, this book is on hold at Davis Library for short term use.
        • If you are not familiar with libraries, then this information is not useful to you until you obtain the required polarity reversing phase transmogrifier from level 42 of the OASIS.
    • Deep Learning Book
    • “A Course in Machine Learning” by Hal Daumé III
      • url: http://ciml.info/
      • Comment: I’ve only recently discovered this book but it seems like a solid, simple approach at explaining the fundamentals and methods in many of the same areas as this course. It’s free online.
  • Data Types

  • Probability and Statistics Fundamentals Review

  • Experimental Methodology

  • History of AI/ML

    The history of Artificial Intelligence and Machine Learning are tightly intertwined, but there are as many different perspectives on the important moments as there are researchers and interested parties.

  • For everyone who took ECE606 last term

    XCKD comic number 2407, a very funny comparison of standard tree search algorithms, such as depth-first and breadth-first, as well as some lesser known ones: Brepth-first, Deadth-first and Bread-First Search, which skips the tree entirely and jumps directly to a loaf of bread.

    The fundamentals of data cleaning and preparation, the types of different data, how to normalize it, how to extract features for different purposes, how to visualize and ask the right questions of our data; these are all critical skills. No number of fancy software frameworks will help you get useful results if you don’t have these skills in the first place.

    At the same time, some algorithms and methodologies have become essentially irrelevant because better ones have been discovered. So, there is no point learned how to use them in detail if they will never be used in industry or even cutting edge research. But when that happens in under 10 years, it’s very hard for a textbook to remain relevant.

    Some people would disagree with this, and in a sense, from a research point of view every method that was useful at one stage is still worthy of study. If only to understand how solutions can be found without fully understanding all of the tradeoffs. For example, SIFT features are incredibly powerful summarizations of context in images and revolutionized image recognition tasks before CNNs had been fully developed. Now they are simply one type of feature that a CNN could learn directly from data.

  • Data Types

    In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.

    https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/

  • Notes From My Other Courses

  • On Entropy and Security

    Some fund thoughts that tie information entropy, random search, sampling and security to the never-ending challenge of picking a new password.

  • References and Further Reading

  • Online Pre-recorded Lecture

  • Ablation Studies

    Once you have a trained model that gives you some kind of response, how do you figure out why it is working?

  • What is Parameter Estimation?

    Parameter estimation is literally the task of guessing the parameters to a function. We can do this through iterative improvement, checking how well our settings work. Just as we do when setting the right angle for the tap in the shower.

    Or if we think we know enough about the distribution of the data we can get fancy and do some calculus on an approximation of that distribution (MAP, MLE, EM).

    At the most fundamental level though, we are taking the data we have, using it to build an estimator and test how well it works. Often, we’ll improve it through multiple scans of the data, or steps down a gradient, as we do implicitly by setting the gradient to zero in MLE.

    We could say at the end of all this that we have learned the best parameters for our model. If we take the algorithm that does the tuning and the estimation together then the machine itself did the learning, we weren’t needed at all except to choose the right libraries and format.

    In this sense, the rest of the course is looking at methods of Machine Learning that, in much more complex ways, still always estimate some parameters for a model.

  • Using Classification for Anomaly Detection

    • Any effective classification model will provide correct labels or predictions for new data.
    • So if a classification model has been trained to predict one of the available features, it could be used to score datapoints at normal or abnormal.
  • Using Clustering for Anomaly Detection

    • Consider how DBScan could be used for Anomaly Detection
    • k-means is a common, and very scalable, solution to grouping datapoints together and then find some patterns to use to argue for an anomaly
  • Dedicated Anomaly Detection Algorithms

    • Local Outlier Factor (LOF)
    • One-Class SVM
    • Oversampling Principal Component Analysis (osPCA)
    • Isolation Forest
    • iMondrian Forest
  • Local Outlier Factor (LOF)

  • One-Class SVM

  • References and Further Reading

  • A More General View of Ensembles

    Now that we have know about

  • A More General View of Ensembles

    People realized that the very successful Boosting method was in essence a very general meta-algorithm for optimization of the mapping function from input variables to output target variables.

    This algorithm chooses multiple weak functions that are combined together, just as the ensemble of decision trees are for Random Forests.

  • What is the Gradient Though?

    • One can imagine that this combined function can have a gradient
    • In this case this is the infinistesimal increase in each of the function parameters that would strengthen the current response.

    We’ve already used them

    • In an ensemble of decision trees these parameters are all of the split points in each for each data dimension.
    • In Random Forests gradient is not used
    • In AdaBoost it is used implicitly in a very simple way
      • each new decision tree weak learner
      • is optimized relative to the negative of this gradient
      • since it tries to do well on what the existing model does badly on.
  • Doing Better

    This idea can then be generalized so that each new weak learner is explicitely treated as a function that points directly away from the gradient of the current combined function.

  • Gradient Tree Boosting

    Given some tree based ensemble model then, represented as a function

    • after adding weak learners already we find that the “perfect” function for the weak learner would be
    • this fills in the gap of what the existing models got wrong.
      • This is because then the new combined model perfectly matches the training data:
  • Gradient Tree Boosting

    • In practice we need to be satisfied with merely approaching this perfect update by fitting a functional gradient descent approach where we use an approximation of the true residual (also called the loss function) each step.

    • In our case this approximation is simply the sum of the wrong answers (i.e. the residuals) from each weak learner decision tree

  • Gradient Tree Boosting explicitely uses the gradient of the loss function of each tree to fit a newtree

    and add it to the ensemble.

    There is also further optimization of weighting functions for each tree and various regularization methods.

  • This algorithm is implemented in the popular XGBoost package.

  • ECE 108 YouTube Videos

    For a very fundamental view of probability from another course of Prof. Crowley you can view the lectures and tutorials for ECE 108

    ECE 108 Youtube (look at “future lectures” and “future tutorials” for S20): https://www.youtube.com/channel/UCHqrRl12d0WtIyS-sECwkRQ/playlists

    The last few lectures and tutorials are on probability definitions as seen from the perspective of discrete math and set theory.

  • Probability Intro Markdown Notes

    From the course website for last year for ECE 493 T25 “Reinforcement Learning”.

    Some of this we won’t need so much but they are all useful to know for Machine Learning methods in general.

    https://rateldajer.github.io/ECE493T25S19/preliminaries/probabilityreview/

    Topics

    • Elements of Probability
    • Conditional Probability Rules
    • Random Variables
    • Probability Functions - Cumulative/Mass/Density
    • Expectation
    • Variance
    • Multi-variable distributions
  • Definitions

  • Motivation for Ablation Studies

  • Automated Parameter Tuning

  • What Ablation Is and Isn’t

  • An Example

  • References and Further Reading

    #otherreadering

  • Transfer Learning Definition

    Transfer Learning: Attempting to improve performance on a learning task B by using a neural network that is pre-trained on some other task A.

    • each task, A and B, could correspond to classification, prediction, summarization, etc on a particular dataset
    • The idea mostly arose out of Image Classification for CNNs, but it can be applied anywhere
  • When to Use Transfer Learning

    • If the data domain is very large and you do not have time to train a model from scratch
      • Corollary : this only helps if there already is a pre-trained model for your domain of interest
    • You have reason to believe you can “learn the fundamentals” of the domain seperately and then train on specific tasks afterwards
      • This is clearly reasonable in natural images and language where the shared structure common to all data is the truly complex part
  • Relation to Inception and ResNet

    • these models are often used as the “pre-trained” model that people do transfer learning with
    • A very, very, very common type of paper a few years ago was

      “We take ResNet-50 (or ResNet-16) trained in ImageNet and fine-tune the last X (3-10) layers for our (FANCY IMAGE CLASSIFICATION) task and demonstrate SotA performance.” - PublishMePlease

  • Relation to NLP

    (see #methodTransferLearningNLP)

    • In fact, we’ve seen transfer learning already, when we talked about Document Classification
    • GloVe and Word2Vec do this by training on large document corpora and then making the models available for others to use.
    • GPT and BERT and following in the footsteps of this
  • Bring it All Together

    An exciting unifying paper in 2018 from OpenAI brings many of these threads together for lanuage understanding

    Where they use the idea of Transfer Learning and merge it with two other recent, related, advances:

  • Resources

  • Ablation - the removal, especially of organs, abnormal growths, or harmful substances, from the body by mechanical means, as by surgery. — Dictionary.com

  • Definition from [Fawcett and Hoos, 2013]:

    Our use of the term ablation follows that of Aghaeepour and Hoos (2013) and loosely echoes its meaning in medicine, where it refers to the surgical removal of organs, organ parts or tissues. We ablate (i.e., remove) changes in the settings of algorithm parameters to better understand the contribution of those changes to observed differences in algorithm performance.

  • As one person puts it (see this twitter thread by @fchollet)

    • how do you determine ‘causality’ between which parts of your system are responsible for the performance?
    • Advice: “Spend at least ~10% of your experimentation time on an honest effort to disprove your thesis.”
  • These tools and other algorithm configuration tools help to set the many complex parameters needed to achieve optimal, or at least maximal, performance.
    But they spit out the parameters without any explanation.

    So in [Fawcett and Hoo, 2013 and 2016] they propose ways to:

    help these algorithm developers answer questions about the high-quality configurations produced by these tools, specifically about which parameter changes contribute most to improved performance.

  • It Is…

    • a good way to determine what parts of your model are useful, which are necessary and which may be unnecessary
    • an approach to help you understand and explain you model to others by showing how each part contributes to your state-of-the-art performance
    • essentially a method for improving your model selection/design process
    • therefore : all ablation analysis should be done on
      1. The Test Dataset?
      2. The Training Dataset?
      3. A Validation Training Dataset?
    • Answer: 3. If the goal is to use ablation to improve the model design, then such analysis must happen on a held out validation dataset, not the final testing dataset.
  • It Is Not…

    • a regularization method
      • why not?
      • While exploring
    • a way to improve your testing numbers (accuracy, recall, confidence) higher
    • a way to fill in the space of your paper with more experiments and graphs
      • it will do this, but that is not the purpose. If you cannot fill a 6-9 page paper with your own background, theories, data, methodology and results then adding two pages of ablation studies will not save you.
  • A nice example explained here: https://stats.stackexchange.com/questions/380040/what-is-an-ablation-study-and-is-there-a-systematic-way-to-perform-it

    As an example, Girshick and colleagues (2014) describe an object detection system that consists of three “modules”: The first proposes regions of an image within which to search for an object using the Selective Search algorithm (Uijlings and colleagues 2012), which feeds in to a large convolutional neural network (with 5 convolutional layers and 2 fully connected layers) that performs feature extraction, which in turn feeds into a set of support vector machines for classification. In order to better understand the system, the authors performed an ablation study where different parts of the system were removed - for instance removing one or both of the fully connected layers of the CNN resulted in surprisingly little performance loss, which allowed the authors to conclude

  • This question and answer on StackExchange provide a great the recent history of the term machine learning and links to further reading : https://stats.stackexchange.com/a/380233

  • References

    • Newell, Allen (1975). A Tutorial on Speech Understanding Systems. In Speech Recognition: Invited Papers Presented at the 1974 IEEE Symposium. New York: Academic. p. 43.
    • [Fawcett and Hoos, 2013] Chris Fawcett and Holger H. Hoos. Analysing differences between algorithm configurations through ablation.
      Proceedings of the 10th Metaheuristics International Conference (MIC 2013), pp. 123-132, 2013. PDF
  • Transfer Learning Definition

    • Obviously if the two datasets are similar then you would expect transfer learning to work quite well
      • essentially, the pre-training on A is just additional training data you don’t need to spend time on
  • A Pleasant Surprise

    What is surprising is that even if A and B are quite different this is still often a useful approach to take.

    We think this is because:

    • The underlying structure of the world is persistent
    • So lessons learned from one problem carry over to problem in the same domain.

    Example:

    1. First, train an object recognition model on a large image dataset involving people, cars, fruit, bike, trees, buildings
    2. Then use that pretrained model CNN model but fine-tune the classification layers to train it to identify cats and dogs in the world
      • Now, most of the features automatically learned by the CNN are still relevant (edge detection, texture, colour gradients,…)

    Go to https://www.tensorflow.org/tutorials/images/transfer_learning and try it out for yourself!

    • pretraining: MobileNet V2 using 1.4M images and 1000 classes
  • Pre-training

    • Pre-training is really just training on a task.

    • It is usually supervised training and commonly on image processing or text classification tasks.

    • If you’re lucky, then someone has already pre-trained a model on a data domain you wish to use

      • photographic images
      • brain scans
      • natural language text documents
      • pedestrians
      • human faces
    • In this case, you don’t need to do the training at all!

  • Fine Tuning

    • Question: Is a neural network restricted to being trained once only, on a batch of data and then frozen for all time?
    • Answer:

    A Standard CNN for Image Classification

    (Input + 3x ConvWithScaling + 3x FullyConnnected + 1x Softmax)

    What part can be kept frozen after pre-training and what part can we fine tune?

  • When not to use Transfer Learning?

    • Imagine a structurally simple set of sensor readings, just a set of numbers come from heat, light, energy sensors.
    • There is no known prior data for this domain, not sense there is a common pattern that could be learnt elsewhere
    • You may just need to train from scratch, as a supervised learning problem to solve your task.
  • Caveat

    Given what we know about Human Learning, it is very likely that some kind of transfer learning is always a good idea.

    • In other words, starting from scratch may not be worth it.
    • BUT whether you can carry out transfer learning in your domain depends on the data, compute power and many other factors.
    • They use

      • Uijlings et al., “Selective Search for Object Recognition.”
        • SIFT feature extraction
        • SVM supervised training
        • loop and strengthen hypotheses.
          then feed the output into a 5CNN+2FC network:
      • Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.”
    • What they found:

      • investigate which parts are needed and which aren’t
      • they found that the SIFT features were not as critical if there was a high-capacity CNN to localize objects
      • they also found that the CNN could be pre-trained on a large, unrelated dataset of images and then fine tuned for the specific problem. This worked better than specialized computer vision methods, such as SIFT.

      They would only have found this through ablation experiments.

    • A twitter thread by François Chollet (@fchollet) that was part of the “recent” surge in popularity of Ablation Studies in Machine Learning.

      Ablation studies are crucial for deep learning research — can’t stress this enough.

      Understanding causality in your system is the most straightforward way to generate reliable knowledge (the goal of any research). And ablation is a very low-effort way to look into causality.

      If you take any complicated deep learning experimental setup, chances are you can remove a few modules (or replace some trained features with random ones) with no loss of performance. Get rid of the noise in the research process: do ablation studies. (Source: https://threader.app/thread/1012721582148550662)

      Can’t fully understand your system? Many moving parts? Want to make sure the reason it’s working is really related to your hypothesis? Try removing stuff. Spend at least ~10% of your experimentation time on an honest effort to disprove your thesis.

      See the whole twitter thread here.

    {"cards":[{"_id":"40b12b8035202dd3150000de","treeId":"5282babaf95bac7e40000146","seq":21996626,"position":0.1875,"parentId":null,"content":"# ECE 657A\n## Data and Knowledge Modelling and Analysis\nA graduate course in the \nElectrical and Computer Engineering Department at\nThe University of Waterloo\ntaught by\nProf. Mark Crowley\nin\nWinter 2021 (Jan-April, 2021)\n---\n\nThese notes are dynamic and will be updated over the term"},{"_id":"40978d39ebd4fe2a9500025e","treeId":"5282babaf95bac7e40000146","seq":22008368,"position":1,"parentId":"40b12b8035202dd3150000de","content":"## Course Textbook(s)\nThe course has no required textbook itself. There are a number of resources listed on the course website and here that are useful. The fact is that the pace and nature of how the field changes these days makes it very hard for any physical textbook to simultaneously cover the *fundamentals* as well the *latest relevant trends and advances* that are required for a course like this. So the web is full of blogs, info-sites, corporate demonstration pages and framework documentation sites that provide fantastic description of all of the concepts in this course, with the latest technology and approaches. Finding the best ones is hard, of course, so on this humble gingko tree I will make my best attempt to curate a list of resources relevant to the topic of this course as I come across them in my own, *never-ending-mad-rush-to-stay-up-to-date*. \n\n:)"},{"_id":"40977cc3ebd4fe2a9500025f","treeId":"5282babaf95bac7e40000146","seq":22008376,"position":1.25,"parentId":"40b12b8035202dd3150000de","content":"### Aside - Fundamentals vs. The Bleeding Edge"},{"_id":"40976c47ebd4fe2a95000261","treeId":"5282babaf95bac7e40000146","seq":22008380,"position":1,"parentId":"40977cc3ebd4fe2a9500025f","content":"- [ ] todo - main idea, we need both. \n\nThe fundamentals of data cleaning and preparation, the types of different data, how to normalize it, how to extract features for different purposes, how to visualize and ask the right questions of our data; these are all critical skills. No number of fancy software frameworks will help you get useful results if you don't have these skills in the first place.\n\nAt the same time, some algorithms and methodologies have become essentially irrelevant because better ones have been discovered. So, there is no point learned how to use them in detail if they will never be used in industry or even cutting edge research. But when that happens in under 10 years, it's very hard for a textbook to remain relevant.\n\nSome people would disagree with this, and in a sense, from a research point of view every method that was useful at one stage is still worthy of study. If only to understand how solutions can be found without fully understanding all of the tradeoffs. For example, SIFT features are incredibly powerful summarizations of context in images and revolutionized image recognition tasks before CNNs had been fully developed. Now they are simply one type of feature that a CNN could learn directly from data. "},{"_id":"40977ca6ebd4fe2a95000260","treeId":"5282babaf95bac7e40000146","seq":22008381,"position":1.5,"parentId":"40b12b8035202dd3150000de","content":"### Universal References\nOne thing we sometimes think we want, is a *universal solution* to a problem. \n\n- Murphy book\n - *buy it*: amazon link\n - *library*: If you are familiar with the idea of a Library, then for my actual course ECE657A that this gingko is primarily created for, this book is ***on hold at Davis Library*** for short term use. \n - If you are *not* familiar with libraries, then this information is not useful to you until you obtain the required polarity reversing phase transmogrifier from level 42 of the OASIS.\n- Deep Learning Book\n- \"A Course in Machine Learning\" by Hal Daumé III\n - **url:** http://ciml.info/\n - **Comment:** I've only recently discovered this book but it seems like a solid, simple approach at explaining the fundamentals and methods in many of the same areas as this course. It's free online."},{"_id":"40ce68a54768cfe90800007f","treeId":"5282babaf95bac7e40000146","seq":21989034,"position":0.25,"parentId":null,"content":"---\n\n# Notes on Fundamentals\n\n---"},{"_id":"52827cc512fe52ddb000005f","treeId":"5282babaf95bac7e40000146","seq":21995099,"position":1,"parentId":"40ce68a54768cfe90800007f","content":"# Data Types"},{"_id":"52827c8512fe52ddb0000060","treeId":"5282babaf95bac7e40000146","seq":19738305,"position":1,"parentId":"52827cc512fe52ddb000005f","content":"# Data Types\nIn summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.\n \nhttps://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/"},{"_id":"40bf11cad74684698900021e","treeId":"5282babaf95bac7e40000146","seq":21995136,"position":1.5,"parentId":"40ce68a54768cfe90800007f","content":"# Probability and Statistics Fundamentals Review"},{"_id":"5fece53216869f0374a37c32","treeId":"5282babaf95bac7e40000146","seq":21995137,"position":1.5,"parentId":"40bf11cad74684698900021e","content":"## Notes From My Other Courses"},{"_id":"5fece53216869f0374a37c34","treeId":"5282babaf95bac7e40000146","seq":21995139,"position":0.5,"parentId":"5fece53216869f0374a37c32","content":"## ECE 108 YouTube Videos\nFor a very fundamental view of probability from another course of Prof. Crowley you can view the lectures and tutorials for ECE 108\n\nECE 108 Youtube (look at \"future lectures\" and \"future tutorials\" for S20): https://www.youtube.com/channel/UCHqrRl12d0WtIyS-sECwkRQ/playlists\n\nThe last few lectures and tutorials are on probability definitions as seen from the perspective of discrete math and set theory."},{"_id":"5fece53216869f0374a37c33","treeId":"5282babaf95bac7e40000146","seq":21996301,"position":1,"parentId":"5fece53216869f0374a37c32","content":"## Probability Intro Markdown Notes\nFrom the course website for last year for ECE 493 T25 \"Reinforcement Learning\". \n\nSome of this we won't need so much but they are all useful to know for Machine Learning methods in general. \n\nhttps://rateldajer.github.io/ECE493T25S19/preliminaries/probabilityreview/\n\n**Topics**\n- Elements of Probability \n- Conditional Probability Rules\n- Random Variables\n- Probability Functions - Cumulative/Mass/Density\n- Expectation\n- Variance\n- Multi-variable distributions"},{"_id":"4065abb4f5d3655a490000e8","treeId":"5282babaf95bac7e40000146","seq":22024373,"position":1.625,"parentId":"40bf11cad74684698900021e","content":"## On Entropy and Security\nSome fund thoughts that tie information entropy, random search, sampling and security to the never-ending challenge of *picking a new password*."},{"_id":"4065a53ff5d3655a490000ea","treeId":"5282babaf95bac7e40000146","seq":22024387,"position":0.5,"parentId":"4065abb4f5d3655a490000e8","content":"### Remembering Complex Passwords\n![](https://imgs.xkcd.com/comics/password_strength.png) \n- from [XKCD/936](https://xkcd.com/936/)"},{"_id":"4065a8e8f5d3655a490000e9","treeId":"5282babaf95bac7e40000146","seq":22024384,"position":1,"parentId":"4065abb4f5d3655a490000e8","content":"### Using Dice to pick your password : https://theworld.com/~reinhold/dicewarefaq.html"},{"_id":"40bf08b4d7468469890002c1","treeId":"5282babaf95bac7e40000146","seq":22024364,"position":1.75,"parentId":"40bf11cad74684698900021e","content":"## References and Further Reading"},{"_id":"5fece53216869f0374a37c36","treeId":"5282babaf95bac7e40000146","seq":21995133,"position":1,"parentId":"40bf08b4d7468469890002c1","content":"### Likelihood, Loss and Risk\nA Good article summarizing how likelihood, loss functions, risk, KL divergence, MLE, MAP are all connected.\nhttps://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/"},{"_id":"40a5f4a12c83debb0f0000de","treeId":"5282babaf95bac7e40000146","seq":22024363,"position":1.875,"parentId":"40bf11cad74684698900021e","content":"## Online Pre-recorded Lecture\n- [ ] post youtube video"},{"_id":"40b15d4535202dd3150000dd","treeId":"5282babaf95bac7e40000146","seq":21996594,"position":1.625,"parentId":"40ce68a54768cfe90800007f","content":"# Experimental Methodology"},{"_id":"5fee435416869f0374a54e5c","treeId":"5282babaf95bac7e40000146","seq":21996622,"position":2,"parentId":"40b15d4535202dd3150000dd","content":"## Ablation Studies\nOnce you have a trained model that gives you some kind of response, how do you figure out **why** it is working?"},{"_id":"5fee435416869f0374a54e5d","treeId":"5282babaf95bac7e40000146","seq":21996565,"position":1,"parentId":"5fee435416869f0374a54e5c","content":"## Definitions"},{"_id":"5fee435416869f0374a54e5e","treeId":"5282babaf95bac7e40000146","seq":21996566,"position":1,"parentId":"5fee435416869f0374a54e5d","content":"**Ablation** - the removal, especially of organs, abnormal growths, or harmful substances, from the body by mechanical means, as by surgery. -- Dictionary.com"},{"_id":"5fee435416869f0374a54e5f","treeId":"5282babaf95bac7e40000146","seq":21996567,"position":2,"parentId":"5fee435416869f0374a54e5d","content":"Definition from [Fawcett and Hoos, 2013]:\n> Our use of the term ablation follows that of Aghaeepour and Hoos (2013) and loosely echoes its meaning in medicine, where it refers to the surgical removal of organs, organ parts or tissues. We ablate (i.e., remove) changes in the settings of algorithm parameters to better understand the contribution of those changes to observed differences in algorithm performance.\n"},{"_id":"5fee435416869f0374a54e60","treeId":"5282babaf95bac7e40000146","seq":21996568,"position":2,"parentId":"5fee435416869f0374a54e5c","content":"## Motivation for Ablation Studies"},{"_id":"5fee435416869f0374a54e61","treeId":"5282babaf95bac7e40000146","seq":21996569,"position":1,"parentId":"5fee435416869f0374a54e60","content":"As one person puts it (see this [twitter thread by @fchollet](https://threader.app/thread/1012721582148550662)) \n- how do you determine 'causality' between which parts of your system are responsible for the performance?\n- Advice: \"Spend at least ~10% of your experimentation time on an honest effort to disprove your thesis.\""},{"_id":"5fee435416869f0374a54e62","treeId":"5282babaf95bac7e40000146","seq":21996570,"position":3,"parentId":"5fee435416869f0374a54e5c","content":"## Automated Parameter Tuning"},{"_id":"5fee435416869f0374a54e63","treeId":"5282babaf95bac7e40000146","seq":21996571,"position":1,"parentId":"5fee435416869f0374a54e62","content":"These tools and other algorithm configuration tools help to set the many complex parameters needed to achieve optimal, or at least maximal, performance. \nBut they spit out the parameters without any explanation. \n\nSo in [Fawcett and Hoo, 2013 and 2016] they propose ways to:\n> help these algorithm developers answer questions about the high-quality configurations produced by these tools, specifically about which parameter changes contribute most to improved performance."},{"_id":"5fee435416869f0374a54e64","treeId":"5282babaf95bac7e40000146","seq":21996572,"position":4,"parentId":"5fee435416869f0374a54e5c","content":"## What Ablation Is and Isn't"},{"_id":"5fee435416869f0374a54e65","treeId":"5282babaf95bac7e40000146","seq":21996573,"position":1,"parentId":"5fee435416869f0374a54e64","content":"### It Is...\n- a good way to determine what parts of your model are *useful*, which are *necessary* and which *may be unnecessary*\n- an approach to help you *understand* and *explain* you model to others by showing how each part contributes to your state-of-the-art performance\n- essentially a method for improving your model selection/design process\n- **therefore** : all ablation analysis should be done on \n 1. The Test Dataset?\n 2. The Training Dataset?\n 3. A Validation Training Dataset?\n- ***Answer:*** 3. If the goal is to use ablation to improve the model design, then such analysis must happen on a held out validation dataset, not the final testing dataset."},{"_id":"5fee435416869f0374a54e66","treeId":"5282babaf95bac7e40000146","seq":21996574,"position":2,"parentId":"5fee435416869f0374a54e64","content":"### It Is Not...\n- a *regularization method*\n - why not?\n - While exploring \n- a way to improve your testing numbers (accuracy, recall, confidence) higher\n- a way to fill in the space of your paper with more experiments and graphs\n - it *will* do this, but that is not the purpose. If you cannot fill a 6-9 page paper with your own background, theories, data, methodology and results then adding two pages of ablation studies will not save you."},{"_id":"5fee435416869f0374a54e67","treeId":"5282babaf95bac7e40000146","seq":21996575,"position":5,"parentId":"5fee435416869f0374a54e5c","content":"## An Example\n\n"},{"_id":"5fee435416869f0374a54e68","treeId":"5282babaf95bac7e40000146","seq":21996576,"position":1,"parentId":"5fee435416869f0374a54e67","content":"A nice example explained here: https://stats.stackexchange.com/questions/380040/what-is-an-ablation-study-and-is-there-a-systematic-way-to-perform-it\n\n> As an example, Girshick and colleagues (2014) describe an **object detection system** that consists of three “modules”: The first proposes regions of an image within which to search for an object using the Selective Search algorithm (Uijlings and colleagues 2012), which feeds in to a large convolutional neural network (with 5 convolutional layers and 2 fully connected layers) that performs feature extraction, which in turn feeds into a set of support vector machines for classification. In order to better understand the system, the authors performed an ablation study where different parts of the system were removed - for instance removing one or both of the fully connected layers of the CNN resulted in surprisingly little performance loss, which allowed the authors to conclude\n"},{"_id":"5fee435416869f0374a54e69","treeId":"5282babaf95bac7e40000146","seq":21996577,"position":1,"parentId":"5fee435416869f0374a54e68","content":"They use\n- Uijlings et al., “Selective Search for Object Recognition.”\n - SIFT feature extraction\n - SVM supervised training\n - loop and strengthen hypotheses.\nthen feed the output into a 5CNN+2FC network:\n- Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.”"},{"_id":"5fee435416869f0374a54e6a","treeId":"5282babaf95bac7e40000146","seq":21996578,"position":2,"parentId":"5fee435416869f0374a54e68","content":"What they found:\n- investigate which parts are needed and which aren't\n- they found that the SIFT features were not as critical if there was a *high-capacity CNN* to localize objects\n- they also found that the CNN could be *pre-trained* on a large, unrelated dataset of images and then fine tuned for the specific problem. This worked better than specialized computer vision methods, such as SIFT.\n\nThey would only have found this through ablation experiments."},{"_id":"5fee435416869f0374a54e6b","treeId":"5282babaf95bac7e40000146","seq":21996579,"position":6,"parentId":"5fee435416869f0374a54e5c","content":"## References and Further Reading\n#otherreadering"},{"_id":"5fee435416869f0374a54e6c","treeId":"5282babaf95bac7e40000146","seq":21996580,"position":1,"parentId":"5fee435416869f0374a54e6b","content":"This question and answer on StackExchange provide a great the recent history of the term machine learning and links to further reading : https://stats.stackexchange.com/a/380233"},{"_id":"5fee435416869f0374a54e6d","treeId":"5282babaf95bac7e40000146","seq":21996581,"position":1,"parentId":"5fee435416869f0374a54e6c","content":"A twitter thread by François Chollet (@fchollet) that was part of the \"recent\" surge in popularity of Ablation Studies in Machine Learning.\n\n> Ablation studies are crucial for deep learning research -- can't stress this enough.\n\n> Understanding causality in your system is the most straightforward way to generate reliable knowledge (the goal of any research). And ablation is a very low-effort way to look into causality.\n\n> If you take any complicated deep learning experimental setup, chances are you can remove a few modules (or replace some trained features with random ones) with no loss of performance. Get rid of the noise in the research process: do ablation studies. (Source: https://threader.app/thread/1012721582148550662)\n\n> Can't fully understand your system? Many moving parts? Want to make sure the reason it's working is really related to your hypothesis? Try removing stuff. Spend at least ~10% of your experimentation time on an honest effort to disprove your thesis.\n\nSee the [whole twitter thread](https://threader.app/thread/1012721582148550662) here."},{"_id":"5fee435416869f0374a54e6e","treeId":"5282babaf95bac7e40000146","seq":21996582,"position":2,"parentId":"5fee435416869f0374a54e6b","content":"### References\n- Newell, Allen (1975). A Tutorial on Speech Understanding Systems. In Speech Recognition: Invited Papers Presented at the 1974 IEEE Symposium. New York: Academic. p. 43.\n- [Fawcett and Hoos, 2013] Chris Fawcett and Holger H. Hoos. Analysing differences between algorithm configurations through ablation.\nProceedings of the 10th Metaheuristics International Conference (MIC 2013), pp. 123-132, 2013. PDF \n"},{"_id":"40ce97414768cfe908000071","treeId":"5282babaf95bac7e40000146","seq":21989162,"position":2,"parentId":"40ce68a54768cfe90800007f","content":"# History of AI/ML\nThe history of Artificial Intelligence and Machine Learning are tightly intertwined, but there are as many different perspectives on the *important moments* as there are researchers and interested parties."},{"_id":"40ce937c4768cfe908000072","treeId":"5282babaf95bac7e40000146","seq":21989163,"position":1,"parentId":"40ce97414768cfe908000071","content":"## The Past\n- From Online Magazine of AAAI Conference - https://aitopics.org/misc/brief-history\n- Wikipedia on ML - https://en.wikipedia.org/wiki/Timeline_of_machine_learning\n- Wikipedia on AI (better list) - https://en.wikipedia.org/wiki/Timeline_of_artificial_intelligence#2010s\n"},{"_id":"40ce49314768cfe9080000d6","treeId":"5282babaf95bac7e40000146","seq":22001798,"position":0.5,"parentId":null,"content":"---\n\n# Notes on Data Analysis and Machine Learning Concepts\n\n---\n"},{"_id":"40a5f2572c83debb0f0000df","treeId":"5282babaf95bac7e40000146","seq":22122658,"position":0.5,"parentId":"40ce49314768cfe9080000d6","content":"# Parameter Estimation\n#topicparameterestimation\n#methodMLE #methodMAP #methodEM #methodNaiveBayes"},{"_id":"40a5f1942c83debb0f0000e0","treeId":"5282babaf95bac7e40000146","seq":22001802,"position":1,"parentId":"40a5f2572c83debb0f0000df","content":"## What is Parameter Estimation?\n**Parameter estimation** is literally the task of guessing the parameters to a function. We can do this through iterative improvement, checking how well our settings work. Just as we do when setting the right angle for the tap in the shower. \n\nOr if we think we know enough about the distribution of the data we can get fancy and do some calculus on an approximation of that distribution (**MAP, MLE, EM**).\n\nAt the most fundamental level though, we are taking the data we have, using it to build an estimator and test how well it works. Often, we'll improve it through multiple scans of the data, or steps down a gradient, as we do implicitly by setting the gradient to zero in MLE. \n\nWe could say at the end of all this that we have *learned the best parameters* for our model. If we take the algorithm that does the tuning and the estimation together then the machine itself did the learning, we weren't needed at all except to choose the right libraries and format.\n\nIn this sense, the rest of the course is looking at methods of **Machine Learning** that, in much more complex ways, still always estimate some parameters for a model."},{"_id":"40ce489e4768cfe9080000d7","treeId":"5282babaf95bac7e40000146","seq":22122656,"position":1,"parentId":"40ce49314768cfe9080000d6","content":"# Unsupervised Learning\n#topicUnsupervisedLearning"},{"_id":"40ce46e34768cfe9080000d8","treeId":"5282babaf95bac7e40000146","seq":22122657,"position":1,"parentId":"40ce489e4768cfe9080000d7","content":"## Improving model robustness and power with *unsupervised* pre-training\n#topicUnsupervisedPretraining\nhttps://openai.com/blog/language-unsupervised/"},{"_id":"40befd9ad7468469890002c2","treeId":"5282babaf95bac7e40000146","seq":21995784,"position":2,"parentId":"40ce49314768cfe9080000d6","content":"# Anomaly Detection\n#topicAnomalyDetection"},{"_id":"40bac318d7468469890002c8","treeId":"5282babaf95bac7e40000146","seq":21995828,"position":0.25,"parentId":"40befd9ad7468469890002c2","content":"## Using Classification for Anomaly Detection\n- Any effective classification model will provide correct labels or predictions for new data. \n- So if a classification model has been trained to *predict* one of the available features, it *could* be used to score datapoints at **normal** or **abnormal**."},{"_id":"40bac2ced7468469890002c9","treeId":"5282babaf95bac7e40000146","seq":21995826,"position":0.375,"parentId":"40befd9ad7468469890002c2","content":"## Using Clustering for Anomaly Detection\n- Consider how DBScan could be used for Anomaly Detection\n- k-means is a common, and very scalable, solution to grouping datapoints together and then find some patterns to use to argue for an anomaly"},{"_id":"40bab0c9d746846989000328","treeId":"5282babaf95bac7e40000146","seq":21995825,"position":0.4375,"parentId":"40befd9ad7468469890002c2","content":"## Dedicated Anomaly Detection Algorithms\n- Local Outlier Factor (LOF)\n- One-Class SVM\n- Oversampling Principal Component Analysis (osPCA)\n- Isolation Forest\n- iMondrian Forest\n"},{"_id":"40bac38ed7468469890002c6","treeId":"5282babaf95bac7e40000146","seq":21995816,"position":0.5,"parentId":"40befd9ad7468469890002c2","content":"## Local Outlier Factor (LOF)"},{"_id":"40baddcdd7468469890002c3","treeId":"5282babaf95bac7e40000146","seq":21995815,"position":1,"parentId":"40bac38ed7468469890002c6","content":"### References and Further Reading\n- See https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html"},{"_id":"40bac364d7468469890002c7","treeId":"5282babaf95bac7e40000146","seq":21995792,"position":0.75,"parentId":"40befd9ad7468469890002c2","content":"## One-Class SVM"},{"_id":"5fed535216869f0374a3e52f","treeId":"5282babaf95bac7e40000146","seq":21996550,"position":1.5,"parentId":"40befd9ad7468469890002c2","content":"## References and Further Reading"},{"_id":"40badabfd7468469890002c4","treeId":"5282babaf95bac7e40000146","seq":22122654,"position":1,"parentId":"5fed535216869f0374a3e52f","content":"- This post has a fairly good overview of simple statistical methods and some more complex algorithms that can be used for Anomaly Detection (https://towardsdatascience.com/detecting-real-time-and-unsupervised-anomalies-in-streaming-data-a-starting-point-760a4bacbdf8)"},{"_id":"40badab3d7468469890002c5","treeId":"5282babaf95bac7e40000146","seq":22122655,"position":2,"parentId":"5fed535216869f0374a3e52f","content":"- A post describing a solution architecture by Google using their cloud services. It uses *very* simple clustering (**k-means**) and feature extraction methods, but it has the advantage that it is *extremely* fast and scalable!(https://cloud.google.com/blog/products/data-analytics/anomaly-detection-using-streaming-analytics-and-ai)"},{"_id":"40b296a0d74684698900032b","treeId":"5282babaf95bac7e40000146","seq":21996345,"position":3,"parentId":"40ce49314768cfe9080000d6","content":"## Optimization by Stochastic Search\nThis is a technical, but critical part of the success of Deep Learning."},{"_id":"40b294cbd74684698900032c","treeId":"5282babaf95bac7e40000146","seq":21996346,"position":1,"parentId":"40b296a0d74684698900032b","content":"## Adam Optimizer\nNow seen as basically the default gradient descent approach to updating weights during #backpropagation. \n\n- Good Overview with links: https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/#:~:text=Adam%20is%20an%20optimization%20algorithm,iterative%20based%20in%20training%20data.&text=The%20algorithm%20is%20called%20Adam,derived%20from%20adaptive%20moment%20estimation."},{"_id":"40ce69b44768cfe90800007e","treeId":"5282babaf95bac7e40000146","seq":21989033,"position":0.75,"parentId":null,"content":"---\n# Notes on Particular Algorithms\n\n---"},{"_id":"5e26209fdf5197044c05b309","treeId":"5282babaf95bac7e40000146","seq":21989141,"position":1,"parentId":"40ce69b44768cfe90800007e","content":"# Gradient Tree Boosting\n#methodGradientTreeBoosting"},{"_id":"5e26209fdf5197044c05b30a","treeId":"5282babaf95bac7e40000146","seq":21995782,"position":1.5,"parentId":"5e26209fdf5197044c05b309","content":"## A More General View of Ensembles\nNow that we have know about \n- [DecisionTrees](DecisionTrees)\n- [Boosting](Boosting)\n- [RandomForests](RandomForests) \nwe are ready to learn about a powerful combination of all these concepts with *gradient search* - **Gradient Tree Boosting**.\n"},{"_id":"5e26209fdf5197044c05b30b","treeId":"5282babaf95bac7e40000146","seq":21983618,"position":2,"parentId":"5e26209fdf5197044c05b309","content":"## A More General View of Ensembles\n\nPeople realized that the very successful [Boosting](Boosting) method was in essence a very general meta-algorithm for optimization of the mapping function from input variables to output target variables. \n\nThis algorithm chooses *multiple weak functions* that are combined together, just as the ensemble of decision trees are for [Random Forests](Random Forests)."},{"_id":"5e26209fdf5197044c05b30c","treeId":"5282babaf95bac7e40000146","seq":19738229,"position":3,"parentId":"5e26209fdf5197044c05b309","content":"## What is the Gradient Though? \n- One can imagine that this combined function can have a **gradient** \n- In this case this is the *infinistesimal increase* in each of the **function parameters** that would **strengthen** the current response.\n\n### We've already used them\n- In an ensemble of decision trees these parameters are **all of the split points** in each for each data dimension.\n- In Random Forests *gradient is not used*\n- In AdaBoost it is used *implicitly* in a very simple way\n - each new decision tree weak learner \n - is optimized relative to the negative of this gradient\n - since it tries to do *well* on what the existing model does *badly* on."},{"_id":"5e26209fdf5197044c05b30f","treeId":"5282babaf95bac7e40000146","seq":19738240,"position":6,"parentId":"5e26209fdf5197044c05b309","content":"## Doing Better\nThis idea can then be generalized so that each new weak learner is *explicitely* treated as a function that points directly away from the gradient of the current combined function."},{"_id":"5e26209fdf5197044c05b310","treeId":"5282babaf95bac7e40000146","seq":19738250,"position":7,"parentId":"5e26209fdf5197044c05b309","content":"## Gradient Tree Boosting\nGiven some tree based ensemble model then, represented as a function \n\n$$T_i(X)\\rightarrow Y$$\n\n- after adding $i$ weak learners already we find that the \"perfect\" function for the $i+1^{th}$ weak learner would be\n$$h(x)=T_i(x) - Y$$ \n- this fills in the gap of what the existing models got wrong. \n - This is because then the new combined model *perfectly matches the training data*: \n$$T_{(i+1)}(x) = T_i(x) + h(x) = Y$$"},{"_id":"52829ccc12fe52ddb000005d","treeId":"5282babaf95bac7e40000146","seq":21983628,"position":7.5,"parentId":"5e26209fdf5197044c05b309","content":"## Gradient Tree Boosting\n- In practice we need to be satisfied with merely *approaching* this perfect update by fitting a **functional gradient descent** approach where we use an *approximation of the true residual* (also called the [loss function](lossfunction)) each step. \n\n- In our case this approximation is simply the sum of the wrong answers (i.e. the residuals) from each weak learner decision tree \n\n$$L(Y, T(X)) = \\sum_i Y-T_i(X)$$"},{"_id":"5fea831b16869f0374a130c3","treeId":"5282babaf95bac7e40000146","seq":21983638,"position":7.75,"parentId":"5e26209fdf5197044c05b309","content":"Gradient Tree Boosting explicitely uses the gradient $\\nabla L(Y,T_i(X)$ of the loss function of each tree to fit a newtree \n\n$$h(X)= T_i(X) - \\sum_i \\nabla_{T_i}L(Y,T_i(X))$$\n\nand add it to the ensemble. \n\nThere is also further optimization of weighting functions for each tree and various regularization methods."},{"_id":"5e26209fdf5197044c05b311","treeId":"5282babaf95bac7e40000146","seq":21983607,"position":8,"parentId":"5e26209fdf5197044c05b309","content":"This algorithm is implemented in the popular [XGBoost](https://github.com/dmlc/xgboost) package."},{"_id":"40ce6f0d4768cfe90800007d","treeId":"5282babaf95bac7e40000146","seq":21989024,"position":1.125,"parentId":null,"content":"---\n\n# Where is the next Frontier or AI/ML?\n\n---\n\nWhile no one can predict the future, and looking at history, it seems AI/ML researchers are *particularly* bad at this, there are current areas and trends taking up a lot of attention. Rather than being what will be big in five years, this more likely means that in a few years there will be solid, or at least *well accepted*, approaches or even solutions to these problems. "},{"_id":"40ce37284768cfe9080000d9","treeId":"5282babaf95bac7e40000146","seq":21989166,"position":0.5,"parentId":"40ce6f0d4768cfe90800007d","content":"# New Advances in Theory"},{"_id":"40ce33744768cfe9080000db","treeId":"5282babaf95bac7e40000146","seq":21989175,"position":0.5,"parentId":"40ce37284768cfe9080000d9","content":"## How Does Learning *Really* Work?\n#topicSequenceModellingWithConvolutions"},{"_id":"40ce31384768cfe9080000dd","treeId":"5282babaf95bac7e40000146","seq":21989183,"position":0.25,"parentId":"40ce33744768cfe9080000db","content":"> “An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling”. https://arxiv.org/pdf/1803.01271.pdf"},{"_id":"40ce31844768cfe9080000dc","treeId":"5282babaf95bac7e40000146","seq":21989188,"position":0.5,"parentId":"40ce33744768cfe9080000db","content":"The main point of this paper is to empirically compare the use of **recurrent** and **convolutional** approaches to modelling sequential data. \n\nThey tried this because some people have found that Convolutional architectures sometimes perform better at audio synthesis and translation. But why is that and is it true in general? "},{"_id":"40ce2b414768cfe9080000df","treeId":"5282babaf95bac7e40000146","seq":21989189,"position":0.75,"parentId":"40ce33744768cfe9080000db","content":"### Quotes and Further Reading"},{"_id":"40ce36634768cfe9080000da","treeId":"5282babaf95bac7e40000146","seq":21989179,"position":1,"parentId":"40ce33744768cfe9080000db","content":"> While being highly empirical and using known approaches, it opens the door to uncovering new ones since it proves that the one that is usually regarded as optimal is in fact not\n- https://www.forbes.com/sites/quora/2019/01/15/what-were-the-most-significant-machine-learning-advances-of-2018/?sh=100a63366ddc"},{"_id":"40ce2fd24768cfe9080000de","treeId":"5282babaf95bac7e40000146","seq":21989199,"position":0.75,"parentId":"40ce6f0d4768cfe90800007d","content":"# New - Protein Folding!"},{"_id":"40ce74f44768cfe90800007b","treeId":"5282babaf95bac7e40000146","seq":21989213,"position":0.875,"parentId":"40ce6f0d4768cfe90800007d","content":"## What's New?\n- See this good summary of [ML Advances in 2020](https://www.forbes.com/sites/quora/2019/01/15/what-were-the-most-significant-machine-learning-advances-of-2018/?sh=100a63366ddc).\n- A Review of the biggest results at the biggest ML conference of 2020 NeurIPS : https://towardsdatascience.com/neurips-2020-10-essentials-you-shouldnt-miss-845723f3add6"},{"_id":"40ce6f914768cfe90800007c","treeId":"5282babaf95bac7e40000146","seq":21989165,"position":1,"parentId":"40ce6f0d4768cfe90800007d","content":"# Explainability and Interpretability\n#topicExplainability"},{"_id":"40ce89e74768cfe908000073","treeId":"5282babaf95bac7e40000146","seq":21991058,"position":2,"parentId":"40ce6f0d4768cfe90800007d","content":"# Transfer Learning\n#topicTransferLearning\n#current"},{"_id":"40cddff20b4362e2d1000098","treeId":"5282babaf95bac7e40000146","seq":21991059,"position":2,"parentId":"40ce89e74768cfe908000073","content":"# Transfer Learning Fundamentals\n#lecture"},{"_id":"40cb12171da2c809600000a0","treeId":"5282babaf95bac7e40000146","seq":21991022,"position":2.5,"parentId":"40cddff20b4362e2d1000098","content":"# Transfer Learning Definition\n**Transfer Learning:** Attempting to improve performance on a learning task **B** by using a neural network that is *pre-trained* on some other task **A**.\n\n- each task, **A** and **B**, could correspond to classification, prediction, summarization, etc on a particular dataset\n- The idea mostly arose out of *Image Classification* for CNNs, but it can be applied anywhere\n\n"},{"_id":"40cb02471da2c809600000a1","treeId":"5282babaf95bac7e40000146","seq":21991020,"position":1,"parentId":"40cb12171da2c809600000a0","content":"## Transfer Learning Definition\n- Obviously if the two datasets are *similar* then you would expect transfer learning to work quite well\n - essentially, the pre-training on **A** is just additional training data you don't need to spend time on"},{"_id":"40c942351da2c809600000a5","treeId":"5282babaf95bac7e40000146","seq":21991019,"position":2,"parentId":"40cb12171da2c809600000a0","content":"## A Pleasant Surprise\nWhat is *surprising* is that even if **A** and **B** are *quite different* this is still often a useful approach to take.\n\nWe think this is because:\n- The underlying structure of the world is *persistent*\n- So lessons learned from one problem *carry over* to problem in the same domain.\n\n#### Example:\n1. First, train an object recognition model on a large image dataset involving people, cars, fruit, bike, trees, buildings\n2. Then use that pretrained model CNN model but **fine-tune** the classification layers to train it to identify **cats** and **dogs** in the world\n - Now, most of the **features** automatically learned by the CNN are still relevant (edge detection, texture, colour gradients,...)\n\nGo to https://www.tensorflow.org/tutorials/images/transfer_learning and try it out for yourself!\n- **pretraining:** MobileNet V2 using 1.4M images and 1000 classes"},{"_id":"40c92a451da2c809600000a7","treeId":"5282babaf95bac7e40000146","seq":21991025,"position":2.5,"parentId":"40cb12171da2c809600000a0","content":"## Pre-training\n- Pre-training is really just training on a task.\n\n- It is usually *supervised training* and commonly on *image processing* or *text classification* tasks.\n\n- If you're lucky, then someone has already pre-trained a model on a data domain you wish to use\n - photographic images\n - brain scans\n - natural language text documents\n - pedestrians\n - human faces\n\n- In this case, you don't need to do the training at all!"},{"_id":"40c92ac71da2c809600000a6","treeId":"5282babaf95bac7e40000146","seq":21991029,"position":3,"parentId":"40cb12171da2c809600000a0","content":"## Fine Tuning\n- **Question:** Is a neural network restricted to being trained once only, on a batch of data and then frozen for all time?\n- **Answer:** \n\n### A Standard CNN for Image Classification\n(Input + 3x ConvWithScaling + 3x FullyConnnected + 1x Softmax)\n\nWhat part can be kept frozen after pre-training and what part can we fine tune?"},{"_id":"40c916321da2c809600000a8","treeId":"5282babaf95bac7e40000146","seq":21991035,"position":2.75,"parentId":"40cddff20b4362e2d1000098","content":"## When to Use Transfer Learning\n- If the data domain is very large and you do not have time to train a model from scratch\n - Corollary : this only helps if there *already is* a pre-trained model for your domain of interest\n- You have reason to believe you can \"learn the fundamentals\" of the domain seperately and then train on specific tasks afterwards\n - This is clearly reasonable in natural images and language where the **shared structure common to all data** is the truly complex part"},{"_id":"40c906531da2c809600000aa","treeId":"5282babaf95bac7e40000146","seq":21991037,"position":1,"parentId":"40c916321da2c809600000a8","content":"## When not to use Transfer Learning?\n- Imagine a structurally simple set of sensor readings, just a set of numbers come from heat, light, energy sensors. \n- There is no known prior data for this domain, not sense there is a common pattern that could be learnt elsewhere\n- You may just need to train from scratch, as a supervised learning problem to solve your task."},{"_id":"40c906c31da2c809600000a9","treeId":"5282babaf95bac7e40000146","seq":21991038,"position":2,"parentId":"40c916321da2c809600000a8","content":"## Caveat\nGiven what we know about Human Learning, it is very likely that some kind of transfer learning is *always* a good idea. \n- In other words, starting from scratch may not be worth it.\n- **BUT** whether you can carry out transfer learning in your domain depends on the data, compute power and many other factors."},{"_id":"40cdddaa1da2c8096000009b","treeId":"5282babaf95bac7e40000146","seq":21991050,"position":3,"parentId":"40cddff20b4362e2d1000098","content":"## Relation to Inception and ResNet\n- [ ] (merge existing slides in here)\n- these models are often used as the \"pre-trained\" model that people do transfer learning with\n- A *very, very, very common* type of paper a few years ago was \n>\"We take ResNet-50 (or ResNet-16) trained in ImageNet and fine-tune the last X (3-10) layers for our (FANCY IMAGE CLASSIFICATION) task and demonstrate SotA performance.\" - PublishMePlease"},{"_id":"40cddc4e1da2c8096000009d","treeId":"5282babaf95bac7e40000146","seq":21991045,"position":5,"parentId":"40cddff20b4362e2d1000098","content":"### Relation to NLP\n(see #methodTransferLearningNLP)\n- In fact, we've seen transfer learning already, when we talked about **Document Classification**\n- GloVe and Word2Vec do this by training on large document corpora and then making the models available for others to use.\n- GPT and BERT and following in the footsteps of this"},{"_id":"40cddbe21da2c8096000009f","treeId":"5282babaf95bac7e40000146","seq":21991055,"position":7,"parentId":"40cddff20b4362e2d1000098","content":"## Bring it All Together\nAn exciting unifying paper in 2018 from OpenAI brings many of these threads together for lanuage understanding\n- Blog: https://openai.com/blog/language-unsupervised/\n- Paper: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf\n\nWhere they use the idea of **Transfer Learning** and merge it with two other recent, related, advances:\n- **Transformer Networks** (https://arxiv.org/abs/1706.03762) and\n- **Semi-supervised Sequence Training** for natural language using a kind of **autoencoder** (https://arxiv.org/abs/1511.01432)\n\n- [ ] todo: refresh this idea in your mind before you discuss it, it's complex but it all fits together"},{"_id":"40cddf301da2c80960000099","treeId":"5282babaf95bac7e40000146","seq":21990508,"position":8,"parentId":"40cddff20b4362e2d1000098","content":"### Resources\n- Google's Tensorflow Documentation has a good introduction where you can play with code: https://www.tensorflow.org/tutorials/images/transfer_learning"},{"_id":"40cddc231da2c8096000009e","treeId":"5282babaf95bac7e40000146","seq":21991066,"position":2.5,"parentId":"40ce89e74768cfe908000073","content":"## Transformer Networks\n- [ ] used for BERT, give a brief highlight of what it really is"},{"_id":"40ce87c84768cfe908000075","treeId":"5282babaf95bac7e40000146","seq":21991044,"position":3,"parentId":"40ce89e74768cfe908000073","content":"## Transfer Learning in NLP\n#methodNLP\n#methodTransferLearningNLP"},{"_id":"40ce57094768cfe9080000aa","treeId":"5282babaf95bac7e40000146","seq":21989146,"position":1,"parentId":"40ce87c84768cfe908000075","content":"### New or Updated ML NLP Methods in 2020\n#readNLPNews2020\n\nSee this good summary of [ML Advances in 2020](https://www.forbes.com/sites/quora/2019/01/15/what-were-the-most-significant-machine-learning-advances-of-2018/?sh=100a63366ddc).\n- Duplex dialog system (google) #applicationDuplex\n- Smart Compose (google) - #applicationSmartCompose https://ai.googleblog.com/2018/05/smart-compose-using-neural-networks-to.html?m=1\n- ELMO - #methodELMO\n- Transformer Networks (OpenAI?) #methodTransformerNetwork\n- BERT (google) #methodBERT"},{"_id":"40ce899f4768cfe908000074","treeId":"5282babaf95bac7e40000146","seq":21989122,"position":1.5,"parentId":"40ce87c84768cfe908000075","content":"### Some Quotes and Further Reading\n> These models have been described as the “Imagenet moment for NLP” since they show the practicality of transfer learning in the language domain by providing ready-to-use pre-trained and general models that can be also fine-tuned for specific tasks.\n- https://www.forbes.com/sites/quora/2019/01/15/what-were-the-most-significant-machine-learning-advances-of-2018/?sh=100a63366ddc"},{"_id":"40ce79754768cfe908000076","treeId":"5282babaf95bac7e40000146","seq":21989108,"position":2,"parentId":"40ce87c84768cfe908000075","content":"### Duplex dialog system (google)\n#applicationDuplex\n"},{"_id":"40ce794b4768cfe908000077","treeId":"5282babaf95bac7e40000146","seq":21989109,"position":3,"parentId":"40ce87c84768cfe908000075","content":"### Smart Compose (google)\n#applicationSmartCompose \nhttps://ai.googleblog.com/2018/05/smart-compose-using-neural-networks-to.html?m=1"},{"_id":"40ce79294768cfe908000078","treeId":"5282babaf95bac7e40000146","seq":21989110,"position":4,"parentId":"40ce87c84768cfe908000075","content":"### ELMO\n#methodELMO\n"},{"_id":"40ce79044768cfe908000079","treeId":"5282babaf95bac7e40000146","seq":21989147,"position":5,"parentId":"40ce87c84768cfe908000075","content":"### Transformer Networks \n(OpenAI?)\n#methodTransformerNetwork\n"},{"_id":"40ce78f44768cfe90800007a","treeId":"5282babaf95bac7e40000146","seq":21989148,"position":6,"parentId":"40ce87c84768cfe908000075","content":"### BERT (google)\n#methodBERT\n#methodTransformerNetwork\n"},{"_id":"40ce67844768cfe908000080","treeId":"5282babaf95bac7e40000146","seq":21989143,"position":3,"parentId":"40ce6f0d4768cfe90800007d","content":"# Natural Language Processing\n#topicNLP\n\n## Methods\n#topicTransferLearningNLP\n#methodTransformerNetwork\n\n#applicationDuplex\n#applicationSmartCompose \n#methodELMO\n#methodTransformerNetwork\n#methodBERT\n\n#readNLPNews2020\n\n\n"},{"_id":"4065c65b8131f16c6500015c","treeId":"5282babaf95bac7e40000146","seq":22024303,"position":1.484375,"parentId":null,"content":"---\n\n# Fun\n\nYou are allowed to have fun...sometimes.\n\n---"},{"_id":"4065c5548131f16c6500015d","treeId":"5282babaf95bac7e40000146","seq":22024357,"position":1,"parentId":"4065c65b8131f16c6500015c","content":"## For everyone who took ECE606 last term\n[![XCKD comic number 2407, a very funny comparison of standard tree search algorithms, such as depth-first and breadth-first, as well as some lesser known ones: Brepth-first, Deadth-first and Bread-First Search, which skips the tree entirely and jumps directly to a loaf of bread.](https://www.filepicker.io/api/file/PMfGcHQGTNm50NeG9Y28 \"XKCD/2407\")](HTTPs://xkcd.com/2407)\n- from [XKCD:2407](https://xkcd.com/2407/)\n\n"}],"tree":{"_id":"5282babaf95bac7e40000146","name":"DKMA","publicUrl":"dkma","latex":true}}