A graduate course in the
Electrical and Computer Engineering Department at
The University of Waterloo
taught by
Prof. Mark Crowley
You are allowed to have fun…sometimes.
The course has no required textbook itself. There are a number of resources listed on the course website and here that are useful. The fact is that the pace and nature of how the field changes these days makes it very hard for any physical textbook to simultaneously cover the fundamentals as well the latest relevant trends and advances that are required for a course like this. So the web is full of blogs, info-sites, corporate demonstration pages and framework documentation sites that provide fantastic description of all of the concepts in this course, with the latest technology and approaches. Finding the best ones is hard, of course, so on this humble gingko tree I will make my best attempt to curate a list of resources relevant to the topic of this course as I come across them in my own, never-ending-mad-rush-to-stay-up-to-date.
:)
One thing we sometimes think we want, is a universal solution to a problem.
The history of Artificial Intelligence and Machine Learning are tightly intertwined, but there are as many different perspectives on the important moments as there are researchers and interested parties.
#topicparameterestimation
#methodMLE #methodMAP #methodEM #methodNaiveBayes
This is a technical, but critical part of the success of Deep Learning.
While no one can predict the future, and looking at history, it seems AI/ML researchers are particularly bad at this, there are current areas and trends taking up a lot of attention. Rather than being what will be big in five years, this more likely means that in a few years there will be solid, or at least well accepted, approaches or even solutions to these problems.
And the text that follows is normal, whatever that means.
Paragraphs are no different.
The text under it is normal
The text under it is normal
I don’t remember what it is. It’s just smaller.
That’s like magic. It even applies to the text that follows.
(psst! H5 is invisible ^^^) But the question is…does it apply to text the next paragraph down? The answer is no
Who knows? I know, it doesn't
What does it do?
No one knows.
So it says.
or does it?
In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.
https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/
Some fund thoughts that tie information entropy, random search, sampling and security to the never-ending challenge of picking a new password.
Once you have a trained model that gives you some kind of response, how do you figure out why it is working?
The fundamentals of data cleaning and preparation, the types of different data, how to normalize it, how to extract features for different purposes, how to visualize and ask the right questions of our data; these are all critical skills. No number of fancy software frameworks will help you get useful results if you don’t have these skills in the first place.
At the same time, some algorithms and methodologies have become essentially irrelevant because better ones have been discovered. So, there is no point learned how to use them in detail if they will never be used in industry or even cutting edge research. But when that happens in under 10 years, it’s very hard for a textbook to remain relevant.
Some people would disagree with this, and in a sense, from a research point of view every method that was useful at one stage is still worthy of study. If only to understand how solutions can be found without fully understanding all of the tradeoffs. For example, SIFT features are incredibly powerful summarizations of context in images and revolutionized image recognition tasks before CNNs had been fully developed. Now they are simply one type of feature that a CNN could learn directly from data.
Parameter estimation is literally the task of guessing the parameters to a function. We can do this through iterative improvement, checking how well our settings work. Just as we do when setting the right angle for the tap in the shower.
Or if we think we know enough about the distribution of the data we can get fancy and do some calculus on an approximation of that distribution (MAP, MLE, EM).
At the most fundamental level though, we are taking the data we have, using it to build an estimator and test how well it works. Often, we’ll improve it through multiple scans of the data, or steps down a gradient, as we do implicitly by setting the gradient to zero in MLE.
We could say at the end of all this that we have learned the best parameters for our model. If we take the algorithm that does the tuning and the estimation together then the machine itself did the learning, we weren’t needed at all except to choose the right libraries and format.
In this sense, the rest of the course is looking at methods of Machine Learning that, in much more complex ways, still always estimate some parameters for a model.
#topicUnsupervisedPretraining
https://openai.com/blog/language-unsupervised/
Now seen as basically the default gradient descent approach to updating weights during #backpropagation.
Now that we have know about
People realized that the very successful Boosting method was in essence a very general meta-algorithm for optimization of the mapping function from input variables to output target variables.
This algorithm chooses multiple weak functions that are combined together, just as the ensemble of decision trees are for Random Forests.
This idea can then be generalized so that each new weak learner is explicitely treated as a function that points directly away from the gradient of the current combined function.
Given some tree based ensemble model then, represented as a function
In practice we need to be satisfied with merely approaching this perfect update by fitting a functional gradient descent approach where we use an approximation of the true residual (also called the loss function) each step.
In our case this approximation is simply the sum of the wrong answers (i.e. the residuals) from each weak learner decision tree
Gradient Tree Boosting explicitely uses the gradient of the loss function of each tree to fit a newtree
and add it to the ensemble.
There is also further optimization of weighting functions for each tree and various regularization methods.
This algorithm is implemented in the popular XGBoost package.
For a very fundamental view of probability from another course of Prof. Crowley you can view the lectures and tutorials for ECE 108
ECE 108 Youtube (look at “future lectures” and “future tutorials” for S20): https://www.youtube.com/channel/UCHqrRl12d0WtIyS-sECwkRQ/playlists
The last few lectures and tutorials are on probability definitions as seen from the perspective of discrete math and set theory.
From the course website for last year for ECE 493 T25 “Reinforcement Learning”.
Some of this we won’t need so much but they are all useful to know for Machine Learning methods in general.
https://rateldajer.github.io/ECE493T25S19/preliminaries/probabilityreview/
Topics
A Good article summarizing how likelihood, loss functions, risk, KL divergence, MLE, MAP are all connected.
https://quantivity.wordpress.com/2011/05/23/why-minimize-negative-log-likelihood/
“An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling”. https://arxiv.org/pdf/1803.01271.pdf
The main point of this paper is to empirically compare the use of recurrent and convolutional approaches to modelling sequential data.
They tried this because some people have found that Convolutional architectures sometimes perform better at audio synthesis and translation. But why is that and is it true in general?
While being highly empirical and using known approaches, it opens the door to uncovering new ones since it proves that the one that is usually regarded as optimal is in fact not
Ablation - the removal, especially of organs, abnormal growths, or harmful substances, from the body by mechanical means, as by surgery. — Dictionary.com
Definition from [Fawcett and Hoos, 2013]:
Our use of the term ablation follows that of Aghaeepour and Hoos (2013) and loosely echoes its meaning in medicine, where it refers to the surgical removal of organs, organ parts or tissues. We ablate (i.e., remove) changes in the settings of algorithm parameters to better understand the contribution of those changes to observed differences in algorithm performance.
As one person puts it (see this twitter thread by @fchollet)
These tools and other algorithm configuration tools help to set the many complex parameters needed to achieve optimal, or at least maximal, performance.
But they spit out the parameters without any explanation.
So in [Fawcett and Hoo, 2013 and 2016] they propose ways to:
help these algorithm developers answer questions about the high-quality configurations produced by these tools, specifically about which parameter changes contribute most to improved performance.
A nice example explained here: https://stats.stackexchange.com/questions/380040/what-is-an-ablation-study-and-is-there-a-systematic-way-to-perform-it
As an example, Girshick and colleagues (2014) describe an object detection system that consists of three “modules”: The first proposes regions of an image within which to search for an object using the Selective Search algorithm (Uijlings and colleagues 2012), which feeds in to a large convolutional neural network (with 5 convolutional layers and 2 fully connected layers) that performs feature extraction, which in turn feeds into a set of support vector machines for classification. In order to better understand the system, the authors performed an ablation study where different parts of the system were removed - for instance removing one or both of the fully connected layers of the CNN resulted in surprisingly little performance loss, which allowed the authors to conclude
This question and answer on StackExchange provide a great the recent history of the term machine learning and links to further reading : https://stats.stackexchange.com/a/380233
Transfer Learning: Attempting to improve performance on a learning task B by using a neural network that is pre-trained on some other task A.
“We take ResNet-50 (or ResNet-16) trained in ImageNet and fine-tune the last X (3-10) layers for our (FANCY IMAGE CLASSIFICATION) task and demonstrate SotA performance.” - PublishMePlease
(see #methodTransferLearningNLP)
An exciting unifying paper in 2018 from OpenAI brings many of these threads together for lanuage understanding
Where they use the idea of Transfer Learning and merge it with two other recent, related, advances:
See this good summary of ML Advances in 2020.
These models have been described as the “Imagenet moment for NLP” since they show the practicality of transfer learning in the language domain by providing ready-to-use pre-trained and general models that can be also fine-tuned for specific tasks.
(OpenAI?)
#methodTransformerNetwork
They use
What they found:
They would only have found this through ablation experiments.
A twitter thread by François Chollet (@fchollet) that was part of the “recent” surge in popularity of Ablation Studies in Machine Learning.
Ablation studies are crucial for deep learning research — can’t stress this enough.
Understanding causality in your system is the most straightforward way to generate reliable knowledge (the goal of any research). And ablation is a very low-effort way to look into causality.
If you take any complicated deep learning experimental setup, chances are you can remove a few modules (or replace some trained features with random ones) with no loss of performance. Get rid of the noise in the research process: do ablation studies. (Source: https://threader.app/thread/1012721582148550662)
Can’t fully understand your system? Many moving parts? Want to make sure the reason it’s working is really related to your hypothesis? Try removing stuff. Spend at least ~10% of your experimentation time on an honest effort to disprove your thesis.
See the whole twitter thread here.
What is surprising is that even if A and B are quite different this is still often a useful approach to take.
We think this is because:
Go to https://www.tensorflow.org/tutorials/images/transfer_learning and try it out for yourself!
Pre-training is really just training on a task.
It is usually supervised training and commonly on image processing or text classification tasks.
If you’re lucky, then someone has already pre-trained a model on a data domain you wish to use
In this case, you don’t need to do the training at all!
(Input + 3x ConvWithScaling + 3x FullyConnnected + 1x Softmax)
What part can be kept frozen after pre-training and what part can we fine tune?
Given what we know about Human Learning, it is very likely that some kind of transfer learning is always a good idea.