Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

ECE 657A

Data and Knowledge Modelling and Analysis

A graduate course in the
Electrical and Computer Engineering Department at
The University of Waterloo
taught by
Prof. Mark Crowley

Winter 2021 (Jan-April, 2021)

These notes are dynamic and will be updated over the term

Course Textbook(s)

The course has no required textbook itself. There are a number of resources listed on the course website and here that are useful. The fact is that the pace and nature of how the field changes these days makes it very hard for any physical textbook to simultaneously cover the fundamentals as well the latest relevant trends and advances that are required for a course like this. So the web is full of blogs, info-sites, corporate demonstration pages and framework documentation sites that provide fantastic description of all of the concepts in this course, with the latest technology and approaches. Finding the best ones is hard, of course, so on this humble gingko tree I will make my best attempt to curate a list of resources relevant to the topic of this course as I come across them in my own, never-ending-mad-rush-to-stay-up-to-date.


Aside - Fundamentals vs. The Bleeding Edge

The fundamentals of data cleaning and preparation, the types of different data, how to normalize it, how to extract features for different purposes, how to visualize and ask the right questions of our data; these are all critical skills. No number of fancy software frameworks will help you get useful results if you don’t have these skills in the first place.

At the same time, some algorithms and methodologies have become essentially irrelevant because better ones have been discovered. So, there is no point learned how to use them in detail if they will never be used in industry or even cutting edge research. But when that happens in under 10 years, it’s very hard for a textbook to remain relevant.

Some people would disagree with this, and in a sense, from a research point of view every method that was useful at one stage is still worthy of study. If only to understand how solutions can be found without fully understanding all of the tradeoffs. For example, SIFT features are incredibly powerful summarizations of context in images and revolutionized image recognition tasks before CNNs had been fully developed. Now they are simply one type of feature that a CNN could learn directly from data.

Universal References

One thing we sometimes think we want, is a universal solution to a problem.

  • Murphy book
    • buy it: amazon link
    • library: If you are familiar with the idea of a Library, then for my actual course ECE657A that this gingko is primarily created for, this book is on hold at Davis Library for short term use.
      • If you are not familiar with libraries, then this information is not useful to you until you obtain the required polarity reversing phase transmogrifier from level 42 of the OASIS.
  • Deep Learning Book
  • “A Course in Machine Learning” by Hal Daumé III
    • url:
    • Comment: I’ve only recently discovered this book but it seems like a solid, simple approach at explaining the fundamentals and methods in many of the same areas as this course. It’s free online.

Notes on Fundamentals

Data Types

Data Types

In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.

Probability and Statistics Fundamentals Review

Notes From My Other Courses

ECE 108 YouTube Videos

For a very fundamental view of probability from another course of Prof. Crowley you can view the lectures and tutorials for ECE 108

ECE 108 Youtube (look at “future lectures” and “future tutorials” for S20):

The last few lectures and tutorials are on probability definitions as seen from the perspective of discrete math and set theory.

Probability Intro Markdown Notes

From the course website for last year for ECE 493 T25 “Reinforcement Learning”.

Some of this we won’t need so much but they are all useful to know for Machine Learning methods in general.


  • Elements of Probability
  • Conditional Probability Rules
  • Random Variables
  • Probability Functions - Cumulative/Mass/Density
  • Expectation
  • Variance
  • Multi-variable distributions

On Entropy and Security

Some fund thoughts that tie information entropy, random search, sampling and security to the never-ending challenge of picking a new password.

Remembering Complex Passwords

Using Dice to pick your password :

References and Further Reading

Likelihood, Loss and Risk

A Good article summarizing how likelihood, loss functions, risk, KL divergence, MLE, MAP are all connected.

Online Pre-recorded Lecture

Experimental Methodology

Ablation Studies

Once you have a trained model that gives you some kind of response, how do you figure out why it is working?


Ablation - the removal, especially of organs, abnormal growths, or harmful substances, from the body by mechanical means, as by surgery. —

Definition from [Fawcett and Hoos, 2013]:

Our use of the term ablation follows that of Aghaeepour and Hoos (2013) and loosely echoes its meaning in medicine, where it refers to the surgical removal of organs, organ parts or tissues. We ablate (i.e., remove) changes in the settings of algorithm parameters to better understand the contribution of those changes to observed differences in algorithm performance.

Motivation for Ablation Studies

As one person puts it (see this twitter thread by @fchollet)

  • how do you determine ‘causality’ between which parts of your system are responsible for the performance?
  • Advice: “Spend at least ~10% of your experimentation time on an honest effort to disprove your thesis.”

Automated Parameter Tuning

These tools and other algorithm configuration tools help to set the many complex parameters needed to achieve optimal, or at least maximal, performance.
But they spit out the parameters without any explanation.

So in [Fawcett and Hoo, 2013 and 2016] they propose ways to:

help these algorithm developers answer questions about the high-quality configurations produced by these tools, specifically about which parameter changes contribute most to improved performance.

What Ablation Is and Isn’t

It Is…

  • a good way to determine what parts of your model are useful, which are necessary and which may be unnecessary
  • an approach to help you understand and explain you model to others by showing how each part contributes to your state-of-the-art performance
  • essentially a method for improving your model selection/design process
  • therefore : all ablation analysis should be done on
    1. The Test Dataset?
    2. The Training Dataset?
    3. A Validation Training Dataset?
  • Answer: 3. If the goal is to use ablation to improve the model design, then such analysis must happen on a held out validation dataset, not the final testing dataset.

It Is Not…

  • a regularization method
    • why not?
    • While exploring
  • a way to improve your testing numbers (accuracy, recall, confidence) higher
  • a way to fill in the space of your paper with more experiments and graphs
    • it will do this, but that is not the purpose. If you cannot fill a 6-9 page paper with your own background, theories, data, methodology and results then adding two pages of ablation studies will not save you.

An Example

A nice example explained here:

As an example, Girshick and colleagues (2014) describe an object detection system that consists of three “modules”: The first proposes regions of an image within which to search for an object using the Selective Search algorithm (Uijlings and colleagues 2012), which feeds in to a large convolutional neural network (with 5 convolutional layers and 2 fully connected layers) that performs feature extraction, which in turn feeds into a set of support vector machines for classification. In order to better understand the system, the authors performed an ablation study where different parts of the system were removed - for instance removing one or both of the fully connected layers of the CNN resulted in surprisingly little performance loss, which allowed the authors to conclude

They use

  • Uijlings et al., “Selective Search for Object Recognition.”
    • SIFT feature extraction
    • SVM supervised training
    • loop and strengthen hypotheses.
      then feed the output into a 5CNN+2FC network:
  • Girshick et al., “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.”

What they found:

  • investigate which parts are needed and which aren’t
  • they found that the SIFT features were not as critical if there was a high-capacity CNN to localize objects
  • they also found that the CNN could be pre-trained on a large, unrelated dataset of images and then fine tuned for the specific problem. This worked better than specialized computer vision methods, such as SIFT.

They would only have found this through ablation experiments.

References and Further Reading


This question and answer on StackExchange provide a great the recent history of the term machine learning and links to further reading :

A twitter thread by François Chollet (@fchollet) that was part of the “recent” surge in popularity of Ablation Studies in Machine Learning.

Ablation studies are crucial for deep learning research — can’t stress this enough.

Understanding causality in your system is the most straightforward way to generate reliable knowledge (the goal of any research). And ablation is a very low-effort way to look into causality.

If you take any complicated deep learning experimental setup, chances are you can remove a few modules (or replace some trained features with random ones) with no loss of performance. Get rid of the noise in the research process: do ablation studies. (Source:

Can’t fully understand your system? Many moving parts? Want to make sure the reason it’s working is really related to your hypothesis? Try removing stuff. Spend at least ~10% of your experimentation time on an honest effort to disprove your thesis.

See the whole twitter thread here.


  • Newell, Allen (1975). A Tutorial on Speech Understanding Systems. In Speech Recognition: Invited Papers Presented at the 1974 IEEE Symposium. New York: Academic. p. 43.
  • [Fawcett and Hoos, 2013] Chris Fawcett and Holger H. Hoos. Analysing differences between algorithm configurations through ablation.
    Proceedings of the 10th Metaheuristics International Conference (MIC 2013), pp. 123-132, 2013. PDF

History of AI/ML

The history of Artificial Intelligence and Machine Learning are tightly intertwined, but there are as many different perspectives on the important moments as there are researchers and interested parties.

Notes on Data Analysis and Machine Learning Concepts

What is Parameter Estimation?

Parameter estimation is literally the task of guessing the parameters to a function. We can do this through iterative improvement, checking how well our settings work. Just as we do when setting the right angle for the tap in the shower.

Or if we think we know enough about the distribution of the data we can get fancy and do some calculus on an approximation of that distribution (MAP, MLE, EM).

At the most fundamental level though, we are taking the data we have, using it to build an estimator and test how well it works. Often, we’ll improve it through multiple scans of the data, or steps down a gradient, as we do implicitly by setting the gradient to zero in MLE.

We could say at the end of all this that we have learned the best parameters for our model. If we take the algorithm that does the tuning and the estimation together then the machine itself did the learning, we weren’t needed at all except to choose the right libraries and format.

In this sense, the rest of the course is looking at methods of Machine Learning that, in much more complex ways, still always estimate some parameters for a model.

Unsupervised Learning


Improving model robustness and power with unsupervised pre-training


Anomaly Detection


Using Classification for Anomaly Detection

  • Any effective classification model will provide correct labels or predictions for new data.
  • So if a classification model has been trained to predict one of the available features, it could be used to score datapoints at normal or abnormal.

Using Clustering for Anomaly Detection

  • Consider how DBScan could be used for Anomaly Detection
  • k-means is a common, and very scalable, solution to grouping datapoints together and then find some patterns to use to argue for an anomaly

Dedicated Anomaly Detection Algorithms

  • Local Outlier Factor (LOF)
  • One-Class SVM
  • Oversampling Principal Component Analysis (osPCA)
  • Isolation Forest
  • iMondrian Forest

Local Outlier Factor (LOF)

One-Class SVM

References and Further Reading

This is a technical, but critical part of the success of Deep Learning.

Notes on Particular Algorithms

Gradient Tree Boosting


A More General View of Ensembles

Now that we have know about

A More General View of Ensembles

People realized that the very successful Boosting method was in essence a very general meta-algorithm for optimization of the mapping function from input variables to output target variables.

This algorithm chooses multiple weak functions that are combined together, just as the ensemble of decision trees are for Random Forests.

What is the Gradient Though?

  • One can imagine that this combined function can have a gradient
  • In this case this is the infinistesimal increase in each of the function parameters that would strengthen the current response.

We’ve already used them

  • In an ensemble of decision trees these parameters are all of the split points in each for each data dimension.
  • In Random Forests gradient is not used
  • In AdaBoost it is used implicitly in a very simple way
    • each new decision tree weak learner
    • is optimized relative to the negative of this gradient
    • since it tries to do well on what the existing model does badly on.

Doing Better

This idea can then be generalized so that each new weak learner is explicitely treated as a function that points directly away from the gradient of the current combined function.

Gradient Tree Boosting

Given some tree based ensemble model then, represented as a function

$$T_i(X)\rightarrow Y$$

  • after adding $i$ weak learners already we find that the “perfect” function for the $i+1^{th}$ weak learner would be

    $$h(x)=T_i(x) - Y$$

  • this fills in the gap of what the existing models got wrong.
    • This is because then the new combined model perfectly matches the training data:

      $$T_{(i+1)}(x) = T_i(x) + h(x) = Y$$

Gradient Tree Boosting

  • In practice we need to be satisfied with merely approaching this perfect update by fitting a functional gradient descent approach where we use an approximation of the true residual (also called the loss function) each step.

  • In our case this approximation is simply the sum of the wrong answers (i.e. the residuals) from each weak learner decision tree

$$L(Y, T(X)) = \sum_i Y-T_i(X)$$

Gradient Tree Boosting explicitely uses the gradient $\nabla L(Y,T_i(X)$ of the loss function of each tree to fit a newtree

$$h(X)= Ti(X) - \sum_i \nabla{T_i}L(Y,T_i(X))$$

and add it to the ensemble.

There is also further optimization of weighting functions for each tree and various regularization methods.

This algorithm is implemented in the popular XGBoost package.

Where is the next Frontier or AI/ML?

While no one can predict the future, and looking at history, it seems AI/ML researchers are particularly bad at this, there are current areas and trends taking up a lot of attention. Rather than being what will be big in five years, this more likely means that in a few years there will be solid, or at least well accepted, approaches or even solutions to these problems.

New Advances in Theory

How Does Learning Really Work?


“An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling”.

The main point of this paper is to empirically compare the use of recurrent and convolutional approaches to modelling sequential data.

They tried this because some people have found that Convolutional architectures sometimes perform better at audio synthesis and translation. But why is that and is it true in general?

Quotes and Further Reading

While being highly empirical and using known approaches, it opens the door to uncovering new ones since it proves that the one that is usually regarded as optimal is in fact not

New - Protein Folding!

What’s New?

Explainability and Interpretability


Transfer Learning Fundamentals


Transfer Learning Definition

Transfer Learning: Attempting to improve performance on a learning task B by using a neural network that is pre-trained on some other task A.

  • each task, A and B, could correspond to classification, prediction, summarization, etc on a particular dataset
  • The idea mostly arose out of Image Classification for CNNs, but it can be applied anywhere

Transfer Learning Definition

  • Obviously if the two datasets are similar then you would expect transfer learning to work quite well
    • essentially, the pre-training on A is just additional training data you don’t need to spend time on

A Pleasant Surprise

What is surprising is that even if A and B are quite different this is still often a useful approach to take.

We think this is because:

  • The underlying structure of the world is persistent
  • So lessons learned from one problem carry over to problem in the same domain.


  1. First, train an object recognition model on a large image dataset involving people, cars, fruit, bike, trees, buildings
  2. Then use that pretrained model CNN model but fine-tune the classification layers to train it to identify cats and dogs in the world
    • Now, most of the features automatically learned by the CNN are still relevant (edge detection, texture, colour gradients,…)

Go to and try it out for yourself!

  • pretraining: MobileNet V2 using 1.4M images and 1000 classes


  • Pre-training is really just training on a task.

  • It is usually supervised training and commonly on image processing or text classification tasks.

  • If you’re lucky, then someone has already pre-trained a model on a data domain you wish to use

    • photographic images
    • brain scans
    • natural language text documents
    • pedestrians
    • human faces
  • In this case, you don’t need to do the training at all!

Fine Tuning

  • Question: Is a neural network restricted to being trained once only, on a batch of data and then frozen for all time?
  • Answer:

A Standard CNN for Image Classification

(Input + 3x ConvWithScaling + 3x FullyConnnected + 1x Softmax)

What part can be kept frozen after pre-training and what part can we fine tune?

When to Use Transfer Learning

  • If the data domain is very large and you do not have time to train a model from scratch
    • Corollary : this only helps if there already is a pre-trained model for your domain of interest
  • You have reason to believe you can “learn the fundamentals” of the domain seperately and then train on specific tasks afterwards
    • This is clearly reasonable in natural images and language where the shared structure common to all data is the truly complex part

When not to use Transfer Learning?

  • Imagine a structurally simple set of sensor readings, just a set of numbers come from heat, light, energy sensors.
  • There is no known prior data for this domain, not sense there is a common pattern that could be learnt elsewhere
  • You may just need to train from scratch, as a supervised learning problem to solve your task.


Given what we know about Human Learning, it is very likely that some kind of transfer learning is always a good idea.

  • In other words, starting from scratch may not be worth it.
  • BUT whether you can carry out transfer learning in your domain depends on the data, compute power and many other factors.

Relation to Inception and ResNet

  • these models are often used as the “pre-trained” model that people do transfer learning with
  • A very, very, very common type of paper a few years ago was

    “We take ResNet-50 (or ResNet-16) trained in ImageNet and fine-tune the last X (3-10) layers for our (FANCY IMAGE CLASSIFICATION) task and demonstrate SotA performance.” - PublishMePlease

Relation to NLP

(see #methodTransferLearningNLP)

  • In fact, we’ve seen transfer learning already, when we talked about Document Classification
  • GloVe and Word2Vec do this by training on large document corpora and then making the models available for others to use.
  • GPT and BERT and following in the footsteps of this

Bring it All Together

An exciting unifying paper in 2018 from OpenAI brings many of these threads together for lanuage understanding

Where they use the idea of Transfer Learning and merge it with two other recent, related, advances:


Transformer Networks

Transfer Learning in NLP


New or Updated ML NLP Methods in 2020


See this good summary of ML Advances in 2020.

Some Quotes and Further Reading

These models have been described as the “Imagenet moment for NLP” since they show the practicality of transfer learning in the language domain by providing ready-to-use pre-trained and general models that can be also fine-tuned for specific tasks.

Duplex dialog system (google)


Transformer Networks



You are allowed to have fun…sometimes.

For everyone who took ECE606 last term

XCKD comic number 2407, a very funny comparison of standard tree search algorithms, such as depth-first and breadth-first, as well as some lesser known ones: Brepth-first, Deadth-first and Bread-First Search, which skips the tree entirely and jumps directly to a loaf of bread.