ECE 657A - Data and Knowledge Modelling and Analysis

A graduate course in the
Electrical and Computer Engineering Department at
The University of Waterloo
taught by
Prof. Mark Crowley

NOTE: These notes are dynamic and will be updated over the term, they were initially created for the Winter 2020 term, so some content may be out of date.

Course Links:

Course Textbook(s)

The course has no required textbook itself. There are a number of resources listed on the course website and here that are useful. The fact is that the pace and nature of how the field changes these days makes it very hard for any physical textbook to simultaneously cover the fundamentals as well the latest relevant trends and advances that are required for a course like this. So the web is full of blogs, info-sites, corporate demonstration pages and framework documentation sites that provide fantastic description of all of the concepts in this course, with the latest technology and approaches. Finding the best ones is hard, of course, so on this humble gingko tree I will make my best attempt to curate a list of resources relevant to the topic of this course as I come across them in my own, never-ending-mad-rush-to-stay-up-to-date.


Universal References

One thing we sometimes think we want, is a universal solution to a problem.

Notes on Fundamentals

History of AI/ML

The history of Artificial Intelligence and Machine Learning are tightly intertwined, but there are as many different perspectives on the important moments as there are researchers and interested parties.

The Past

Data Types

Data Types

In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.

Probability and Statistics Fundamentals Review

Notes From My Other Courses

ECE 108 YouTube Videos

For a very fundamental view of probability from another course of Prof. Crowley you can view the lectures and tutorials for ECE 108

ECE 108 Youtube (look at “future lectures” and “future tutorials” for S20):

The last few lectures and tutorials are on probability definitions as seen from the perspective of discrete math and set theory.

Probability Intro Markdown Notes

From the course website for last year for ECE 493 T25 “Reinforcement Learning”.

Some of this we won’t need so much but they are all useful to know for Machine Learning methods in general.


On Entropy and Security

Some fund thoughts that tie information entropy, random search, sampling and security to the never-ending challenge of picking a new password.

Remembering Complex Passwords

Using Dice to pick your password :

References and Further Reading

Likelihood, Loss and Risk

A Good article summarizing how likelihood, loss functions, risk, KL divergence, MLE, MAP are all connected.

Online Pre-recorded Lecture

Experimental Methodology

Ablation Studies

Once you have a trained model that gives you some kind of response, how do you figure out why it is working?


Ablation - the removal, especially of organs, abnormal growths, or harmful substances, from the body by mechanical means, as by surgery. —

Definition from [Fawcett and Hoos, 2013]:

Our use of the term ablation follows that of Aghaeepour and Hoos (2013) and loosely echoes its meaning in medicine, where it refers to the surgical removal of organs, organ parts or tissues. We ablate (i.e., remove) changes in the settings of algorithm parameters to better understand the contribution of those changes to observed differences in algorithm performance.

Motivation for Ablation Studies

As one person puts it (see this twitter thread by @fchollet)

Automated Parameter Tuning

These tools and other algorithm configuration tools help to set the many complex parameters needed to achieve optimal, or at least maximal, performance.
But they spit out the parameters without any explanation.

So in [Fawcett and Hoo, 2013 and 2016] they propose ways to:

help these algorithm developers answer questions about the high-quality configurations produced by these tools, specifically about which parameter changes contribute most to improved performance.

What Ablation Is and Isn’t

It Is…

It Is Not…

An Example

A nice example explained here:

As an example, Girshick and colleagues (2014) describe an object detection system that consists of three “modules”: The first proposes regions of an image within which to search for an object using the Selective Search algorithm (Uijlings and colleagues 2012), which feeds in to a large convolutional neural network (with 5 convolutional layers and 2 fully connected layers) that performs feature extraction, which in turn feeds into a set of support vector machines for classification. In order to better understand the system, the authors performed an ablation study where different parts of the system were removed - for instance removing one or both of the fully connected layers of the CNN resulted in surprisingly little performance loss, which allowed the authors to conclude

They use

What they found:

They would only have found this through ablation experiments.

References and Further Reading


This question and answer on StackExchange provide a great the recent history of the term machine learning and links to further reading :

A twitter thread by François Chollet (@fchollet) that was part of the “recent” surge in popularity of Ablation Studies in Machine Learning.

Ablation studies are crucial for deep learning research — can’t stress this enough.

Understanding causality in your system is the most straightforward way to generate reliable knowledge (the goal of any research). And ablation is a very low-effort way to look into causality.

If you take any complicated deep learning experimental setup, chances are you can remove a few modules (or replace some trained features with random ones) with no loss of performance. Get rid of the noise in the research process: do ablation studies. (Source:

Can’t fully understand your system? Many moving parts? Want to make sure the reason it’s working is really related to your hypothesis? Try removing stuff. Spend at least ~10% of your experimentation time on an honest effort to disprove your thesis.

See the whole twitter thread here.


Aside - Fundamentals vs. The Bleeding Edge

The fundamentals of data cleaning and preparation, the types of different data, how to normalize it, how to extract features for different purposes, how to visualize and ask the right questions of our data; these are all critical skills. No number of fancy software frameworks will help you get useful results if you don’t have these skills in the first place.

At the same time, some algorithms and methodologies have become essentially irrelevant because better ones have been discovered. So, there is no point learned how to use them in detail if they will never be used in industry or even cutting edge research. But when that happens in under 10 years, it’s very hard for a textbook to remain relevant.

Some people would disagree with this, and in a sense, from a research point of view every method that was useful at one stage is still worthy of study. If only to understand how solutions can be found without fully understanding all of the tradeoffs. For example, SIFT features are incredibly powerful summarizations of context in images and revolutionized image recognition tasks before CNNs had been fully developed. Now they are simply one type of feature that a CNN could learn directly from data.

Notes on Data Analysis and Machine Learning Concepts

Parameter Estimation

#methodMLE #methodMAP #methodEM #methodNaiveBayes

What is Parameter Estimation?

Parameter estimation is literally the task of guessing the parameters to a function. We can do this through iterative improvement, checking how well our settings work. Just as we do when setting the right angle for the tap in the shower.

Or if we think we know enough about the distribution of the data we can get fancy and do some calculus on an approximation of that distribution (MAP, MLE, EM).

At the most fundamental level though, we are taking the data we have, using it to build an estimator and test how well it works. Often, we’ll improve it through multiple scans of the data, or steps down a gradient, as we do implicitly by setting the gradient to zero in MLE.

We could say at the end of all this that we have learned the best parameters for our model. If we take the algorithm that does the tuning and the estimation together then the machine itself did the learning, we weren’t needed at all except to choose the right libraries and format.

In this sense, the rest of the course is looking at methods of Machine Learning that, in much more complex ways, still always estimate some parameters for a model.

Unsupervised Learning


Improving model robustness and power with unsupervised pre-training


Anomaly Detection


Using Classification for Anomaly Detection

Using Clustering for Anomaly Detection

Dedicated Anomaly Detection Algorithms

Local Outlier Factor (LOF)

References and Further Reading

One-Class SVM

References and Further Reading

This is a technical, but critical part of the success of Deep Learning.

Adam Optimizer

Now seen as basically the default gradient descent approach to updating weights during #backpropagation.

Notes on Particular Algorithms

Gradient Tree Boosting


A More General View of Ensembles

Now that we have know about

A More General View of Ensembles

People realized that the very successful Boosting method was in essence a very general meta-algorithm for optimization of the mapping function from input variables to output target variables.

This algorithm chooses multiple weak functions that are combined together, just as the ensemble of decision trees are for Random Forests.

What is the Gradient Though?

We’ve already used them

Doing Better

This idea can then be generalized so that each new weak learner is explicitely treated as a function that points directly away from the gradient of the current combined function.

Gradient Tree Boosting

Given some tree based ensemble model then, represented as a function

$$T_i(X)\rightarrow Y$$

Gradient Tree Boosting

$$L(Y, T(X)) = \sum_i Y-T_i(X)$$

Gradient Tree Boosting explicitely uses the gradient $\nabla L(Y,T_i(X)$ of the loss function of each tree to fit a newtree

$$h(X)= Ti(X) - \sum_i \nabla{T_i}L(Y,T_i(X))$$

and add it to the ensemble.

There is also further optimization of weighting functions for each tree and various regularization methods.

This algorithm is implemented in the popular XGBoost package.

Where is the next Frontier or AI/ML?

While no one can predict the future, and looking at history, it seems AI/ML researchers are particularly bad at this, there are current areas and trends taking up a lot of attention. Rather than being what will be big in five years, this more likely means that in a few years there will be solid, or at least well accepted, approaches or even solutions to these problems.

New Advances in Theory

How Does Learning Really Work?


“An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling”.

The main point of this paper is to empirically compare the use of recurrent and convolutional approaches to modelling sequential data.

They tried this because some people have found that Convolutional architectures sometimes perform better at audio synthesis and translation. But why is that and is it true in general?

Quotes and Further Reading

While being highly empirical and using known approaches, it opens the door to uncovering new ones since it proves that the one that is usually regarded as optimal is in fact not

What’s New? (as of 2020)

Ongoing Hot Topics

Generative AIs

Explainability and Interpretability


Natural Language Processing






Transfer Learning


Transfer Learning Fundamentals


Transfer Learning Definition

Transfer Learning: Attempting to improve performance on a learning task B by using a neural network that is pre-trained on some other task A.

Transfer Learning Definition

A Pleasant Surprise

What is surprising is that even if A and B are quite different this is still often a useful approach to take.

We think this is because:


  1. First, train an object recognition model on a large image dataset involving people, cars, fruit, bike, trees, buildings
  2. Then use that pretrained model CNN model but fine-tune the classification layers to train it to identify cats and dogs in the world
    • Now, most of the features automatically learned by the CNN are still relevant (edge detection, texture, colour gradients,…)

Go to and try it out for yourself!


Fine Tuning

A Standard CNN for Image Classification

(Input + 3x ConvWithScaling + 3x FullyConnnected + 1x Softmax)

What part can be kept frozen after pre-training and what part can we fine tune?

When to Use Transfer Learning

When not to use Transfer Learning?


Given what we know about Human Learning, it is very likely that some kind of transfer learning is always a good idea.

Relation to Inception and ResNet

Relation to NLP

(see #methodTransferLearningNLP)

Bring it All Together

An exciting unifying paper in 2018 from OpenAI brings many of these threads together for lanuage understanding

Where they use the idea of Transfer Learning and merge it with two other recent, related, advances:


Transformer Networks

Transfer Learning in NLP


New or Updated ML NLP Methods in 2020


See this good summary of ML Advances in 2020.

Some Quotes and Further Reading

These models have been described as the “Imagenet moment for NLP” since they show the practicality of transfer learning in the language domain by providing ready-to-use pre-trained and general models that can be also fine-tuned for specific tasks.

Duplex dialog system (google)


Smart Compose (google)




Transformer Networks


BERT (google)


Protein Folding!


You are allowed to have fun…sometimes.

For everyone who took ECE606 last term

XCKD comic number 2407, a very funny comparison of standard tree search algorithms, such as depth-first and breadth-first, as well as some lesser known ones: Brepth-first, Deadth-first and Bread-First Search, which skips the tree entirely and jumps directly to a loaf of bread.

Style Sheet (leave this alone)
Hidden Gingko Style Sheet Code

H1 Are Centred

H2 Have A Strong Line

And the text that follows is normal, whatever that means.

Paragraphs are no different.

H3 Has A Dashed Line

The text under it is normal

H3 Has A Dashed Line … even if there are multiple ones

The text under it is normal

H4 Does Something Different

I don’t remember what it is. It’s just smaller.

H5 Is Invisible

That’s like magic. It even applies to the text that follows.

(psst! H5 is invisible ^^^) But the question is…does it apply to text the next paragraph down? The answer is no

Or the next section down?

Who knows? I know, it doesn't

H6 Even Exists

What does it do?

No one knows.

# H7 Does not Exist

So it says.

or does it?

## H8+ Does not Exist

So it says.

or does it?