Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome, Safari or Firefox browser.

Data Types

Data Types

In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.

Gradient Tree Boosting

A More General View of Ensembles

Now that we have know about

A More General View of Ensembles

People realized that the very successful Boosting method was in essence
Boosting = a very general meta-algorithm for optimization of the mapping function from input variables to output target variables.

This algorithm chooses multiple weak functions that are combined together, just as the ensemble of decision trees are for Random Forests.

What is the Gradient Though?

  • One can imagine that this combined function can have a gradient
  • In this case this is the infinistesimal increase in each of the function parameters that would strengthen the current response.

We’ve already used them

  • In an ensemble of decision trees these parameters are all of the split points in each for each data dimension.
  • In Random Forests gradient is not used
  • In AdaBoost it is used implicitly in a very simple way
    • each new decision tree weak learner
    • is optimized relative to the negative of this gradient
    • since it tries to do well on what the existing model does badly on.

Doing Better

This idea can then be generalized so that each new weak learner is explicitely treated as a function that points directly away from the gradient of the current combined function.

Gradient Tree Boosting

Given some tree based ensemble model then, represented as a function

$$T_i(X)\rightarrow Y$$

  • after adding $i$ weak learners already we find that the “perfect” function for the $i+1^{th}$ weak learner would be

    $$h(x)=T_i(x) - Y$$

  • this fills in the gap of what the existing models got wrong.
    • This is because then the new combined model perfectly matches the training data:

      $$T_{(i+1)}(x) = T_i(x) + h(x) = Y$$

Gradient Tree Boosting

  • In practice we need to be satisfied with merely approaching this perfect update by fitting a functional gradient descent approach where we use an approximation of the true residual (also called the loss function) each step.

  • In our case this approximation is simply the sum of the wrong answers (i.e. the residuals) from each weak learner decision tree

$$L(Y, T(X)) = \sum_i Y-T_i(X)$$

Gradient Tree Boosting

Gradient Tree Boosting explicitly uses the gradient

$$\nabla L(Y,Ti(X))=[ \nabla{w_i} L(Y,T^{w_i}_i (X))]$$

of the loss function of each tree to fit a new tree

$$h(X)= Ti(X) - \sum_i \nabla{T_i}L(Y,T_i(X))$$

and then add it to the ensemble.

There is also further optimization of weighting functions for each tree and various regularization methods which can be done.

The popular algorithm XGBoost\cite{xgboost} implements approach.