• Data Types

In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.

https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/

• A More General View of Ensembles

Now that we have know about

• A More General View of Ensembles

People realized that the very successful Boosting method was in essence
Boosting = a very general meta-algorithm for optimization of the mapping function from input variables to output target variables.

This algorithm chooses multiple weak functions that are combined together, just as the ensemble of decision trees are for Random Forests.

• What is the Gradient Though?

• One can imagine that this combined function can have a gradient
• In this case this is the infinistesimal increase in each of the function parameters that would strengthen the current response.

• In an ensemble of decision trees these parameters are all of the split points in each for each data dimension.
• In Random Forests gradient is not used
• In AdaBoost it is used implicitly in a very simple way
• each new decision tree weak learner
• is optimized relative to the negative of this gradient
• since it tries to do well on what the existing model does badly on.
• Doing Better

This idea can then be generalized so that each new weak learner is explicitely treated as a function that points directly away from the gradient of the current combined function.

Given some tree based ensemble model then, represented as a function

• after adding $i$ weak learners already we find that the “perfect” function for the $i+1^{th}$ weak learner would be
• this fills in the gap of what the existing models got wrong.
• This is because then the new combined model perfectly matches the training data:

• In practice we need to be satisfied with merely approaching this perfect update by fitting a functional gradient descent approach where we use an approximation of the true residual (also called the loss function) each step.

• In our case this approximation is simply the sum of the wrong answers (i.e. the residuals) from each weak learner decision tree

{"cards":[{"_id":"52827cc512fe52ddb000005f","treeId":"5282babaf95bac7e40000146","seq":19738299,"position":0.5,"parentId":null,"content":"# Data Types"},{"_id":"52827c8512fe52ddb0000060","treeId":"5282babaf95bac7e40000146","seq":19738305,"position":1,"parentId":"52827cc512fe52ddb000005f","content":"# Data Types\nIn summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.\n \nhttps://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/"},{"_id":"5e26209fdf5197044c05b309","treeId":"5282babaf95bac7e40000146","seq":19738199,"position":1,"parentId":null,"content":"# Gradient Tree Boosting"},{"_id":"5e26209fdf5197044c05b30a","treeId":"5282babaf95bac7e40000146","seq":19738247,"position":1,"parentId":"5e26209fdf5197044c05b309","content":"## A More General View of Ensembles\nNow that we have know about \n- [DecisionTrees](DecisionTrees)\n- [Boosting](Boosting)\n- [RandomForests](RandomForests) \nwe are ready to learn about a powerful combination of all these concepts with *gradient search* - **Gradient Tree Boosting**.\n"},{"_id":"5e26209fdf5197044c05b30b","treeId":"5282babaf95bac7e40000146","seq":19738248,"position":2,"parentId":"5e26209fdf5197044c05b309","content":"## A More General View of Ensembles\n\nPeople realized that the very successful [Boosting](Boosting) method was in essence \n Boosting = a very general meta-algorithm for optimization of the mapping function from input variables to output target variables. \n\nThis algorithm chooses *multiple weak functions* that are combined together, just as the ensemble of decision trees are for [Random Forests](Random Forests)."},{"_id":"5e26209fdf5197044c05b30c","treeId":"5282babaf95bac7e40000146","seq":19738229,"position":3,"parentId":"5e26209fdf5197044c05b309","content":"## What is the Gradient Though? \n- One can imagine that this combined function can have a **gradient** \n- In this case this is the *infinistesimal increase* in each of the **function parameters** that would **strengthen** the current response.\n\n### We've already used them\n- In an ensemble of decision trees these parameters are **all of the split points** in each for each data dimension.\n- In Random Forests *gradient is not used*\n- In AdaBoost it is used *implicitly* in a very simple way\n - each new decision tree weak learner \n - is optimized relative to the negative of this gradient\n - since it tries to do *well* on what the existing model does *badly* on."},{"_id":"5e26209fdf5197044c05b30f","treeId":"5282babaf95bac7e40000146","seq":19738240,"position":6,"parentId":"5e26209fdf5197044c05b309","content":"## Doing Better\nThis idea can then be generalized so that each new weak learner is *explicitely* treated as a function that points directly away from the gradient of the current combined function."},{"_id":"5e26209fdf5197044c05b310","treeId":"5282babaf95bac7e40000146","seq":19738250,"position":7,"parentId":"5e26209fdf5197044c05b309","content":"## Gradient Tree Boosting\nGiven some tree based ensemble model then, represented as a function \n\n$$T_i(X)\\rightarrow Y$$\n\n- after adding $i$ weak learners already we find that the \"perfect\" function for the $i+1^{th}$ weak learner would be\n$$h(x)=T_i(x) - Y$$ \n- this fills in the gap of what the existing models got wrong. \n - This is because then the new combined model *perfectly matches the training data*: \n$$T_{(i+1)}(x) = T_i(x) + h(x) = Y$$"},{"_id":"52829ccc12fe52ddb000005d","treeId":"5282babaf95bac7e40000146","seq":19738383,"position":7.5,"parentId":"5e26209fdf5197044c05b309","content":"## Gradient Tree Boosting\n- In practice we need to be satisfied with merely *approaching* this perfect update by fitting a **functional gradient descent** approach where we use an *approximation of the true residual* (also called the [loss function](lossfunction)) each step. \n\n- In our case this approximation is simply the sum of the wrong answers (i.e. the residuals) from each weak learner decision tree \n\n$$L(Y, T(X)) = \\sum_i Y-T_i(X)$$"},{"_id":"5e26209fdf5197044c05b311","treeId":"5282babaf95bac7e40000146","seq":19738387,"position":8,"parentId":"5e26209fdf5197044c05b309","content":"## Gradient Tree Boosting\nGradient Tree Boosting explicitly uses the gradient \n\n$$\\nabla L(Y,T_i(X))=[ \\nabla_{w_i} L(Y,T^{w_i}_i (X))]$$\n\nof the loss function of *each tree* to fit a new tree \n$$h(X)= T_i(X) - \\sum_i \\nabla_{T_i}L(Y,T_i(X))$$ \n\nand then add it to the ensemble. \n\nThere is also further optimization of weighting functions for each tree and various regularization methods which can be done. \n\nThe popular algorithm **XGBoost**\\cite{xgboost} implements approach."}],"tree":{"_id":"5282babaf95bac7e40000146","name":"GradientTreeBoosting","publicUrl":"gradient-tree-boosting","latex":true}}