# Data Types # Data Types In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined. https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/ # Gradient Tree Boosting ## A More General View of Ensembles Now that we have know about - [DecisionTrees](DecisionTrees) - [Boosting](Boosting) - [RandomForests](RandomForests) we are ready to learn about a powerful combination of all these concepts with *gradient search* - **Gradient Tree Boosting**. ## A More General View of Ensembles People realized that the very successful [Boosting](Boosting) method was in essence Boosting = a very general meta-algorithm for optimization of the mapping function from input variables to output target variables. This algorithm chooses *multiple weak functions* that are combined together, just as the ensemble of decision trees are for [Random Forests](Random Forests). ## What is the Gradient Though? - One can imagine that this combined function can have a **gradient** - In this case this is the *infinistesimal increase* in each of the **function parameters** that would **strengthen** the current response. ### We've already used them - In an ensemble of decision trees these parameters are **all of the split points** in each for each data dimension. - In Random Forests *gradient is not used* - In AdaBoost it is used *implicitly* in a very simple way - each new decision tree weak learner - is optimized relative to the negative of this gradient - since it tries to do *well* on what the existing model does *badly* on. ## Doing Better This idea can then be generalized so that each new weak learner is *explicitely* treated as a function that points directly away from the gradient of the current combined function. ## Gradient Tree Boosting Given some tree based ensemble model then, represented as a function $$T_i(X)\rightarrow Y$$ - after adding $i$ weak learners already we find that the "perfect" function for the $i+1^{th}$ weak learner would be $$h(x)=T_i(x) - Y$$ - this fills in the gap of what the existing models got wrong. - This is because then the new combined model *perfectly matches the training data*: $$T_{(i+1)}(x) = T_i(x) + h(x) = Y$$ ## Gradient Tree Boosting - In practice we need to be satisfied with merely *approaching* this perfect update by fitting a **functional gradient descent** approach where we use an *approximation of the true residual* (also called the [loss function](lossfunction)) each step. - In our case this approximation is simply the sum of the wrong answers (i.e. the residuals) from each weak learner decision tree $$L(Y, T(X)) = \sum_i Y-T_i(X)$$ ## Gradient Tree Boosting Gradient Tree Boosting explicitly uses the gradient $$\nabla L(Y,T_i(X))=[ \nabla_{w_i} L(Y,T^{w_i}_i (X))]$$ of the loss function of *each tree* to fit a new tree $$h(X)= T_i(X) - \sum_i \nabla_{T_i}L(Y,T_i(X))$$ and then add it to the ensemble. There is also further optimization of weighting functions for each tree and various regularization methods which can be done. The popular algorithm **XGBoost**\cite{xgboost} implements approach.