# Data Types
# Data Types
In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.
https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/
# Gradient Tree Boosting
## A More General View of Ensembles
Now that we have know about
- [DecisionTrees](DecisionTrees)
- [Boosting](Boosting)
- [RandomForests](RandomForests)
we are ready to learn about a powerful combination of all these concepts with *gradient search* - **Gradient Tree Boosting**.
## A More General View of Ensembles
People realized that the very successful [Boosting](Boosting) method was in essence
Boosting = a very general meta-algorithm for optimization of the mapping function from input variables to output target variables.
This algorithm chooses *multiple weak functions* that are combined together, just as the ensemble of decision trees are for [Random Forests](Random Forests).
## What is the Gradient Though?
- One can imagine that this combined function can have a **gradient**
- In this case this is the *infinistesimal increase* in each of the **function parameters** that would **strengthen** the current response.
### We've already used them
- In an ensemble of decision trees these parameters are **all of the split points** in each for each data dimension.
- In Random Forests *gradient is not used*
- In AdaBoost it is used *implicitly* in a very simple way
- each new decision tree weak learner
- is optimized relative to the negative of this gradient
- since it tries to do *well* on what the existing model does *badly* on.
## Doing Better
This idea can then be generalized so that each new weak learner is *explicitely* treated as a function that points directly away from the gradient of the current combined function.
## Gradient Tree Boosting
Given some tree based ensemble model then, represented as a function
$$T_i(X)\rightarrow Y$$
- after adding $i$ weak learners already we find that the "perfect" function for the $i+1^{th}$ weak learner would be
$$h(x)=T_i(x) - Y$$
- this fills in the gap of what the existing models got wrong.
- This is because then the new combined model *perfectly matches the training data*:
$$T_{(i+1)}(x) = T_i(x) + h(x) = Y$$
## Gradient Tree Boosting
- In practice we need to be satisfied with merely *approaching* this perfect update by fitting a **functional gradient descent** approach where we use an *approximation of the true residual* (also called the [loss function](lossfunction)) each step.
- In our case this approximation is simply the sum of the wrong answers (i.e. the residuals) from each weak learner decision tree
$$L(Y, T(X)) = \sum_i Y-T_i(X)$$
## Gradient Tree Boosting
Gradient Tree Boosting explicitly uses the gradient
$$\nabla L(Y,T_i(X))=[ \nabla_{w_i} L(Y,T^{w_i}_i (X))]$$
of the loss function of *each tree* to fit a new tree
$$h(X)= T_i(X) - \sum_i \nabla_{T_i}L(Y,T_i(X))$$
and then add it to the ensemble.
There is also further optimization of weighting functions for each tree and various regularization methods which can be done.
The popular algorithm **XGBoost**\cite{xgboost} implements approach.