[
{
"content": "# Data Types",
"children": [
{
"content": "# Data Types\nIn summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratios since a “true zero” can be defined.\n \nhttps://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/"
}
]
},
{
"content": "# Gradient Tree Boosting",
"children": [
{
"content": "## A More General View of Ensembles\nNow that we have know about \n- [DecisionTrees](DecisionTrees)\n- [Boosting](Boosting)\n- [RandomForests](RandomForests) \nwe are ready to learn about a powerful combination of all these concepts with *gradient search* - **Gradient Tree Boosting**.\n"
},
{
"content": "## A More General View of Ensembles\n\nPeople realized that the very successful [Boosting](Boosting) method was in essence \n Boosting = a very general meta-algorithm for optimization of the mapping function from input variables to output target variables. \n\nThis algorithm chooses *multiple weak functions* that are combined together, just as the ensemble of decision trees are for [Random Forests](Random Forests)."
},
{
"content": "## What is the Gradient Though? \n- One can imagine that this combined function can have a **gradient** \n- In this case this is the *infinistesimal increase* in each of the **function parameters** that would **strengthen** the current response.\n\n### We've already used them\n- In an ensemble of decision trees these parameters are **all of the split points** in each for each data dimension.\n- In Random Forests *gradient is not used*\n- In AdaBoost it is used *implicitly* in a very simple way\n - each new decision tree weak learner \n - is optimized relative to the negative of this gradient\n - since it tries to do *well* on what the existing model does *badly* on."
},
{
"content": "## Doing Better\nThis idea can then be generalized so that each new weak learner is *explicitely* treated as a function that points directly away from the gradient of the current combined function."
},
{
"content": "## Gradient Tree Boosting\nGiven some tree based ensemble model then, represented as a function \n\n$$T_i(X)\\rightarrow Y$$\n\n- after adding $i$ weak learners already we find that the \"perfect\" function for the $i+1^{th}$ weak learner would be\n$$h(x)=T_i(x) - Y$$ \n- this fills in the gap of what the existing models got wrong. \n - This is because then the new combined model *perfectly matches the training data*: \n$$T_{(i+1)}(x) = T_i(x) + h(x) = Y$$"
},
{
"content": "## Gradient Tree Boosting\n- In practice we need to be satisfied with merely *approaching* this perfect update by fitting a **functional gradient descent** approach where we use an *approximation of the true residual* (also called the [loss function](lossfunction)) each step. \n\n- In our case this approximation is simply the sum of the wrong answers (i.e. the residuals) from each weak learner decision tree \n\n$$L(Y, T(X)) = \\sum_i Y-T_i(X)$$"
},
{
"content": "## Gradient Tree Boosting\nGradient Tree Boosting explicitly uses the gradient \n\n$$\\nabla L(Y,T_i(X))=[ \\nabla_{w_i} L(Y,T^{w_i}_i (X))]$$\n\nof the loss function of *each tree* to fit a new tree \n$$h(X)= T_i(X) - \\sum_i \\nabla_{T_i}L(Y,T_i(X))$$ \n\nand then add it to the ensemble. \n\nThere is also further optimization of weighting functions for each tree and various regularization methods which can be done. \n\nThe popular algorithm **XGBoost**\\cite{xgboost} implements approach."
}
]
}
]