From a4a0ebb85d07287fda67a26aedac6be398f34e64 Mon Sep 17 00:00:00 2001 From: Randall Britten Date: Thu, 30 Dec 2021 04:05:54 +1300 Subject: [PATCH] [doc] Lowercase omega for per tree complexity (#7532) As suggested on issue #7480 --- doc/tutorials/model.rst | 24 +++++++++++++----------- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/doc/tutorials/model.rst b/doc/tutorials/model.rst index 701fc9943..f58a06b86 100644 --- a/doc/tutorials/model.rst +++ b/doc/tutorials/model.rst @@ -97,11 +97,13 @@ Mathematically, we can write our model in the form \hat{y}_i = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F} -where :math:`K` is the number of trees, :math:`f` is a function in the functional space :math:`\mathcal{F}`, and :math:`\mathcal{F}` is the set of all possible CARTs. The objective function to be optimized is given by +where :math:`K` is the number of trees, :math:`f_k` is a function in the functional space :math:`\mathcal{F}`, and :math:`\mathcal{F}` is the set of all possible CARTs. The objective function to be optimized is given by .. math:: - \text{obj}(\theta) = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k) + \text{obj}(\theta) = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \omega(f_k) + +where :math:`\omega(f_k)` is the complexity of the tree :math:`f_k`, defined in detail later. Now here comes a trick question: what is the *model* used in random forests? Tree ensembles! So random forests and boosted trees are really the same models; the difference arises from how we train them. This means that, if you write a predictive service for tree ensembles, you only need to write one and it should work @@ -117,7 +119,7 @@ Let the following be the objective function (remember it always needs to contain .. math:: - \text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) + \text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\omega(f_i) Additive Training ================= @@ -141,15 +143,15 @@ It remains to ask: which tree do we want at each step? A natural thing is to ad .. math:: - \text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\ - & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant} + \text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\omega(f_i) \\ + & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \omega(f_t) + \mathrm{constant} If we consider using mean squared error (MSE) as our loss function, the objective becomes .. math:: - \text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\ - & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant} + \text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\omega(f_i) \\ + & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \omega(f_t) + \mathrm{constant} The form of MSE is friendly, with a first order term (usually called the residual) and a quadratic term. For other losses of interest (for example, logistic loss), it is not so easy to get such a nice form. @@ -157,7 +159,7 @@ So in the general case, we take the *Taylor expansion of the loss function up to .. math:: - \text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant} + \text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \omega(f_t) + \mathrm{constant} where the :math:`g_i` and :math:`h_i` are defined as @@ -170,7 +172,7 @@ After we remove all the constants, the specific objective at step :math:`t` beco .. math:: - \sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \omega(f_t) This becomes our optimization goal for the new tree. One important advantage of this definition is that the value of the objective function only depends on :math:`g_i` and :math:`h_i`. This is how XGBoost supports custom loss functions. @@ -180,7 +182,7 @@ the same solver that takes :math:`g_i` and :math:`h_i` as input! Model Complexity ================ We have introduced the training step, but wait, there is one important thing, the **regularization term**! -We need to define the complexity of the tree :math:`\Omega(f)`. In order to do so, let us first refine the definition of the tree :math:`f(x)` as +We need to define the complexity of the tree :math:`\omega(f)`. In order to do so, let us first refine the definition of the tree :math:`f(x)` as .. math:: @@ -191,7 +193,7 @@ In XGBoost, we define the complexity as .. math:: - \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 + \omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 Of course, there is more than one way to define the complexity, but this one works well in practice. The regularization is one part most tree packages treat less carefully, or simply ignore. This was because the traditional treatment of tree learning only emphasized improving impurity, while the complexity control was left to heuristics.