[doc] Lowercase omega for per tree complexity (#7532)

As suggested on issue #7480
This commit is contained in:
Randall Britten 2021-12-30 04:05:54 +13:00 committed by GitHub
parent 3886c3dd8f
commit a4a0ebb85d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -97,11 +97,13 @@ Mathematically, we can write our model in the form
\hat{y}_i = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F} \hat{y}_i = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F}
where :math:`K` is the number of trees, :math:`f` is a function in the functional space :math:`\mathcal{F}`, and :math:`\mathcal{F}` is the set of all possible CARTs. The objective function to be optimized is given by where :math:`K` is the number of trees, :math:`f_k` is a function in the functional space :math:`\mathcal{F}`, and :math:`\mathcal{F}` is the set of all possible CARTs. The objective function to be optimized is given by
.. math:: .. math::
\text{obj}(\theta) = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k) \text{obj}(\theta) = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \omega(f_k)
where :math:`\omega(f_k)` is the complexity of the tree :math:`f_k`, defined in detail later.
Now here comes a trick question: what is the *model* used in random forests? Tree ensembles! So random forests and boosted trees are really the same models; the Now here comes a trick question: what is the *model* used in random forests? Tree ensembles! So random forests and boosted trees are really the same models; the
difference arises from how we train them. This means that, if you write a predictive service for tree ensembles, you only need to write one and it should work difference arises from how we train them. This means that, if you write a predictive service for tree ensembles, you only need to write one and it should work
@ -117,7 +119,7 @@ Let the following be the objective function (remember it always needs to contain
.. math:: .. math::
\text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\omega(f_i)
Additive Training Additive Training
================= =================
@ -141,15 +143,15 @@ It remains to ask: which tree do we want at each step? A natural thing is to ad
.. math:: .. math::
\text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\ \text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\omega(f_i) \\
& = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \omega(f_t) + \mathrm{constant}
If we consider using mean squared error (MSE) as our loss function, the objective becomes If we consider using mean squared error (MSE) as our loss function, the objective becomes
.. math:: .. math::
\text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\ \text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\omega(f_i) \\
& = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant} & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \omega(f_t) + \mathrm{constant}
The form of MSE is friendly, with a first order term (usually called the residual) and a quadratic term. The form of MSE is friendly, with a first order term (usually called the residual) and a quadratic term.
For other losses of interest (for example, logistic loss), it is not so easy to get such a nice form. For other losses of interest (for example, logistic loss), it is not so easy to get such a nice form.
@ -157,7 +159,7 @@ So in the general case, we take the *Taylor expansion of the loss function up to
.. math:: .. math::
\text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant} \text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \omega(f_t) + \mathrm{constant}
where the :math:`g_i` and :math:`h_i` are defined as where the :math:`g_i` and :math:`h_i` are defined as
@ -170,7 +172,7 @@ After we remove all the constants, the specific objective at step :math:`t` beco
.. math:: .. math::
\sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) \sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \omega(f_t)
This becomes our optimization goal for the new tree. One important advantage of this definition is that This becomes our optimization goal for the new tree. One important advantage of this definition is that
the value of the objective function only depends on :math:`g_i` and :math:`h_i`. This is how XGBoost supports custom loss functions. the value of the objective function only depends on :math:`g_i` and :math:`h_i`. This is how XGBoost supports custom loss functions.
@ -180,7 +182,7 @@ the same solver that takes :math:`g_i` and :math:`h_i` as input!
Model Complexity Model Complexity
================ ================
We have introduced the training step, but wait, there is one important thing, the **regularization term**! We have introduced the training step, but wait, there is one important thing, the **regularization term**!
We need to define the complexity of the tree :math:`\Omega(f)`. In order to do so, let us first refine the definition of the tree :math:`f(x)` as We need to define the complexity of the tree :math:`\omega(f)`. In order to do so, let us first refine the definition of the tree :math:`f(x)` as
.. math:: .. math::
@ -191,7 +193,7 @@ In XGBoost, we define the complexity as
.. math:: .. math::
\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 \omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2
Of course, there is more than one way to define the complexity, but this one works well in practice. The regularization is one part most tree packages treat Of course, there is more than one way to define the complexity, but this one works well in practice. The regularization is one part most tree packages treat
less carefully, or simply ignore. This was because the traditional treatment of tree learning only emphasized improving impurity, while the complexity control was left to heuristics. less carefully, or simply ignore. This was because the traditional treatment of tree learning only emphasized improving impurity, while the complexity control was left to heuristics.