[doc] Lowercase omega for per tree complexity (#7532)

As suggested on issue #7480
2021-12-30 04:05:54 +13:00 · 2021-12-30 04:05:54 +13:00 · a4a0ebb85d
commit a4a0ebb85d
parent 3886c3dd8f
1 changed files with 13 additions and 11 deletions
--- a/doc/tutorials/model.rst
+++ b/doc/tutorials/model.rst
@ -97,11 +97,13 @@ Mathematically, we can write our model in the form

  \hat{y}_i = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F}

-where :math:`K` is the number of trees, :math:`f` is a function in the functional space :math:`\mathcal{F}`, and :math:`\mathcal{F}` is the set of all possible CARTs. The objective function to be optimized is given by
+where :math:`K` is the number of trees, :math:`f_k` is a function in the functional space :math:`\mathcal{F}`, and :math:`\mathcal{F}` is the set of all possible CARTs. The objective function to be optimized is given by

 .. math::

-  \text{obj}(\theta) = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k)
+  \text{obj}(\theta) = \sum_i^n l(y_i, \hat{y}_i) + \sum_{k=1}^K \omega(f_k)
+
+where :math:`\omega(f_k)` is the complexity of the tree :math:`f_k`, defined in detail later.

 Now here comes a trick question: what is the *model* used in random forests? Tree ensembles! So random forests and boosted trees are really the same models; the
 difference arises from how we train them. This means that, if you write a predictive service for tree ensembles, you only need to write one and it should work
@ -117,7 +119,7 @@ Let the following be the objective function (remember it always needs to contain

 .. math::

-  \text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i)
+  \text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\omega(f_i)

 Additive Training
 =================
@ -141,15 +143,15 @@ It remains to ask: which tree do we want at each step?  A natural thing is to ad

 .. math::

-  \text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\
-            & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant}
+  \text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\omega(f_i) \\
+            & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \omega(f_t) + \mathrm{constant}

 If we consider using mean squared error (MSE) as our loss function, the objective becomes

 .. math::

-  \text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\
-            & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant}
+  \text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\omega(f_i) \\
+            & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \omega(f_t) + \mathrm{constant}

 The form of MSE is friendly, with a first order term (usually called the residual) and a quadratic term.
 For other losses of interest (for example, logistic loss), it is not so easy to get such a nice form.
@ -157,7 +159,7 @@ So in the general case, we take the *Taylor expansion of the loss function up to

 .. math::

-  \text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant}
+  \text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \omega(f_t) + \mathrm{constant}

 where the :math:`g_i` and :math:`h_i` are defined as

@ -170,7 +172,7 @@ After we remove all the constants, the specific objective at step :math:`t` beco

 .. math::

-  \sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t)
+  \sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \omega(f_t)

 This becomes our optimization goal for the new tree. One important advantage of this definition is that
 the value of the objective function only depends on :math:`g_i` and :math:`h_i`. This is how XGBoost supports custom loss functions.
@ -180,7 +182,7 @@ the same solver that takes :math:`g_i` and :math:`h_i` as input!
 Model Complexity
 ================
 We have introduced the training step, but wait, there is one important thing, the **regularization term**!
-We need to define the complexity of the tree :math:`\Omega(f)`. In order to do so, let us first refine the definition of the tree :math:`f(x)` as
+We need to define the complexity of the tree :math:`\omega(f)`. In order to do so, let us first refine the definition of the tree :math:`f(x)` as

 .. math::

@ -191,7 +193,7 @@ In XGBoost, we define the complexity as

 .. math::

-  \Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2
+  \omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2

 Of course, there is more than one way to define the complexity, but this one works well in practice. The regularization is one part most tree packages treat
 less carefully, or simply ignore. This was because the traditional treatment of tree learning only emphasized improving impurity, while the complexity control was left to heuristics.