Support categorical data for hist. (#7695)

* Extract partitioner from hist. * Implement categorical data support by passing the gradient index directly into the partitioner. * Organize/update document. * Remove code for negative hessian.
2022-02-25 03:47:14 +08:00
parent f60d95b0ba
commit 83a66b4994
15 changed files with 402 additions and 498 deletions
--- a/doc/parameter.rst
+++ b/doc/parameter.rst
@@ -244,9 +244,6 @@ Additional parameters for ``hist``, ``gpu_hist`` and ``approx`` tree method

  - Use single precision to build histograms instead of double precision.

-Additional parameters for ``approx`` and ``gpu_hist`` tree method
-=================================================================
-
 * ``max_cat_to_onehot``

  .. versionadded:: 1.6
@@ -256,8 +253,8 @@ Additional parameters for ``approx`` and ``gpu_hist`` tree method
  - A threshold for deciding whether XGBoost should use one-hot encoding based split for
    categorical data.  When number of categories is lesser than the threshold then one-hot
    encoding is chosen, otherwise the categories will be partitioned into children nodes.
-    Only relevant for regression and binary classification. Also, `approx` or `gpu_hist`
-    tree method is required.
+    Only relevant for regression and binary classification. Also, ``exact`` tree method is
+    not supported

 Additional parameters for Dart Booster (``booster=dart``)
 =========================================================
--- a/doc/tutorials/categorical.rst
+++ b/doc/tutorials/categorical.rst
@@ -4,16 +4,16 @@ Categorical Data

 .. note::

-   As of XGBoost 1.6, the feature is highly experimental and has limited features
+   As of XGBoost 1.6, the feature is experimental and has limited features

 Starting from version 1.5, XGBoost has experimental support for categorical data available
-for public testing.  At the moment, the support is implemented as one-hot encoding based
-categorical tree splits.  For numerical data, the split condition is defined as
-:math:`value < threshold`, while for categorical data the split is defined as :math:`value
-== category` and ``category`` is a discrete value.  More advanced categorical split
-strategy is planned for future releases and this tutorial details how to inform XGBoost
-about the data type.  Also, the current support for training is limited to ``gpu_hist``
-tree method.
+for public testing. For numerical data, the split condition is defined as :math:`value <
+threshold`, while for categorical data the split is defined depending on whether
+partitioning or onehot encoding is used. For partition-based splits, the splits are
+specified as :math:`value \in categories`, where ``categories`` is the set of categories
+in one feature.  If onehot encoding is used instead, then the split is defined as
+:math:`value == category`. More advanced categorical split strategy is planned for future
+releases and this tutorial details how to inform XGBoost about the data type.

 ************************************
 Training with scikit-learn Interface
@@ -35,13 +35,13 @@ parameter ``enable_categorical``:

 .. code:: python

-  # Only gpu_hist is supported for categorical data as mentioned previously
+  # Supported tree methods are `gpu_hist`, `approx`, and `hist`.
  clf = xgb.XGBClassifier(
      tree_method="gpu_hist", enable_categorical=True, use_label_encoder=False
  )
  # X is the dataframe we created in previous snippet
  clf.fit(X, y)
-  # Must use JSON for serialization, otherwise the information is lost
+  # Must use JSON/UBJSON for serialization, otherwise the information is lost.
  clf.save_model("categorical-model.json")


@@ -60,11 +60,37 @@ can plot the model and calculate the global feature importance:


 The ``scikit-learn`` interface from dask is similar to single node version.  The basic
-idea is create dataframe with category feature type, and tell XGBoost to use ``gpu_hist``
-with parameter ``enable_categorical``.  See :ref:`sphx_glr_python_examples_categorical.py`
-for a worked example of using categorical data with ``scikit-learn`` interface.  A
-comparison between using one-hot encoded data and XGBoost's categorical data support can
-be found :ref:`sphx_glr_python_examples_cat_in_the_dat.py`.
+idea is create dataframe with category feature type, and tell XGBoost to use it by setting
+the ``enable_categorical`` parameter.  See :ref:`sphx_glr_python_examples_categorical.py`
+for a worked example of using categorical data with ``scikit-learn`` interface with
+one-hot encoding.  A comparison between using one-hot encoded data and XGBoost's
+categorical data support can be found :ref:`sphx_glr_python_examples_cat_in_the_dat.py`.
+
+
+********************
+Optimal Partitioning
+********************
+
+.. versionadded:: 1.6
+
+Optimal partitioning is a technique for partitioning the categorical predictors for each
+node split, the proof of optimality for numerical objectives like ``RMSE`` was first
+introduced by `[1] <#references>`__. The algorithm is used in decision trees for handling
+regression and binary classification tasks `[2] <#references>`__, later LightGBM `[3]
+<#references>`__ brought it to the context of gradient boosting trees and now is also
+adopted in XGBoost as an optional feature for handling categorical splits. More
+specifically, the proof by Fisher `[1] <#references>`__ states that, when trying to
+partition a set of discrete values into groups based on the distances between a measure of
+these values, one only needs to look at sorted partitions instead of enumerating all
+possible permutations. In the context of decision trees, the discrete values are
+categories, and the measure is the output leaf value.  Intuitively, we want to group the
+categories that output similar leaf values. During split finding, we first sort the
+gradient histogram to prepare the contiguous partitions then enumerate the splits
+according to these sorted values. One of the related parameters for XGBoost is
+``max_cat_to_one_hot``, which controls whether one-hot encoding or partitioning should be
+used for each feature, see :doc:`/parameter` for details.  When objective is not
+regression or binary classification, XGBoost will fallback to using onehot encoding
+instead.


 **********************
@@ -82,7 +108,7 @@ categorical data, we need to pass the similar parameter to :class:`DMatrix

  # X is a dataframe we created in previous snippet
  Xy = xgb.DMatrix(X, y, enable_categorical=True)
-  booster = xgb.train({"tree_method": "gpu_hist"}, Xy)
+  booster = xgb.train({"tree_method": "hist", "max_cat_to_onehot": 5}, Xy)
  # Must use JSON for serialization, otherwise the information is lost
  booster.save_model("categorical-model.json")

@@ -109,30 +135,7 @@ types by using the ``feature_types`` parameter in :class:`DMatrix <xgboost.DMatr

 For numerical data, the feature type can be ``"q"`` or ``"float"``, while for categorical
 feature it's specified as ``"c"``.  The Dask module in XGBoost has the same interface so
-:class:`dask.Array <dask.Array>` can also be used as categorical data.
-
-********************
-Optimal Partitioning
-********************
-
-.. versionadded:: 1.6
-
-Optimal partitioning is a technique for partitioning the categorical predictors for each
-node split, the proof of optimality for numerical objectives like ``RMSE`` was first
-introduced by `[1] <#references>`__. The algorithm is used in decision trees for handling
-regression and binary classification tasks `[2] <#references>`__, later LightGBM `[3]
-<#references>`__ brought it to the context of gradient boosting trees and now is also
-adopted in XGBoost as an optional feature for handling categorical splits. More
-specifically, the proof by Fisher `[1] <#references>`__ states that, when trying to
-partition a set of discrete values into groups based on the distances between a measure of
-these values, one only needs to look at sorted partitions instead of enumerating all
-possible permutations. In the context of decision trees, the discrete values are
-categories, and the measure is the output leaf value.  Intuitively, we want to group the
-categories that output similar leaf values. During split finding, we first sort the
-gradient histogram to prepare the contiguous partitions then enumerate the splits
-according to these sorted values. One of the related parameters for XGBoost is
-``max_cat_to_one_hot``, which controls whether one-hot encoding or partitioning should be
-used for each feature, see :doc:`/parameter` for details.
+:class:`dask.Array <dask.Array>` can also be used for categorical data.

 *************
 Miscellaneous