Support categorical data for hist. (#7695)

* Extract partitioner from hist.
* Implement categorical data support by passing the gradient index directly into the partitioner.
* Organize/update document.
* Remove code for negative hessian.
This commit is contained in:
Jiaming Yuan
2022-02-25 03:47:14 +08:00
committed by GitHub
parent f60d95b0ba
commit 83a66b4994
15 changed files with 402 additions and 498 deletions

View File

@@ -244,9 +244,6 @@ Additional parameters for ``hist``, ``gpu_hist`` and ``approx`` tree method
- Use single precision to build histograms instead of double precision.
Additional parameters for ``approx`` and ``gpu_hist`` tree method
=================================================================
* ``max_cat_to_onehot``
.. versionadded:: 1.6
@@ -256,8 +253,8 @@ Additional parameters for ``approx`` and ``gpu_hist`` tree method
- A threshold for deciding whether XGBoost should use one-hot encoding based split for
categorical data. When number of categories is lesser than the threshold then one-hot
encoding is chosen, otherwise the categories will be partitioned into children nodes.
Only relevant for regression and binary classification. Also, `approx` or `gpu_hist`
tree method is required.
Only relevant for regression and binary classification. Also, ``exact`` tree method is
not supported
Additional parameters for Dart Booster (``booster=dart``)
=========================================================

View File

@@ -4,16 +4,16 @@ Categorical Data
.. note::
As of XGBoost 1.6, the feature is highly experimental and has limited features
As of XGBoost 1.6, the feature is experimental and has limited features
Starting from version 1.5, XGBoost has experimental support for categorical data available
for public testing. At the moment, the support is implemented as one-hot encoding based
categorical tree splits. For numerical data, the split condition is defined as
:math:`value < threshold`, while for categorical data the split is defined as :math:`value
== category` and ``category`` is a discrete value. More advanced categorical split
strategy is planned for future releases and this tutorial details how to inform XGBoost
about the data type. Also, the current support for training is limited to ``gpu_hist``
tree method.
for public testing. For numerical data, the split condition is defined as :math:`value <
threshold`, while for categorical data the split is defined depending on whether
partitioning or onehot encoding is used. For partition-based splits, the splits are
specified as :math:`value \in categories`, where ``categories`` is the set of categories
in one feature. If onehot encoding is used instead, then the split is defined as
:math:`value == category`. More advanced categorical split strategy is planned for future
releases and this tutorial details how to inform XGBoost about the data type.
************************************
Training with scikit-learn Interface
@@ -35,13 +35,13 @@ parameter ``enable_categorical``:
.. code:: python
# Only gpu_hist is supported for categorical data as mentioned previously
# Supported tree methods are `gpu_hist`, `approx`, and `hist`.
clf = xgb.XGBClassifier(
tree_method="gpu_hist", enable_categorical=True, use_label_encoder=False
)
# X is the dataframe we created in previous snippet
clf.fit(X, y)
# Must use JSON for serialization, otherwise the information is lost
# Must use JSON/UBJSON for serialization, otherwise the information is lost.
clf.save_model("categorical-model.json")
@@ -60,11 +60,37 @@ can plot the model and calculate the global feature importance:
The ``scikit-learn`` interface from dask is similar to single node version. The basic
idea is create dataframe with category feature type, and tell XGBoost to use ``gpu_hist``
with parameter ``enable_categorical``. See :ref:`sphx_glr_python_examples_categorical.py`
for a worked example of using categorical data with ``scikit-learn`` interface. A
comparison between using one-hot encoded data and XGBoost's categorical data support can
be found :ref:`sphx_glr_python_examples_cat_in_the_dat.py`.
idea is create dataframe with category feature type, and tell XGBoost to use it by setting
the ``enable_categorical`` parameter. See :ref:`sphx_glr_python_examples_categorical.py`
for a worked example of using categorical data with ``scikit-learn`` interface with
one-hot encoding. A comparison between using one-hot encoded data and XGBoost's
categorical data support can be found :ref:`sphx_glr_python_examples_cat_in_the_dat.py`.
********************
Optimal Partitioning
********************
.. versionadded:: 1.6
Optimal partitioning is a technique for partitioning the categorical predictors for each
node split, the proof of optimality for numerical objectives like ``RMSE`` was first
introduced by `[1] <#references>`__. The algorithm is used in decision trees for handling
regression and binary classification tasks `[2] <#references>`__, later LightGBM `[3]
<#references>`__ brought it to the context of gradient boosting trees and now is also
adopted in XGBoost as an optional feature for handling categorical splits. More
specifically, the proof by Fisher `[1] <#references>`__ states that, when trying to
partition a set of discrete values into groups based on the distances between a measure of
these values, one only needs to look at sorted partitions instead of enumerating all
possible permutations. In the context of decision trees, the discrete values are
categories, and the measure is the output leaf value. Intuitively, we want to group the
categories that output similar leaf values. During split finding, we first sort the
gradient histogram to prepare the contiguous partitions then enumerate the splits
according to these sorted values. One of the related parameters for XGBoost is
``max_cat_to_one_hot``, which controls whether one-hot encoding or partitioning should be
used for each feature, see :doc:`/parameter` for details. When objective is not
regression or binary classification, XGBoost will fallback to using onehot encoding
instead.
**********************
@@ -82,7 +108,7 @@ categorical data, we need to pass the similar parameter to :class:`DMatrix
# X is a dataframe we created in previous snippet
Xy = xgb.DMatrix(X, y, enable_categorical=True)
booster = xgb.train({"tree_method": "gpu_hist"}, Xy)
booster = xgb.train({"tree_method": "hist", "max_cat_to_onehot": 5}, Xy)
# Must use JSON for serialization, otherwise the information is lost
booster.save_model("categorical-model.json")
@@ -109,30 +135,7 @@ types by using the ``feature_types`` parameter in :class:`DMatrix <xgboost.DMatr
For numerical data, the feature type can be ``"q"`` or ``"float"``, while for categorical
feature it's specified as ``"c"``. The Dask module in XGBoost has the same interface so
:class:`dask.Array <dask.Array>` can also be used as categorical data.
********************
Optimal Partitioning
********************
.. versionadded:: 1.6
Optimal partitioning is a technique for partitioning the categorical predictors for each
node split, the proof of optimality for numerical objectives like ``RMSE`` was first
introduced by `[1] <#references>`__. The algorithm is used in decision trees for handling
regression and binary classification tasks `[2] <#references>`__, later LightGBM `[3]
<#references>`__ brought it to the context of gradient boosting trees and now is also
adopted in XGBoost as an optional feature for handling categorical splits. More
specifically, the proof by Fisher `[1] <#references>`__ states that, when trying to
partition a set of discrete values into groups based on the distances between a measure of
these values, one only needs to look at sorted partitions instead of enumerating all
possible permutations. In the context of decision trees, the discrete values are
categories, and the measure is the output leaf value. Intuitively, we want to group the
categories that output similar leaf values. During split finding, we first sort the
gradient histogram to prepare the contiguous partitions then enumerate the splits
according to these sorted values. One of the related parameters for XGBoost is
``max_cat_to_one_hot``, which controls whether one-hot encoding or partitioning should be
used for each feature, see :doc:`/parameter` for details.
:class:`dask.Array <dask.Array>` can also be used for categorical data.
*************
Miscellaneous