Support categorical data for hist. (#7695)
* Extract partitioner from hist. * Implement categorical data support by passing the gradient index directly into the partitioner. * Organize/update document. * Remove code for negative hessian.
This commit is contained in:
@@ -244,9 +244,6 @@ Additional parameters for ``hist``, ``gpu_hist`` and ``approx`` tree method
|
||||
|
||||
- Use single precision to build histograms instead of double precision.
|
||||
|
||||
Additional parameters for ``approx`` and ``gpu_hist`` tree method
|
||||
=================================================================
|
||||
|
||||
* ``max_cat_to_onehot``
|
||||
|
||||
.. versionadded:: 1.6
|
||||
@@ -256,8 +253,8 @@ Additional parameters for ``approx`` and ``gpu_hist`` tree method
|
||||
- A threshold for deciding whether XGBoost should use one-hot encoding based split for
|
||||
categorical data. When number of categories is lesser than the threshold then one-hot
|
||||
encoding is chosen, otherwise the categories will be partitioned into children nodes.
|
||||
Only relevant for regression and binary classification. Also, `approx` or `gpu_hist`
|
||||
tree method is required.
|
||||
Only relevant for regression and binary classification. Also, ``exact`` tree method is
|
||||
not supported
|
||||
|
||||
Additional parameters for Dart Booster (``booster=dart``)
|
||||
=========================================================
|
||||
|
||||
@@ -4,16 +4,16 @@ Categorical Data
|
||||
|
||||
.. note::
|
||||
|
||||
As of XGBoost 1.6, the feature is highly experimental and has limited features
|
||||
As of XGBoost 1.6, the feature is experimental and has limited features
|
||||
|
||||
Starting from version 1.5, XGBoost has experimental support for categorical data available
|
||||
for public testing. At the moment, the support is implemented as one-hot encoding based
|
||||
categorical tree splits. For numerical data, the split condition is defined as
|
||||
:math:`value < threshold`, while for categorical data the split is defined as :math:`value
|
||||
== category` and ``category`` is a discrete value. More advanced categorical split
|
||||
strategy is planned for future releases and this tutorial details how to inform XGBoost
|
||||
about the data type. Also, the current support for training is limited to ``gpu_hist``
|
||||
tree method.
|
||||
for public testing. For numerical data, the split condition is defined as :math:`value <
|
||||
threshold`, while for categorical data the split is defined depending on whether
|
||||
partitioning or onehot encoding is used. For partition-based splits, the splits are
|
||||
specified as :math:`value \in categories`, where ``categories`` is the set of categories
|
||||
in one feature. If onehot encoding is used instead, then the split is defined as
|
||||
:math:`value == category`. More advanced categorical split strategy is planned for future
|
||||
releases and this tutorial details how to inform XGBoost about the data type.
|
||||
|
||||
************************************
|
||||
Training with scikit-learn Interface
|
||||
@@ -35,13 +35,13 @@ parameter ``enable_categorical``:
|
||||
|
||||
.. code:: python
|
||||
|
||||
# Only gpu_hist is supported for categorical data as mentioned previously
|
||||
# Supported tree methods are `gpu_hist`, `approx`, and `hist`.
|
||||
clf = xgb.XGBClassifier(
|
||||
tree_method="gpu_hist", enable_categorical=True, use_label_encoder=False
|
||||
)
|
||||
# X is the dataframe we created in previous snippet
|
||||
clf.fit(X, y)
|
||||
# Must use JSON for serialization, otherwise the information is lost
|
||||
# Must use JSON/UBJSON for serialization, otherwise the information is lost.
|
||||
clf.save_model("categorical-model.json")
|
||||
|
||||
|
||||
@@ -60,11 +60,37 @@ can plot the model and calculate the global feature importance:
|
||||
|
||||
|
||||
The ``scikit-learn`` interface from dask is similar to single node version. The basic
|
||||
idea is create dataframe with category feature type, and tell XGBoost to use ``gpu_hist``
|
||||
with parameter ``enable_categorical``. See :ref:`sphx_glr_python_examples_categorical.py`
|
||||
for a worked example of using categorical data with ``scikit-learn`` interface. A
|
||||
comparison between using one-hot encoded data and XGBoost's categorical data support can
|
||||
be found :ref:`sphx_glr_python_examples_cat_in_the_dat.py`.
|
||||
idea is create dataframe with category feature type, and tell XGBoost to use it by setting
|
||||
the ``enable_categorical`` parameter. See :ref:`sphx_glr_python_examples_categorical.py`
|
||||
for a worked example of using categorical data with ``scikit-learn`` interface with
|
||||
one-hot encoding. A comparison between using one-hot encoded data and XGBoost's
|
||||
categorical data support can be found :ref:`sphx_glr_python_examples_cat_in_the_dat.py`.
|
||||
|
||||
|
||||
********************
|
||||
Optimal Partitioning
|
||||
********************
|
||||
|
||||
.. versionadded:: 1.6
|
||||
|
||||
Optimal partitioning is a technique for partitioning the categorical predictors for each
|
||||
node split, the proof of optimality for numerical objectives like ``RMSE`` was first
|
||||
introduced by `[1] <#references>`__. The algorithm is used in decision trees for handling
|
||||
regression and binary classification tasks `[2] <#references>`__, later LightGBM `[3]
|
||||
<#references>`__ brought it to the context of gradient boosting trees and now is also
|
||||
adopted in XGBoost as an optional feature for handling categorical splits. More
|
||||
specifically, the proof by Fisher `[1] <#references>`__ states that, when trying to
|
||||
partition a set of discrete values into groups based on the distances between a measure of
|
||||
these values, one only needs to look at sorted partitions instead of enumerating all
|
||||
possible permutations. In the context of decision trees, the discrete values are
|
||||
categories, and the measure is the output leaf value. Intuitively, we want to group the
|
||||
categories that output similar leaf values. During split finding, we first sort the
|
||||
gradient histogram to prepare the contiguous partitions then enumerate the splits
|
||||
according to these sorted values. One of the related parameters for XGBoost is
|
||||
``max_cat_to_one_hot``, which controls whether one-hot encoding or partitioning should be
|
||||
used for each feature, see :doc:`/parameter` for details. When objective is not
|
||||
regression or binary classification, XGBoost will fallback to using onehot encoding
|
||||
instead.
|
||||
|
||||
|
||||
**********************
|
||||
@@ -82,7 +108,7 @@ categorical data, we need to pass the similar parameter to :class:`DMatrix
|
||||
|
||||
# X is a dataframe we created in previous snippet
|
||||
Xy = xgb.DMatrix(X, y, enable_categorical=True)
|
||||
booster = xgb.train({"tree_method": "gpu_hist"}, Xy)
|
||||
booster = xgb.train({"tree_method": "hist", "max_cat_to_onehot": 5}, Xy)
|
||||
# Must use JSON for serialization, otherwise the information is lost
|
||||
booster.save_model("categorical-model.json")
|
||||
|
||||
@@ -109,30 +135,7 @@ types by using the ``feature_types`` parameter in :class:`DMatrix <xgboost.DMatr
|
||||
|
||||
For numerical data, the feature type can be ``"q"`` or ``"float"``, while for categorical
|
||||
feature it's specified as ``"c"``. The Dask module in XGBoost has the same interface so
|
||||
:class:`dask.Array <dask.Array>` can also be used as categorical data.
|
||||
|
||||
********************
|
||||
Optimal Partitioning
|
||||
********************
|
||||
|
||||
.. versionadded:: 1.6
|
||||
|
||||
Optimal partitioning is a technique for partitioning the categorical predictors for each
|
||||
node split, the proof of optimality for numerical objectives like ``RMSE`` was first
|
||||
introduced by `[1] <#references>`__. The algorithm is used in decision trees for handling
|
||||
regression and binary classification tasks `[2] <#references>`__, later LightGBM `[3]
|
||||
<#references>`__ brought it to the context of gradient boosting trees and now is also
|
||||
adopted in XGBoost as an optional feature for handling categorical splits. More
|
||||
specifically, the proof by Fisher `[1] <#references>`__ states that, when trying to
|
||||
partition a set of discrete values into groups based on the distances between a measure of
|
||||
these values, one only needs to look at sorted partitions instead of enumerating all
|
||||
possible permutations. In the context of decision trees, the discrete values are
|
||||
categories, and the measure is the output leaf value. Intuitively, we want to group the
|
||||
categories that output similar leaf values. During split finding, we first sort the
|
||||
gradient histogram to prepare the contiguous partitions then enumerate the splits
|
||||
according to these sorted values. One of the related parameters for XGBoost is
|
||||
``max_cat_to_one_hot``, which controls whether one-hot encoding or partitioning should be
|
||||
used for each feature, see :doc:`/parameter` for details.
|
||||
:class:`dask.Array <dask.Array>` can also be used for categorical data.
|
||||
|
||||
*************
|
||||
Miscellaneous
|
||||
|
||||
Reference in New Issue
Block a user