Support optimal partitioning for GPU hist. (#7652)

* Implement `MaxCategory` in quantile.
* Implement partition-based split for GPU evaluation.  Currently, it's based on the existing evaluation function.
* Extract an evaluator from GPU Hist to store the needed states.
* Added some CUDA stream/event utilities.
* Update document with references.
* Fixed a bug in approx evaluator where the number of data points is less than the number of categories.
This commit is contained in:
Jiaming Yuan
2022-02-15 03:03:12 +08:00
committed by GitHub
parent 2369d55e9a
commit 0d0abe1845
26 changed files with 1088 additions and 528 deletions

View File

@@ -581,10 +581,10 @@ class DMatrix: # pylint: disable=too-many-instance-attributes
.. versionadded:: 1.3.0
Experimental support of specializing for categorical features. Do not set to
True unless you are interested in development. Currently it's only available
for `gpu_hist` tree method with 1 vs rest (one hot) categorical split. Also,
JSON serialization format is required.
Experimental support of specializing for categorical features. Do not set
to True unless you are interested in development. Currently it's only
available for `gpu_hist` and `approx` tree methods. Also, JSON/UBJSON
serialization format is required. (XGBoost 1.6 for approx)
"""
if group is not None and qid is not None:

View File

@@ -207,7 +207,9 @@ __model_doc = f'''
.. versionadded:: 1.5.0
Experimental support for categorical data. Do not set to true unless you are
interested in development. Only valid when `gpu_hist` and dataframe are used.
interested in development. Only valid when `gpu_hist` or `approx` is used along
with dataframe as input. Also, JSON/UBJSON serialization format is
required. (XGBoost 1.6 for approx)
max_cat_to_onehot : Optional[int]
@@ -216,10 +218,11 @@ __model_doc = f'''
.. note:: This parameter is experimental
A threshold for deciding whether XGBoost should use one-hot encoding based split
for categorical data. When number of categories is lesser than the threshold then
one-hot encoding is chosen, otherwise the categories will be partitioned into
children nodes. Only relevant for regression and binary classification and
`approx` tree method.
for categorical data. When number of categories is lesser than the threshold
then one-hot encoding is chosen, otherwise the categories will be partitioned
into children nodes. Only relevant for regression and binary
classification. Also, ``approx`` or ``gpu_hist`` tree method is required. See
:doc:`Categorical Data </tutorials/categorical>` for details.
eval_metric : Optional[Union[str, List[str], Callable]]