Support optimal partitioning for GPU hist. (#7652)

* Implement `MaxCategory` in quantile. * Implement partition-based split for GPU evaluation. Currently, it's based on the existing evaluation function. * Extract an evaluator from GPU Hist to store the needed states. * Added some CUDA stream/event utilities. * Update document with references. * Fixed a bug in approx evaluator where the number of data points is less than the number of categories.
2022-02-15 03:03:12 +08:00
parent 2369d55e9a
commit 0d0abe1845
26 changed files with 1088 additions and 528 deletions
--- a/python-package/xgboost/core.py
+++ b/python-package/xgboost/core.py
@@ -581,10 +581,10 @@ class DMatrix:  # pylint: disable=too-many-instance-attributes

            .. versionadded:: 1.3.0

-            Experimental support of specializing for categorical features.  Do not set to
-            True unless you are interested in development.  Currently it's only available
-            for `gpu_hist` tree method with 1 vs rest (one hot) categorical split.  Also,
-            JSON serialization format is required.
+            Experimental support of specializing for categorical features.  Do not set
+            to True unless you are interested in development.  Currently it's only
+            available for `gpu_hist` and `approx` tree methods. Also, JSON/UBJSON
+            serialization format is required. (XGBoost 1.6 for approx)

        """
        if group is not None and qid is not None:
--- a/python-package/xgboost/sklearn.py
+++ b/python-package/xgboost/sklearn.py
@@ -207,7 +207,9 @@ __model_doc = f'''
        .. versionadded:: 1.5.0

        Experimental support for categorical data.  Do not set to true unless you are
-        interested in development. Only valid when `gpu_hist` and dataframe are used.
+        interested in development. Only valid when `gpu_hist` or `approx` is used along
+        with dataframe as input.  Also, JSON/UBJSON serialization format is
+        required. (XGBoost 1.6 for approx)

    max_cat_to_onehot : Optional[int]

@@ -216,10 +218,11 @@ __model_doc = f'''
        .. note:: This parameter is experimental

        A threshold for deciding whether XGBoost should use one-hot encoding based split
-        for categorical data.  When number of categories is lesser than the threshold then
-        one-hot encoding is chosen, otherwise the categories will be partitioned into
-        children nodes.  Only relevant for regression and binary classification and
-        `approx` tree method.
+        for categorical data.  When number of categories is lesser than the threshold
+        then one-hot encoding is chosen, otherwise the categories will be partitioned
+        into children nodes.  Only relevant for regression and binary
+        classification. Also, ``approx`` or ``gpu_hist`` tree method is required.  See
+        :doc:`Categorical Data </tutorials/categorical>` for details.

    eval_metric : Optional[Union[str, List[str], Callable]]