Document for device ordinal. (#9398)

- Rewrite GPU demos. notebook is converted to script to avoid committing additional png plots. - Add GPU demos into the sphinx gallery. - Add RMM demos into the sphinx gallery. - Test for firing threads with different device ordinals.
2023-07-22 15:26:29 +08:00
parent 22b0a55a04
commit 275da176ba
32 changed files with 351 additions and 398 deletions
--- a/doc/.gitignore
+++ b/doc/.gitignore
@@ -6,3 +6,5 @@ doxygen
 parser.py
 *.pyc
 web-data
+# generated by doxygen
+tmp
--- a/doc/conf.py
+++ b/doc/conf.py
@@ -19,7 +19,6 @@ import sys
 import tarfile
 import urllib.request
 import warnings
-from subprocess import call
 from urllib.error import HTTPError

 from sh.contrib import git
@@ -148,12 +147,20 @@ extensions = [

 sphinx_gallery_conf = {
    # path to your example scripts
-    "examples_dirs": ["../demo/guide-python", "../demo/dask", "../demo/aft_survival"],
+    "examples_dirs": [
+        "../demo/guide-python",
+        "../demo/dask",
+        "../demo/aft_survival",
+        "../demo/gpu_acceleration",
+        "../demo/rmm_plugin"
+    ],
    # path to where to save gallery generated output
    "gallery_dirs": [
        "python/examples",
        "python/dask-examples",
        "python/survival-examples",
+        "python/gpu-examples",
+        "python/rmm-examples",
    ],
    "matplotlib_animations": True,
 }
--- a/doc/gpu/index.rst
+++ b/doc/gpu/index.rst
@@ -23,20 +23,19 @@ The GPU algorithms currently work with CLI, Python, R, and JVM packages. See :do
  :caption: Python example

  params = dict()
-  params["device"] = "cuda:0"
+  params["device"] = "cuda"
  params["tree_method"] = "hist"
  Xy = xgboost.QuantileDMatrix(X, y)
  xgboost.train(params, Xy)

 .. code-block:: python
-  :caption: With Scikit-Learn interface
+  :caption: With the Scikit-Learn interface

  XGBRegressor(tree_method="hist", device="cuda")

-
 GPU-Accelerated SHAP values
 =============================
-XGBoost makes use of `GPUTreeShap <https://github.com/rapidsai/gputreeshap>`_ as a backend for computing shap values when the GPU predictor is selected.
+XGBoost makes use of `GPUTreeShap <https://github.com/rapidsai/gputreeshap>`_ as a backend for computing shap values when the GPU is used.

 .. code-block:: python

@@ -44,12 +43,12 @@ XGBoost makes use of `GPUTreeShap <https://github.com/rapidsai/gputreeshap>`_ as
  shap_values = booster.predict(dtrain, pred_contribs=True)
  shap_interaction_values = model.predict(dtrain, pred_interactions=True)

-See examples `here <https://github.com/dmlc/xgboost/tree/master/demo/gpu_acceleration>`__.
+See :ref:`sphx_glr_python_gpu-examples_tree_shap.py` for a worked example.

 Multi-node Multi-GPU Training
 =============================

-XGBoost supports fully distributed GPU training using `Dask <https://dask.org/>`_, ``Spark`` and ``PySpark``. For getting started with Dask see our tutorial :doc:`/tutorials/dask` and worked examples `here <https://github.com/dmlc/xgboost/tree/master/demo/dask>`__, also Python documentation :ref:`dask_api` for complete reference. For usage with ``Spark`` using Scala see :doc:`/jvm/xgboost4j_spark_gpu_tutorial`. Lastly for distributed GPU training with ``PySpark``, see :doc:`/tutorials/spark_estimator`.
+XGBoost supports fully distributed GPU training using `Dask <https://dask.org/>`_, ``Spark`` and ``PySpark``. For getting started with Dask see our tutorial :doc:`/tutorials/dask` and worked examples :doc:`/python/dask-examples/index`, also Python documentation :ref:`dask_api` for complete reference. For usage with ``Spark`` using Scala see :doc:`/jvm/xgboost4j_spark_gpu_tutorial`. Lastly for distributed GPU training with ``PySpark``, see :doc:`/tutorials/spark_estimator`.


 Memory usage
@@ -67,7 +66,8 @@ If you are getting out-of-memory errors on a big dataset, try the or :py:class:`

 CPU-GPU Interoperability
 ========================
-XGBoost models trained on GPUs can be used on CPU-only systems to generate predictions. For information about how to save and load an XGBoost model, see :doc:`/tutorials/saving_model`.
+
+The model can be used on any device regardless of the one used to train it. For instance, a model trained using GPU can still work on a CPU-only machine and vice versa. For more information about model serialization, see :doc:`/tutorials/saving_model`.


 Developer notes
--- a/doc/install.rst
+++ b/doc/install.rst
@@ -189,7 +189,7 @@ This will check out the latest stable version from the Maven Central.

 For the latest release version number, please check `release page <https://github.com/dmlc/xgboost/releases>`_.

-To enable the GPU algorithm (``tree_method='gpu_hist'``), use artifacts ``xgboost4j-gpu_2.12`` and ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix).
+To enable the GPU algorithm (``device='cuda'``), use artifacts ``xgboost4j-gpu_2.12`` and ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix).


 .. note:: Windows not supported in the JVM package
@@ -325,4 +325,4 @@ The SNAPSHOT JARs are hosted by the XGBoost project. Every commit in the ``maste

 You can browse the file listing of the Maven repository at https://s3-us-west-2.amazonaws.com/xgboost-maven-repo/list.html.

-To enable the GPU algorithm (``tree_method='gpu_hist'``), use artifacts ``xgboost4j-gpu_2.12`` and ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix).
+To enable the GPU algorithm (``device='cuda'``), use artifacts ``xgboost4j-gpu_2.12`` and ``xgboost4j-spark-gpu_2.12`` instead (note the ``gpu`` suffix).
--- a/doc/parameter.rst
+++ b/doc/parameter.rst
@@ -34,27 +34,6 @@ General Parameters

  - Which booster to use. Can be ``gbtree``, ``gblinear`` or ``dart``; ``gbtree`` and ``dart`` use tree based models while ``gblinear`` uses linear functions.

-* ``verbosity`` [default=1]
-
-  - Verbosity of printing messages.  Valid values are 0 (silent), 1 (warning), 2 (info), 3
-    (debug).  Sometimes XGBoost tries to change configurations based on heuristics, which
-    is displayed as warning message.  If there's unexpected behaviour, please try to
-    increase value of verbosity.
-
-* ``validate_parameters`` [default to ``false``, except for Python, R and CLI interface]
-
-  - When set to True, XGBoost will perform validation of input parameters to check whether
-    a parameter is used or not.
-
-* ``nthread`` [default to maximum number of threads available if not set]
-
-  - Number of parallel threads used to run XGBoost.  When choosing it, please keep thread
-    contention and hyperthreading in mind.
-
-* ``disable_default_eval_metric`` [default= ``false``]
-
-  - Flag to disable default metric. Set to 1 or ``true`` to disable.
-
 * ``device`` [default= ``cpu``]

  .. versionadded:: 2.0.0
@@ -67,6 +46,29 @@ General Parameters
    + ``gpu``: Default GPU device selection from the list of available and supported devices. Only ``cuda`` devices are supported currently.
    + ``gpu:<ordinal>``: Default GPU device selection from the list of available and supported devices. Only ``cuda`` devices are supported currently.

+    For more information about GPU acceleration, see :doc:`/gpu/index`.
+
+* ``verbosity`` [default=1]
+
+  - Verbosity of printing messages.  Valid values are 0 (silent), 1 (warning), 2 (info), 3
+    (debug).  Sometimes XGBoost tries to change configurations based on heuristics, which
+    is displayed as warning message.  If there's unexpected behaviour, please try to
+    increase value of verbosity.
+
+* ``validate_parameters`` [default to ``false``, except for Python, R and CLI interface]
+
+  - When set to True, XGBoost will perform validation of input parameters to check whether
+    a parameter is used or not. A warning is emitted when there's unknown parameter.
+
+* ``nthread`` [default to maximum number of threads available if not set]
+
+  - Number of parallel threads used to run XGBoost.  When choosing it, please keep thread
+    contention and hyperthreading in mind.
+
+* ``disable_default_eval_metric`` [default= ``false``]
+
+  - Flag to disable default metric. Set to 1 or ``true`` to disable.
+
 Parameters for Tree Booster
 ===========================
 * ``eta`` [default=0.3, alias: ``learning_rate``]
@@ -160,7 +162,7 @@ Parameters for Tree Booster
    - ``grow_colmaker``: non-distributed column-based construction of trees.
    - ``grow_histmaker``: distributed tree construction with row-based data splitting based on global proposal of histogram counting.
    - ``grow_quantile_histmaker``: Grow tree using quantized histogram.
-    - ``grow_gpu_hist``: Grow tree with GPU. Same as setting ``tree_method`` to ``hist`` and use ``device=cuda``.
+    - ``grow_gpu_hist``: Grow tree with GPU. Enabled when ``tree_method`` is set to ``hist`` along with ``device=cuda``.
    - ``sync``: synchronizes trees in all distributed nodes.
    - ``refresh``: refreshes tree's statistics and/or leaf values based on the current data. Note that no random subsampling of data rows is performed.
    - ``prune``: prunes the splits where loss < min_split_loss (or gamma) and nodes that have depth greater than ``max_depth``.
--- a/doc/python/.gitignore
+++ b/doc/python/.gitignore
@@ -1,3 +1,5 @@
 examples
 dask-examples
-survival-examples
+survival-examples
+gpu-examples
+rmm-examples
--- a/doc/python/index.rst
+++ b/doc/python/index.rst
@@ -17,3 +17,5 @@ Contents
  examples/index
  dask-examples/index
  survival-examples/index
+  gpu-examples/index
+  rmm-examples/index
--- a/doc/treemethod.rst
+++ b/doc/treemethod.rst
@@ -124,7 +124,7 @@ Following table summarizes some differences in supported features between 4 tree
 `T` means supported while `F` means unsupported.

 +------------------+-----------+---------------------+---------------------+------------------------+
-|                  | Exact     | Approx              | Hist                | GPU Hist               |
+|                  | Exact     | Approx              | Hist                | Hist (GPU)             |
 +==================+===========+=====================+=====================+========================+
 | grow_policy      | Depthwise | depthwise/lossguide | depthwise/lossguide | depthwise/lossguide    |
 +------------------+-----------+---------------------+---------------------+------------------------+
@@ -141,5 +141,5 @@ Following table summarizes some differences in supported features between 4 tree

 Features/parameters that are not mentioned here are universally supported for all 4 tree
 methods (for instance, column sampling and constraints).  The `P` in external memory means
-partially supported.  Please note that both categorical data and external memory are
+special handling.  Please note that both categorical data and external memory are
 experimental.
--- a/doc/tutorials/categorical.rst
+++ b/doc/tutorials/categorical.rst
@@ -35,8 +35,8 @@ parameter ``enable_categorical``:

 .. code:: python

-  # Supported tree methods are `gpu_hist`, `approx`, and `hist`.
-  clf = xgb.XGBClassifier(tree_method="gpu_hist", enable_categorical=True)
+  # Supported tree methods are `approx` and `hist`.
+  clf = xgb.XGBClassifier(tree_method="hist", enable_categorical=True, device="cuda")
  # X is the dataframe we created in previous snippet
  clf.fit(X, y)
  # Must use JSON/UBJSON for serialization, otherwise the information is lost.
--- a/doc/tutorials/external_memory.rst
+++ b/doc/tutorials/external_memory.rst
@@ -81,7 +81,7 @@ constructor.
  it = Iterator(["file_0.svm", "file_1.svm", "file_2.svm"])
  Xy = xgboost.DMatrix(it)

-  # Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
+  # The ``approx`` also work, but with low performance. GPU implementation is different from CPU.
  # as noted in following sections.
  booster = xgboost.train({"tree_method": "hist"}, Xy)

@@ -118,15 +118,15 @@ to reduce the overhead of file reading.
 GPU Version (GPU Hist tree method)
 **********************************

-External memory is supported by GPU algorithms (i.e. when ``tree_method`` is set to
-``gpu_hist``). However, the algorithm used for GPU is different from the one used for
+External memory is supported by GPU algorithms (i.e. when ``device`` is set to
+``cuda``). However, the algorithm used for GPU is different from the one used for
 CPU. When training on a CPU, the tree method iterates through all batches from external
 memory for each step of the tree construction algorithm. On the other hand, the GPU
 algorithm uses a hybrid approach. It iterates through the data during the beginning of
-each iteration and concatenates all batches into one in GPU memory. To reduce overall
-memory usage, users can utilize subsampling. The GPU hist tree method supports
-`gradient-based sampling`, enabling users to set a low sampling rate without compromising
-accuracy.
+each iteration and concatenates all batches into one in GPU memory for performance
+reasons. To reduce overall memory usage, users can utilize subsampling. The GPU hist tree
+method supports `gradient-based sampling`, enabling users to set a low sampling rate
+without compromising accuracy.

 .. code-block:: python

--- a/doc/tutorials/monotonic.rst
+++ b/doc/tutorials/monotonic.rst
@@ -83,13 +83,14 @@ Some other examples:
 - ``(0,-1)``: No constraint on the first predictor and a decreasing constraint on the second.


-**Note for the 'hist' tree construction algorithm**.
-If ``tree_method`` is set to either ``hist``, ``approx`` or ``gpu_hist``, enabling
-monotonic constraints may produce unnecessarily shallow trees. This is because the
-``hist`` method reduces the number of candidate splits to be considered at each
-split. Monotonic constraints may wipe out all available split candidates, in which case no
-split is made. To reduce the effect, you may want to increase the ``max_bin`` parameter to
-consider more split candidates.
+.. note::
+
+   **Note for the 'hist' tree construction algorithm**.  If ``tree_method`` is set to
+   either ``hist`` or ``approx``, enabling monotonic constraints may produce unnecessarily
+   shallow trees. This is because the ``hist`` method reduces the number of candidate
+   splits to be considered at each split. Monotonic constraints may wipe out all available
+   split candidates, in which case no split is made. To reduce the effect, you may want to
+   increase the ``max_bin`` parameter to consider more split candidates.


 *******************
--- a/doc/tutorials/param_tuning.rst
+++ b/doc/tutorials/param_tuning.rst
@@ -38,10 +38,6 @@ There are in general two ways that you can control overfitting in XGBoost:
  - This includes ``subsample`` and ``colsample_bytree``.
  - You can also reduce stepsize ``eta``. Remember to increase ``num_round`` when you do so.

-***************************
-Faster training performance
-***************************
-There's a parameter called ``tree_method``, set it to ``hist`` or ``gpu_hist`` for faster computation.

 *************************
 Handle Imbalanced Dataset
--- a/doc/tutorials/rf.rst
+++ b/doc/tutorials/rf.rst
@@ -50,13 +50,14 @@ Here is a sample parameter dictionary for training a random forest on a GPU usin
 xgboost::

  params = {
-    'colsample_bynode': 0.8,
-    'learning_rate': 1,
-    'max_depth': 5,
-    'num_parallel_tree': 100,
-    'objective': 'binary:logistic',
-    'subsample': 0.8,
-    'tree_method': 'gpu_hist'
+    "colsample_bynode": 0.8,
+    "learning_rate": 1,
+    "max_depth": 5,
+    "num_parallel_tree": 100,
+    "objective": "binary:logistic",
+    "subsample": 0.8,
+    "tree_method": "hist",
+    "device": "cuda",
  }

 A random forest model can then be trained as follows::
--- a/doc/tutorials/saving_model.rst
+++ b/doc/tutorials/saving_model.rst
@@ -174,7 +174,7 @@ Will print out something similar to (not actual output as it's too long for demo
          "gbtree_train_param": {
            "num_parallel_tree": "1",
            "process_type": "default",
-            "tree_method": "gpu_hist",
+            "tree_method": "hist",
            "updater": "grow_gpu_hist",
            "updater_seq": "grow_gpu_hist"
          },