Enable distributed GPU training over Rabit (#7930)

2022-05-30 13:09:45 -07:00
parent 6275cdc486
commit 80339c3427
9 changed files with 458 additions and 129 deletions
--- a/doc/build.rst
+++ b/doc/build.rst
@@ -136,9 +136,9 @@ From the command line on Linux starting from the XGBoost directory:

  To speed up compilation, the compute version specific to your GPU could be passed to cmake as, e.g., ``-DGPU_COMPUTE_VER=50``. A quick explanation and numbers for some architectures can be found `in this page <https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/>`_.

-.. note:: Enabling distributed GPU training
+.. note:: Faster distributed GPU training with NCCL

-  By default, distributed GPU training is disabled and only a single GPU will be used. To enable distributed GPU training, set the option ``USE_NCCL=ON``. Distributed GPU training depends on NCCL2, available at `this link <https://developer.nvidia.com/nccl>`_. Since NCCL2 is only available for Linux machines, **distributed GPU training is available only for Linux**.
+  By default, distributed GPU training is enabled and uses Rabit for communication. For faster training, set the option ``USE_NCCL=ON``. Faster distributed GPU training depends on NCCL2, available at `this link <https://developer.nvidia.com/nccl>`_. Since NCCL2 is only available for Linux machines, **faster distributed GPU training is available only for Linux**.

  .. code-block:: bash