[doc] [dask] Troubleshooting NCCL errors. (#8943)

This commit is contained in:
Jiaming Yuan
2023-03-22 22:17:26 +08:00
committed by GitHub
parent a551bed803
commit ea04d4c46c
3 changed files with 44 additions and 20 deletions

View File

@@ -190,9 +190,9 @@ Scikit-Learn wrapper object:
booster = cls.get_booster()
**********************
Scikit-Learn interface
**********************
********************************
Scikit-Learn Estimator Interface
********************************
As mentioned previously, there's another interface that mimics the scikit-learn estimators
with higher level of of abstraction. The interface is easier to use compared to the
@@ -488,12 +488,13 @@ with dask and optuna.
Troubleshooting
***************
.. versionadded:: 1.6.0
In some environments XGBoost might fail to resolve the IP address of the scheduler, a
symptom is user receiving ``OSError: [Errno 99] Cannot assign requested address`` error
during training. A quick workaround is to specify the address explicitly. To do that
dask config is used:
- In some environments XGBoost might fail to resolve the IP address of the scheduler, a
symptom is user receiving ``OSError: [Errno 99] Cannot assign requested address`` error
during training. A quick workaround is to specify the address explicitly. To do that
dask config is used:
.. versionadded:: 1.6.0
.. code-block:: python
@@ -511,10 +512,20 @@ dask config is used:
reg = dxgb.DaskXGBRegressor()
Please note that XGBoost requires a different port than dask. By default, on a unix-like
system XGBoost uses the port 0 to find available ports, which may fail if a user is
running in a restricted docker environment. In this case, please open additional ports in
the container and specify it as in the above snippet.
- Please note that XGBoost requires a different port than dask. By default, on a unix-like
system XGBoost uses the port 0 to find available ports, which may fail if a user is
running in a restricted docker environment. In this case, please open additional ports
in the container and specify it as in the above snippet.
- If you encounter a NCCL system error while training with GPU enabled, which usually
includes the error message `NCCL failure: unhandled system error`, you can specify its
network configuration using one of the environment variables listed in the `NCCL
document <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html>`__ such as
the ``NCCL_SOCKET_IFNAME``. In addition, you can use ``NCCL_DEBUG`` to obtain debug
logs.
- MIG (Multi-Instance GPU) is not yet supported by NCCL. You will receive an error message
that includes `Multiple processes within a communication group ...` upon initialization.
************
IPv6 Support