[doc] [dask] Troubleshooting NCCL errors. (#8943)
This commit is contained in:
@@ -190,9 +190,9 @@ Scikit-Learn wrapper object:
|
||||
booster = cls.get_booster()
|
||||
|
||||
|
||||
**********************
|
||||
Scikit-Learn interface
|
||||
**********************
|
||||
********************************
|
||||
Scikit-Learn Estimator Interface
|
||||
********************************
|
||||
|
||||
As mentioned previously, there's another interface that mimics the scikit-learn estimators
|
||||
with higher level of of abstraction. The interface is easier to use compared to the
|
||||
@@ -488,12 +488,13 @@ with dask and optuna.
|
||||
Troubleshooting
|
||||
***************
|
||||
|
||||
.. versionadded:: 1.6.0
|
||||
|
||||
In some environments XGBoost might fail to resolve the IP address of the scheduler, a
|
||||
symptom is user receiving ``OSError: [Errno 99] Cannot assign requested address`` error
|
||||
during training. A quick workaround is to specify the address explicitly. To do that
|
||||
dask config is used:
|
||||
- In some environments XGBoost might fail to resolve the IP address of the scheduler, a
|
||||
symptom is user receiving ``OSError: [Errno 99] Cannot assign requested address`` error
|
||||
during training. A quick workaround is to specify the address explicitly. To do that
|
||||
dask config is used:
|
||||
|
||||
.. versionadded:: 1.6.0
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@@ -511,10 +512,20 @@ dask config is used:
|
||||
reg = dxgb.DaskXGBRegressor()
|
||||
|
||||
|
||||
Please note that XGBoost requires a different port than dask. By default, on a unix-like
|
||||
system XGBoost uses the port 0 to find available ports, which may fail if a user is
|
||||
running in a restricted docker environment. In this case, please open additional ports in
|
||||
the container and specify it as in the above snippet.
|
||||
- Please note that XGBoost requires a different port than dask. By default, on a unix-like
|
||||
system XGBoost uses the port 0 to find available ports, which may fail if a user is
|
||||
running in a restricted docker environment. In this case, please open additional ports
|
||||
in the container and specify it as in the above snippet.
|
||||
|
||||
- If you encounter a NCCL system error while training with GPU enabled, which usually
|
||||
includes the error message `NCCL failure: unhandled system error`, you can specify its
|
||||
network configuration using one of the environment variables listed in the `NCCL
|
||||
document <https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html>`__ such as
|
||||
the ``NCCL_SOCKET_IFNAME``. In addition, you can use ``NCCL_DEBUG`` to obtain debug
|
||||
logs.
|
||||
|
||||
- MIG (Multi-Instance GPU) is not yet supported by NCCL. You will receive an error message
|
||||
that includes `Multiple processes within a communication group ...` upon initialization.
|
||||
|
||||
************
|
||||
IPv6 Support
|
||||
|
||||
Reference in New Issue
Block a user