Use dlopen to load NCCL. (#9796)

This PR adds optional support for loading nccl with `dlopen` as an alternative of compile time linking. This is to address the size bloat issue with the PyPI binary release.
- Add CMake option to load `nccl` at runtime.
- Add an NCCL stub.

After this, `nccl` will be fetched from PyPI when using pip to install XGBoost, either by a user or by `pyproject.toml`. Others who want to link the nccl at compile time can continue to do so without any change.

At the moment, this is Linux only since we only support MNMG on Linux.
This commit is contained in:
Jiaming Yuan
2023-11-22 19:27:31 +08:00
committed by GitHub
parent fedd9674c8
commit 0715ab3c10
45 changed files with 658 additions and 268 deletions

View File

@@ -536,6 +536,37 @@ Troubleshooting
- MIG (Multi-Instance GPU) is not yet supported by NCCL. You will receive an error message
that includes `Multiple processes within a communication group ...` upon initialization.
.. _nccl-load:
- Starting from version 2.1.0, to reduce the size of the binary wheel, the XGBoost package
(installed using pip) loads NCCL from the environment instead of bundling it
directly. This means that if you encounter an error message like
"Failed to load nccl ...", it indicates that NCCL is not installed or properly
configured in your environment.
To resolve this issue, you can install NCCL using pip:
.. code-block:: sh
pip install nvidia-nccl-cu12 # (or with any compatible CUDA version)
The default conda installation of XGBoost should not encounter this error. If you are
using a customized XGBoost, please make sure one of the followings is true:
+ XGBoost is NOT compiled with the `USE_DLOPEN_NCCL` flag.
+ The `dmlc_nccl_path` parameter is set to full NCCL path when initializing the collective.
Here are some additional tips for troubleshooting NCCL dependency issues:
+ Check the NCCL installation path and verify that it's installed correctly. We try to
find NCCL by using ``from nvidia.nccl import lib`` in Python when XGBoost is installed
using pip.
+ Ensure that you have the correct CUDA version installed. NCCL requires a compatible
CUDA version to function properly.
+ If you are not using distributed training with XGBoost and yet see this error, please
open an issue on GitHub.
+ If you continue to encounter NCCL dependency issues, please open an issue on GitHub.
************
IPv6 Support
************