From 7f20eaed93d1844f76a68898143a496e5d11b070 Mon Sep 17 00:00:00 2001 From: Jiaming Yuan Date: Wed, 31 May 2023 05:00:02 +0800 Subject: [PATCH] [doc] Troubleshoot nccl shared memory. [skip ci] (#9206) --- doc/tutorials/dask.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/doc/tutorials/dask.rst b/doc/tutorials/dask.rst index 888683975..3562015e2 100644 --- a/doc/tutorials/dask.rst +++ b/doc/tutorials/dask.rst @@ -519,6 +519,9 @@ Troubleshooting the ``NCCL_SOCKET_IFNAME``. In addition, you can use ``NCCL_DEBUG`` to obtain debug logs. +- If NCCL fails to initialize in a container environment, it might be caused by limited + system shared memory. With docker, one can try the flag: `--shm-size=4g`. + - MIG (Multi-Instance GPU) is not yet supported by NCCL. You will receive an error message that includes `Multiple processes within a communication group ...` upon initialization.