xgboost/tests/distributed-gpu/runtests-gpu.sh
Matthew Jones 92b7577c62 [REVIEW] Enable Multi-Node Multi-GPU functionality (#4095)
* Initial commit to support multi-node multi-gpu xgboost using dask

* Fixed NCCL initialization by not ignoring the opg parameter.

- it now crashes on NCCL initialization, but at least we're attempting it properly

* At the root node, perform a rabit::Allreduce to get initial sum_gradient across workers

* Synchronizing in a couple of more places.

- now the workers don't go down, but just hang
- no more "wild" values of gradients
- probably needs syncing in more places

* Added another missing max-allreduce operation inside BuildHistLeftRight

* Removed unnecessary collective operations.

* Simplified rabit::Allreduce() sync of gradient sums.

* Removed unnecessary rabit syncs around ncclAllReduce.

- this improves performance _significantly_ (7x faster for overall training,
  20x faster for xgboost proper)

* pulling in latest xgboost

* removing changes to updater_quantile_hist.cc

* changing use_nccl_opg initialization, removing unnecessary if statements

* added definition for opaque ncclUniqueId struct to properly encapsulate GetUniqueId

* placing struct defintion in guard to avoid duplicate code errors

* addressing linting errors

* removing

* removing additional arguments to AllReduer initialization

* removing distributed flag

* making comm init symmetric

* removing distributed flag

* changing ncclCommInit to support multiple modalities

* fix indenting

* updating ncclCommInitRank block with necessary group calls

* fix indenting

* adding print statement, and updating accessor in vector

* improving print statement to end-line

* generalizing nccl_rank construction using rabit

* assume device_ordinals is the same for every node

* test, assume device_ordinals is identical for all nodes

* test, assume device_ordinals is unique for all nodes

* changing names of offset variable to be more descriptive, editing indenting

* wrapping ncclUniqueId GetUniqueId() and aesthetic changes

* adding synchronization, and tests for distributed

* adding  to tests

* fixing broken #endif

* fixing initialization of gpu histograms, correcting errors in tests

* adding to contributors list

* adding distributed tests to jenkins

* fixing bad path in distributed test

* debugging

* adding kubernetes for distributed tests

* adding proper import for OrderedDict

* adding urllib3==1.22 to address ordered_dict import error

* added sleep to allow workers to save their models for comparison

* adding name to GPU contributors under docs
2019-03-02 10:03:22 +13:00

19 lines
1018 B
Bash
Executable File

#!/bin/bash
rm -f *.model*
echo -e "\n ====== 1. Basic distributed-gpu test with Python: 4 workers; 1 GPU per worker ====== \n"
PYTHONPATH=../../python-package/ python ../../dmlc-core/tracker/dmlc-submit --cluster=local --num-workers=4 \
python test_gpu_basic_1x4.py
echo -e "\n ====== 2. Basic distributed-gpu test with Python: 2 workers; 2 GPUs per worker ====== \n"
PYTHONPATH=../../python-package/ python ../../dmlc-core/tracker/dmlc-submit --cluster=local --num-workers=2 \
python test_gpu_basic_2x2.py
echo -e "\n ====== 3. Basic distributed-gpu test with Python: 2 workers; Rank 0: 1 GPU, Rank 1: 3 GPUs ====== \n"
PYTHONPATH=../../python-package/ python ../../dmlc-core/tracker/dmlc-submit --cluster=local --num-workers=2 \
python test_gpu_basic_asym.py
echo -e "\n ====== 4. Basic distributed-gpu test with Python: 1 worker; 4 GPUs per worker ====== \n"
PYTHONPATH=../../python-package/ python ../../dmlc-core/tracker/dmlc-submit --cluster=local --num-workers=1 \
python test_gpu_basic_4x1.py