* [CI] Migrate to rockylinux8 / manylinux_2_28_x86_64
* Scrub all references to CentOS 7
* Fix
* Remove use of yum
* Use gcc-10 in cpu
* Temporarily disable -Werror
* Use GCC 9 for now
* Roll back gRPC
* Scrub all references to manylinux2014_x86_64
* Revise rename_whl.py to handle no-op rename
* Change JDK_VERSION back to 8
* Reviewer's comment
* Use GCC 10
* Use Spark 3.5.1, same as in pom.xml
* Fix JAR install
* [CI] Add nightly CI job to test against dev version of deps
* Update build-containers.sh
* Add build step
* Wait for build artifact
* Try pinning dask
* Address reviewers' comments
* Fix unbound variable error
* Specify dev version exactly
* Pin dask=2024.1.1
This PR adds optional support for loading nccl with `dlopen` as an alternative of compile time linking. This is to address the size bloat issue with the PyPI binary release.
- Add CMake option to load `nccl` at runtime.
- Add an NCCL stub.
After this, `nccl` will be fetched from PyPI when using pip to install XGBoost, either by a user or by `pyproject.toml`. Others who want to link the nccl at compile time can continue to do so without any change.
At the moment, this is Linux only since we only support MNMG on Linux.
* [CI] Update images that are not related to the binary release.
- Update clang-tidy, prefer tools from the Ubuntu repository.
- Update GPU image to 22.04.
- Small cleanup to the tidy script.
- Remove gpu_jvm, which seems to be unused.
* Update to C++17
* Turn off unity build
* Update CMake to 3.18
* Use MSVC 2022 + CUDA 11.8
* Re-create stack for worker images
* Allocate more disk space for Windows
* Tempiorarily disable clang-tidy
* RAPIDS now requires Python 3.10+
* Unpin cuda-python
* Use latest NCCL
* Use Ubuntu 20.04 in RMM image
* Mark failing mgpu test as xfail
* Reduce clutter in log of Python test
* Set up BuildKite test analytics
* Add separate step for building containers
* Enable incremental update of CI stack; custom agent IAM policy
* [CI] Use RAPIDS 22.10
* Store CUDA and RAPIDS versions in one place
* Fix
* Add missing #include
* Update gputreeshap submodule
* Fix
* Remove outdated distributed tests
* Support categorical data for dask functional interface and DQM.
* Implement categorical data support for GPU GK-merge.
* Add support for dask functional interface.
* Add support for DQM.
* Get newer cupy.
* [CI] Upgrade cuDF and RMM to 0.18 nightlies
* Modify RMM plugin to be compatible with RMM 0.18
* Update src/common/device_helpers.cuh
Co-authored-by: Mark Harris <mharris@nvidia.com>
Co-authored-by: Mark Harris <mharris@nvidia.com>
* [CI] Add RMM as an optional dependency
* Replace caching allocator with pool allocator from RMM
* Revert "Replace caching allocator with pool allocator from RMM"
This reverts commit e15845d4e72e890c2babe31a988b26503a7d9038.
* Use rmm::mr::get_default_resource()
* Try setting default resource (doesn't work yet)
* Allocate pool_mr in the heap
* Prevent leaking pool_mr handle
* Separate EXPECT_DEATH() in separate test suite suffixed DeathTest
* Turn off death tests for RMM
* Address reviewer's feedback
* Prevent leaking of cuda_mr
* Fix Jenkinsfile syntax
* Remove unnecessary function in Jenkinsfile
* [CI] Install NCCL into RMM container
* Run Python tests
* Try building with RMM, CUDA 10.0
* Do not use RMM for CUDA 10.0 target
* Actually test for test_rmm flag
* Fix TestPythonGPU
* Use CNMeM allocator, since pool allocator doesn't yet support multiGPU
* Use 10.0 container to build RMM-enabled XGBoost
* Revert "Use 10.0 container to build RMM-enabled XGBoost"
This reverts commit 789021fa31112e25b683aef39fff375403060141.
* Fix Jenkinsfile
* [CI] Assign larger /dev/shm to NCCL
* Use 10.2 artifact to run multi-GPU Python tests
* Add CUDA 10.0 -> 11.0 cross-version test; remove CUDA 10.0 target
* Rename Conda env rmm_test -> gpu_test
* Use env var to opt into CNMeM pool for C++ tests
* Use identical CUDA version for RMM builds and tests
* Use Pytest fixtures to enable RMM pool in Python tests
* Move RMM to plugin/CMakeLists.txt; use PLUGIN_RMM
* Use per-device MR; use command arg in gtest
* Set CMake prefix path to use Conda env
* Use 0.15 nightly version of RMM
* Remove unnecessary header
* Fix a unit test when cudf is missing
* Add RMM demos
* Remove print()
* Use HostDeviceVector in GPU predictor
* Simplify pytest setup; use LocalCUDACluster fixture
* Address reviewers' commments
Co-authored-by: Hyunsu Cho <chohyu01@cs.wasshington.edu>
* Use hypothesis
* Allow int64 array interface for groups
* Add packages to Windows CI
* Add to travis
* Make sure device index is set correctly
* Fix dask-cudf test
* appveyor
* Add inplace prediction for dask-cudf.
* Remove Dockerfile.release, since it's not used anywhere
* Use Conda exclusively in CUDF and GPU containers
* Improve cupy memory copying.
* Add skip marks to tests.
* Add mgpu-cudf category on the CI to run all distributed tests.
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
This makes GPU Hist robust in distributed environment as some workers might not
be associated with any data in either training or evaluation.
* Disable rabit mock test for now: See #5012 .
* Disable dask-cudf test at prediction for now: See #5003
* Launch dask job for all workers despite they might not have any data.
* Check 0 rows in elementwise evaluation metrics.
Using AUC and AUC-PR still throws an error. See #4663 for a robust fix.
* Add tests for edge cases.
* Add `LaunchKernel` wrapper handling zero sized grid.
* Move some parts of allreducer into a cu file.
* Don't validate feature names when the booster is empty.
* Sync number of columns in DMatrix.
As num_feature is required to be the same across all workers in data split
mode.
* Filtering in dask interface now by default syncs all booster that's not
empty, instead of using rank 0.
* Fix Jenkins' GPU tests.
* Install dask-cuda from source in Jenkins' test.
Now all tests are actually running.
* Restore GPU Hist tree synchronization test.
* Check UUID of running devices.
The check is only performed on CUDA version >= 10.x, as 9.x doesn't have UUID field.
* Fix CMake policy and project variables.
Use xgboost_SOURCE_DIR uniformly, add policy for CMake >= 3.13.
* Fix copying data to CPU
* Fix race condition in cpu predictor.
* Fix duplicated DMatrix construction.
* Don't download extra nccl in CI script.
* All Linux tests are now in Jenkins CI
* Tests are now de-coupled from builds. We can now build XGBoost with one version of CUDA/JDK and test it with another version of CUDA/JDK
* Builds (compilation) are significantly faster because 1) They use C5 instances with faster CPU cores; and 2) build environment setup is cached using Docker containers
* Refactor CMake scripts.
* Remove CMake CUDA wrapper.
* Bump CMake version for CUDA.
* Use CMake to handle Doxygen.
* Split up CMakeList.
* Export install target.
* Use modern CMake.
* Remove build.sh
* Workaround for gpu_hist test.
* Use cmake 3.12.
* Revert machine.conf.
* Move CLI test to gpu.
* Small cleanup.
* Support using XGBoost as submodule.
* Fix windows
* Fix cpp tests on Windows
* Remove duplicated find_package.
* Initial commit to support multi-node multi-gpu xgboost using dask
* Fixed NCCL initialization by not ignoring the opg parameter.
- it now crashes on NCCL initialization, but at least we're attempting it properly
* At the root node, perform a rabit::Allreduce to get initial sum_gradient across workers
* Synchronizing in a couple of more places.
- now the workers don't go down, but just hang
- no more "wild" values of gradients
- probably needs syncing in more places
* Added another missing max-allreduce operation inside BuildHistLeftRight
* Removed unnecessary collective operations.
* Simplified rabit::Allreduce() sync of gradient sums.
* Removed unnecessary rabit syncs around ncclAllReduce.
- this improves performance _significantly_ (7x faster for overall training,
20x faster for xgboost proper)
* pulling in latest xgboost
* removing changes to updater_quantile_hist.cc
* changing use_nccl_opg initialization, removing unnecessary if statements
* added definition for opaque ncclUniqueId struct to properly encapsulate GetUniqueId
* placing struct defintion in guard to avoid duplicate code errors
* addressing linting errors
* removing
* removing additional arguments to AllReduer initialization
* removing distributed flag
* making comm init symmetric
* removing distributed flag
* changing ncclCommInit to support multiple modalities
* fix indenting
* updating ncclCommInitRank block with necessary group calls
* fix indenting
* adding print statement, and updating accessor in vector
* improving print statement to end-line
* generalizing nccl_rank construction using rabit
* assume device_ordinals is the same for every node
* test, assume device_ordinals is identical for all nodes
* test, assume device_ordinals is unique for all nodes
* changing names of offset variable to be more descriptive, editing indenting
* wrapping ncclUniqueId GetUniqueId() and aesthetic changes
* adding synchronization, and tests for distributed
* adding to tests
* fixing broken #endif
* fixing initialization of gpu histograms, correcting errors in tests
* adding to contributors list
* adding distributed tests to jenkins
* fixing bad path in distributed test
* debugging
* adding kubernetes for distributed tests
* adding proper import for OrderedDict
* adding urllib3==1.22 to address ordered_dict import error
* added sleep to allow workers to save their models for comparison
* adding name to GPU contributors under docs
* Upgrading to NCCL2
* Part - II of NCCL2 upgradation
- Doc updates to build with nccl2
- Dockerfile.gpu update for a correct CI build with nccl2
- Updated FindNccl package to have env-var NCCL_ROOT to take precedence
* Upgrading to v9.2 for CI workflow, since it has the nccl2 binaries available
* Added NCCL2 license + copy the nccl binaries into /usr location for the FindNccl module to find
* Set LD_LIBRARY_PATH variable to pick nccl2 binary at runtime
* Need the nccl2 library download instructions inside Dockerfile.release as well
* Use NCCL2 as a static library
Includes:
- Dockerfile changes
- Dockerfile clean up
- Fix execution privileges of files used from Dockerfile.
- New Dockerfile entrypoint to replace with_user script
- Defined a placeholders for CPU testing (script and Dockerfile)
- Jenkinsfile
- Jenkins file milestone defined
- Single source code checkout and propagation via stash/unstash
- Bash needs to be explicitly used in launching make build, since we need
access to environment
- Jenkinsfile build factory for cmake and make style of jobs
- Archivation of artifacts (*.so, *.whl, *.egg) produced by cmake build
Missing:
- CPU testing
- Python3 env build and testing