35 Commits

Author SHA1 Message Date
Philip Hyunsu Cho
ca0547bb65
[CI] Use RAPIDS 22.10 (#8298)
* [CI] Use RAPIDS 22.10

* Store CUDA and RAPIDS versions in one place

* Fix

* Add missing #include

* Update gputreeshap submodule

* Fix

* Remove outdated distributed tests
2022-10-03 23:18:07 -08:00
Philip Hyunsu Cho
e888eb2fa9
[CI] Migrate CI pipelines from Jenkins to BuildKite (#8142)
* [CI] Migrate CI pipelines from Jenkins to BuildKite

* Require manual approval

* Less verbose output when pulling Docker

* Remove us-east-2 from metadata.py

* Add documentation

* Add missing underscore

* Add missing punctuation

* More specific instruction

* Better paragraph structure
2022-09-07 16:29:25 -08:00
WeichenXu
176fec8789
PySpark XGBoost integration (#8020)
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
2022-07-13 13:11:18 +08:00
Philip Hyunsu Cho
2070afea02
[CI] Rotate package repository keys (#7943) 2022-05-26 17:06:46 -07:00
Jiaming Yuan
50d854e02e
[CI] Test with latest RAPIDS. (#7816) 2022-04-30 11:55:10 -07:00
Jiaming Yuan
ca17f8a5fc
Dispatch thrust versions and upgrade rmm. (#7254)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2021-09-25 03:43:23 +08:00
Jiaming Yuan
86715e4cd4
Support categorical data for dask functional interface and DQM. (#7043)
* Support categorical data for dask functional interface and DQM.

* Implement categorical data support for GPU GK-merge.
* Add support for dask functional interface.
* Add support for DQM.

* Get newer cupy.
2021-06-18 13:06:52 +08:00
Jiaming Yuan
dcd84b3979
[CI] Configure RAPIDS, dask, modin (#7033)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2021-06-18 10:27:51 +08:00
Philip Hyunsu Cho
05db6a6c29
[CI] Upgrade cuDF and RMM to 21.06 nightly (#7012)
* [CI] Upgrade cuDF and RMM to 21.06 nightly

* Trim outdated test cases

* Pin Dask version to 2021.05.0 for now
2021-06-02 11:59:30 -07:00
Philip Hyunsu Cho
bf6cfe3b99
[Breaking] Upgrade cuDF and RMM to 0.18 nightlies; require RMM 0.18+ for RMM plugin (#6510)
* [CI] Upgrade cuDF and RMM to 0.18 nightlies

* Modify RMM plugin to be compatible with RMM 0.18

* Update src/common/device_helpers.cuh

Co-authored-by: Mark Harris <mharris@nvidia.com>

Co-authored-by: Mark Harris <mharris@nvidia.com>
2020-12-16 10:07:52 -08:00
Philip Hyunsu Cho
4dbbeb635d
[CI] Upgrade cuDF and RMM to 0.17 nightlies (#6434) 2020-11-24 13:21:41 -08:00
Philip Hyunsu Cho
f121f2738f
[CI] Fix Docker build for CUDA 11 (#6202) 2020-10-05 17:54:14 -07:00
Philip Hyunsu Cho
678ea40b24
[CI] Upgrade cuDF and RMM to 0.16 nightlies; upgrade to Ubuntu 18.04 (#6157)
* [CI] Upgrade cuDF and RMM to 0.16 nightlies

* Use Ubuntu 18.04 in RMM test, since RMM needs GCC 7+
2020-09-23 19:48:44 -07:00
Philip Hyunsu Cho
9adb812a0a
RMM integration plugin (#5873)
* [CI] Add RMM as an optional dependency

* Replace caching allocator with pool allocator from RMM

* Revert "Replace caching allocator with pool allocator from RMM"

This reverts commit e15845d4e72e890c2babe31a988b26503a7d9038.

* Use rmm::mr::get_default_resource()

* Try setting default resource (doesn't work yet)

* Allocate pool_mr in the heap

* Prevent leaking pool_mr handle

* Separate EXPECT_DEATH() in separate test suite suffixed DeathTest

* Turn off death tests for RMM

* Address reviewer's feedback

* Prevent leaking of cuda_mr

* Fix Jenkinsfile syntax

* Remove unnecessary function in Jenkinsfile

* [CI] Install NCCL into RMM container

* Run Python tests

* Try building with RMM, CUDA 10.0

* Do not use RMM for CUDA 10.0 target

* Actually test for test_rmm flag

* Fix TestPythonGPU

* Use CNMeM allocator, since pool allocator doesn't yet support multiGPU

* Use 10.0 container to build RMM-enabled XGBoost

* Revert "Use 10.0 container to build RMM-enabled XGBoost"

This reverts commit 789021fa31112e25b683aef39fff375403060141.

* Fix Jenkinsfile

* [CI] Assign larger /dev/shm to NCCL

* Use 10.2 artifact to run multi-GPU Python tests

* Add CUDA 10.0 -> 11.0 cross-version test; remove CUDA 10.0 target

* Rename Conda env rmm_test -> gpu_test

* Use env var to opt into CNMeM pool for C++ tests

* Use identical CUDA version for RMM builds and tests

* Use Pytest fixtures to enable RMM pool in Python tests

* Move RMM to plugin/CMakeLists.txt; use PLUGIN_RMM

* Use per-device MR; use command arg in gtest

* Set CMake prefix path to use Conda env

* Use 0.15 nightly version of RMM

* Remove unnecessary header

* Fix a unit test when cudf is missing

* Add RMM demos

* Remove print()

* Use HostDeviceVector in GPU predictor

* Simplify pytest setup; use LocalCUDACluster fixture

* Address reviewers' commments

Co-authored-by: Hyunsu Cho <chohyu01@cs.wasshington.edu>
2020-08-12 01:26:02 -07:00
Philip Hyunsu Cho
a6d9a06b7b
[CI] Fix cuDF install; merge 'gpu' and 'cudf' test suite (#5814) 2020-06-19 16:42:57 +08:00
Rory Mitchell
b47b5ac771
Use hypothesis (#5759)
* Use hypothesis

* Allow int64 array interface for groups

* Add packages to Windows CI

* Add to travis

* Make sure device index is set correctly

* Fix dask-cudf test

* appveyor
2020-06-16 12:45:59 +12:00
Jiaming Yuan
8b04736b81
[dask] dask cudf inplace prediction. (#5512)
* Add inplace prediction for dask-cudf.

* Remove Dockerfile.release, since it's not used anywhere

* Use Conda exclusively in CUDF and GPU containers

* Improve cupy memory copying.

* Add skip marks to tests.

* Add mgpu-cudf category on the CI to run all distributed tests.

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-04-15 18:15:51 +08:00
Jiaming Yuan
7663de956c
Run training with empty DMatrix. (#4990)
This makes GPU Hist robust in distributed environment as some workers might not
be associated with any data in either training or evaluation.

* Disable rabit mock test for now: See #5012 .

* Disable dask-cudf test at prediction for now: See #5003

* Launch dask job for all workers despite they might not have any data.
* Check 0 rows in elementwise evaluation metrics.

   Using AUC and AUC-PR still throws an error.  See #4663 for a robust fix.

* Add tests for edge cases.
* Add `LaunchKernel` wrapper handling zero sized grid.
* Move some parts of allreducer into a cu file.
* Don't validate feature names when the booster is empty.

* Sync number of columns in DMatrix.

  As num_feature is required to be the same across all workers in data split
  mode.

* Filtering in dask interface now by default syncs all booster that's not
empty, instead of using rank 0.

* Fix Jenkins' GPU tests.

* Install dask-cuda from source in Jenkins' test.

  Now all tests are actually running.

* Restore GPU Hist tree synchronization test.

* Check UUID of running devices.

  The check is only performed on CUDA version >= 10.x, as 9.x doesn't have UUID field.

* Fix CMake policy and project variables.

  Use xgboost_SOURCE_DIR uniformly, add policy for CMake >= 3.13.

* Fix copying data to CPU

* Fix race condition in cpu predictor.

* Fix duplicated DMatrix construction.

* Don't download extra nccl in CI script.
2019-11-06 16:13:13 +08:00
Philip Hyunsu Cho
95295ce026 [CI] Use latest dask (#4973)
* Remove version spec, to use latest dask always
2019-10-22 07:00:13 -04:00
Philip Hyunsu Cho
166def9f75
[CI] Fix broken installation of Pandas (#4722)
* [CI] Fix broken installation of Pandas

* Update Dockerfile.gpu
2019-07-30 22:03:11 -07:00
Philip Hyunsu Cho
a30176907f
Support Dask 2.0 (#4617) 2019-06-27 20:42:35 -07:00
Rory Mitchell
09b90d9329
Add native support for Dask (#4473)
* Add native support for Dask

* Add multi-GPU demo

* Add sklearn example
2019-05-27 13:29:28 +12:00
Philip Hyunsu Cho
ea850ecd20
[CI] Refactor Jenkins CI pipeline + migrate all Linux tests to Jenkins (#4401)
* All Linux tests are now in Jenkins CI
* Tests are now de-coupled from builds. We can now build XGBoost with one version of CUDA/JDK and test it with another version of CUDA/JDK
* Builds (compilation) are significantly faster because 1) They use C5 instances with faster CPU cores; and 2) build environment setup is cached using Docker containers
2019-04-26 18:39:12 -07:00
Jiaming Yuan
207f058711 Refactor CMake scripts. (#4323)
* Refactor CMake scripts.

* Remove CMake CUDA wrapper.
* Bump CMake version for CUDA.
* Use CMake to handle Doxygen.
* Split up CMakeList.
* Export install target.
* Use modern CMake.
* Remove build.sh
* Workaround for gpu_hist test.
* Use cmake 3.12.

* Revert machine.conf.

* Move CLI test to gpu.

* Small cleanup.

* Support using XGBoost as submodule.

* Fix windows

* Fix cpp tests on Windows

* Remove duplicated find_package.
2019-04-15 10:08:12 -07:00
Philip Hyunsu Cho
7aed8f3d48
[CI] Upgrade to GCC 5.3.1, CMake 3.6.0 (#4306)
* Upgrade to GCC 5.3.1, CMake 3.6.0

* <regex> is now okay
2019-03-28 00:21:21 -07:00
Rong Ou
5aa42b5f11 jenkins build for cuda 10.0 (#4281)
* jenkins build for cuda 10.0

* yum install nccl2 for cuda 10.0
2019-03-22 22:35:18 -07:00
Matthew Jones
92b7577c62 [REVIEW] Enable Multi-Node Multi-GPU functionality (#4095)
* Initial commit to support multi-node multi-gpu xgboost using dask

* Fixed NCCL initialization by not ignoring the opg parameter.

- it now crashes on NCCL initialization, but at least we're attempting it properly

* At the root node, perform a rabit::Allreduce to get initial sum_gradient across workers

* Synchronizing in a couple of more places.

- now the workers don't go down, but just hang
- no more "wild" values of gradients
- probably needs syncing in more places

* Added another missing max-allreduce operation inside BuildHistLeftRight

* Removed unnecessary collective operations.

* Simplified rabit::Allreduce() sync of gradient sums.

* Removed unnecessary rabit syncs around ncclAllReduce.

- this improves performance _significantly_ (7x faster for overall training,
  20x faster for xgboost proper)

* pulling in latest xgboost

* removing changes to updater_quantile_hist.cc

* changing use_nccl_opg initialization, removing unnecessary if statements

* added definition for opaque ncclUniqueId struct to properly encapsulate GetUniqueId

* placing struct defintion in guard to avoid duplicate code errors

* addressing linting errors

* removing

* removing additional arguments to AllReduer initialization

* removing distributed flag

* making comm init symmetric

* removing distributed flag

* changing ncclCommInit to support multiple modalities

* fix indenting

* updating ncclCommInitRank block with necessary group calls

* fix indenting

* adding print statement, and updating accessor in vector

* improving print statement to end-line

* generalizing nccl_rank construction using rabit

* assume device_ordinals is the same for every node

* test, assume device_ordinals is identical for all nodes

* test, assume device_ordinals is unique for all nodes

* changing names of offset variable to be more descriptive, editing indenting

* wrapping ncclUniqueId GetUniqueId() and aesthetic changes

* adding synchronization, and tests for distributed

* adding  to tests

* fixing broken #endif

* fixing initialization of gpu histograms, correcting errors in tests

* adding to contributors list

* adding distributed tests to jenkins

* fixing bad path in distributed test

* debugging

* adding kubernetes for distributed tests

* adding proper import for OrderedDict

* adding urllib3==1.22 to address ordered_dict import error

* added sleep to allow workers to save their models for comparison

* adding name to GPU contributors under docs
2019-03-02 10:03:22 +13:00
Philip Hyunsu Cho
7a652a8c64
Speed up Jenkins by not compiling CMake (#4099) 2019-02-03 00:08:14 -08:00
Jiaming Yuan
2ea0f887c1
Refactor Python tests. (#3897)
* Deprecate nose tests.
* Format python tests.
2018-11-15 13:56:33 +13:00
Philip Hyunsu Cho
411df9f878
Test wheels on CUDA 10.0 container for compatibility (#3838) 2018-11-01 08:34:47 -07:00
Philip Hyunsu Cho
abf2f661be
Fix #3708: Use dmlc::TemporaryDirectory to handle temporaries in cross-platform way (#3783)
* Fix #3708: Use dmlc::TemporaryDirectory to handle temporaries in cross-platform way

Also install git inside NVIDIA GPU container

* Update dmlc-core
2018-10-18 10:16:04 -07:00
Thejaswi
2200939416 Upgrading to NCCL2 (#3404)
* Upgrading to NCCL2

* Part - II of NCCL2 upgradation

 - Doc updates to build with nccl2
 - Dockerfile.gpu update for a correct CI build with nccl2
 - Updated FindNccl package to have env-var NCCL_ROOT to take precedence

* Upgrading to v9.2 for CI workflow, since it has the nccl2 binaries available

* Added NCCL2 license + copy the nccl binaries into /usr location for the FindNccl module to find

* Set LD_LIBRARY_PATH variable to pick nccl2 binary at runtime

* Need the nccl2 library download instructions inside Dockerfile.release as well

* Use NCCL2 as a static library
2018-07-10 00:42:15 -07:00
Rory Mitchell
f8b7686719
Add cuda 8/9.1 centos 6 builds, test GPU wheel on CPU only container. (#3309)
* Add cuda 8/9.1 centos 6 builds, test GPU wheel on CPU only container.

* Add Google test
2018-05-17 10:57:01 +12:00
Michal Malohlava
33ee7d1615 [BUILD] Dockerfile and Jenkinsfile revisited (#2514)
Includes:
  - Dockerfile changes
    - Dockerfile clean up
    - Fix execution privileges of files used from Dockerfile.
    - New Dockerfile entrypoint to replace with_user script
    - Defined a placeholders for CPU testing (script and Dockerfile)
  - Jenkinsfile
    - Jenkins file milestone defined
    - Single source code checkout and propagation via stash/unstash
    - Bash needs to be explicitly used in launching make build, since we need
access to environment
    - Jenkinsfile build factory for cmake and make style of jobs
    - Archivation of artifacts (*.so, *.whl, *.egg) produced by cmake build

Missing:
  - CPU testing
  - Python3 env build and testing
2017-07-13 17:51:47 +12:00
Rory Mitchell
1899f9e744 [GPU-Plugin] Add basic continuous integration for GPU plugin. (#2431) 2017-06-22 10:15:28 -04:00