* [CI] Add RMM as an optional dependency
* Replace caching allocator with pool allocator from RMM
* Revert "Replace caching allocator with pool allocator from RMM"
This reverts commit e15845d4e72e890c2babe31a988b26503a7d9038.
* Use rmm::mr::get_default_resource()
* Try setting default resource (doesn't work yet)
* Allocate pool_mr in the heap
* Prevent leaking pool_mr handle
* Separate EXPECT_DEATH() in separate test suite suffixed DeathTest
* Turn off death tests for RMM
* Address reviewer's feedback
* Prevent leaking of cuda_mr
* Fix Jenkinsfile syntax
* Remove unnecessary function in Jenkinsfile
* [CI] Install NCCL into RMM container
* Run Python tests
* Try building with RMM, CUDA 10.0
* Do not use RMM for CUDA 10.0 target
* Actually test for test_rmm flag
* Fix TestPythonGPU
* Use CNMeM allocator, since pool allocator doesn't yet support multiGPU
* Use 10.0 container to build RMM-enabled XGBoost
* Revert "Use 10.0 container to build RMM-enabled XGBoost"
This reverts commit 789021fa31112e25b683aef39fff375403060141.
* Fix Jenkinsfile
* [CI] Assign larger /dev/shm to NCCL
* Use 10.2 artifact to run multi-GPU Python tests
* Add CUDA 10.0 -> 11.0 cross-version test; remove CUDA 10.0 target
* Rename Conda env rmm_test -> gpu_test
* Use env var to opt into CNMeM pool for C++ tests
* Use identical CUDA version for RMM builds and tests
* Use Pytest fixtures to enable RMM pool in Python tests
* Move RMM to plugin/CMakeLists.txt; use PLUGIN_RMM
* Use per-device MR; use command arg in gtest
* Set CMake prefix path to use Conda env
* Use 0.15 nightly version of RMM
* Remove unnecessary header
* Fix a unit test when cudf is missing
* Add RMM demos
* Remove print()
* Use HostDeviceVector in GPU predictor
* Simplify pytest setup; use LocalCUDACluster fixture
* Address reviewers' commments
Co-authored-by: Hyunsu Cho <chohyu01@cs.wasshington.edu>
* Implement GK sketching on GPU.
* Strong tests on quantile building.
* Handle sparse dataset by binary searching the column index.
* Hypothesis test on dask.
- move segment sorter to common
- this is the first of a handful of pr's that splits the larger pr #5326
- it moves this facility to common (from ranking objective class), so that it can be
used for metric computation
- it also wraps all the bald device pointers into span.
* - implementation of map ranking algorithm
- also effected necessary suggestions mentioned in the earlier ranking pr's
- made some performance improvements to the ndcg algo as well
This makes GPU Hist robust in distributed environment as some workers might not
be associated with any data in either training or evaluation.
* Disable rabit mock test for now: See #5012 .
* Disable dask-cudf test at prediction for now: See #5003
* Launch dask job for all workers despite they might not have any data.
* Check 0 rows in elementwise evaluation metrics.
Using AUC and AUC-PR still throws an error. See #4663 for a robust fix.
* Add tests for edge cases.
* Add `LaunchKernel` wrapper handling zero sized grid.
* Move some parts of allreducer into a cu file.
* Don't validate feature names when the booster is empty.
* Sync number of columns in DMatrix.
As num_feature is required to be the same across all workers in data split
mode.
* Filtering in dask interface now by default syncs all booster that's not
empty, instead of using rank 0.
* Fix Jenkins' GPU tests.
* Install dask-cuda from source in Jenkins' test.
Now all tests are actually running.
* Restore GPU Hist tree synchronization test.
* Check UUID of running devices.
The check is only performed on CUDA version >= 10.x, as 9.x doesn't have UUID field.
* Fix CMake policy and project variables.
Use xgboost_SOURCE_DIR uniformly, add policy for CMake >= 3.13.
* Fix copying data to CPU
* Fix race condition in cpu predictor.
* Fix duplicated DMatrix construction.
* Don't download extra nccl in CI script.
* - pairwise ranking objective implementation on gpu
- there are couple of more algorithms (ndcg and map) for which support will be added
as follow-up pr's
- with no label groups defined, get gradient is 90x faster on gpu (120m instance
mortgage dataset)
- it can perform by an order of magnitude faster with ~ 10 groups (and adequate cores
for the cpu implementation)
* Add JSON config to rank obj.
* Move get transpose into cc.
* Clean up headers in host device vector, remove thrust dependency.
* Move span and host device vector into public.
* Install c++ headers.
* Short notes for c and c++.
Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Initial support for cudf integration.
* Add two C APIs for consuming data and metainfo.
* Add CopyFrom for SimpleCSRSource as a generic function to consume the data.
* Add FromDeviceColumnar for consuming device data.
* Add new MetaInfo::SetInfo for consuming label, weight etc.
* - set the appropriate device before freeing device memory...
- pr #4532 added a global memory tracker/logger to keep track of number of (de)allocations
and peak memory usage on a per device basis.
- this pr adds the appropriate check to make sure that the (de)allocation counts and memory usages
makes sense for the device. since verbosity is typically increased on debug/non-retail builds.
* - pre-create cub allocators and reuse them
- create them once and not resize them dynamically. we need to ensure that these allocators
are created and destroyed exactly once so that the appropriate device id's are set
* Combine thread launches into single launch per tree for gpu_hist
algorithm.
* Address deprecation warning
* Add manual column sampler constructor
* Turn off omp dynamic to get a guaranteed number of threads
* Enable openmp in cuda code
* Optimisations for gpu_hist.
* Use streams to overlap operations.
* ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.
* Upgrade gtest for clang-tidy.
* Use CMake to install GTest instead of mv.
* Don't enforce clang-tidy to return 0 due to errors in thrust.
* Add a small test for tidy itself.
* Reformat.
* Initial commit to support multi-node multi-gpu xgboost using dask
* Fixed NCCL initialization by not ignoring the opg parameter.
- it now crashes on NCCL initialization, but at least we're attempting it properly
* At the root node, perform a rabit::Allreduce to get initial sum_gradient across workers
* Synchronizing in a couple of more places.
- now the workers don't go down, but just hang
- no more "wild" values of gradients
- probably needs syncing in more places
* Added another missing max-allreduce operation inside BuildHistLeftRight
* Removed unnecessary collective operations.
* Simplified rabit::Allreduce() sync of gradient sums.
* Removed unnecessary rabit syncs around ncclAllReduce.
- this improves performance _significantly_ (7x faster for overall training,
20x faster for xgboost proper)
* pulling in latest xgboost
* removing changes to updater_quantile_hist.cc
* changing use_nccl_opg initialization, removing unnecessary if statements
* added definition for opaque ncclUniqueId struct to properly encapsulate GetUniqueId
* placing struct defintion in guard to avoid duplicate code errors
* addressing linting errors
* removing
* removing additional arguments to AllReduer initialization
* removing distributed flag
* making comm init symmetric
* removing distributed flag
* changing ncclCommInit to support multiple modalities
* fix indenting
* updating ncclCommInitRank block with necessary group calls
* fix indenting
* adding print statement, and updating accessor in vector
* improving print statement to end-line
* generalizing nccl_rank construction using rabit
* assume device_ordinals is the same for every node
* test, assume device_ordinals is identical for all nodes
* test, assume device_ordinals is unique for all nodes
* changing names of offset variable to be more descriptive, editing indenting
* wrapping ncclUniqueId GetUniqueId() and aesthetic changes
* adding synchronization, and tests for distributed
* adding to tests
* fixing broken #endif
* fixing initialization of gpu histograms, correcting errors in tests
* adding to contributors list
* adding distributed tests to jenkins
* fixing bad path in distributed test
* debugging
* adding kubernetes for distributed tests
* adding proper import for OrderedDict
* adding urllib3==1.22 to address ordered_dict import error
* added sleep to allow workers to save their models for comparison
* adding name to GPU contributors under docs
* Remove GHistRow, GHistEntry, GHistIndexRow.
* Remove kSimpleStats.
* Remove CheckInfo, SetLeafVec in GradStats and in SKStats.
* Clean up the GradStats.
* Cleanup calcgain.
* Move LossChangeMissing out of common.
* Remove [] operator from GHistIndexBlock.