* [CI] Add RMM as an optional dependency
* Replace caching allocator with pool allocator from RMM
* Revert "Replace caching allocator with pool allocator from RMM"
This reverts commit e15845d4e72e890c2babe31a988b26503a7d9038.
* Use rmm::mr::get_default_resource()
* Try setting default resource (doesn't work yet)
* Allocate pool_mr in the heap
* Prevent leaking pool_mr handle
* Separate EXPECT_DEATH() in separate test suite suffixed DeathTest
* Turn off death tests for RMM
* Address reviewer's feedback
* Prevent leaking of cuda_mr
* Fix Jenkinsfile syntax
* Remove unnecessary function in Jenkinsfile
* [CI] Install NCCL into RMM container
* Run Python tests
* Try building with RMM, CUDA 10.0
* Do not use RMM for CUDA 10.0 target
* Actually test for test_rmm flag
* Fix TestPythonGPU
* Use CNMeM allocator, since pool allocator doesn't yet support multiGPU
* Use 10.0 container to build RMM-enabled XGBoost
* Revert "Use 10.0 container to build RMM-enabled XGBoost"
This reverts commit 789021fa31112e25b683aef39fff375403060141.
* Fix Jenkinsfile
* [CI] Assign larger /dev/shm to NCCL
* Use 10.2 artifact to run multi-GPU Python tests
* Add CUDA 10.0 -> 11.0 cross-version test; remove CUDA 10.0 target
* Rename Conda env rmm_test -> gpu_test
* Use env var to opt into CNMeM pool for C++ tests
* Use identical CUDA version for RMM builds and tests
* Use Pytest fixtures to enable RMM pool in Python tests
* Move RMM to plugin/CMakeLists.txt; use PLUGIN_RMM
* Use per-device MR; use command arg in gtest
* Set CMake prefix path to use Conda env
* Use 0.15 nightly version of RMM
* Remove unnecessary header
* Fix a unit test when cudf is missing
* Add RMM demos
* Remove print()
* Use HostDeviceVector in GPU predictor
* Simplify pytest setup; use LocalCUDACluster fixture
* Address reviewers' commments
Co-authored-by: Hyunsu Cho <chohyu01@cs.wasshington.edu>
* Publish artifacts only on the master and release branches
* Build CUDA only for Compute Capability 7.5 when building PRs
* Run all Windows jobs in a single worker image
* Build nightly XGBoost4J SNAPSHOT JARs with Scala 2.12 only
* Show skipped Python tests on Windows
* Make Graphviz optional for Python tests
* Add back C++ tests
* Unstash xgboost_cpp_tests
* Fix label to CUDA 10.1
* Install cuPy for CUDA 10.1
* Install jsonschema
* Address reviewer's feedback
* Implement GK sketching on GPU.
* Strong tests on quantile building.
* Handle sparse dataset by binary searching the column index.
* Hypothesis test on dask.
* Add cupy to Windows CI
* Update Jenkinsfile-win64
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Update Jenkinsfile-win64
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Update tests/python-gpu/test_gpu_prediction.py
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
* Use hypothesis
* Allow int64 array interface for groups
* Add packages to Windows CI
* Add to travis
* Make sure device index is set correctly
* Fix dask-cudf test
* appveyor
* Group aware GPU weighted sketching.
* Distribute group weights to each data point.
* Relax the test.
* Validate input meta info.
* Fix metainfo copy ctor.
* Add inplace prediction for dask-cudf.
* Remove Dockerfile.release, since it's not used anywhere
* Use Conda exclusively in CUDF and GPU containers
* Improve cupy memory copying.
* Add skip marks to tests.
* Add mgpu-cudf category on the CI to run all distributed tests.
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
Normal prediction with DMatrix is now thread safe with locks. Added inplace prediction is lock free thread safe.
When data is on device (cupy, cudf), the returned data is also on device.
* Implementation for numpy, csr, cudf and cupy.
* Implementation for dask.
* Remove sync in simple dmatrix.
* Move thread local entry into Learner.
This is an attempt to workaround CUDA context issue in static variable, where
the CUDA context can be released before device vector.
* Add PredictionEntry to thread local entry.
This eliminates one copy of prediction vector.
* Don't define CUDA C API in a namespace.
* Use pre-rounding based method to obtain reproducible floating point
summation.
* GPU Hist for regression and classification are bit-by-bit reproducible.
* Add doc.
* Switch to thrust reduce for `node_sum_gradient`.
* Fix syncing DMatrix columns.
* notes for tree method.
* Enable feature validation for all interfaces except for jvm.
* Better tests for boosting from predictions.
* Disable validation on JVM.
* Disable parameter validation for now.
Scikit-Learn passes all parameters down to XGBoost, whether they are used or
not.
* Add option `validate_parameters`.
* - implementation of map ranking algorithm
- also effected necessary suggestions mentioned in the earlier ranking pr's
- made some performance improvements to the ndcg algo as well
* Remove `learning_rates`.
It's been deprecated since we have callback.
* Set `before_iteration` of `reset_learning_rate` to False to preserve
the initial learning rate, and comply to the term "reset".
Closes#4709.
* Tests for various `tree_method`.
* Extract interaction constraints from split evaluator.
The reason for doing so is mostly for model IO, where num_feature and interaction_constraints are copied in split evaluator. Also interaction constraint by itself is a feature selector, acting like column sampler and it's inefficient to bury it deep in the evaluator chain. Lastly removing one another copied parameter is a win.
* Enable inc for approx tree method.
As now the implementation is spited up from evaluator class, it's also enabled for approx method.
* Removing obsoleted code in colmaker.
They are never documented nor actually used in real world. Also there isn't a single test for those code blocks.
* Unifying the types used for row and column.
As the size of input dataset is marching to billion, incorrect use of int is subject to overflow, also singed integer overflow is undefined behaviour. This PR starts the procedure for unifying used index type to unsigned integers. There's optimization that can utilize this undefined behaviour, but after some testings I don't see the optimization is beneficial to XGBoost.
This makes GPU Hist robust in distributed environment as some workers might not
be associated with any data in either training or evaluation.
* Disable rabit mock test for now: See #5012 .
* Disable dask-cudf test at prediction for now: See #5003
* Launch dask job for all workers despite they might not have any data.
* Check 0 rows in elementwise evaluation metrics.
Using AUC and AUC-PR still throws an error. See #4663 for a robust fix.
* Add tests for edge cases.
* Add `LaunchKernel` wrapper handling zero sized grid.
* Move some parts of allreducer into a cu file.
* Don't validate feature names when the booster is empty.
* Sync number of columns in DMatrix.
As num_feature is required to be the same across all workers in data split
mode.
* Filtering in dask interface now by default syncs all booster that's not
empty, instead of using rank 0.
* Fix Jenkins' GPU tests.
* Install dask-cuda from source in Jenkins' test.
Now all tests are actually running.
* Restore GPU Hist tree synchronization test.
* Check UUID of running devices.
The check is only performed on CUDA version >= 10.x, as 9.x doesn't have UUID field.
* Fix CMake policy and project variables.
Use xgboost_SOURCE_DIR uniformly, add policy for CMake >= 3.13.
* Fix copying data to CPU
* Fix race condition in cpu predictor.
* Fix duplicated DMatrix construction.
* Don't download extra nccl in CI script.
* Don't set_params at the end of set_state.
* Also fix another issue found in dask prediction.
* Add note about prediction.
Don't support other prediction modes at the moment.