* Combine thread launches into single launch per tree for gpu_hist
algorithm.
* Address deprecation warning
* Add manual column sampler constructor
* Turn off omp dynamic to get a guaranteed number of threads
* Enable openmp in cuda code
* Add multi-GPU unit test environment
* Better assertion message
* Temporarily disable failing test
* Distinguish between multi-GPU and single-GPU CPP tests
* Consolidate Python tests. Use attributes to distinguish multi-GPU Python tests from single-CPU counterparts
* Added finding quantiles on GPU.
- this includes datasets where weights are assigned to data rows
- as the quantiles found by the new algorithm are not the same
as those found by the old one, test thresholds in
tests/python-gpu/test_gpu_updaters.py have been adjusted.
* Adjustments and improved testing for finding quantiles on the GPU.
- added C++ tests for the DeviceSketch() function
- reduced one of the thresholds in test_gpu_updaters.py
- adjusted the cuts found by the find_cuts_k kernel
* Fail GPU CI after test failure
* Fix GPU linear tests
* Reduced number of GPU tests to speed up CI
* Remove static allocations of device memory
* Resolve illegal memory access for updater_fast_hist.cc
* Fix broken r tests dependency
* Update python install documentation for GPU
- Implement colsampling, subsampling for gpu_hist_experimental
- Optimised multi-GPU implementation for gpu_hist_experimental
- Make nccl optional
- Add Volta architecture flag
- Optimise RegLossObj
- Add timing utilities for debug verbose mode
- Bump required cuda version to 8.0