* Added finding quantiles on GPU.
- this includes datasets where weights are assigned to data rows
- as the quantiles found by the new algorithm are not the same
as those found by the old one, test thresholds in
tests/python-gpu/test_gpu_updaters.py have been adjusted.
* Adjustments and improved testing for finding quantiles on the GPU.
- added C++ tests for the DeviceSketch() function
- reduced one of the thresholds in test_gpu_updaters.py
- adjusted the cuts found by the find_cuts_k kernel
* Use sparse page as singular CSR matrix representation
* Simplify dmatrix methods
* Reduce statefullness of batch iterators
* BREAKING CHANGE: Remove prob_buffer_row parameter. Users are instead recommended to sample their dataset as a preprocessing step before using XGBoost.
* GPU binning and compression.
- binning and index compression are done inside the DeviceShard constructor
- in case of a DMatrix with multiple row batches, it is first converted into a single row batch
- Implement colsampling, subsampling for gpu_hist_experimental
- Optimised multi-GPU implementation for gpu_hist_experimental
- Make nccl optional
- Add Volta architecture flag
- Optimise RegLossObj
- Add timing utilities for debug verbose mode
- Bump required cuda version to 8.0
* Fatal error if GPU algorithm selected without GPU support compiled
* Resolve type conversion warnings
* Fix gpu unit test failure
* Fix compressed iterator edge case
* Fix python unit test failures due to flake8 update on pip