* - training with external memory part 1 of 2
- this pr focuses on computing the quantiles using multiple gpus on a
dataset that uses the external cache capabilities
- there will a follow-up pr soon after this that will support creation
of histogram indices on large dataset as well
- both of these changes are required to support training with external memory
- the sparse pages in dmatrix are taken in batches and the the cut matrices
are incrementally built
- also snuck in some (perf) changes related to sketches aggregation amongst multiple
features across multiple sparse page batches. instead of aggregating the summary
inside each device and merged later, it is aggregated in-place when the device
is working on different rows but the same feature
* Only define `gpu_id` and `n_gpus` in `LearnerTrainParam`
* Pass LearnerTrainParam through XGBoost vid factory method.
* Disable all GPU usage when GPU related parameters are not specified (fixes XGBoost choosing GPU over aggressively).
* Test learner train param io.
* Fix gpu pickling.
* Combine thread launches into single launch per tree for gpu_hist
algorithm.
* Address deprecation warning
* Add manual column sampler constructor
* Turn off omp dynamic to get a guaranteed number of threads
* Enable openmp in cuda code
* Refactor CMake scripts.
* Remove CMake CUDA wrapper.
* Bump CMake version for CUDA.
* Use CMake to handle Doxygen.
* Split up CMakeList.
* Export install target.
* Use modern CMake.
* Remove build.sh
* Workaround for gpu_hist test.
* Use cmake 3.12.
* Revert machine.conf.
* Move CLI test to gpu.
* Small cleanup.
* Support using XGBoost as submodule.
* Fix windows
* Fix cpp tests on Windows
* Remove duplicated find_package.
* Fix Histogram allocation.
nidx_map is cleared after `Reset`, but histogram data size isn't changed hence
histogram recycling is used in later iterations. After a reset(building new
tree), newly allocated node will start from 0, while recycling always choose
the node with smallest index, which happens to be our newly allocated node 0.
* Optimisations for gpu_hist.
* Use streams to overlap operations.
* ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.
* Upgrade gtest for clang-tidy.
* Use CMake to install GTest instead of mv.
* Don't enforce clang-tidy to return 0 due to errors in thrust.
* Add a small test for tidy itself.
* Reformat.
* Initial commit to support multi-node multi-gpu xgboost using dask
* Fixed NCCL initialization by not ignoring the opg parameter.
- it now crashes on NCCL initialization, but at least we're attempting it properly
* At the root node, perform a rabit::Allreduce to get initial sum_gradient across workers
* Synchronizing in a couple of more places.
- now the workers don't go down, but just hang
- no more "wild" values of gradients
- probably needs syncing in more places
* Added another missing max-allreduce operation inside BuildHistLeftRight
* Removed unnecessary collective operations.
* Simplified rabit::Allreduce() sync of gradient sums.
* Removed unnecessary rabit syncs around ncclAllReduce.
- this improves performance _significantly_ (7x faster for overall training,
20x faster for xgboost proper)
* pulling in latest xgboost
* removing changes to updater_quantile_hist.cc
* changing use_nccl_opg initialization, removing unnecessary if statements
* added definition for opaque ncclUniqueId struct to properly encapsulate GetUniqueId
* placing struct defintion in guard to avoid duplicate code errors
* addressing linting errors
* removing
* removing additional arguments to AllReduer initialization
* removing distributed flag
* making comm init symmetric
* removing distributed flag
* changing ncclCommInit to support multiple modalities
* fix indenting
* updating ncclCommInitRank block with necessary group calls
* fix indenting
* adding print statement, and updating accessor in vector
* improving print statement to end-line
* generalizing nccl_rank construction using rabit
* assume device_ordinals is the same for every node
* test, assume device_ordinals is identical for all nodes
* test, assume device_ordinals is unique for all nodes
* changing names of offset variable to be more descriptive, editing indenting
* wrapping ncclUniqueId GetUniqueId() and aesthetic changes
* adding synchronization, and tests for distributed
* adding to tests
* fixing broken #endif
* fixing initialization of gpu histograms, correcting errors in tests
* adding to contributors list
* adding distributed tests to jenkins
* fixing bad path in distributed test
* debugging
* adding kubernetes for distributed tests
* adding proper import for OrderedDict
* adding urllib3==1.22 to address ordered_dict import error
* added sleep to allow workers to save their models for comparison
* adding name to GPU contributors under docs
* Remove GHistRow, GHistEntry, GHistIndexRow.
* Remove kSimpleStats.
* Remove CheckInfo, SetLeafVec in GradStats and in SKStats.
* Clean up the GradStats.
* Cleanup calcgain.
* Move LossChangeMissing out of common.
* Remove [] operator from GHistIndexBlock.
* Unify logging facilities.
* Enhance `ConsoleLogger` to handle different verbosity.
* Override macros from `dmlc`.
* Don't use specialized gamma when building with GPU.
* Remove verbosity cache in monitor.
* Test monitor.
* Deprecate `silent`.
* Fix doc and messages.
* Fix python test.
* Fix silent tests.
- Improved GPU performance logging
- Only use one execute shards function
- Revert performance regression on multi-GPU
- Use threads to launch NCCL AllReduce
* Split building histogram into separated class.
* Extract `InitCompressedRow` definition.
* Basic tests for gpu-hist.
* Document the code more verbosely.
* Removed `HistCutUnit`.
* Removed some duplicated copies in `GPUHistMaker`.
* Implement LCG and use it in tests.
* DMatrix refactor 2
* Remove buffered rowset usage where possible
* Transition to c++11 style iterators for row access
* Transition column iterators to C++ 11
* Replaced std::vector with HostDeviceVector in MetaInfo and SparsePage.
- added distributions to HostDeviceVector
- using HostDeviceVector for labels, weights and base margings in MetaInfo
- using HostDeviceVector for offset and data in SparsePage
- other necessary refactoring
* Added const version of HostDeviceVector API calls.
- const versions added to calls that can trigger data transfers, e.g. DevicePointer()
- updated the code that uses HostDeviceVector
- objective functions now accept const HostDeviceVector<bst_float>& for predictions
* Updated src/linear/updater_gpu_coordinate.cu.
* Added read-only state for HostDeviceVector sync.
- this means no copies are performed if both host and devices access
the HostDeviceVector read-only
* Fixed linter and test errors.
- updated the lz4 plugin
- added ConstDeviceSpan to HostDeviceVector
- using device % dh::NVisibleDevices() for the physical device number,
e.g. in calls to cudaSetDevice()
* Fixed explicit template instantiation errors for HostDeviceVector.
- replaced HostDeviceVector<unsigned int> with HostDeviceVector<int>
* Fixed HostDeviceVector tests that require multiple GPUs.
- added a mock set device handler; when set, it is called instead of cudaSetDevice()
* Expand histogram memory dynamically to prevent large allocations for large tree depths (e.g. > 15)
* Remove GPU memory allocation messages. These are misleading as a large number of allocations are now dynamic.
* Fix appveyor R test
* Added finding quantiles on GPU.
- this includes datasets where weights are assigned to data rows
- as the quantiles found by the new algorithm are not the same
as those found by the old one, test thresholds in
tests/python-gpu/test_gpu_updaters.py have been adjusted.
* Adjustments and improved testing for finding quantiles on the GPU.
- added C++ tests for the DeviceSketch() function
- reduced one of the thresholds in test_gpu_updaters.py
- adjusted the cuts found by the find_cuts_k kernel
* Upgrading to NCCL2
* Part - II of NCCL2 upgradation
- Doc updates to build with nccl2
- Dockerfile.gpu update for a correct CI build with nccl2
- Updated FindNccl package to have env-var NCCL_ROOT to take precedence
* Upgrading to v9.2 for CI workflow, since it has the nccl2 binaries available
* Added NCCL2 license + copy the nccl binaries into /usr location for the FindNccl module to find
* Set LD_LIBRARY_PATH variable to pick nccl2 binary at runtime
* Need the nccl2 library download instructions inside Dockerfile.release as well
* Use NCCL2 as a static library
* Use sparse page as singular CSR matrix representation
* Simplify dmatrix methods
* Reduce statefullness of batch iterators
* BREAKING CHANGE: Remove prob_buffer_row parameter. Users are instead recommended to sample their dataset as a preprocessing step before using XGBoost.
* GPU binning and compression.
- binning and index compression are done inside the DeviceShard constructor
- in case of a DMatrix with multiple row batches, it is first converted into a single row batch
* Multi-GPU HostDeviceVector.
- HostDeviceVector instances can now span multiple devices, defined by GPUSet struct
- the interface of HostDeviceVector has been modified accordingly
- GPU objective functions are now multi-GPU
- GPU predicting from cache is now multi-GPU
- avoiding omp_set_num_threads() calls
- other minor changes
* Replaced std::vector-based interfaces with HostDeviceVector-based interfaces.
- replacement was performed in the learner, boosters, predictors,
updaters, and objective functions
- only interfaces used in training were replaced;
interfaces like PredictInstance() still use std::vector
- refactoring necessary for replacement of interfaces was also performed,
such as using HostDeviceVector in prediction cache
* HostDeviceVector-based interfaces for custom objective function example plugin.