* Implement Transform class.
* Add tests for softmax.
* Use Transform in regression, softmax and hinge objectives, except for Cox.
* Mark old gpu objective functions deprecated.
* static_assert for softmax.
* Split up multi-gpu tests.
* DMatrix refactor 2
* Remove buffered rowset usage where possible
* Transition to c++11 style iterators for row access
* Transition column iterators to C++ 11
- previously, vec_ in DeviceShard wasn't updated on copy; as a result,
the shards continued to refer to the old HostDeviceVectorImpl object,
which resulted in a dangling pointer once that object was deallocated
* Replaced std::vector with HostDeviceVector in MetaInfo and SparsePage.
- added distributions to HostDeviceVector
- using HostDeviceVector for labels, weights and base margings in MetaInfo
- using HostDeviceVector for offset and data in SparsePage
- other necessary refactoring
* Added const version of HostDeviceVector API calls.
- const versions added to calls that can trigger data transfers, e.g. DevicePointer()
- updated the code that uses HostDeviceVector
- objective functions now accept const HostDeviceVector<bst_float>& for predictions
* Updated src/linear/updater_gpu_coordinate.cu.
* Added read-only state for HostDeviceVector sync.
- this means no copies are performed if both host and devices access
the HostDeviceVector read-only
* Fixed linter and test errors.
- updated the lz4 plugin
- added ConstDeviceSpan to HostDeviceVector
- using device % dh::NVisibleDevices() for the physical device number,
e.g. in calls to cudaSetDevice()
* Fixed explicit template instantiation errors for HostDeviceVector.
- replaced HostDeviceVector<unsigned int> with HostDeviceVector<int>
* Fixed HostDeviceVector tests that require multiple GPUs.
- added a mock set device handler; when set, it is called instead of cudaSetDevice()
* Add basic Span class based on ISO++20.
* Use Span<Entry const> instead of Inst in SparsePage.
* Add DeviceSpan in HostDeviceVector, use it in regression obj.
* Expand histogram memory dynamically to prevent large allocations for large tree depths (e.g. > 15)
* Remove GPU memory allocation messages. These are misleading as a large number of allocations are now dynamic.
* Fix appveyor R test
* Added finding quantiles on GPU.
- this includes datasets where weights are assigned to data rows
- as the quantiles found by the new algorithm are not the same
as those found by the old one, test thresholds in
tests/python-gpu/test_gpu_updaters.py have been adjusted.
* Adjustments and improved testing for finding quantiles on the GPU.
- added C++ tests for the DeviceSketch() function
- reduced one of the thresholds in test_gpu_updaters.py
- adjusted the cuts found by the find_cuts_k kernel
* Fail GPU CI after test failure
* Fix GPU linear tests
* Reduced number of GPU tests to speed up CI
* Remove static allocations of device memory
* Resolve illegal memory access for updater_fast_hist.cc
* Fix broken r tests dependency
* Update python install documentation for GPU
* Upgrading to NCCL2
* Part - II of NCCL2 upgradation
- Doc updates to build with nccl2
- Dockerfile.gpu update for a correct CI build with nccl2
- Updated FindNccl package to have env-var NCCL_ROOT to take precedence
* Upgrading to v9.2 for CI workflow, since it has the nccl2 binaries available
* Added NCCL2 license + copy the nccl binaries into /usr location for the FindNccl module to find
* Set LD_LIBRARY_PATH variable to pick nccl2 binary at runtime
* Need the nccl2 library download instructions inside Dockerfile.release as well
* Use NCCL2 as a static library
* Use sparse page as singular CSR matrix representation
* Simplify dmatrix methods
* Reduce statefullness of batch iterators
* BREAKING CHANGE: Remove prob_buffer_row parameter. Users are instead recommended to sample their dataset as a preprocessing step before using XGBoost.
* GPU binning and compression.
- binning and index compression are done inside the DeviceShard constructor
- in case of a DMatrix with multiple row batches, it is first converted into a single row batch
* Multi-GPU HostDeviceVector.
- HostDeviceVector instances can now span multiple devices, defined by GPUSet struct
- the interface of HostDeviceVector has been modified accordingly
- GPU objective functions are now multi-GPU
- GPU predicting from cache is now multi-GPU
- avoiding omp_set_num_threads() calls
- other minor changes
* Replaced std::vector-based interfaces with HostDeviceVector-based interfaces.
- replacement was performed in the learner, boosters, predictors,
updaters, and objective functions
- only interfaces used in training were replaced;
interfaces like PredictInstance() still use std::vector
- refactoring necessary for replacement of interfaces was also performed,
such as using HostDeviceVector in prediction cache
* HostDeviceVector-based interfaces for custom objective function example plugin.
- thrust::copy() called from dvec::copy() for gpairs invoked a GPU kernel instead of
cudaMemcpy()
- this resulted in illegal memory access if the GPU running the kernel could not access
the data being copied
- new version of dvec::copy() for thrust::device_ptr iterators calls cudaMemcpy(),
avoiding the problem.
* Added GPU objective function and no-copy interface.
- xgboost::HostDeviceVector<T> syncs automatically between host and device
- no-copy interfaces have been added
- default implementations just sync the data to host
and call the implementations with std::vector
- GPU objective function, predictor, histogram updater process data
directly on GPU
- Implement colsampling, subsampling for gpu_hist_experimental
- Optimised multi-GPU implementation for gpu_hist_experimental
- Make nccl optional
- Add Volta architecture flag
- Optimise RegLossObj
- Add timing utilities for debug verbose mode
- Bump required cuda version to 8.0
* Fatal error if GPU algorithm selected without GPU support compiled
* Resolve type conversion warnings
* Fix gpu unit test failure
* Fix compressed iterator edge case
* Fix python unit test failures due to flake8 update on pip
Problem:
Fast histogram updater crashes whenever subsampling picks zero rows
Diagnosis:
Row set data structure uses "nullptr" internally to indicate a non-existent
row set. Since you cannot take the address of the first element of an empty
vector, a valid row set ends up getting "nullptr" as well.
Fix:
Use an arbitrary value (not equal to "nullptr") to bypass nullptr check.
* Patch to improve multithreaded performance scaling
Change parallel strategy for histogram construction.
Instead of partitioning data rows among multiple threads, partition feature
columns instead. Useful heuristics for assigning partitions have been adopted
from LightGBM project.
* Add missing header to satisfy MSVC
* Restore max_bin and related parameters to TrainParam
* Fix lint error
* inline functions do not require static keyword
* Feature grouping algorithm accepting FastHistParam
Feature grouping algorithm accepts many parameters (3+), and it gets annoying to
pass them one by one. Instead, simply pass the reference to FastHistParam. The
definition of FastHistParam has been moved to a separate header file to
accomodate this change.
Reported in issue #2165. Dynamic scheduling of OpenMP loops involve
implicit synchronization. To implement synchronization, libgomp uses futex
(fast userspace mutex), whereas MinGW uses kernel-space mutex, which is more
costly. With chunk size of 1, synchronization overhead may become prohibitive
on Windows machines.
Solution: use 'guided' schedule to minimize the number of syncs
* Add UpdatePredictionCache() option to updaters
Some updaters (e.g. fast_hist) has enough information to quickly compute
prediction cache for the training data. Each updater may override
UpdaterPredictionCache() method to update the prediction cache. Note: this
trick does not apply to validation data.
* Respond to code review
* Disable some debug messages by default
* Document UpdatePredictionCache() interface
* Remove base_margin logic from UpdatePredictionCache() implementation
* Do not take pointer to cfg, as reference may get stale
* Improve multi-threaded performance
* Use columnwise accessor to accelerate ApplySplit() step,
with support for a compressed representation
* Parallel sort for evaluation step
* Inline BuildHist() function
* Cache gradient pairs when building histograms in BuildHist()
* Add missing #if macro
* Respond to code review
* Use wrapper to enable parallel sort on Linux
* Fix C++ compatibility issues
* MSVC doesn't support unsigned in OpenMP loops
* gcc 4.6 doesn't support using keyword
* Fix lint issues
* Respond to code review
* Fix bug in ApplySplitSparseData()
* Attempting to read beyond the end of a sparse column
* Mishandling the case where an entire range of rows have missing values
* Fix training continuation bug
Disable UpdatePredictionCache() in the first iteration. This way, we can
accomodate the scenario where we build off of an existing (nonempty) ensemble.
* Add regression test for fast_hist
* Respond to code review
* Add back old version of ApplySplitSparseData
* Support histogram-based algorithm + multiple tree growing strategy
* Add a brand new updater to support histogram-based algorithm, which buckets
continuous features into discrete bins to speed up training. To use it, set
`tree_method = fast_hist` to configuration.
* Support multiple tree growing strategies. For now, two policies are supported:
* `grow_policy=depthwise` (default): favor splitting at nodes closest to the
root, i.e. grow depth-wise.
* `grow_policy=lossguide`: favor splitting at nodes with highest loss change
* Improve single-threaded performance
* Unroll critical loops
* Introduce specialized code for dense data (i.e. no missing values)
* Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose`
* Adding a small test for hist method
* Fix memory error in row_set.h
When std::vector is resized, a reference to one of its element may become
stale. Any such reference must be updated as well.
* Resolve cross-platform compilation issues
* Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g.
alignas(*) and new initializer syntax. To support g++ 4.6, use pre-C++11
initializer and remove alignas(*).
* Versions of MSVC older than 2015 does not support alignas(*). To support
MSVC 2012, remove alignas(*).
* For g++ 4.8 and newer, alignas(*) is enabled for performance benefits.
* Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases
(which uses `using` to declate type aliases). So always use `typedef`.
* Fix a host of CI issues
* Remove dependency for libz on osx
* Fix heading for hist_util
* Fix minor style issues
* Add missing #include
* Remove extraneous logging
* Enable tree_method=hist in R
* Renaming HistMaker to GHistBuilder to avoid confusion
* Fix R integration
* Respond to style comments
* Consistent tie-breaking for priority queue using timestamps
* Last-minute style fixes
* Fix issuecomment-271977647
The way we quantize data is broken. The agaricus data consists of all
categorical values. When NAs are converted into 0's,
`HistCutMatrix::Init` assign both 0's and 1's to the same single bin.
Why? gmat only the smallest value (0) and an upper bound (2), which is twice
the maximum value (1). Add the maximum value itself to gmat to fix the issue.
* Fix issuecomment-272266358
* Remove padding from cut values for the continuous case
* For categorical/ordinal values, use midpoints as bin boundaries to be safe
* Fix CI issue -- do not use xrange(*)
* Fix corner case in quantile sketch
Signed-off-by: Philip Cho <chohyu01@cs.washington.edu>
* Adding a test for an edge case in quantile sketcher
max_bin=2 used to cause an exception.
* Fix fast_hist test
The test used to require a strictly increasing Test AUC for all examples.
One of them exhibits a small blip in Test AUC before achieving a Test AUC
of 1. (See bottom.)
Solution: do not require monotonic increase for this particular example.
[0] train-auc:0.99989 test-auc:0.999497
[1] train-auc:1 test-auc:0.999749
[2] train-auc:1 test-auc:0.999749
[3] train-auc:1 test-auc:0.999749
[4] train-auc:1 test-auc:0.999749
[5] train-auc:1 test-auc:0.999497
[6] train-auc:1 test-auc:1
[7] train-auc:1 test-auc:1
[8] train-auc:1 test-auc:1
[9] train-auc:1 test-auc:1