xgboost

Author	SHA1	Message	Date
Rory Mitchell	fbbae3386a	Smarter choice of histogram construction for distributed gpu_hist (#4519 ) * Smarter choice of histogram construction for distributed gpu_hist * Limit omp team size in ExecuteShards	2019-05-31 14:11:34 +12:00
sriramch	fed665ae8a	- training with external memory part 1 of 2 (#4486 ) * - training with external memory part 1 of 2 - this pr focuses on computing the quantiles using multiple gpus on a dataset that uses the external cache capabilities - there will a follow-up pr soon after this that will support creation of histogram indices on large dataset as well - both of these changes are required to support training with external memory - the sparse pages in dmatrix are taken in batches and the the cut matrices are incrementally built - also snuck in some (perf) changes related to sketches aggregation amongst multiple features across multiple sparse page batches. instead of aggregating the summary inside each device and merged later, it is aggregated in-place when the device is working on different rows but the same feature	2019-05-30 08:18:34 +12:00
Jiaming Yuan	c589eff941	De-duplicate GPU parameters. (#4454 ) * Only define `gpu_id` and `n_gpus` in `LearnerTrainParam` * Pass LearnerTrainParam through XGBoost vid factory method. * Disable all GPU usage when GPU related parameters are not specified (fixes XGBoost choosing GPU over aggressively). * Test learner train param io. * Fix gpu pickling.	2019-05-29 11:55:57 +08:00
Rong Ou	eaab364a63	More explict sharding methods for device memory (#4396 ) * Rename the Reshard method to Shard * Add a new Reshard method for sharding a vector that's already sharded	2019-05-01 11:47:22 +12:00
Rory Mitchell	5e582b0fa7	Combine thread launches into single launch per tree for gpu_hist (#4343 ) * Combine thread launches into single launch per tree for gpu_hist algorithm. * Address deprecation warning * Add manual column sampler constructor * Turn off omp dynamic to get a guaranteed number of threads * Enable openmp in cuda code	2019-04-29 09:58:34 +12:00
Jiaming Yuan	207f058711	Refactor CMake scripts. (#4323 ) * Refactor CMake scripts. * Remove CMake CUDA wrapper. * Bump CMake version for CUDA. * Use CMake to handle Doxygen. * Split up CMakeList. * Export install target. * Use modern CMake. * Remove build.sh * Workaround for gpu_hist test. * Use cmake 3.12. * Revert machine.conf. * Move CLI test to gpu. * Small cleanup. * Support using XGBoost as submodule. * Fix windows * Fix cpp tests on Windows * Remove duplicated find_package.	2019-04-15 10:08:12 -07:00
Jiaming Yuan	5c2575535f	Fix Histogram allocation. (#4347 ) * Fix Histogram allocation. nidx_map is cleared after `Reset`, but histogram data size isn't changed hence histogram recycling is used in later iterations. After a reset(building new tree), newly allocated node will start from 0, while recycling always choose the node with smallest index, which happens to be our newly allocated node 0.	2019-04-10 19:21:26 +08:00
Rory Mitchell	3f312e30db	Retire DVec class in favour of c++20 style span for device memory. (#4293 )	2019-03-28 13:59:58 +13:00
Rory Mitchell	6d5b34d824	Further optimisations for gpu_hist. (#4283 ) - Fuse final update position functions into a single more efficient kernel - Refactor gpu_hist with a more explicit ellpack matrix representation	2019-03-24 17:17:22 +13:00
Rory Mitchell	00465d243d	Optimisations for gpu_hist. (#4248 ) * Optimisations for gpu_hist. * Use streams to overlap operations. * ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.	2019-03-20 13:30:06 +13:00
Jiaming Yuan	7b9043cf71	Fix clang-tidy warnings. (#4149 ) * Upgrade gtest for clang-tidy. * Use CMake to install GTest instead of mv. * Don't enforce clang-tidy to return 0 due to errors in thrust. * Add a small test for tidy itself. * Reformat.	2019-03-13 02:25:51 +08:00
Rory Mitchell	4eeeded7d1	Remove various synchronisations from cuda API calls, instrument monitor (#4205 ) * Remove various synchronisations from cuda API calls, instrument monitor with nvtx profiler ranges.	2019-03-10 15:01:23 +13:00
Rong Ou	0944360416	minor fix: log InitDataOnce() only when it is actually called (#4206 )	2019-03-08 10:53:09 +13:00
Matthew Jones	92b7577c62	[REVIEW] Enable Multi-Node Multi-GPU functionality (#4095 ) * Initial commit to support multi-node multi-gpu xgboost using dask * Fixed NCCL initialization by not ignoring the opg parameter. - it now crashes on NCCL initialization, but at least we're attempting it properly * At the root node, perform a rabit::Allreduce to get initial sum_gradient across workers * Synchronizing in a couple of more places. - now the workers don't go down, but just hang - no more "wild" values of gradients - probably needs syncing in more places * Added another missing max-allreduce operation inside BuildHistLeftRight * Removed unnecessary collective operations. * Simplified rabit::Allreduce() sync of gradient sums. * Removed unnecessary rabit syncs around ncclAllReduce. - this improves performance _significantly_ (7x faster for overall training, 20x faster for xgboost proper) * pulling in latest xgboost * removing changes to updater_quantile_hist.cc * changing use_nccl_opg initialization, removing unnecessary if statements * added definition for opaque ncclUniqueId struct to properly encapsulate GetUniqueId * placing struct defintion in guard to avoid duplicate code errors * addressing linting errors * removing * removing additional arguments to AllReduer initialization * removing distributed flag * making comm init symmetric * removing distributed flag * changing ncclCommInit to support multiple modalities * fix indenting * updating ncclCommInitRank block with necessary group calls * fix indenting * adding print statement, and updating accessor in vector * improving print statement to end-line * generalizing nccl_rank construction using rabit * assume device_ordinals is the same for every node * test, assume device_ordinals is identical for all nodes * test, assume device_ordinals is unique for all nodes * changing names of offset variable to be more descriptive, editing indenting * wrapping ncclUniqueId GetUniqueId() and aesthetic changes * adding synchronization, and tests for distributed * adding to tests * fixing broken #endif * fixing initialization of gpu histograms, correcting errors in tests * adding to contributors list * adding distributed tests to jenkins * fixing bad path in distributed test * debugging * adding kubernetes for distributed tests * adding proper import for OrderedDict * adding urllib3==1.22 to address ordered_dict import error * added sleep to allow workers to save their models for comparison * adding name to GPU contributors under docs	2019-03-02 10:03:22 +13:00
Rory Mitchell	c8c472f39a	Fix incorrect device in multi-GPU algorithm (#4161 )	2019-02-20 09:23:15 +13:00
Jiaming Yuan	f8ca2960fc	Use nccl group calls to prevent from dead lock. (#4113 ) * launch all reduce sequentially. * Fix gpu_exact test memory leak.	2019-02-08 06:12:39 +08:00
Jiaming Yuan	017c97b8ce	Clean up training code. (#3825 ) * Remove GHistRow, GHistEntry, GHistIndexRow. * Remove kSimpleStats. * Remove CheckInfo, SetLeafVec in GradStats and in SKStats. * Clean up the GradStats. * Cleanup calcgain. * Move LossChangeMissing out of common. * Remove [] operator from GHistIndexBlock.	2019-02-07 14:22:13 +08:00
Rory Mitchell	1fc37e4749	Require leaf statistics when expanding tree (#4015 ) * Cache left and right gradient sums * Require leaf statistics when expanding tree	2019-01-17 21:12:20 -08:00
Jiaming Yuan	9897b5042f	Use Span in GPU exact updater. (#4020 ) * Use Span in GPU exact updater. * Add a small test.	2018-12-26 12:44:46 +08:00
Rory Mitchell	f75a21af25	Reduce tree expand boilerplate code (#4008 )	2018-12-20 15:52:28 +13:00
Jiaming Yuan	c8c7b9649c	Fix and optimize logger (#4002 ) * Fix logging switch statement. * Remove debug_verbose_ in AllReducer. * Don't construct the stream when not needed. * Make default constructor deleted. * Remove redundant IsVerbose.	2018-12-17 19:23:05 +08:00
Andy Adinets	42bf90eb8f	Column sampling at individual nodes (splits). (#3971 ) * Column sampling at individual nodes (splits). * Documented colsample_bynode parameter. - also updated documentation for colsample_by* parameters * Updated documentation. * GetFeatureSet() returns shared pointer to std::vector. * Sync sampled columns across multiple processes.	2018-12-14 22:37:35 +08:00
Jiaming Yuan	e0a279114e	Unify logging facilities. (#3982 ) * Unify logging facilities. * Enhance `ConsoleLogger` to handle different verbosity. * Override macros from `dmlc`. * Don't use specialized gamma when building with GPU. * Remove verbosity cache in monitor. * Test monitor. * Deprecate `silent`. * Fix doc and messages. * Fix python test. * Fix silent tests.	2018-12-14 19:29:58 +08:00
Rory Mitchell	93f9ce9ef9	Single precision histograms on GPU (#3965 ) * Allow single precision histogram summation in gpu_hist * Add python test, reduce run-time of gpu_hist tests * Update documentation	2018-12-10 10:55:30 +13:00
Rory Mitchell	a9d684db18	GPU performance logging/improvements (#3945 ) - Improved GPU performance logging - Only use one execute shards function - Revert performance regression on multi-GPU - Use threads to launch NCCL AllReduce	2018-11-29 14:36:51 +13:00
Rory Mitchell	7af0946ac1	Improve update position function for gpu_hist (#3895 )	2018-11-14 19:33:29 +13:00
Rory Mitchell	926eb651fe	Minor refactor of split evaluation in gpu_hist (#3889 ) * Refactor evaluate split into shard * Use span in evaluate split * Update google tests	2018-11-14 00:11:20 +13:00
Jiaming Yuan	f1275f52c1	Fix specifying gpu_id, add tests. (#3851 ) * Rewrite gpu_id related code. * Remove normalised/unnormalised operatios. * Address difference between `Index' and `Device ID'. * Modify doc for `gpu_id'. * Better LOG for GPUSet. * Check specified n_gpus. * Remove inappropriate `device_idx' term. * Clarify GpuIdType and size_t.	2018-11-06 18:17:53 +13:00
trivialfis	516457fadc	Add basic unittests for gpu-hist method. (#3785 ) * Split building histogram into separated class. * Extract `InitCompressedRow` definition. * Basic tests for gpu-hist. * Document the code more verbosely. * Removed `HistCutUnit`. * Removed some duplicated copies in `GPUHistMaker`. * Implement LCG and use it in tests.	2018-10-15 15:47:00 +13:00
Rory Mitchell	70d208d68c	Dmatrix refactor stage 2 (#3395 ) * DMatrix refactor 2 * Remove buffered rowset usage where possible * Transition to c++11 style iterators for row access * Transition column iterators to C++ 11	2018-10-01 01:29:03 +13:00
trivialfis	5a7f7e7d49	Implement devices to devices reshard. (#3721 ) * Force clearing device memory before Reshard. * Remove calculating row_segments for gpu_hist and gpu_sketch. * Guard against changing device.	2018-09-28 17:40:23 +12:00
Andy Adinets	0f99cdfe0e	Fixed an uninitialized pointer. (#3703 )	2018-09-16 18:02:31 +12:00
Andy Adinets	f606cb8ef4	Fixed the performance regression within EvaluateSplits(). (#3680 ) - it turns out creating an std::vector on every call is faster than cudaMallocHost()/cudaFreeHost()	2018-09-08 14:48:45 +12:00
Andy Adinets	72cd1517d6	Replaced std::vector with HostDeviceVector in MetaInfo and SparsePage. (#3446 ) * Replaced std::vector with HostDeviceVector in MetaInfo and SparsePage. - added distributions to HostDeviceVector - using HostDeviceVector for labels, weights and base margings in MetaInfo - using HostDeviceVector for offset and data in SparsePage - other necessary refactoring * Added const version of HostDeviceVector API calls. - const versions added to calls that can trigger data transfers, e.g. DevicePointer() - updated the code that uses HostDeviceVector - objective functions now accept const HostDeviceVector<bst_float>& for predictions * Updated src/linear/updater_gpu_coordinate.cu. * Added read-only state for HostDeviceVector sync. - this means no copies are performed if both host and devices access the HostDeviceVector read-only * Fixed linter and test errors. - updated the lz4 plugin - added ConstDeviceSpan to HostDeviceVector - using device % dh::NVisibleDevices() for the physical device number, e.g. in calls to cudaSetDevice() * Fixed explicit template instantiation errors for HostDeviceVector. - replaced HostDeviceVector<unsigned int> with HostDeviceVector<int> * Fixed HostDeviceVector tests that require multiple GPUs. - added a mock set device handler; when set, it is called instead of cudaSetDevice()	2018-08-30 14:28:47 +12:00
Rory Mitchell	686e990ffc	GPU memory usage fixes + column sampling refactor (#3635 ) * Remove thrust copy calls * Fix histogram memory usage * Cap extreme histogram memory usage * More efficient column sampling * Use column sampler across updaters * More efficient split evaluation on GPU with column sampling	2018-08-27 16:26:46 +12:00
trivialfis	60787ecebc	Merge generic device helper functions into gpu set. (#3626 ) * Remove the use of old NDevices* functions. * Use GPUSet in timer.h.	2018-08-26 18:14:23 +12:00
Rory Mitchell	07ff52d54c	Dynamically allocate GPU histogram memory (#3519 ) * Expand histogram memory dynamically to prevent large allocations for large tree depths (e.g. > 15) * Remove GPU memory allocation messages. These are misleading as a large number of allocations are now dynamic. * Fix appveyor R test	2018-07-28 21:22:41 +12:00
Andy Adinets	cc6a5a3666	Added finding quantiles on GPU. (#3393 ) * Added finding quantiles on GPU. - this includes datasets where weights are assigned to data rows - as the quantiles found by the new algorithm are not the same as those found by the old one, test thresholds in tests/python-gpu/test_gpu_updaters.py have been adjusted. * Adjustments and improved testing for finding quantiles on the GPU. - added C++ tests for the DeviceSketch() function - reduced one of the thresholds in test_gpu_updaters.py - adjusted the cuts found by the find_cuts_k kernel	2018-07-27 14:03:16 +12:00
Rory Mitchell	0f145a0365	Resolve GPU bug on large files (#3472 ) Remove calls to thrust copy, fix indexing bug	2018-07-16 20:43:45 +12:00
Thejaswi	2200939416	Upgrading to NCCL2 (#3404 ) * Upgrading to NCCL2 * Part - II of NCCL2 upgradation - Doc updates to build with nccl2 - Dockerfile.gpu update for a correct CI build with nccl2 - Updated FindNccl package to have env-var NCCL_ROOT to take precedence * Upgrading to v9.2 for CI workflow, since it has the nccl2 binaries available * Added NCCL2 license + copy the nccl binaries into /usr location for the FindNccl module to find * Set LD_LIBRARY_PATH variable to pick nccl2 binary at runtime * Need the nccl2 library download instructions inside Dockerfile.release as well * Use NCCL2 as a static library	2018-07-10 00:42:15 -07:00
Thejaswi	0e78034607	Shared memory atomics while building histogram (#3384 ) * Use shared memory atomics for building histograms, whenever possible	2018-06-19 16:03:09 +12:00
Rory Mitchell	a96039141a	Dmatrix refactor stage 1 (#3301 ) * Use sparse page as singular CSR matrix representation * Simplify dmatrix methods * Reduce statefullness of batch iterators * BREAKING CHANGE: Remove prob_buffer_row parameter. Users are instead recommended to sample their dataset as a preprocessing step before using XGBoost.	2018-06-07 10:25:58 +12:00
Andy Adinets	286dccb8e8	GPU binning and compression. (#3319 ) * GPU binning and compression. - binning and index compression are done inside the DeviceShard constructor - in case of a DMatrix with multiple row batches, it is first converted into a single row batch	2018-06-05 17:15:13 +12:00
Andrew V. Adinetz	b8a0d66fe6	Multi-GPU HostDeviceVector. (#3287 ) * Multi-GPU HostDeviceVector. - HostDeviceVector instances can now span multiple devices, defined by GPUSet struct - the interface of HostDeviceVector has been modified accordingly - GPU objective functions are now multi-GPU - GPU predicting from cache is now multi-GPU - avoiding omp_set_num_threads() calls - other minor changes	2018-05-05 08:00:05 +12:00
Thejaswi	c80d51ccb3	Fix issue #3264 , accuracy issues on k80 GPUs. (#3293 )	2018-05-04 13:14:08 +12:00
Rory Mitchell	ccf80703ef	Clang-tidy static analysis (#3222 ) * Clang-tidy static analysis * Modernise checks * Google coding standard checks * Identifier renaming according to Google style	2018-04-19 18:57:13 +12:00
Rory Mitchell	a1ec7b1716	Change reduce operation from thrust to cub. Fix for cuda 9.1 error (#3218 ) * Change reduce operation from thrust to cub. Fix for cuda 9.1 runtime error * Unit test sum reduce	2018-04-04 14:21:48 +12:00
Andrew V. Adinetz	a1b48afa41	Added back UpdatePredictionCache() in updater_gpu_hist.cu. (#3120 ) * Added back UpdatePredictionCache() in updater_gpu_hist.cu. - it had been there before, but wasn't ported to the new version of updater_gpu_hist.cu	2018-03-09 15:06:45 +13:00
Andrew V. Adinetz	d5992dd881	Replaced std::vector-based interfaces with HostDeviceVector-based interfaces. (#3116 ) * Replaced std::vector-based interfaces with HostDeviceVector-based interfaces. - replacement was performed in the learner, boosters, predictors, updaters, and objective functions - only interfaces used in training were replaced; interfaces like PredictInstance() still use std::vector - refactoring necessary for replacement of interfaces was also performed, such as using HostDeviceVector in prediction cache * HostDeviceVector-based interfaces for custom objective function example plugin.	2018-02-28 13:00:04 +13:00
Rory Mitchell	f87802f00c	Fix GPU bugs (#3051 ) * Change uint to unsigned int * Fix no root predictions bug * Remove redundant splitting due to numerical instability	2018-01-23 13:14:15 +13:00

1 2 3 4

161 Commits