119 Commits

Author SHA1 Message Date
Andy Adinets
73142041b9
For histograms, opting into maximum shared memory available per block. (#5491) 2020-04-21 14:56:42 +12:00
Rory Mitchell
d6d1035950
gpu_hist performance fixes (#5558)
* Remove unnecessary cuda API calls

* Fix histogram memory growth
2020-04-19 12:21:13 +12:00
Rory Mitchell
e268fb0093
Use thrust functions instead of custom functions (#5544) 2020-04-16 21:41:16 +12:00
Rory Mitchell
ca4e05660e
Purge device_helpers.cuh (#5534)
* Simplifications with caching_device_vector

* Purge device helpers
2020-04-15 21:51:56 +12:00
Jiaming Yuan
0012f2ef93
Upgrade clang-tidy on CI. (#5469)
* Correct all clang-tidy errors.
* Upgrade clang-tidy to 10 on CI.

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-04-05 04:42:29 +08:00
Rory Mitchell
13b10a6370
Device dmatrix (#5420) 2020-03-28 14:42:21 +13:00
Rory Mitchell
b745b7acce
Fix memory usage of device sketching (#5407) 2020-03-14 13:43:24 +13:00
sriramch
b81f8cbbc0
Move segment sorter to common (#5378)
- move segment sorter to common
- this is the first of a handful of pr's that splits the larger pr #5326
- it moves this facility to common (from ranking objective class), so that it can be
    used for metric computation
- it also wraps all the bald device pointers into span.
2020-02-29 15:42:07 +08:00
Jiaming Yuan
29eeea709a
Pass shared pointer instead of raw pointer to Learner. (#5302)
Extracted from https://github.com/dmlc/xgboost/pull/5220 .
2020-02-11 14:16:38 +08:00
Rong Ou
e4b74c4d22
Gradient based sampling for GPU Hist (#5093)
* Implement gradient based sampling for GPU Hist tree method.
* Add samplers and handle compacted page in GPU Hist.
2020-02-04 10:31:27 +08:00
Rory Mitchell
87ebfc1315
Implement cudf construction with adapters. (#5189) 2020-01-09 20:23:06 +13:00
sriramch
ee81ba8e1f implementation of map ranking algorithm on gpu (#5129)
* - implementation of map ranking algorithm
  - also effected necessary suggestions mentioned in the earlier ranking pr's
  - made some performance improvements to the ndcg algo as well
2019-12-27 12:05:37 +13:00
Jiaming Yuan
7663de956c
Run training with empty DMatrix. (#4990)
This makes GPU Hist robust in distributed environment as some workers might not
be associated with any data in either training or evaluation.

* Disable rabit mock test for now: See #5012 .

* Disable dask-cudf test at prediction for now: See #5003

* Launch dask job for all workers despite they might not have any data.
* Check 0 rows in elementwise evaluation metrics.

   Using AUC and AUC-PR still throws an error.  See #4663 for a robust fix.

* Add tests for edge cases.
* Add `LaunchKernel` wrapper handling zero sized grid.
* Move some parts of allreducer into a cu file.
* Don't validate feature names when the booster is empty.

* Sync number of columns in DMatrix.

  As num_feature is required to be the same across all workers in data split
  mode.

* Filtering in dask interface now by default syncs all booster that's not
empty, instead of using rank 0.

* Fix Jenkins' GPU tests.

* Install dask-cuda from source in Jenkins' test.

  Now all tests are actually running.

* Restore GPU Hist tree synchronization test.

* Check UUID of running devices.

  The check is only performed on CUDA version >= 10.x, as 9.x doesn't have UUID field.

* Fix CMake policy and project variables.

  Use xgboost_SOURCE_DIR uniformly, add policy for CMake >= 3.13.

* Fix copying data to CPU

* Fix race condition in cpu predictor.

* Fix duplicated DMatrix construction.

* Don't download extra nccl in CI script.
2019-11-06 16:13:13 +08:00
Rong Ou
5b1715d97c Write ELLPACK pages to disk (#4879)
* add ellpack source
* add batch param
* extract function to parse cache info
* construct ellpack info separately
* push batch to ellpack page
* write ellpack page.
* make sparse page source reusable
2019-10-22 23:44:32 -04:00
sriramch
310fe60b35 Pairwise ranking objective implementation on gpu (#4873)
* - pairwise ranking objective implementation on gpu
   - there are couple of more algorithms (ndcg and map) for which support will be added
     as follow-up pr's
   - with no label groups defined, get gradient is 90x faster on gpu (120m instance
     mortgage dataset)
   - it can perform by an order of magnitude faster with ~ 10 groups (and adequate cores
     for the cpu implementation)

* Add JSON config to rank obj.
2019-10-22 23:40:07 -04:00
Jiaming Yuan
b61d534472
Span: use size_t' for index_type, add front' and `back'. (#4935)
* Use `size_t' for index_type.  Add `front' and `back'.

* Remove a batch of `static_cast'.
2019-10-14 09:13:33 -04:00
Rory Mitchell
aefb1e5c2f
Resolve dask performance issues (#4914)
* Set dask client.map as impure function

* Remove nrows

* Remove slow check in verbose mode
2019-10-10 16:01:30 +13:00
Jiaming Yuan
095de3bf5f
Export c++ headers in CMake installation. (#4897)
* Move get transpose into cc.

* Clean up headers in host device vector, remove thrust dependency.

* Move span and host device vector into public.

* Install c++ headers.

* Short notes for c and c++.

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2019-10-06 23:53:09 -04:00
Jiaming Yuan
5374f52531
Complete cudf support. (#4850)
* Handles missing value.
* Accept all floating point and integer types.
* Move to cudf 9.0 API.
* Remove requirement on `null_count`.
* Arbitrary column types support.
2019-09-16 23:52:00 -04:00
Rong Ou
733ed24dd9 further cleanup of single process multi-GPU code (#4810)
* use subspan in gpu predictor instead of copying
* Revise `HostDeviceVector`
2019-08-30 05:27:23 -04:00
Rong Ou
38ab79f889 Make HostDeviceVector single gpu only (#4773)
* Make HostDeviceVector single gpu only
2019-08-26 09:51:13 +12:00
Jiaming Yuan
9700776597 Cudf support. (#4745)
* Initial support for cudf integration.

* Add two C APIs for consuming data and metainfo.

* Add CopyFrom for SimpleCSRSource as a generic function to consume the data.

* Add FromDeviceColumnar for consuming device data.

* Add new MetaInfo::SetInfo for consuming label, weight etc.
2019-08-19 16:51:40 +12:00
sriramch
7a388cbf8b Modify caching allocator/vector and fix issues relating to inability to train large datasets (#4615) 2019-07-09 18:33:27 +12:00
Jiaming Yuan
d9a47794a5 Fix CPU hist init for sparse dataset. (#4625)
* Fix CPU hist init for sparse dataset.

* Implement sparse histogram cut.
* Allow empty features.

* Fix windows build, don't use sparse in distributed environment.

* Comments.

* Smaller threshold.

* Fix windows omp.

* Fix msvc lambda capture.

* Fix MSVC macro.

* Fix MSVC initialization list.

* Fix MSVC initialization list x2.

* Preserve categorical feature behavior.

* Rename matrix to sparse cuts.
* Reuse UseGroup.
* Check for categorical data when adding cut.

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Sanity check.

* Fix comments.

* Fix comment.
2019-07-04 16:27:03 -07:00
Rory Mitchell
221e163185
Refactor out row partitioning logic from gpu_hist, introduce caching device vectors (#4554) 2019-06-20 18:24:09 +12:00
sriramch
90f683b25b Set the appropriate device before freeing device memory... (#4566)
* - set the appropriate device before freeing device memory...
   - pr #4532 added a global memory tracker/logger to keep track of number of (de)allocations
     and peak memory usage on a per device basis.
   - this pr adds the appropriate check to make sure that the (de)allocation counts and memory usages
     makes sense for the device. since verbosity is typically increased on debug/non-retail builds.  
* - pre-create cub allocators and reuse them
   - create them once and not resize them dynamically. we need to ensure that these allocators
     are created and destroyed exactly once so that the appropriate device id's are set
2019-06-18 14:58:05 +12:00
Rory Mitchell
9683fd433e
Overload device memory allocation (#4532)
* Group source files, include headers in source files

* Overload device memory allocation
2019-06-10 11:35:13 +12:00
Rory Mitchell
fbbae3386a
Smarter choice of histogram construction for distributed gpu_hist (#4519)
* Smarter choice of histogram construction for distributed gpu_hist

* Limit omp team size in ExecuteShards
2019-05-31 14:11:34 +12:00
Rong Ou
df2cdaca50 add cuda 10.1 support (#4468) 2019-05-14 18:30:58 +00:00
Rory Mitchell
5e582b0fa7
Combine thread launches into single launch per tree for gpu_hist (#4343)
* Combine thread launches into single launch per tree for gpu_hist
algorithm.

* Address deprecation warning

* Add manual column sampler constructor

* Turn off omp dynamic to get a guaranteed number of threads

* Enable openmp in cuda code
2019-04-29 09:58:34 +12:00
Rory Mitchell
3f312e30db
Retire DVec class in favour of c++20 style span for device memory. (#4293) 2019-03-28 13:59:58 +13:00
Rory Mitchell
00465d243d
Optimisations for gpu_hist. (#4248)
* Optimisations for gpu_hist.

* Use streams to overlap operations.

* ColumnSampler now uses HostDeviceVector to prevent repeatedly copying feature vectors to the device.
2019-03-20 13:30:06 +13:00
Jiaming Yuan
cf8d5b9b76
Mark CUDA 10.1 as unsupported. (#4265) 2019-03-17 16:59:15 +08:00
Jiaming Yuan
7b9043cf71
Fix clang-tidy warnings. (#4149)
* Upgrade gtest for clang-tidy.
* Use CMake to install GTest instead of mv.
* Don't enforce clang-tidy to return 0 due to errors in thrust.
* Add a small test for tidy itself.

* Reformat.
2019-03-13 02:25:51 +08:00
Rory Mitchell
4eeeded7d1
Remove various synchronisations from cuda API calls, instrument monitor (#4205)
* Remove various synchronisations from cuda API calls, instrument monitor
with nvtx profiler ranges.
2019-03-10 15:01:23 +13:00
Matthew Jones
92b7577c62 [REVIEW] Enable Multi-Node Multi-GPU functionality (#4095)
* Initial commit to support multi-node multi-gpu xgboost using dask

* Fixed NCCL initialization by not ignoring the opg parameter.

- it now crashes on NCCL initialization, but at least we're attempting it properly

* At the root node, perform a rabit::Allreduce to get initial sum_gradient across workers

* Synchronizing in a couple of more places.

- now the workers don't go down, but just hang
- no more "wild" values of gradients
- probably needs syncing in more places

* Added another missing max-allreduce operation inside BuildHistLeftRight

* Removed unnecessary collective operations.

* Simplified rabit::Allreduce() sync of gradient sums.

* Removed unnecessary rabit syncs around ncclAllReduce.

- this improves performance _significantly_ (7x faster for overall training,
  20x faster for xgboost proper)

* pulling in latest xgboost

* removing changes to updater_quantile_hist.cc

* changing use_nccl_opg initialization, removing unnecessary if statements

* added definition for opaque ncclUniqueId struct to properly encapsulate GetUniqueId

* placing struct defintion in guard to avoid duplicate code errors

* addressing linting errors

* removing

* removing additional arguments to AllReduer initialization

* removing distributed flag

* making comm init symmetric

* removing distributed flag

* changing ncclCommInit to support multiple modalities

* fix indenting

* updating ncclCommInitRank block with necessary group calls

* fix indenting

* adding print statement, and updating accessor in vector

* improving print statement to end-line

* generalizing nccl_rank construction using rabit

* assume device_ordinals is the same for every node

* test, assume device_ordinals is identical for all nodes

* test, assume device_ordinals is unique for all nodes

* changing names of offset variable to be more descriptive, editing indenting

* wrapping ncclUniqueId GetUniqueId() and aesthetic changes

* adding synchronization, and tests for distributed

* adding  to tests

* fixing broken #endif

* fixing initialization of gpu histograms, correcting errors in tests

* adding to contributors list

* adding distributed tests to jenkins

* fixing bad path in distributed test

* debugging

* adding kubernetes for distributed tests

* adding proper import for OrderedDict

* adding urllib3==1.22 to address ordered_dict import error

* added sleep to allow workers to save their models for comparison

* adding name to GPU contributors under docs
2019-03-02 10:03:22 +13:00
Rory Mitchell
71a604fae3
Fix for windows compilation (#4139) 2019-02-17 19:42:32 +13:00
Jiaming Yuan
f8ca2960fc
Use nccl group calls to prevent from dead lock. (#4113)
* launch all reduce sequentially.
* Fix gpu_exact test memory leak.
2019-02-08 06:12:39 +08:00
Jiaming Yuan
017c97b8ce
Clean up training code. (#3825)
* Remove GHistRow, GHistEntry, GHistIndexRow.
* Remove kSimpleStats.
* Remove CheckInfo, SetLeafVec in GradStats and in SKStats.
* Clean up the GradStats.
* Cleanup calcgain.
* Move LossChangeMissing out of common.
* Remove [] operator from GHistIndexBlock.
2019-02-07 14:22:13 +08:00
Jiaming Yuan
9897b5042f
Use Span in GPU exact updater. (#4020)
* Use Span in GPU exact updater.

* Add a small test.
2018-12-26 12:44:46 +08:00
Jiaming Yuan
c8c7b9649c
Fix and optimize logger (#4002)
* Fix logging switch statement.

* Remove debug_verbose_ in AllReducer.

* Don't construct the stream when not needed.

* Make default constructor deleted.

* Remove redundant IsVerbose.
2018-12-17 19:23:05 +08:00
Jiaming Yuan
e0a279114e
Unify logging facilities. (#3982)
* Unify logging facilities.

* Enhance `ConsoleLogger` to handle different verbosity.
* Override macros from `dmlc`.
* Don't use specialized gamma when building with GPU.
* Remove verbosity cache in monitor.
* Test monitor.
* Deprecate `silent`.
* Fix doc and messages.
* Fix python test.
* Fix silent tests.
2018-12-14 19:29:58 +08:00
Andy Adinets
4be5edaf92 Initialized AllReducer counters to 0. (#3987) 2018-12-12 09:09:20 +13:00
Rory Mitchell
93f9ce9ef9
Single precision histograms on GPU (#3965)
* Allow single precision histogram summation in gpu_hist

* Add python test, reduce run-time of gpu_hist tests

* Update documentation
2018-12-10 10:55:30 +13:00
Rory Mitchell
a9d684db18
GPU performance logging/improvements (#3945)
- Improved GPU performance logging

- Only use one execute shards function

- Revert performance regression on multi-GPU

- Use threads to launch NCCL AllReduce
2018-11-29 14:36:51 +13:00
Rory Mitchell
7af0946ac1
Improve update position function for gpu_hist (#3895) 2018-11-14 19:33:29 +13:00
Rory Mitchell
926eb651fe
Minor refactor of split evaluation in gpu_hist (#3889)
* Refactor evaluate split into shard

* Use span in evaluate split

* Update google tests
2018-11-14 00:11:20 +13:00
Jiaming Yuan
f1275f52c1
Fix specifying gpu_id, add tests. (#3851)
* Rewrite gpu_id related code.

* Remove normalised/unnormalised operatios.
* Address difference between `Index' and `Device ID'.
* Modify doc for `gpu_id'.
* Better LOG for GPUSet.
* Check specified n_gpus.
* Remove inappropriate `device_idx' term.
* Clarify GpuIdType and size_t.
2018-11-06 18:17:53 +13:00
trivialfis
516457fadc Add basic unittests for gpu-hist method. (#3785)
* Split building histogram into separated class.
* Extract `InitCompressedRow` definition.
* Basic tests for gpu-hist.
* Document the code more verbosely.
* Removed `HistCutUnit`.
* Removed some duplicated copies in `GPUHistMaker`.
* Implement LCG and use it in tests.
2018-10-15 15:47:00 +13:00
trivialfis
c6b5df67f6 Catch dmlc::Error. (#3751)
Fix #3643.
2018-10-04 16:51:38 +13:00