141 Commits

Author SHA1 Message Date
Jiaming Yuan
582ea104b5
[EM] Enable prediction cache for GPU. (#10707)
- Use `UpdatePosition` for all nodes and skip `FinalizePosition` when external memory is used.
- Create `encode/decode` for node position, this is just as a refactor.
- Reuse code between update position and finalization.
2024-08-15 21:41:59 +08:00
Jiaming Yuan
d414fdf2e7
[EM] Add GPU version of the external memory QDM. (#10689) 2024-08-10 10:49:43 +08:00
Jiaming Yuan
292bb677e5
[EM] Support mmap backed ellpack. (#10602)
- Support resource view in ellpack.
- Define the CUDA version of MMAP resource.
- Define the CUDA version of malloc resource.
- Refactor cuda runtime API wrappers, and add memory access related wrappers.
- gather windows macros into a single header.
2024-07-18 08:20:21 +08:00
Jiaming Yuan
e9fbce9791
Refactor DeviceUVector. (#10595)
Create a wrapper instead of using inheritance to avoid inconsistent interface of the class.
2024-07-18 03:33:01 +08:00
Jiaming Yuan
1ca4bfd20e
Avoid thrust vector initialization. (#10544)
* Avoid thrust vector initialization.

- Add a wrapper for rmm device uvector.
- Split up the `Resize` method for HDV.
2024-07-11 17:29:27 +08:00
Jiaming Yuan
e8a962575a
[EM] Allow staging ellpack on host for GPU external memory. (#10488)
- New parameter `on_host`.
- Abstract format creation and stream creation into policy classes.
2024-06-28 04:42:18 +08:00
Jiaming Yuan
3f64b4fde3
[coll] Add global functions. (#10203) 2024-04-19 03:17:23 +08:00
Jiaming Yuan
1022909bbe
Fix global config for external memory. (#10173)
Pass the thread-local configuration between threads.
2024-04-11 01:29:28 +08:00
Jiaming Yuan
f0a138f33a
Fix pyspark with verbosity=3. (#10172) 2024-04-09 23:18:56 +08:00
Jiaming Yuan
53fc17578f
Use std::uint64_t for row index. (#10120)
- Use std::uint64_t instead of size_t to avoid implementation-defined type.
- Rename to bst_idx_t, to account for other types of indexing.
- Small cleanup to the base header.
2024-03-15 18:43:49 +08:00
Jiaming Yuan
39c637ee19
Use array interface in Python prediction return. (#9855) 2023-12-08 03:42:14 +08:00
Jiaming Yuan
8fe1a2213c
Cleanup code for distributed training. (#9805)
* Cleanup code for distributed training.

- Merge `GetNcclResult` into nccl stub.
- Split up utilities from the main dask module.
- Let Channel return `Result` to accommodate nccl channel.
- Remove old `use_label_encoder` parameter.
2023-11-25 09:10:56 +08:00
Jiaming Yuan
0715ab3c10
Use dlopen to load NCCL. (#9796)
This PR adds optional support for loading nccl with `dlopen` as an alternative of compile time linking. This is to address the size bloat issue with the PyPI binary release.
- Add CMake option to load `nccl` at runtime.
- Add an NCCL stub.

After this, `nccl` will be fetched from PyPI when using pip to install XGBoost, either by a user or by `pyproject.toml`. Others who want to link the nccl at compile time can continue to do so without any change.

At the moment, this is Linux only since we only support MNMG on Linux.
2023-11-22 19:27:31 +08:00
Jiaming Yuan
fedd9674c8
Implement column sampler in CUDA. (#9785)
- CUDA implementation.
- Extract the broadcasting logic, we will need the context parameter after revamping the collective implementation.
- Some changes to the event loop for fixing a deadlock in CI.
- Move argsort into algorithms.cuh, add support for cuda stream.
2023-11-17 04:29:08 +08:00
Jiaming Yuan
06bdc15e9b
[coll] Pass context to various functions. (#9772)
* [coll] Pass context to various functions.

In the future, the `Context` object would be required for collective operations, this PR
passes the context object to some required functions to prepare for swapping out the
implementation.
2023-11-08 09:54:05 +08:00
Jiaming Yuan
6755179e77
[coll] Add nccl. (#9726) 2023-10-28 16:33:58 +08:00
Rong Ou
7579905e18
Retry switching to per-thread default stream (#9416) 2023-07-26 07:09:12 +08:00
Jiaming Yuan
3a9996173e
Revert "Switch to per-thread default stream (#9396)" (#9413)
This reverts commit f7f673b00c15458fb4dd74a2a0d2ba80369c5faf.
2023-07-24 12:03:28 -07:00
Rong Ou
f7f673b00c
Switch to per-thread default stream (#9396) 2023-07-20 08:21:00 +08:00
Jiaming Yuan
9fbde21e9d
Rework the precision metric. (#9222)
- Rework the precision metric for both CPU and GPU.
- Mention it in the document.
- Cleanup old support code for GPU ranking metric.
- Deterministic GPU implementation.

* Drop support for classification.

* type.

* use batch shape.

* lint.

* cpu build.

* cpu build.

* lint.

* Tests.

* Fix.

* Cleanup error message.
2023-06-02 20:49:43 +08:00
Jiaming Yuan
85988a3178
Wait for data CUDA stream instead of sync. (#9144)
---------

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2023-05-09 09:52:21 +08:00
Jiaming Yuan
ea04d4c46c
[doc] [dask] Troubleshooting NCCL errors. (#8943) 2023-03-22 22:17:26 +08:00
Jiaming Yuan
4d665b3fb0
Restore clang tidy test. (#8861) 2023-03-03 13:47:04 -08:00
Jiaming Yuan
31d3ec07af
Extract device algorithms. (#8789) 2023-02-13 20:53:53 +08:00
Rong Ou
15a88ceef0
Fix deprecated CUB calls in CUDA 12.0 (#8578) 2022-12-12 17:02:30 +08:00
Rong Ou
668b8a0ea4
[Breaking] Switch from rabit to the collective communicator (#8257)
* Switch from rabit to the collective communicator

* fix size_t specialization

* really fix size_t

* try again

* add include

* more include

* fix lint errors

* remove rabit includes

* fix pylint error

* return dict from communicator context

* fix communicator shutdown

* fix dask test

* reset communicator mocklist

* fix distributed tests

* do not save device communicator

* fix jvm gpu tests

* add python test for federated communicator

* Update gputreeshap submodule

Co-authored-by: Hyunsu Philip Cho <chohyu01@cs.washington.edu>
2022-10-05 14:39:01 -08:00
Philip Hyunsu Cho
ca0547bb65
[CI] Use RAPIDS 22.10 (#8298)
* [CI] Use RAPIDS 22.10

* Store CUDA and RAPIDS versions in one place

* Fix

* Add missing #include

* Update gputreeshap submodule

* Fix

* Remove outdated distributed tests
2022-10-03 23:18:07 -08:00
Rory Mitchell
8f77677193
Use quantised gradients in gpu_hist histograms (#8246) 2022-09-26 17:35:35 +02:00
Jiaming Yuan
441ffc017a
Copy data from Ellpack to GHist. (#8215) 2022-09-06 23:05:49 +08:00
Jiaming Yuan
bcc8679a05
Update CUDA docker image and NCCL. (#8139) 2022-08-07 16:32:41 +08:00
Rory Mitchell
1be09848a7
Refactor split valuation kernel (#8073) 2022-07-21 15:41:50 +02:00
Rory Mitchell
bc4f802b17
Batch UpdatePosition using cudaMemcpy (#7964) 2022-06-30 17:52:40 +02:00
Rong Ou
80339c3427
Enable distributed GPU training over Rabit (#7930) 2022-05-31 04:09:45 +08:00
Jiaming Yuan
fdf533f2b9
[POC] Experimental support for l1 error. (#7812)
Support adaptive tree, a feature supported by both sklearn and lightgbm.  The tree leaf is recomputed based on residue of labels and predictions after construction.

For l1 error, the optimal value is the median (50 percentile).

This is marked as experimental support for the following reasons:
- The value is not well defined for distributed training, where we might have empty leaves for local workers. Right now I just use the original leaf value for computing the average with other workers, which might cause significant errors.
- Some follow-ups are required, for exact, pruner, and optimization for quantile function. Also, we need to calculate the initial estimation.
2022-04-26 21:41:55 +08:00
Jiaming Yuan
0d0abe1845
Support optimal partitioning for GPU hist. (#7652)
* Implement `MaxCategory` in quantile.
* Implement partition-based split for GPU evaluation.  Currently, it's based on the existing evaluation function.
* Extract an evaluator from GPU Hist to store the needed states.
* Added some CUDA stream/event utilities.
* Update document with references.
* Fixed a bug in approx evaluator where the number of data points is less than the number of categories.
2022-02-15 03:03:12 +08:00
Jiaming Yuan
5b1161bb64
Convert labels into tensor. (#7456)
* Add a new ctor to tensor for `initilizer_list`.
* Change labels from host device vector to tensor.
* Rename the field from `labels_` to `labels` since it's a public member.
2021-12-17 00:58:35 +08:00
Jiaming Yuan
55ee272ea8
Extend array interface to handle ndarray. (#7434)
* Extend array interface to handle ndarray.

The `ArrayInterface` class is extended to support multi-dim array inputs. Previously this
class handles only 2-dim (vector is also matrix).  This PR specifies the expected
dimension at compile-time and the array interface can perform various checks automatically
for input data. Also, adapters like CSR are more rigorous about their input.  Lastly, row
vector and column vector are handled without intervention from the caller.
2021-11-16 09:52:15 +08:00
Jiaming Yuan
32e673d8c4
Support building with CTK11.5. (#7379)
* Support building with CTK11.5.

* Require system cub installation for CTK11.4+.
* Check thrust version for segmented sort.
2021-11-02 16:22:26 +08:00
Jiaming Yuan
6295dc3b67
Fix span reverse iterator. (#7387)
* Fix span reverse iterator.

* Disable `rbegin` on device code to avoid calling host function.
* Add `trbegin` and friends.
2021-11-02 13:35:59 +08:00
Jiaming Yuan
ca17f8a5fc
Dispatch thrust versions and upgrade rmm. (#7254)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2021-09-25 03:43:23 +08:00
Jiaming Yuan
c311a8c1d8
Enable compiling with system cub. (#7232)
- Tested with all CUDA 11.x.
- Workaround cub scan by using discard iterator in AUC.
- Limit the size of Argsort when compiled with CUDA cub.
2021-09-17 14:28:18 +08:00
Jiaming Yuan
ba69244a94
Restore the custom double atomic add. (#7198) 2021-08-28 18:30:42 +08:00
Jiaming Yuan
7a1d67f9cb
[breaking] Use integer atomic for GPU histogram. (#7180)
On GPU we use rouding factor to truncate the gradient for deterministic results. This PR changes the gradient representation to fixed point number with exponent aligned with rounding factor.

    [breaking] Drop non-deterministic histogram.
    Use fixed point for shared memory.

This PR is to improve the performance of GPU Hist. 

Co-authored-by: Andy Adinets <aadinets@nvidia.com>
2021-08-28 05:17:05 +08:00
Jiaming Yuan
e7d7ab6bc3
Better error message for ncclUnhandledCudaError. (#7190) 2021-08-27 10:29:22 +08:00
Robert Maynard
1a75f43304
Allow compilation with nvcc 11.4 (#7131)
* Use type aliases for discard iterators

* update to include host_vector as thrust 1.12 doesn't bring it in as a side-effect

* cub::DispatchRadixSort requires signed offset types
2021-07-27 20:05:33 +08:00
Jiaming Yuan
1c8fdf2218
Remove use of device_idx in dh::LaunchN. (#7063)
It's an unused parameter, removing it can make the CI log more readable.
2021-06-29 11:37:26 +08:00
Jiaming Yuan
86715e4cd4
Support categorical data for dask functional interface and DQM. (#7043)
* Support categorical data for dask functional interface and DQM.

* Implement categorical data support for GPU GK-merge.
* Add support for dask functional interface.
* Add support for DQM.

* Get newer cupy.
2021-06-18 13:06:52 +08:00
Andrew Ziem
3e7e426b36
Fix spelling in documents (#6948)
* Update roxygen2 doc.

Co-authored-by: fis <jm.yuan@outlook.com>
2021-05-11 20:44:36 +08:00
Jiaming Yuan
1b26a2a561
Copy output data for argsort. (#6866)
Fix GPU AUC.
2021-04-16 21:05:01 +08:00
Jiaming Yuan
f294c4e023
Use constexpr in dh::CopyIf. (#6828) 2021-04-08 07:37:47 +08:00