Hui Liu
968dbf25fb
merge latest changes
2024-03-12 09:13:09 -07:00
Jiaming Yuan
0ce4372bd4
Use UBJSON for serializing splits for vertical data split. ( #10059 )
2024-02-25 00:18:23 +08:00
Hui Liu
2d7ffbdf3d
merge latest changes
2023-12-13 21:06:28 -08:00
Jiaming Yuan
8fe1a2213c
Cleanup code for distributed training. ( #9805 )
...
* Cleanup code for distributed training.
- Merge `GetNcclResult` into nccl stub.
- Split up utilities from the main dask module.
- Let Channel return `Result` to accommodate nccl channel.
- Remove old `use_label_encoder` parameter.
2023-11-25 09:10:56 +08:00
Jiaming Yuan
0715ab3c10
Use dlopen to load NCCL. ( #9796 )
...
This PR adds optional support for loading nccl with `dlopen` as an alternative of compile time linking. This is to address the size bloat issue with the PyPI binary release.
- Add CMake option to load `nccl` at runtime.
- Add an NCCL stub.
After this, `nccl` will be fetched from PyPI when using pip to install XGBoost, either by a user or by `pyproject.toml`. Others who want to link the nccl at compile time can continue to do so without any change.
At the moment, this is Linux only since we only support MNMG on Linux.
2023-11-22 19:27:31 +08:00
Jiaming Yuan
ada377c57e
[coll] Reduce the scope of lock in the event loop. ( #9784 )
2023-11-15 14:16:19 +08:00
Jiaming Yuan
6fd4a30667
[coll] Increase timeout for allgather test. ( #9777 )
2023-11-09 05:26:40 +08:00
Jiaming Yuan
44099f585d
[coll] Add C API for the tracker. ( #9773 )
2023-11-08 18:17:14 +08:00
Jiaming Yuan
6c0a190f6d
[coll] Add comm group. ( #9759 )
...
- Implement `CommGroup` for double dispatching.
- Small cleanup to tracker for handling abort.
2023-11-07 11:12:31 +08:00
Hui Liu
3af5dfd546
Merge branch 'master'
2023-11-02 09:05:31 -07:00
Jiaming Yuan
4da4e092b5
[coll] Improvements and fixes for tracker and allreduce. ( #9745 )
...
- Allow the tracker to wait.
- Fix allreduce type cast
- Return args from the federated tracker.
2023-11-02 04:06:46 +08:00
Hui Liu
8fab17ae8f
rm hip.h files
2023-10-30 21:20:28 -07:00
Hui Liu
02f5464fa6
enable coll and comm
2023-10-30 15:15:05 -07:00
Hui Liu
d7f1235b7d
Merge branch 'master' into sync-condition-2023Oct11
2023-10-30 13:19:33 -07:00
Jiaming Yuan
6755179e77
[coll] Add nccl. ( #9726 )
2023-10-28 16:33:58 +08:00
Hui Liu
3752b06550
Merge branch 'master' into sync-condition-2023Oct11
2023-10-24 10:46:38 -07:00
Jiaming Yuan
7a02facc9d
Serialize expand entry for allgather. ( #9702 )
2023-10-24 14:33:28 +08:00
Hui Liu
6ba66463b6
fix uuid and Clear/SetValid
2023-10-23 16:32:26 -07:00
Hui Liu
55994b1ac7
enable ROCm on latest XGBoost
2023-10-23 11:15:04 -07:00
Hui Liu
15421e40d9
enable ROCm on latest XGBoost
2023-10-23 11:07:08 -07:00
Philip Hyunsu Cho
5e6cb63a56
[CI] Set up CI for Mac M1 ( #9699 )
2023-10-22 23:33:19 -07:00
Jiaming Yuan
b771f58453
[coll] Define interface for bridging. ( #9695 )
...
* Define the basic interface that will shared by nccl, federated and native.
2023-10-20 16:20:48 +08:00
Jiaming Yuan
5d1bcde719
[coll] allgatherv. ( #9688 )
2023-10-19 03:13:50 +08:00
Jiaming Yuan
4c0e4422d0
[coll] allgather. ( #9681 )
2023-10-18 10:22:18 +08:00
Jiaming Yuan
48ac9b6cbe
[coll] Allreduce. ( #9679 )
2023-10-17 13:57:14 +08:00
Jiaming Yuan
53049b16b8
[coll] Broadcast. ( #9659 )
2023-10-14 09:34:37 +08:00
Your Name
ea19555474
temp merge, disable 1 line, SetValid
2023-10-12 16:16:44 -07:00
Rong Ou
e164d51c43
Improve allgather functions ( #9649 )
2023-10-12 23:31:43 +08:00
Jiaming Yuan
946ae1c440
[coll] Implement a new tracker and a communicator. ( #9650 )
...
* [coll] Implement a new tracker and a communicator.
The new tracker and communicators communicate through the use of JSON documents. Along
with which, communicators are aware of each other.
2023-10-12 12:49:16 +08:00
Jiaming Yuan
b14e535e78
[Coll] Implement get host address in libxgboost. ( #9644 )
...
- Port `xgboost.tracker.get_host_ip` in C++.
2023-10-10 10:01:14 +08:00
Jiaming Yuan
8c676c889d
Remove internal use of gpu_id. ( #9568 )
2023-09-20 23:29:51 +08:00
Jiaming Yuan
38ac52dd87
Build a simple event loop for collective. ( #9593 )
2023-09-20 02:09:07 +08:00
Jiaming Yuan
b438d684d2
Utilities and cleanups for socket. ( #9576 )
...
- Use c++-17 nodiscard and nested ns.
- Add bind method to socket.
- Remove rabit parameters.
2023-09-14 01:41:42 +08:00
Jiaming Yuan
ccfc90e4c6
[rabit] Improved connection handling. ( #9531 )
...
- Enable timeout.
- Report connection error from the system.
- Handle retry for both tracker connection and peer connection.
2023-08-30 13:00:04 +08:00
Rong Ou
c2b85ab68a
Clean up MGPU C++ tests ( #9430 )
2023-08-02 14:31:18 +08:00
Rong Ou
15ca12a77e
Fix NCCL test hang ( #9367 )
2023-07-07 11:21:35 +08:00
Rong Ou
f90771eec6
Fix device communicator dependency ( #9346 )
2023-06-29 10:34:30 +08:00
Rong Ou
d8beb517ed
Support bitwise allreduce in NCCL communicator ( #9300 )
2023-06-17 01:56:50 +08:00
amdsc21
5ca7daaa13
merge latest changes
2023-06-15 21:39:14 +02:00
Rong Ou
e70810be8a
Refactor device communicator to make allreduce more flexible ( #9295 )
2023-06-14 03:53:03 +08:00
amdsc21
8cad8c693c
sync up May15 2023
2023-05-15 18:59:18 +02:00
Rong Ou
52311dcec9
Fix multi-threaded gtests ( #9148 )
2023-05-10 19:15:32 +08:00
amdsc21
06d9b998ce
fix CAPI BuildInfo
2023-03-28 00:14:18 +02:00
amdsc21
7fbc561e17
initial merge
2023-03-25 04:31:55 +01:00
Jiaming Yuan
ea04d4c46c
[doc] [dask] Troubleshooting NCCL errors. ( #8943 )
2023-03-22 22:17:26 +08:00
amdsc21
332f6a89a9
more tests
2023-03-11 01:33:48 +01:00
amdsc21
c51a1c9aae
rename hip.cc to hip
2023-03-07 05:39:53 +01:00
amdsc21
6039a71e6c
add hip structure
2023-03-07 02:17:19 +01:00
Rong Ou
cbf98cb9c6
Add Allgather to collective communicator ( #8765 )
...
* Add Allgather to collective communicator
2023-02-09 11:31:22 +08:00
Rong Ou
78396f8a6e
Initial support for column-split cpu predictor ( #8676 )
2023-01-18 06:33:13 +08:00