26 Commits

Author SHA1 Message Date
Philip Hyunsu Cho
5e6cb63a56
[CI] Set up CI for Mac M1 (#9699) 2023-10-22 23:33:19 -07:00
Jiaming Yuan
b771f58453
[coll] Define interface for bridging. (#9695)
* Define the basic interface that will shared by nccl, federated and native.
2023-10-20 16:20:48 +08:00
Jiaming Yuan
5d1bcde719
[coll] allgatherv. (#9688) 2023-10-19 03:13:50 +08:00
Jiaming Yuan
4c0e4422d0
[coll] allgather. (#9681) 2023-10-18 10:22:18 +08:00
Jiaming Yuan
48ac9b6cbe
[coll] Allreduce. (#9679) 2023-10-17 13:57:14 +08:00
Jiaming Yuan
53049b16b8
[coll] Broadcast. (#9659) 2023-10-14 09:34:37 +08:00
Rong Ou
e164d51c43
Improve allgather functions (#9649) 2023-10-12 23:31:43 +08:00
Jiaming Yuan
946ae1c440
[coll] Implement a new tracker and a communicator. (#9650)
* [coll] Implement a new tracker and a communicator.

The new tracker and communicators communicate through the use of JSON documents. Along
with which, communicators are aware of each other.
2023-10-12 12:49:16 +08:00
Jiaming Yuan
b14e535e78
[Coll] Implement get host address in libxgboost. (#9644)
- Port `xgboost.tracker.get_host_ip` in C++.
2023-10-10 10:01:14 +08:00
Jiaming Yuan
8c676c889d
Remove internal use of gpu_id. (#9568) 2023-09-20 23:29:51 +08:00
Jiaming Yuan
38ac52dd87
Build a simple event loop for collective. (#9593) 2023-09-20 02:09:07 +08:00
Jiaming Yuan
b438d684d2
Utilities and cleanups for socket. (#9576)
- Use c++-17 nodiscard and nested ns.
- Add bind method to socket.
- Remove rabit parameters.
2023-09-14 01:41:42 +08:00
Jiaming Yuan
ccfc90e4c6
[rabit] Improved connection handling. (#9531)
- Enable timeout.
- Report connection error from the system.
- Handle retry for both tracker connection and peer connection.
2023-08-30 13:00:04 +08:00
Rong Ou
c2b85ab68a
Clean up MGPU C++ tests (#9430) 2023-08-02 14:31:18 +08:00
Rong Ou
15ca12a77e
Fix NCCL test hang (#9367) 2023-07-07 11:21:35 +08:00
Rong Ou
f90771eec6
Fix device communicator dependency (#9346) 2023-06-29 10:34:30 +08:00
Rong Ou
d8beb517ed
Support bitwise allreduce in NCCL communicator (#9300) 2023-06-17 01:56:50 +08:00
Rong Ou
e70810be8a
Refactor device communicator to make allreduce more flexible (#9295) 2023-06-14 03:53:03 +08:00
Rong Ou
52311dcec9
Fix multi-threaded gtests (#9148) 2023-05-10 19:15:32 +08:00
Jiaming Yuan
ea04d4c46c
[doc] [dask] Troubleshooting NCCL errors. (#8943) 2023-03-22 22:17:26 +08:00
Rong Ou
cbf98cb9c6
Add Allgather to collective communicator (#8765)
* Add Allgather to collective communicator
2023-02-09 11:31:22 +08:00
Rong Ou
78396f8a6e
Initial support for column-split cpu predictor (#8676) 2023-01-18 06:33:13 +08:00
Rong Ou
77b069c25d
Support bitwise allreduce operations in the communicator (#8623) 2022-12-25 06:40:05 +08:00
Rong Ou
a8255ea678
Add an in-memory collective communicator (#8494) 2022-12-01 00:24:12 +08:00
Jiaming Yuan
b791446623
Initial support for IPv6 (#8225)
- Merge rabit socket into XGBoost.
- Dask interface support.
- Add test to the socket.
2022-09-21 18:06:50 +08:00
Rong Ou
a2686543a9
Common interface for collective communication (#8057)
* implement broadcast for federated communicator

* implement allreduce

* add communicator factory

* add device adapter

* add device communicator to factory

* add rabit communicator

* add rabit communicator to the factory

* add nccl device communicator

* add synchronize to device communicator

* add back print and getprocessorname

* add python wrapper and c api

* clean up types

* fix non-gpu build

* try to fix ci

* fix std::size_t

* portable string compare ignore case

* c style size_t

* fix lint errors

* cross platform setenv

* fix memory leak

* fix lint errors

* address review feedback

* add python test for rabit communicator

* fix failing gtest

* use json to configure communicators

* fix lint error

* get rid of factories

* fix cpu build

* fix include

* fix python import

* don't export collective.py yet

* skip collective communicator pytest on windows

* add review feedback

* update documentation

* remove mpi communicator type

* fix tests

* shutdown the communicator separately

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2022-09-12 15:21:12 -07:00