Hui Liu
1e1e8be3a5
merge latest, Jan 12 2024
2024-01-12 09:57:11 -08:00
Hui Liu
2d7ffbdf3d
merge latest changes
2023-12-13 21:06:28 -08:00
Jiaming Yuan
b3700bbb3f
Flexible find protobuf. ( #9867 )
2023-12-12 07:34:01 +08:00
Jiaming Yuan
0715ab3c10
Use dlopen to load NCCL. ( #9796 )
...
This PR adds optional support for loading nccl with `dlopen` as an alternative of compile time linking. This is to address the size bloat issue with the PyPI binary release.
- Add CMake option to load `nccl` at runtime.
- Add an NCCL stub.
After this, `nccl` will be fetched from PyPI when using pip to install XGBoost, either by a user or by `pyproject.toml`. Others who want to link the nccl at compile time can continue to do so without any change.
At the moment, this is Linux only since we only support MNMG on Linux.
2023-11-22 19:27:31 +08:00
Jiaming Yuan
06bdc15e9b
[coll] Pass context to various functions. ( #9772 )
...
* [coll] Pass context to various functions.
In the future, the `Context` object would be required for collective operations, this PR
passes the context object to some required functions to prepare for swapping out the
implementation.
2023-11-08 09:54:05 +08:00
Jiaming Yuan
6c0a190f6d
[coll] Add comm group. ( #9759 )
...
- Implement `CommGroup` for double dispatching.
- Small cleanup to tracker for handling abort.
2023-11-07 11:12:31 +08:00
Hui Liu
3af5dfd546
Merge branch 'master'
2023-11-02 09:05:31 -07:00
Jiaming Yuan
4da4e092b5
[coll] Improvements and fixes for tracker and allreduce. ( #9745 )
...
- Allow the tracker to wait.
- Fix allreduce type cast
- Return args from the federated tracker.
2023-11-02 04:06:46 +08:00
Hui Liu
129bb76941
enable federated
2023-10-31 16:31:56 -07:00
Jiaming Yuan
bc995a4865
[coll] Add federated coll. ( #9738 )
...
- Define a new data type, the proto file is copied for now.
- Merge client and communicator into `FederatedColl`.
- Define CUDA variant.
- Migrate tests for CPU, add tests for CUDA.
2023-11-01 04:06:46 +08:00
Jiaming Yuan
80390e6cb6
[coll] Federated comm. ( #9732 )
2023-10-31 02:39:55 +08:00
Rong Ou
e164d51c43
Improve allgather functions ( #9649 )
2023-10-12 23:31:43 +08:00
Jiaming Yuan
b438d684d2
Utilities and cleanups for socket. ( #9576 )
...
- Use c++-17 nodiscard and nested ns.
- Add bind method to socket.
- Remove rabit parameters.
2023-09-14 01:41:42 +08:00
Jiaming Yuan
c1b2cff874
[CI] Check compiler warnings. ( #9444 )
2023-08-08 12:02:45 -07:00
Philip Hyunsu Cho
a5cd2412de
Replace setup.py with pyproject.toml ( #9021 )
...
* Create pyproject.toml
* Implement a custom build backend (see below) in packager directory. Build logic from setup.py has been refactored and migrated into the new backend.
* Tested: pip wheel . (build wheel), python -m build --sdist . (source distribution)
2023-04-20 13:51:39 -07:00
Jiaming Yuan
c5c8f643f2
Remove the cub submodule. ( #8888 )
...
XGBoost now uses CTK-11.8 for binary packages, there's no need to maintain a cub
submodule anymore.
2023-03-09 19:43:02 -08:00
Rong Ou
cbf98cb9c6
Add Allgather to collective communicator ( #8765 )
...
* Add Allgather to collective communicator
2023-02-09 11:31:22 +08:00
Rong Ou
77b069c25d
Support bitwise allreduce operations in the communicator ( #8623 )
2022-12-25 06:40:05 +08:00
Rong Ou
a8255ea678
Add an in-memory collective communicator ( #8494 )
2022-12-01 00:24:12 +08:00
Rong Ou
4449e30184
Always link federated proto statically ( #8442 )
2022-11-09 07:47:38 +08:00
Rong Ou
521086d56b
Make federated client more robust ( #8351 )
2022-10-18 13:52:44 +08:00
Rong Ou
8f3dee58be
Speed up tests with federated learning enabled ( #8350 )
...
* Speed up tests with federated learning enabled
* Re-enable timeouts
Co-authored-by: Hyunsu Philip Cho <chohyu01@cs.washington.edu>
2022-10-17 15:17:04 -07:00
Philip Hyunsu Cho
2faa744aba
[CI] Test federated learning plugin in the CI ( #8325 )
2022-10-12 13:57:39 -07:00
Rong Ou
39afdac3be
Better error message when world size and rank are set as strings ( #8316 )
...
Co-authored-by: jiamingy <jm.yuan@outlook.com>
2022-10-12 15:53:25 +08:00
Rong Ou
8d4038da57
Don't split input data in federated mode ( #8279 )
...
Co-authored-by: Hyunsu Philip Cho <chohyu01@cs.washington.edu>
2022-10-05 18:19:28 -08:00
Rong Ou
668b8a0ea4
[Breaking] Switch from rabit to the collective communicator ( #8257 )
...
* Switch from rabit to the collective communicator
* fix size_t specialization
* really fix size_t
* try again
* add include
* more include
* fix lint errors
* remove rabit includes
* fix pylint error
* return dict from communicator context
* fix communicator shutdown
* fix dask test
* reset communicator mocklist
* fix distributed tests
* do not save device communicator
* fix jvm gpu tests
* add python test for federated communicator
* Update gputreeshap submodule
Co-authored-by: Hyunsu Philip Cho <chohyu01@cs.washington.edu>
2022-10-05 14:39:01 -08:00
Rong Ou
a2686543a9
Common interface for collective communication ( #8057 )
...
* implement broadcast for federated communicator
* implement allreduce
* add communicator factory
* add device adapter
* add device communicator to factory
* add rabit communicator
* add rabit communicator to the factory
* add nccl device communicator
* add synchronize to device communicator
* add back print and getprocessorname
* add python wrapper and c api
* clean up types
* fix non-gpu build
* try to fix ci
* fix std::size_t
* portable string compare ignore case
* c style size_t
* fix lint errors
* cross platform setenv
* fix memory leak
* fix lint errors
* address review feedback
* add python test for rabit communicator
* fix failing gtest
* use json to configure communicators
* fix lint error
* get rid of factories
* fix cpu build
* fix include
* fix python import
* don't export collective.py yet
* skip collective communicator pytest on windows
* add review feedback
* update documentation
* remove mpi communicator type
* fix tests
* shutdown the communicator separately
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2022-09-12 15:21:12 -07:00
Rong Ou
d6e2013c5f
Set max message size in insecure gRPC ( #8203 )
2022-08-26 16:33:51 +08:00
Rong Ou
ad3bc0edee
Allow insecure gRPC connections for federated learning ( #8181 )
...
* Allow insecure gRPC connections for federated learning
* format
2022-08-19 12:16:14 +08:00
Rong Ou
45dc1f818a
Make federated plugin work with cmake 3.16.3 ( #8029 )
2022-06-27 17:26:41 +08:00
Rong Ou
0725fd6081
fix federated learning plugin ( #8027 )
2022-06-24 08:41:07 +08:00
Rong Ou
e5ec546da5
[Breaking] Remove rabit support for custom reductions and grow_local_histmaker updater ( #7992 )
2022-06-21 15:08:23 +08:00
Rong Ou
31e6902e43
Support GPU training in the NVFlare demo ( #7965 )
2022-06-02 21:52:36 +08:00
Rong Ou
d3429f2ff6
Increase gRPC max receive message size for federated learning ( #7958 )
2022-06-01 13:21:54 +08:00
Rong Ou
af907e2d0d
Demo of federated learning using NVFlare ( #7879 )
...
Co-authored-by: jiamingy <jm.yuan@outlook.com>
2022-05-14 22:45:41 +08:00
Rong Ou
14ef38b834
Initial support for federated learning ( #7831 )
...
Federated learning plugin for xgboost:
* A gRPC server to aggregate MPI-style requests (allgather, allreduce, broadcast) from federated workers.
* A Rabit engine for the federated environment.
* Integration test to simulate federated learning.
Additional followups are needed to address GPU support, better security, and privacy, etc.
2022-05-05 21:49:22 +08:00