xgboost

Author	SHA1	Message	Date
Jiaming Yuan	a39fef2c67	[fed] Fixes for the encrypted GRPC backend. (#10503 )	2024-07-02 15:15:12 +08:00
Jiaming Yuan	26eb68859f	Consistently report error in tests. (#10453 )	2024-06-21 14:35:22 +08:00
Jiaming Yuan	c9f5fcaf21	[col] Small cleanup to federated comm. (#10397 )	2024-06-07 21:19:04 +08:00
Jiaming Yuan	7354955cbb	Test federated plugin using GitHub action. (#10336 ) Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>	2024-05-29 02:28:14 +08:00
Jiaming Yuan	a5a58102e5	Revamp the rabit implementation. (#10112 ) This PR replaces the original RABIT implementation with a new one, which has already been partially merged into XGBoost. The new one features: - Federated learning for both CPU and GPU. - NCCL. - More data types. - A unified interface for all the underlying implementations. - Improved timeout handling for both tracker and workers. - Exhausted tests with metrics (fixed a couple of bugs along the way). - A reusable tracker for Python and JVM packages.	2024-05-20 11:56:23 +08:00
Jiaming Yuan	3fbb221fec	[coll] Implement shutdown for tracker and comm. (#10208 ) - Force shutdown the tracker. - Implement shutdown notice for error handling thread in comm.	2024-04-20 04:08:17 +08:00
Jiaming Yuan	8bad677c2f	Update collective implementation. (#10152 ) * Update collective implementation. - Cleanup resource during `Finalize` to avoid handling threads in destructor. - Calculate the size for allgather automatically. - Use simple allgather for small (smaller than the number of worker) allreduce.	2024-03-30 18:57:31 +08:00
Jiaming Yuan	b3700bbb3f	Flexible find protobuf. (#9867 )	2023-12-12 07:34:01 +08:00
Jiaming Yuan	0715ab3c10	Use `dlopen` to load NCCL. (#9796 ) This PR adds optional support for loading nccl with `dlopen` as an alternative of compile time linking. This is to address the size bloat issue with the PyPI binary release. - Add CMake option to load `nccl` at runtime. - Add an NCCL stub. After this, `nccl` will be fetched from PyPI when using pip to install XGBoost, either by a user or by `pyproject.toml`. Others who want to link the nccl at compile time can continue to do so without any change. At the moment, this is Linux only since we only support MNMG on Linux.	2023-11-22 19:27:31 +08:00
Jiaming Yuan	06bdc15e9b	[coll] Pass context to various functions. (#9772 ) * [coll] Pass context to various functions. In the future, the `Context` object would be required for collective operations, this PR passes the context object to some required functions to prepare for swapping out the implementation.	2023-11-08 09:54:05 +08:00
Jiaming Yuan	6c0a190f6d	[coll] Add comm group. (#9759 ) - Implement `CommGroup` for double dispatching. - Small cleanup to tracker for handling abort.	2023-11-07 11:12:31 +08:00
Jiaming Yuan	4da4e092b5	[coll] Improvements and fixes for tracker and allreduce. (#9745 ) - Allow the tracker to wait. - Fix allreduce type cast - Return args from the federated tracker.	2023-11-02 04:06:46 +08:00
Jiaming Yuan	bc995a4865	[coll] Add federated coll. (#9738 ) - Define a new data type, the proto file is copied for now. - Merge client and communicator into `FederatedColl`. - Define CUDA variant. - Migrate tests for CPU, add tests for CUDA.	2023-11-01 04:06:46 +08:00
Jiaming Yuan	80390e6cb6	[coll] Federated comm. (#9732 )	2023-10-31 02:39:55 +08:00
Rong Ou	e164d51c43	Improve allgather functions (#9649 )	2023-10-12 23:31:43 +08:00
Jiaming Yuan	b438d684d2	Utilities and cleanups for socket. (#9576 ) - Use c++-17 nodiscard and nested ns. - Add bind method to socket. - Remove rabit parameters.	2023-09-14 01:41:42 +08:00
Jiaming Yuan	c1b2cff874	[CI] Check compiler warnings. (#9444 )	2023-08-08 12:02:45 -07:00
Philip Hyunsu Cho	a5cd2412de	Replace setup.py with pyproject.toml (#9021 ) * Create pyproject.toml * Implement a custom build backend (see below) in packager directory. Build logic from setup.py has been refactored and migrated into the new backend. * Tested: pip wheel . (build wheel), python -m build --sdist . (source distribution)	2023-04-20 13:51:39 -07:00
Jiaming Yuan	c5c8f643f2	Remove the cub submodule. (#8888 ) XGBoost now uses CTK-11.8 for binary packages, there's no need to maintain a cub submodule anymore.	2023-03-09 19:43:02 -08:00
Rong Ou	cbf98cb9c6	Add Allgather to collective communicator (#8765 ) * Add Allgather to collective communicator	2023-02-09 11:31:22 +08:00
Rong Ou	77b069c25d	Support bitwise allreduce operations in the communicator (#8623 )	2022-12-25 06:40:05 +08:00
Rong Ou	a8255ea678	Add an in-memory collective communicator (#8494 )	2022-12-01 00:24:12 +08:00
Rong Ou	4449e30184	Always link federated proto statically (#8442 )	2022-11-09 07:47:38 +08:00
Rong Ou	521086d56b	Make federated client more robust (#8351 )	2022-10-18 13:52:44 +08:00
Rong Ou	8f3dee58be	Speed up tests with federated learning enabled (#8350 ) * Speed up tests with federated learning enabled * Re-enable timeouts Co-authored-by: Hyunsu Philip Cho <chohyu01@cs.washington.edu>	2022-10-17 15:17:04 -07:00
Philip Hyunsu Cho	2faa744aba	[CI] Test federated learning plugin in the CI (#8325 )	2022-10-12 13:57:39 -07:00
Rong Ou	39afdac3be	Better error message when world size and rank are set as strings (#8316 ) Co-authored-by: jiamingy <jm.yuan@outlook.com>	2022-10-12 15:53:25 +08:00
Rong Ou	8d4038da57	Don't split input data in federated mode (#8279 ) Co-authored-by: Hyunsu Philip Cho <chohyu01@cs.washington.edu>	2022-10-05 18:19:28 -08:00
Rong Ou	668b8a0ea4	[Breaking] Switch from rabit to the collective communicator (#8257 ) * Switch from rabit to the collective communicator * fix size_t specialization * really fix size_t * try again * add include * more include * fix lint errors * remove rabit includes * fix pylint error * return dict from communicator context * fix communicator shutdown * fix dask test * reset communicator mocklist * fix distributed tests * do not save device communicator * fix jvm gpu tests * add python test for federated communicator * Update gputreeshap submodule Co-authored-by: Hyunsu Philip Cho <chohyu01@cs.washington.edu>	2022-10-05 14:39:01 -08:00
Rong Ou	a2686543a9	Common interface for collective communication (#8057 ) * implement broadcast for federated communicator * implement allreduce * add communicator factory * add device adapter * add device communicator to factory * add rabit communicator * add rabit communicator to the factory * add nccl device communicator * add synchronize to device communicator * add back print and getprocessorname * add python wrapper and c api * clean up types * fix non-gpu build * try to fix ci * fix std::size_t * portable string compare ignore case * c style size_t * fix lint errors * cross platform setenv * fix memory leak * fix lint errors * address review feedback * add python test for rabit communicator * fix failing gtest * use json to configure communicators * fix lint error * get rid of factories * fix cpu build * fix include * fix python import * don't export collective.py yet * skip collective communicator pytest on windows * add review feedback * update documentation * remove mpi communicator type * fix tests * shutdown the communicator separately Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>	2022-09-12 15:21:12 -07:00
Rong Ou	d6e2013c5f	Set max message size in insecure gRPC (#8203 )	2022-08-26 16:33:51 +08:00
Rong Ou	ad3bc0edee	Allow insecure gRPC connections for federated learning (#8181 ) * Allow insecure gRPC connections for federated learning * format	2022-08-19 12:16:14 +08:00
Rong Ou	45dc1f818a	Make federated plugin work with cmake 3.16.3 (#8029 )	2022-06-27 17:26:41 +08:00
Rong Ou	0725fd6081	fix federated learning plugin (#8027 )	2022-06-24 08:41:07 +08:00
Rong Ou	e5ec546da5	[Breaking] Remove rabit support for custom reductions and `grow_local_histmaker` updater (#7992 )	2022-06-21 15:08:23 +08:00
Rong Ou	31e6902e43	Support GPU training in the NVFlare demo (#7965 )	2022-06-02 21:52:36 +08:00
Rong Ou	d3429f2ff6	Increase gRPC max receive message size for federated learning (#7958 )	2022-06-01 13:21:54 +08:00
Rong Ou	af907e2d0d	Demo of federated learning using NVFlare (#7879 ) Co-authored-by: jiamingy <jm.yuan@outlook.com>	2022-05-14 22:45:41 +08:00
Rong Ou	14ef38b834	Initial support for federated learning (#7831 ) Federated learning plugin for xgboost: * A gRPC server to aggregate MPI-style requests (allgather, allreduce, broadcast) from federated workers. * A Rabit engine for the federated environment. * Integration test to simulate federated learning. Additional followups are needed to address GPU support, better security, and privacy, etc.	2022-05-05 21:49:22 +08:00

39 Commits