This PR replaces the original RABIT implementation with a new one, which has already been partially merged into XGBoost. The new one features:
- Federated learning for both CPU and GPU.
- NCCL.
- More data types.
- A unified interface for all the underlying implementations.
- Improved timeout handling for both tracker and workers.
- Exhausted tests with metrics (fixed a couple of bugs along the way).
- A reusable tracker for Python and JVM packages.
* Update collective implementation.
- Cleanup resource during `Finalize` to avoid handling threads in destructor.
- Calculate the size for allgather automatically.
- Use simple allgather for small (smaller than the number of worker) allreduce.
- Use std::uint64_t instead of size_t to avoid implementation-defined type.
- Rename to bst_idx_t, to account for other types of indexing.
- Small cleanup to the base header.
* [coll] Pass context to various functions.
In the future, the `Context` object would be required for collective operations, this PR
passes the context object to some required functions to prepare for swapping out the
implementation.
- Define a new data type, the proto file is copied for now.
- Merge client and communicator into `FederatedColl`.
- Define CUDA variant.
- Migrate tests for CPU, add tests for CUDA.
Added some more tests for the learner and fit_stump, for both column-wise distributed learning and vertical federated learning.
Also moved the `IsRowSplit` and `IsColumnSplit` methods from the `DMatrix` to the `MetaInfo` since in some places we only have access to the `MetaInfo`. Added a new convenience method `IsVerticalFederatedLearning`.
Some refactoring of the testing fixtures.
* Prepare for improving Windows networking compatibility.
* Include dmlc filesystem indirectly as dmlc/filesystem.h includes windows.h, which
conflicts with winsock2.h
* Define `NOMINMAX` conditionally.
* Link the winsock library when mysys32 is used.
* Add config file for read the doc.
Federated learning plugin for xgboost:
* A gRPC server to aggregate MPI-style requests (allgather, allreduce, broadcast) from federated workers.
* A Rabit engine for the federated environment.
* Integration test to simulate federated learning.
Additional followups are needed to address GPU support, better security, and privacy, etc.
* Change DefaultEvalMetric of classification from error to logloss
* Change default binary metric in plugin/example/custom_obj.cc
* Set old error metric in python tests
* Set old error metric in R tests
* Fix missed eval metrics and typos in R tests
* Fix setting eval_metric twice in R tests
* Add warning for empty eval_metric for classification
* Fix Dask tests
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>