Experimental Support of Horizontal Federated XGBoost using NVFlare
This directory contains a demo of Horizontal Federated Learning using NVFlare.
Training with CPU only
To run the demo, first build XGBoost with the federated learning plugin enabled (see the README).
Install NVFlare (note that currently NVFlare only supports Python 3.8):
pip install nvflare
Prepare the data:
./prepare_data.sh
Start the NVFlare federated server:
/tmp/nvflare/poc/server/startup/start.sh
In another terminal, start the first worker:
/tmp/nvflare/poc/site-1/startup/start.sh
And the second worker:
/tmp/nvflare/poc/site-2/startup/start.sh
Then start the admin CLI:
/tmp/nvflare/poc/admin/startup/fl_admin.sh
In the admin CLI, run the following command:
submit_job horizontal-xgboost
Make a note of the job id:
Submitted job: 28309e77-a7c5-45e6-b2bc-c2e3655122d8
On both workers, you should see train and eval losses printed:
[10:45:41] [0] eval-logloss:0.22646 train-logloss:0.23316
[10:45:41] [1] eval-logloss:0.13776 train-logloss:0.13654
[10:45:41] [2] eval-logloss:0.08036 train-logloss:0.08243
[10:45:41] [3] eval-logloss:0.05830 train-logloss:0.05645
[10:45:41] [4] eval-logloss:0.03825 train-logloss:0.04148
[10:45:41] [5] eval-logloss:0.02660 train-logloss:0.02958
[10:45:41] [6] eval-logloss:0.01386 train-logloss:0.01918
[10:45:41] [7] eval-logloss:0.01018 train-logloss:0.01331
[10:45:41] [8] eval-logloss:0.00847 train-logloss:0.01112
[10:45:41] [9] eval-logloss:0.00691 train-logloss:0.00662
[10:45:41] [10] eval-logloss:0.00543 train-logloss:0.00503
[10:45:41] [11] eval-logloss:0.00445 train-logloss:0.00420
[10:45:41] [12] eval-logloss:0.00336 train-logloss:0.00355
[10:45:41] [13] eval-logloss:0.00277 train-logloss:0.00280
[10:45:41] [14] eval-logloss:0.00252 train-logloss:0.00244
[10:45:41] [15] eval-logloss:0.00177 train-logloss:0.00193
[10:45:41] [16] eval-logloss:0.00156 train-logloss:0.00161
[10:45:41] [17] eval-logloss:0.00135 train-logloss:0.00142
[10:45:41] [18] eval-logloss:0.00123 train-logloss:0.00125
[10:45:41] [19] eval-logloss:0.00106 train-logloss:0.00107
Once the training finishes, the model file should be written into
/tmp/nvlfare/poc/site-1/${job_id}/test.model.json and /tmp/nvflare/poc/site-2/${job_id}/test.model.json
respectively, where job_id is the UUID printed out when we ran submit_job.
Finally, shutdown everything from the admin CLI, using admin as password:
shutdown client
shutdown server
Training with GPUs
To demo with Federated Learning using GPUs, make sure your machine has at least 2 GPUs. Build XGBoost with the federated learning plugin enabled along with CUDA (see the README).
Modify ../config/config_fed_client.json and set use_gpus to true, then repeat the steps
above.