change allreduce lib to rabit library, xgboost now run with rabit
This commit is contained in:
@@ -4,20 +4,16 @@ This folder contains information about experimental version of distributed xgboo
|
||||
|
||||
Build
|
||||
=====
|
||||
* In the root folder, run ```make mpi```, this will give you xgboost-mpi
|
||||
* In the root folder, run ```make```, this will give you xgboost, which uses rabit allreduce
|
||||
- this version of xgboost should be fault tolerant eventually
|
||||
* Alterniatively, run ```make mpi```, this will give you xgboost-mpi
|
||||
- You will need to have MPI to build xgboost-mpi
|
||||
* Alternatively, you can run ```make```, this will give you xgboost, which uses a beta buildin allreduce
|
||||
- You do not need MPI to build this, you can modify [submit_job_tcp.py](submit_job_tcp.py) to use any job scheduler you like to submit the job
|
||||
|
||||
Design Choice
|
||||
=====
|
||||
* Does distributed xgboost must reply on MPI library?
|
||||
- No, XGBoost replies on MPI protocol that provide Broadcast and AllReduce,
|
||||
- The dependency is isolated in [sync module](../src/sync/sync.h)
|
||||
- All other parts of code uses interface defined in sync.h
|
||||
- [sync_mpi.cpp](../src/sync/sync_mpi.cpp) is a implementation of sync interface using standard MPI library, to use xgboost-mpi, you need an MPI library
|
||||
- If there are platform/framework that implements these protocol, xgboost should naturally extends to these platform
|
||||
- As an example, [sync_tcp.cpp](../src/sync/sync_tcp.cpp) is an implementation of interface using TCP, and is linked with xgboost by default
|
||||
* XGBoost replies on [Rabit Library](https://github.com/tqchen/rabit)
|
||||
* Rabit is an fault tolerant and portable allreduce library that provides Allreduce and Broadcast
|
||||
* Since rabit is compatible with MPI, xgboost can be compiled using MPI backend
|
||||
|
||||
* How is the data distributed?
|
||||
- There are two solvers in distributed xgboost
|
||||
@@ -27,12 +23,10 @@ Design Choice
|
||||
it uses an approximate histogram count algorithm, and will only examine subset of
|
||||
potential split points as opposed to all split points.
|
||||
|
||||
|
||||
Usage
|
||||
====
|
||||
* You will need a network filesystem, or copy data to local file system before running the code
|
||||
* xgboost-mpi run in MPI enviroment,
|
||||
* xgboost can be used together with [submit_job_tcp.py](submit_job_tcp.py) on other types of job schedulers
|
||||
* xgboost can be used together with submission script provided in Rabit on different possible types of job scheduler
|
||||
* ***Note*** The distributed version is still multi-threading optimized.
|
||||
You should run one process per node that takes most available CPU,
|
||||
this will reduce the communication overhead and improve the performance.
|
||||
|
||||
@@ -1,12 +1,9 @@
|
||||
Distributed XGBoost: Column Split Version
|
||||
====
|
||||
* run ```bash mushroom-col.sh <n-mpi-process>```
|
||||
* run ```bash mushroom-col-rabit.sh <n-process>```
|
||||
- mushroom-col-tcp.sh starts xgboost job using rabit's allreduce
|
||||
* run ```bash mushroom-col-mpi.sh <n-mpi-process>```
|
||||
- mushroom-col.sh starts xgboost-mpi job
|
||||
* run ```bash mushroom-col-tcp.sh <n-process>```
|
||||
- mushroom-col-tcp.sh starts xgboost job using xgboost's buildin allreduce
|
||||
* run ```bash mushroom-col-python.sh <n-process>```
|
||||
- mushroom-col-python.sh starts xgboost python job using xgboost's buildin all reduce
|
||||
- see mushroom-col.py
|
||||
|
||||
How to Use
|
||||
====
|
||||
@@ -16,7 +13,7 @@ How to Use
|
||||
|
||||
Notes
|
||||
====
|
||||
* The code is multi-threaded, so you want to run one xgboost-mpi per node
|
||||
* The code is multi-threaded, so you want to run one process per node
|
||||
* The code will work correctly as long as union of each column subset is all the columns we are interested in.
|
||||
- The column subset can overlap with each other.
|
||||
* It uses exactly the same algorithm as single node version, to examine all potential split points.
|
||||
|
||||
@@ -17,6 +17,6 @@ k=$1
|
||||
python splitsvm.py ../../demo/data/agaricus.txt.train train $k
|
||||
|
||||
# run xgboost mpi
|
||||
../submit_job_tcp.py $k python mushroom-col.py
|
||||
../../rabit/tracker/rabit_mpi.py $k local python mushroom-col.py
|
||||
|
||||
cat dump.nice.$k.txt
|
||||
|
||||
@@ -16,13 +16,13 @@ k=$1
|
||||
python splitsvm.py ../../demo/data/agaricus.txt.train train $k
|
||||
|
||||
# run xgboost mpi
|
||||
../submit_job_tcp.py $k ../../xgboost mushroom-col.conf dsplit=col
|
||||
../../rabit/tracker/rabit_mpi.py $k local ../../xgboost mushroom-col.conf dsplit=col
|
||||
|
||||
# the model can be directly loaded by single machine xgboost solver, as usuall
|
||||
../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt
|
||||
|
||||
# run for one round, and continue training
|
||||
../submit_job_tcp.py $k ../../xgboost mushroom-col.conf dsplit=col num_round=1
|
||||
../submit_job_tcp.py $k ../../xgboost mushroom-col.conf dsplit=col model_in=0001.model
|
||||
../../rabit/tracker/rabit_mpi.py $k local ../../xgboost mushroom-col.conf dsplit=col num_round=1
|
||||
../../rabit/tracker/rabit_mpi.py $k local ../../xgboost mushroom-col.conf mushroom-col.conf dsplit=col model_in=0001.model
|
||||
|
||||
cat dump.nice.$k.txt
|
||||
cat dump.nice.$k.txt
|
||||
@@ -1,6 +1,10 @@
|
||||
import os
|
||||
import sys
|
||||
sys.path.append(os.path.dirname(__file__)+'/../wrapper')
|
||||
path = os.path.dirname(__file__)
|
||||
if path == '':
|
||||
path = '.'
|
||||
sys.path.append(path+'/../../wrapper')
|
||||
|
||||
import xgboost as xgb
|
||||
# this is example script of running distributed xgboost using python
|
||||
|
||||
|
||||
@@ -1,10 +1,10 @@
|
||||
Distributed XGBoost: Row Split Version
|
||||
====
|
||||
* Mushroom: run ```bash mushroom-row.sh <n-mpi-process>```
|
||||
* Machine: run ```bash machine-row.sh <n-mpi-process>```
|
||||
* Machine Rabit: run ```bash machine-row-rabit.sh <n-mpi-process>```
|
||||
- machine-col-rabit.sh starts xgboost job using rabit
|
||||
* Mushroom: run ```bash mushroom-row-mpi.sh <n-mpi-process>```
|
||||
* Machine: run ```bash machine-row-mpi.sh <n-mpi-process>```
|
||||
- Machine case also include example to continue training from existing model
|
||||
* Machine TCP: run ```bash machine-row-tcp.sh <n-mpi-process>```
|
||||
- machine-col-tcp.sh starts xgboost job using xgboost's buildin allreduce
|
||||
|
||||
How to Use
|
||||
====
|
||||
|
||||
@@ -1,24 +0,0 @@
|
||||
#!/bin/bash
|
||||
if [[ $# -ne 1 ]]
|
||||
then
|
||||
echo "Usage: nprocess"
|
||||
exit -1
|
||||
fi
|
||||
|
||||
rm -rf train-machine.row* *.model
|
||||
k=$1
|
||||
# make machine data
|
||||
cd ../../demo/regression/
|
||||
python mapfeat.py
|
||||
python mknfold.py machine.txt 1
|
||||
cd -
|
||||
|
||||
# split the lib svm file into k subfiles
|
||||
python splitrows.py ../../demo/regression/machine.txt.train train-machine $k
|
||||
|
||||
# run xgboost mpi
|
||||
../submit_job_tcp.py $k ../../xgboost machine-row.conf dsplit=row num_round=3
|
||||
|
||||
# run xgboost-mpi save model 0001, continue to run from existing model
|
||||
../submit_job_tcp.py $k ../../xgboost machine-row.conf dsplit=row num_round=1
|
||||
../submit_job_tcp.py $k ../../xgboost machine-row.conf dsplit=row num_round=2 model_in=0001.model
|
||||
@@ -1,36 +0,0 @@
|
||||
#!/usr/bin/python
|
||||
"""
|
||||
This is an example script to create a customized job submit
|
||||
script using xgboost sync_tcp mode
|
||||
"""
|
||||
import sys
|
||||
import os
|
||||
import subprocess
|
||||
# import the tcp_master.py
|
||||
# add path to sync
|
||||
sys.path.append(os.path.dirname(__file__)+'/../src/sync/')
|
||||
import tcp_master as master
|
||||
|
||||
#
|
||||
# Note: this submit script is only used for example purpose
|
||||
# It does not have to be mpirun, it can be any job submission script that starts the job, qsub, hadoop streaming etc.
|
||||
#
|
||||
def mpi_submit(nslave, args):
|
||||
"""
|
||||
customized submit script, that submit nslave jobs, each must contain args as parameter
|
||||
note this can be a lambda function containing additional parameters in input
|
||||
Parameters
|
||||
nslave number of slave process to start up
|
||||
args arguments to launch each job
|
||||
this usually includes the parameters of master_uri and parameters passed into submit
|
||||
"""
|
||||
cmd = ' '.join(['mpirun -n %d' % nslave] + args)
|
||||
print cmd
|
||||
subprocess.check_call(cmd, shell = True)
|
||||
|
||||
if __name__ == '__main__':
|
||||
if len(sys.argv) < 2:
|
||||
print 'Usage: <nslave> <cmd>'
|
||||
exit(0)
|
||||
# call submit, with nslave, the commands to run each job and submit function
|
||||
master.submit(int(sys.argv[1]), sys.argv[2:], fun_submit= mpi_submit)
|
||||
Reference in New Issue
Block a user