make wrapper ok
This commit is contained in:
@@ -4,17 +4,21 @@ This folder contains information about experimental version of distributed xgboo
|
||||
|
||||
Build
|
||||
=====
|
||||
* You will need to have MPI
|
||||
* In the root folder, run ```make mpi```, this will give you xgboost-mpi
|
||||
- You will need to have MPI to build xgboost-mpi
|
||||
* Alternatively, you can run ```make```, this will give you xgboost, which uses a beta buildin allreduce
|
||||
- You do not need MPI to build this, you can modify [submit_job_tcp.py](submit_job_tcp.py) to use any job scheduler you like to submit the job
|
||||
|
||||
Design Choice
|
||||
=====
|
||||
* Does distributed xgboost reply on MPI?
|
||||
- Yes, but the dependency is isolated in [sync module](../src/sync/sync.h)
|
||||
* Does distributed xgboost must reply on MPI library?
|
||||
- No, XGBoost replies on MPI protocol that provide Broadcast and AllReduce,
|
||||
- The dependency is isolated in [sync module](../src/sync/sync.h)
|
||||
- All other parts of code uses interface defined in sync.h
|
||||
- sync_mpi.cpp is a implementation of sync interface using standard MPI library
|
||||
- Specificially, xgboost reply on MPI protocol that provide Broadcast and AllReduce,
|
||||
if there are platform/framework that implements these protocol, xgboost should naturally extends to these platform
|
||||
- [sync_mpi.cpp](../src/sync/sync_mpi.cpp) is a implementation of sync interface using standard MPI library, to use xgboost-mpi, you need an MPI library
|
||||
- If there are platform/framework that implements these protocol, xgboost should naturally extends to these platform
|
||||
- As an example, [sync_tcp.cpp](../src/sync/sync_tcp.cpp) is an implementation of interface using TCP, and is linked with xgboost by default
|
||||
|
||||
* How is the data distributed?
|
||||
- There are two solvers in distributed xgboost
|
||||
- Column-based solver split data by column, each node work on subset of columns,
|
||||
@@ -26,10 +30,11 @@ Design Choice
|
||||
|
||||
Usage
|
||||
====
|
||||
* The current code run in MPI enviroment, you will need to have a network filesystem,
|
||||
or copy data to local file system before running the code
|
||||
* You will need a network filesystem, or copy data to local file system before running the code
|
||||
* xgboost-mpi run in MPI enviroment,
|
||||
* xgboost can be used together with [submit_job_tcp.py](submit_job_tcp.py) on other types of job schedulers
|
||||
* ***Note*** The distributed version is still multi-threading optimized.
|
||||
You should run one xgboost-mpi per node that takes most available CPU,
|
||||
You should run one process per node that takes most available CPU,
|
||||
this will reduce the communication overhead and improve the performance.
|
||||
- One way to do that is limit mpi slot in each machine to be 1, or reserve nthread processors for each process.
|
||||
* Examples:
|
||||
|
||||
@@ -1,6 +1,11 @@
|
||||
Distributed XGBoost: Column Split Version
|
||||
====
|
||||
* run ```bash mushroom-row.sh <n-mpi-process>```
|
||||
* run ```bash mushroom-col.sh <n-mpi-process>```
|
||||
* run ```bash mushroom-col-tcp.sh <n-process>```
|
||||
- mushroom-col-tcp.sh starts xgboost job using xgboost's buildin allreduce
|
||||
* run ```bash mushroom-col-python.sh <n-process>```
|
||||
- mushroom-col-python.sh starts xgboost python job using xgboost's buildin all reduce
|
||||
- see mushroom-col.py
|
||||
|
||||
How to Use
|
||||
====
|
||||
|
||||
22
multi-node/col-split/mushroom-col-python.sh
Executable file
22
multi-node/col-split/mushroom-col-python.sh
Executable file
@@ -0,0 +1,22 @@
|
||||
#!/bin/bash
|
||||
if [[ $# -ne 1 ]]
|
||||
then
|
||||
echo "Usage: nprocess"
|
||||
exit -1
|
||||
fi
|
||||
|
||||
#
|
||||
# This script is same as mushroom-col except that we will be using xgboost python module
|
||||
#
|
||||
# xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py
|
||||
#
|
||||
rm -rf train.col* *.model
|
||||
k=$1
|
||||
|
||||
# split the lib svm file into k subfiles
|
||||
python splitsvm.py ../../demo/data/agaricus.txt.train train $k
|
||||
|
||||
# run xgboost mpi
|
||||
../submit_job_tcp.py $k python mushroom-col.py
|
||||
|
||||
cat dump.nice.$k.txt
|
||||
29
multi-node/col-split/mushroom-col.py
Normal file
29
multi-node/col-split/mushroom-col.py
Normal file
@@ -0,0 +1,29 @@
|
||||
import os
|
||||
import sys
|
||||
sys.path.append(os.path.dirname(__file__)+'/../wrapper')
|
||||
import xgboost as xgb
|
||||
# this is example script of running distributed xgboost using python
|
||||
|
||||
# call this additional function to intialize the xgboost sync module
|
||||
# in distributed mode
|
||||
xgb.sync_init(sys.argv)
|
||||
rank = xgb.sync_get_rank()
|
||||
# read in dataset
|
||||
dtrain = xgb.DMatrix('train.col%d' % rank)
|
||||
param = {'max_depth':3, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
|
||||
param['dsplit'] = 'col'
|
||||
nround = 3
|
||||
|
||||
if rank == 0:
|
||||
dtest = xgb.DMatrix('../../demo/data/agaricus.txt.test')
|
||||
model = xgb.train(param, dtrain, nround, [(dtrain, 'train') , (dtest, 'test')])
|
||||
else:
|
||||
# if it is a slave node, do not run evaluation
|
||||
model = xgb.train(param, dtrain, nround)
|
||||
|
||||
if rank == 0:
|
||||
model.save_model('%04d.model' % nround)
|
||||
# dump model with feature map
|
||||
model.dump_model('dump.nice.%d.txt' % xgb.sync_get_world_size(),'../../demo/data/featmap.txt')
|
||||
# shutdown the synchronization module
|
||||
xgb.sync_finalize()
|
||||
@@ -11,6 +11,10 @@ import subprocess
|
||||
sys.path.append(os.path.dirname(__file__)+'/../src/sync/')
|
||||
import tcp_master as master
|
||||
|
||||
#
|
||||
# Note: this submit script is only used for example purpose
|
||||
# It does not have to be mpirun, it can be any job submission script that starts the job, qsub, hadoop streaming etc.
|
||||
#
|
||||
def mpi_submit(nslave, args):
|
||||
"""
|
||||
customized submit script, that submit nslave jobs, each must contain args as parameter
|
||||
|
||||
Reference in New Issue
Block a user