cleanup multi-node
This commit is contained in:
parent
b762231b02
commit
b1f89f29b8
@ -1,38 +1,36 @@
|
|||||||
Distributed XGBoost
|
Distributed XGBoost
|
||||||
======
|
======
|
||||||
This folder contains information about experimental version of distributed xgboost.
|
This folder contains information of Distributed XGBoost.
|
||||||
|
|
||||||
Build
|
* The distributed version is built on Rabit:[Reliable Allreduce and Broadcast Library](https://github.com/tqchen/rabit)
|
||||||
=====
|
- Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning
|
||||||
* In the root folder, run ```make```, this will give you xgboost, which uses rabit allreduce
|
- This makes xgboost portable and fault-tolerant against node failures
|
||||||
- this version of xgboost should be fault tolerant eventually
|
* You can run Distributed XGBoost on common platforms that rabit can port to,
|
||||||
* Alterniatively, run ```make mpi```, this will give you xgboost-mpi
|
including Hadoop(see [hadoop folder](hadoop)) and MPI
|
||||||
- You will need to have MPI to build xgboost-mpi
|
|
||||||
|
|
||||||
Design Choice
|
|
||||||
=====
|
|
||||||
* XGBoost replies on [Rabit Library](https://github.com/tqchen/rabit)
|
|
||||||
* Rabit is an fault tolerant and portable allreduce library that provides Allreduce and Broadcast
|
|
||||||
* Since rabit is compatible with MPI, xgboost can be compiled using MPI backend
|
|
||||||
|
|
||||||
* How is the data distributed?
|
|
||||||
- There are two solvers in distributed xgboost
|
|
||||||
- Column-based solver split data by column, each node work on subset of columns,
|
|
||||||
it uses exactly the same algorithm as single node version.
|
|
||||||
- Row-based solver split data by row, each node work on subset of rows,
|
|
||||||
it uses an approximate histogram count algorithm, and will only examine subset of
|
|
||||||
potential split points as opposed to all split points.
|
|
||||||
- Hadoop version can run on the existing hadoop platform,
|
|
||||||
it use Rabit to submit jobs as map-reduce tasks.
|
|
||||||
|
|
||||||
Usage
|
Usage
|
||||||
|
=====
|
||||||
|
* In the root folder, run ```./build.sh```, this will give you xgboost, which uses rabit allreduce
|
||||||
|
|
||||||
|
Notes
|
||||||
====
|
====
|
||||||
* You will need a network filesystem, or copy data to local file system before running the code
|
* Rabit handles all the fault tolerant and communications efficiently, we only use platform specific command to start programs
|
||||||
* xgboost can be used together with submission script provided in Rabit on different possible types of job scheduler
|
- The Hadoop version does not rely on Mapreduce to do iterations
|
||||||
* ***Note*** The distributed version is still multi-threading optimized.
|
- You can expect xgboost not suffering the drawbacks of iterative MapReduce program
|
||||||
You should run one process per node that takes most available CPU,
|
* The design choice was made because Allreduce is very natural and efficient for distributed tree building
|
||||||
this will reduce the communication overhead and improve the performance.
|
- In current version of xgboost, the distributed version is only adds several lines of Allreduce synchronization code
|
||||||
- One way to do that is limit mpi slot in each machine to be 1, or reserve nthread processors for each process.
|
* The multi-threading nature of xgboost is inheritated in distributed mode
|
||||||
* Examples:
|
- This means xgboost efficiently use all the threads in one machine, and communicates only between machines
|
||||||
- [Column-based version](col-split)
|
- Remember to run on xgboost process per machine and this will give you maximum speedup
|
||||||
- [Row-based version](row-split)
|
* For more information about rabit and how it works, see the [tutorial](https://github.com/tqchen/rabit/tree/master/guide)
|
||||||
|
|
||||||
|
Solvers
|
||||||
|
=====
|
||||||
|
There are two solvers in distributed xgboost. You can check for local demo of the two solvers, see [row-split](row-split) and [col-split](col-split)
|
||||||
|
* Column-based solver split data by column, each node work on subset of columns,
|
||||||
|
it uses exactly the same algorithm as single node version.
|
||||||
|
* Row-based solver split data by row, each node work on subset of rows,
|
||||||
|
it uses an approximate histogram count algorithm, and will only examine subset of
|
||||||
|
potential split points as opposed to all split points.
|
||||||
|
- This is the mode used by current hadoop version, since usually data was stored by rows in many industry system
|
||||||
|
|
||||||
|
|||||||
@ -4,8 +4,6 @@ Distributed XGBoost: Column Split Version
|
|||||||
- mushroom-col-rabit.sh starts xgboost job using rabit's allreduce
|
- mushroom-col-rabit.sh starts xgboost job using rabit's allreduce
|
||||||
* run ```bash mushroom-col-rabit-mock.sh <n-process>```
|
* run ```bash mushroom-col-rabit-mock.sh <n-process>```
|
||||||
- mushroom-col-rabit-mock.sh starts xgboost job using rabit's allreduce, inserts suicide signal at certain point and test recovery
|
- mushroom-col-rabit-mock.sh starts xgboost job using rabit's allreduce, inserts suicide signal at certain point and test recovery
|
||||||
* run ```bash mushroom-col-mpi.sh <n-mpi-process>```
|
|
||||||
- mushroom-col.sh starts xgboost-mpi job
|
|
||||||
|
|
||||||
How to Use
|
How to Use
|
||||||
====
|
====
|
||||||
|
|||||||
@ -1,24 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [[ $# -ne 1 ]]
|
|
||||||
then
|
|
||||||
echo "Usage: nprocess"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
rm -rf train.col* *.model
|
|
||||||
k=$1
|
|
||||||
|
|
||||||
# split the lib svm file into k subfiles
|
|
||||||
python splitsvm.py ../../demo/data/agaricus.txt.train train $k
|
|
||||||
|
|
||||||
# run xgboost mpi
|
|
||||||
mpirun -n $k ../../xgboost-mpi mushroom-col.conf dsplit=col
|
|
||||||
|
|
||||||
# the model can be directly loaded by single machine xgboost solver, as usuall
|
|
||||||
../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt
|
|
||||||
|
|
||||||
# run for one round, and continue training
|
|
||||||
mpirun -n $k ../../xgboost-mpi mushroom-col.conf dsplit=col num_round=1
|
|
||||||
mpirun -n $k ../../xgboost-mpi mushroom-col.conf dsplit=col model_in=0001.model
|
|
||||||
|
|
||||||
cat dump.nice.$k.txt
|
|
||||||
@ -1,22 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [[ $# -ne 1 ]]
|
|
||||||
then
|
|
||||||
echo "Usage: nprocess"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
#
|
|
||||||
# This script is same as mushroom-col except that we will be using xgboost python module
|
|
||||||
#
|
|
||||||
# xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py
|
|
||||||
#
|
|
||||||
rm -rf train.col* *.model
|
|
||||||
k=$1
|
|
||||||
|
|
||||||
# split the lib svm file into k subfiles
|
|
||||||
python splitsvm.py ../../demo/data/agaricus.txt.train train $k
|
|
||||||
|
|
||||||
# run xgboost mpi
|
|
||||||
../../rabit/tracker/rabit_mpi.py $k local python mushroom-col.py
|
|
||||||
|
|
||||||
cat dump.nice.$k.txt
|
|
||||||
@ -1,33 +0,0 @@
|
|||||||
import os
|
|
||||||
import sys
|
|
||||||
path = os.path.dirname(__file__)
|
|
||||||
if path == '':
|
|
||||||
path = '.'
|
|
||||||
sys.path.append(path+'/../../wrapper')
|
|
||||||
|
|
||||||
import xgboost as xgb
|
|
||||||
# this is example script of running distributed xgboost using python
|
|
||||||
|
|
||||||
# call this additional function to intialize the xgboost sync module
|
|
||||||
# in distributed mode
|
|
||||||
xgb.sync_init(sys.argv)
|
|
||||||
rank = xgb.sync_get_rank()
|
|
||||||
# read in dataset
|
|
||||||
dtrain = xgb.DMatrix('train.col%d' % rank)
|
|
||||||
param = {'max_depth':3, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
|
|
||||||
param['dsplit'] = 'col'
|
|
||||||
nround = 3
|
|
||||||
|
|
||||||
if rank == 0:
|
|
||||||
dtest = xgb.DMatrix('../../demo/data/agaricus.txt.test')
|
|
||||||
model = xgb.train(param, dtrain, nround, [(dtrain, 'train') , (dtest, 'test')])
|
|
||||||
else:
|
|
||||||
# if it is a slave node, do not run evaluation
|
|
||||||
model = xgb.train(param, dtrain, nround)
|
|
||||||
|
|
||||||
if rank == 0:
|
|
||||||
model.save_model('%04d.model' % nround)
|
|
||||||
# dump model with feature map
|
|
||||||
model.dump_model('dump.nice.%d.txt' % xgb.sync_get_world_size(),'../../demo/data/featmap.txt')
|
|
||||||
# shutdown the synchronization module
|
|
||||||
xgb.sync_finalize()
|
|
||||||
@ -1,10 +1,8 @@
|
|||||||
Distributed XGBoost: Row Split Version
|
Distributed XGBoost: Row Split Version
|
||||||
====
|
====
|
||||||
|
* You might be interested to checkout the [Hadoop example](../hadoop)
|
||||||
* Machine Rabit: run ```bash machine-row-rabit.sh <n-mpi-process>```
|
* Machine Rabit: run ```bash machine-row-rabit.sh <n-mpi-process>```
|
||||||
- machine-col-rabit.sh starts xgboost job using rabit
|
- machine-col-rabit.sh starts xgboost job using rabit
|
||||||
* Mushroom: run ```bash mushroom-row-mpi.sh <n-mpi-process>```
|
|
||||||
* Machine: run ```bash machine-row-mpi.sh <n-mpi-process>```
|
|
||||||
- Machine case also include example to continue training from existing model
|
|
||||||
|
|
||||||
How to Use
|
How to Use
|
||||||
====
|
====
|
||||||
|
|||||||
@ -1,20 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [[ $# -ne 1 ]]
|
|
||||||
then
|
|
||||||
echo "Usage: nprocess"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
rm -rf train-machine.row* *.model
|
|
||||||
k=$1
|
|
||||||
# make machine data
|
|
||||||
cd ../../demo/regression/
|
|
||||||
python mapfeat.py
|
|
||||||
python mknfold.py machine.txt 1
|
|
||||||
cd -
|
|
||||||
|
|
||||||
# split the lib svm file into k subfiles
|
|
||||||
python splitrows.py ../../demo/regression/machine.txt.train train-machine $k
|
|
||||||
|
|
||||||
# run xgboost mpi, take data from stdin
|
|
||||||
../submit_job_tcp.py $k "bash map.sh train-machine.row ../../xgboost machine-row.conf dsplit=row num_round=3 data=stdin"
|
|
||||||
@ -1,24 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [[ $# -ne 1 ]]
|
|
||||||
then
|
|
||||||
echo "Usage: nprocess"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
rm -rf train-machine.row* *.model
|
|
||||||
k=$1
|
|
||||||
# make machine data
|
|
||||||
cd ../../demo/regression/
|
|
||||||
python mapfeat.py
|
|
||||||
python mknfold.py machine.txt 1
|
|
||||||
cd -
|
|
||||||
|
|
||||||
# split the lib svm file into k subfiles
|
|
||||||
python splitrows.py ../../demo/regression/machine.txt.train train-machine $k
|
|
||||||
|
|
||||||
# run xgboost mpi
|
|
||||||
mpirun -n $k ../../xgboost-mpi machine-row.conf dsplit=row num_round=3
|
|
||||||
|
|
||||||
# run xgboost-mpi save model 0001, continue to run from existing model
|
|
||||||
mpirun -n $k ../../xgboost-mpi machine-row.conf dsplit=row num_round=1
|
|
||||||
mpirun -n $k ../../xgboost-mpi machine-row.conf dsplit=row num_round=2 model_in=0001.model
|
|
||||||
@ -1,3 +0,0 @@
|
|||||||
# a simple script to simulate mapreduce mapper
|
|
||||||
echo "cat $1$OMPI_COMM_WORLD_RANK | ${@:2}"
|
|
||||||
cat $1$OMPI_COMM_WORLD_RANK | ${@:2}
|
|
||||||
@ -1,19 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [[ $# -ne 1 ]]
|
|
||||||
then
|
|
||||||
echo "Usage: nprocess"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
rm -rf train.row* *.model
|
|
||||||
k=$1
|
|
||||||
|
|
||||||
# split the lib svm file into k subfiles
|
|
||||||
python splitrows.py ../../demo/data/agaricus.txt.train train $k
|
|
||||||
|
|
||||||
# run xgboost mpi
|
|
||||||
mpirun -n $k ../../xgboost-mpi mushroom-row.conf dsplit=row nthread=1
|
|
||||||
|
|
||||||
# the model can be directly loaded by single machine xgboost solver, as usuall
|
|
||||||
../../xgboost mushroom-row.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt
|
|
||||||
cat dump.nice.$k.txt
|
|
||||||
@ -1,35 +0,0 @@
|
|||||||
# General Parameters, see comment for each definition
|
|
||||||
# choose the booster, can be gbtree or gblinear
|
|
||||||
booster = gbtree
|
|
||||||
# choose logistic regression loss function for binary classification
|
|
||||||
objective = binary:logistic
|
|
||||||
|
|
||||||
# Tree Booster Parameters
|
|
||||||
# step size shrinkage
|
|
||||||
eta = 1.0
|
|
||||||
# minimum loss reduction required to make a further partition
|
|
||||||
gamma = 1.0
|
|
||||||
# minimum sum of instance weight(hessian) needed in a child
|
|
||||||
min_child_weight = 1
|
|
||||||
# maximum depth of a tree
|
|
||||||
max_depth = 3
|
|
||||||
|
|
||||||
# Task Parameters
|
|
||||||
# the number of round to do boosting
|
|
||||||
num_round = 2
|
|
||||||
# 0 means do not save any model except the final round model
|
|
||||||
save_period = 0
|
|
||||||
use_buffer = 0
|
|
||||||
|
|
||||||
# The path of training data %d is the wildcard for the rank of the data
|
|
||||||
# The idea is each process take a feature matrix with subset of columns
|
|
||||||
#
|
|
||||||
data = "train.row%d"
|
|
||||||
|
|
||||||
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
|
|
||||||
eval[test] = "../../demo/data/agaricus.txt.test"
|
|
||||||
# evaluate on training data as well each round
|
|
||||||
eval_train = 1
|
|
||||||
|
|
||||||
# The path of test data, need to use full data of test, try not use it, or keep an subsampled version
|
|
||||||
test:data = "../../demo/data/agaricus.txt.test"
|
|
||||||
@ -33,10 +33,6 @@ xglib.XGBoosterCreate.restype = ctypes.c_void_p
|
|||||||
xglib.XGBoosterPredict.restype = ctypes.POINTER(ctypes.c_float)
|
xglib.XGBoosterPredict.restype = ctypes.POINTER(ctypes.c_float)
|
||||||
xglib.XGBoosterEvalOneIter.restype = ctypes.c_char_p
|
xglib.XGBoosterEvalOneIter.restype = ctypes.c_char_p
|
||||||
xglib.XGBoosterDumpModel.restype = ctypes.POINTER(ctypes.c_char_p)
|
xglib.XGBoosterDumpModel.restype = ctypes.POINTER(ctypes.c_char_p)
|
||||||
# sync function
|
|
||||||
xglib.XGSyncGetRank.restype = ctypes.c_int
|
|
||||||
xglib.XGSyncGetWorldSize.restype = ctypes.c_int
|
|
||||||
# initialize communication module
|
|
||||||
|
|
||||||
def ctypes2numpy(cptr, length, dtype):
|
def ctypes2numpy(cptr, length, dtype):
|
||||||
"""convert a ctypes pointer array to numpy array """
|
"""convert a ctypes pointer array to numpy array """
|
||||||
@ -557,17 +553,3 @@ def cv(params, dtrain, num_boost_round = 10, nfold=3, metrics=[], \
|
|||||||
results.append(res)
|
results.append(res)
|
||||||
return results
|
return results
|
||||||
|
|
||||||
# synchronization module
|
|
||||||
def sync_init(args = sys.argv):
|
|
||||||
arr = (ctypes.c_char_p * len(args))()
|
|
||||||
arr[:] = args
|
|
||||||
xglib.XGSyncInit(len(args), arr)
|
|
||||||
|
|
||||||
def sync_finalize():
|
|
||||||
xglib.XGSyncFinalize()
|
|
||||||
|
|
||||||
def sync_get_rank():
|
|
||||||
return int(xglib.XGSyncGetRank())
|
|
||||||
|
|
||||||
def sync_get_world_size():
|
|
||||||
return int(xglib.XGSyncGetWorldSize())
|
|
||||||
|
|||||||
@ -82,23 +82,6 @@ class Booster: public learner::BoostLearner {
|
|||||||
using namespace xgboost::wrapper;
|
using namespace xgboost::wrapper;
|
||||||
|
|
||||||
extern "C"{
|
extern "C"{
|
||||||
void XGSyncInit(int argc, char *argv[]) {
|
|
||||||
rabit::Init(argc, argv);
|
|
||||||
if (rabit::GetWorldSize() != 1) {
|
|
||||||
std::string pname = rabit::GetProcessorName();
|
|
||||||
utils::Printf("distributed job start %s:%d\n", pname.c_str(), rabit::GetRank());
|
|
||||||
}
|
|
||||||
}
|
|
||||||
void XGSyncFinalize(void) {
|
|
||||||
rabit::Finalize();
|
|
||||||
}
|
|
||||||
int XGSyncGetRank(void) {
|
|
||||||
int rank = rabit::GetRank();
|
|
||||||
return rank;
|
|
||||||
}
|
|
||||||
int XGSyncGetWorldSize(void) {
|
|
||||||
return rabit::GetWorldSize();
|
|
||||||
}
|
|
||||||
void* XGDMatrixCreateFromFile(const char *fname, int silent) {
|
void* XGDMatrixCreateFromFile(const char *fname, int silent) {
|
||||||
return LoadDataMatrix(fname, silent != 0, false);
|
return LoadDataMatrix(fname, silent != 0, false);
|
||||||
}
|
}
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user