cleanup multi-node

2015-01-15 21:55:56 -08:00 · 2015-01-15 21:55:56 -08:00 · b1f89f29b8
commit b1f89f29b8
parent b762231b02
13 changed files with 31 additions and 252 deletions
--- a/multi-node/README.md
+++ b/multi-node/README.md
@ -1,38 +1,36 @@
 Distributed XGBoost
 ======
-This folder contains information about experimental version of distributed xgboost.
+This folder contains information of Distributed XGBoost.
-Build
+* The distributed version is built on Rabit:[Reliable Allreduce and Broadcast Library](https://github.com/tqchen/rabit)
-=====
+  - Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning  
-* In the root folder, run ```make```, this will give you xgboost, which uses rabit allreduce
+  - This makes xgboost portable and fault-tolerant against node failures
-  - this version of xgboost should be fault tolerant eventually
+* You can run Distributed XGBoost on common platforms that rabit can port to,
-* Alterniatively, run ```make mpi```, this will give you xgboost-mpi
+  including Hadoop(see [hadoop folder](hadoop)) and MPI
  - You will need to have MPI to build xgboost-mpi
 Design Choice
 =====
 * XGBoost replies on [Rabit Library](https://github.com/tqchen/rabit)
 * Rabit is an fault tolerant and portable allreduce library that provides Allreduce and Broadcast
 * Since rabit is compatible with MPI, xgboost can be compiled using MPI backend
 * How is the data distributed?
  - There are two solvers in distributed xgboost
  - Column-based solver split data by column, each node work on subset of columns, 
    it uses exactly the same algorithm as single node version.
  - Row-based solver split data by row, each node work on subset of rows,
    it uses an approximate histogram count algorithm, and will only examine subset of 
    potential split points as opposed to all split points.
  - Hadoop version can run on the existing hadoop platform,
    it use Rabit to submit jobs as map-reduce tasks.
 Usage
 =====
 * In the root folder, run ```./build.sh```, this will give you xgboost, which uses rabit allreduce
 Notes
 ====
-* You will need a network filesystem, or copy data to local file system before running the code
+* Rabit handles all the fault tolerant and communications efficiently, we only use platform specific command to start programs
-* xgboost can be used together with submission script provided in Rabit on different possible types of job scheduler
+  - The Hadoop version does not rely on Mapreduce to do iterations
-* ***Note*** The distributed version is still multi-threading optimized.
+  - You can expect xgboost not suffering the drawbacks of iterative MapReduce program
-    You should run one process per node that takes most available CPU,
+* The design choice was made because Allreduce is very natural and efficient for distributed tree building
-    this will reduce the communication overhead and improve the performance.
+  - In current version of xgboost, the distributed version is only adds several lines of Allreduce synchronization code
-   - One way to do that is limit mpi slot in each machine to be 1, or reserve nthread processors for each process.
+* The multi-threading nature of xgboost is inheritated in distributed mode
-* Examples:
+  - This means xgboost efficiently use all the threads in one machine, and communicates only between machines
-  - [Column-based version](col-split)
+  - Remember to run on xgboost process per machine and this will give you maximum speedup
-  - [Row-based version](row-split)
+* For more information about rabit and how it works, see the [tutorial](https://github.com/tqchen/rabit/tree/master/guide)
 Solvers
 =====
 There are two solvers in distributed xgboost. You can check for local demo of the two solvers, see [row-split](row-split) and [col-split](col-split)
  * Column-based solver split data by column, each node work on subset of columns, 
    it uses exactly the same algorithm as single node version.
  * Row-based solver split data by row, each node work on subset of rows,
    it uses an approximate histogram count algorithm, and will only examine subset of 
    potential split points as opposed to all split points.
    - This is the mode used by current hadoop version, since usually data was stored by rows in many industry system
--- a/multi-node/col-split/README.md
+++ b/multi-node/col-split/README.md
@ -4,8 +4,6 @@ Distributed XGBoost: Column Split Version
  - mushroom-col-rabit.sh starts xgboost job using rabit's allreduce
 * run ```bash mushroom-col-rabit-mock.sh <n-process>```
  - mushroom-col-rabit-mock.sh starts xgboost job using rabit's allreduce, inserts suicide signal at certain point and test recovery
 * run ```bash mushroom-col-mpi.sh <n-mpi-process>```
  - mushroom-col.sh starts xgboost-mpi job
 How to Use
 ====
--- a/multi-node/col-split/mushroom-col-mpi.sh
+++ b/multi-node/col-split/mushroom-col-mpi.sh
@ -1,24 +0,0 @@
 #!/bin/bash
 if [[ $# -ne 1 ]]
 then
    echo "Usage: nprocess"
    exit -1
 fi
 rm -rf train.col* *.model
 k=$1
 # split the lib svm file into k subfiles
 python splitsvm.py ../../demo/data/agaricus.txt.train train $k
 # run xgboost mpi
 mpirun -n $k ../../xgboost-mpi mushroom-col.conf dsplit=col
 # the model can be directly loaded by single machine xgboost solver, as usuall
 ../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt
 # run for one round, and continue training
 mpirun -n $k ../../xgboost-mpi mushroom-col.conf dsplit=col num_round=1
 mpirun -n $k ../../xgboost-mpi mushroom-col.conf dsplit=col model_in=0001.model
 cat dump.nice.$k.txt
--- a/multi-node/col-split/mushroom-col-python.sh
+++ b/multi-node/col-split/mushroom-col-python.sh
@ -1,22 +0,0 @@
 #!/bin/bash
 if [[ $# -ne 1 ]]
 then
    echo "Usage: nprocess"
    exit -1
 fi
 #
 # This script is same as mushroom-col except that we will be using xgboost python module
 # 
 # xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py
 #
 rm -rf train.col* *.model
 k=$1
 # split the lib svm file into k subfiles
 python splitsvm.py ../../demo/data/agaricus.txt.train train $k
 # run xgboost mpi
 ../../rabit/tracker/rabit_mpi.py $k local python mushroom-col.py
 cat dump.nice.$k.txt
--- a/multi-node/col-split/mushroom-col.py
+++ b/multi-node/col-split/mushroom-col.py
@ -1,33 +0,0 @@
 import os
 import sys
 path = os.path.dirname(__file__)
 if path == '':
    path = '.'
 sys.path.append(path+'/../../wrapper')
 import xgboost as xgb
 # this is example script of running distributed xgboost using python
 # call this additional function to intialize the xgboost sync module
 # in distributed mode
 xgb.sync_init(sys.argv)
 rank = xgb.sync_get_rank()
 # read in dataset
 dtrain = xgb.DMatrix('train.col%d' % rank)
 param = {'max_depth':3, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
 param['dsplit'] = 'col'
 nround = 3
 if rank == 0:
    dtest = xgb.DMatrix('../../demo/data/agaricus.txt.test')
    model = xgb.train(param, dtrain, nround, [(dtrain, 'train') , (dtest, 'test')])
 else:
    # if it is a slave node, do not run evaluation
    model = xgb.train(param, dtrain, nround)
 if rank == 0:
    model.save_model('%04d.model' % nround)
    # dump model with feature map
    model.dump_model('dump.nice.%d.txt' % xgb.sync_get_world_size(),'../../demo/data/featmap.txt')
 # shutdown the synchronization module
 xgb.sync_finalize()
--- a/multi-node/row-split/README.md
+++ b/multi-node/row-split/README.md
@ -1,10 +1,8 @@
 Distributed XGBoost: Row Split Version
 ====
 * You might be interested to checkout the [Hadoop example](../hadoop)
 * Machine Rabit: run ```bash machine-row-rabit.sh <n-mpi-process>```
  - machine-col-rabit.sh starts xgboost job using rabit
 * Mushroom: run ```bash mushroom-row-mpi.sh <n-mpi-process>```
 * Machine: run ```bash machine-row-mpi.sh <n-mpi-process>```
  - Machine case also include example to continue training from existing model
 How to Use
 ====
--- a/multi-node/row-split/machine-row-map.sh
+++ b/multi-node/row-split/machine-row-map.sh
@ -1,20 +0,0 @@
 #!/bin/bash
 if [[ $# -ne 1 ]]
 then
    echo "Usage: nprocess"
    exit -1
 fi
 rm -rf train-machine.row* *.model
 k=$1
 # make machine data
 cd ../../demo/regression/
 python mapfeat.py
 python mknfold.py machine.txt 1
 cd -
 # split the lib svm file into k subfiles
 python splitrows.py ../../demo/regression/machine.txt.train train-machine $k
 # run xgboost mpi, take data from stdin
 ../submit_job_tcp.py $k "bash map.sh train-machine.row ../../xgboost machine-row.conf dsplit=row num_round=3 data=stdin"
--- a/multi-node/row-split/machine-row-mpi.sh
+++ b/multi-node/row-split/machine-row-mpi.sh
@ -1,24 +0,0 @@
 #!/bin/bash
 if [[ $# -ne 1 ]]
 then
    echo "Usage: nprocess"
    exit -1
 fi
 rm -rf train-machine.row* *.model
 k=$1
 # make machine data
 cd ../../demo/regression/
 python mapfeat.py
 python mknfold.py machine.txt 1
 cd -
 # split the lib svm file into k subfiles
 python splitrows.py ../../demo/regression/machine.txt.train train-machine $k
 # run xgboost mpi
 mpirun -n $k ../../xgboost-mpi machine-row.conf dsplit=row num_round=3
 # run xgboost-mpi save model 0001, continue to run from existing model
 mpirun -n $k ../../xgboost-mpi machine-row.conf dsplit=row num_round=1
 mpirun -n $k ../../xgboost-mpi machine-row.conf dsplit=row num_round=2 model_in=0001.model
--- a/multi-node/row-split/map.sh
+++ b/multi-node/row-split/map.sh
@ -1,3 +0,0 @@
 # a simple script to simulate mapreduce mapper
 echo "cat $1$OMPI_COMM_WORLD_RANK | ${@:2}"
 cat $1$OMPI_COMM_WORLD_RANK | ${@:2}
--- a/multi-node/row-split/mushroom-row-mpi.sh
+++ b/multi-node/row-split/mushroom-row-mpi.sh
@ -1,19 +0,0 @@
 #!/bin/bash
 if [[ $# -ne 1 ]]
 then
    echo "Usage: nprocess"
    exit -1
 fi
 rm -rf train.row* *.model
 k=$1
 # split the lib svm file into k subfiles
 python splitrows.py ../../demo/data/agaricus.txt.train train $k
 # run xgboost mpi
 mpirun -n $k ../../xgboost-mpi mushroom-row.conf dsplit=row nthread=1 
 # the model can be directly loaded by single machine xgboost solver, as usuall
 ../../xgboost mushroom-row.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt
 cat dump.nice.$k.txt
--- a/multi-node/row-split/mushroom-row.conf
+++ b/multi-node/row-split/mushroom-row.conf
@ -1,35 +0,0 @@
 # General Parameters, see comment for each definition
 # choose the booster, can be gbtree or gblinear
 booster = gbtree
 # choose logistic regression loss function for binary classification
 objective = binary:logistic
 # Tree Booster Parameters
 # step size shrinkage
 eta = 1.0 
 # minimum loss reduction required to make a further partition
 gamma = 1.0 
 # minimum sum of instance weight(hessian) needed in a child
 min_child_weight = 1 
 # maximum depth of a tree
 max_depth = 3 
 # Task Parameters
 # the number of round to do boosting
 num_round = 2
 # 0 means do not save any model except the final round model
 save_period = 0 
 use_buffer = 0
 # The path of training data %d is the wildcard for the rank of the data
 # The idea is each process take a feature matrix with subset of columns
 #
 data = "train.row%d" 
 # The path of validation data, used to monitor training process, here [test] sets name of the validation set
 eval[test] = "../../demo/data/agaricus.txt.test" 
 # evaluate on training data as well each round
 eval_train = 1
 # The path of test data, need to use full data of test, try not use it, or keep an subsampled version
 test:data = "../../demo/data/agaricus.txt.test"      
--- a/wrapper/xgboost.py
+++ b/wrapper/xgboost.py
@ -33,10 +33,6 @@ xglib.XGBoosterCreate.restype = ctypes.c_void_p
 xglib.XGBoosterPredict.restype = ctypes.POINTER(ctypes.c_float)
 xglib.XGBoosterEvalOneIter.restype = ctypes.c_char_p
 xglib.XGBoosterDumpModel.restype = ctypes.POINTER(ctypes.c_char_p)
 # sync function
 xglib.XGSyncGetRank.restype = ctypes.c_int
 xglib.XGSyncGetWorldSize.restype = ctypes.c_int
 # initialize communication module
 def ctypes2numpy(cptr, length, dtype):
    """convert a ctypes pointer array to numpy array """
@ -557,17 +553,3 @@ def cv(params, dtrain, num_boost_round = 10, nfold=3, metrics=[], \
        results.append(res)
    return results
 # synchronization module
 def sync_init(args = sys.argv):
    arr = (ctypes.c_char_p * len(args))()
    arr[:] = args
    xglib.XGSyncInit(len(args), arr)
 def sync_finalize():
    xglib.XGSyncFinalize()
 def sync_get_rank():
    return int(xglib.XGSyncGetRank())
 def sync_get_world_size():
    return int(xglib.XGSyncGetWorldSize())
--- a/wrapper/xgboost_wrapper.cpp
+++ b/wrapper/xgboost_wrapper.cpp
@ -82,23 +82,6 @@ class Booster: public learner::BoostLearner {
 using namespace xgboost::wrapper;
 extern "C"{
  void XGSyncInit(int argc, char *argv[]) {
    rabit::Init(argc, argv);
    if (rabit::GetWorldSize() != 1) {
      std::string pname = rabit::GetProcessorName();
      utils::Printf("distributed job start %s:%d\n", pname.c_str(), rabit::GetRank());
    }
  }
  void XGSyncFinalize(void) {
    rabit::Finalize();
  }
  int XGSyncGetRank(void) {
    int rank = rabit::GetRank();
    return rank;
  }
  int XGSyncGetWorldSize(void) {
    return rabit::GetWorldSize();
  }
  void* XGDMatrixCreateFromFile(const char *fname, int silent) {
    return LoadDataMatrix(fname, silent != 0, false);
  }