change allreduce lib to rabit library, xgboost now run with rabit

2014-12-20 00:17:09 -08:00
parent 5ae99372d6
commit 8e16cc4617
28 changed files with 105 additions and 1206 deletions
--- a/multi-node/README.md
+++ b/multi-node/README.md
@@ -4,20 +4,16 @@ This folder contains information about experimental version of distributed xgboo

 Build
 =====
-* In the root folder, run ```make mpi```, this will give you xgboost-mpi
+* In the root folder, run ```make```, this will give you xgboost, which uses rabit allreduce
+  - this version of xgboost should be fault tolerant eventually
+* Alterniatively, run ```make mpi```, this will give you xgboost-mpi
  - You will need to have MPI to build xgboost-mpi
-* Alternatively, you can run ```make```, this will give you xgboost, which uses a beta buildin allreduce
-  - You do not need MPI to build this, you can modify [submit_job_tcp.py](submit_job_tcp.py) to use any job scheduler you like to submit the job

 Design Choice
 =====
-* Does distributed xgboost must reply on MPI library?
-  - No, XGBoost replies on MPI protocol that provide Broadcast and AllReduce,
-  - The dependency is isolated in [sync module](../src/sync/sync.h)
-  - All other parts of code uses interface defined in sync.h
-  - [sync_mpi.cpp](../src/sync/sync_mpi.cpp) is a implementation of sync interface using standard MPI library, to use xgboost-mpi, you need an MPI library
-  - If there are platform/framework that implements these protocol, xgboost should naturally extends to these platform
-  - As an example, [sync_tcp.cpp](../src/sync/sync_tcp.cpp) is an implementation of interface using TCP, and is linked with xgboost by default
+* XGBoost replies on [Rabit Library](https://github.com/tqchen/rabit)
+* Rabit is an fault tolerant and portable allreduce library that provides Allreduce and Broadcast
+* Since rabit is compatible with MPI, xgboost can be compiled using MPI backend

 * How is the data distributed?
  - There are two solvers in distributed xgboost
@@ -27,12 +23,10 @@ Design Choice
    it uses an approximate histogram count algorithm, and will only examine subset of 
    potential split points as opposed to all split points.

-
 Usage
 ====
 * You will need a network filesystem, or copy data to local file system before running the code
-* xgboost-mpi run in MPI enviroment, 
-* xgboost can be used together with [submit_job_tcp.py](submit_job_tcp.py) on other types of job schedulers
+* xgboost can be used together with submission script provided in Rabit on different possible types of job scheduler
 * ***Note*** The distributed version is still multi-threading optimized.
    You should run one process per node that takes most available CPU,
    this will reduce the communication overhead and improve the performance.
--- a/multi-node/col-split/README.md
+++ b/multi-node/col-split/README.md
@@ -1,12 +1,9 @@
 Distributed XGBoost: Column Split Version
 ====
-* run ```bash mushroom-col.sh <n-mpi-process>```
+* run ```bash mushroom-col-rabit.sh <n-process>```
+  - mushroom-col-tcp.sh starts xgboost job using rabit's allreduce
+* run ```bash mushroom-col-mpi.sh <n-mpi-process>```
  - mushroom-col.sh starts xgboost-mpi job
-* run ```bash mushroom-col-tcp.sh <n-process>```
-  - mushroom-col-tcp.sh starts xgboost job using xgboost's buildin allreduce 
-* run ```bash mushroom-col-python.sh <n-process>```
-  - mushroom-col-python.sh starts xgboost python job using xgboost's buildin all reduce
-  - see mushroom-col.py

 How to Use
 ====
@@ -16,7 +13,7 @@ How to Use

 Notes
 ====
-* The code is multi-threaded, so you want to run one xgboost-mpi per node
+* The code is multi-threaded, so you want to run one process per node
 * The code will work correctly as long as union of each column subset is all the columns we are interested in.
  - The column subset can overlap with each other.
 * It uses exactly the same algorithm as single node version, to examine all potential split points.
--- a/multi-node/col-split/mushroom-col-mpi.sh
+++ b/multi-node/col-split/mushroom-col-mpi.sh
--- a/multi-node/col-split/mushroom-col-python.sh
+++ b/multi-node/col-split/mushroom-col-python.sh
@@ -17,6 +17,6 @@ k=$1
 python splitsvm.py ../../demo/data/agaricus.txt.train train $k

 # run xgboost mpi
-../submit_job_tcp.py $k python mushroom-col.py
+../../rabit/tracker/rabit_mpi.py $k local python mushroom-col.py

 cat dump.nice.$k.txt
--- a/multi-node/col-split/mushroom-col-rabit.sh
+++ b/multi-node/col-split/mushroom-col-rabit.sh
@@ -16,13 +16,13 @@ k=$1
 python splitsvm.py ../../demo/data/agaricus.txt.train train $k

 # run xgboost mpi
-../submit_job_tcp.py $k ../../xgboost mushroom-col.conf dsplit=col
+../../rabit/tracker/rabit_mpi.py $k local ../../xgboost mushroom-col.conf dsplit=col

 # the model can be directly loaded by single machine xgboost solver, as usuall
 ../../xgboost mushroom-col.conf task=dump model_in=0002.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.$k.txt

 # run for one round, and continue training
-../submit_job_tcp.py $k ../../xgboost mushroom-col.conf dsplit=col num_round=1
-../submit_job_tcp.py $k ../../xgboost mushroom-col.conf dsplit=col model_in=0001.model
+../../rabit/tracker/rabit_mpi.py $k local  ../../xgboost mushroom-col.conf dsplit=col num_round=1
+../../rabit/tracker/rabit_mpi.py $k local  ../../xgboost mushroom-col.conf  mushroom-col.conf dsplit=col model_in=0001.model

-cat dump.nice.$k.txt
+cat dump.nice.$k.txt
--- a/multi-node/col-split/mushroom-col.py
+++ b/multi-node/col-split/mushroom-col.py
@@ -1,6 +1,10 @@
 import os
 import sys
-sys.path.append(os.path.dirname(__file__)+'/../wrapper')
+path = os.path.dirname(__file__)
+if path == '':
+    path = '.'
+sys.path.append(path+'/../../wrapper')
+
 import xgboost as xgb
 # this is example script of running distributed xgboost using python

--- a/multi-node/row-split/README.md
+++ b/multi-node/row-split/README.md
@@ -1,10 +1,10 @@
 Distributed XGBoost: Row Split Version
 ====
-* Mushroom: run ```bash mushroom-row.sh <n-mpi-process>```
-* Machine: run ```bash machine-row.sh <n-mpi-process>```
+* Machine Rabit: run ```bash machine-row-rabit.sh <n-mpi-process>```
+  - machine-col-rabit.sh starts xgboost job using rabit
+* Mushroom: run ```bash mushroom-row-mpi.sh <n-mpi-process>```
+* Machine: run ```bash machine-row-mpi.sh <n-mpi-process>```
  - Machine case also include example to continue training from existing model
-* Machine TCP: run ```bash machine-row-tcp.sh <n-mpi-process>```
-  - machine-col-tcp.sh starts xgboost job using xgboost's buildin allreduce 

 How to Use
 ====
--- a/multi-node/row-split/machine-row-mpi.sh
+++ b/multi-node/row-split/machine-row-mpi.sh
--- a/multi-node/row-split/machine-row-tcp.sh
+++ b/multi-node/row-split/machine-row-tcp.sh
@@ -1,24 +0,0 @@
-#!/bin/bash
-if [[ $# -ne 1 ]]
-then
-    echo "Usage: nprocess"
-    exit -1
-fi
-
-rm -rf train-machine.row* *.model
-k=$1
-# make machine data
-cd ../../demo/regression/
-python mapfeat.py
-python mknfold.py machine.txt 1
-cd -
-
-# split the lib svm file into k subfiles
-python splitrows.py ../../demo/regression/machine.txt.train train-machine $k
-
-# run xgboost mpi
-../submit_job_tcp.py $k ../../xgboost machine-row.conf dsplit=row num_round=3
-
-# run xgboost-mpi save model 0001, continue to run from existing model
-../submit_job_tcp.py $k ../../xgboost machine-row.conf dsplit=row num_round=1
-../submit_job_tcp.py $k ../../xgboost machine-row.conf dsplit=row num_round=2 model_in=0001.model
--- a/multi-node/row-split/mushroom-row-mpi.sh
+++ b/multi-node/row-split/mushroom-row-mpi.sh
--- a/multi-node/submit_job_tcp.py
+++ b/multi-node/submit_job_tcp.py
@@ -1,36 +0,0 @@
-#!/usr/bin/python
-"""
-This is an example script to create a customized job submit
-script using xgboost sync_tcp mode
-"""
-import sys
-import os
-import subprocess
-# import the tcp_master.py
-# add path to sync
-sys.path.append(os.path.dirname(__file__)+'/../src/sync/')
-import tcp_master as master
-
-#
-#  Note: this submit script is only used for example purpose
-#  It does not have to be mpirun, it can be any job submission script that starts the job, qsub, hadoop streaming etc.
-#  
-def mpi_submit(nslave, args):
-    """
-      customized submit script, that submit nslave jobs, each must contain args as parameter
-      note this can be a lambda function containing additional parameters in input
-      Parameters
-         nslave number of slave process to start up
-         args arguments to launch each job
-              this usually includes the parameters of master_uri and parameters passed into submit
-    """
-    cmd = ' '.join(['mpirun -n %d' % nslave] + args)
-    print cmd
-    subprocess.check_call(cmd, shell = True)
-
-if __name__ == '__main__':
-    if len(sys.argv) < 2:
-        print 'Usage: <nslave> <cmd>'
-        exit(0)        
-    # call submit, with nslave, the commands to run each job and submit function
-    master.submit(int(sys.argv[1]), sys.argv[2:], fun_submit= mpi_submit)