make wrapper ok

2014-11-23 14:03:59 -08:00
parent 69b2f31098
commit 5f08313cb2
15 changed files with 160 additions and 24 deletions
--- a/multi-node/README.md
+++ b/multi-node/README.md
@@ -4,17 +4,21 @@ This folder contains information about experimental version of distributed xgboo

 Build
 =====
-* You will need to have MPI
 * In the root folder, run ```make mpi```, this will give you xgboost-mpi
+  - You will need to have MPI to build xgboost-mpi
+* Alternatively, you can run ```make```, this will give you xgboost, which uses a beta buildin allreduce
+  - You do not need MPI to build this, you can modify [submit_job_tcp.py](submit_job_tcp.py) to use any job scheduler you like to submit the job

 Design Choice
 =====
-* Does distributed xgboost reply on MPI?
-  - Yes, but the dependency is isolated in [sync module](../src/sync/sync.h)
+* Does distributed xgboost must reply on MPI library?
+  - No, XGBoost replies on MPI protocol that provide Broadcast and AllReduce,
+  - The dependency is isolated in [sync module](../src/sync/sync.h)
  - All other parts of code uses interface defined in sync.h
-  - sync_mpi.cpp is a implementation of sync interface using standard MPI library
-  - Specificially, xgboost reply on MPI protocol that provide Broadcast and AllReduce,
-     if there are platform/framework that implements these protocol, xgboost should naturally extends to these platform
+  - [sync_mpi.cpp](../src/sync/sync_mpi.cpp) is a implementation of sync interface using standard MPI library, to use xgboost-mpi, you need an MPI library
+  - If there are platform/framework that implements these protocol, xgboost should naturally extends to these platform
+  - As an example, [sync_tcp.cpp](../src/sync/sync_tcp.cpp) is an implementation of interface using TCP, and is linked with xgboost by default
+
 * How is the data distributed?
  - There are two solvers in distributed xgboost
  - Column-based solver split data by column, each node work on subset of columns, 
@@ -26,10 +30,11 @@ Design Choice

 Usage
 ====
-* The current code run in MPI enviroment, you will need to have a network filesystem,
-    or copy data to local file system before running the code
+* You will need a network filesystem, or copy data to local file system before running the code
+* xgboost-mpi run in MPI enviroment, 
+* xgboost can be used together with [submit_job_tcp.py](submit_job_tcp.py) on other types of job schedulers
 * ***Note*** The distributed version is still multi-threading optimized.
-    You should run one xgboost-mpi per node that takes most available CPU,
+    You should run one process per node that takes most available CPU,
    this will reduce the communication overhead and improve the performance.
   - One way to do that is limit mpi slot in each machine to be 1, or reserve nthread processors for each process.
 * Examples:
--- a/multi-node/col-split/README.md
+++ b/multi-node/col-split/README.md
@@ -1,6 +1,11 @@
 Distributed XGBoost: Column Split Version
 ====
-* run ```bash mushroom-row.sh <n-mpi-process>```
+* run ```bash mushroom-col.sh <n-mpi-process>```
+* run ```bash mushroom-col-tcp.sh <n-process>```
+  - mushroom-col-tcp.sh starts xgboost job using xgboost's buildin allreduce 
+* run ```bash mushroom-col-python.sh <n-process>```
+  - mushroom-col-python.sh starts xgboost python job using xgboost's buildin all reduce
+  - see mushroom-col.py

 How to Use
 ====
--- a/multi-node/col-split/mushroom-col-python.sh
+++ b/multi-node/col-split/mushroom-col-python.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+if [[ $# -ne 1 ]]
+then
+    echo "Usage: nprocess"
+    exit -1
+fi
+
+#
+# This script is same as mushroom-col except that we will be using xgboost python module
+# 
+# xgboost used built in tcp-based allreduce module, and can be run on more enviroment, so long as we know how to start job by modifying ../submit_job_tcp.py
+#
+rm -rf train.col* *.model
+k=$1
+
+# split the lib svm file into k subfiles
+python splitsvm.py ../../demo/data/agaricus.txt.train train $k
+
+# run xgboost mpi
+../submit_job_tcp.py $k python mushroom-col.py
+
+cat dump.nice.$k.txt
--- a/multi-node/col-split/mushroom-col.py
+++ b/multi-node/col-split/mushroom-col.py
@@ -0,0 +1,29 @@
+import os
+import sys
+sys.path.append(os.path.dirname(__file__)+'/../wrapper')
+import xgboost as xgb
+# this is example script of running distributed xgboost using python
+
+# call this additional function to intialize the xgboost sync module
+# in distributed mode
+xgb.sync_init(sys.argv)
+rank = xgb.sync_get_rank()
+# read in dataset
+dtrain = xgb.DMatrix('train.col%d' % rank)
+param = {'max_depth':3, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
+param['dsplit'] = 'col'
+nround = 3
+
+if rank == 0:
+    dtest = xgb.DMatrix('../../demo/data/agaricus.txt.test')
+    model = xgb.train(param, dtrain, nround, [(dtrain, 'train') , (dtest, 'test')])
+else:
+    # if it is a slave node, do not run evaluation
+    model = xgb.train(param, dtrain, nround)
+
+if rank == 0:
+    model.save_model('%04d.model' % nround)
+    # dump model with feature map
+    model.dump_model('dump.nice.%d.txt' % xgb.sync_get_world_size(),'../../demo/data/featmap.txt')
+# shutdown the synchronization module
+xgb.sync_finalize()
--- a/multi-node/submit_job_tcp.py
+++ b/multi-node/submit_job_tcp.py
@@ -11,6 +11,10 @@ import subprocess
 sys.path.append(os.path.dirname(__file__)+'/../src/sync/')
 import tcp_master as master

+#
+#  Note: this submit script is only used for example purpose
+#  It does not have to be mpirun, it can be any job submission script that starts the job, qsub, hadoop streaming etc.
+#  
 def mpi_submit(nslave, args):
    """
      customized submit script, that submit nslave jobs, each must contain args as parameter