move distributed xgboost to wormhole

2015-04-06 09:56:45 -07:00 · 2015-04-06 09:56:45 -07:00 · 65abc26797
commit 65abc26797
parent 421f5c6570
10 changed files with 9 additions and 239 deletions
--- a/README.md
+++ b/README.md
@ -27,7 +27,7 @@ Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washin
 What's New
 ==========
 * XGBoost now support HDFS and S3
-* [Distributed XGBoost now runs on YARN](multi-node/hadoop)!
+* [Distributed XGBoost now runs on YARN](https://github.com/dmlc/wormhole/learn/xgboost)!
 * [xgboost user group](https://groups.google.com/forum/#!forum/xgboost-user/) for tracking changes, sharing your experience on xgboost
 * [Distributed XGBoost](multi-node) is now available!!
 * New features in the lastest changes :)
@ -37,7 +37,6 @@ What's New
 * XGBoost wins [Tradeshift Text Classification](https://kaggle2.blob.core.windows.net/forum-message-attachments/60041/1813/TradeshiftTextClassification.pdf?sv=2012-02-12&se=2015-01-02T13%3A55%3A16Z&sr=b&sp=r&sig=5MHvyjCLESLexYcvbSRFumGQXCS7MVmfdBIY3y01tMk%3D)
 * XGBoost wins [HEP meets ML Award in Higgs Boson Challenge](http://atlas.ch/news/2014/machine-learning-wins-the-higgs-challenge.html)
 Features
 ========
 * Sparse feature format:
--- a/multi-node/README.md
+++ b/multi-node/README.md
@ -1,17 +1,10 @@
 Distributed XGBoost
 ======
-This folder contains information of Distributed XGBoost (Distributed GBDT).
+Distributed XGBoost is now part of [Wormhole](https://github.com/dmlc/wormhole/learn/xgboost).
-
+See the [Wormhole](https://github.com/dmlc/wormhole/learn/xgboost) for usage examples, build and job submissions.
 * The distributed version is built on Rabit:[Reliable Allreduce and Broadcast Library](https://github.com/dmlc/rabit)
  - Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning  
  - This makes xgboost portable and fault-tolerant against node failures
 * You can run Distributed XGBoost on platforms including Hadoop(see [hadoop folder](hadoop)) and MPI
  - Rabit only replies a platform to start the programs, so it should be easy to port xgboost to most platforms
 Build
 =====
 * In the root folder, type ```make```
  - If you have C++11 compiler, it is recommended to use ```make cxx11=1```
 Notes
 ====
@ -27,11 +20,9 @@ Notes
 Solvers
 =====
-There are two solvers in distributed xgboost. You can check for local demo of the two solvers, see [row-split](row-split) and [col-split](col-split)
+* Column-based solver split data by column, each node work on subset of columns, 
-  * Column-based solver split data by column, each node work on subset of columns, 
+  it uses exactly the same algorithm as single node version.
-    it uses exactly the same algorithm as single node version.
+* Row-based solver split data by row, each node work on subset of rows,
-  * Row-based solver split data by row, each node work on subset of rows,
+  it uses an approximate histogram count algorithm, and will only examine subset of 
-    it uses an approximate histogram count algorithm, and will only examine subset of 
+  potential split points as opposed to all split points.
-    potential split points as opposed to all split points.
+  - This is the mode used by current hadoop version, since usually data was stored by rows in many industry system
    - This is the mode used by current hadoop version, since usually data was stored by rows in many industry system
--- a/multi-node/hadoop/README.md
+++ b/multi-node/hadoop/README.md
@ -1,40 +0,0 @@
 Distributed XGBoost: Hadoop Yarn Version
 ====
 *  The script in this fold shows an example of how to run distributed xgboost on hadoop platform with YARN
 *  It relies on [Rabit Library](https://github.com/dmlc/rabit) (Reliable Allreduce and Broadcast Interface) and Yarn. Rabit provides an interface to aggregate gradient values and split statistics, that allow xgboost to run reliably on hadoop. You do not need to care how to update model in each iteration, just use the script ```rabit_yarn.py```. For those who want to know how it exactly works, plz refer to the main page of [Rabit](https://github.com/dmlc/rabit).
 *  Quick start: run ```bash run_mushroom.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```
  - This is the hadoop version of binary classification example in the demo folder.
  - More info of the usage of xgboost can be refered to [wiki page](https://github.com/dmlc/xgboost/wiki)
 Before you run the script
 ====
 * Make sure you have set up the hadoop environment.  
  - Check variable $HADOOP_PREFIX exists (e.g. run ```echo $HADOOP_PREFIX```)
  - Compile xgboost with hdfs support
 How to Use
 ====
 * Input data format: LIBSVM format. The example here uses generated data in demo/data folder.
 * Put the training data in HDFS (hadoop distributed file system).
 * Use rabit ```rabit_yarn.py``` to submit training task to yarn
 * Get the final model file from HDFS, and locally do prediction as well as visualization of model.
 Single machine vs Hadoop version
 ====
 If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
 * IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
 * File cache: ```rabit_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
  - ```rabit_yarn.py``` will automatically cache files in the command line. For example, ```rabit_yarn.py -n 3 $localPath/xgboost mushroom.hadoop.conf``` will cache "xgboost" and "mushroom.hadoop.conf".
  - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2``` or ```-f file1#file2``` (use "#" to spilt file names).
  - The local path of cached files in command is "./".
  - Since the cached files will be packaged and delivered to hadoop slave nodes, the cached file should not be large.
 * Hadoop version also support evaluting each training round. You just need to modify parameters "eval_train".
 * More details of submission can be referred to the usage of ```rabit_yarn.py```.
 * The model saved by hadoop version is compatible with single machine version.
 Notes
 ====
 * The code has been tested on YARN.
 * The code is optimized with multi-threading, so you will want to run one xgboost per node/worker for best performance.
  - You will want to set <n_thread_per_worker> to be number of cores you have on each machine.
 * It is also possible to submit job with hadoop streaming, however, YARN is highly recommended for efficiency reason
--- a/multi-node/hadoop/mushroom.hadoop.conf
+++ b/multi-node/hadoop/mushroom.hadoop.conf
@ -1,36 +0,0 @@
 # General Parameters, see comment for each definition
 # choose the booster, can be gbtree or gblinear
 booster = gbtree
 # choose logistic regression loss function for binary classification
 objective = binary:logistic
 # Tree Booster Parameters
 # step size shrinkage
 eta = 1.0 
 # minimum loss reduction required to make a further partition
 gamma = 1.0 
 # minimum sum of instance weight(hessian) needed in a child
 min_child_weight = 1 
 # maximum depth of a tree
 max_depth = 3 
 # Task Parameters
 # the number of round to do boosting
 num_round = 2
 # 0 means do not save any model except the final round model
 save_period = 0 
 # evaluate on training data as well each round
 # eval_train = 1
 # The path of validation data, used to monitor training process, here [test] sets name of the validation set
 # eval[test] = "agaricus.txt.test"
 # Plz donot modify the following parameters
 # The path of training data, with prefix hdfs
 #data = hdfs:/data/
 # The path of model file
 #model_out = 
 # split pattern of xgboost
 dsplit = row
 # evaluate on training data as well each round
 eval_train = 1
--- a/multi-node/hadoop/run_mushroom.sh
+++ b/multi-node/hadoop/run_mushroom.sh
@ -1,28 +0,0 @@
 #!/bin/bash
 if [ "$#" -lt 3 ];
 then
 	echo "Usage: <nworkers> <nthreads> <path_in_HDFS>"
 	exit -1
 fi
 # put the local training file to HDFS
 hadoop fs -mkdir $3/data
 hadoop fs -put ../../demo/data/agaricus.txt.train $3/data
 hadoop fs -put ../../demo/data/agaricus.txt.test $3/data
 # running rabit, pass address in hdfs
 ../../subtree/rabit/tracker/rabit_yarn.py  -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\
    data=hdfs://$3/data/agaricus.txt.train\
    eval[test]=hdfs://$3/data/agaricus.txt.test\
    model_out=hdfs://$3/mushroom.final.model
 # get the final model file
 hadoop fs -get $3/mushroom.final.model final.model
 # output prediction task=pred 
 ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../../demo/data/agaricus.txt.test
 # print the boosters of final.model in dump.raw.txt
 ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
 # use the feature map in printing for better visualization
 ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.txt
 cat dump.nice.txt
--- a/multi-node/row-split/README.md
+++ b/multi-node/row-split/README.md
@ -1,18 +0,0 @@
 Distributed XGBoost: Row Split Version
 ====
 * You might be interested to checkout the [Hadoop example](../hadoop)
 * Machine Rabit: run ```bash machine-row-rabit.sh <n-mpi-process>```
  - machine-col-rabit.sh starts xgboost job using rabit
 How to Use
 ====
 * First split the data by rows
 * In the config, specify data file as containing a wildcard %d, where %d is the rank of the node, each node will load their part of data
 * Enable ow split mode by ```dsplit=row```
 Notes
 ====
 * The code is multi-threaded, so you want to run one xgboost-mpi per node
 * Row-based solver split data by row, each node work on subset of rows, it uses an approximate histogram count algorithm,
  and will only examine subset of potential split points as opposed to all split points.
--- a/multi-node/row-split/machine-row-rabit-mock.sh
+++ b/multi-node/row-split/machine-row-rabit-mock.sh
@ -1,20 +0,0 @@
 #!/bin/bash
 if [[ $# -ne 1 ]]
 then
    echo "Usage: nprocess"
    exit -1
 fi
 rm -rf train-machine.row* *.model
 k=$1
 # make machine data
 cd ../../demo/regression/
 python mapfeat.py
 python mknfold.py machine.txt 1
 cd -
 # split the lib svm file into k subfiles
 python splitrows.py ../../demo/regression/machine.txt.train train-machine $k
 # run xgboost mpi
 ../../subtree/rabit/tracker/rabit_demo.py -n $k  ../../xgboost.mock machine-row.conf dsplit=row num_round=3 mock=1,1,1,0  mock=0,0,3,0 mock=2,2,3,0
--- a/multi-node/row-split/machine-row-rabit.sh
+++ b/multi-node/row-split/machine-row-rabit.sh
@ -1,24 +0,0 @@
 #!/bin/bash
 if [[ $# -ne 1 ]]
 then
    echo "Usage: nprocess"
    exit -1
 fi
 rm -rf train-machine.row* *.model
 k=$1
 # make machine data
 cd ../../demo/regression/
 python mapfeat.py
 python mknfold.py machine.txt 1
 cd -
 # split the lib svm file into k subfiles
 python splitrows.py ../../demo/regression/machine.txt.train train-machine $k
 # run xgboost mpi
 ../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost machine-row.conf dsplit=row num_round=3 eval_train=1
 # run xgboost-mpi save model 0001, continue to run from existing model
 ../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost machine-row.conf dsplit=row num_round=1
 ../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost machine-row.conf dsplit=row num_round=2 model_in=0001.model
--- a/multi-node/row-split/machine-row.conf
+++ b/multi-node/row-split/machine-row.conf
@ -1,30 +0,0 @@
 # General Parameters, see comment for each definition
 # choose the tree booster, can also change to gblinear
 booster = gbtree
 # this is the only difference with classification, use reg:linear to do linear classification
 # when labels are in [0,1] we can also use reg:logistic
 objective = reg:linear
 # Tree Booster Parameters
 # step size shrinkage
 eta = 1.0 
 # minimum loss reduction required to make a further partition
 gamma = 1.0 
 # minimum sum of instance weight(hessian) needed in a child
 min_child_weight = 1 
 # maximum depth of a tree
 max_depth = 3 
 # Task parameters
 # the number of round to do boosting
 num_round = 2
 # 0 means do not save any model except the final round model
 save_period = 0 
 use_buffer = 0
 # The path of training data
 data = "train-machine.row%d" 
 # The path of validation data, used to monitor training process, here [test] sets name of the validation set
 eval[test] = "../../demo/regression/machine.txt.test" 
 # The path of test data 
 test:data = "../../demo/regression/machine.txt.test" 
--- a/multi-node/row-split/splitrows.py
+++ b/multi-node/row-split/splitrows.py
@ -1,24 +0,0 @@
 #!/usr/bin/python
 import sys
 import random
 # split libsvm file into different rows
 if len(sys.argv) < 4:
    print ('Usage:<fin> <fo> k')
    exit(0)
 random.seed(10)
 k = int(sys.argv[3])
 fi = open( sys.argv[1], 'r' )
 fos = []
 for i in range(k):
    fos.append(open( sys.argv[2]+'.row%d' % i, 'w' ))
 for l in open(sys.argv[1]):
    i = random.randint(0, k-1)
    fos[i].write(l)
 for f in fos:    
    f.close()