move distributed xgboost to wormhole
This commit is contained in:
parent
421f5c6570
commit
65abc26797
@ -27,7 +27,7 @@ Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washin
|
|||||||
What's New
|
What's New
|
||||||
==========
|
==========
|
||||||
* XGBoost now support HDFS and S3
|
* XGBoost now support HDFS and S3
|
||||||
* [Distributed XGBoost now runs on YARN](multi-node/hadoop)!
|
* [Distributed XGBoost now runs on YARN](https://github.com/dmlc/wormhole/learn/xgboost)!
|
||||||
* [xgboost user group](https://groups.google.com/forum/#!forum/xgboost-user/) for tracking changes, sharing your experience on xgboost
|
* [xgboost user group](https://groups.google.com/forum/#!forum/xgboost-user/) for tracking changes, sharing your experience on xgboost
|
||||||
* [Distributed XGBoost](multi-node) is now available!!
|
* [Distributed XGBoost](multi-node) is now available!!
|
||||||
* New features in the lastest changes :)
|
* New features in the lastest changes :)
|
||||||
@ -37,7 +37,6 @@ What's New
|
|||||||
* XGBoost wins [Tradeshift Text Classification](https://kaggle2.blob.core.windows.net/forum-message-attachments/60041/1813/TradeshiftTextClassification.pdf?sv=2012-02-12&se=2015-01-02T13%3A55%3A16Z&sr=b&sp=r&sig=5MHvyjCLESLexYcvbSRFumGQXCS7MVmfdBIY3y01tMk%3D)
|
* XGBoost wins [Tradeshift Text Classification](https://kaggle2.blob.core.windows.net/forum-message-attachments/60041/1813/TradeshiftTextClassification.pdf?sv=2012-02-12&se=2015-01-02T13%3A55%3A16Z&sr=b&sp=r&sig=5MHvyjCLESLexYcvbSRFumGQXCS7MVmfdBIY3y01tMk%3D)
|
||||||
* XGBoost wins [HEP meets ML Award in Higgs Boson Challenge](http://atlas.ch/news/2014/machine-learning-wins-the-higgs-challenge.html)
|
* XGBoost wins [HEP meets ML Award in Higgs Boson Challenge](http://atlas.ch/news/2014/machine-learning-wins-the-higgs-challenge.html)
|
||||||
|
|
||||||
|
|
||||||
Features
|
Features
|
||||||
========
|
========
|
||||||
* Sparse feature format:
|
* Sparse feature format:
|
||||||
|
|||||||
@ -1,17 +1,10 @@
|
|||||||
Distributed XGBoost
|
Distributed XGBoost
|
||||||
======
|
======
|
||||||
This folder contains information of Distributed XGBoost (Distributed GBDT).
|
Distributed XGBoost is now part of [Wormhole](https://github.com/dmlc/wormhole/learn/xgboost).
|
||||||
|
See the [Wormhole](https://github.com/dmlc/wormhole/learn/xgboost) for usage examples, build and job submissions.
|
||||||
* The distributed version is built on Rabit:[Reliable Allreduce and Broadcast Library](https://github.com/dmlc/rabit)
|
* The distributed version is built on Rabit:[Reliable Allreduce and Broadcast Library](https://github.com/dmlc/rabit)
|
||||||
- Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning
|
- Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning
|
||||||
- This makes xgboost portable and fault-tolerant against node failures
|
- This makes xgboost portable and fault-tolerant against node failures
|
||||||
* You can run Distributed XGBoost on platforms including Hadoop(see [hadoop folder](hadoop)) and MPI
|
|
||||||
- Rabit only replies a platform to start the programs, so it should be easy to port xgboost to most platforms
|
|
||||||
|
|
||||||
Build
|
|
||||||
=====
|
|
||||||
* In the root folder, type ```make```
|
|
||||||
- If you have C++11 compiler, it is recommended to use ```make cxx11=1```
|
|
||||||
|
|
||||||
Notes
|
Notes
|
||||||
====
|
====
|
||||||
@ -27,11 +20,9 @@ Notes
|
|||||||
|
|
||||||
Solvers
|
Solvers
|
||||||
=====
|
=====
|
||||||
There are two solvers in distributed xgboost. You can check for local demo of the two solvers, see [row-split](row-split) and [col-split](col-split)
|
* Column-based solver split data by column, each node work on subset of columns,
|
||||||
* Column-based solver split data by column, each node work on subset of columns,
|
it uses exactly the same algorithm as single node version.
|
||||||
it uses exactly the same algorithm as single node version.
|
* Row-based solver split data by row, each node work on subset of rows,
|
||||||
* Row-based solver split data by row, each node work on subset of rows,
|
it uses an approximate histogram count algorithm, and will only examine subset of
|
||||||
it uses an approximate histogram count algorithm, and will only examine subset of
|
potential split points as opposed to all split points.
|
||||||
potential split points as opposed to all split points.
|
- This is the mode used by current hadoop version, since usually data was stored by rows in many industry system
|
||||||
- This is the mode used by current hadoop version, since usually data was stored by rows in many industry system
|
|
||||||
|
|
||||||
|
|||||||
@ -1,40 +0,0 @@
|
|||||||
Distributed XGBoost: Hadoop Yarn Version
|
|
||||||
====
|
|
||||||
* The script in this fold shows an example of how to run distributed xgboost on hadoop platform with YARN
|
|
||||||
* It relies on [Rabit Library](https://github.com/dmlc/rabit) (Reliable Allreduce and Broadcast Interface) and Yarn. Rabit provides an interface to aggregate gradient values and split statistics, that allow xgboost to run reliably on hadoop. You do not need to care how to update model in each iteration, just use the script ```rabit_yarn.py```. For those who want to know how it exactly works, plz refer to the main page of [Rabit](https://github.com/dmlc/rabit).
|
|
||||||
* Quick start: run ```bash run_mushroom.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```
|
|
||||||
- This is the hadoop version of binary classification example in the demo folder.
|
|
||||||
- More info of the usage of xgboost can be refered to [wiki page](https://github.com/dmlc/xgboost/wiki)
|
|
||||||
|
|
||||||
Before you run the script
|
|
||||||
====
|
|
||||||
* Make sure you have set up the hadoop environment.
|
|
||||||
- Check variable $HADOOP_PREFIX exists (e.g. run ```echo $HADOOP_PREFIX```)
|
|
||||||
- Compile xgboost with hdfs support
|
|
||||||
|
|
||||||
How to Use
|
|
||||||
====
|
|
||||||
* Input data format: LIBSVM format. The example here uses generated data in demo/data folder.
|
|
||||||
* Put the training data in HDFS (hadoop distributed file system).
|
|
||||||
* Use rabit ```rabit_yarn.py``` to submit training task to yarn
|
|
||||||
* Get the final model file from HDFS, and locally do prediction as well as visualization of model.
|
|
||||||
|
|
||||||
Single machine vs Hadoop version
|
|
||||||
====
|
|
||||||
If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
|
|
||||||
* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
|
|
||||||
* File cache: ```rabit_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
|
|
||||||
- ```rabit_yarn.py``` will automatically cache files in the command line. For example, ```rabit_yarn.py -n 3 $localPath/xgboost mushroom.hadoop.conf``` will cache "xgboost" and "mushroom.hadoop.conf".
|
|
||||||
- You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2``` or ```-f file1#file2``` (use "#" to spilt file names).
|
|
||||||
- The local path of cached files in command is "./".
|
|
||||||
- Since the cached files will be packaged and delivered to hadoop slave nodes, the cached file should not be large.
|
|
||||||
* Hadoop version also support evaluting each training round. You just need to modify parameters "eval_train".
|
|
||||||
* More details of submission can be referred to the usage of ```rabit_yarn.py```.
|
|
||||||
* The model saved by hadoop version is compatible with single machine version.
|
|
||||||
|
|
||||||
Notes
|
|
||||||
====
|
|
||||||
* The code has been tested on YARN.
|
|
||||||
* The code is optimized with multi-threading, so you will want to run one xgboost per node/worker for best performance.
|
|
||||||
- You will want to set <n_thread_per_worker> to be number of cores you have on each machine.
|
|
||||||
* It is also possible to submit job with hadoop streaming, however, YARN is highly recommended for efficiency reason
|
|
||||||
@ -1,36 +0,0 @@
|
|||||||
# General Parameters, see comment for each definition
|
|
||||||
# choose the booster, can be gbtree or gblinear
|
|
||||||
booster = gbtree
|
|
||||||
# choose logistic regression loss function for binary classification
|
|
||||||
objective = binary:logistic
|
|
||||||
|
|
||||||
# Tree Booster Parameters
|
|
||||||
# step size shrinkage
|
|
||||||
eta = 1.0
|
|
||||||
# minimum loss reduction required to make a further partition
|
|
||||||
gamma = 1.0
|
|
||||||
# minimum sum of instance weight(hessian) needed in a child
|
|
||||||
min_child_weight = 1
|
|
||||||
# maximum depth of a tree
|
|
||||||
max_depth = 3
|
|
||||||
|
|
||||||
# Task Parameters
|
|
||||||
# the number of round to do boosting
|
|
||||||
num_round = 2
|
|
||||||
# 0 means do not save any model except the final round model
|
|
||||||
save_period = 0
|
|
||||||
# evaluate on training data as well each round
|
|
||||||
# eval_train = 1
|
|
||||||
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
|
|
||||||
# eval[test] = "agaricus.txt.test"
|
|
||||||
|
|
||||||
# Plz donot modify the following parameters
|
|
||||||
# The path of training data, with prefix hdfs
|
|
||||||
#data = hdfs:/data/
|
|
||||||
# The path of model file
|
|
||||||
#model_out =
|
|
||||||
# split pattern of xgboost
|
|
||||||
dsplit = row
|
|
||||||
# evaluate on training data as well each round
|
|
||||||
eval_train = 1
|
|
||||||
|
|
||||||
@ -1,28 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [ "$#" -lt 3 ];
|
|
||||||
then
|
|
||||||
echo "Usage: <nworkers> <nthreads> <path_in_HDFS>"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
# put the local training file to HDFS
|
|
||||||
hadoop fs -mkdir $3/data
|
|
||||||
hadoop fs -put ../../demo/data/agaricus.txt.train $3/data
|
|
||||||
hadoop fs -put ../../demo/data/agaricus.txt.test $3/data
|
|
||||||
|
|
||||||
# running rabit, pass address in hdfs
|
|
||||||
../../subtree/rabit/tracker/rabit_yarn.py -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\
|
|
||||||
data=hdfs://$3/data/agaricus.txt.train\
|
|
||||||
eval[test]=hdfs://$3/data/agaricus.txt.test\
|
|
||||||
model_out=hdfs://$3/mushroom.final.model
|
|
||||||
|
|
||||||
# get the final model file
|
|
||||||
hadoop fs -get $3/mushroom.final.model final.model
|
|
||||||
|
|
||||||
# output prediction task=pred
|
|
||||||
../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../../demo/data/agaricus.txt.test
|
|
||||||
# print the boosters of final.model in dump.raw.txt
|
|
||||||
../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
|
|
||||||
# use the feature map in printing for better visualization
|
|
||||||
../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.txt
|
|
||||||
cat dump.nice.txt
|
|
||||||
@ -1,18 +0,0 @@
|
|||||||
Distributed XGBoost: Row Split Version
|
|
||||||
====
|
|
||||||
* You might be interested to checkout the [Hadoop example](../hadoop)
|
|
||||||
* Machine Rabit: run ```bash machine-row-rabit.sh <n-mpi-process>```
|
|
||||||
- machine-col-rabit.sh starts xgboost job using rabit
|
|
||||||
|
|
||||||
How to Use
|
|
||||||
====
|
|
||||||
* First split the data by rows
|
|
||||||
* In the config, specify data file as containing a wildcard %d, where %d is the rank of the node, each node will load their part of data
|
|
||||||
* Enable ow split mode by ```dsplit=row```
|
|
||||||
|
|
||||||
Notes
|
|
||||||
====
|
|
||||||
* The code is multi-threaded, so you want to run one xgboost-mpi per node
|
|
||||||
* Row-based solver split data by row, each node work on subset of rows, it uses an approximate histogram count algorithm,
|
|
||||||
and will only examine subset of potential split points as opposed to all split points.
|
|
||||||
|
|
||||||
@ -1,20 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [[ $# -ne 1 ]]
|
|
||||||
then
|
|
||||||
echo "Usage: nprocess"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
rm -rf train-machine.row* *.model
|
|
||||||
k=$1
|
|
||||||
# make machine data
|
|
||||||
cd ../../demo/regression/
|
|
||||||
python mapfeat.py
|
|
||||||
python mknfold.py machine.txt 1
|
|
||||||
cd -
|
|
||||||
|
|
||||||
# split the lib svm file into k subfiles
|
|
||||||
python splitrows.py ../../demo/regression/machine.txt.train train-machine $k
|
|
||||||
|
|
||||||
# run xgboost mpi
|
|
||||||
../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost.mock machine-row.conf dsplit=row num_round=3 mock=1,1,1,0 mock=0,0,3,0 mock=2,2,3,0
|
|
||||||
@ -1,24 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [[ $# -ne 1 ]]
|
|
||||||
then
|
|
||||||
echo "Usage: nprocess"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
rm -rf train-machine.row* *.model
|
|
||||||
k=$1
|
|
||||||
# make machine data
|
|
||||||
cd ../../demo/regression/
|
|
||||||
python mapfeat.py
|
|
||||||
python mknfold.py machine.txt 1
|
|
||||||
cd -
|
|
||||||
|
|
||||||
# split the lib svm file into k subfiles
|
|
||||||
python splitrows.py ../../demo/regression/machine.txt.train train-machine $k
|
|
||||||
|
|
||||||
# run xgboost mpi
|
|
||||||
../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost machine-row.conf dsplit=row num_round=3 eval_train=1
|
|
||||||
|
|
||||||
# run xgboost-mpi save model 0001, continue to run from existing model
|
|
||||||
../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost machine-row.conf dsplit=row num_round=1
|
|
||||||
../../subtree/rabit/tracker/rabit_demo.py -n $k ../../xgboost machine-row.conf dsplit=row num_round=2 model_in=0001.model
|
|
||||||
@ -1,30 +0,0 @@
|
|||||||
# General Parameters, see comment for each definition
|
|
||||||
# choose the tree booster, can also change to gblinear
|
|
||||||
booster = gbtree
|
|
||||||
# this is the only difference with classification, use reg:linear to do linear classification
|
|
||||||
# when labels are in [0,1] we can also use reg:logistic
|
|
||||||
objective = reg:linear
|
|
||||||
|
|
||||||
# Tree Booster Parameters
|
|
||||||
# step size shrinkage
|
|
||||||
eta = 1.0
|
|
||||||
# minimum loss reduction required to make a further partition
|
|
||||||
gamma = 1.0
|
|
||||||
# minimum sum of instance weight(hessian) needed in a child
|
|
||||||
min_child_weight = 1
|
|
||||||
# maximum depth of a tree
|
|
||||||
max_depth = 3
|
|
||||||
# Task parameters
|
|
||||||
# the number of round to do boosting
|
|
||||||
num_round = 2
|
|
||||||
# 0 means do not save any model except the final round model
|
|
||||||
save_period = 0
|
|
||||||
use_buffer = 0
|
|
||||||
|
|
||||||
# The path of training data
|
|
||||||
data = "train-machine.row%d"
|
|
||||||
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
|
|
||||||
eval[test] = "../../demo/regression/machine.txt.test"
|
|
||||||
# The path of test data
|
|
||||||
test:data = "../../demo/regression/machine.txt.test"
|
|
||||||
|
|
||||||
@ -1,24 +0,0 @@
|
|||||||
#!/usr/bin/python
|
|
||||||
import sys
|
|
||||||
import random
|
|
||||||
|
|
||||||
# split libsvm file into different rows
|
|
||||||
if len(sys.argv) < 4:
|
|
||||||
print ('Usage:<fin> <fo> k')
|
|
||||||
exit(0)
|
|
||||||
|
|
||||||
random.seed(10)
|
|
||||||
|
|
||||||
k = int(sys.argv[3])
|
|
||||||
fi = open( sys.argv[1], 'r' )
|
|
||||||
fos = []
|
|
||||||
|
|
||||||
for i in range(k):
|
|
||||||
fos.append(open( sys.argv[2]+'.row%d' % i, 'w' ))
|
|
||||||
|
|
||||||
for l in open(sys.argv[1]):
|
|
||||||
i = random.randint(0, k-1)
|
|
||||||
fos[i].write(l)
|
|
||||||
|
|
||||||
for f in fos:
|
|
||||||
f.close()
|
|
||||||
Loading…
x
Reference in New Issue
Block a user