From 525c1594e5f7130d0baa34cce7b8347cca19f8f4 Mon Sep 17 00:00:00 2001 From: Boliang Chen Date: Sun, 11 Jan 2015 16:06:19 +0800 Subject: [PATCH 1/4] revise the script --- multi-node/hadoop/run_hadoop_mushroom.sh | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/multi-node/hadoop/run_hadoop_mushroom.sh b/multi-node/hadoop/run_hadoop_mushroom.sh index 2f095ff25..1e7c9a1d0 100755 --- a/multi-node/hadoop/run_hadoop_mushroom.sh +++ b/multi-node/hadoop/run_hadoop_mushroom.sh @@ -11,19 +11,15 @@ hadoop fs -mkdir $2/data hadoop fs -put ../../demo/data/agaricus.txt.train $2/data # training and output the final model file -../../rabit/tracker/rabit_hadoop.py -n $1 -i $2/data/agaricus.txt.train \ - -o $2/model -f ../../demo/data/agaricus.txt.test \ - ../../xgboost mushroom.hadoop.conf dsplit=row +../../rabit/tracker/rabit_hadoop.py -n $1 -i $2/data/agaricus.txt.train -o $2/mushroom.final.model ../../xgboost mushroom.hadoop.conf # get the final model file -hadoop fs -get $2/model/part-00000 ./final.model +hadoop fs -get $2/mushroom.final.model/part-00000 ./mushroom.final.model -# output prediction task=pred -../../xgboost mushroom.hadoop.conf task=pred model_in=final.model \ - test:data=../../demo/data/agaricus.txt.test +# output prediction task=pred of test:data +../../xgboost mushroom.hadoop.conf task=pred model_in=mushroom.final.model test:data=../../demo/data/agaricus.txt.test # print the boosters of final.model in dump.raw.txt -../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt +../../xgboost mushroom.hadoop.conf task=dump model_in=mushroom.final.model name_dump=dump.raw.txt # use the feature map in printing for better visualization -../../xgboost mushroom.hadoop.conf task=dump model_in=final.model \ - fmap=../../demo/data/featmap.txt name_dump=dump.nice.txt +../../xgboost mushroom.hadoop.conf task=dump model_in=mushroom.final.model fmap=../../demo/data/featmap.txt name_dump=dump.nice.txt cat dump.nice.txt From ef2518364c62a15362628d43aee4566f06267336 Mon Sep 17 00:00:00 2001 From: Boliang Chen Date: Sun, 11 Jan 2015 16:07:00 +0800 Subject: [PATCH 2/4] change to minimal setting --- multi-node/hadoop/mushroom.hadoop.conf | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/multi-node/hadoop/mushroom.hadoop.conf b/multi-node/hadoop/mushroom.hadoop.conf index 305b82dd3..15e05f2da 100644 --- a/multi-node/hadoop/mushroom.hadoop.conf +++ b/multi-node/hadoop/mushroom.hadoop.conf @@ -19,12 +19,16 @@ max_depth = 3 num_round = 2 # 0 means do not save any model except the final round model save_period = 0 +# evaluate on training data as well each round +# eval_train = 1 +# The path of validation data, used to monitor training process, here [test] sets name of the validation set +# eval[test] = "agaricus.txt.test" + +# Plz donot modify the following parameters # The path of training data data = stdin # The path of model file model_out = stdout +# split pattern of xgboost +dsplit = row -# The path of validation data, used to monitor training process, here [test] sets name of the validation set -eval[test] = "agaricus.txt.test" -# evaluate on training data as well each round -eval_train = 1 From fdbca6013d7ae2d8417a2c95f149f4213dfcda07 Mon Sep 17 00:00:00 2001 From: Boliang Chen Date: Sun, 11 Jan 2015 17:57:41 +0800 Subject: [PATCH 3/4] modify --- multi-node/hadoop/README.md | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) diff --git a/multi-node/hadoop/README.md b/multi-node/hadoop/README.md index adfacdb8b..aecee38e0 100644 --- a/multi-node/hadoop/README.md +++ b/multi-node/hadoop/README.md @@ -1,15 +1,41 @@ Distributed XGBoost: Hadoop Version ==== -* Hadoop version: run ```bash run_binary_classification.sh ``` +* The script in this fold shows an example of how to run distributed xgboost on hadoop platform. +* It relies on [Rabit Library](https://github.com/tqchen/rabit) and Hadoop Streaming. +* Quick start: run ```bash run_binary_classification.sh ``` - This is the hadoop version of binary classification example in the demo folder. + - More info of the binary classification task can be refered to https://github.com/tqchen/xgboost/wiki/Binary-Classification. +Before you run the script +==== +* Make sure you have set up the hadoop environment. Otherwise you should run single machine examples in the demo fold. +* Build: run ```bash build.sh``` in the root folder, it will automatically download rabit and build xgboost. +* Check whether the environment variable $HADOOP_HOME exists (e.g. run ```echo $HADOOP_HOME```). If not, plz set up hadoop-streaming.jar path in rabit_hadoop.py. + How to Use ==== -* Check whether environment variable $HADOOP_HOME exists (e.g. run ```echo $HADOOP_HOME```). If not, plz set up hadoop-streaming.jar path in rabit_hadoop.py. +* Input data format: LIBSVM format. The example here uses generated data in demo/data folder. +* Put the training data in HDFS (hadoop distributed file system). +* Use rabit ```rabit_hadoop.py``` to submit training task to hadoop, and save the final model file. +* Get the final model file from HDFS, and locally do prediction as well as visualization of model. + +XGBoost: Single machine verison VS Hadoop version +==== +If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file. +* IO: instead of reading and writing file locally, hadoop version use "stdin" to read training file and use "stdout" to store the final model file. Therefore, you should change the parameters "data" and "model_out" in conf file to ```data = stdin; model_out = stdout```. +* File cache: ```rabit_hadoop.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file, small size of dataset which used for eveluation during the training process, and so on. + - Any file used in config file, excluding stdin, should be cached in the script. ```rabit_hadoop.py``` will automatically cache files in the command line. For example, ```rabit_hadoop.py -n 3 -i $hdfsPath/agaricus.txt.train -o $hdfsPath/mushroom.final.model $localPath/xgboost mushroom.hadoop.conf``` will cache "xgboost" and "mushroom.hadoop.conf". + - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2``` or ```-f file1#file2```. +* Test locally +* +* + +Usage of rabit_hadoop.py +==== Notes ==== * The code has been tested on MapReduce 1 (MRv1), it should be ok to run on MapReduce 2 (MRv2, YARN). * The code is multi-threaded, so you want to run one xgboost per node/worker, which means the parameter should be less than the number of slaves/workers. -* The hadoop version now can only save the final model and evaluate test data locally after the training process. +* The hadoop version now can only save the final model. From df3f87c182cc12ccc9ac1f9cafbe01ea7ebf0ac4 Mon Sep 17 00:00:00 2001 From: Boliang Chen Date: Sun, 11 Jan 2015 18:20:16 +0800 Subject: [PATCH 4/4] add more details --- multi-node/hadoop/README.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/multi-node/hadoop/README.md b/multi-node/hadoop/README.md index aecee38e0..7ff7c5da7 100644 --- a/multi-node/hadoop/README.md +++ b/multi-node/hadoop/README.md @@ -22,20 +22,21 @@ How to Use XGBoost: Single machine verison VS Hadoop version ==== If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file. +* Hadoop version needs to set up how many slave nodes/machines/workers you would like to use at first. * IO: instead of reading and writing file locally, hadoop version use "stdin" to read training file and use "stdout" to store the final model file. Therefore, you should change the parameters "data" and "model_out" in conf file to ```data = stdin; model_out = stdout```. * File cache: ```rabit_hadoop.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file, small size of dataset which used for eveluation during the training process, and so on. - Any file used in config file, excluding stdin, should be cached in the script. ```rabit_hadoop.py``` will automatically cache files in the command line. For example, ```rabit_hadoop.py -n 3 -i $hdfsPath/agaricus.txt.train -o $hdfsPath/mushroom.final.model $localPath/xgboost mushroom.hadoop.conf``` will cache "xgboost" and "mushroom.hadoop.conf". - - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2``` or ```-f file1#file2```. -* Test locally -* -* - -Usage of rabit_hadoop.py -==== + - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2``` or ```-f file1#file2``` (use "#" to spilt file names). + - The local path of cached files in command is "./". + - Since the cached files will be packaged and delivered to hadoop slave nodes, the cached file should not be large. For instance, trying to cache files of GB size may reduce the performance. +* Hadoop version also support evaluting each training round. You just need to modify parameters "eval_train" and "eval[test]" in conf file and cache the evaluation file. +* Hadoop version now can only save the final model. +* Predict locally. Althought the hadoop version supports training process, you should do prediction locally, just the same as single machine version. +* The hadoop version now can only save the final model. +* More details of hadoop version can be referred to the usage of ```rabit_hadoop.py```. Notes ==== * The code has been tested on MapReduce 1 (MRv1), it should be ok to run on MapReduce 2 (MRv2, YARN). * The code is multi-threaded, so you want to run one xgboost per node/worker, which means the parameter should be less than the number of slaves/workers. -* The hadoop version now can only save the final model.