Distributed XGBoost: Hadoop Version
- The script in this fold shows an example of how to run distributed xgboost on hadoop platform.
- It relies on Rabit Library and Hadoop Streaming.
- Quick start: run
bash run_binary_classification.sh <n_hadoop_workers> <path_in_HDFS>
- This is the hadoop version of binary classification example in the demo folder.
- More info of the binary classification task can be refered to https://github.com/tqchen/xgboost/wiki/Binary-Classification.
Before you run the script
- Make sure you have set up the hadoop environment. Otherwise you should run single machine examples in the demo fold.
- Build: run
bash build.shin the root folder, it will automatically download rabit and build xgboost. - Check whether the environment variable $HADOOP_HOME exists (e.g. run
echo $HADOOP_HOME). If not, plz set up hadoop-streaming.jar path in rabit_hadoop.py.
How to Use
- Input data format: LIBSVM format. The example here uses generated data in demo/data folder.
- Put the training data in HDFS (hadoop distributed file system).
- Use rabit
rabit_hadoop.pyto submit training task to hadoop, and save the final model file. - Get the final model file from HDFS, and locally do prediction as well as visualization of model.
XGBoost: Single machine verison VS Hadoop version
If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
- IO: instead of reading and writing file locally, hadoop version use "stdin" to read training file and use "stdout" to store the final model file. Therefore, you should change the parameters "data" and "model_out" in conf file to
data = stdin; model_out = stdout. - File cache:
rabit_hadoop.pyalso provide several ways to cache necesary files, including binary file (xgboost), conf file, small size of dataset which used for eveluation during the training process, and so on.- Any file used in config file, excluding stdin, should be cached in the script.
rabit_hadoop.pywill automatically cache files in the command line. For example,rabit_hadoop.py -n 3 -i $hdfsPath/agaricus.txt.train -o $hdfsPath/mushroom.final.model $localPath/xgboost mushroom.hadoop.confwill cache "xgboost" and "mushroom.hadoop.conf". - You could also use "-f" to manually cache one or more files, like
-f file1 -f file2or-f file1#file2.
- Any file used in config file, excluding stdin, should be cached in the script.
- Test locally
Usage of rabit_hadoop.py
Notes
- The code has been tested on MapReduce 1 (MRv1), it should be ok to run on MapReduce 2 (MRv2, YARN).
- The code is multi-threaded, so you want to run one xgboost per node/worker, which means the parameter <n_workers> should be less than the number of slaves/workers.
- The hadoop version now can only save the final model.