[DOC] cleanup distributed training

2016-01-16 11:00:06 -08:00
parent df7c7930d0
commit e7d8ed71d6
11 changed files with 155 additions and 237 deletions
--- a/demo/README.md
+++ b/demo/README.md
@@ -44,8 +44,15 @@ However, the parameter settings can be applied to all versions
 * [Multiclass classification](multiclass_classification)
 * [Regression](regression)
 * [Learning to Rank](rank)
+* [Distributed Training](distributed-training)

 Benchmarks
 ----------
 * [Starter script for Kaggle Higgs Boson](kaggle-higgs)
 * [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)
+
+Machine Learning Challenge Winning Solutions
+--------------------------------------------
+* XGBoost helps Vlad Mironov, Alexander Guschin to win the [CERN LHCb experiment Flavour of Physics competition](https://www.kaggle.com/c/flavours-of-physics). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/11/30/flavour-of-physics-technical-write-up-1st-place-go-polar-bears/).
+* XGBoost helps Mario Filho, Josef Feigl, Lucas, Gilberto to win the [Caterpillar Tube Pricing competition](https://www.kaggle.com/c/caterpillar-tube-pricing). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/09/22/caterpillar-winners-interview-1st-place-gilberto-josef-leustagos-mario/).
+* XGBoost helps Halla Yang to win the [Recruit Coupon Purchase Prediction Challenge](https://www.kaggle.com/c/coupon-purchase-prediction). Check out the [interview from Kaggle](http://blog.kaggle.com/2015/10/21/recruit-coupon-purchase-winners-interview-2nd-place-halla-yang/).
--- a/demo/distributed-training/README.md
+++ b/demo/distributed-training/README.md
@@ -0,0 +1,52 @@
+Distributed XGBoost Training
+============================
+This is an tutorial of Distributed XGBoost Training.
+Currently xgboost supports distributed training via CLI program with the configuration file.
+There is also plan push distributed python and other language bindings, please open an issue
+if you are interested in contributing.
+
+Build XGBoost with Distributed Filesystem Support
+-------------------------------------------------
+To use distributed xgboost, you only need to turn the options on to build
+with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.
+
+How to Use
+----------
+* Input data format: LIBSVM format. The example here uses generated data in ../data folder.
+* Put the data into some distribute filesytem (S3 or HDFS)
+* Use tracker script in dmlc-core/tracker to submit the jobs
+* Like all other DMLC tools, xgboost support taking a path to a folder as input argument
+  - All the files in the folder will be used as input
+* Quick start in Hadoop YARN: run ```bash run_yarn.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```
+
+Example
+-------
+* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN.
+
+Single machine vs Distributed Version
+-------------------------------------
+If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
+* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
+* File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
+  - ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf".
+  - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2```
+  - The local path of cached files in command is "./".
+* More details of submission can be referred to the usage of ```dmlc_yarn.py```.
+* The model saved by hadoop version is compatible with single machine version.
+
+Notes
+-----
+* The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance.
+  - You will want to set <n_thread_per_worker> to be number of cores you have on each machine.
+
+
+External Memory Version
+-----------------------
+XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data.
+See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory.
+
+You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as
+```
+data=hdfs:///path-to-my-data/#dtrain.cache
+```
+This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset.
--- a/demo/distributed-training/run_yarn.sh
+++ b/demo/distributed-training/run_yarn.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+if [ "$#" -lt 3 ];
+then
+	echo "Usage: <nworkers> <nthreads> <path_in_HDFS>"
+	exit -1
+fi
+
+# put the local training file to HDFS
+hadoop fs -mkdir $3/data
+hadoop fs -put ../data/agaricus.txt.train $3/data
+hadoop fs -put ../data/agaricus.txt.test $3/data
+
+# running rabit, pass address in hdfs
+../../dmlc-core/tracker/dmlc_yarn.py  -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\
+    data=hdfs://$3/data/agaricus.txt.train\
+    eval[test]=hdfs://$3/data/agaricus.txt.test\
+    model_out=hdfs://$3/mushroom.final.model
+
+# get the final model file
+hadoop fs -get $3/mushroom.final.model final.model
+
+# use dmlc-core/yarn/run_hdfs_prog.py to setup approperiate env
+
+# output prediction task=pred
+#../../xgboost.dmlc mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
+../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
+# print the boosters of final.model in dump.raw.txt
+#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
+../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
+# use the feature map in printing for better visualization
+#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
+../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
+cat dump.nice.txt