[DIST] Add Distributed XGBoost on AWS Tutorial
This commit is contained in:
@@ -10,43 +10,14 @@ Build XGBoost with Distributed Filesystem Support
|
||||
To use distributed xgboost, you only need to turn the options on to build
|
||||
with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.
|
||||
|
||||
How to Use
|
||||
----------
|
||||
* Input data format: LIBSVM format. The example here uses generated data in ../data folder.
|
||||
* Put the data into some distribute filesytem (S3 or HDFS)
|
||||
* Use tracker script in dmlc-core/tracker to submit the jobs
|
||||
* Like all other DMLC tools, xgboost support taking a path to a folder as input argument
|
||||
- All the files in the folder will be used as input
|
||||
* Quick start in Hadoop YARN: run ```bash run_yarn.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```
|
||||
|
||||
Example
|
||||
-------
|
||||
* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN.
|
||||
|
||||
Single machine vs Distributed Version
|
||||
-------------------------------------
|
||||
If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
|
||||
* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
|
||||
* File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
|
||||
- ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf".
|
||||
- You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2```
|
||||
- The local path of cached files in command is "./".
|
||||
* More details of submission can be referred to the usage of ```dmlc_yarn.py```.
|
||||
* The model saved by hadoop version is compatible with single machine version.
|
||||
|
||||
Notes
|
||||
-----
|
||||
* The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance.
|
||||
- You will want to set <n_thread_per_worker> to be number of cores you have on each machine.
|
||||
Step by Step Tutorial on AWS
|
||||
----------------------------
|
||||
Checkout [this tutorial](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html) for running distributed xgboost.
|
||||
|
||||
|
||||
External Memory Version
|
||||
-----------------------
|
||||
XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data.
|
||||
See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory.
|
||||
|
||||
You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as
|
||||
```
|
||||
data=hdfs:///path-to-my-data/#dtrain.cache
|
||||
```
|
||||
This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset.
|
||||
Model Analysis
|
||||
--------------
|
||||
XGBoost is exchangable across all bindings and platforms.
|
||||
This means you can use python or R to analyze the learnt model and do prediction.
|
||||
For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model.
|
||||
|
||||
27
demo/distributed-training/mushroom.aws.conf
Normal file
27
demo/distributed-training/mushroom.aws.conf
Normal file
@@ -0,0 +1,27 @@
|
||||
# General Parameters, see comment for each definition
|
||||
# choose the booster, can be gbtree or gblinear
|
||||
booster = gbtree
|
||||
# choose logistic regression loss function for binary classification
|
||||
objective = binary:logistic
|
||||
|
||||
# Tree Booster Parameters
|
||||
# step size shrinkage
|
||||
eta = 1.0
|
||||
# minimum loss reduction required to make a further partition
|
||||
gamma = 1.0
|
||||
# minimum sum of instance weight(hessian) needed in a child
|
||||
min_child_weight = 1
|
||||
# maximum depth of a tree
|
||||
max_depth = 3
|
||||
|
||||
# Task Parameters
|
||||
# the number of round to do boosting
|
||||
num_round = 2
|
||||
# 0 means do not save any model except the final round model
|
||||
save_period = 0
|
||||
# The path of training data
|
||||
data = "s3://mybucket/xgb-demo/train"
|
||||
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
|
||||
# evaluate on training data as well each round
|
||||
eval_train = 1
|
||||
|
||||
107
demo/distributed-training/plot_model.ipynb
Normal file
107
demo/distributed-training/plot_model.ipynb
Normal file
@@ -0,0 +1,107 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# XGBoost Model Analysis\n",
|
||||
"\n",
|
||||
"This notebook can be used to load and anlysis model learnt from all xgboost bindings, including distributed training. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import sys\n",
|
||||
"import os\n",
|
||||
"%matplotlib inline "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Please change the ```pkg_path``` and ```model_file``` to be correct path"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pkg_path = '../../python-package/'\n",
|
||||
"model_file = 's3://my-bucket/xgb-demo/model/0002.model'\n",
|
||||
"sys.path.insert(0, pkg_path)\n",
|
||||
"import xgboost as xgb"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Plot the Feature Importance"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# plot the first two trees.\n",
|
||||
"bst = xgb.Booster(model_file=model_file)\n",
|
||||
"xgb.plot_importance(bst)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Plot the First Tree"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"tree_id = 0\n",
|
||||
"xgb.to_graphviz(bst, tree_id)"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 2",
|
||||
"language": "python",
|
||||
"name": "python2"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 2
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython2",
|
||||
"version": "2.7.3"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
||||
11
demo/distributed-training/run_aws.sh
Normal file
11
demo/distributed-training/run_aws.sh
Normal file
@@ -0,0 +1,11 @@
|
||||
# This is the example script to run distributed xgboost on AWS.
|
||||
# Change the following two lines for configuration
|
||||
|
||||
export BUCKET=mybucket
|
||||
|
||||
# submit the job to YARN
|
||||
../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\
|
||||
../../xgboost mushroom.aws.conf nthread=2\
|
||||
data=s3://${BUCKET}/xgb-demo/train\
|
||||
eval[test]=s3://${BUCKET}/xgb-demo/test\
|
||||
model_dir=s3://${BUCKET}/xgb-demo/model
|
||||
@@ -1,33 +0,0 @@
|
||||
#!/bin/bash
|
||||
if [ "$#" -lt 3 ];
|
||||
then
|
||||
echo "Usage: <nworkers> <nthreads> <path_in_HDFS>"
|
||||
exit -1
|
||||
fi
|
||||
|
||||
# put the local training file to HDFS
|
||||
hadoop fs -mkdir $3/data
|
||||
hadoop fs -put ../data/agaricus.txt.train $3/data
|
||||
hadoop fs -put ../data/agaricus.txt.test $3/data
|
||||
|
||||
# running rabit, pass address in hdfs
|
||||
../../dmlc-core/tracker/dmlc_yarn.py -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\
|
||||
data=hdfs://$3/data/agaricus.txt.train\
|
||||
eval[test]=hdfs://$3/data/agaricus.txt.test\
|
||||
model_out=hdfs://$3/mushroom.final.model
|
||||
|
||||
# get the final model file
|
||||
hadoop fs -get $3/mushroom.final.model final.model
|
||||
|
||||
# use dmlc-core/yarn/run_hdfs_prog.py to setup approperiate env
|
||||
|
||||
# output prediction task=pred
|
||||
#../../xgboost.dmlc mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
|
||||
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
|
||||
# print the boosters of final.model in dump.raw.txt
|
||||
#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
|
||||
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
|
||||
# use the feature map in printing for better visualization
|
||||
#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
|
||||
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
|
||||
cat dump.nice.txt
|
||||
Reference in New Issue
Block a user