[DIST] Add Distributed XGBoost on AWS Tutorial
This commit is contained in:
parent
61d9edcaa4
commit
a71ba04109
@ -20,6 +20,7 @@ The same code runs on major distributed environment(Hadoop, SGE, MPI) and can so
|
|||||||
|
|
||||||
What's New
|
What's New
|
||||||
----------
|
----------
|
||||||
|
* [Distributed XGBoost on AWS with YARN](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html)
|
||||||
* [XGBoost brick](NEWS.md) Release
|
* [XGBoost brick](NEWS.md) Release
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@ -10,43 +10,14 @@ Build XGBoost with Distributed Filesystem Support
|
|||||||
To use distributed xgboost, you only need to turn the options on to build
|
To use distributed xgboost, you only need to turn the options on to build
|
||||||
with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.
|
with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.
|
||||||
|
|
||||||
How to Use
|
|
||||||
----------
|
|
||||||
* Input data format: LIBSVM format. The example here uses generated data in ../data folder.
|
|
||||||
* Put the data into some distribute filesytem (S3 or HDFS)
|
|
||||||
* Use tracker script in dmlc-core/tracker to submit the jobs
|
|
||||||
* Like all other DMLC tools, xgboost support taking a path to a folder as input argument
|
|
||||||
- All the files in the folder will be used as input
|
|
||||||
* Quick start in Hadoop YARN: run ```bash run_yarn.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```
|
|
||||||
|
|
||||||
Example
|
Step by Step Tutorial on AWS
|
||||||
-------
|
----------------------------
|
||||||
* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN.
|
Checkout [this tutorial](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html) for running distributed xgboost.
|
||||||
|
|
||||||
Single machine vs Distributed Version
|
|
||||||
-------------------------------------
|
|
||||||
If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
|
|
||||||
* IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
|
|
||||||
* File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
|
|
||||||
- ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf".
|
|
||||||
- You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2```
|
|
||||||
- The local path of cached files in command is "./".
|
|
||||||
* More details of submission can be referred to the usage of ```dmlc_yarn.py```.
|
|
||||||
* The model saved by hadoop version is compatible with single machine version.
|
|
||||||
|
|
||||||
Notes
|
|
||||||
-----
|
|
||||||
* The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance.
|
|
||||||
- You will want to set <n_thread_per_worker> to be number of cores you have on each machine.
|
|
||||||
|
|
||||||
|
|
||||||
External Memory Version
|
Model Analysis
|
||||||
-----------------------
|
--------------
|
||||||
XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data.
|
XGBoost is exchangable across all bindings and platforms.
|
||||||
See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory.
|
This means you can use python or R to analyze the learnt model and do prediction.
|
||||||
|
For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model.
|
||||||
You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as
|
|
||||||
```
|
|
||||||
data=hdfs:///path-to-my-data/#dtrain.cache
|
|
||||||
```
|
|
||||||
This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset.
|
|
||||||
|
|||||||
27
demo/distributed-training/mushroom.aws.conf
Normal file
27
demo/distributed-training/mushroom.aws.conf
Normal file
@ -0,0 +1,27 @@
|
|||||||
|
# General Parameters, see comment for each definition
|
||||||
|
# choose the booster, can be gbtree or gblinear
|
||||||
|
booster = gbtree
|
||||||
|
# choose logistic regression loss function for binary classification
|
||||||
|
objective = binary:logistic
|
||||||
|
|
||||||
|
# Tree Booster Parameters
|
||||||
|
# step size shrinkage
|
||||||
|
eta = 1.0
|
||||||
|
# minimum loss reduction required to make a further partition
|
||||||
|
gamma = 1.0
|
||||||
|
# minimum sum of instance weight(hessian) needed in a child
|
||||||
|
min_child_weight = 1
|
||||||
|
# maximum depth of a tree
|
||||||
|
max_depth = 3
|
||||||
|
|
||||||
|
# Task Parameters
|
||||||
|
# the number of round to do boosting
|
||||||
|
num_round = 2
|
||||||
|
# 0 means do not save any model except the final round model
|
||||||
|
save_period = 0
|
||||||
|
# The path of training data
|
||||||
|
data = "s3://mybucket/xgb-demo/train"
|
||||||
|
# The path of validation data, used to monitor training process, here [test] sets name of the validation set
|
||||||
|
# evaluate on training data as well each round
|
||||||
|
eval_train = 1
|
||||||
|
|
||||||
107
demo/distributed-training/plot_model.ipynb
Normal file
107
demo/distributed-training/plot_model.ipynb
Normal file
@ -0,0 +1,107 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# XGBoost Model Analysis\n",
|
||||||
|
"\n",
|
||||||
|
"This notebook can be used to load and anlysis model learnt from all xgboost bindings, including distributed training. "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"import sys\n",
|
||||||
|
"import os\n",
|
||||||
|
"%matplotlib inline "
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Please change the ```pkg_path``` and ```model_file``` to be correct path"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"pkg_path = '../../python-package/'\n",
|
||||||
|
"model_file = 's3://my-bucket/xgb-demo/model/0002.model'\n",
|
||||||
|
"sys.path.insert(0, pkg_path)\n",
|
||||||
|
"import xgboost as xgb"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Plot the Feature Importance"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"# plot the first two trees.\n",
|
||||||
|
"bst = xgb.Booster(model_file=model_file)\n",
|
||||||
|
"xgb.plot_importance(bst)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Plot the First Tree"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {
|
||||||
|
"collapsed": false
|
||||||
|
},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"tree_id = 0\n",
|
||||||
|
"xgb.to_graphviz(bst, tree_id)"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 2",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python2"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 2
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython2",
|
||||||
|
"version": "2.7.3"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 0
|
||||||
|
}
|
||||||
11
demo/distributed-training/run_aws.sh
Normal file
11
demo/distributed-training/run_aws.sh
Normal file
@ -0,0 +1,11 @@
|
|||||||
|
# This is the example script to run distributed xgboost on AWS.
|
||||||
|
# Change the following two lines for configuration
|
||||||
|
|
||||||
|
export BUCKET=mybucket
|
||||||
|
|
||||||
|
# submit the job to YARN
|
||||||
|
../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\
|
||||||
|
../../xgboost mushroom.aws.conf nthread=2\
|
||||||
|
data=s3://${BUCKET}/xgb-demo/train\
|
||||||
|
eval[test]=s3://${BUCKET}/xgb-demo/test\
|
||||||
|
model_dir=s3://${BUCKET}/xgb-demo/model
|
||||||
@ -1,33 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
if [ "$#" -lt 3 ];
|
|
||||||
then
|
|
||||||
echo "Usage: <nworkers> <nthreads> <path_in_HDFS>"
|
|
||||||
exit -1
|
|
||||||
fi
|
|
||||||
|
|
||||||
# put the local training file to HDFS
|
|
||||||
hadoop fs -mkdir $3/data
|
|
||||||
hadoop fs -put ../data/agaricus.txt.train $3/data
|
|
||||||
hadoop fs -put ../data/agaricus.txt.test $3/data
|
|
||||||
|
|
||||||
# running rabit, pass address in hdfs
|
|
||||||
../../dmlc-core/tracker/dmlc_yarn.py -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\
|
|
||||||
data=hdfs://$3/data/agaricus.txt.train\
|
|
||||||
eval[test]=hdfs://$3/data/agaricus.txt.test\
|
|
||||||
model_out=hdfs://$3/mushroom.final.model
|
|
||||||
|
|
||||||
# get the final model file
|
|
||||||
hadoop fs -get $3/mushroom.final.model final.model
|
|
||||||
|
|
||||||
# use dmlc-core/yarn/run_hdfs_prog.py to setup approperiate env
|
|
||||||
|
|
||||||
# output prediction task=pred
|
|
||||||
#../../xgboost.dmlc mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
|
|
||||||
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
|
|
||||||
# print the boosters of final.model in dump.raw.txt
|
|
||||||
#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
|
|
||||||
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
|
|
||||||
# use the feature map in printing for better visualization
|
|
||||||
#../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
|
|
||||||
../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
|
|
||||||
cat dump.nice.txt
|
|
||||||
@ -1 +1 @@
|
|||||||
Subproject commit 0f8fd38bf94e6666aa367be80195b1f2da87428c
|
Subproject commit 38ee75d95ff23e4e1febacc89e08975d9b6c6c3a
|
||||||
@ -23,7 +23,7 @@ This section contains users guides that are general across languages.
|
|||||||
|
|
||||||
* [Installation Guide](build.md)
|
* [Installation Guide](build.md)
|
||||||
* [Introduction to Boosted Trees](model.md)
|
* [Introduction to Boosted Trees](model.md)
|
||||||
* [Distributed Training](../demo/distributed-training)
|
* [Distributed Training Tutorial](tutorial/aws_yarn.md)
|
||||||
* [Frequently Asked Questions](faq.md)
|
* [Frequently Asked Questions](faq.md)
|
||||||
* [External Memory Version](external_memory.md)
|
* [External Memory Version](external_memory.md)
|
||||||
* [Learning to use XGBoost by Example](../demo)
|
* [Learning to use XGBoost by Example](../demo)
|
||||||
|
|||||||
187
doc/tutorial/aws_yarn.md
Normal file
187
doc/tutorial/aws_yarn.md
Normal file
@ -0,0 +1,187 @@
|
|||||||
|
Distributed XGBoost YARN on AWS
|
||||||
|
===============================
|
||||||
|
This is a step-by-step tutorial on how to setup and run distributed [XGBoost](https://github.com/dmlc/xgboost)
|
||||||
|
on a AWS EC2 cluster. Distributed XGBoost runs on various platforms such as MPI, SGE and Hadoop YARN.
|
||||||
|
In this tutorial, we use YARN as an example since this is widely used solution for distributed computing.
|
||||||
|
|
||||||
|
Prerequisite
|
||||||
|
------------
|
||||||
|
We need to get a [AWS key-pair](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html)
|
||||||
|
to access the AWS services. Let us assume that we are using a key ```mykey``` and the corresponding permission file ```mypem.pem```.
|
||||||
|
|
||||||
|
We also need [AWS credentials](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html),
|
||||||
|
which includes an `ACCESS_KEY_ID` and a `SECRET_ACCESS_KEY`.
|
||||||
|
|
||||||
|
Finally, we will need a S3 bucket to host the data and the model, ```s3://mybucket/```
|
||||||
|
|
||||||
|
Setup a Hadoop YARN Cluster
|
||||||
|
---------------------------
|
||||||
|
This sections shows how to start a Hadoop YARN cluster from scratch.
|
||||||
|
You can skip this step if you have already have one.
|
||||||
|
We will be using [yarn-ec2](https://github.com/tqchen/yarn-ec2) to start the cluster.
|
||||||
|
|
||||||
|
We can first clone the yarn-ec2 script by the following command.
|
||||||
|
```bash
|
||||||
|
git clone https://github.com/tqchen/yarn-ec2
|
||||||
|
```
|
||||||
|
|
||||||
|
To use the script, we must set the environment variables `AWS_ACCESS_KEY_ID` and
|
||||||
|
`AWS_SECRET_ACCESS_KEY` properly. This can be done by adding the following two lines in
|
||||||
|
`~/.bashrc` (replacing the strings with the correct ones)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
|
||||||
|
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
|
||||||
|
```
|
||||||
|
|
||||||
|
Now we can launch a master machine of the cluster from EC2
|
||||||
|
```bash
|
||||||
|
./yarn-ec2 -k mykey -i mypem.pem launch xgboost
|
||||||
|
```
|
||||||
|
Wait a few mininutes till the master machine get up.
|
||||||
|
|
||||||
|
After the master machine gets up, we can query the public DNS of the master machine using the following command.
|
||||||
|
```bash
|
||||||
|
./yarn-ec2 -k mykey -i mypem.pem get-master xgboost
|
||||||
|
```
|
||||||
|
It will show the public DNS of the master machine like ```ec2-xx-xx-xx.us-west-2.compute.amazonaws.com```
|
||||||
|
Now we can open the browser, and type(replace the DNS with the master DNS)
|
||||||
|
```
|
||||||
|
ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088
|
||||||
|
```
|
||||||
|
This will show the job tracker of the YARN cluster. Note that we may wait a few minutes before the master finishes bootstraping and starts the
|
||||||
|
job tracker.
|
||||||
|
|
||||||
|
After master machine gets up, we can freely add more slave machines to the cluster.
|
||||||
|
The following command add m3.xlarge instances to the cluster.
|
||||||
|
```bash
|
||||||
|
./yarn-ec2 -k mykey -i mypem.pem -t m3.xlarge -s 2 addslave xgboost
|
||||||
|
```
|
||||||
|
We can also choose to add two spot instances
|
||||||
|
```bash
|
||||||
|
./yarn-ec2 -k mykey -i mypem.pem -t m3.xlarge -s 2 addspot xgboost
|
||||||
|
```
|
||||||
|
The slave machines will startup, bootstrap and report to the master.
|
||||||
|
You can check if the slave machines are connected by clicking on Nodes link on the job tracker.
|
||||||
|
Or simply type the following URL(replace DNS ith the master DNS)
|
||||||
|
```
|
||||||
|
ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088/cluster/nodes
|
||||||
|
```
|
||||||
|
|
||||||
|
One thing we should note is that not all the links in the job tracker works.
|
||||||
|
This is due to that many of them uses the private ip of AWS, which can only be accessed by EC2.
|
||||||
|
We can use ssh proxy to access these packages.
|
||||||
|
Now that we have setup a cluster with one master and two slaves. We are ready to run the experiment.
|
||||||
|
|
||||||
|
|
||||||
|
Build XGBoost with S3
|
||||||
|
---------------------
|
||||||
|
We can log into the master machine by the following command.
|
||||||
|
```bash
|
||||||
|
./yarn-ec2 -k mykey -i mypem.pem login xgboost
|
||||||
|
```
|
||||||
|
|
||||||
|
We will be using S3 to host the data and the result model, so the data won't get lost after the cluster shutdown.
|
||||||
|
To do so, we will need to build xgboost with S3 support. The only thing we need to do is to set ```USE_S3```
|
||||||
|
variable to be true. This can be achieved by the following command.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone --recursive https://github.com/dmlc/xgboost
|
||||||
|
cd xgboost
|
||||||
|
cp make/config.mk config.mk
|
||||||
|
echo "USE_S3=1" >> config.mk
|
||||||
|
make -j4
|
||||||
|
```
|
||||||
|
Now we have built the XGBoost with S3 support. You can also enable HDFS support if you plan to store data on HDFS, by turnning on ```USE_HDFS``` option.
|
||||||
|
|
||||||
|
XGBoost also relies on the environment variable to access S3, so you will need to add the following two lines to `~/.bashrc` (replacing the strings with the correct ones)
|
||||||
|
on the master machine as well.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
|
||||||
|
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
|
||||||
|
export BUCKET=mybucket
|
||||||
|
```
|
||||||
|
|
||||||
|
Host the Data on S3
|
||||||
|
-------------------
|
||||||
|
In this example, we will copy the example dataset in xgboost to the S3 bucket as input.
|
||||||
|
In normal usecases, the dataset is usually created from existing distributed processing pipeline.
|
||||||
|
We can use [s3cmd](http://s3tools.org/s3cmd) to copy the data into mybucket(replace ${BUCKET} with the real bucket name).
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd xgboost
|
||||||
|
s3cmd put demo/data/agaricus.txt.train s3://${BUCKET}/xgb-demo/train/
|
||||||
|
s3cmd put demo/data/agaricus.txt.test s3://${BUCKET}/xgb-demo/test/
|
||||||
|
```
|
||||||
|
|
||||||
|
Submit the Jobs
|
||||||
|
---------------
|
||||||
|
Now everything is ready, we can submit the xgboost distributed job to the YARN cluster.
|
||||||
|
We will use the [dmlc-submit](https://github.com/dmlc/dmlc-core/tree/master/tracker) script to submit the job.
|
||||||
|
|
||||||
|
Now we can run the following script in the distributed training folder(replace ${BUCKET} with the real bucket name)
|
||||||
|
```bash
|
||||||
|
cd xgboost/demo/distributed-training
|
||||||
|
# Use dmlc-submit to submit the job.
|
||||||
|
../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\
|
||||||
|
../../xgboost mushroom.aws.conf nthread=2\
|
||||||
|
data=s3://${BUCKET}/xgb-demo/train\
|
||||||
|
eval[test]=s3://${BUCKET}/xgb-demo/test\
|
||||||
|
model_dir=s3://${BUCKET}/xgb-demo/model
|
||||||
|
```
|
||||||
|
All the configurations such as ```data``` and ```model_dir``` can also be directly written into the configuration file.
|
||||||
|
Note that we only specified the folder path to the file, instead of the file name.
|
||||||
|
XGBoost will read in all the files in that folder as the training and evaluation data.
|
||||||
|
|
||||||
|
In this command, we are using two workers, each worker uses two running thread.
|
||||||
|
XGBoost can benefit from using multiple cores in each worker.
|
||||||
|
A common choice of working cores can range from 4 to 8.
|
||||||
|
The trained model will be saved into the specified model folder. You can browse the model folder.
|
||||||
|
```
|
||||||
|
s3cmd ls s3://${BUCKET}/xgb-demo/model/
|
||||||
|
```
|
||||||
|
|
||||||
|
The following is an example output from distributed training.
|
||||||
|
```
|
||||||
|
16/02/26 05:41:59 INFO dmlc.Client: jobname=DMLC[nworker=2]:xgboost,username=ubuntu
|
||||||
|
16/02/26 05:41:59 INFO dmlc.Client: Submitting application application_1456461717456_0015
|
||||||
|
16/02/26 05:41:59 INFO impl.YarnClientImpl: Submitted application application_1456461717456_0015
|
||||||
|
2016-02-26 05:42:05,230 INFO @tracker All of 2 nodes getting started
|
||||||
|
2016-02-26 05:42:14,027 INFO [05:42:14] [0] test-error:0.016139 train-error:0.014433
|
||||||
|
2016-02-26 05:42:14,186 INFO [05:42:14] [1] test-error:0.000000 train-error:0.001228
|
||||||
|
2016-02-26 05:42:14,947 INFO @tracker All nodes finishes job
|
||||||
|
2016-02-26 05:42:14,948 INFO @tracker 9.71754479408 secs between node start and job finish
|
||||||
|
Application application_1456461717456_0015 finished with state FINISHED at 1456465335961
|
||||||
|
```
|
||||||
|
|
||||||
|
Analyze the Model
|
||||||
|
-----------------
|
||||||
|
After the model is trained, we can analyse the learnt model and use it for future prediction task.
|
||||||
|
XGBoost is a portable framework, the model in all platforms are ***exchangable***.
|
||||||
|
This means we can load the trained model in python/R/Julia and take benefit of data science pipelines
|
||||||
|
in these languages to do model analysis and prediction.
|
||||||
|
|
||||||
|
For example, you can use [this ipython notebook](https://github.com/dmlc/xgboost/tree/master/demo/distributed-training/plot_model.ipynb)
|
||||||
|
to plot feature importance and visualize the learnt model.
|
||||||
|
|
||||||
|
Trouble Shooting
|
||||||
|
----------------
|
||||||
|
|
||||||
|
When you encountered a problem, the best way might be use the following command
|
||||||
|
to get logs of stdout and stderr of the containers, to check what causes the problem.
|
||||||
|
```
|
||||||
|
yarn logs -applicationId yourAppId
|
||||||
|
```
|
||||||
|
|
||||||
|
Future Directions
|
||||||
|
-----------------
|
||||||
|
You have learnt to use distributed XGBoost on YARN in this tutorial.
|
||||||
|
XGBoost is portable and scalable framework for gradient boosting.
|
||||||
|
You can checkout more examples and resources in the [resources page](https://github.com/dmlc/xgboost/blob/master/demo/README.md).
|
||||||
|
|
||||||
|
The project goal is to make the best scalable machine learning solution available to all platforms.
|
||||||
|
The API is designed to be able to portable, and the same code can also run on other platforms such as MPI and SGE.
|
||||||
|
XGBoost is actively evolving and we are working on even more exciting features
|
||||||
|
such as distributed xgboost python/R package. Checkout [RoadMap](https://github.com/dmlc/xgboost/issues/873) for
|
||||||
|
more details and you are more than welcomed to contribute to the project.
|
||||||
@ -880,11 +880,9 @@ class Booster(object):
|
|||||||
fname : string or a memory buffer
|
fname : string or a memory buffer
|
||||||
Input file name or memory buffer(see also save_raw)
|
Input file name or memory buffer(see also save_raw)
|
||||||
"""
|
"""
|
||||||
if isinstance(fname, STRING_TYPES): # assume file name
|
if isinstance(fname, STRING_TYPES):
|
||||||
if os.path.exists(fname):
|
# assume file name, cannot use os.path.exist to check, file can be from URL.
|
||||||
_LIB.XGBoosterLoadModel(self.handle, c_str(fname))
|
_check_call(_LIB.XGBoosterLoadModel(self.handle, c_str(fname)))
|
||||||
else:
|
|
||||||
raise ValueError("No such file: {0}".format(fname))
|
|
||||||
else:
|
else:
|
||||||
buf = fname
|
buf = fname
|
||||||
length = ctypes.c_ulong(len(buf))
|
length = ctypes.c_ulong(len(buf))
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user