[DIST] Add Distributed XGBoost on AWS Tutorial

2016-02-25 20:42:16 -08:00 · 2016-02-25 20:42:16 -08:00 · a71ba04109
commit a71ba04109
parent 61d9edcaa4
11 changed files with 355 additions and 86 deletions
--- a/README.md
+++ b/README.md
@ -20,6 +20,7 @@ The same code runs on major distributed environment(Hadoop, SGE, MPI) and can so
 What's New
 ----------
 * [Distributed XGBoost on AWS with YARN](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html)
 * [XGBoost brick](NEWS.md) Release
--- a/demo/distributed-training/README.md
+++ b/demo/distributed-training/README.md
@ -10,43 +10,14 @@ Build XGBoost with Distributed Filesystem Support
 To use distributed xgboost, you only need to turn the options on to build
 with distributed filesystems(HDFS or S3) in ```xgboost/make/config.mk```.
 How to Use
 ----------
 * Input data format: LIBSVM format. The example here uses generated data in ../data folder.
 * Put the data into some distribute filesytem (S3 or HDFS)
 * Use tracker script in dmlc-core/tracker to submit the jobs
 * Like all other DMLC tools, xgboost support taking a path to a folder as input argument
  - All the files in the folder will be used as input
 * Quick start in Hadoop YARN: run ```bash run_yarn.sh <n_hadoop_workers> <n_thread_per_worker> <path_in_HDFS>```
-Example
+Step by Step Tutorial on AWS
-------
+----------------------------
-* [run_yarn.sh](run_yarn.sh) shows how to submit job to Hadoop via YARN.
+Checkout [this tutorial](https://xgboost.readthedocs.org/en/latest/tutorial/aws_yarn.html) for running distributed xgboost.
 Single machine vs Distributed Version
 -------------------------------------
 If you have used xgboost (single machine version) before, this section will show you how to run xgboost on hadoop with a slight modification on conf file.
 * IO: instead of reading and writing file locally, we now use HDFS, put ```hdfs://``` prefix to the address of file you like to access
 * File cache: ```dmlc_yarn.py``` also provide several ways to cache necesary files, including binary file (xgboost), conf file
  - ```dmlc_yarn.py``` will automatically cache files in the command line. For example, ```dmlc_yarn.py -n 3 $localPath/xgboost.dmlc mushroom.hadoop.conf``` will cache "xgboost.dmlc" and "mushroom.hadoop.conf".
  - You could also use "-f" to manually cache one or more files, like ```-f file1 -f file2```
  - The local path of cached files in command is "./".
 * More details of submission can be referred to the usage of ```dmlc_yarn.py```.
 * The model saved by hadoop version is compatible with single machine version.
 Notes
 -----
 * The code is optimized with multi-threading, so you will want to run xgboost with more vcores for best performance.
  - You will want to set <n_thread_per_worker> to be number of cores you have on each machine.
-External Memory Version
+Model Analysis
-----------------------
+--------------
-XGBoost supports external memory, this will make each process cache data into local disk during computation, without taking up all the memory for storing the data.
+XGBoost is exchangable across all bindings and platforms.
-See [external memory](https://github.com/dmlc/xgboost/tree/master/doc/external_memory.md) for syntax using external memory.
+This means you can use python or R to analyze the learnt model and do prediction.
-
+For example, you can use the [plot_model.ipynb](plot_model.ipynb) to visualize the learnt model.
 You only need to add cacheprefix to the input file to enable external memory mode. For example set training data as
 ```
 data=hdfs:///path-to-my-data/#dtrain.cache
 ```
 This will make xgboost more memory efficient, allows you to run xgboost on larger-scale dataset.
--- a/demo/distributed-training/mushroom.aws.conf
+++ b/demo/distributed-training/mushroom.aws.conf
@ -0,0 +1,27 @@
 # General Parameters, see comment for each definition
 # choose the booster, can be gbtree or gblinear
 booster = gbtree
 # choose logistic regression loss function for binary classification
 objective = binary:logistic
 # Tree Booster Parameters
 # step size shrinkage
 eta = 1.0
 # minimum loss reduction required to make a further partition
 gamma = 1.0
 # minimum sum of instance weight(hessian) needed in a child
 min_child_weight = 1
 # maximum depth of a tree
 max_depth = 3
 # Task Parameters
 # the number of round to do boosting
 num_round = 2
 # 0 means do not save any model except the final round model
 save_period = 0
 # The path of training data
 data = "s3://mybucket/xgb-demo/train"
 # The path of validation data, used to monitor training process, here [test] sets name of the validation set
 # evaluate on training data as well each round
 eval_train = 1
--- a/demo/distributed-training/plot_model.ipynb
+++ b/demo/distributed-training/plot_model.ipynb
@ -0,0 +1,107 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# XGBoost Model Analysis\n",
    "\n",
    "This notebook can be used to load and anlysis model learnt from all xgboost bindings, including distributed training. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "%matplotlib inline "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Please change the ```pkg_path``` and ```model_file``` to be correct path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pkg_path = '../../python-package/'\n",
    "model_file = 's3://my-bucket/xgb-demo/model/0002.model'\n",
    "sys.path.insert(0, pkg_path)\n",
    "import xgboost as xgb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Plot the Feature Importance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# plot the first two trees.\n",
    "bst = xgb.Booster(model_file=model_file)\n",
    "xgb.plot_importance(bst)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Plot the First Tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "tree_id = 0\n",
    "xgb.to_graphviz(bst, tree_id)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
 }
--- a/demo/distributed-training/run_aws.sh
+++ b/demo/distributed-training/run_aws.sh
@ -0,0 +1,11 @@
 # This is the example script to run distributed xgboost on AWS.
 # Change the following two lines for configuration
 export BUCKET=mybucket
 # submit the job to YARN
 ../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\
    ../../xgboost mushroom.aws.conf nthread=2\
    data=s3://${BUCKET}/xgb-demo/train\
    eval[test]=s3://${BUCKET}/xgb-demo/test\
    model_dir=s3://${BUCKET}/xgb-demo/model
--- a/demo/distributed-training/run_yarn.sh
+++ b/demo/distributed-training/run_yarn.sh
@ -1,33 +0,0 @@
 #!/bin/bash
 if [ "$#" -lt 3 ];
 then
 	echo "Usage: <nworkers> <nthreads> <path_in_HDFS>"
 	exit -1
 fi
 # put the local training file to HDFS
 hadoop fs -mkdir $3/data
 hadoop fs -put ../data/agaricus.txt.train $3/data
 hadoop fs -put ../data/agaricus.txt.test $3/data
 # running rabit, pass address in hdfs
 ../../dmlc-core/tracker/dmlc_yarn.py  -n $1 --vcores $2 ../../xgboost mushroom.hadoop.conf nthread=$2\
    data=hdfs://$3/data/agaricus.txt.train\
    eval[test]=hdfs://$3/data/agaricus.txt.test\
    model_out=hdfs://$3/mushroom.final.model
 # get the final model file
 hadoop fs -get $3/mushroom.final.model final.model
 # use dmlc-core/yarn/run_hdfs_prog.py to setup approperiate env
 # output prediction task=pred
 #../../xgboost.dmlc mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
 ../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=pred model_in=final.model test:data=../data/agaricus.txt.test
 # print the boosters of final.model in dump.raw.txt
 #../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
 ../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model name_dump=dump.raw.txt
 # use the feature map in printing for better visualization
 #../../xgboost.dmlc mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
 ../../dmlc-core/yarn/run_hdfs_prog.py ../../xgboost mushroom.hadoop.conf task=dump model_in=final.model fmap=../data/featmap.txt name_dump=dump.nice.txt
 cat dump.nice.txt
--- a/2
+++ b/2
@ -1 +1 @@
-Subproject commit 0f8fd38bf94e6666aa367be80195b1f2da87428c
+Subproject commit 38ee75d95ff23e4e1febacc89e08975d9b6c6c3a
--- a/doc/index.md
+++ b/doc/index.md
@ -23,7 +23,7 @@ This section contains users guides that are general across languages.
 * [Installation Guide](build.md)
 * [Introduction to Boosted Trees](model.md)
-* [Distributed Training](../demo/distributed-training)
+* [Distributed Training Tutorial](tutorial/aws_yarn.md)
 * [Frequently Asked Questions](faq.md)
 * [External Memory Version](external_memory.md)
 * [Learning to use XGBoost by Example](../demo)
--- a/doc/tutorial/aws_yarn.md
+++ b/doc/tutorial/aws_yarn.md
@ -0,0 +1,187 @@
 Distributed XGBoost YARN on AWS
 ===============================
 This is a step-by-step tutorial on how to setup and run distributed [XGBoost](https://github.com/dmlc/xgboost)
 on a AWS EC2 cluster. Distributed XGBoost runs on various platforms such as MPI, SGE and Hadoop YARN.
 In this tutorial, we use YARN as an example since this is widely used solution for distributed computing.
 Prerequisite
 ------------
 We need to get a [AWS key-pair](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html)
 to access the AWS services. Let us assume that we are using a key ```mykey``` and  the corresponding permission file ```mypem.pem```.
 We also need [AWS credentials](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html),
 which includes an `ACCESS_KEY_ID` and a `SECRET_ACCESS_KEY`.
 Finally, we will need a S3 bucket to host the data and the model, ```s3://mybucket/```
 Setup a Hadoop YARN Cluster
 ---------------------------
 This sections shows how to start a Hadoop YARN cluster from scratch.
 You can skip this step if you have already have one.
 We will be using [yarn-ec2](https://github.com/tqchen/yarn-ec2) to start the cluster.
 We can first clone the yarn-ec2 script by the following command.
 ```bash
 git clone https://github.com/tqchen/yarn-ec2
 ```
 To use the script, we must set the environment variables `AWS_ACCESS_KEY_ID` and
 `AWS_SECRET_ACCESS_KEY` properly. This can be done by adding the following two lines in
 `~/.bashrc` (replacing the strings with the correct ones)
 ```bash
 export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
 export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
 ```
 Now we can launch a master machine of the cluster from EC2
 ```bash
 ./yarn-ec2 -k mykey -i mypem.pem launch xgboost
 ```
 Wait a few mininutes till the master machine get up.
 After the master machine gets up, we can query the public DNS of the master machine using the following command.
 ```bash
 ./yarn-ec2 -k mykey -i mypem.pem get-master xgboost
 ```
 It will show the public DNS of the master machine like ```ec2-xx-xx-xx.us-west-2.compute.amazonaws.com```
 Now we can open the browser, and type(replace the DNS with the master DNS)
 ```
 ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088
 ```
 This will show the job tracker of the YARN cluster. Note that we may wait a few minutes before the master finishes bootstraping and starts the
 job tracker.
 After master machine gets up, we can freely add more slave machines to the cluster.
 The following command add m3.xlarge instances to the cluster.
 ```bash
 ./yarn-ec2 -k mykey -i mypem.pem -t m3.xlarge -s 2 addslave xgboost
 ```
 We can also choose to add two spot instances
 ```bash
 ./yarn-ec2 -k mykey -i mypem.pem -t m3.xlarge -s 2 addspot xgboost
 ```
 The slave machines will startup, bootstrap  and report to the master.
 You can check if the slave machines are connected by clicking on Nodes link on the job tracker.
 Or simply type the following URL(replace DNS ith the master DNS)
 ```
 ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088/cluster/nodes
 ```
 One thing we should note is that not all the links in the job tracker works.
 This is due to that many of them uses the private ip of AWS, which can only be accessed by EC2.
 We can use ssh proxy to access these packages.
 Now that we have setup a cluster with one master and two slaves. We are ready to run the experiment.
 Build XGBoost with S3
 ---------------------
 We can log into the master machine by the following command.
 ```bash
 ./yarn-ec2 -k mykey -i mypem.pem login xgboost
 ```
 We will be using S3 to host the data and the result model, so the data won't get lost after the cluster shutdown.
 To do so, we will need to build xgboost with S3 support. The only thing we need to do is to set ```USE_S3```
 variable to be true. This can be achieved by the following command.
 ```bash
 git clone --recursive https://github.com/dmlc/xgboost
 cd xgboost
 cp make/config.mk config.mk
 echo "USE_S3=1" >> config.mk
 make -j4
 ```
 Now we have built the XGBoost with S3 support. You can also enable HDFS support if you plan to store data on HDFS, by turnning on ```USE_HDFS``` option.
 XGBoost also relies on the environment variable to access S3, so you will need to add the following two lines to `~/.bashrc` (replacing the strings with the correct ones)
 on the master machine as well.
 ```bash
 export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
 export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
 export BUCKET=mybucket
 ```
 Host the Data on S3
 -------------------
 In this example, we will copy the example dataset in xgboost to the S3 bucket as input.
 In normal usecases, the dataset is usually created from existing distributed processing pipeline.
 We can use [s3cmd](http://s3tools.org/s3cmd) to copy the data into mybucket(replace ${BUCKET} with the real bucket name).
 ```bash
 cd xgboost
 s3cmd put demo/data/agaricus.txt.train s3://${BUCKET}/xgb-demo/train/
 s3cmd put demo/data/agaricus.txt.test s3://${BUCKET}/xgb-demo/test/
 ```
 Submit the Jobs
 ---------------
 Now everything is ready, we can submit the xgboost distributed job to the YARN cluster.
 We will use the [dmlc-submit](https://github.com/dmlc/dmlc-core/tree/master/tracker) script to submit the job.
 Now we can run the following script in the distributed training folder(replace ${BUCKET} with the real bucket name)
 ```bash
 cd xgboost/demo/distributed-training
 # Use dmlc-submit to submit the job.
 ../../dmlc-core/tracker/dmlc-submit --cluster=yarn --num-workers=2 --worker-cores=2\
    ../../xgboost mushroom.aws.conf nthread=2\
    data=s3://${BUCKET}/xgb-demo/train\
    eval[test]=s3://${BUCKET}/xgb-demo/test\
    model_dir=s3://${BUCKET}/xgb-demo/model
 ```
 All the configurations such as ```data``` and ```model_dir``` can also be directly written into the configuration file.
 Note that we only specified the folder path to the file, instead of the file name.
 XGBoost will read in all the files in that folder as the training and evaluation data.
 In this command, we are using two workers, each worker uses two running thread.
 XGBoost can benefit from using multiple cores in each worker.
 A common choice of working cores can range from 4 to 8.
 The trained model will be saved into the specified model folder. You can browse the model folder.
 ```
 s3cmd ls s3://${BUCKET}/xgb-demo/model/
 ```
 The following is an example output from distributed training.
 ```
 16/02/26 05:41:59 INFO dmlc.Client: jobname=DMLC[nworker=2]:xgboost,username=ubuntu
 16/02/26 05:41:59 INFO dmlc.Client: Submitting application application_1456461717456_0015
 16/02/26 05:41:59 INFO impl.YarnClientImpl: Submitted application application_1456461717456_0015
 2016-02-26 05:42:05,230 INFO @tracker All of 2 nodes getting started
 2016-02-26 05:42:14,027 INFO [05:42:14] [0]  test-error:0.016139        train-error:0.014433
 2016-02-26 05:42:14,186 INFO [05:42:14] [1]  test-error:0.000000        train-error:0.001228
 2016-02-26 05:42:14,947 INFO @tracker All nodes finishes job
 2016-02-26 05:42:14,948 INFO @tracker 9.71754479408 secs between node start and job finish
 Application application_1456461717456_0015 finished with state FINISHED at 1456465335961
 ```
 Analyze the Model
 -----------------
 After the model is trained, we can analyse the learnt model and use it for future prediction task.
 XGBoost is a portable framework, the model in all platforms are ***exchangable***.
 This means we can load the trained model in python/R/Julia and take benefit of data science pipelines
 in these languages to do model analysis and prediction.
 For example, you can use [this ipython notebook](https://github.com/dmlc/xgboost/tree/master/demo/distributed-training/plot_model.ipynb)
 to plot feature importance and visualize the learnt model.
 Trouble Shooting
 ----------------
 When you encountered a problem, the best way might be use the following command
 to get logs of stdout and stderr of the containers, to check what causes the problem.
 ```
 yarn logs -applicationId yourAppId
 ```
 Future Directions
 -----------------
 You have learnt to use distributed XGBoost on YARN in this tutorial.
 XGBoost is portable and scalable framework for gradient boosting.
 You can checkout more examples and resources in the [resources page](https://github.com/dmlc/xgboost/blob/master/demo/README.md).
 The project goal is to make the best scalable machine learning solution available to all platforms.
 The API is designed to be able to portable, and the same code can also run on other platforms such as MPI and SGE.
 XGBoost is actively evolving and we are working on even more exciting features
 such as distributed xgboost python/R package. Checkout [RoadMap](https://github.com/dmlc/xgboost/issues/873) for
 more details and you are more than welcomed to contribute to the project.
--- a/python-package/xgboost/core.py
+++ b/python-package/xgboost/core.py
@ -880,11 +880,9 @@ class Booster(object):
        fname : string or a memory buffer
            Input file name or memory buffer(see also save_raw)
        """
-        if isinstance(fname, STRING_TYPES):  # assume file name
+        if isinstance(fname, STRING_TYPES):
-            if os.path.exists(fname):
+            # assume file name, cannot use os.path.exist to check, file can be from URL.
-                _LIB.XGBoosterLoadModel(self.handle, c_str(fname))
+            _check_call(_LIB.XGBoosterLoadModel(self.handle, c_str(fname)))
            else:
                raise ValueError("No such file: {0}".format(fname))
        else:
            buf = fname
            length = ctypes.c_ulong(len(buf))
		`@ -1 +1 @@`
			`Subproject commit 0f8fd38bf94e6666aa367be80195b1f2da87428c`				`Subproject commit 38ee75d95ff23e4e1febacc89e08975d9b6c6c3a`