Correcting small typos in documentation. (#1901)

This commit is contained in:
Kyle Willett
2016-12-31 04:47:51 -08:00
committed by Tianqi Chen
parent f5c85836bf
commit 7e07b2b93d
5 changed files with 58 additions and 58 deletions

View File

@@ -1,8 +1,8 @@
Distributed XGBoost YARN on AWS
===============================
This is a step-by-step tutorial on how to setup and run distributed [XGBoost](https://github.com/dmlc/xgboost)
on a AWS EC2 cluster. Distributed XGBoost runs on various platforms such as MPI, SGE and Hadoop YARN.
In this tutorial, we use YARN as an example since this is widely used solution for distributed computing.
on an AWS EC2 cluster. Distributed XGBoost runs on various platforms such as MPI, SGE and Hadoop YARN.
In this tutorial, we use YARN as an example since this is a widely used solution for distributed computing.
Prerequisite
------------
@@ -38,21 +38,21 @@ Now we can launch a master machine of the cluster from EC2
```bash
./yarn-ec2 -k mykey -i mypem.pem launch xgboost
```
Wait a few mininutes till the master machine get up.
Wait a few mininutes till the master machine gets up.
After the master machine gets up, we can query the public DNS of the master machine using the following command.
```bash
./yarn-ec2 -k mykey -i mypem.pem get-master xgboost
```
It will show the public DNS of the master machine like ```ec2-xx-xx-xx.us-west-2.compute.amazonaws.com```
Now we can open the browser, and type(replace the DNS with the master DNS)
Now we can open the browser, and type (replace the DNS with the master DNS)
```
ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088
```
This will show the job tracker of the YARN cluster. Note that we may wait a few minutes before the master finishes bootstrapping and starts the
This will show the job tracker of the YARN cluster. Note that we may have to wait a few minutes before the master finishes bootstrapping and starts the
job tracker.
After master machine gets up, we can freely add more slave machines to the cluster.
After the master machine gets up, we can freely add more slave machines to the cluster.
The following command add m3.xlarge instances to the cluster.
```bash
./yarn-ec2 -k mykey -i mypem.pem -t m3.xlarge -s 2 addslave xgboost
@@ -61,17 +61,17 @@ We can also choose to add two spot instances
```bash
./yarn-ec2 -k mykey -i mypem.pem -t m3.xlarge -s 2 addspot xgboost
```
The slave machines will startup, bootstrap and report to the master.
You can check if the slave machines are connected by clicking on Nodes link on the job tracker.
Or simply type the following URL(replace DNS ith the master DNS)
The slave machines will start up, bootstrap and report to the master.
You can check if the slave machines are connected by clicking on the Nodes link on the job tracker.
Or simply type the following URL (replace DNS ith the master DNS)
```
ec2-xx-xx-xx.us-west-2.compute.amazonaws.com:8088/cluster/nodes
```
One thing we should note is that not all the links in the job tracker works.
This is due to that many of them uses the private ip of AWS, which can only be accessed by EC2.
One thing we should note is that not all the links in the job tracker work.
This is due to that many of them use the private IP of AWS, which can only be accessed by EC2.
We can use ssh proxy to access these packages.
Now that we have setup a cluster with one master and two slaves. We are ready to run the experiment.
Now that we have set up a cluster with one master and two slaves, we are ready to run the experiment.
Build XGBoost with S3
@@ -92,7 +92,7 @@ cp make/config.mk config.mk
echo "USE_S3=1" >> config.mk
make -j4
```
Now we have built the XGBoost with S3 support. You can also enable HDFS support if you plan to store data on HDFS, by turnning on ```USE_HDFS``` option.
Now we have built the XGBoost with S3 support. You can also enable HDFS support if you plan to store data on HDFS by turning on ```USE_HDFS``` option.
XGBoost also relies on the environment variable to access S3, so you will need to add the following two lines to `~/.bashrc` (replacing the strings with the correct ones)
on the master machine as well.
@@ -107,7 +107,7 @@ Host the Data on S3
-------------------
In this example, we will copy the example dataset in xgboost to the S3 bucket as input.
In normal usecases, the dataset is usually created from existing distributed processing pipeline.
We can use [s3cmd](http://s3tools.org/s3cmd) to copy the data into mybucket(replace ${BUCKET} with the real bucket name).
We can use [s3cmd](http://s3tools.org/s3cmd) to copy the data into mybucket (replace ${BUCKET} with the real bucket name).
```bash
cd xgboost
@@ -120,7 +120,7 @@ Submit the Jobs
Now everything is ready, we can submit the xgboost distributed job to the YARN cluster.
We will use the [dmlc-submit](https://github.com/dmlc/dmlc-core/tree/master/tracker) script to submit the job.
Now we can run the following script in the distributed training folder(replace ${BUCKET} with the real bucket name)
Now we can run the following script in the distributed training folder (replace ${BUCKET} with the real bucket name)
```bash
cd xgboost/demo/distributed-training
# Use dmlc-submit to submit the job.
@@ -134,7 +134,7 @@ All the configurations such as ```data``` and ```model_dir``` can also be direct
Note that we only specified the folder path to the file, instead of the file name.
XGBoost will read in all the files in that folder as the training and evaluation data.
In this command, we are using two workers, each worker uses two running thread.
In this command, we are using two workers, and each worker uses two running threads.
XGBoost can benefit from using multiple cores in each worker.
A common choice of working cores can range from 4 to 8.
The trained model will be saved into the specified model folder. You can browse the model folder.
@@ -157,28 +157,28 @@ Application application_1456461717456_0015 finished with state FINISHED at 14564
Analyze the Model
-----------------
After the model is trained, we can analyse the learnt model and use it for future prediction task.
XGBoost is a portable framework, the model in all platforms are ***exchangeable***.
After the model is trained, we can analyse the learnt model and use it for future prediction tasks.
XGBoost is a portable framework, meaning the models in all platforms are ***exchangeable***.
This means we can load the trained model in python/R/Julia and take benefit of data science pipelines
in these languages to do model analysis and prediction.
For example, you can use [this ipython notebook](https://github.com/dmlc/xgboost/tree/master/demo/distributed-training/plot_model.ipynb)
For example, you can use [this IPython notebook](https://github.com/dmlc/xgboost/tree/master/demo/distributed-training/plot_model.ipynb)
to plot feature importance and visualize the learnt model.
Trouble Shooting
Troubleshooting
----------------
When you encountered a problem, the best way might be use the following command
to get logs of stdout and stderr of the containers, to check what causes the problem.
If you encounter a problem, the best way might be to use the following command
to get logs of stdout and stderr of the containers and check what causes the problem.
```
yarn logs -applicationId yourAppId
```
Future Directions
-----------------
You have learnt to use distributed XGBoost on YARN in this tutorial.
XGBoost is portable and scalable framework for gradient boosting.
You can checkout more examples and resources in the [resources page](https://github.com/dmlc/xgboost/blob/master/demo/README.md).
You have learned to use distributed XGBoost on YARN in this tutorial.
XGBoost is a portable and scalable framework for gradient boosting.
You can check out more examples and resources in the [resources page](https://github.com/dmlc/xgboost/blob/master/demo/README.md).
The project goal is to make the best scalable machine learning solution available to all platforms.
The API is designed to be able to portable, and the same code can also run on other platforms such as MPI and SGE.

View File

@@ -1,9 +1,9 @@
DART booster
============
[XGBoost](https://github.com/dmlc/xgboost)) mostly combines a huge number of regression trees with small learning rate.
In this situation, trees added early are significance and trees added late are unimportant.
[XGBoost](https://github.com/dmlc/xgboost)) mostly combines a huge number of regression trees with a small learning rate.
In this situation, trees added early are significant and trees added late are unimportant.
Rasmi et.al proposed a new method to add dropout techniques from deep neural nets community to boosted trees, and reported better results in some situations.
Rasmi et al. proposed a new method to add dropout techniques from the deep neural net community to boosted trees, and reported better results in some situations.
This is a instruction of new tree booster `dart`.
@@ -16,15 +16,15 @@ Features
- Drop trees in order to solve the over-fitting.
- Trivial trees (to correct trivial errors) may be prevented.
Because the randomness introduced in the training, expect the following few difference.
- Training can be slower than `gbtree` because the random dropout prevents usage of prediction buffer.
Because of the randomness introduced in the training, expect the following few differences:
- Training can be slower than `gbtree` because the random dropout prevents usage of the prediction buffer.
- The early stop might not be stable, due to the randomness.
How it works
------------
- In ``$ m $``th training round, suppose ``$ k $`` trees are selected drop.
- Let ``$ D = \sum_{i \in \mathbf{K}} F_i $`` be leaf scores of dropped trees and ``$ F_m = \eta \tilde{F}_m $`` be leaf scores of a new tree.
- The objective function is following:
- In ``$ m $``th training round, suppose ``$ k $`` trees are selected to be dropped.
- Let ``$ D = \sum_{i \in \mathbf{K}} F_i $`` be the leaf scores of dropped trees and ``$ F_m = \eta \tilde{F}_m $`` be the leaf scores of a new tree.
- The objective function is as follows:
```math
\mathrm{Obj}
= \sum_{j=1}^n L \left( y_j, \hat{y}_j^{m-1} - D_j + \tilde{F}_m \right)

View File

@@ -1,7 +1,7 @@
# XGBoost Tutorials
This section contains official tutorials inside XGBoost package.
See [Awesome XGBoost](https://github.com/dmlc/xgboost/tree/master/demo) for links to mores resources.
See [Awesome XGBoost](https://github.com/dmlc/xgboost/tree/master/demo) for links to more resources.
## Contents
- [Introduction to Boosted Trees](../model.md)

View File

@@ -1,7 +1,7 @@
Monotonic Constraints
=====================
It is often the case ina modeling problems or project that the functional form of an acceptable model is constrained in some way. This may happen due to business considerations, or because of the type of scientific question being investigated. In some cases, where there is a very strong prior belief that the true relationship has some quality, constraints can be used to improve the predictive performance of the model.
It is often the case in a modeling problem or project that the functional form of an acceptable model is constrained in some way. This may happen due to business considerations, or because of the type of scientific question being investigated. In some cases, where there is a very strong prior belief that the true relationship has some quality, constraints can be used to improve the predictive performance of the model.
A common type of constraint in this situation is that certain features bear a *monotonic* relationship to the predicted response:
@@ -9,13 +9,13 @@ A common type of constraint in this situation is that certain features bear a *m
f(x_1, x_2, \ldots, x, \ldots, x_{n-1}, x_n) \leq f(x_1, x_2, \ldots, x', \ldots, x_{n-1}, x_n)
```
whenever ``$ x \leq x' $``, an *increasing constraint*; or
whenever ``$ x \leq x' $`` is an *increasing constraint*; or
```math
f(x_1, x_2, \ldots, x, \ldots, x_{n-1}, x_n) \geq f(x_1, x_2, \ldots, x', \ldots, x_{n-1}, x_n)
```
whenever ``$ x \leq x' $``, a *decreasing constraint*.
whenever ``$ x \leq x' $`` is a *decreasing constraint*.
XGBoost has the ability to enforce monotonicity constraints on any features used in a boosted model.
@@ -38,7 +38,7 @@ Let's fit a boosted tree model to this data without imposing any monotonic const
![Fit of Model with No Constraint](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/monotonic/two.feature.no.constraint.png)
The black curve shows the trend inferred from the model for each feature. To make these plots the distinguished feature ``$x_i$`` is fed to the model over a one dimensional grid of values, while all the other features (in this case only one other feature) are set to their average values. We see that the model does a good job of capturing the general trend with the oscillatory wave superimposed.
The black curve shows the trend inferred from the model for each feature. To make these plots the distinguished feature ``$x_i$`` is fed to the model over a one-dimensional grid of values, while all the other features (in this case only one other feature) are set to their average values. We see that the model does a good job of capturing the general trend with the oscillatory wave superimposed.
Here is the same model, but fit with monotonicity constraints
@@ -70,7 +70,7 @@ model_with_constraints = xgb.train(params_constrained, dtrain,
early_stopping_rounds = 10)
```
In this example the training data ```X``` has two columns, and by unsing the parameter value ```(1,-1)``` we are telling XGBoost to impose an increaseing constraint on the first predictor and a decreasing constraint on the second.
In this example the training data ```X``` has two columns, and by using the parameter values ```(1,-1)``` we are telling XGBoost to impose an increasing constraint on the first predictor and a decreasing constraint on the second.
Some other examples: