Merge pull request #495 from aeeilllmrx/master
minor spelling and grammar changes
This commit is contained in:
commit
bad4a27b9f
@ -160,7 +160,7 @@ bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth
|
||||
|
||||
#### xgb.DMatrix
|
||||
|
||||
**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be usefull for the most advanced features we will discover later.
|
||||
**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later.
|
||||
|
||||
```{r trainingDmatrix, message=F, warning=F}
|
||||
dtrain <- xgb.DMatrix(data = train$data, label = train$label)
|
||||
@ -169,7 +169,7 @@ bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround
|
||||
|
||||
#### Verbose option
|
||||
|
||||
**XGBoost** has severa features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
|
||||
**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
|
||||
|
||||
One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics).
|
||||
|
||||
@ -194,7 +194,7 @@ Basic prediction using XGBoost
|
||||
Perform the prediction
|
||||
----------------------
|
||||
|
||||
The pupose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.
|
||||
The purpose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.
|
||||
|
||||
```{r predicting, message=F, warning=F}
|
||||
pred <- predict(bst, test$data)
|
||||
@ -267,7 +267,7 @@ Measure learning progress with xgb.train
|
||||
|
||||
Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.
|
||||
|
||||
One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following technics will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.
|
||||
One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.
|
||||
|
||||
One way to measure progress in learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
|
||||
|
||||
@ -285,7 +285,7 @@ bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchli
|
||||
|
||||
Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.
|
||||
|
||||
If with your own dataset you have not such results, you should think about how you did to divide your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/splitting.html).
|
||||
If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/splitting.html).
|
||||
|
||||
For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.
|
||||
|
||||
@ -306,7 +306,7 @@ bst <- xgb.train(data=dtrain, booster = "gblinear", max.depth=2, nthread = 2, nr
|
||||
|
||||
In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm.
|
||||
|
||||
In simple cases, it will happem because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
|
||||
In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
|
||||
|
||||
Manipulating xgb.DMatrix
|
||||
------------------------
|
||||
@ -368,7 +368,7 @@ xgb.plot.tree(model = bst)
|
||||
Save and load models
|
||||
--------------------
|
||||
|
||||
May be your dataset is big, and it takes time to train a model on it? May be you are not a big fan of loosing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.
|
||||
Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.
|
||||
|
||||
Hopefully for you, **XGBoost** implements such functions.
|
||||
|
||||
@ -379,7 +379,7 @@ xgb.save(bst, "xgboost.model")
|
||||
|
||||
> `xgb.save` function should return `r TRUE` if everything goes well and crashes otherwise.
|
||||
|
||||
An interesting test to see how identic is our saved model with the original one would be to compare the two predictions.
|
||||
An interesting test to see how identical our saved model is to the original one would be to compare the two predictions.
|
||||
|
||||
```{r loadModel, message=F, warning=F}
|
||||
# load binary model to R
|
||||
|
||||
28
doc/faq.md
28
doc/faq.md
@ -1,6 +1,6 @@
|
||||
Frequent Asked Questions
|
||||
Frequently Asked Questions
|
||||
========================
|
||||
This document contains the frequent asked question to xgboost.
|
||||
This document contains frequently asked questions about xgboost.
|
||||
|
||||
How to tune parameters
|
||||
----------------------
|
||||
@ -13,7 +13,7 @@ See [Introduction to Boosted Trees](model.md)
|
||||
|
||||
I have a big dataset
|
||||
--------------------
|
||||
XGBoost is designed to be memory efficient. Usually it could handle problems as long as the data fit into your memory
|
||||
XGBoost is designed to be memory efficient. Usually it can handle problems as long as the data fit into your memory
|
||||
(This usually means millions of instances).
|
||||
If you are running out of memory, checkout [external memory version](external_memory.md) or
|
||||
[distributed version](https://github.com/dmlc/wormhole/tree/master/learn/xgboost) of xgboost.
|
||||
@ -23,30 +23,30 @@ Running xgboost on Platform X (Hadoop/Yarn, Mesos)
|
||||
--------------------------------------------------
|
||||
The distributed version of XGBoost is designed to be portable to various environment.
|
||||
Distributed XGBoost can be ported to any platform that supports [rabit](https://github.com/dmlc/rabit).
|
||||
You can directly run xgboost on Yarn. In theory Mesos and other resource allocation engine can be easily supported as well.
|
||||
You can directly run xgboost on Yarn. In theory Mesos and other resource allocation engines can be easily supported as well.
|
||||
|
||||
|
||||
Why not implement distributed xgboost on top of X (Spark, Hadoop)
|
||||
-----------------------------------------------------------------
|
||||
The first fact we need to know is going distributed does not necessarily solve all the problems.
|
||||
Instead, it creates more problems such as more communication over head and fault tolerance.
|
||||
The ultimate question will still come back into how to push the limit of each computation node
|
||||
Instead, it creates more problems such as more communication overhead and fault tolerance.
|
||||
The ultimate question will still come back to how to push the limit of each computation node
|
||||
and use less resources to complete the task (thus with less communication and chance of failure).
|
||||
|
||||
To achieve these, we decide to reuse the optimizations in the single node xgboost and build distributed version on top of it.
|
||||
The demand of communication in machine learning is rather simple, in a sense that we can depend on a limited set of API (in our case rabit).
|
||||
Such design allows us to reuse most of the code, and being portable to major platforms such as Hadoop/Yarn, MPI, SGE.
|
||||
Most importantly, pushs the limit of the computation resources we can use.
|
||||
The demand of communication in machine learning is rather simple, in the sense that we can depend on a limited set of API (in our case rabit).
|
||||
Such design allows us to reuse most of the code, while being portable to major platforms such as Hadoop/Yarn, MPI, SGE.
|
||||
Most importantly, it pushes the limit of the computation resources we can use.
|
||||
|
||||
|
||||
How can I port the model to my own system
|
||||
-----------------------------------------
|
||||
The model and data format of XGBoost is exchangable.
|
||||
Which means the model trained by one langauge can be loaded in another.
|
||||
The model and data format of XGBoost is exchangable,
|
||||
which means the model trained by one language can be loaded in another.
|
||||
This means you can train the model using R, while running prediction using
|
||||
Java or C++, which are more common in production system.
|
||||
You can also train the model using distributed version,
|
||||
and load them in from python to do some interactive analysis.
|
||||
Java or C++, which are more common in production systems.
|
||||
You can also train the model using distributed versions,
|
||||
and load them in from Python to do some interactive analysis.
|
||||
|
||||
|
||||
Do you support LambdaMART
|
||||
|
||||
28
doc/model.md
28
doc/model.md
@ -2,29 +2,29 @@ Introduction to Boosted Trees
|
||||
=============================
|
||||
XGBoost is short for "Extreme Gradient Boosting", where the term "Gradient Boosting" is proposed in the paper _Greedy Function Approximation: A Gradient Boosting Machine_, Friedman. Based on this original model. This is a tutorial on boosted trees, most of content are based on this [slide](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf) by the author of xgboost.
|
||||
|
||||
The GBM(boosted trees) has been around for really a while, and there are a lot of materials on the topic. This tutorial tries to explain boosted trees in a self-contained and principled way of supervised learning. We think this explaination is cleaner, more formal, and motivates the variant used in xgboost.
|
||||
The GBM(boosted trees) has been around for really a while, and there are a lot of materials on the topic. This tutorial tries to explain boosted trees in a self-contained and principled way of supervised learning. We think this explanation is cleaner, more formal, and motivates the variant used in xgboost.
|
||||
|
||||
Elements of Supervised Learning
|
||||
-------------------------------
|
||||
XGBoost is used for supervised learning problems, where we use the training data ``$ x_i $`` to predict a target variable ``$ y_i $``.
|
||||
Before we get dived into trees, let us start from reviwing the basic elements in supervised learning.
|
||||
Before we dive into trees, let us start by reviewing the basic elements in supervised learning.
|
||||
|
||||
### Model and Parameters
|
||||
The ***model*** in supervised learning usually refers to the mathematical structure on how to given the prediction ``$ y_i $`` given ``$ x_i $``.
|
||||
For example, a common model is *linear model*, where the prediction is given by ``$ \hat{y}_i = \sum_j w_j x_{ij} $``, a linear combination of weighted input features.
|
||||
The prediction value can have different interpretations, depending on the task.
|
||||
For example, it can be logistic transformed to get the probability of postitive class in logistic regression, it can also be used as ranking score when we want to rank the outputs.
|
||||
For example, it can be logistic transformed to get the probability of positive class in logistic regression, and it can also be used as ranking score when we want to rank the outputs.
|
||||
|
||||
The ***parameters*** are the undermined part that we need to learn from data. In linear regression problem, the parameters are the co-efficients ``$ w $``.
|
||||
Usually we will use ``$ \Theta $`` to denote the parameters.
|
||||
|
||||
### Object Function : Training Loss + Regularization
|
||||
### Objective Function : Training Loss + Regularization
|
||||
|
||||
Based on different understanding or assumption of ``$ y_i $``, we can have different problems as regression, classification, ordering, etc.
|
||||
We need to find a way to find the best parameters given the training data. In order to do so, we need to define a so called ***objective function***,
|
||||
to measure the performance of the model under certain set of parameters.
|
||||
|
||||
A very important about objective functions, is they ***must always*** contains two parts: training loss and regularization.
|
||||
A very important fact about objective functions, is they ***must always*** contains two parts: training loss and regularization.
|
||||
|
||||
```math
|
||||
Obj(\Theta) = L(\Theta) + \Omega(\Theta)
|
||||
@ -42,8 +42,8 @@ Another commonly used loss function is logistic loss for logistic regression
|
||||
L(\theta) = \sum_i[ y_i\ln (1+e^{-\hat{y}_i}) + (1-y_i)\ln (1+e^{\hat{y}_i})]
|
||||
```
|
||||
|
||||
The ***regularization term*** is usually people forget to add. The regularization term controls the complexity of the model, this helps us to avoid overfitting.
|
||||
This sounds a bit abstract, let us consider the following problem in the following picture. You are asked to *fit* visually a step function given the input data points
|
||||
The ***regularization term*** is what people usually forget to add. The regularization term controls the complexity of the model, which helps us to avoid overfitting.
|
||||
This sounds a bit abstract, so let us consider the following problem in the following picture. You are asked to *fit* visually a step function given the input data points
|
||||
on the upper left corner of the image, which solution among the tree you think is the best fit?
|
||||
|
||||

|
||||
@ -55,12 +55,12 @@ The tradeoff between the two is also referred as bias-variance tradeoff in machi
|
||||
### Why introduce the general principle
|
||||
The elements introduced in above forms the basic elements of supervised learning, and they are naturally the building blocks of machine learning toolkits.
|
||||
For example, you should be able to answer what is the difference and common parts between boosted trees and random forest.
|
||||
Understanding the process in a formalized way also helps us to understand the objective what we are learning and getting the reason behind the heurestics such as
|
||||
Understanding the process in a formalized way also helps us to understand the objective that we are learning and the reason behind the heurestics such as
|
||||
pruning and smoothing.
|
||||
|
||||
Tree Ensemble
|
||||
-------------
|
||||
Now we have introduce the elements of supervised learning, let us getting started with real trees.
|
||||
Now that we have introduced the elements of supervised learning, let us get started with real trees.
|
||||
To begin with, let us first learn what is the ***model*** of xgboost: tree ensembles.
|
||||
The tree ensemble model is a set of classification and regression trees (CART). Here's a simple example of a CART
|
||||
that classifies is someone will like computer games.
|
||||
@ -69,17 +69,17 @@ that classifies is someone will like computer games.
|
||||
|
||||
We classify the members in thie family into different leaves, and assign them the score on corresponding leaf.
|
||||
A CART is a bit different from decision trees, where the leaf only contain decision values. In CART, a real score
|
||||
is associated with each of the leaves, this allows gives us richer interpretations that go beyond classification.
|
||||
is associated with each of the leaves, which gives us richer interpretations that go beyond classification.
|
||||
This also makes the unified optimization step easier, as we will see in later part of this tutorial.
|
||||
|
||||
Usually, a single tree is not so strong enough to be used in practice. What is actually used is the so called
|
||||
tree ensemble model, that sumes the prediction of multiple trees together.
|
||||
tree ensemble model, that sums the prediction of multiple trees together.
|
||||
|
||||

|
||||
|
||||
Here is an example of tree ensemble of two trees. The prediction scores of each individual tree are summed up to get the final score.
|
||||
If you look at the example, an important fact is that the two trees tries to *complement* each other.
|
||||
Mathematically, we can write our model into the form
|
||||
If you look at the example, an important fact is that the two trees try to *complement* each other.
|
||||
Mathematically, we can write our model in the form
|
||||
|
||||
```math
|
||||
\hat{y}_i = \sum_{k=1}^K f_k(x_i), f_k \in \mathcal{F}
|
||||
@ -219,7 +219,7 @@ This formula can be decomposited as 1) the score on the new left leaf 2) the sco
|
||||
We can find an important fact here: if the gain is smaller than ``$\gamma$``, we would better not to add that branch. This is exactly the ***prunning*** techniques in tree based
|
||||
models! By using the principles of supervised learning, we can naturally comes up with the reason these techniques :)
|
||||
|
||||
For real valued data, we usually want to search for an optimal split. To efficiently doing so, we place all the instances in a sorted way, like the following picture.
|
||||
For real valued data, we usually want to search for an optimal split. To efficiently do so, we place all the instances in a sorted way, like the following picture.
|
||||

|
||||
Then a left to right scan is sufficient to calculate the structure score of all possible split solutions, and we can find the best split efficiently.
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user