[DOC] Update R doc
This commit is contained in:
parent
e7d8ed71d6
commit
8e7f2679d5
@ -3,6 +3,12 @@ R package for xgboost
|
||||
|
||||
[](http://cran.r-project.org/web/packages/xgboost)
|
||||
[](http://cran.rstudio.com/web/packages/xgboost/index.html)
|
||||
[](http://xgboost.readthedocs.org/en/latest/R-package/index.html)
|
||||
|
||||
Resources
|
||||
---------
|
||||
* [XGBoost R Package Online Documentation](http://xgboost.readthedocs.org/en/latest/R-package/index.html)
|
||||
- Check this out for detailed documents, examples and tutorials.
|
||||
|
||||
Installation
|
||||
------------
|
||||
@ -24,21 +30,3 @@ Examples
|
||||
|
||||
* Please visit [walk through example](demo).
|
||||
* See also the [example scripts](../demo/kaggle-higgs) for Kaggle Higgs Challenge, including [speedtest script](../demo/kaggle-higgs/speedtest.R) on this dataset and the one related to [Otto challenge](../demo/kaggle-otto), including a [RMarkdown documentation](../demo/kaggle-otto/understandingXGBoostModel.Rmd).
|
||||
|
||||
Notes
|
||||
-----
|
||||
|
||||
If you face an issue installing the package using ```devtools::install_github```, something like this (even after updating libxml and RCurl as lot of forums say) -
|
||||
|
||||
```
|
||||
devtools::install_github('dmlc/xgboost',subdir='R-package')
|
||||
Downloading github repo dmlc/xgboost@master
|
||||
Error in function (type, msg, asError = TRUE) :
|
||||
Peer certificate cannot be authenticated with given CA certificates
|
||||
```
|
||||
To get around this you can build the package locally as mentioned [here](https://github.com/dmlc/xgboost/issues/347) -
|
||||
```
|
||||
1. Clone the current repository and set your workspace to xgboost/R-package/
|
||||
2. Run R CMD INSTALL --build . in terminal to get the tarball.
|
||||
3. Run install.packages('path_to_the_tarball',repo=NULL) in R to install.
|
||||
```
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
---
|
||||
title: "Understand your dataset with Xgboost"
|
||||
output:
|
||||
output:
|
||||
rmarkdown::html_vignette:
|
||||
css: vignette.css
|
||||
number_sections: yes
|
||||
@ -12,8 +12,11 @@ vignette: >
|
||||
\usepackage[utf8]{inputenc}
|
||||
---
|
||||
|
||||
Understand your dataset with XGBoost
|
||||
====================================
|
||||
|
||||
Introduction
|
||||
============
|
||||
------------
|
||||
|
||||
The purpose of this Vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.
|
||||
|
||||
@ -25,16 +28,16 @@ Pacakge loading:
|
||||
require(xgboost)
|
||||
require(Matrix)
|
||||
require(data.table)
|
||||
if (!require('vcd')) install.packages('vcd')
|
||||
if (!require('vcd')) install.packages('vcd')
|
||||
```
|
||||
|
||||
> **VCD** package is used for one of its embedded dataset only.
|
||||
|
||||
Preparation of the dataset
|
||||
==========================
|
||||
--------------------------
|
||||
|
||||
### Numeric VS categorical variables
|
||||
|
||||
Numeric VS categorical variables
|
||||
--------------------------------
|
||||
|
||||
**Xgboost** manages only `numeric` vectors.
|
||||
|
||||
@ -48,10 +51,9 @@ A *categorical* variable has a fixed number of different values. For instance, i
|
||||
|
||||
To answer the question above we will convert *categorical* variables to `numeric` one.
|
||||
|
||||
Conversion from categorical to numeric variables
|
||||
------------------------------------------------
|
||||
### Conversion from categorical to numeric variables
|
||||
|
||||
### Looking at the raw data
|
||||
#### Looking at the raw data
|
||||
|
||||
In this Vignette we will see how to transform a *dense* `data.frame` (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.
|
||||
|
||||
@ -85,11 +87,11 @@ str(df)
|
||||
> * can take a limited number of values (like `factor`) ;
|
||||
> * these values are ordered (unlike `factor`). Here these ordered values are: `Marked > Some > None`
|
||||
|
||||
### Creation of new features based on old ones
|
||||
#### Creation of new features based on old ones
|
||||
|
||||
We will add some new *categorical* features to see if it helps.
|
||||
|
||||
#### Grouping per 10 years
|
||||
##### Grouping per 10 years
|
||||
|
||||
For the first feature we create groups of age by rounding the real age.
|
||||
|
||||
@ -101,7 +103,7 @@ Therefore, 20 is not closer to 30 than 60. To make it short, the distance betwee
|
||||
head(df[,AgeDiscret := as.factor(round(Age/10,0))])
|
||||
```
|
||||
|
||||
#### Random split in two groups
|
||||
##### Random split in two groups
|
||||
|
||||
Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (you may already have an idea of how well it will work...).
|
||||
|
||||
@ -109,15 +111,15 @@ Following is an even stronger simplification of the real age with an arbitrary s
|
||||
head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))])
|
||||
```
|
||||
|
||||
#### Risks in adding correlated features
|
||||
##### Risks in adding correlated features
|
||||
|
||||
These new features are highly correlated to the `Age` feature because they are simple transformations of this feature.
|
||||
These new features are highly correlated to the `Age` feature because they are simple transformations of this feature.
|
||||
|
||||
For many machine learning algorithms, using correlated features is not a good idea. It may sometimes make prediction less accurate, and most of the time make interpretation of the model almost impossible. GLM, for instance, assumes that the features are uncorrelated.
|
||||
|
||||
Fortunately, decision tree algorithms (including boosted trees) are very robust to these features. Therefore we have nothing to do to manage this situation.
|
||||
|
||||
#### Cleaning data
|
||||
##### Cleaning data
|
||||
|
||||
We remove ID as there is nothing to learn from this feature (it would just add some noise).
|
||||
|
||||
@ -132,7 +134,7 @@ levels(df[,Treatment])
|
||||
```
|
||||
|
||||
|
||||
### One-hot encoding
|
||||
#### One-hot encoding
|
||||
|
||||
Next step, we will transform the categorical data to dummy variables.
|
||||
This is the [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) step.
|
||||
@ -156,12 +158,12 @@ Create the output `numeric` vector (not as a sparse `Matrix`):
|
||||
output_vector = df[,Improved] == "Marked"
|
||||
```
|
||||
|
||||
1. set `Y` vector to `0`;
|
||||
2. set `Y` to `1` for rows where `Improved == Marked` is `TRUE` ;
|
||||
1. set `Y` vector to `0`;
|
||||
2. set `Y` to `1` for rows where `Improved == Marked` is `TRUE` ;
|
||||
3. return `Y` vector.
|
||||
|
||||
Build the model
|
||||
===============
|
||||
---------------
|
||||
|
||||
The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)).
|
||||
|
||||
@ -173,17 +175,17 @@ bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
|
||||
|
||||
You can see some `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well the model explains your data. Lower is better.
|
||||
|
||||
A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future).
|
||||
A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future).
|
||||
|
||||
> Here you can see the numbers decrease until line 7 and then increase.
|
||||
> Here you can see the numbers decrease until line 7 and then increase.
|
||||
>
|
||||
> It probably means we are overfitting. To fix that I should reduce the number of rounds to `nround = 4`. I will let things like that because I don't really care for the purpose of this example :-)
|
||||
|
||||
Feature importance
|
||||
==================
|
||||
------------------
|
||||
|
||||
## Measure feature importance
|
||||
|
||||
Measure feature importance
|
||||
--------------------------
|
||||
|
||||
### Build the feature importance data.table
|
||||
|
||||
@ -204,7 +206,7 @@ head(importance)
|
||||
|
||||
`Frequency` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
|
||||
|
||||
### Improvement in the interpretability of feature importance data.table
|
||||
#### Improvement in the interpretability of feature importance data.table
|
||||
|
||||
We can go deeper in the analysis of the model. In the `data.table` above, we have discovered which features counts to predict if the illness will go or not. But we don't yet know the role of these features. For instance, one of the question we may want to answer would be: does receiving a placebo treatment helps to recover from the illness?
|
||||
|
||||
@ -233,8 +235,8 @@ Therefore, according to our findings, getting a placebo doesn't seem to help but
|
||||
|
||||
> You may wonder how to interpret the `< 1.00001` on the first line. Basically, in a sparse `Matrix`, there is no `0`, therefore, looking for one hot-encoded categorical observations validating the rule `< 1.00001` is like just looking for `1` for this feature.
|
||||
|
||||
Plotting the feature importance
|
||||
-------------------------------
|
||||
### Plotting the feature importance
|
||||
|
||||
|
||||
All these things are nice, but it would be even better to plot the results.
|
||||
|
||||
@ -250,11 +252,11 @@ According to the plot above, the most important features in this dataset to pred
|
||||
|
||||
* the Age ;
|
||||
* having received a placebo or not ;
|
||||
* the sex is third but already included in the not interesting features group ;
|
||||
* the sex is third but already included in the not interesting features group ;
|
||||
* then we see our generated features (AgeDiscret). We can see that their contribution is very low.
|
||||
|
||||
Do these results make sense?
|
||||
------------------------------
|
||||
### Do these results make sense?
|
||||
|
||||
|
||||
Let's check some **Chi2** between each of these features and the label.
|
||||
|
||||
@ -279,18 +281,18 @@ c2 <- chisq.test(df$AgeCat, output_vector)
|
||||
print(c2)
|
||||
```
|
||||
|
||||
The perfectly random split I did between young and old at 30 years old have a low correlation of **`r round(c2$statistic, 2)`**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same.
|
||||
The perfectly random split I did between young and old at 30 years old have a low correlation of **`r round(c2$statistic, 2)`**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same.
|
||||
|
||||
Morality: don't let your *gut* lower the quality of your model.
|
||||
Morality: don't let your *gut* lower the quality of your model.
|
||||
|
||||
In *data science* expression, there is the word *science* :-)
|
||||
|
||||
Conclusion
|
||||
==========
|
||||
----------
|
||||
|
||||
As you can see, in general *destroying information by simplifying it won't improve your model*. **Chi2** just demonstrates that.
|
||||
As you can see, in general *destroying information by simplifying it won't improve your model*. **Chi2** just demonstrates that.
|
||||
|
||||
But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model.
|
||||
But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model.
|
||||
|
||||
The case studied here is not enough complex to show that. Check [Kaggle website](http://www.kaggle.com/) for some challenging datasets. However it's almost always worse when you add some arbitrary rules.
|
||||
|
||||
@ -299,7 +301,7 @@ Moreover, you can notice that even if we have added some not useful new features
|
||||
Linear model may not be that smart in this scenario.
|
||||
|
||||
Special Note: What about Random Forests™?
|
||||
==========================================
|
||||
-----------------------------------------
|
||||
|
||||
As you may know, [Random Forests™](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](http://en.wikipedia.org/wiki/Ensemble_learning) family.
|
||||
|
||||
@ -313,7 +315,7 @@ However, in Random Forests™ this random choice will be done for each tree, bec
|
||||
|
||||
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.
|
||||
|
||||
If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters!
|
||||
If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters!
|
||||
|
||||
**Warning**: this is still an experimental parameter.
|
||||
|
||||
|
||||
@ -13,8 +13,11 @@ vignette: >
|
||||
\usepackage[utf8]{inputenc}
|
||||
---
|
||||
|
||||
Introduction
|
||||
============
|
||||
XGBoost R Tutorial
|
||||
==================
|
||||
|
||||
## Introduction
|
||||
|
||||
|
||||
**Xgboost** is short for e**X**treme **G**radient **Boost**ing package.
|
||||
|
||||
@ -40,16 +43,16 @@ It has several features:
|
||||
* Sparsity: it accepts *sparse* input for both *tree booster* and *linear booster*, and is optimized for *sparse* input ;
|
||||
* Customization: it supports customized objective functions and evaluation functions.
|
||||
|
||||
Installation
|
||||
============
|
||||
## Installation
|
||||
|
||||
|
||||
### Github version
|
||||
|
||||
Github version
|
||||
--------------
|
||||
|
||||
For up-to-date version (highly recommended), install from *Github*:
|
||||
|
||||
```{r installGithub, eval=FALSE}
|
||||
devtools::install_github('dmlc/xgboost', subdir='R-package')
|
||||
devtools::install_git('git://github.com/dmlc/xgboost', subdir='R-package')
|
||||
```
|
||||
|
||||
> *Windows* user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
|
||||
@ -61,8 +64,8 @@ As of 2015-03-13, ‘xgboost’ was removed from the CRAN repository.
|
||||
|
||||
Formerly available versions can be obtained from the CRAN [archive](http://cran.r-project.org/src/contrib/Archive/xgboost)
|
||||
|
||||
Learning
|
||||
========
|
||||
## Learning
|
||||
|
||||
|
||||
For the purpose of this tutorial we will load **XGBoost** package.
|
||||
|
||||
@ -70,15 +73,15 @@ For the purpose of this tutorial we will load **XGBoost** package.
|
||||
require(xgboost)
|
||||
```
|
||||
|
||||
Dataset presentation
|
||||
--------------------
|
||||
### Dataset presentation
|
||||
|
||||
|
||||
In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-).
|
||||
|
||||
Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.
|
||||
|
||||
Dataset loading
|
||||
---------------
|
||||
### Dataset loading
|
||||
|
||||
|
||||
We will load the `agaricus` datasets embedded with the package and will link them to variables.
|
||||
|
||||
@ -124,12 +127,12 @@ class(train$data)[1]
|
||||
class(train$label)
|
||||
```
|
||||
|
||||
Basic Training using XGBoost
|
||||
----------------------------
|
||||
### Basic Training using XGBoost
|
||||
|
||||
|
||||
This step is the most critical part of the process for the quality of our model.
|
||||
|
||||
### Basic training
|
||||
#### Basic training
|
||||
|
||||
We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`.
|
||||
|
||||
@ -148,9 +151,9 @@ bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta
|
||||
|
||||
> More complex the relationship between your features and your `label` is, more passes you need.
|
||||
|
||||
### Parameter variations
|
||||
#### Parameter variations
|
||||
|
||||
#### Dense matrix
|
||||
##### Dense matrix
|
||||
|
||||
Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R** matrix.
|
||||
|
||||
@ -158,7 +161,7 @@ Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R**
|
||||
bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
#### xgb.DMatrix
|
||||
##### xgb.DMatrix
|
||||
|
||||
**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later.
|
||||
|
||||
@ -167,7 +170,7 @@ dtrain <- xgb.DMatrix(data = train$data, label = train$label)
|
||||
bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
#### Verbose option
|
||||
##### Verbose option
|
||||
|
||||
**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
|
||||
|
||||
@ -188,11 +191,11 @@ bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, o
|
||||
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 2)
|
||||
```
|
||||
|
||||
Basic prediction using XGBoost
|
||||
==============================
|
||||
## Basic prediction using XGBoost
|
||||
|
||||
|
||||
## Perform the prediction
|
||||
|
||||
Perform the prediction
|
||||
----------------------
|
||||
|
||||
The purpose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.
|
||||
|
||||
@ -208,8 +211,8 @@ print(head(pred))
|
||||
|
||||
These numbers doesn't look like *binary classification* `{0,1}`. We need to perform a simple transformation before being able to use these results.
|
||||
|
||||
Transform the regression in a binary classification
|
||||
---------------------------------------------------
|
||||
## Transform the regression in a binary classification
|
||||
|
||||
|
||||
The only thing that **XGBoost** does is a *regression*. **XGBoost** is using `label` vector to build its *regression* model.
|
||||
|
||||
@ -222,8 +225,8 @@ prediction <- as.numeric(pred > 0.5)
|
||||
print(head(prediction))
|
||||
```
|
||||
|
||||
Measuring model performance
|
||||
---------------------------
|
||||
## Measuring model performance
|
||||
|
||||
|
||||
To measure the model performance, we will compute a simple metric, the *average error*.
|
||||
|
||||
@ -246,14 +249,14 @@ The most important thing to remember is that **to do a classification, you just
|
||||
|
||||
This metric is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well!
|
||||
|
||||
Advanced features
|
||||
=================
|
||||
## Advanced features
|
||||
|
||||
|
||||
Most of the features below have been implemented to help you to improve your model by offering a better understanding of its content.
|
||||
|
||||
|
||||
Dataset preparation
|
||||
-------------------
|
||||
### Dataset preparation
|
||||
|
||||
|
||||
For the following advanced features, we need to put data in `xgb.DMatrix` as explained above.
|
||||
|
||||
@ -262,8 +265,8 @@ dtrain <- xgb.DMatrix(data = train$data, label=train$label)
|
||||
dtest <- xgb.DMatrix(data = test$data, label=test$label)
|
||||
```
|
||||
|
||||
Measure learning progress with xgb.train
|
||||
----------------------------------------
|
||||
### Measure learning progress with xgb.train
|
||||
|
||||
|
||||
Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.
|
||||
|
||||
@ -295,8 +298,8 @@ bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchli
|
||||
|
||||
> `eval.metric` allows us to monitor two new metrics for each round, `logloss` and `error`.
|
||||
|
||||
Linear boosting
|
||||
---------------
|
||||
### Linear boosting
|
||||
|
||||
|
||||
Until now, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).
|
||||
|
||||
@ -308,10 +311,10 @@ In this specific case, *linear boosting* gets sligtly better performance metrics
|
||||
|
||||
In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
|
||||
|
||||
Manipulating xgb.DMatrix
|
||||
------------------------
|
||||
### Manipulating xgb.DMatrix
|
||||
|
||||
### Save / Load
|
||||
|
||||
#### Save / Load
|
||||
|
||||
Like saving models, `xgb.DMatrix` object (which groups both dataset and outcome) can also be saved using `xgb.DMatrix.save` function.
|
||||
|
||||
@ -326,7 +329,7 @@ bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nthread = 2, nround=2, watchl
|
||||
file.remove("dtrain.buffer")
|
||||
```
|
||||
|
||||
### Information extraction
|
||||
#### Information extraction
|
||||
|
||||
Information can be extracted from `xgb.DMatrix` using `getinfo` function. Hereafter we will extract `label` data.
|
||||
|
||||
@ -337,8 +340,8 @@ err <- as.numeric(sum(as.integer(pred > 0.5) != label))/length(label)
|
||||
print(paste("test-error=", err))
|
||||
```
|
||||
|
||||
View feature importance/influence from the learnt model
|
||||
-------------------------------------------------------
|
||||
### View feature importance/influence from the learnt model
|
||||
|
||||
|
||||
Feature importance is similar to R gbm package's relative influence (rel.inf).
|
||||
|
||||
@ -348,8 +351,8 @@ print(importance_matrix)
|
||||
xgb.plot.importance(importance_matrix = importance_matrix)
|
||||
```
|
||||
|
||||
View the trees from a model
|
||||
---------------------------
|
||||
#### View the trees from a model
|
||||
|
||||
|
||||
You can dump the tree you learned using `xgb.dump` into a text file.
|
||||
|
||||
@ -365,8 +368,8 @@ xgb.plot.tree(model = bst)
|
||||
|
||||
> if you provide a path to `fname` parameter you can save the trees to your hard drive.
|
||||
|
||||
Save and load models
|
||||
--------------------
|
||||
#### Save and load models
|
||||
|
||||
|
||||
Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.
|
||||
|
||||
@ -416,5 +419,4 @@ print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred))))
|
||||
|
||||
> Again `0`? It seems that `XGBoost` works pretty well!
|
||||
|
||||
References
|
||||
==========
|
||||
## References
|
||||
|
||||
@ -17,8 +17,8 @@ Contents
|
||||
--------
|
||||
* [Documentation and Tutorials](https://xgboost.readthedocs.org)
|
||||
* [Code Examples](demo)
|
||||
* [Build Instruction](doc/build.md)
|
||||
* [Committers and Contributors](CONTRIBUTORS.md)
|
||||
* [Installation](doc/build.md)
|
||||
* [Contribute to XGBoost](http://xgboost.readthedocs.org/en/latest/dev-guide/contribute.html)
|
||||
|
||||
What's New
|
||||
----------
|
||||
|
||||
1
doc/.gitignore
vendored
1
doc/.gitignore
vendored
@ -5,3 +5,4 @@ _*
|
||||
doxygen
|
||||
parser.py
|
||||
*.pyc
|
||||
web-data
|
||||
|
||||
1
doc/R-package/.gitignore
vendored
Normal file
1
doc/R-package/.gitignore
vendored
Normal file
@ -0,0 +1 @@
|
||||
*~
|
||||
14
doc/R-package/Makefile
Normal file
14
doc/R-package/Makefile
Normal file
@ -0,0 +1,14 @@
|
||||
# This is the makefile for compiling Rmarkdown files into the md file with results.
|
||||
PKGROOT=../../R-package
|
||||
|
||||
# ADD The Markdown to be built here, with suffix md
|
||||
discoverYourData.md: $(PKGROOT)/vignettes/discoverYourData.Rmd
|
||||
xgboostPresentation.md: $(PKGROOT)/vignettes/xgboostPresentation.Rmd
|
||||
|
||||
# General Rules for build rmarkdowns, need knitr
|
||||
%.md:
|
||||
Rscript -e \
|
||||
"require(knitr);"\
|
||||
"knitr::opts_knit\$$set(root.dir=\".\");"\
|
||||
"knitr::opts_chunk\$$set(fig.path=\"../web-data/xgboost/knitr/$(basename $@)-\");"\
|
||||
"knitr::knit(\"$+\")"
|
||||
484
doc/R-package/discoverYourData.md
Normal file
484
doc/R-package/discoverYourData.md
Normal file
@ -0,0 +1,484 @@
|
||||
---
|
||||
title: "Understand your dataset with Xgboost"
|
||||
output:
|
||||
rmarkdown::html_vignette:
|
||||
css: vignette.css
|
||||
number_sections: yes
|
||||
toc: yes
|
||||
author: Tianqi Chen, Tong He, Michaël Benesty
|
||||
vignette: >
|
||||
%\VignetteIndexEntry{Discover your data}
|
||||
%\VignetteEngine{knitr::rmarkdown}
|
||||
\usepackage[utf8]{inputenc}
|
||||
---
|
||||
|
||||
Understand your dataset with XGBoost
|
||||
====================================
|
||||
|
||||
Introduction
|
||||
------------
|
||||
|
||||
The purpose of this Vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.
|
||||
|
||||
This Vignette is not about predicting anything (see [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features* of your data and the *outcome*.
|
||||
|
||||
Pacakge loading:
|
||||
|
||||
|
||||
```r
|
||||
require(xgboost)
|
||||
require(Matrix)
|
||||
require(data.table)
|
||||
if (!require('vcd')) install.packages('vcd')
|
||||
```
|
||||
|
||||
> **VCD** package is used for one of its embedded dataset only.
|
||||
|
||||
Preparation of the dataset
|
||||
--------------------------
|
||||
|
||||
### Numeric VS categorical variables
|
||||
|
||||
|
||||
**Xgboost** manages only `numeric` vectors.
|
||||
|
||||
What to do when you have *categorical* data?
|
||||
|
||||
A *categorical* variable has a fixed number of different values. For instance, if a variable called *Colour* can have only one of these three values, *red*, *blue* or *green*, then *Colour* is a *categorical* variable.
|
||||
|
||||
> In **R**, a *categorical* variable is called `factor`.
|
||||
>
|
||||
> Type `?factor` in the console for more information.
|
||||
|
||||
To answer the question above we will convert *categorical* variables to `numeric` one.
|
||||
|
||||
### Conversion from categorical to numeric variables
|
||||
|
||||
#### Looking at the raw data
|
||||
|
||||
In this Vignette we will see how to transform a *dense* `data.frame` (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.
|
||||
|
||||
The method we are going to see is usually called [one-hot encoding](http://en.wikipedia.org/wiki/One-hot).
|
||||
|
||||
The first step is to load `Arthritis` dataset in memory and wrap it with `data.table` package.
|
||||
|
||||
|
||||
```r
|
||||
data(Arthritis)
|
||||
df <- data.table(Arthritis, keep.rownames = F)
|
||||
```
|
||||
|
||||
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `panda` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **Xgboost** **R** package use `data.table`.
|
||||
|
||||
The first thing we want to do is to have a look to the first lines of the `data.table`:
|
||||
|
||||
|
||||
```r
|
||||
head(df)
|
||||
```
|
||||
|
||||
```
|
||||
## ID Treatment Sex Age Improved
|
||||
## 1: 57 Treated Male 27 Some
|
||||
## 2: 46 Treated Male 29 None
|
||||
## 3: 77 Treated Male 30 None
|
||||
## 4: 17 Treated Male 32 Marked
|
||||
## 5: 36 Treated Male 46 Marked
|
||||
## 6: 23 Treated Male 58 Marked
|
||||
```
|
||||
|
||||
Now we will check the format of each column.
|
||||
|
||||
|
||||
```r
|
||||
str(df)
|
||||
```
|
||||
|
||||
```
|
||||
## Classes 'data.table' and 'data.frame': 84 obs. of 5 variables:
|
||||
## $ ID : int 57 46 77 17 36 23 75 39 33 55 ...
|
||||
## $ Treatment: Factor w/ 2 levels "Placebo","Treated": 2 2 2 2 2 2 2 2 2 2 ...
|
||||
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
|
||||
## $ Age : int 27 29 30 32 46 58 59 59 63 63 ...
|
||||
## $ Improved : Ord.factor w/ 3 levels "None"<"Some"<..: 2 1 1 3 3 3 1 3 1 1 ...
|
||||
## - attr(*, ".internal.selfref")=<externalptr>
|
||||
```
|
||||
|
||||
2 columns have `factor` type, one has `ordinal` type.
|
||||
|
||||
> `ordinal` variable :
|
||||
>
|
||||
> * can take a limited number of values (like `factor`) ;
|
||||
> * these values are ordered (unlike `factor`). Here these ordered values are: `Marked > Some > None`
|
||||
|
||||
#### Creation of new features based on old ones
|
||||
|
||||
We will add some new *categorical* features to see if it helps.
|
||||
|
||||
##### Grouping per 10 years
|
||||
|
||||
For the first feature we create groups of age by rounding the real age.
|
||||
|
||||
Note that we transform it to `factor` so the algorithm treat these age groups as independent values.
|
||||
|
||||
Therefore, 20 is not closer to 30 than 60. To make it short, the distance between ages is lost in this transformation.
|
||||
|
||||
|
||||
```r
|
||||
head(df[,AgeDiscret := as.factor(round(Age/10,0))])
|
||||
```
|
||||
|
||||
```
|
||||
## ID Treatment Sex Age Improved AgeDiscret
|
||||
## 1: 57 Treated Male 27 Some 3
|
||||
## 2: 46 Treated Male 29 None 3
|
||||
## 3: 77 Treated Male 30 None 3
|
||||
## 4: 17 Treated Male 32 Marked 3
|
||||
## 5: 36 Treated Male 46 Marked 5
|
||||
## 6: 23 Treated Male 58 Marked 6
|
||||
```
|
||||
|
||||
##### Random split in two groups
|
||||
|
||||
Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (you may already have an idea of how well it will work...).
|
||||
|
||||
|
||||
```r
|
||||
head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))])
|
||||
```
|
||||
|
||||
```
|
||||
## ID Treatment Sex Age Improved AgeDiscret AgeCat
|
||||
## 1: 57 Treated Male 27 Some 3 Young
|
||||
## 2: 46 Treated Male 29 None 3 Young
|
||||
## 3: 77 Treated Male 30 None 3 Young
|
||||
## 4: 17 Treated Male 32 Marked 3 Old
|
||||
## 5: 36 Treated Male 46 Marked 5 Old
|
||||
## 6: 23 Treated Male 58 Marked 6 Old
|
||||
```
|
||||
|
||||
##### Risks in adding correlated features
|
||||
|
||||
These new features are highly correlated to the `Age` feature because they are simple transformations of this feature.
|
||||
|
||||
For many machine learning algorithms, using correlated features is not a good idea. It may sometimes make prediction less accurate, and most of the time make interpretation of the model almost impossible. GLM, for instance, assumes that the features are uncorrelated.
|
||||
|
||||
Fortunately, decision tree algorithms (including boosted trees) are very robust to these features. Therefore we have nothing to do to manage this situation.
|
||||
|
||||
##### Cleaning data
|
||||
|
||||
We remove ID as there is nothing to learn from this feature (it would just add some noise).
|
||||
|
||||
|
||||
```r
|
||||
df[,ID:=NULL]
|
||||
```
|
||||
|
||||
We will list the different values for the column `Treatment`:
|
||||
|
||||
|
||||
```r
|
||||
levels(df[,Treatment])
|
||||
```
|
||||
|
||||
```
|
||||
## [1] "Placebo" "Treated"
|
||||
```
|
||||
|
||||
|
||||
#### One-hot encoding
|
||||
|
||||
Next step, we will transform the categorical data to dummy variables.
|
||||
This is the [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) step.
|
||||
|
||||
The purpose is to transform each value of each *categorical* feature in a *binary* feature `{0, 1}`.
|
||||
|
||||
For example, the column `Treatment` will be replaced by two columns, `Placebo`, and `Treated`. Each of them will be *binary*. Therefore, an observation which has the value `Placebo` in column `Treatment` before the transformation will have after the transformation the value `1` in the new column `Placebo` and the value `0` in the new column `Treated`. The column `Treatment` will disappear during the one-hot encoding.
|
||||
|
||||
Column `Improved` is excluded because it will be our `label` column, the one we want to predict.
|
||||
|
||||
|
||||
```r
|
||||
sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df)
|
||||
head(sparse_matrix)
|
||||
```
|
||||
|
||||
```
|
||||
## 6 x 10 sparse Matrix of class "dgCMatrix"
|
||||
##
|
||||
## 1 . 1 1 27 1 . . . . 1
|
||||
## 2 . 1 1 29 1 . . . . 1
|
||||
## 3 . 1 1 30 1 . . . . 1
|
||||
## 4 . 1 1 32 1 . . . . .
|
||||
## 5 . 1 1 46 . . 1 . . .
|
||||
## 6 . 1 1 58 . . . 1 . .
|
||||
```
|
||||
|
||||
> Formulae `Improved~.-1` used above means transform all *categorical* features but column `Improved` to binary values. The `-1` is here to remove the first column which is full of `1` (this column is generated by the conversion). For more information, you can type `?sparse.model.matrix` in the console.
|
||||
|
||||
Create the output `numeric` vector (not as a sparse `Matrix`):
|
||||
|
||||
|
||||
```r
|
||||
output_vector = df[,Improved] == "Marked"
|
||||
```
|
||||
|
||||
1. set `Y` vector to `0`;
|
||||
2. set `Y` to `1` for rows where `Improved == Marked` is `TRUE` ;
|
||||
3. return `Y` vector.
|
||||
|
||||
Build the model
|
||||
---------------
|
||||
|
||||
The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)).
|
||||
|
||||
|
||||
```r
|
||||
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
|
||||
eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```
|
||||
## [0] train-error:0.202381
|
||||
## [1] train-error:0.166667
|
||||
## [2] train-error:0.166667
|
||||
## [3] train-error:0.166667
|
||||
## [4] train-error:0.154762
|
||||
## [5] train-error:0.154762
|
||||
## [6] train-error:0.154762
|
||||
## [7] train-error:0.166667
|
||||
## [8] train-error:0.166667
|
||||
## [9] train-error:0.166667
|
||||
```
|
||||
|
||||
You can see some `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well the model explains your data. Lower is better.
|
||||
|
||||
A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future).
|
||||
|
||||
> Here you can see the numbers decrease until line 7 and then increase.
|
||||
>
|
||||
> It probably means we are overfitting. To fix that I should reduce the number of rounds to `nround = 4`. I will let things like that because I don't really care for the purpose of this example :-)
|
||||
|
||||
Feature importance
|
||||
------------------
|
||||
|
||||
## Measure feature importance
|
||||
|
||||
|
||||
### Build the feature importance data.table
|
||||
|
||||
In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the original values of the features (remember, each binary column == one value of one *categorical* feature).
|
||||
|
||||
|
||||
```r
|
||||
importance <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst)
|
||||
head(importance)
|
||||
```
|
||||
|
||||
```
|
||||
## Feature Gain Cover Frequency
|
||||
## 1: Age 0.622031651 0.67251706 0.67241379
|
||||
## 2: TreatmentPlacebo 0.285750607 0.11916656 0.10344828
|
||||
## 3: SexMale 0.048744054 0.04522027 0.08620690
|
||||
## 4: AgeDiscret6 0.016604647 0.04784637 0.05172414
|
||||
## 5: AgeDiscret3 0.016373791 0.08028939 0.05172414
|
||||
## 6: AgeDiscret4 0.009270558 0.02858801 0.01724138
|
||||
```
|
||||
|
||||
> The column `Gain` provide the information we are looking for.
|
||||
>
|
||||
> As you can see, features are classified by `Gain`.
|
||||
|
||||
`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as `1`, and the other branch saying the exact opposite).
|
||||
|
||||
`Cover` measures the relative quantity of observations concerned by a feature.
|
||||
|
||||
`Frequency` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
|
||||
|
||||
#### Improvement in the interpretability of feature importance data.table
|
||||
|
||||
We can go deeper in the analysis of the model. In the `data.table` above, we have discovered which features counts to predict if the illness will go or not. But we don't yet know the role of these features. For instance, one of the question we may want to answer would be: does receiving a placebo treatment helps to recover from the illness?
|
||||
|
||||
One simple solution is to count the co-occurrences of a feature and a class of the classification.
|
||||
|
||||
For that purpose we will execute the same function as above but using two more parameters, `data` and `label`.
|
||||
|
||||
|
||||
```r
|
||||
importanceRaw <- xgb.importance(feature_names = sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector)
|
||||
|
||||
# Cleaning for better display
|
||||
importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequency=NULL)]
|
||||
|
||||
head(importanceClean)
|
||||
```
|
||||
|
||||
```
|
||||
## Feature Split Gain RealCover RealCover %
|
||||
## 1: TreatmentPlacebo -1.00136e-05 0.28575061 7 0.2500000
|
||||
## 2: Age 61.5 0.16374034 12 0.4285714
|
||||
## 3: Age 39 0.08705750 8 0.2857143
|
||||
## 4: Age 57.5 0.06947553 11 0.3928571
|
||||
## 5: SexMale -1.00136e-05 0.04874405 4 0.1428571
|
||||
## 6: Age 53.5 0.04620627 10 0.3571429
|
||||
```
|
||||
|
||||
> In the table above we have removed two not needed columns and select only the first lines.
|
||||
|
||||
First thing you notice is the new column `Split`. It is the split applied to the feature on a branch of one of the tree. Each split is present, therefore a feature can appear several times in this table. Here we can see the feature `Age` is used several times with different splits.
|
||||
|
||||
How the split is applied to count the co-occurrences? It is always `<`. For instance, in the second line, we measure the number of persons under 61.5 years with the illness gone after the treatment.
|
||||
|
||||
The two other new columns are `RealCover` and `RealCover %`. In the first column it measures the number of observations in the dataset where the split is respected and the label marked as `1`. The second column is the percentage of the whole population that `RealCover` represents.
|
||||
|
||||
Therefore, according to our findings, getting a placebo doesn't seem to help but being younger than 61 years may help (seems logic).
|
||||
|
||||
> You may wonder how to interpret the `< 1.00001` on the first line. Basically, in a sparse `Matrix`, there is no `0`, therefore, looking for one hot-encoded categorical observations validating the rule `< 1.00001` is like just looking for `1` for this feature.
|
||||
|
||||
### Plotting the feature importance
|
||||
|
||||
|
||||
All these things are nice, but it would be even better to plot the results.
|
||||
|
||||
|
||||
```r
|
||||
xgb.plot.importance(importance_matrix = importanceRaw)
|
||||
```
|
||||
|
||||
```
|
||||
## Error in xgb.plot.importance(importance_matrix = importanceRaw): Importance matrix is not correct (column names issue)
|
||||
```
|
||||
|
||||
Feature have automatically been divided in 2 clusters: the interesting features... and the others.
|
||||
|
||||
> Depending of the dataset and the learning parameters you may have more than two clusters. Default value is to limit them to `10`, but you can increase this limit. Look at the function documentation for more information.
|
||||
|
||||
According to the plot above, the most important features in this dataset to predict if the treatment will work are :
|
||||
|
||||
* the Age ;
|
||||
* having received a placebo or not ;
|
||||
* the sex is third but already included in the not interesting features group ;
|
||||
* then we see our generated features (AgeDiscret). We can see that their contribution is very low.
|
||||
|
||||
### Do these results make sense?
|
||||
|
||||
|
||||
Let's check some **Chi2** between each of these features and the label.
|
||||
|
||||
Higher **Chi2** means better correlation.
|
||||
|
||||
|
||||
```r
|
||||
c2 <- chisq.test(df$Age, output_vector)
|
||||
print(c2)
|
||||
```
|
||||
|
||||
```
|
||||
##
|
||||
## Pearson's Chi-squared test
|
||||
##
|
||||
## data: df$Age and output_vector
|
||||
## X-squared = 35.475, df = 35, p-value = 0.4458
|
||||
```
|
||||
|
||||
Pearson correlation between Age and illness disapearing is **35.48**.
|
||||
|
||||
|
||||
```r
|
||||
c2 <- chisq.test(df$AgeDiscret, output_vector)
|
||||
print(c2)
|
||||
```
|
||||
|
||||
```
|
||||
##
|
||||
## Pearson's Chi-squared test
|
||||
##
|
||||
## data: df$AgeDiscret and output_vector
|
||||
## X-squared = 8.2554, df = 5, p-value = 0.1427
|
||||
```
|
||||
|
||||
Our first simplification of Age gives a Pearson correlation is **8.26**.
|
||||
|
||||
|
||||
```r
|
||||
c2 <- chisq.test(df$AgeCat, output_vector)
|
||||
print(c2)
|
||||
```
|
||||
|
||||
```
|
||||
##
|
||||
## Pearson's Chi-squared test with Yates' continuity correction
|
||||
##
|
||||
## data: df$AgeCat and output_vector
|
||||
## X-squared = 2.3571, df = 1, p-value = 0.1247
|
||||
```
|
||||
|
||||
The perfectly random split I did between young and old at 30 years old have a low correlation of **2.36**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same.
|
||||
|
||||
Morality: don't let your *gut* lower the quality of your model.
|
||||
|
||||
In *data science* expression, there is the word *science* :-)
|
||||
|
||||
Conclusion
|
||||
----------
|
||||
|
||||
As you can see, in general *destroying information by simplifying it won't improve your model*. **Chi2** just demonstrates that.
|
||||
|
||||
But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model.
|
||||
|
||||
The case studied here is not enough complex to show that. Check [Kaggle website](http://www.kaggle.com/) for some challenging datasets. However it's almost always worse when you add some arbitrary rules.
|
||||
|
||||
Moreover, you can notice that even if we have added some not useful new features highly correlated with other features, the boosting tree algorithm have been able to choose the best one, which in this case is the Age.
|
||||
|
||||
Linear model may not be that smart in this scenario.
|
||||
|
||||
Special Note: What about Random Forests™?
|
||||
-----------------------------------------
|
||||
|
||||
As you may know, [Random Forests™](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](http://en.wikipedia.org/wiki/Ensemble_learning) family.
|
||||
|
||||
Both trains several decision trees for one dataset. The *main* difference is that in Random Forests™, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`).
|
||||
|
||||
This difference have an impact on a corner case in feature importance analysis: the *correlated features*.
|
||||
|
||||
Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and Random Forests™).
|
||||
|
||||
However, in Random Forests™ this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the *importance* of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...
|
||||
|
||||
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.
|
||||
|
||||
If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters!
|
||||
|
||||
**Warning**: this is still an experimental parameter.
|
||||
|
||||
For instance, to compute a model with 1000 trees, with a 0.5 factor on sampling rows and columns:
|
||||
|
||||
|
||||
```r
|
||||
data(agaricus.train, package='xgboost')
|
||||
data(agaricus.test, package='xgboost')
|
||||
train <- agaricus.train
|
||||
test <- agaricus.test
|
||||
|
||||
#Random Forest™ - 1000 trees
|
||||
bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```
|
||||
## [0] train-error:0.002150
|
||||
```
|
||||
|
||||
```r
|
||||
#Boosting - 3 rounds
|
||||
bst <- xgboost(data = train$data, label = train$label, max.depth = 4, nround = 3, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```
|
||||
## [0] train-error:0.006142
|
||||
## [1] train-error:0.006756
|
||||
## [2] train-error:0.001228
|
||||
```
|
||||
|
||||
> Note that the parameter `round` is set to `1`.
|
||||
|
||||
> [**Random Forests™**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software.
|
||||
17
doc/R-package/index.md
Normal file
17
doc/R-package/index.md
Normal file
@ -0,0 +1,17 @@
|
||||
XGBoost R Package
|
||||
=================
|
||||
[](http://cran.r-project.org/web/packages/xgboost)
|
||||
[](http://cran.rstudio.com/web/packages/xgboost/index.html)
|
||||
|
||||
|
||||
You have find XGBoost R Package!
|
||||
|
||||
Get Started
|
||||
-----------
|
||||
* Checkout the [Installation Guide](../build.md) contains instructions to install xgboost, and [Tutorials](#tutorials) for examples on how to use xgboost for various tasks.
|
||||
* Please visit [walk through example](demo).
|
||||
|
||||
Tutorials
|
||||
---------
|
||||
- [Introduction to XGBoost in R](xgboostPresentation.md)
|
||||
- [Discover your data with XGBoost in R](discoverYourData.md)
|
||||
590
doc/R-package/xgboostPresentation.md
Normal file
590
doc/R-package/xgboostPresentation.md
Normal file
@ -0,0 +1,590 @@
|
||||
---
|
||||
title: "Xgboost presentation"
|
||||
output:
|
||||
rmarkdown::html_vignette:
|
||||
css: vignette.css
|
||||
number_sections: yes
|
||||
toc: yes
|
||||
bibliography: xgboost.bib
|
||||
author: Tianqi Chen, Tong He, Michaël Benesty
|
||||
vignette: >
|
||||
%\VignetteIndexEntry{Xgboost presentation}
|
||||
%\VignetteEngine{knitr::rmarkdown}
|
||||
\usepackage[utf8]{inputenc}
|
||||
---
|
||||
|
||||
XGBoost R Tutorial
|
||||
==================
|
||||
|
||||
## Introduction
|
||||
|
||||
|
||||
**Xgboost** is short for e**X**treme **G**radient **Boost**ing package.
|
||||
|
||||
The purpose of this Vignette is to show you how to use **Xgboost** to build a model and make predictions.
|
||||
|
||||
It is an efficient and scalable implementation of gradient boosting framework by @friedman2000additive and @friedman2001greedy. Two solvers are included:
|
||||
|
||||
- *linear* model ;
|
||||
- *tree learning* algorithm.
|
||||
|
||||
It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objective functions easily.
|
||||
|
||||
It has been [used](https://github.com/dmlc/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions.
|
||||
|
||||
It has several features:
|
||||
|
||||
* Speed: it can automatically do parallel computation on *Windows* and *Linux*, with *OpenMP*. It is generally over 10 times faster than the classical `gbm`.
|
||||
* Input Type: it takes several types of input data:
|
||||
* *Dense* Matrix: *R*'s *dense* matrix, i.e. `matrix` ;
|
||||
* *Sparse* Matrix: *R*'s *sparse* matrix, i.e. `Matrix::dgCMatrix` ;
|
||||
* Data File: local data files ;
|
||||
* `xgb.DMatrix`: its own class (recommended).
|
||||
* Sparsity: it accepts *sparse* input for both *tree booster* and *linear booster*, and is optimized for *sparse* input ;
|
||||
* Customization: it supports customized objective functions and evaluation functions.
|
||||
|
||||
## Installation
|
||||
|
||||
|
||||
### Github version
|
||||
|
||||
|
||||
For up-to-date version (highly recommended), install from *Github*:
|
||||
|
||||
|
||||
```r
|
||||
devtools::install_git('git://github.com/dmlc/xgboost', subdir='R-package')
|
||||
```
|
||||
|
||||
> *Windows* user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
|
||||
|
||||
Cran version
|
||||
------------
|
||||
|
||||
As of 2015-03-13, ‘xgboost’ was removed from the CRAN repository.
|
||||
|
||||
Formerly available versions can be obtained from the CRAN [archive](http://cran.r-project.org/src/contrib/Archive/xgboost)
|
||||
|
||||
## Learning
|
||||
|
||||
|
||||
For the purpose of this tutorial we will load **XGBoost** package.
|
||||
|
||||
|
||||
```r
|
||||
require(xgboost)
|
||||
```
|
||||
|
||||
### Dataset presentation
|
||||
|
||||
|
||||
In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-).
|
||||
|
||||
Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.
|
||||
|
||||
### Dataset loading
|
||||
|
||||
|
||||
We will load the `agaricus` datasets embedded with the package and will link them to variables.
|
||||
|
||||
The datasets are already split in:
|
||||
|
||||
* `train`: will be used to build the model ;
|
||||
* `test`: will be used to assess the quality of our model.
|
||||
|
||||
Why *split* the dataset in two parts?
|
||||
|
||||
In the first part we will build our model. In the second part we will want to test it and assess its quality. Without dividing the dataset we would test the model on the data which the algorithm have already seen.
|
||||
|
||||
|
||||
```r
|
||||
data(agaricus.train, package='xgboost')
|
||||
data(agaricus.test, package='xgboost')
|
||||
train <- agaricus.train
|
||||
test <- agaricus.test
|
||||
```
|
||||
|
||||
> In the real world, it would be up to you to make this division between `train` and `test` data. The way to do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/splitting.html).
|
||||
|
||||
Each variable is a `list` containing two things, `label` and `data`:
|
||||
|
||||
|
||||
```r
|
||||
str(train)
|
||||
```
|
||||
|
||||
```
|
||||
## List of 2
|
||||
## $ data :
|
||||
```
|
||||
|
||||
```
|
||||
## Error in str.default(obj, ...): could not find function "is"
|
||||
```
|
||||
|
||||
`label` is the outcome of our dataset meaning it is the binary *classification* we will try to predict.
|
||||
|
||||
Let's discover the dimensionality of our datasets.
|
||||
|
||||
|
||||
```r
|
||||
dim(train$data)
|
||||
```
|
||||
|
||||
```
|
||||
## [1] 6513 126
|
||||
```
|
||||
|
||||
```r
|
||||
dim(test$data)
|
||||
```
|
||||
|
||||
```
|
||||
## [1] 1611 126
|
||||
```
|
||||
|
||||
This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge dataset very efficiently.
|
||||
|
||||
As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`):
|
||||
|
||||
|
||||
```r
|
||||
class(train$data)[1]
|
||||
```
|
||||
|
||||
```
|
||||
## [1] "dgCMatrix"
|
||||
```
|
||||
|
||||
```r
|
||||
class(train$label)
|
||||
```
|
||||
|
||||
```
|
||||
## [1] "numeric"
|
||||
```
|
||||
|
||||
### Basic Training using XGBoost
|
||||
|
||||
|
||||
This step is the most critical part of the process for the quality of our model.
|
||||
|
||||
#### Basic training
|
||||
|
||||
We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`.
|
||||
|
||||
In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very usual to have such dataset.
|
||||
|
||||
We will train decision tree model using the following parameters:
|
||||
|
||||
* `objective = "binary:logistic"`: we will train a binary classification model ;
|
||||
* `max.deph = 2`: the trees won't be deep, because our case is very simple ;
|
||||
* `nthread = 2`: the number of cpu threads we are going to use;
|
||||
* `nround = 2`: there will be two passes on the data, the second one will enhance the model by further reducing the difference between ground truth and prediction.
|
||||
|
||||
|
||||
```r
|
||||
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```
|
||||
## [0] train-error:0.046522
|
||||
## [1] train-error:0.022263
|
||||
```
|
||||
|
||||
> More complex the relationship between your features and your `label` is, more passes you need.
|
||||
|
||||
#### Parameter variations
|
||||
|
||||
##### Dense matrix
|
||||
|
||||
Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R** matrix.
|
||||
|
||||
|
||||
```r
|
||||
bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```
|
||||
## Error in as.vector(data): no method for coercing this S4 class to a vector
|
||||
```
|
||||
|
||||
##### xgb.DMatrix
|
||||
|
||||
**XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later.
|
||||
|
||||
|
||||
```r
|
||||
dtrain <- xgb.DMatrix(data = train$data, label = train$label)
|
||||
bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```
|
||||
## [0] train-error:0.046522
|
||||
## [1] train-error:0.022263
|
||||
```
|
||||
|
||||
##### Verbose option
|
||||
|
||||
**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
|
||||
|
||||
One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics).
|
||||
|
||||
|
||||
```r
|
||||
# verbose = 0, no message
|
||||
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 0)
|
||||
```
|
||||
|
||||
|
||||
```r
|
||||
# verbose = 1, print evaluation metric
|
||||
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 1)
|
||||
```
|
||||
|
||||
```
|
||||
## [0] train-error:0.046522
|
||||
## [1] train-error:0.022263
|
||||
```
|
||||
|
||||
|
||||
```r
|
||||
# verbose = 2, also print information about tree
|
||||
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 2)
|
||||
```
|
||||
|
||||
```
|
||||
## [11:43:20] ../..//amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 6 extra nodes, 0 pruned nodes, max_depth=2
|
||||
## [0] train-error:0.046522
|
||||
## [11:43:20] ../..//amalgamation/../src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 4 extra nodes, 0 pruned nodes, max_depth=2
|
||||
## [1] train-error:0.022263
|
||||
```
|
||||
|
||||
## Basic prediction using XGBoost
|
||||
|
||||
|
||||
## Perform the prediction
|
||||
|
||||
|
||||
The purpose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.
|
||||
|
||||
|
||||
```r
|
||||
pred <- predict(bst, test$data)
|
||||
|
||||
# size of the prediction vector
|
||||
print(length(pred))
|
||||
```
|
||||
|
||||
```
|
||||
## [1] 1611
|
||||
```
|
||||
|
||||
```r
|
||||
# limit display of predictions to the first 10
|
||||
print(head(pred))
|
||||
```
|
||||
|
||||
```
|
||||
## [1] 0.28583017 0.92392391 0.28583017 0.28583017 0.05169873 0.92392391
|
||||
```
|
||||
|
||||
These numbers doesn't look like *binary classification* `{0,1}`. We need to perform a simple transformation before being able to use these results.
|
||||
|
||||
## Transform the regression in a binary classification
|
||||
|
||||
|
||||
The only thing that **XGBoost** does is a *regression*. **XGBoost** is using `label` vector to build its *regression* model.
|
||||
|
||||
How can we use a *regression* model to perform a binary classification?
|
||||
|
||||
If we think about the meaning of a regression applied to our data, the numbers we get are probabilities that a datum will be classified as `1`. Therefore, we will set the rule that if this probability for a specific datum is `> 0.5` then the observation is classified as `1` (or `0` otherwise).
|
||||
|
||||
|
||||
```r
|
||||
prediction <- as.numeric(pred > 0.5)
|
||||
print(head(prediction))
|
||||
```
|
||||
|
||||
```
|
||||
## [1] 0 1 0 0 0 1
|
||||
```
|
||||
|
||||
## Measuring model performance
|
||||
|
||||
|
||||
To measure the model performance, we will compute a simple metric, the *average error*.
|
||||
|
||||
|
||||
```r
|
||||
err <- mean(as.numeric(pred > 0.5) != test$label)
|
||||
print(paste("test-error=", err))
|
||||
```
|
||||
|
||||
```
|
||||
## [1] "test-error= 0.0217256362507759"
|
||||
```
|
||||
|
||||
> Note that the algorithm has not seen the `test` data during the model construction.
|
||||
|
||||
Steps explanation:
|
||||
|
||||
1. `as.numeric(pred > 0.5)` applies our rule that when the probability (<=> regression <=> prediction) is `> 0.5` the observation is classified as `1` and `0` otherwise ;
|
||||
2. `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ;
|
||||
3. `mean(vectorOfErrors)` computes the *average error* itself.
|
||||
|
||||
The most important thing to remember is that **to do a classification, you just do a regression to the** `label` **and then apply a threshold**.
|
||||
|
||||
*Multiclass* classification works in a similar way.
|
||||
|
||||
This metric is **0.02** and is pretty low: our yummly mushroom model works well!
|
||||
|
||||
## Advanced features
|
||||
|
||||
|
||||
Most of the features below have been implemented to help you to improve your model by offering a better understanding of its content.
|
||||
|
||||
|
||||
### Dataset preparation
|
||||
|
||||
|
||||
For the following advanced features, we need to put data in `xgb.DMatrix` as explained above.
|
||||
|
||||
|
||||
```r
|
||||
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
|
||||
dtest <- xgb.DMatrix(data = test$data, label=test$label)
|
||||
```
|
||||
|
||||
### Measure learning progress with xgb.train
|
||||
|
||||
|
||||
Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.
|
||||
|
||||
One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.
|
||||
|
||||
One way to measure progress in learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
|
||||
|
||||
> in some way it is similar to what we have done above with the average error. The main difference is that below it was after building the model, and now it is during the construction that we measure errors.
|
||||
|
||||
For the purpose of this example, we use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name.
|
||||
|
||||
|
||||
```r
|
||||
watchlist <- list(train=dtrain, test=dtest)
|
||||
|
||||
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchlist=watchlist, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```
|
||||
## [0] train-error:0.046522 test-error:0.042831
|
||||
## [1] train-error:0.022263 test-error:0.021726
|
||||
```
|
||||
|
||||
**XGBoost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.
|
||||
|
||||
Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.
|
||||
|
||||
If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/splitting.html).
|
||||
|
||||
For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.
|
||||
|
||||
|
||||
```r
|
||||
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```
|
||||
## [0] train-error:0.046522 train-logloss:0.233376 test-error:0.042831 test-logloss:0.226686
|
||||
## [1] train-error:0.022263 train-logloss:0.136658 test-error:0.021726 test-logloss:0.137874
|
||||
```
|
||||
|
||||
> `eval.metric` allows us to monitor two new metrics for each round, `logloss` and `error`.
|
||||
|
||||
### Linear boosting
|
||||
|
||||
|
||||
Until now, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).
|
||||
|
||||
|
||||
```r
|
||||
bst <- xgb.train(data=dtrain, booster = "gblinear", max.depth=2, nthread = 2, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```
|
||||
## [0] train-error:0.019499 train-logloss:0.176561 test-error:0.018001 test-logloss:0.173835
|
||||
## [1] train-error:0.004760 train-logloss:0.068214 test-error:0.003104 test-logloss:0.065493
|
||||
```
|
||||
|
||||
In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm.
|
||||
|
||||
In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.
|
||||
|
||||
### Manipulating xgb.DMatrix
|
||||
|
||||
|
||||
#### Save / Load
|
||||
|
||||
Like saving models, `xgb.DMatrix` object (which groups both dataset and outcome) can also be saved using `xgb.DMatrix.save` function.
|
||||
|
||||
|
||||
```r
|
||||
xgb.DMatrix.save(dtrain, "dtrain.buffer")
|
||||
```
|
||||
|
||||
```
|
||||
## [1] TRUE
|
||||
```
|
||||
|
||||
```r
|
||||
# to load it in, simply call xgb.DMatrix
|
||||
dtrain2 <- xgb.DMatrix("dtrain.buffer")
|
||||
```
|
||||
|
||||
```
|
||||
## [11:43:20] 6513x126 matrix with 143286 entries loaded from dtrain.buffer
|
||||
```
|
||||
|
||||
```r
|
||||
bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nthread = 2, nround=2, watchlist=watchlist, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```
|
||||
## [0] train-error:0.046522 test-error:0.042831
|
||||
## [1] train-error:0.022263 test-error:0.021726
|
||||
```
|
||||
|
||||
|
||||
|
||||
#### Information extraction
|
||||
|
||||
Information can be extracted from `xgb.DMatrix` using `getinfo` function. Hereafter we will extract `label` data.
|
||||
|
||||
|
||||
```r
|
||||
label = getinfo(dtest, "label")
|
||||
pred <- predict(bst, dtest)
|
||||
err <- as.numeric(sum(as.integer(pred > 0.5) != label))/length(label)
|
||||
print(paste("test-error=", err))
|
||||
```
|
||||
|
||||
```
|
||||
## [1] "test-error= 0.0217256362507759"
|
||||
```
|
||||
|
||||
### View feature importance/influence from the learnt model
|
||||
|
||||
|
||||
Feature importance is similar to R gbm package's relative influence (rel.inf).
|
||||
|
||||
```
|
||||
importance_matrix <- xgb.importance(model = bst)
|
||||
print(importance_matrix)
|
||||
xgb.plot.importance(importance_matrix = importance_matrix)
|
||||
```
|
||||
|
||||
#### View the trees from a model
|
||||
|
||||
|
||||
You can dump the tree you learned using `xgb.dump` into a text file.
|
||||
|
||||
|
||||
```r
|
||||
xgb.dump(bst, with.stats = T)
|
||||
```
|
||||
|
||||
```
|
||||
## [1] "booster[0]"
|
||||
## [2] "0:[f28<-1.00136e-05] yes=1,no=2,missing=1,gain=4000.53,cover=1628.25"
|
||||
## [3] "1:[f55<-1.00136e-05] yes=3,no=4,missing=3,gain=1158.21,cover=924.5"
|
||||
## [4] "3:leaf=1.71218,cover=812"
|
||||
## [5] "4:leaf=-1.70044,cover=112.5"
|
||||
## [6] "2:[f108<-1.00136e-05] yes=5,no=6,missing=5,gain=198.174,cover=703.75"
|
||||
## [7] "5:leaf=-1.94071,cover=690.5"
|
||||
## [8] "6:leaf=1.85965,cover=13.25"
|
||||
## [9] "booster[1]"
|
||||
## [10] "0:[f59<-1.00136e-05] yes=1,no=2,missing=1,gain=832.545,cover=788.852"
|
||||
## [11] "1:[f28<-1.00136e-05] yes=3,no=4,missing=3,gain=569.725,cover=768.39"
|
||||
## [12] "3:leaf=0.784718,cover=458.937"
|
||||
## [13] "4:leaf=-0.96853,cover=309.453"
|
||||
## [14] "2:leaf=-6.23624,cover=20.4624"
|
||||
```
|
||||
|
||||
You can plot the trees from your model using ```xgb.plot.tree``
|
||||
|
||||
```
|
||||
xgb.plot.tree(model = bst)
|
||||
```
|
||||
|
||||
> if you provide a path to `fname` parameter you can save the trees to your hard drive.
|
||||
|
||||
#### Save and load models
|
||||
|
||||
|
||||
Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.
|
||||
|
||||
Hopefully for you, **XGBoost** implements such functions.
|
||||
|
||||
|
||||
```r
|
||||
# save model to binary local file
|
||||
xgb.save(bst, "xgboost.model")
|
||||
```
|
||||
|
||||
```
|
||||
## [1] TRUE
|
||||
```
|
||||
|
||||
> `xgb.save` function should return TRUE if everything goes well and crashes otherwise.
|
||||
|
||||
An interesting test to see how identical our saved model is to the original one would be to compare the two predictions.
|
||||
|
||||
|
||||
```r
|
||||
# load binary model to R
|
||||
bst2 <- xgb.load("xgboost.model")
|
||||
pred2 <- predict(bst2, test$data)
|
||||
|
||||
# And now the test
|
||||
print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred))))
|
||||
```
|
||||
|
||||
```
|
||||
## [1] "sum(abs(pred2-pred))= 0"
|
||||
```
|
||||
|
||||
|
||||
|
||||
> result is `0`? We are good!
|
||||
|
||||
In some very specific cases, like when you want to pilot **XGBoost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it.
|
||||
|
||||
|
||||
```r
|
||||
# save model to R's raw vector
|
||||
rawVec <- xgb.save.raw(bst)
|
||||
|
||||
# print class
|
||||
print(class(rawVec))
|
||||
```
|
||||
|
||||
```
|
||||
## [1] "raw"
|
||||
```
|
||||
|
||||
```r
|
||||
# load binary model to R
|
||||
bst3 <- xgb.load(rawVec)
|
||||
pred3 <- predict(bst3, test$data)
|
||||
|
||||
# pred2 should be identical to pred
|
||||
print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred))))
|
||||
```
|
||||
|
||||
```
|
||||
## [1] "sum(abs(pred3-pred))= 0"
|
||||
```
|
||||
|
||||
> Again `0`? It seems that `XGBoost` works pretty well!
|
||||
|
||||
## References
|
||||
@ -26,7 +26,7 @@ from sphinx_util import MarkdownParser, AutoStructify
|
||||
|
||||
# -- mock out modules
|
||||
import mock
|
||||
MOCK_MODULES = ['numpy', 'scipy', 'scipy.sparse', 'sklearn', 'matplotlib']
|
||||
MOCK_MODULES = ['numpy', 'scipy', 'scipy.sparse', 'sklearn', 'matplotlib', 'pandas', 'graphviz']
|
||||
for mod_name in MOCK_MODULES:
|
||||
sys.modules[mod_name] = mock.Mock()
|
||||
|
||||
@ -120,6 +120,7 @@ todo_include_todos = False
|
||||
# The theme to use for HTML and HTML Help pages. See the documentation for
|
||||
# a list of builtin themes.
|
||||
# html_theme = 'alabaster'
|
||||
html_theme = 'sphinx_rtd_theme'
|
||||
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
|
||||
@ -1,12 +1,145 @@
|
||||
Developer Guide
|
||||
===============
|
||||
This page contains guide for developers of xgboost. XGBoost has been developed and used by a group of active community.
|
||||
Everyone is more than welcomed to is a great way to make the project better.
|
||||
The project is maintained by a committee of [committers](../../CONTRIBUTORS.md#comitters) who will review and merge pull requests from contributors.
|
||||
Contribute to XGBoost
|
||||
=====================
|
||||
XGBoost has been developed and used by a group of active community members.
|
||||
Everyone is more than welcome to contribute. It is a way to make the project better and more accessible to more users.
|
||||
|
||||
Contributing Code
|
||||
-----------------
|
||||
* The C++ code follows Google C++ style
|
||||
* We follow numpy style to document our python module
|
||||
* Tools to precheck codestyle
|
||||
- type ```make lint``` and fix possible errors.
|
||||
- Please add your name to [CONTRIBUTORS.md](../CONTRIBUTORS.md) after your patch has been merged.
|
||||
- Please also update [NEWS.md](../NEWS.md) to add note on your changes to the API or added a new document.
|
||||
|
||||
Guidelines
|
||||
----------
|
||||
* [Submit Pull Request](#submit-pull-request)
|
||||
* [Git Workflow Howtos](#git-workflow-howtos)
|
||||
- [How to resolve conflict with master](#how-to-resolve-conflict-with-master)
|
||||
- [How to combine multiple commits into one](#how-to-combine-multiple-commits-into-one)
|
||||
- [What is the consequence of force push](#what-is-the-consequence-of-force-push)
|
||||
* [Document](#document)
|
||||
* [Testcases](#testcases)
|
||||
* [Examples](#examples)
|
||||
* [Core Library](#core-library)
|
||||
* [Python Package](#python-package)
|
||||
* [R Package](#r-package)
|
||||
|
||||
Submit Pull Request
|
||||
-------------------
|
||||
* Before submit, please rebase your code on the most recent version of master, you can do it by
|
||||
```bash
|
||||
git remote add upstream https://github.com/dmlc/xgboost
|
||||
git fetch upstream
|
||||
git rebase upstream/master
|
||||
```
|
||||
* If you have multiple small commits,
|
||||
it might be good to merge them together(use git rebase then squash) into more meaningful groups.
|
||||
* Send the pull request!
|
||||
- Fix the problems reported by automatic checks
|
||||
- If you are contributing a new module, consider add a testcase in [tests](../tests)
|
||||
|
||||
Git Workflow Howtos
|
||||
-------------------
|
||||
### How to resolve conflict with master
|
||||
- First rebase to most recent master
|
||||
```bash
|
||||
# The first two steps can be skipped after you do it once.
|
||||
git remote add upstream https://github.com/dmlc/xgboost
|
||||
git fetch upstream
|
||||
git rebase upstream/master
|
||||
```
|
||||
- The git may show some conflicts it cannot merge, say ```conflicted.py```.
|
||||
- Manually modify the file to resolve the conflict.
|
||||
- After you resolved the conflict, mark it as resolved by
|
||||
```bash
|
||||
git add conflicted.py
|
||||
```
|
||||
- Then you can continue rebase by
|
||||
```bash
|
||||
git rebase --continue
|
||||
```
|
||||
- Finally push to your fork, you may need to force push here.
|
||||
```bash
|
||||
git push --force
|
||||
```
|
||||
|
||||
### How to combine multiple commits into one
|
||||
Sometimes we want to combine multiple commits, especially when later commits are only fixes to previous ones,
|
||||
to create a PR with set of meaningful commits. You can do it by following steps.
|
||||
- Before doing so, configure the default editor of git if you haven't done so before.
|
||||
```bash
|
||||
git config core.editor the-editor-you-like
|
||||
```
|
||||
- Assume we want to merge last 3 commits, type the following commands
|
||||
```bash
|
||||
git rebase -i HEAD~3
|
||||
```
|
||||
- It will pop up an text editor. Set the first commit as ```pick```, and change later ones to ```squash```.
|
||||
- After you saved the file, it will pop up another text editor to ask you modify the combined commit message.
|
||||
- Push the changes to your fork, you need to force push.
|
||||
```bash
|
||||
git push --force
|
||||
```
|
||||
|
||||
### What is the consequence of force push
|
||||
The previous two tips requires force push, this is because we altered the path of the commits.
|
||||
It is fine to force push to your own fork, as long as the commits changed are only yours.
|
||||
|
||||
Documents
|
||||
---------
|
||||
* The document is created using sphinx and [recommonmark](http://recommonmark.readthedocs.org/en/latest/)
|
||||
* You can build document locally to see the effect.
|
||||
|
||||
Testcases
|
||||
---------
|
||||
* All the testcases are in [tests](../tests)
|
||||
* We use python nose for python test cases.
|
||||
|
||||
Examples
|
||||
--------
|
||||
* Usecases and examples will be in [demo](../demo)
|
||||
* We are super excited to hear about your story, if you have blogposts,
|
||||
tutorials code solutions using xgboost, please tell us and we will add
|
||||
a link in the example pages.
|
||||
|
||||
Core Library
|
||||
------------
|
||||
- Follow Google C style for C++.
|
||||
- We use doxygen to document all the interface code.
|
||||
- You can reproduce the linter checks by typing ```make lint```
|
||||
|
||||
Python Package
|
||||
--------------
|
||||
- Always add docstring to the new functions in numpydoc format.
|
||||
- You can reproduce the linter checks by typing ```make lint```
|
||||
|
||||
R Package
|
||||
---------
|
||||
### Code Style
|
||||
- We follow Google's C++ Style guide on C++ code.
|
||||
- This is mainly to be consistent with the rest of the project.
|
||||
- Another reason is we will be able to check style automatically with a linter.
|
||||
- You can check the style of the code by typing the following command at root folder.
|
||||
```bash
|
||||
make rcpplint
|
||||
```
|
||||
- When needed, you can disable the linter warning of certain line with ```// NOLINT(*)``` comments.
|
||||
|
||||
### Rmarkdown Vignettes
|
||||
Rmarkdown vignettes are placed in [R-package/vignettes](../R-package/vignettes)
|
||||
These Rmarkdown files are not compiled. We host the compiled version on [doc/R-package](R-package)
|
||||
|
||||
The following steps are followed to add a new Rmarkdown vignettes:
|
||||
- Add the original rmarkdown to ```R-package/vignettes```
|
||||
- Modify ```doc/R-package/Makefile``` to add the markdown files to be build
|
||||
- Clone the [dmlc/web-data](https://github.com/dmlc/web-data) repo to folder ```doc```
|
||||
- Now type the following command on ```doc/R-package```
|
||||
```bash
|
||||
make the-markdown-to-make.md
|
||||
```
|
||||
- This will generate the markdown, as well as the figures into ```doc/web-data/xgboost/knitr```
|
||||
- Modify the ```doc/R-package/index.md``` to point to the generated markdown.
|
||||
- Add the generated figure to the ```dmlc/web-data``` repo.
|
||||
- If you already cloned the repo to doc, this means a ```git add```
|
||||
- Create PR for both the markdown and ```dmlc/web-data```
|
||||
- You can also build the document locally by typing the followig command at ```doc```
|
||||
```bash
|
||||
make html
|
||||
```
|
||||
The reason we do this is to avoid exploded repo size due to generated images sizes.
|
||||
|
||||
62
doc/index.md
62
doc/index.md
@ -5,23 +5,26 @@ XGBoost is short for eXtreme gradient boosting. This is a library that is design
|
||||
The goal of this library is to push the extreme of the computation limits of machines to provide a ***scalable***, ***portable*** and ***accurate***
|
||||
for large scale tree boosting.
|
||||
|
||||
|
||||
This document is hosted at http://xgboost.readthedocs.org/. You can also browse most of the documents in github directly.
|
||||
|
||||
How to Get Started
|
||||
------------------
|
||||
The best way to get started to learn xgboost is by the examples. There are three types of examples you can find in xgboost.
|
||||
* [Tutorials](#tutorials) are self-contained tutorials on complete data science tasks.
|
||||
* [XGBoost Code Examples](../demo/) are collections of code and benchmarks of xgboost.
|
||||
- There is a walkthrough section in this to walk you through specific API features.
|
||||
* [Highlight Solutions](#highlight-solutions) are presentations using xgboost to solve real world problems.
|
||||
- These examples are usually more advanced. You can usually find state-of-art solutions to many problems and challenges in here.
|
||||
|
||||
After you gets familiar with the interface, checkout the following additional resources
|
||||
User Guide
|
||||
----------
|
||||
* [Installation Guide](build.md)
|
||||
* [Introduction to Boosted Trees](model.md)
|
||||
* [Python Package Document](python/index.md)
|
||||
* [R Package Document](R-package/index.md)
|
||||
* [XGBoost.jl Julia Package](https://github.com/dmlc/XGBoost.jl)
|
||||
* [Distributed Training](../demo/distributed-training)
|
||||
* [Frequently Asked Questions](faq.md)
|
||||
* [Learning what is in Behind: Introduction to Boosted Trees](model.md)
|
||||
* [User Guide](#user-guide) contains comprehensive list of documents of xgboost.
|
||||
* [Developer Guide](dev-guide/contribute.md)
|
||||
* [External Memory Version](external_memory.md)
|
||||
* [Learning to use XGBoost by Example](../demo)
|
||||
* [Parameters](parameter.md)
|
||||
* [Text input format](input_format.md)
|
||||
* [Notes on Parameter Tunning](param_tuning.md)
|
||||
|
||||
Developer Guide
|
||||
---------------
|
||||
* [Contributor Guide](dev-guide/contribute.md)
|
||||
|
||||
Tutorials
|
||||
---------
|
||||
@ -31,14 +34,13 @@ are great resources to learn xgboost by real examples. If you think you have som
|
||||
- This tutorial introduces the basic usage of CLI version of xgboost
|
||||
* [Introduction of XGBoost in Python](python/python_intro.md) (python)
|
||||
- This tutorial introduces the python package of xgboost
|
||||
* [Introduction to XGBoost in R](../R-package/vignettes/xgboostPresentation.Rmd) (R package)
|
||||
* [Introduction to XGBoost in R](R-package/xgboostPresentation.md) (R package)
|
||||
- This is a general presentation about xgboost in R.
|
||||
* [Discover your data with XGBoost in R](../R-package/vignettes/discoverYourData.Rmd) (R package)
|
||||
* [Discover your data with XGBoost in R](R-package/discoverYourData.md) (R package)
|
||||
- This tutorial explaining feature analysis in xgboost.
|
||||
* [Understanding XGBoost Model on Otto Dataset](../demo/kaggle-otto/understandingXGBoostModel.Rmd) (R package)
|
||||
- This tutorial teaches you how to use xgboost to compete kaggle otto challenge.
|
||||
|
||||
|
||||
Highlight Solutions
|
||||
-------------------
|
||||
This section is about blogposts, presentation and videos discussing how to use xgboost to solve your interesting problem. If you think something belongs to here, send a pull request.
|
||||
@ -49,23 +51,11 @@ This section is about blogposts, presentation and videos discussing how to use x
|
||||
* Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model](https://www.youtube.com/watch?v=Og7CGAfSr_Y)
|
||||
* [Winning solution of Kaggle Higgs competition: what a single model can do](http://no2147483647.wordpress.com/2014/09/17/winning-solution-of-kaggle-higgs-competition-what-a-single-model-can-do/)
|
||||
|
||||
User Guide
|
||||
----------
|
||||
* [Frequently Asked Questions](faq.md)
|
||||
* [Introduction to Boosted Trees](model.md)
|
||||
* [Using XGBoost in Python](python/python_intro.md)
|
||||
* [Using XGBoost in R](../R-package/vignettes/xgboostPresentation.Rmd)
|
||||
* [Learning to use XGBoost by Example](../demo)
|
||||
* [External Memory Version](external_memory.md)
|
||||
* [Text input format](input_format.md)
|
||||
* [Build Instruction](build.md)
|
||||
* [Parameters](parameter.md)
|
||||
* [Notes on Parameter Tunning](param_tuning.md)
|
||||
Indices and tables
|
||||
------------------
|
||||
|
||||
Developer Guide
|
||||
---------------
|
||||
* [Developer Guide](dev-guide/contribute.md)
|
||||
|
||||
API Reference
|
||||
-------------
|
||||
* [Python API Reference](python/python_api.rst)
|
||||
```eval_rst
|
||||
* :ref:`genindex`
|
||||
* :ref:`modindex`
|
||||
* :ref:`search`
|
||||
```
|
||||
|
||||
10
doc/python/index.md
Normal file
10
doc/python/index.md
Normal file
@ -0,0 +1,10 @@
|
||||
XGBoost Python Package
|
||||
======================
|
||||
This page contains links to all the python related documents on python package.
|
||||
To install the package package, checkout [Build and Installation Instruction](../build.md).
|
||||
|
||||
Contents
|
||||
--------
|
||||
* [Python Overview Tutorial](python_intro.md)
|
||||
* [Learning to use XGBoost by Example](../../demo)
|
||||
* [Python API Reference](python_api.rst)
|
||||
@ -5,11 +5,24 @@ import os
|
||||
import docutils
|
||||
import subprocess
|
||||
|
||||
if os.environ.get('READTHEDOCS', None) == 'True':
|
||||
READTHEDOCS_BUILD = (os.environ.get('READTHEDOCS', None) is not None)
|
||||
|
||||
if not os.path.exists('../recommonmark'):
|
||||
subprocess.call('cd ..; rm -rf recommonmark;' +
|
||||
'git clone https://github.com/tqchen/recommonmark', shell=True)
|
||||
'git clone https://github.com/tqchen/recommonmark', shell = True)
|
||||
else:
|
||||
subprocess.call('cd ../recommonmark/; git pull', shell=True)
|
||||
|
||||
if not os.path.exists('web-data'):
|
||||
subprocess.call('rm -rf web-data;' +
|
||||
'git clone https://github.com/dmlc/web-data', shell = True)
|
||||
else:
|
||||
subprocess.call('cd web-data; git pull', shell=True)
|
||||
|
||||
|
||||
sys.path.insert(0, os.path.abspath('../recommonmark/'))
|
||||
sys.stderr.write('READTHEDOCS=%s\n' % (READTHEDOCS_BUILD))
|
||||
|
||||
|
||||
from recommonmark import parser, transform
|
||||
|
||||
|
||||
@ -14,7 +14,7 @@ try:
|
||||
from .sklearn import XGBModel, XGBClassifier, XGBRegressor
|
||||
from .plotting import plot_importance, plot_tree, to_graphviz
|
||||
except ImportError:
|
||||
print('Error when loading sklearn/plotting. Please install scikit-learn')
|
||||
pass
|
||||
|
||||
VERSION_FILE = os.path.join(os.path.dirname(__file__), 'VERSION')
|
||||
__version__ = open(VERSION_FILE).read().strip()
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user