[DOC] Update R doc

2016-01-16 11:46:23 -08:00
parent e7d8ed71d6
commit 8e7f2679d5
16 changed files with 1402 additions and 156 deletions
--- a/R-package/README.md
+++ b/R-package/README.md
@@ -3,6 +3,12 @@ R package for xgboost

 [![CRAN Status Badge](http://www.r-pkg.org/badges/version/xgboost)](http://cran.r-project.org/web/packages/xgboost)
 [![CRAN Downloads](http://cranlogs.r-pkg.org/badges/xgboost)](http://cran.rstudio.com/web/packages/xgboost/index.html)
+[![Documentation Status](https://readthedocs.org/projects/xgboost/badge/?version=latest)](http://xgboost.readthedocs.org/en/latest/R-package/index.html)
+
+Resources
+---------
+* [XGBoost R Package Online Documentation](http://xgboost.readthedocs.org/en/latest/R-package/index.html)
+  - Check this out for detailed documents, examples and tutorials.

 Installation
 ------------
@@ -24,21 +30,3 @@ Examples

 * Please visit [walk through example](demo).
 * See also the [example scripts](../demo/kaggle-higgs) for Kaggle Higgs Challenge, including [speedtest script](../demo/kaggle-higgs/speedtest.R) on this dataset and the one related to [Otto challenge](../demo/kaggle-otto), including a [RMarkdown documentation](../demo/kaggle-otto/understandingXGBoostModel.Rmd).
-
-Notes
-----
-
-If you face an issue installing the package using  ```devtools::install_github```, something like this (even after updating libxml and RCurl as lot of forums say) -
-
-```
-devtools::install_github('dmlc/xgboost',subdir='R-package')
-Downloading github repo dmlc/xgboost@master
-Error in function (type, msg, asError = TRUE)  :
-  Peer certificate cannot be authenticated with given CA certificates
-```
-To get around this you can build the package locally as mentioned [here](https://github.com/dmlc/xgboost/issues/347) -
-```
-1. Clone the current repository and set your workspace to xgboost/R-package/
-2. Run R CMD INSTALL --build . in terminal to get the tarball.
-3. Run install.packages('path_to_the_tarball',repo=NULL) in R to install.
-```
--- a/R-package/vignettes/discoverYourData.Rmd
+++ b/R-package/vignettes/discoverYourData.Rmd
@@ -1,6 +1,6 @@
 ---
 title: "Understand your dataset with Xgboost"
-output: 
+output:
  rmarkdown::html_vignette:
    css: vignette.css
    number_sections: yes
@@ -12,8 +12,11 @@ vignette: >
  \usepackage[utf8]{inputenc}
 ---

+Understand your dataset with XGBoost
+====================================
+
 Introduction
-============
+------------

 The purpose of this Vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.

@@ -25,16 +28,16 @@ Pacakge loading:
 require(xgboost)
 require(Matrix)
 require(data.table)
-if (!require('vcd')) install.packages('vcd') 
+if (!require('vcd')) install.packages('vcd')
 ```

 > **VCD** package is used for one of its embedded dataset only.

 Preparation of the dataset
-==========================
+--------------------------
+
+### Numeric VS categorical variables

-Numeric VS categorical variables
--------------------------------

 **Xgboost** manages only `numeric` vectors.

@@ -48,10 +51,9 @@ A *categorical* variable has a fixed number of different values. For instance, i

 To answer the question above we will convert *categorical* variables to `numeric` one.

-Conversion from categorical to numeric variables
------------------------------------------------
+### Conversion from categorical to numeric variables

-### Looking at the raw data
+#### Looking at the raw data

 In this Vignette we will see how to transform a *dense* `data.frame` (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.

@@ -85,11 +87,11 @@ str(df)
 > * can take a limited number of values (like `factor`) ;
 > * these values are ordered (unlike `factor`). Here these ordered values are: `Marked > Some > None`

-### Creation of new features based on old ones
+#### Creation of new features based on old ones

 We will add some new *categorical* features to see if it helps.

-#### Grouping per 10 years
+##### Grouping per 10 years

 For the first feature we create groups of age by rounding the real age.

@@ -101,7 +103,7 @@ Therefore, 20 is not closer to 30 than 60. To make it short, the distance betwee
 head(df[,AgeDiscret := as.factor(round(Age/10,0))])
 ```

-#### Random split in two groups
+##### Random split in two groups

 Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (you may already have an idea of how well it will work...).

@@ -109,15 +111,15 @@ Following is an even stronger simplification of the real age with an arbitrary s
 head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))])
 ```

-#### Risks in adding correlated features
+##### Risks in adding correlated features

-These new features are highly correlated to the `Age` feature because they are simple transformations of this feature. 
+These new features are highly correlated to the `Age` feature because they are simple transformations of this feature.

 For many machine learning algorithms, using correlated features is not a good idea. It may sometimes make prediction less accurate, and most of the time make interpretation of the model almost impossible. GLM, for instance, assumes that the features are uncorrelated.

 Fortunately, decision tree algorithms (including boosted trees) are very robust to these features. Therefore we have nothing to do to manage this situation.

-#### Cleaning data
+##### Cleaning data

 We remove ID as there is nothing to learn from this feature (it would just add some noise).

@@ -132,7 +134,7 @@ levels(df[,Treatment])
 ```


-### One-hot encoding
+#### One-hot encoding

 Next step, we will transform the categorical data to dummy variables.
 This is the [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) step.
@@ -156,12 +158,12 @@ Create the output `numeric` vector (not as a sparse `Matrix`):
 output_vector = df[,Improved] == "Marked"
 ```

-1. set `Y` vector to `0`; 
-2. set `Y` to `1` for rows where `Improved == Marked` is `TRUE` ; 
+1. set `Y` vector to `0`;
+2. set `Y` to `1` for rows where `Improved == Marked` is `TRUE` ;
 3. return `Y` vector.

 Build the model
-===============
+---------------

 The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)).

@@ -173,17 +175,17 @@ bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,

 You can see some `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well the model explains your data. Lower is better.

-A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future). 
+A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future).

-> Here you can see the numbers decrease until line 7 and then increase. 
+> Here you can see the numbers decrease until line 7 and then increase.
 >
 > It probably means we are overfitting. To fix that I should reduce the number of rounds to `nround = 4`. I will let things like that because I don't really care for the purpose of this example :-)

 Feature importance
-==================
+------------------
+
+## Measure feature importance

-Measure feature importance
--------------------------

 ### Build the feature importance data.table

@@ -204,7 +206,7 @@ head(importance)

 `Frequency` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).

-### Improvement in the interpretability of feature importance data.table
+#### Improvement in the interpretability of feature importance data.table

 We can go deeper in the analysis of the model. In the `data.table` above, we have discovered which features counts to predict if the illness will go or not. But we don't yet know the role of these features. For instance, one of the question we may want to answer would be: does receiving a placebo treatment helps to recover from the illness?

@@ -233,8 +235,8 @@ Therefore, according to our findings, getting a placebo doesn't seem to help but

 > You may wonder how to interpret the `< 1.00001` on the first line. Basically, in a sparse `Matrix`, there is no `0`, therefore, looking for one hot-encoded categorical observations validating the rule `< 1.00001` is like just looking for `1` for this feature.

-Plotting the feature importance
-------------------------------
+### Plotting the feature importance
+

 All these things are nice, but it would be even better to plot the results.

@@ -250,11 +252,11 @@ According to the plot above, the most important features in this dataset to pred

 * the Age ;
 * having received a placebo or not ;
-* the sex is third but already included in the not interesting features group ; 
+* the sex is third but already included in the not interesting features group ;
 * then we see our generated features (AgeDiscret). We can see that their contribution is very low.

-Do these results make sense?
------------------------------
+### Do these results make sense?
+

 Let's check some **Chi2** between each of these features and the label.

@@ -279,18 +281,18 @@ c2 <- chisq.test(df$AgeCat, output_vector)
 print(c2)
 ```

-The perfectly random split I did between young and old at 30 years old have a low correlation of **`r round(c2$statistic, 2)`**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same. 
+The perfectly random split I did between young and old at 30 years old have a low correlation of **`r round(c2$statistic, 2)`**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same.

-Morality: don't let your *gut* lower the quality of your model. 
+Morality: don't let your *gut* lower the quality of your model.

 In *data science* expression, there is the word *science* :-)

 Conclusion
-==========
+----------

-As you can see, in general *destroying information by simplifying it won't improve your model*. **Chi2** just demonstrates that. 
+As you can see, in general *destroying information by simplifying it won't improve your model*. **Chi2** just demonstrates that.

-But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model. 
+But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model.

 The case studied here is not enough complex to show that. Check [Kaggle website](http://www.kaggle.com/) for some challenging datasets. However it's almost always worse when you add some arbitrary rules.

@@ -299,7 +301,7 @@ Moreover, you can notice that even if we have added some not useful new features
 Linear model may not be that smart in this scenario.

 Special Note: What about Random Forests™?
-==========================================
+-----------------------------------------

 As you may know, [Random Forests™](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](http://en.wikipedia.org/wiki/Ensemble_learning) family.

@@ -313,7 +315,7 @@ However, in Random Forests™ this random choice will be done for each tree, bec

 In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.

-If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters! 
+If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters!

 **Warning**: this is still an experimental parameter.

--- a/R-package/vignettes/xgboostPresentation.Rmd
+++ b/R-package/vignettes/xgboostPresentation.Rmd
@@ -13,8 +13,11 @@ vignette: >
  \usepackage[utf8]{inputenc}
 ---

-Introduction
-============
+XGBoost R Tutorial
+==================
+
+## Introduction
+

 **Xgboost** is short for e**X**treme **G**radient **Boost**ing package.

@@ -40,16 +43,16 @@ It has several features:
 * Sparsity: it accepts *sparse* input for both *tree booster*  and *linear booster*, and is optimized for *sparse* input ;
 * Customization: it supports customized objective functions and evaluation functions.

-Installation
-============
+## Installation
+
+
+### Github version

-Github version
--------------

 For up-to-date version (highly recommended), install from *Github*:

 ```{r installGithub, eval=FALSE}
-devtools::install_github('dmlc/xgboost', subdir='R-package')
+devtools::install_git('git://github.com/dmlc/xgboost', subdir='R-package')
 ```

 > *Windows* user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
@@ -61,8 +64,8 @@ As of 2015-03-13, ‘xgboost’ was removed from the CRAN repository.

 Formerly available versions can be obtained from the CRAN [archive](http://cran.r-project.org/src/contrib/Archive/xgboost)

-Learning
-========
+## Learning
+

 For the purpose of this tutorial we will load **XGBoost** package.

@@ -70,15 +73,15 @@ For the purpose of this tutorial we will load **XGBoost** package.
 require(xgboost)
 ```

-Dataset presentation
--------------------
+### Dataset presentation
+

 In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-).

 Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.

-Dataset loading
---------------
+### Dataset loading
+

 We will load the `agaricus` datasets embedded with the package and will link them to variables.

@@ -124,12 +127,12 @@ class(train$data)[1]
 class(train$label)
 ```

-Basic Training using XGBoost
----------------------------
+### Basic Training using XGBoost
+

 This step is the most critical part of the process for the quality of our model.

-### Basic training
+#### Basic training

 We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`.

@@ -148,9 +151,9 @@ bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta

 > More complex the relationship between your features and your `label` is, more passes you need.

-### Parameter variations
+#### Parameter variations

-#### Dense matrix
+##### Dense matrix

 Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R** matrix.

@@ -158,7 +161,7 @@ Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R**
 bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
 ```

-#### xgb.DMatrix
+##### xgb.DMatrix

 **XGBoost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be useful for the most advanced features we will discover later.

@@ -167,7 +170,7 @@ dtrain <- xgb.DMatrix(data = train$data, label = train$label)
 bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic")
 ```

-#### Verbose option
+##### Verbose option

 **XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.

@@ -188,11 +191,11 @@ bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, o
 bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nthread = 2, nround = 2, objective = "binary:logistic", verbose = 2)
 ```

-Basic prediction using XGBoost
-==============================
+## Basic prediction using XGBoost
+
+
+## Perform the prediction

-Perform the prediction
----------------------

 The purpose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.

@@ -208,8 +211,8 @@ print(head(pred))

 These numbers doesn't look like *binary classification* `{0,1}`. We need to perform a simple transformation before being able to use these results.

-Transform the regression in a binary classification
---------------------------------------------------
+## Transform the regression in a binary classification
+

 The only thing that **XGBoost** does is a *regression*. **XGBoost** is using `label` vector to build its *regression* model.

@@ -222,8 +225,8 @@ prediction <- as.numeric(pred > 0.5)
 print(head(prediction))
 ```

-Measuring model performance
---------------------------
+## Measuring model performance
+

 To measure the model performance, we will compute a simple metric, the *average error*.

@@ -246,14 +249,14 @@ The most important thing to remember is that **to do a classification, you just

 This metric is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well!

-Advanced features
-=================
+## Advanced features
+

 Most of the features below have been implemented to help you to improve your model by offering a better understanding of its content.


-Dataset preparation
-------------------
+### Dataset preparation
+

 For the following advanced features, we need to put data in `xgb.DMatrix` as explained above.

@@ -262,8 +265,8 @@ dtrain <- xgb.DMatrix(data = train$data, label=train$label)
 dtest <- xgb.DMatrix(data = test$data, label=test$label)
 ```

-Measure learning progress with xgb.train
----------------------------------------
+### Measure learning progress with xgb.train
+

 Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.

@@ -295,8 +298,8 @@ bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nthread = 2, nround=2, watchli

 > `eval.metric` allows us to monitor two new metrics for each round, `logloss` and `error`.

-Linear boosting
---------------
+### Linear boosting
+

 Until now, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).

@@ -308,10 +311,10 @@ In this specific case, *linear boosting* gets sligtly better performance metrics

 In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.

-Manipulating xgb.DMatrix
------------------------
+### Manipulating xgb.DMatrix

-### Save / Load
+
+#### Save / Load

 Like saving models, `xgb.DMatrix` object (which groups both dataset and outcome) can also be saved using `xgb.DMatrix.save` function.

@@ -326,7 +329,7 @@ bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nthread = 2, nround=2, watchl
 file.remove("dtrain.buffer")
 ```

-### Information extraction
+#### Information extraction

 Information can be extracted from `xgb.DMatrix` using `getinfo` function. Hereafter we will extract `label` data.

@@ -337,8 +340,8 @@ err <- as.numeric(sum(as.integer(pred > 0.5) != label))/length(label)
 print(paste("test-error=", err))
 ```

-View feature importance/influence from the learnt model
-------------------------------------------------------
+### View feature importance/influence from the learnt model
+

 Feature importance is similar to R gbm package's relative influence (rel.inf).

@@ -348,8 +351,8 @@ print(importance_matrix)
 xgb.plot.importance(importance_matrix = importance_matrix)
 ```

-View the trees from a model
---------------------------
+#### View the trees from a model
+

 You can dump the tree you learned using `xgb.dump` into a text file.

@@ -365,8 +368,8 @@ xgb.plot.tree(model = bst)

 > if you provide a path to `fname` parameter you can save the trees to your hard drive.

-Save and load models
--------------------
+#### Save and load models
+

 Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.

@@ -416,5 +419,4 @@ print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred))))

 > Again `0`? It seems that `XGBoost` works pretty well!

-References
-==========
+## References