xgboost/R-package/vignettes/xgboostPresentation.Rmd

---
title: "Xgboost presentation"
output:
  rmarkdown::html_vignette:
    css: vignette.css
    number_sections: yes
    toc: yes
bibliography: xgboost.bib
author: Tianqi Chen, Tong He, Michaël Benesty
vignette: >
  %\VignetteIndexEntry{Xgboost presentation}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

Introduction
============

**Xgboost** is short for e**X**treme **G**radient **B**oosting package.

The purpose of this Vignette is to show you how to use **Xgboost** to build a model and make predictions.

It is an efficient and scalable implementation of gradient boosting framework by @friedman2001greedy. Two solvers are included:

- *linear* model ;
- *tree learning* algorithm.

It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objective function easily.

It has been [used](https://github.com/tqchen/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions.

It has several features:

* Speed: it can automatically do parallel computation on *Windows* and *Linux*, with *OpenMP*. It is generally over 10 times faster than the classical `gbm`.
* Input Type: it takes several types of input data:
    * *Dense* Matrix: *R*'s *dense* matrix, i.e. `matrix` ;
    * *Sparse* Matrix: *R*'s *sparse* matrix, i.e. `Matrix::dgCMatrix` ;
    * Data File: local data files ;
    * `xgb.DMatrix`: its own class (recommended).
* Sparsity: it accepts *sparse* input for both *tree booster*  and *linear booster*, and is optimized for *sparse* input ;
* Customization: it supports customized objective functions and evaluation functions.

Installation
============

Github version
--------------

For up-to-date version (highly recommended), install from *Github*:

```{r installGithub, eval=FALSE}
devtools::install_github('tqchen/xgboost', subdir='R-package')
```

> *Windows* user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.

Cran version
------------

For stable version on *CRAN*, run:

```{r installCran, eval=FALSE}
install.packages('xgboost')
```

Learning
========

For the purpose of this tutorial we will load **Xgboost** package.

```{r libLoading, results='hold', message=F, warning=F}
require(xgboost)
```

Dataset presentation
--------------------

In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-).

Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.

Dataset loading
---------------

We will load the `agaricus` datasets embedded with the package and will link them to variables.

The datasets are already split in:

* `train`: will be used to build the model ;
* `test`: will be used to assess the quality of our model.

Why *split* the dataset in two parts?

In a first part we will build our model. In a second part we will want to test it and assess its quality. Without dividing the dataset we would test the model on data the algorithm have already seen.

```{r datasetLoading, results='hold', message=F, warning=F}
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
```

> In the real world, it would be up to you to make this division between `train` and `test` data. The way you should do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/splitting.html).

Each variable is a `list`, each containing two things, `label` and `data`:

```{r dataList, message=F, warning=F}
str(train)
```

`label` is the outcome of our dataset meaning it is the binary *classification* we will try to predict.

Let's discover the dimensionality of our datasets.

```{r dataSize, message=F, warning=F}
dim(train$data)
dim(test$data)
```

This dataset is very small to not make the **R** package too heavy, however **Xgboost** is built to manage huge dataset very efficiently.

As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`):

```{r dataClass, message=F, warning=F}
class(train$data)[1]
class(train$label)
```

Basic Training using Xgboost
----------------------------

This step is the most critical part of the process for the quality of our model.

### Basic training

We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`.

In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very usual to have such dataset.

We will train decision tree model using the following parameters:

* `objective = "binary:logistic"`: we will train a binary classification model ;
* `max.deph = 2`: the trees won't be deep, because our case is very simple ;
* `nround = 2`: there will be two pass on the data, the second one will focus on the data not correctly learned by the first pass.

```{r trainingSparse, message=F, warning=F}
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic")
```

> More the link between your features and your `label` is complex, more pass you need.

### Parameter variations

#### Dense matrix

Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R** matrix.

```{r trainingDense, message=F, warning=F}
bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic")
```

#### xgb.DMatrix

**Xgboost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be usefull for the most advanced features we will discover later.

```{r trainingDmatrix, message=F, warning=F}
dtrain <- xgb.DMatrix(data = train$data, label = train$label)
bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic")
```

#### Verbose option

**Xgboost** has severa features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.

One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics).

```{r trainingVerbose0, message=T, warning=F}
# verbose = 0, no message
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 0)
```

```{r trainingVerbose1, message=T, warning=F}
# verbose = 1, print evaluation metric
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 1)
```

```{r trainingVerbose2, message=T, warning=F}
# verbose = 2, also print information about tree
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 2)
```

Basic prediction using Xgboost
==============================

The main use of **Xgboost** is to predict data. For that purpose we will use the `test` dataset.

```{r predicting, message=F, warning=F}
pred <- predict(bst, test$data)

# size of the prediction vector
print(length(pred))

# limit display of predictions to the first 10
print(pred[1:10])
```

The only thing **Xgboost** do is a regression. But we are in a binary classification problem. If we think about this regression results, they are just kind of probabilities being classified as `1`.

Therefore, we will set the rule if the probability is `> 5` then the observation is classified as `1` and is classified `0` otherwise.

```{r predictingTest, message=F, warning=F}
err <- mean(as.numeric(pred > 0.5) != test$label)
print(paste("test-error=", err))
```

> We remind you that the algorithm has never seen the `test` data before.

Here, we have just computed a simple metric, the average error.

1. `as.numeric(pred > 0.5)` applies our rule that when the probability (== prediction == regression) is over `0.5` the observation is classified as `1` and `0` otherwise ;
2. `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ;
3. `mean(vectorOfErrors)` computes the average error itself.

The most important thing to remember is that **to do a classification basically, you just do a regression and then apply a threeshold**.

Multiclass classification works in a very similar way.

This metrix is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well!

Advanced features
=================

Most of the features below have been created to help you to improve your model by offering a better understanding of its content.


Dataset preparation
-------------------

For the following advanced features, we need to put data in `xgb.DMatrix` as explained above.

```{r DMatrix, message=F, warning=F}
dtrain <- xgb.DMatrix(data = train$data, label=train$label)
dtest <- xgb.DMatrix(data = test$data, label=test$label)
```

Measure learning progress with xgb.train
----------------------------------------

Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.

One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following features will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.

One way to measure progress in learning of a model is to provide to the **Xgboost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.

For the purpose of this example, we use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name.

```{r watchlist, message=F, warning=F}
watchlist <- list(train=dtrain, test=dtest)

bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic")
```

**Xgboost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines of metric here). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.

Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.

If with your own dataset you have not such results, you should think about how you did to divide your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/splitting.html).

For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.

```{r watchlist2, message=F, warning=F}
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic")
```

> `eval.metric` allows us to monitor two new metrics for each round, logloss and error.

Until know, all the learnings we have performed were based on boosting trees. **Xgboost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).

```{r linearBoosting, message=F, warning=F}
bst <- xgb.train(data=dtrain, booster = "gblinear", max.depth=2, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic")
```

In this specific case, linear boosting gets sligtly better performance metrics than decision trees based algorithm. In simple case, it will happem because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Check both implementations with your own dataset to have an idea of what to use.

Manipulating xgb.DMatrix
------------------------

### Save / Load

Like saving models, `xgb.DMatrix` object (which groups both dataset and outcome) can also be saved using `xgb.DMatrix.save` function.

```{r DMatrixSave, message=F, warning=F}
xgb.DMatrix.save(dtrain, "dtrain.buffer")
# to load it in, simply call xgb.DMatrix
dtrain2 <- xgb.DMatrix("dtrain.buffer")
bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic")
```

```{r DMatrixDel, include=FALSE}
file.remove("dtrain.buffer")
```

### Information extraction

Information can be extracted from `xgb.DMatrix` using `getinfo` function. Hereafter we will extract `label` data.

```{r getinfo, message=F, warning=F}
label = getinfo(dtest, "label")
pred <- predict(bst, dtest)
err <- as.numeric(sum(as.integer(pred > 0.5) != label))/length(label)
print(paste("test-error=", err))
```

View the trees from a model
---------------------------

You can dump the tree you learned using `xgb.dump` into a text file.

```{r dump, message=T, warning=F}
xgb.dump(bst, with.stats = T)
```

> if you provide a path to `fname` parameter you can save the trees to your hard drive.

Save and load models
--------------------

May be your dataset is big, and it takes time to train a model on it? May be you are not a big fan of loosing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.

Hopefully for you, **Xgboost** implements such functions.

```{r saveModel, message=F, warning=F}
# save model to binary local file
xgb.save(bst, "xgboost.model")
```

> `xgb.save` function should return `r TRUE` if everything goes well and crashes otherwise.

An interesting test to see how identic to the original one our saved model is would be to compare the two predictions.

```{r loadModel, message=F, warning=F}
# load binary model to R
bst2 <- xgb.load("xgboost.model")
pred2 <- predict(bst2, test$data)

# And now the test
print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred))))
```

```{r clean, include=FALSE}
# delete the created model
file.remove("./xgboost.model")
```

> result is `0`? We are good!

In some very specific cases, like when you want to pilot **Xgboost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it.

```{r saveLoadRBinVectorModel, message=F, warning=F}
# save model to R's raw vector
rawVec <- xgb.save.raw(bst)

# print class
print(class(rawVec))

# load binary model to R
bst3 <- xgb.load(rawVec)
pred3 <- predict(bst3, test$data)

# pred2 should be identical to pred
print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred))))
```

> Again `0`? It seems that `Xgboost` works prety well!

References
==========