---
title: "XGBoost presentation"
output:
  rmarkdown::html_vignette:
    css: vignette.css
    number_sections: yes
    toc: yes
bibliography: xgboost.bib
author: Tianqi Chen, Tong He, Michaël Benesty, David Cortes
vignette: >
  %\VignetteIndexEntry{XGBoost presentation}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

XGBoost R Tutorial
==================

## Introduction


**XGBoost** is short for e**X**treme **G**radient **Boost**ing package.

The purpose of this Vignette is to show you how to use **XGBoost** to build a model and make predictions.

It is an efficient and scalable implementation of gradient boosting framework by @friedman2000additive and @friedman2001greedy. Two solvers are included:

- *tree learning* algorithm (in different varieties).
- *linear* model ;

It supports various objective functions, including *regression*, *classification* (binary and multi-class) and *ranking*. The package is made to be extensible, so that users are also allowed to define their own objective functions easily.

It has been [used](https://github.com/dmlc/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions.

It has several features:

* Speed: it can automatically do parallel computations with *OpenMP*. It is generally over 10 times faster than the classical `gbm`.
* Input Type: it takes several types of input data:
    * *Dense* Matrix: *R*'s *dense* matrix, i.e. `matrix` ;
    * *Sparse* Matrix: *R*'s *sparse* matrix, i.e. `Matrix::dgCMatrix` ;
    * Data File: local data files ;
    * Data frames (class `data.frame` and sub-classes from it such as `data.table`), taking
    both numeric and categorical (factor) features.
    * `xgb.DMatrix`: its own class (recommended, also supporting numeric and categorical features).
* Customization: it supports customized objective functions and evaluation functions.

## Installation

Package can be easily installed from CRAN:

```{r, eval=FALSE}
install.packages("xgboost")
```

For the development version, see the [GitHub page](https://github.com/dmlc/xgboost) and the [installation docs](https://xgboost.readthedocs.io/en/stable/install.html) for further instructions.

## Learning


For the purpose of this tutorial we will load **XGBoost** package.

```{r libLoading, results='hold', message=F, warning=F}
require(xgboost)
```

### Dataset presentation


In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the same as you will use on in your every day life :-).

Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.

### Dataset loading


We will load the `agaricus` datasets embedded with the package and will link them to variables.

The datasets are already split in:

* `train`: will be used to build the model ;
* `test`: will be used to assess the quality of our model.

Why *split* the dataset in two parts?

In the first part we will build our model. In the second part we will want to test it and assess its quality. Without dividing the dataset we would test the model on the data which the algorithm have already seen.

```{r datasetLoading, results='hold', message=F, warning=F}
data(agaricus.train, package = 'xgboost')
data(agaricus.test, package = 'xgboost')
train <- agaricus.train
test <- agaricus.test
```

> In the real world, it would be up to you to make this division between `train` and `test` data.

Each variable is a `list` containing two things, `label` and `data`:

```{r dataList, message=F, warning=F}
str(train)
```

`label` is the outcome of our dataset meaning it is the binary *classification* we will try to predict.

Let's discover the dimensionality of our datasets.

```{r dataSize, message=F, warning=F}
dim(train$data)
dim(test$data)
```

This dataset is very small to not make the **R** package too heavy, however **XGBoost** is built to manage huge datasets very efficiently.

As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`):

```{r dataClass, message=F, warning=F}
class(train$data)[1]
class(train$label)
```

### Basic Training using XGBoost


This step is the most critical part of the process for the quality of our model.

#### Basic training

We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`.

In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very usual to have such dataset.

We will train a decision tree model using the following parameters:

* `objective = "binary:logistic"`: we will train a binary classification model (note that this is set automatically when `y` is a `factor`) ;
* `max_depth = 2`: the trees won't be deep, because our case is very simple ;
* `nthread = 2`: the number of CPU threads we are going to use;
* `nrounds = 2`: there will be two passes on the data, the second one will enhance the model by further reducing the difference between ground truth and prediction.

```{r trainingSparse, message=F, warning=F}
bstSparse <- xgboost(
    x = train$data
    , y = factor(train$label, levels = c(0, 1))
    , objective = "binary:logistic"
    , max_depth = 2
    , eta = 1
    , nrounds = 2
    , nthread = 2
)
```

Note that, while the R function `xgboost()` follows typical R idioms for statistical modeling packages
such as an x/y division and having those as first arguments, it also offers a more flexible `xgb.train`
interface which is more consistent across different language bindings (e.g. arguments are the same as
in the Python XGBoost library) and which exposes some additional functionalities. The `xgb.train`
interface uses XGBoost's own DMatrix class to pass data to it, and accepts the model parameters instead
as a named list:

```{r}
bstTrInterface <- xgb.train(
    data = xgb.DMatrix(train$data, label = train$label, nthread = 1)
    , params = list(
        objective = "binary:logistic"
        , max_depth = 2
        , eta = 1
    )
    , nthread = 2
    , nrounds = 2
)
```

For the rest of this tutorial, we'll nevertheless be using the `xgboost()` interface which will be
more familiar to users of packages such as GLMNET or Ranger.

> More complex the relationship between your features and your `label` is, more passes you need.

#### Parameter variations

##### Dense matrix

Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R** matrix.

```{r trainingDense, message=F, warning=F}
bstDense <- xgboost(
    x = as.matrix(train$data),
    y = factor(train$label, levels = c(0, 1)),
    max_depth = 2,
    eta = 1,
    nthread = 2,
    nrounds = 2
)
```

##### Data frame

As another alternative, XGBoost will also accept `data.frame` objects, from which it can
use numeric, integer and factor columns:

```{r}
df_train <- as.data.frame(as.matrix(train$data))
bstDF <- xgboost(
    x = df_train,
    y = factor(train$label, levels = c(0, 1)),
    max_depth = 2,
    eta = 1,
    nthread = 2,
    nrounds = 2
)
```

##### Verbosity levels

**XGBoost** has several features to help you to view how the learning progresses internally. The purpose is to help you
set the best parameters, which is the key of your model quality. Note that when using the `xgb.train` interface,
one can also use a separate evaluation dataset (e.g. a different subset of the data than the training dataset) on
which to monitor metrics of interest, and it also offers an `xgb.cv` function which automatically splits the data
to create evaluation subsets for you.

One of the simplest way to see the training progress is to set the `verbosity` option:

```{r trainingVerbose1, message=T, warning=F}
# verbosity = 1, print evaluation metric
bst <- xgboost(
    x = train$data,
    y = factor(train$label, levels = c(0, 1)),
    max_depth = 2,
    eta = 1,
    nthread = 2,
    objective = "binary:logistic",
    nrounds = 5,
    verbosity = 1
)
```

## Basic prediction using XGBoost


## Perform the prediction


The purpose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step.

```{r predicting, message=F, warning=F}
pred <- predict(bst, test$data)

# size of the prediction vector
print(length(pred))

# limit display of predictions to the first 10
print(head(pred))
```

These numbers reflect the predicted probabilities of belonging to the class '1' in the 'y' data. Tautologically,
the probability of belonging to the class '0' is then $P(y=0) = 1 - P(y=1)$. This implies: if the number is greater
than 0.5, then according to the model it is more likely than an observation will be of class '1', whereas if the
number if lower than 0.5, it is more likely that the observation will be of class '0':

```{r predictingTest, message=F, warning=F}
prediction <- as.numeric(pred > 0.5)
print(head(prediction))
```

Note that one can also control the prediction type directly to obtain classes instead of probabilities.

## Measuring model performance


To measure the model performance, we will compute a simple metric, the *accuracy rate*.

```{r predictingAverageError, message=F, warning=F}
acc <- mean(as.numeric(pred > 0.5) == test$label)
print(paste("test-acc=", acc))
```

> Note that the algorithm has not seen the `test` data during the model construction.

Steps explanation:

1. `as.numeric(pred > 0.5)` applies our rule that when the probability (<=> regression <=> prediction) is `> 0.5` the observation is classified as `1` and `0` otherwise ;
2. `probabilityVectorPreviouslyComputed == test$label` whether the predicted class matches with the real data ;
3. `mean(vectorOfMatches)` computes the *accuracy rate* itself.

The most important thing to remember is that **to obtain the predicted class of an observation, a threshold needs to be applied on the predicted probabilities**.

*Multiclass* classification works in a similar way.

This metric is **`r round(acc, 2)`** and is pretty high: our yummy mushroom model works well!

## Advanced features


Most of the features below have been implemented to help you to improve your model by offering a better understanding of its content.


### Dataset preparation for xgb.train


For the following advanced features, we'll be using the `xgb.train()` interface instead of the `xbgoost()`
interface, so we need to put data in an `xgb.DMatrix` as explained earlier:

```{r DMatrix, message=F, warning=F}
dtrain <- xgb.DMatrix(data = train$data, label = train$label, nthread = 2)
dtest <- xgb.DMatrix(data = test$data, label = test$label, nthread = 2)
```

### Measure learning progress with xgb.train


Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.

One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following techniques will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.

One way to measure progress in learning of a model is to provide to **XGBoost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.

> in some way it is similar to what we have done above with the prediction accuracy. The main difference is that below it was after building the model, and now it is during the construction that we measure quality of predictions.

For the purpose of this example, we use the `evals` parameter. It is a list of `xgb.DMatrix` objects, each of them tagged with a name.

```{r evals, message=F, warning=F}
evals <- list(train = dtrain, test = dtest)

bst <- xgb.train(
    data = dtrain
    , params = list(
        max_depth = 2
        , eta = 1
        , nthread = 2
        , objective = "binary:logistic"
    )
    , nrounds = 2
    , evals = evals
)
```

**XGBoost** has computed at each round the same (negative of) average log-loss (logarithm of the Bernoulli likelihood)
that it uses as optimization objective to minimize in both of the datasets. Obviously, the `train_logloss` number is
related to the training dataset (the one the algorithm learns from) and the `test_logloss` number to the test dataset.

Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.

If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix.

For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.

```{r evals2, message=F, warning=F}
bst <- xgb.train(
    data = dtrain
    , max_depth = 2
    , params = list(
        eta = 1
        , nthread = 2
        , objective = "binary:logistic"
        , eval_metric = "error"
        , eval_metric = "logloss"
    )
    , nrounds = 2
    , evals = evals
)
```

> `eval_metric` allows us to monitor two new metrics for each round, `logloss` and `error`.

### Linear boosting


Until now, all the learnings we have performed were based on boosting trees. **XGBoost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).

```{r linearBoosting, message=F, warning=F}
bst <- xgb.train(
    data = dtrain
    , params = list(
        booster = "gblinear"
        , nthread = 2
        , objective = "binary:logistic"
        , eval_metric = "error"
        , eval_metric = "logloss"
    )
    , nrounds = 2
    , evals = evals
)
```

In this specific case, *linear boosting* gets slightly better performance metrics than decision trees based algorithm.

In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.

### Manipulating xgb.DMatrix


#### Save / Load

Like saving models, `xgb.DMatrix` object (which groups both dataset and outcome) can also be saved using `xgb.DMatrix.save` function.

```{r DMatrixSave, message=F, warning=F}
fname <- file.path(tempdir(), "dtrain.buffer")
xgb.DMatrix.save(dtrain, fname)
# to load it in, simply call xgb.DMatrix
dtrain2 <- xgb.DMatrix(fname)
bst <- xgb.train(
    data = dtrain2
    , params = list(
        max_depth = 2
        , eta = 1
        , nthread = 2
        , objective = "binary:logistic"
    )
    , nrounds = 2
    , evals = evals
)
```

```{r DMatrixDel, include=FALSE}
file.remove(fname)
```

#### Information extraction

Information can be extracted from `xgb.DMatrix` using `getinfo` function. Hereafter we will extract `label` data.

```{r getinfo, message=F, warning=F}
label <- getinfo(dtest, "label")
pred <- predict(bst, dtest)
err <- as.numeric(sum(as.integer(pred > 0.5) != label)) / length(label)
print(paste("test-error=", err))
```

### View feature importance/influence from the fitted model


Feature importance is similar to R gbm package's relative influence (rel.inf).

```{r}
importance_matrix <- xgb.importance(model = bst)
print(importance_matrix)
xgb.plot.importance(importance_matrix = importance_matrix)
```

#### View the trees from a model


XGBoost can output the trees it fitted in a standard tabular format:

```{r}
xgb.model.dt.tree(bst)
```

You can plot the trees from your model using ```xgb.plot.tree``

```{r}
xgb.plot.tree(model = bst)
```

> if you provide a path to `fname` parameter you can save the trees to your hard drive.

#### Save and load models


Maybe your dataset is big, and it takes time to train a model on it? May be you are not a big fan of losing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.

XGBoost models can be saved through R functions such as `save` and `saveRDS`, but in addition, it also offers
its own serialization format, which might have better compatibility guarantees across versions of XGBoost and
which can also be loaded into other language bindings:

```{r saveModel, message=F, warning=F}
# save model to binary local file
fname <- file.path(tempdir(), "xgb_model.ubj")
xgb.save(bst, fname)
```

> `xgb.save` function should return `r TRUE` if everything goes well and crashes otherwise.

An interesting test to see how identical our saved model is to the original one would be to compare the two predictions.

```{r loadModel, message=F, warning=F}
# load binary model to R
# Note that the number of threads for 'xgb.load' is taken from global config,
# can be modified like this:
RhpcBLASctl::omp_set_num_threads(1)
bst2 <- xgb.load(fname)
xgb.parameters(bst2) <- list(nthread = 2)
pred2 <- predict(bst2, test$data)

# And now the test
print(paste("sum(abs(pred2-pred))=", sum(abs(pred2 - pred))))
```

```{r clean, include=FALSE}
# delete the created model
file.remove(fname)
```

> result is `0`? We are good!

In some very specific cases, you will want to save the model as a *R* raw vector. See below how to do it.

```{r saveLoadRBinVectorModel, message=F, warning=F}
# save model to R's raw vector
rawVec <- xgb.save.raw(bst)

# print class
print(class(rawVec))

# load binary model to R
bst3 <- xgb.load.raw(rawVec)
xgb.parameters(bst3) <- list(nthread = 2)
pred3 <- predict(bst3, test$data)

# pred2 should be identical to pred
print(paste("sum(abs(pred3-pred))=", sum(abs(pred2 - pred))))
```

> Again `0`? It seems that `XGBoost` works pretty well!

## References