xgboost/R-package/vignettes/discoverYourData.Rmd

---
title: "Understand your dataset with Xgboost"
output:
  rmarkdown::html_vignette:
    css: vignette.css
    number_sections: yes
    toc: yes
author: Tianqi Chen, Tong He, Michaël Benesty
vignette: >
  %\VignetteIndexEntry{Discover your data}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

Introduction
============

The purpose of this Vignette is to show you how to use **Xgboost** to discover and better understand your own dataset.

This Vignette is not about showing you how to predict anything (see [Xgboost presentation](www.somewhere.org)). The purpose of this document is to explain how to use **Xgboost** to understand the *link* between the *features* of your data and an *outcome*.

For the purpose of this tutorial we will first load the required packages.

```{r libLoading, results='hold', message=F, warning=F}
require(xgboost)
require(Matrix)
require(data.table)
if (!require('vcd')) install.packages('vcd')
```
> **VCD** package is used for one of its embedded dataset only (and not for its own functions).

Preparation of the dataset
==========================

**Xgboost** works only on `numeric` variables.

Sometimes the dataset we have to work on have *categorical* data.

A *categorical* variable is one which have a fixed number of different values. By exemple, if for each observation a variable called *Colour* can have only *red*, *blue* or *green* as value, it is a *categorical* variable.

> In *R*, *categorical* variable is called `factor`.
>
> Type `?factor` in console for more information.

In this demo we will see how to transform a dense dataframe (dense = few zero in the matrix) with *categorical* variables to a very sparse matrix (sparse = lots of zero in the matrix) of `numeric` features before analyzing these data in **Xgboost**.

The method we are going to see is usually called [one hot encoding](http://en.wikipedia.org/wiki/One-hot).

The first step is to load Arthritis dataset in memory and wrap the dataset with `data.table` package (`data.table` is 100% compliant with *R* dataframe but its syntax is a lot more consistent and its performance are really good).

```{r, results='hide'}
data(Arthritis)
df <- data.table(Arthritis, keep.rownames = F)
```

Let's have a look to the 10 first lines of the `data.table`:

```{r}
print(df[1:10])
```

Now we will check the format of each column.

```{r}
str(df)
```

> 2 columns have `factor` type, one has `ordinal` type.
>
> `ordinal` variable can take a limited number of values and these values can be ordered.
>
> `Marked > Some > None`

Let's add some new *categorical* features to see if it helps.

Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in Machine Learning. Fortunately, tree algorithms (including boosted trees) are very robust in this specific case.

```{r}
df[,AgeDiscret:= as.factor(round(Age/10,0))][1:10]
```

> For the first feature we create groups of age by rounding the real age.
> Note that we transform it to `factor` so the algorithm treat these age groups as independant values.
> Therefore, 20 is not closer to 30 than 60. To make it short, the distance between ages is lost in this transformation.

Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!).

```{r}
df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))][1:10]
```

We remove ID as there is nothing to learn from this feature (it will just add some noise as the dataset is small).

```{r, results='hide'}
df[,ID:=NULL]
```

Let's list the different values for the column Treatment.

```{r}
print(levels(df[,Treatment]))
```


Next step, we will transform the categorical data to dummy variables.
This is the [one hot encoding](http://en.wikipedia.org/wiki/One-hot) part.

The purpose is to transform each value of each *categorical* feature in a binary feature `{0, 1}`.

For example, the column Treatment will be replaced by two columns, Placebo, and Treated. Each of them will be *binary*. Therefore, an observation which has the value Placebo in column Treatment before the transformation will have after the transformation the value `1` in the new column Placebo and the value `0` in the new column  Treated.

Column Improved is excluded because it will be our output column, the one we want to predict.

```{r, warning=FALSE,message=FALSE}
sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df)
print(sparse_matrix[1:10,])
```

> Formulae `Improved~.-1` used above means transform all *categorical* features but column Improved to binary values.

Create the output `numeric` vector (not as a sparse `Matrix`):

1. Set, for all rows, field in Y column to `0`;
2. set Y to `1` when Improved == Marked;
3. Return Y column.

```{r}
output_vector = df[,Y:=0][Improved == "Marked",Y:=1][,Y]
```

Build the model
===============

The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or to the vignette [Xgboost presentation](www.somewhere.org)).

```{r}
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
               eta = 1, nround = 10,objective = "binary:logistic")

```

You can see plenty of `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well your model explains your data. Lower is better.

A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy paste too much the past, and is not that good to predict the future).

> Here you can see the numbers decrease until line 7 and then increase.
> It probably means I am overfitting. To fix that I may reduce the number of rounds to `nround = 4`.
> I will let things like that because I don't really care for the purpose of this example :-)

Feature importance
==================

Measure feature importance
--------------------------

In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the original values of the feature (remember, one binary column == one value of one *categorical* feature).

```{r}
importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst)
print(importance)
```

> The column `Gain` provide the information we are looking for.
>
> As you can see, features are classified by `Gain`.

`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite, both new branches being more accurate than the one before the split).

`Cover` measures the relative quantity of observations concerned by a feature.

`Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).


Plotting the feature importance
-------------------------------

All these things are nice, but it would be even better to plot the result. Fortunately, such function already exists.

```{r, fig.width=8, fig.height=5, fig.align='center'}
xgb.plot.importance(importance_matrix = importance)
```

Feature have automatically been divided in 2 clusters: the interesting features... and the others.

> Depending of the dataset and the learning parameters you may have more than two clusters.
> Default value is to limit them to 10, but you can increase this limit. Look at the function documentation for more information.

According to the plot above, the most important feature in this dataset to predict if the treatment will work is :

* the Age;
* having received a placebo or not ;
* the sex is third but already included in the not interesting feature ;
* then we see our generated features (AgeDiscret). We can see that their contribution is very low.

Do these results make sense?
------------------------------

Let's check some **Chi2** between each of these features and the outcome.

Higher **Chi2** means better correlation.

```{r, warning=FALSE, message=FALSE}
c2 <- chisq.test(df$Age, df$Y)
print(c2)
```

Pearson correlation between Age and illness disapearing is **`r round(c2$statistic, 2 )`**.

```{r, warning=FALSE, message=FALSE}
c2 <- chisq.test(df$AgeDiscret, df$Y)
print(c2)
```

Our first simplification of Age gives a Pearson correlation is **`r round(c2$statistic, 2)`**.

```{r, warning=FALSE, message=FALSE}
c2 <- chisq.test(df$AgeCat, df$Y)
print(c2)
```

The perfectly random split I did between young and old at 30 years old have a low correlation of **`r round(c2$statistic, 2)`**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but  for the illness we are studying, the age to be vulnerable is not the same.

Morality: don't let your *gut* lower the quality of your model.

In *data science* expression, there is the word *science* :-)

Conclusion
==========

As you can see, in general *destroying information by simplying it won't improve your model*. **Chi2** just demonstrates that.

But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model.

The case studied here is not enough complex to show that. Check Kaggle forum for some challenging datasets. However it's almost always worse when you add some arbitrary rules.

Moreover, you can notice that even if we have added some not useful new features highly correlated with other features, the boosting tree algorithm have been able to choose the best one, which in this case is the Age.

Linear model may not be that strong in these scenario.

```{r, fig.align='center', include=FALSE}
#xgb.plot.tree(sparse_matrix@Dimnames[[2]], model = bst, n_first_tree = 1, width = 1200, height = 800)
```

Special Note: What about Random forest?
=======================================

As you may know, [Random Forest](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble leanrning](http://en.wikipedia.org/wiki/Ensemble_learning) family.

Both trains several decision trees for one dataset. The *main* difference is that in Random Forest, trees are independant and in boosting tree N+1 focus its learning on the loss (= what has no been well modeled by tree N).

This difference have an impact on feature importance analysis: the *correlated features*.

Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and random forest).

However, in Random Forest this random choice will be done for each tree, because each tree is independant from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the **importance** of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...

In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is never that simple). Therefore, all the importance will be on `A` or on `B`. You will know that one feature have an important role in the link between your dataset and the outcome. It is still up to you to search for the correlated features to the one detected as important if you need all of them.