improved vignette text
This commit is contained in:
parent
a30635e0b4
commit
423c3e6a8d
@ -36,8 +36,8 @@ Sometimes the dataset we have to work on have *categorical* data.
|
||||
|
||||
A *categorical* variable is one which have a fixed number of different values. By exemple, if for each observation a variable called *Colour* can have only *red*, *blue* or *green* as value, it is a *categorical* variable.
|
||||
|
||||
In **R**, *categorical* variable is called `factor`.
|
||||
Type `?factor` in console for more information.
|
||||
> In **R**, *categorical* variable is called `factor`.
|
||||
> Type `?factor` in console for more information.
|
||||
|
||||
In this demo we will see how to transform a dense dataframe with *categorical* variables to a sparse matrix before analyzing it in **Xgboost**.
|
||||
|
||||
@ -62,18 +62,21 @@ Now we will check the format of each column.
|
||||
str(df)
|
||||
```
|
||||
|
||||
2 columns have `factor` type, one has `ordinal` type (`ordinal` variable is a categorical variable with values wich can be ordered, here: `None` > `Some` > `Marked`).
|
||||
> 2 columns have `factor` type, one has `ordinal` type.
|
||||
> `ordinal` variable is a categorical variable with values wich can be ordered
|
||||
> Here: `None` > `Some` > `Marked`.
|
||||
|
||||
Let's add some new categorical features to see if it helps.
|
||||
|
||||
Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features.
|
||||
|
||||
For the first feature we create groups of age by rounding the real age. Note that we transform it to `factor` so the algorithm treat them as independant values.
|
||||
|
||||
```{r}
|
||||
df[,AgeDiscret:= as.factor(round(Age/10,0))][1:10]
|
||||
```
|
||||
|
||||
> For the first feature we create groups of age by rounding the real age.
|
||||
> Note that we transform it to `factor` so the algorithm treat them as independant values.
|
||||
|
||||
Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!).
|
||||
|
||||
```{r}
|
||||
@ -99,7 +102,7 @@ The purpose is to transform each value of each *categorical* feature in a binary
|
||||
|
||||
For example, the column Treatment will be replaced by two columns, Placebo, and Treated. Each of them will be *binary*. For example an observation which had the value Placebo in column Treatment before the transformation will have, after the transformation, the value 1 in the new column Placebo and the value 0 in the new column Treated.
|
||||
|
||||
Formulae `Improved~.-1` used below means transform all *categorical* features but column Improved to binary values.
|
||||
> Formulae `Improved~.-1` used below means transform all *categorical* features but column Improved to binary values.
|
||||
|
||||
Column Improved is excluded because it will be our output column, the one we want to predict.
|
||||
|
||||
@ -133,8 +136,9 @@ You can see plenty of `train-error: 0.XXXXX` lines followed by a number. It decr
|
||||
|
||||
A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy paste too much the past, and is not that good to predict the future).
|
||||
|
||||
Here you can see the numbers decrease until line 7 and then increase. It probably means I am overfitting. To fix that I may reduce the number of rounds to `nround = 4`. I will let things like that because I don't really care for the purpose of this example :-)
|
||||
|
||||
> Here you can see the numbers decrease until line 7 and then increase.
|
||||
> It probably means I am overfitting. To fix that I may reduce the number of rounds to `nround = 4`.
|
||||
> I will let things like that because I don't really care for the purpose of this example :-)
|
||||
|
||||
Feature importance
|
||||
==================
|
||||
@ -149,7 +153,8 @@ importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst)
|
||||
print(importance)
|
||||
```
|
||||
|
||||
The column `Gain` provide the information we are looking for.
|
||||
> The column `Gain` provide the information we are looking for.
|
||||
> As you can see, features are classified by `Gain`.
|
||||
|
||||
`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite, both new branch being more accurate than the one before the insertion of the feature).
|
||||
|
||||
@ -157,8 +162,6 @@ The column `Gain` provide the information we are looking for.
|
||||
|
||||
`Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
|
||||
|
||||
As you can see, features are classified by `Gain`.
|
||||
|
||||
Plotting the feature importance
|
||||
-------------------------------
|
||||
|
||||
@ -170,6 +173,9 @@ xgb.plot.importance(importance_matrix = importance)
|
||||
|
||||
Feature have been automatically divided in 2 clusters: the interesting features... and the others.
|
||||
|
||||
> Depending of the case you may have more than two clusters.
|
||||
> Default value is to limit them to 10, but you can increase this limit. Look at the function documentation for more information.
|
||||
|
||||
According to the plot above, the most important feature in this dataset to predict if the treatment will work is :
|
||||
|
||||
* the Age;
|
||||
@ -177,8 +183,6 @@ According to the plot above, the most important feature in this dataset to predi
|
||||
* the sex is third but already included in the not interesting feature ;
|
||||
* then we see our generated features (AgeDiscret). We can see that their contribution is very low.
|
||||
|
||||
*Note: Depending of the case you may have more than two clusters. Default value is to limit them to 10, but you can increase this limit. Look at the function documentation for more information.*
|
||||
|
||||
Does these results make sense?
|
||||
------------------------------
|
||||
|
||||
@ -224,4 +228,17 @@ Linear model may not be that strong in these scenario.
|
||||
#xgb.plot.tree(sparse_matrix@Dimnames[[2]], model = bst, n_first_tree = 1, width = 1200, height = 800)
|
||||
```
|
||||
|
||||
Special Note: What about Random forest?
|
||||
=======================================
|
||||
|
||||
As you may know, [Random Forest](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble leanrning](http://en.wikipedia.org/wiki/Ensemble_learning) family.
|
||||
|
||||
Both trains several decision trees for one dataset. The *main* difference is that in Random Forest, trees are independant and in boosting tree N+1 focus its learning on what has no been well modeled by tree N (and so on...).
|
||||
|
||||
This difference have an impact on a corner case in feature importance analysis: the *correlated features*.
|
||||
|
||||
Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and random forest).
|
||||
|
||||
However, in Random Forest this choice will be done plenty of times, because trees are independant. So the **importance** of a specific feature is diluted among features `A` and `B`. So you won't easily know they are important to predict what you want to predict.
|
||||
|
||||
In boosting, when as aspect of your dataset have been learned by the algorithm, there is no more need to refocus on it. Therefore, all the importace will be on `A` or `B`. You will know that one of them is important, it is up to you to search for correlated features.
|
||||
|
||||
@ -130,13 +130,16 @@ aside {
|
||||
width: 390px;
|
||||
}
|
||||
blockquote {
|
||||
border-left:.5em solid #eee;
|
||||
padding: 0 1em;
|
||||
margin-left:0;
|
||||
max-width: 476px;
|
||||
font-size:14px;
|
||||
border-left:.5em solid #606AAA;
|
||||
background: #f5f5f5;
|
||||
color:#bfbfbf;
|
||||
padding: 5px;
|
||||
margin-left:25px;
|
||||
max-width: 500px;
|
||||
}
|
||||
blockquote cite {
|
||||
/ font-size:14px;
|
||||
font-size:14px;
|
||||
line-height:20px;
|
||||
color:#bfbfbf;
|
||||
}
|
||||
@ -146,7 +149,6 @@ blockquote cite:before {
|
||||
|
||||
blockquote p {
|
||||
color: #666;
|
||||
max-width: 460px;
|
||||
}
|
||||
hr {
|
||||
/ width: 540px;
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user