Vignette text

2015-03-01 21:22:26 +01:00 · 2015-03-01 21:22:26 +01:00 · a749cf3133
commit a749cf3133
parent 46082a54c9
1 changed files with 33 additions and 31 deletions
--- a/R-package/vignettes/discoverYourData.Rmd
+++ b/R-package/vignettes/discoverYourData.Rmd
@ -49,6 +49,8 @@ A *categorical* variable is one which have a fixed number of different values. F
 Conversion from categorical to numeric variables
 ------------------------------------------------

+### Looking at the raw data
+
 In this demo we will see how to transform a *dense* dataframe (*dense* = few zero in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.

 The method we are going to see is usually called [one hot encoding](http://en.wikipedia.org/wiki/One-hot).
@ -80,9 +82,11 @@ str(df)
 >
 > `Marked > Some > None`

+### Creation of new features based on old ones
+
 We will add some new *categorical* features to see if it helps.

-These feature will be highly correlated to the `Age` feature. Usually it's not a good thing in Machine Learning. Fortunately, tree algorithms (including boosted trees) are robust to correlated features.
+These feature will be highly correlated to the `Age` feature. Usually it's not a good thing in machine learning. Fortunately, decision tree algorithms (including boosted trees) are robust to correlated features.

 ```{r}
 head(df[,AgeDiscret:= as.factor(round(Age/10,0))])
@ -93,7 +97,7 @@ head(df[,AgeDiscret:= as.factor(round(Age/10,0))])
 > Note that we transform it to `factor` so the algorithm treat these age groups as independent values.
 > Therefore, 20 is not closer to 30 than 60. To make it short, the distance between ages is lost in this transformation.

-Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!).
+Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (you may already have an idea of how well it will work...).

 ```{r}
 head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))])
@ -112,12 +116,14 @@ levels(df[,Treatment])
 ```


+### One-hot encoding
+
 Next step, we will transform the categorical data to dummy variables.
-This is the [one hot encoding](http://en.wikipedia.org/wiki/One-hot) part.
+This is the [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) step.

-The purpose is to transform each value of each *categorical* feature in a binary feature `{0, 1}`.
+The purpose is to transform each value of each *categorical* feature in a *binary* feature `{0, 1}`.

-For example, the column `Treatment` will be replaced by two columns, `Placebo`, and `Treated`. Each of them will be *binary*. Therefore, an observation which has the value `Placebo` in column `Treatment` before the transformation will have after the transformation the value `1` in the new column `Placebo` and the value `0` in the new column `Treated`.
+For example, the column `Treatment` will be replaced by two columns, `Placebo`, and `Treated`. Each of them will be *binary*. Therefore, an observation which has the value `Placebo` in column `Treatment` before the transformation will have after the transformation the value `1` in the new column `Placebo` and the value `0` in the new column `Treated`. The column `Treatment` will disappear during the one-hot encoding.

 Column `Improved` is excluded because it will be our `label` column, the one we want to predict.

@ -130,14 +136,14 @@ head(sparse_matrix)

 Create the output `numeric` vector (not as a sparse `Matrix`):

-1. Set, for all rows, field in Y column to `0`; 
-2. set Y to `1` when Improved == Marked; 
-3. Return Y column.
-
 ```{r}
 output_vector = df[,Y:=0][Improved == "Marked",Y:=1][,Y]
 ```

+1. set `Y` vector to `0`; 
+2. set `Y` to `1` for rows where `Improved == Marked` is `TRUE` ; 
+3. return `Y` vector.
+
 Build the model
 ===============

@ -155,7 +161,7 @@ A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitti

 > Here you can see the numbers decrease until line 7 and then increase. 
 >
-> It probably means I am overfitting. To fix that I should reduce the number of rounds to `nround = 4`. I will let things like that because I don't really care for the purpose of this example :-)
+> It probably means we are overfitting. To fix that I should reduce the number of rounds to `nround = 4`. I will let things like that because I don't really care for the purpose of this example :-)

 Feature importance
 ==================
@ -163,26 +169,26 @@ Feature importance
 Measure feature importance
 --------------------------

-In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the original values of the feature (remember, one binary column == one value of one *categorical* feature).
+In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the original values of the features (remember, each binary column == one value of one *categorical* feature).

 ```{r}
 importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst)
-importance
+head(importance)
 ```

 > The column `Gain` provide the information we are looking for.
 >
 > As you can see, features are classified by `Gain`.

-`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as `1`, and the other branch saying the exact opposite, both new branches being more accurate than the one before the split).
+`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as `1`, and the other branch saying the exact opposite).

 `Cover` measures the relative quantity of observations concerned by a feature.

 `Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).

-We can go deeper in the analysis. In the table above, we have discovered which features counts to predict if the illness will go or not. But we don't yet know the role of these features. For instance, one of the question we will try to answer will be: does receiving a placebo helps to recover from the illness?
+We can go deeper in the analysis of the model. In the `data.table` above, we have discovered which features counts to predict if the illness will go or not. But we don't yet know the role of these features. For instance, one of the question we may want to answer would be: does receiving a placebo treatment helps to recover from the illness?

-One simple solution is to count the co-occurrences of a feature and a class of the classification. 
+One simple solution is to count the co-occurrences of a feature and a class of the classification.

 For that purpose we will execute the same function as above but using two more parameters, `data` and `label`.

@ -192,14 +198,14 @@ importanceRaw <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst, data =
 # Cleaning for better display
 importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequence=NULL)]

-importanceClean
+head(importanceClean)
 ```

-> In the table above we have removed two not needed columns and select only the first 10 lines.
+> In the table above we have removed two not needed columns and select only the first lines.

 First thing you notice is the new column `Split`. It is the split applied to the feature on a branch of one of the tree. Each split is present, therefore a feature can appear several times in this table. Here we can see the feature `Age` is used several times with different splits.

-How the split is applied to count the co-occurrences? It is always `<`. For instance, in the second line, we measure the number of persons under 61 years with the illness gone after the treatment.
+How the split is applied to count the co-occurrences? It is always `<`. For instance, in the second line, we measure the number of persons under 61.5 years with the illness gone after the treatment.

 The two other new columns are `RealCover` and `RealCover %`. In the first column it measures the number of observations in the dataset where the split is respected and the label marked as `1`. The second column is the percentage of the whole population that `RealCover` represents.

@ -210,7 +216,7 @@ Therefore, according to our findings, getting a placebo doesn't seem to help but
 Plotting the feature importance
 -------------------------------

-All these things are nice, but it would be even better to plot the result. Fortunately, such function already exists.
+All these things are nice, but it would be even better to plot the results.

 ```{r, fig.width=8, fig.height=5, fig.align='center'}
 xgb.plot.importance(importance_matrix = importanceRaw)
@ -221,9 +227,9 @@ Feature have automatically been divided in 2 clusters: the interesting features.
 > Depending of the dataset and the learning parameters you may have more than two clusters. 
 > Default value is to limit them to 10, but you can increase this limit. Look at the function documentation for more information.

-According to the plot above, the most important feature in this dataset to predict if the treatment will work is :
+According to the plot above, the most important features in this dataset to predict if the treatment will work are :

-* the Age;
+* the Age ;
 * having received a placebo or not ;
 * the sex is third but already included in the not interesting feature ; 
 * then we see our generated features (AgeDiscret). We can see that their contribution is very low.
@ -231,7 +237,7 @@ According to the plot above, the most important feature in this dataset to predi
 Do these results make sense?
 ------------------------------

-Let's check some **Chi2** between each of these features and the outcome.
+Let's check some **Chi2** between each of these features and the label.

 Higher **Chi2** means better correlation.

@ -267,27 +273,23 @@ As you can see, in general *destroying information by simplifying it won't impro

 But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model. 

-The case studied here is not enough complex to show that. Check Kaggle forum for some challenging datasets. However it's almost always worse when you add some arbitrary rules.
+The case studied here is not enough complex to show that. Check [Kaggle website](http://www.kaggle.com/) for some challenging datasets. However it's almost always worse when you add some arbitrary rules.

 Moreover, you can notice that even if we have added some not useful new features highly correlated with other features, the boosting tree algorithm have been able to choose the best one, which in this case is the Age.

-Linear model may not be that strong in these scenario.
-
-```{r, fig.align='center', include=FALSE}
-#xgb.plot.tree(sparse_matrix@Dimnames[[2]], model = bst, n_first_tree = 1, width = 1200, height = 800)
-```
+Linear model may not be that smart in this scenario.

 Special Note: What about Random forest?
 =======================================

 As you may know, [Random Forest](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](http://en.wikipedia.org/wiki/Ensemble_learning) family.

-Both trains several decision trees for one dataset. The *main* difference is that in Random Forest, trees are independent and in boosting tree N+1 focus its learning on the loss (= what has no been well modeled by tree N).
+Both trains several decision trees for one dataset. The *main* difference is that in Random Forest, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`).

-This difference have an impact on feature importance analysis: the *correlated features*.
+This difference have an impact on a corner case in feature importance analysis: the *correlated features*.

 Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and random forest).

 However, in Random Forest this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the **importance** of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...

-In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is never that simple). Therefore, all the importance will be on `A` or on `B`. You will know that one feature have an important role in the link between your dataset and the outcome. It is still up to you to search for the correlated features to the one detected as important if you need all of them.
+In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.