vignette text

This commit is contained in:
El Potaeto 2015-02-18 13:13:27 +01:00
parent 1cfa810edb
commit 8fd546ab3c

View File

@ -170,30 +170,32 @@ print(importance)
`Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
We can go deeper in the analysis. In the table above, we have discovered which feature counts to predict if the illness will go or not. But we don't yet know the role of these feature.
We can go deeper in the analysis. In the table above, we have discovered which features counts to predict if the illness will go or not. But we don't yet know the role of these features. For instance, one of the question we will try to answer will be: does receiving a placebo helps to recover from the illness?
One simple way to see this role is to count the co-occurence. For that purpose we will execute the same function but with more arguments.
One simple solution is to count the co-occurences of a feature and a class of the classification.
For that purpose we will execute the same function as above but using two more parameters, `data` and `label`.
```{r}
importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector)
# Removing not important things for better display
# Cleaning for better display
importance <- importance[,`:=`(Cover=NULL, Frequence=NULL)][1:10,]
print(importance)
```
In the table above we have removed two not needed columns and select only the first 10 lines.
> In the table above we have removed two not needed columns and select only the first 10 lines.
First thing you notice is the new column `Split`. It is the split applied to the feature on a branch of one of the tree. Each split is present, therefore a feature can appear several times in this table. Here we can see the feature Age is used several times with different split.
First thing you notice is the new column `Split`. It is the split applied to the feature on a branch of one of the tree. Each split is present, therefore a feature can appear several times in this table. Here we can see the feature `Age` is used several times with different splits.
How the split is applied to count the co-occurences? It is always `<`. For instance, in the second line, we measure the number of person under 61 years with the illness gone.
How the split is applied to count the co-occurences? It is always `<`. For instance, in the second line, we measure the number of persons under 61 years with the illness gone after the treatment.
The two other new columns are `RealCover` and `RealCover %`. In the first column it measures the number of observation in the dataset where the split is respected and the label marked as `1`. The second column is the percentage of the all population that the previous figure represents.
The two other new columns are `RealCover` and `RealCover %`. In the first column it measures the number of observations in the dataset where the split is respected and the label marked as `1`. The second column is the percentage of the whole population that `RealCover` represents.
Therefore, according to our findings, getting a Placebo doesn't seem to help but being less than 61 years old may help.
Therefore, according to our findings, getting a placebo doesn't seem to help but being younger than 61 years may help (seems logic).
> You may wonder how to interpret the `< 1.00001 ` on the first line. Basically, in a sparse `Matrix`, there is no 0, therefore, looking for categorical observations validating the rule `< 1.00001` is like looking for `1` for this feature.
> You may wonder how to interpret the `< 1.00001 ` on the first line. Basically, in a sparse `Matrix`, there is no `0`, therefore, looking for one hot-encoded categorical observations validating the rule `< 1.00001` is like just looking for `1` for this feature.
Plotting the feature importance
-------------------------------