Merge remote-tracking branch 'origin/master'
Conflicts: R-package/vignettes/discoverYourData.Rmd R-package/vignettes/vignette.css
This commit is contained in:
commit
fe4f73920b
@ -39,6 +39,7 @@ Sometimes the dataset we have to work on have *categorical* data.
|
||||
A *categorical* variable is one which have a fixed number of different values. By exemple, if for each observation a variable called *Colour* can have only *red*, *blue* or *green* as value, it is a *categorical* variable.
|
||||
|
||||
> In *R*, *categorical* variable is called `factor`.
|
||||
>
|
||||
> Type `?factor` in console for more information.
|
||||
|
||||
In this demo we will see how to transform a dense dataframe (dense = few zero in the matrix) with *categorical* variables to a very sparse matrix (sparse = lots of zero in the matrix) of `numeric` features before analyzing these data in **Xgboost**.
|
||||
@ -65,8 +66,10 @@ str(df)
|
||||
```
|
||||
|
||||
> 2 columns have `factor` type, one has `ordinal` type.
|
||||
> `ordinal` variable is a categorical variable with values wich can be ordered
|
||||
> Here: `None` > `Some` > `Marked`.
|
||||
>
|
||||
> `ordinal` variable can take a limited number of values and these values can be ordered.
|
||||
>
|
||||
> `Marked > Some > None`
|
||||
|
||||
Let's add some new *categorical* features to see if it helps.
|
||||
|
||||
@ -158,6 +161,7 @@ print(importance)
|
||||
```
|
||||
|
||||
> The column `Gain` provide the information we are looking for.
|
||||
>
|
||||
> As you can see, features are classified by `Gain`.
|
||||
|
||||
`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite, both new branches being more accurate than the one before the split).
|
||||
@ -166,6 +170,7 @@ print(importance)
|
||||
|
||||
`Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
|
||||
|
||||
<<<<<<< HEAD
|
||||
We can go deeper in the analysis. In the table above, we have discovered which feature counts to predict if the illness will go or not. But we don't yet know the role of these feature.
|
||||
|
||||
One simple way to see this role is to count the co-occurence. For that purpose we will execute the same function but with more arguments.
|
||||
@ -190,6 +195,8 @@ The two other new columns are `RealCover` and `RealCover %`. In the first column
|
||||
Therefore, according to our findings, getting a Placebo doesn't seem to help but being less than 61 years old may help.
|
||||
|
||||
> You may wonder how to interpret the `< 1.00001 ` on the first line. Basically, in a sparse `Matrix`, there is no 0, therefore, looking for categorical observations validating the rule `< 1.00001` is like looking for `1` for this feature.
|
||||
=======
|
||||
>>>>>>> origin/master
|
||||
|
||||
Plotting the feature importance
|
||||
-------------------------------
|
||||
|
||||
@ -126,7 +126,7 @@ pre {
|
||||
}
|
||||
|
||||
code {
|
||||
font-family: Consolas, Monaco, Andale Mono, monospace;
|
||||
font-family: Consolas, Monaco, Andale Mono, monospace, courrier new;
|
||||
line-height: 1.5;
|
||||
font-size: 15px;
|
||||
background: #F8F8F8;
|
||||
@ -137,14 +137,22 @@ code {
|
||||
white-space: pre-wrap;
|
||||
}
|
||||
|
||||
<<<<<<< HEAD
|
||||
p code {
|
||||
=======
|
||||
blockquote code {
|
||||
>>>>>>> origin/master
|
||||
background: #CDCDCD;
|
||||
color: #606AAA;
|
||||
}
|
||||
|
||||
code.r, code.cpp {
|
||||
display: block;
|
||||
<<<<<<< HEAD
|
||||
word-wrap: break-word;
|
||||
=======
|
||||
word-wrap: break-word;
|
||||
>>>>>>> origin/master
|
||||
border: 1px solid #606AAA;
|
||||
}
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user