Merge remote-tracking branch 'origin/master'

Conflicts:
	R-package/vignettes/discoverYourData.Rmd
	R-package/vignettes/vignette.css
This commit is contained in:
El Potaeto 2015-02-17 23:35:52 +01:00
commit fe4f73920b
2 changed files with 18 additions and 3 deletions

View File

@ -39,6 +39,7 @@ Sometimes the dataset we have to work on have *categorical* data.
A *categorical* variable is one which have a fixed number of different values. By exemple, if for each observation a variable called *Colour* can have only *red*, *blue* or *green* as value, it is a *categorical* variable.
> In *R*, *categorical* variable is called `factor`.
>
> Type `?factor` in console for more information.
In this demo we will see how to transform a dense dataframe (dense = few zero in the matrix) with *categorical* variables to a very sparse matrix (sparse = lots of zero in the matrix) of `numeric` features before analyzing these data in **Xgboost**.
@ -65,8 +66,10 @@ str(df)
```
> 2 columns have `factor` type, one has `ordinal` type.
> `ordinal` variable is a categorical variable with values wich can be ordered
> Here: `None` > `Some` > `Marked`.
>
> `ordinal` variable can take a limited number of values and these values can be ordered.
>
> `Marked > Some > None`
Let's add some new *categorical* features to see if it helps.
@ -158,6 +161,7 @@ print(importance)
```
> The column `Gain` provide the information we are looking for.
>
> As you can see, features are classified by `Gain`.
`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite, both new branches being more accurate than the one before the split).
@ -166,6 +170,7 @@ print(importance)
`Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
<<<<<<< HEAD
We can go deeper in the analysis. In the table above, we have discovered which feature counts to predict if the illness will go or not. But we don't yet know the role of these feature.
One simple way to see this role is to count the co-occurence. For that purpose we will execute the same function but with more arguments.
@ -190,6 +195,8 @@ The two other new columns are `RealCover` and `RealCover %`. In the first column
Therefore, according to our findings, getting a Placebo doesn't seem to help but being less than 61 years old may help.
> You may wonder how to interpret the `< 1.00001 ` on the first line. Basically, in a sparse `Matrix`, there is no 0, therefore, looking for categorical observations validating the rule `< 1.00001` is like looking for `1` for this feature.
=======
>>>>>>> origin/master
Plotting the feature importance
-------------------------------

View File

@ -126,7 +126,7 @@ pre {
}
code {
font-family: Consolas, Monaco, Andale Mono, monospace;
font-family: Consolas, Monaco, Andale Mono, monospace, courrier new;
line-height: 1.5;
font-size: 15px;
background: #F8F8F8;
@ -137,14 +137,22 @@ code {
white-space: pre-wrap;
}
<<<<<<< HEAD
p code {
=======
blockquote code {
>>>>>>> origin/master
background: #CDCDCD;
color: #606AAA;
}
code.r, code.cpp {
display: block;
<<<<<<< HEAD
word-wrap: break-word;
=======
word-wrap: break-word;
>>>>>>> origin/master
border: 1px solid #606AAA;
}