text vignette

This commit is contained in:
El Potaeto
2015-02-12 17:36:10 +01:00
parent 7f71cc12f4
commit ba36c495be
2 changed files with 17 additions and 14 deletions

View File

@@ -37,14 +37,14 @@ Sometimes the dataset we have to work on have *categorical* data.
A *categorical* variable is one which have a fixed number of different values. By exemple, if for each observation a variable called *Colour* can have only *red*, *blue* or *green* as value, it is a *categorical* variable.
> In **R**, *categorical* variable is called `factor`.
> In *R*, *categorical* variable is called `factor`.
> Type `?factor` in console for more information.
In this demo we will see how to transform a dense dataframe (dense = few zero in the matrix) with *categorical* variables to a very sparse matrix (sparse = lots of zero in the matrix) of `numeric` features before analyzing these data in **Xgboost**.
The method we are going to see is usually called [one hot encoding](http://en.wikipedia.org/wiki/One-hot).
The first step is to load Arthritis dataset in memory and wrap the dataset with `data.table` package (`data.table` is 100% compliant with **R** dataframe but its syntax is a lot more consistent and its performance are really good).
The first step is to load Arthritis dataset in memory and wrap the dataset with `data.table` package (`data.table` is 100% compliant with *R* dataframe but its syntax is a lot more consistent and its performance are really good).
```{r, results='hide'}
data(Arthritis)