R maintenance Feb2017 (#2045)
* [R] better argument check in xgb.DMatrix; fixes #1480 * [R] showsd was a dummy; fixes #2044 * [R] better categorical encoding explanation in vignette; fixes #1989 * [R] new roxygen version docs update
This commit is contained in:
committed by
Tianqi Chen
parent
63aec12a13
commit
b4d97d3cb8
@@ -134,23 +134,24 @@ levels(df[,Treatment])
|
||||
```
|
||||
|
||||
|
||||
#### One-hot encoding
|
||||
#### Encoding categorical features
|
||||
|
||||
Next step, we will transform the categorical data to dummy variables.
|
||||
This is the [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) step.
|
||||
Several encoding methods exist, e.g., [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) is a common approach.
|
||||
We will use the [dummy contrast coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#dummy) which is popular because it producess "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
|
||||
|
||||
The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`.
|
||||
|
||||
For example, the column `Treatment` will be replaced by two columns, `Placebo`, and `Treated`. Each of them will be *binary*. Therefore, an observation which has the value `Placebo` in column `Treatment` before the transformation will have after the transformation the value `1` in the new column `Placebo` and the value `0` in the new column `Treated`. The column `Treatment` will disappear during the one-hot encoding.
|
||||
For example, the column `Treatment` will be replaced by two columns, `TreatmentPlacebo`, and `TreatmentTreated`. Each of them will be *binary*. Therefore, an observation which has the value `Placebo` in column `Treatment` before the transformation will have after the transformation the value `1` in the new column `TreatmentPlacebo` and the value `0` in the new column `TreatmentTreated`. The column `TreatmentPlacebo` will disappear during the contrast encoding, as it would be absorbed into a common constant intercept column.
|
||||
|
||||
Column `Improved` is excluded because it will be our `label` column, the one we want to predict.
|
||||
|
||||
```{r, warning=FALSE,message=FALSE}
|
||||
sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df)
|
||||
sparse_matrix <- sparse.model.matrix(Improved ~ ., data = df)[,-1]
|
||||
head(sparse_matrix)
|
||||
```
|
||||
|
||||
> Formulae `Improved~.-1` used above means transform all *categorical* features but column `Improved` to binary values. The `-1` is here to remove the first column which is full of `1` (this column is generated by the conversion). For more information, you can type `?sparse.model.matrix` in the console.
|
||||
> Formula `Improved ~ .` used above means transform all *categorical* features but column `Improved` to binary values. The `-1` column selection removes the intercept column which is full of `1` (this column is generated by the conversion). For more information, you can type `?sparse.model.matrix` in the console.
|
||||
|
||||
Create the output `numeric` vector (not as a sparse `Matrix`):
|
||||
|
||||
|
||||
Reference in New Issue
Block a user