Update discoverYourData.Rmd (#1482)

* Fixed some typos
* RF is not experimental anymore
This commit is contained in:
Yuan (Terry) Tang 2016-08-19 00:46:45 -05:00 committed by GitHub
parent 669a387c99
commit d5178231cb

View File

@ -5,7 +5,7 @@ output:
css: vignette.css css: vignette.css
number_sections: yes number_sections: yes
toc: yes toc: yes
author: Tianqi Chen, Tong He, Michaël Benesty author: Tianqi Chen, Tong He, Michaël Benesty, Yuan Tang
vignette: > vignette: >
%\VignetteIndexEntry{Discover your data} %\VignetteIndexEntry{Discover your data}
%\VignetteEngine{knitr::rmarkdown} %\VignetteEngine{knitr::rmarkdown}
@ -18,9 +18,9 @@ Understand your dataset with XGBoost
Introduction Introduction
------------ ------------
The purpose of this Vignette is to show you how to use **Xgboost** to discover and understand your own dataset better. The purpose of this vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.
This Vignette is not about predicting anything (see [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features* of your data and the *outcome*. This vignette is not about predicting anything (see [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features* of your data and the *outcome*.
Pacakge loading: Pacakge loading:
@ -36,7 +36,7 @@ if (!require('vcd')) install.packages('vcd')
Preparation of the dataset Preparation of the dataset
-------------------------- --------------------------
### Numeric VS categorical variables ### Numeric v.s. categorical variables
**Xgboost** manages only `numeric` vectors. **Xgboost** manages only `numeric` vectors.
@ -68,7 +68,7 @@ df <- data.table(Arthritis, keep.rownames = F)
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **Xgboost** **R** package use `data.table`. > `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **Xgboost** **R** package use `data.table`.
The first thing we want to do is to have a look to the first lines of the `data.table`: The first thing we want to do is to have a look to the first few lines of the `data.table`:
```{r} ```{r}
head(df) head(df)
@ -103,9 +103,9 @@ Therefore, 20 is not closer to 30 than 60. To make it short, the distance betwee
head(df[,AgeDiscret := as.factor(round(Age/10,0))]) head(df[,AgeDiscret := as.factor(round(Age/10,0))])
``` ```
##### Random split in two groups ##### Random split into two groups
Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (you may already have an idea of how well it will work...). Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. We choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (you may already have an idea of how well it will work...).
```{r} ```{r}
head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))]) head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))])
@ -139,7 +139,7 @@ levels(df[,Treatment])
Next step, we will transform the categorical data to dummy variables. Next step, we will transform the categorical data to dummy variables.
This is the [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) step. This is the [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) step.
The purpose is to transform each value of each *categorical* feature in a *binary* feature `{0, 1}`. The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`.
For example, the column `Treatment` will be replaced by two columns, `Placebo`, and `Treated`. Each of them will be *binary*. Therefore, an observation which has the value `Placebo` in column `Treatment` before the transformation will have after the transformation the value `1` in the new column `Placebo` and the value `0` in the new column `Treated`. The column `Treatment` will disappear during the one-hot encoding. For example, the column `Treatment` will be replaced by two columns, `Placebo`, and `Treated`. Each of them will be *binary*. Therefore, an observation which has the value `Placebo` in column `Treatment` before the transformation will have after the transformation the value `1` in the new column `Placebo` and the value `0` in the new column `Treated`. The column `Treatment` will disappear during the one-hot encoding.
@ -317,8 +317,6 @@ In boosting, when a specific link between feature and outcome have been learned
If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters! If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters!
**Warning**: this is still an experimental parameter.
For instance, to compute a model with 1000 trees, with a 0.5 factor on sampling rows and columns: For instance, to compute a model with 1000 trees, with a 0.5 factor on sampling rows and columns:
```{r, warning=FALSE, message=FALSE} ```{r, warning=FALSE, message=FALSE}