Update discoverYourData.Rmd
This commit is contained in:
parent
48deb49ba1
commit
c62583bb0f
@ -15,9 +15,9 @@ vignette: >
|
|||||||
Introduction
|
Introduction
|
||||||
============
|
============
|
||||||
|
|
||||||
The purpose of this Vignette is to show you how to use **Xgboost** to discover and better understand your own dataset.
|
The purpose of this Vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.
|
||||||
|
|
||||||
This Vignette is not about predicting anything (see [Xgboost presentation](https://github.com/tqchen/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features* of your data and an *outcome*.
|
This Vignette is not about predicting anything (see [Xgboost presentation](https://github.com/tqchen/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features* of your data and the *outcome*.
|
||||||
|
|
||||||
Pacakge loading:
|
Pacakge loading:
|
||||||
|
|
||||||
@ -40,7 +40,7 @@ Numeric VS categorical variables
|
|||||||
|
|
||||||
What to do when you have *categorical* data?
|
What to do when you have *categorical* data?
|
||||||
|
|
||||||
A *categorical* variable is one which have a fixed number of different values. For instance, if a variable called *Colour* can have only one of these three values, *red*, *blue* or *green*, *Colour* is a *categorical* variable.
|
A *categorical* variable has a fixed number of different values. For instance, if a variable called *Colour* can have only one of these three values, *red*, *blue* or *green*, then *Colour* is a *categorical* variable.
|
||||||
|
|
||||||
> In **R**, a *categorical* variable is called `factor`.
|
> In **R**, a *categorical* variable is called `factor`.
|
||||||
>
|
>
|
||||||
@ -53,9 +53,9 @@ Conversion from categorical to numeric variables
|
|||||||
|
|
||||||
### Looking at the raw data
|
### Looking at the raw data
|
||||||
|
|
||||||
In this Vignette we will see how to transform a *dense* dataframe (*dense* = few zero in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.
|
In this Vignette we will see how to transform a *dense* dataframe (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.
|
||||||
|
|
||||||
The method we are going to see is usually called [one hot encoding](http://en.wikipedia.org/wiki/One-hot).
|
The method we are going to see is usually called [one-hot encoding](http://en.wikipedia.org/wiki/One-hot).
|
||||||
|
|
||||||
The first step is to load `Arthritis` dataset in memory and wrap it with `data.table` package.
|
The first step is to load `Arthritis` dataset in memory and wrap it with `data.table` package.
|
||||||
|
|
||||||
@ -297,4 +297,4 @@ Imagine two features perfectly correlated, feature `A` and feature `B`. For one
|
|||||||
|
|
||||||
However, in Random Forest this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the *importance* of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...
|
However, in Random Forest this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the *importance* of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...
|
||||||
|
|
||||||
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.
|
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user