From 5a59c0b26c6e8dd872327d6855ce4bb68374b36d Mon Sep 17 00:00:00 2001
From: El Potaeto <pommedeterresautee@msn.com>
Date: Sun, 8 Mar 2015 00:02:14 +0100
Subject: [PATCH 1/3] df spell

---
 R-package/vignettes/discoverYourData.Rmd | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/R-package/vignettes/discoverYourData.Rmd b/R-package/vignettes/discoverYourData.Rmd
index 9f0280ffc..a0e86601d 100644
--- a/R-package/vignettes/discoverYourData.Rmd
+++ b/R-package/vignettes/discoverYourData.Rmd
@@ -53,7 +53,7 @@ Conversion from categorical to numeric variables
 
 ### Looking at the raw data
 
-In this Vignette we will see how to transform a *dense* dataframe (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.
+In this Vignette we will see how to transform a *dense* `data.frame` (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.
 
 The method we are going to see is usually called [one-hot encoding](http://en.wikipedia.org/wiki/One-hot).
 
@@ -64,7 +64,7 @@ data(Arthritis)
 df <- data.table(Arthritis, keep.rownames = F)
 ```
 
-> `data.table` is 100% compliant with **R** dataframe but its syntax is very consistent and its performance is really good.
+> `data.table` is 100% compliant with **R** `data.frame` but its syntax is very consistent and its performance is really good.
 
 The first thing we want to do is to have a look to the first lines of the `data.table`:
 

From 05dbc401862b885d1136c4dbcfa8b9067bf4deea Mon Sep 17 00:00:00 2001
From: El Potaeto <pommedeterresautee@msn.com>
Date: Sun, 8 Mar 2015 00:03:40 +0100
Subject: [PATCH 2/3] space

---
 R-package/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/R-package/README.md b/R-package/README.md
index db55b9eb7..51dbbe942 100644
--- a/R-package/README.md
+++ b/R-package/README.md
@@ -2,7 +2,7 @@
 
 ## Installation
 
-For up-to-date version(which is recommended), please install from github. Windows user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
+For up-to-date version (which is recommended), please install from github. Windows user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
 
 ```r
 devtools::install_github('tqchen/xgboost',subdir='R-package')

From 09e466764e40ce6ea0e00aa169062348bf4743a5 Mon Sep 17 00:00:00 2001
From: El Potaeto <pommedeterresautee@msn.com>
Date: Sun, 8 Mar 2015 00:38:22 +0100
Subject: [PATCH 3/3] Vignette text

---
 R-package/vignettes/discoverYourData.Rmd | 36 ++++++++++++++++--------
 R-package/vignettes/vignette.css         |  2 +-
 2 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/R-package/vignettes/discoverYourData.Rmd b/R-package/vignettes/discoverYourData.Rmd
index a0e86601d..c9060f012 100644
--- a/R-package/vignettes/discoverYourData.Rmd
+++ b/R-package/vignettes/discoverYourData.Rmd
@@ -64,7 +64,7 @@ data(Arthritis)
 df <- data.table(Arthritis, keep.rownames = F)
 ```
 
-> `data.table` is 100% compliant with **R** `data.frame` but its syntax is very consistent and its performance is really good.
+> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `panda` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **Xgboost** **R** package use `data.table`.
 
 The first thing we want to do is to have a look to the first lines of the `data.table`:
 
@@ -78,26 +78,30 @@ Now we will check the format of each column.
 str(df)
 ```
 
-> 2 columns have `factor` type, one has `ordinal` type.
+2 columns have `factor` type, one has `ordinal` type.
+
+> `ordinal` variable :
 >
-> `ordinal` variable can take a limited number of values and these values can be ordered.
->
-> `Marked > Some > None`
+> * can take a limited number of values (like `factor`) ;
+> * these values are ordered (unlike `factor`). Here these ordered values are: `Marked > Some > None`
 
 ### Creation of new features based on old ones
 
 We will add some new *categorical* features to see if it helps.
 
-These feature will be highly correlated to the `Age` feature. Usually it's not a good thing in machine learning. Fortunately, decision tree algorithms (including boosted trees) are robust to correlated features.
+#### Grouping per 10 years
+
+For the first feature we create groups of age by rounding the real age.
+
+Note that we transform it to `factor` so the algorithm treat these age groups as independent values.
+
+Therefore, 20 is not closer to 30 than 60. To make it short, the distance between ages is lost in this transformation.
 
 ```{r}
-head(df[,AgeDiscret:= as.factor(round(Age/10,0))])
+head(df[,AgeDiscret := as.factor(round(Age/10,0))])
 ```
 
-> For the first feature we create groups of age by rounding the real age.
->
-> Note that we transform it to `factor` so the algorithm treat these age groups as independent values.
-> Therefore, 20 is not closer to 30 than 60. To make it short, the distance between ages is lost in this transformation.
+#### Random split in two groups
 
 Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (you may already have an idea of how well it will work...).
 
@@ -105,6 +109,16 @@ Following is an even stronger simplification of the real age with an arbitrary s
 head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))])
 ```
 
+#### Risks in adding correlated features
+
+These new features are highly correlated to the `Age` feature because they are simple transformations of this feature. 
+
+For many machine learning algorithms, using correlated features is not a good idea. It may sometimes make prediction less accurate, and most of the time make interpretation of the model almost impossible. GLM, for instance, assumes that the features are uncorrelated.
+
+Fortunately, decision tree algorithms (including boosted trees) are very robust to these features. Therefore we have nothing to do to manage this situation.
+
+#### Cleaning data
+
 We remove ID as there is nothing to learn from this feature (it would just add some noise).
 
 ```{r, results='hide'}
diff --git a/R-package/vignettes/vignette.css b/R-package/vignettes/vignette.css
index 51908da28..59dfcd85c 100644
--- a/R-package/vignettes/vignette.css
+++ b/R-package/vignettes/vignette.css
@@ -169,7 +169,7 @@ blockquote cite:before {
     /content: '\2014 \00A0';
 }
 
-blockquote p {  
+blockquote p, blockquote li {  
     color: #666;
 }
 hr {