diff --git a/R-package/vignettes/discoverYourData.Rmd b/R-package/vignettes/discoverYourData.Rmd index b478e8662..18b6b68a9 100644 --- a/R-package/vignettes/discoverYourData.Rmd +++ b/R-package/vignettes/discoverYourData.Rmd @@ -17,9 +17,9 @@ Introduction The purpose of this Vignette is to show you how to use **Xgboost** to discover and better understand your own dataset. -This Vignette is not about showing you how to predict anything (see [Xgboost presentation](www.somewhere.org)). The purpose of this document is to explain how to use **Xgboost** to understand the *link* between the *features* of your data and an *outcome*. +This Vignette is not about predicting anything (see [Xgboost presentation](https://github.com/tqchen/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features* of your data and an *outcome*. -For the purpose of this tutorial we will first load the required packages. +Pacakge loading: ```{r libLoading, results='hold', message=F, warning=F} require(xgboost) @@ -27,36 +27,49 @@ require(Matrix) require(data.table) if (!require('vcd')) install.packages('vcd') ``` -> **VCD** package is used for one of its embedded dataset only (and not for its own functions). + +> **VCD** package is used for one of its embedded dataset only. Preparation of the dataset ========================== -**Xgboost** works only on `numeric` variables. +Numeric VS categorical variables +---------------------------------- -Sometimes the dataset we have to work on have *categorical* data. +**Xgboost** manages only `numeric` vectors. -A *categorical* variable is one which have a fixed number of different values. By exemple, if for each observation a variable called *Colour* can have only *red*, *blue* or *green* as value, it is a *categorical* variable. +What to do when you have *categorical* data? -> In *R*, *categorical* variable is called `factor`. +A *categorical* variable is one which have a fixed number of different values. For instance, if a variable called *Colour* can have only one of these three values, *red*, *blue* or *green*, *Colour* is a *categorical* variable. + +> In **R**, a *categorical* variable is called `factor`. > -> Type `?factor` in console for more information. +> Type `?factor` in the console for more information. -In this demo we will see how to transform a dense dataframe (dense = few zero in the matrix) with *categorical* variables to a very sparse matrix (sparse = lots of zero in the matrix) of `numeric` features before analyzing these data in **Xgboost**. +To answer the question above we will convert *categorical* variables to `numeric` one. + +Conversion from categorical to numeric variables +------------------------------------------------ + +### Looking at the raw data + +In this Vignette we will see how to transform a *dense* dataframe (*dense* = few zero in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features. The method we are going to see is usually called [one hot encoding](http://en.wikipedia.org/wiki/One-hot). -The first step is to load Arthritis dataset in memory and wrap the dataset with `data.table` package (`data.table` is 100% compliant with *R* dataframe but its syntax is a lot more consistent and its performance are really good). +The first step is to load `Arthritis` dataset in memory and wrap it with `data.table` package. ```{r, results='hide'} data(Arthritis) df <- data.table(Arthritis, keep.rownames = F) ``` -Let's have a look to the 10 first lines of the `data.table`: +> `data.table` is 100% compliant with **R** dataframe but its syntax is very consistent and its performance is really good. + +The first thing we want to do is to have a look to the first lines of the `data.table`: ```{r} -print(df[1:10]) +head(df) ``` Now we will check the format of each column. @@ -71,67 +84,72 @@ str(df) > > `Marked > Some > None` -Let's add some new *categorical* features to see if it helps. +### Creation of new features based on old ones -Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in Machine Learning. Fortunately, tree algorithms (including boosted trees) are very robust in this specific case. +We will add some new *categorical* features to see if it helps. + +These feature will be highly correlated to the `Age` feature. Usually it's not a good thing in machine learning. Fortunately, decision tree algorithms (including boosted trees) are robust to correlated features. ```{r} -df[,AgeDiscret:= as.factor(round(Age/10,0))][1:10] +head(df[,AgeDiscret:= as.factor(round(Age/10,0))]) ``` > For the first feature we create groups of age by rounding the real age. +> > Note that we transform it to `factor` so the algorithm treat these age groups as independent values. > Therefore, 20 is not closer to 30 than 60. To make it short, the distance between ages is lost in this transformation. -Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!). +Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (you may already have an idea of how well it will work...). ```{r} -df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))][1:10] +head(df[,AgeCat:= as.factor(ifelse(Age > 30, "Old", "Young"))]) ``` -We remove ID as there is nothing to learn from this feature (it will just add some noise as the dataset is small). +We remove ID as there is nothing to learn from this feature (it would just add some noise). ```{r, results='hide'} df[,ID:=NULL] ``` -Let's list the different values for the column Treatment. +We will list the different values for the column `Treatment`: ```{r} -print(levels(df[,Treatment])) +levels(df[,Treatment]) ``` +### One-hot encoding + Next step, we will transform the categorical data to dummy variables. -This is the [one hot encoding](http://en.wikipedia.org/wiki/One-hot) part. +This is the [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) step. -The purpose is to transform each value of each *categorical* feature in a binary feature `{0, 1}`. +The purpose is to transform each value of each *categorical* feature in a *binary* feature `{0, 1}`. -For example, the column Treatment will be replaced by two columns, Placebo, and Treated. Each of them will be *binary*. Therefore, an observation which has the value Placebo in column Treatment before the transformation will have after the transformation the value `1` in the new column Placebo and the value `0` in the new column Treated. +For example, the column `Treatment` will be replaced by two columns, `Placebo`, and `Treated`. Each of them will be *binary*. Therefore, an observation which has the value `Placebo` in column `Treatment` before the transformation will have after the transformation the value `1` in the new column `Placebo` and the value `0` in the new column `Treated`. The column `Treatment` will disappear during the one-hot encoding. -Column Improved is excluded because it will be our output column, the one we want to predict. +Column `Improved` is excluded because it will be our `label` column, the one we want to predict. ```{r, warning=FALSE,message=FALSE} sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df) -print(sparse_matrix[1:10,]) +head(sparse_matrix) ``` -> Formulae `Improved~.-1` used above means transform all *categorical* features but column Improved to binary values. +> Formulae `Improved~.-1` used above means transform all *categorical* features but column `Improved` to binary values. The `-1` is here to remove the first column which is full of `1` (this column is generated by the conversion). For more information, you can type `?sparse.model.matrix` in the console. Create the output `numeric` vector (not as a sparse `Matrix`): -1. Set, for all rows, field in Y column to `0`; -2. set Y to `1` when Improved == Marked; -3. Return Y column. - ```{r} output_vector = df[,Y:=0][Improved == "Marked",Y:=1][,Y] ``` +1. set `Y` vector to `0`; +2. set `Y` to `1` for rows where `Improved == Marked` is `TRUE` ; +3. return `Y` vector. + Build the model =============== -The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or to the vignette [Xgboost presentation](www.somewhere.org)). +The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [Xgboost presentation](https://github.com/tqchen/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). ```{r} bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4, @@ -139,13 +157,13 @@ bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4, ``` -You can see plenty of `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well your model explains your data. Lower is better. +You can see some `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well the model explains your data. Lower is better. -A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy paste too much the past, and is not that good to predict the future). +A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future). > Here you can see the numbers decrease until line 7 and then increase. -> It probably means I am overfitting. To fix that I may reduce the number of rounds to `nround = 4`. -> I will let things like that because I don't really care for the purpose of this example :-) +> +> It probably means we are overfitting. To fix that I should reduce the number of rounds to `nround = 4`. I will let things like that because I don't really care for the purpose of this example :-) Feature importance ================== @@ -153,67 +171,70 @@ Feature importance Measure feature importance -------------------------- -In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the original values of the feature (remember, one binary column == one value of one *categorical* feature). +### Build the feature importance data.table + +In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the original values of the features (remember, each binary column == one value of one *categorical* feature). ```{r} importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst) -print(importance) +head(importance) ``` > The column `Gain` provide the information we are looking for. > > As you can see, features are classified by `Gain`. -`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as `1`, and the other branch saying the exact opposite, both new branches being more accurate than the one before the split). +`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as `1`, and the other branch saying the exact opposite). `Cover` measures the relative quantity of observations concerned by a feature. `Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it). -We can go deeper in the analysis. In the table above, we have discovered which features counts to predict if the illness will go or not. But we don't yet know the role of these features. For instance, one of the question we will try to answer will be: does receiving a placebo helps to recover from the illness? +### Improvement in the interpretability of feature importance data.table -One simple solution is to count the co-occurrences of a feature and a class of the classification. +We can go deeper in the analysis of the model. In the `data.table` above, we have discovered which features counts to predict if the illness will go or not. But we don't yet know the role of these features. For instance, one of the question we may want to answer would be: does receiving a placebo treatment helps to recover from the illness? + +One simple solution is to count the co-occurrences of a feature and a class of the classification. For that purpose we will execute the same function as above but using two more parameters, `data` and `label`. ```{r} -importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector) +importanceRaw <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst, data = sparse_matrix, label = output_vector) # Cleaning for better display -importance <- importance[,`:=`(Cover=NULL, Frequence=NULL)][1:10,] +importanceClean <- importanceRaw[,`:=`(Cover=NULL, Frequence=NULL)] -print(importance) +head(importanceClean) ``` -> In the table above we have removed two not needed columns and select only the first 10 lines. +> In the table above we have removed two not needed columns and select only the first lines. First thing you notice is the new column `Split`. It is the split applied to the feature on a branch of one of the tree. Each split is present, therefore a feature can appear several times in this table. Here we can see the feature `Age` is used several times with different splits. -How the split is applied to count the co-occurrences? It is always `<`. For instance, in the second line, we measure the number of persons under 61 years with the illness gone after the treatment. +How the split is applied to count the co-occurrences? It is always `<`. For instance, in the second line, we measure the number of persons under 61.5 years with the illness gone after the treatment. The two other new columns are `RealCover` and `RealCover %`. In the first column it measures the number of observations in the dataset where the split is respected and the label marked as `1`. The second column is the percentage of the whole population that `RealCover` represents. Therefore, according to our findings, getting a placebo doesn't seem to help but being younger than 61 years may help (seems logic). -> You may wonder how to interpret the `< 1.00001 ` on the first line. Basically, in a sparse `Matrix`, there is no `0`, therefore, looking for one hot-encoded categorical observations validating the rule `< 1.00001` is like just looking for `1` for this feature. +> You may wonder how to interpret the `< 1.00001` on the first line. Basically, in a sparse `Matrix`, there is no `0`, therefore, looking for one hot-encoded categorical observations validating the rule `< 1.00001` is like just looking for `1` for this feature. Plotting the feature importance ------------------------------- -All these things are nice, but it would be even better to plot the result. Fortunately, such function already exists. +All these things are nice, but it would be even better to plot the results. ```{r, fig.width=8, fig.height=5, fig.align='center'} -xgb.plot.importance(importance_matrix = importance) +xgb.plot.importance(importance_matrix = importanceRaw) ``` Feature have automatically been divided in 2 clusters: the interesting features... and the others. -> Depending of the dataset and the learning parameters you may have more than two clusters. -> Default value is to limit them to 10, but you can increase this limit. Look at the function documentation for more information. +> Depending of the dataset and the learning parameters you may have more than two clusters. Default value is to limit them to `10`, but you can increase this limit. Look at the function documentation for more information. -According to the plot above, the most important feature in this dataset to predict if the treatment will work is : +According to the plot above, the most important features in this dataset to predict if the treatment will work are : -* the Age; +* the Age ; * having received a placebo or not ; * the sex is third but already included in the not interesting feature ; * then we see our generated features (AgeDiscret). We can see that their contribution is very low. @@ -221,7 +242,7 @@ According to the plot above, the most important feature in this dataset to predi Do these results make sense? ------------------------------ -Let's check some **Chi2** between each of these features and the outcome. +Let's check some **Chi2** between each of these features and the label. Higher **Chi2** means better correlation. @@ -244,7 +265,7 @@ c2 <- chisq.test(df$AgeCat, df$Y) print(c2) ``` -The perfectly random split I did between young and old at 30 years old have a low correlation of **`r round(c2$statistic, 2)`**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same. +The perfectly random split I did between young and old at 30 years old have a low correlation of **`r round(c2$statistic, 2)`**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same. Morality: don't let your *gut* lower the quality of your model. @@ -257,27 +278,23 @@ As you can see, in general *destroying information by simplifying it won't impro But in more complex cases, creating a new feature based on existing one which makes link with the outcome more obvious may help the algorithm and improve the model. -The case studied here is not enough complex to show that. Check Kaggle forum for some challenging datasets. However it's almost always worse when you add some arbitrary rules. +The case studied here is not enough complex to show that. Check [Kaggle website](http://www.kaggle.com/) for some challenging datasets. However it's almost always worse when you add some arbitrary rules. Moreover, you can notice that even if we have added some not useful new features highly correlated with other features, the boosting tree algorithm have been able to choose the best one, which in this case is the Age. -Linear model may not be that strong in these scenario. - -```{r, fig.align='center', include=FALSE} -#xgb.plot.tree(sparse_matrix@Dimnames[[2]], model = bst, n_first_tree = 1, width = 1200, height = 800) -``` +Linear model may not be that smart in this scenario. Special Note: What about Random forest? ======================================= As you may know, [Random Forest](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](http://en.wikipedia.org/wiki/Ensemble_learning) family. -Both trains several decision trees for one dataset. The *main* difference is that in Random Forest, trees are independent and in boosting tree N+1 focus its learning on the loss (= what has no been well modeled by tree N). +Both trains several decision trees for one dataset. The *main* difference is that in Random Forest, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`). -This difference have an impact on feature importance analysis: the *correlated features*. +This difference have an impact on a corner case in feature importance analysis: the *correlated features*. Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and random forest). -However, in Random Forest this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the **importance** of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features... +However, in Random Forest this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the *importance* of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features... -In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is never that simple). Therefore, all the importance will be on `A` or on `B`. You will know that one feature have an important role in the link between your dataset and the outcome. It is still up to you to search for the correlated features to the one detected as important if you need all of them. +In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them. \ No newline at end of file diff --git a/R-package/vignettes/vignette.css b/R-package/vignettes/vignette.css index b9967535c..10454b3cf 100644 --- a/R-package/vignettes/vignette.css +++ b/R-package/vignettes/vignette.css @@ -60,28 +60,29 @@ h1 { } h2 { - font-size:130% + font-size:130%; / margin: 24px 0 6px; } h3 { - font-size:110% + font-size:110%; text-decoration: underline; - font-style: italic; } -h4 { - font-size:100% - font-variant:small-caps; +h4 { + font-size:100%; + font-style: italic; + font-variant:small-caps; } + h5 { - font-size:100% + font-size:100%; font-weight: 100; font-style: italic; } h6 { - font-size:100% + font-size:100%; font-weight: 100; color:red; font-variant:small-caps; @@ -160,7 +161,6 @@ aside { } blockquote { - font-size:14px; border-left:.5em solid #606AAA; background: #F8F8F8; padding: 0em 1em 0em 1em; @@ -169,7 +169,6 @@ blockquote { } blockquote cite { - font-size:14px; line-height:10px; color:#bfbfbf; } diff --git a/R-package/vignettes/xgboostPresentation.Rmd b/R-package/vignettes/xgboostPresentation.Rmd index 63954f18a..d1dbfaa05 100644 --- a/R-package/vignettes/xgboostPresentation.Rmd +++ b/R-package/vignettes/xgboostPresentation.Rmd @@ -16,10 +16,10 @@ vignette: > Introduction ============ -This is an introductory document for using the \verb@xgboost@ package in *R*. - **Xgboost** is short for e**X**treme **G**radient **B**oosting package. +The purpose of this Vignette is to show you how to use **Xgboost** to build a model and make predictions. + It is an efficient and scalable implementation of gradient boosting framework by @friedman2001greedy. Two solvers are included: - *linear* model ; @@ -38,43 +38,47 @@ It has several features: * Data File: local data files ; * `xgb.DMatrix`: its own class (recommended). * Sparsity: it accepts *sparse* input for both *tree booster* and *linear booster*, and is optimized for *sparse* input ; -* Customization: it supports customized objective functions and evaluation functions ; -* Performance: it has better performance on several different datasets. - -The purpose of this Vignette is to show you how to use **Xgboost** to make predictions from a model based on your dataset. +* Customization: it supports customized objective functions and evaluation functions. Installation ============ -The first step is to install the package. +Github version +-------------- -For up-to-date version (which is *highly* recommended), install from *Github*: +For up-to-date version (highly recommended), install from *Github*: ```{r installGithub, eval=FALSE} -devtools::install_github('tqchen/xgboost',subdir='R-package') +devtools::install_github('tqchen/xgboost', subdir='R-package') ``` > *Windows* user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first. +Cran version +------------ + For stable version on *CRAN*, run: ```{r installCran, eval=FALSE} install.packages('xgboost') ``` +Learning +======== + For the purpose of this tutorial we will load **Xgboost** package. ```{r libLoading, results='hold', message=F, warning=F} require(xgboost) ``` -In this example, we are aiming to predict whether a mushroom can be eaten or not (yeah I know, like many tutorials, example data are the the same as you will use on in your every day life :-). +Dataset presentation +-------------------- + +In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-). Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013. -Learning -======== - Dataset loading --------------- @@ -85,7 +89,9 @@ The datasets are already split in: * `train`: will be used to build the model ; * `test`: will be used to assess the quality of our model. -Without dividing the dataset we would test the model on data the algorithm have already seen. As you may imagine, it's not the best methodology to check the performance of a prediction (can it even be called a *prediction*?). +Why *split* the dataset in two parts? + +In a first part we will build our model. In a second part we will want to test it and assess its quality. Without dividing the dataset we would test the model on data the algorithm have already seen. ```{r datasetLoading, results='hold', message=F, warning=F} data(agaricus.train, package='xgboost') @@ -96,11 +102,14 @@ test <- agaricus.test > In the real world, it would be up to you to make this division between `train` and `test` data. The way you should do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/splitting.html). -Each variable is a `list` containing both label and data. +Each variable is a `list`, each containing two things, `label` and `data`: + ```{r dataList, message=F, warning=F} str(train) ``` +`label` is the outcome of our dataset meaning it is the binary *classification* we will try to predict. + Let's discover the dimensionality of our datasets. ```{r dataSize, message=F, warning=F} @@ -108,50 +117,62 @@ dim(train$data) dim(test$data) ``` -Clearly, we have here a small dataset, however **Xgboost** can manage huge one very efficiently. +This dataset is very small to not make the **R** package too heavy, however **Xgboost** is built to manage huge dataset very efficiently. -The loaded `data` are stored in `dgCMatrix` which is a *sparse* matrix type and `label` is a `numeric` vector in `{0,1}`. +As seen below, the `data` are stored in a `dgCMatrix` which is a *sparse* matrix and `label` vector is a `numeric` vector (`{0,1}`): ```{r dataClass, message=F, warning=F} class(train$data)[1] class(train$label) ``` -`label` is the outcome of our dataset meaning it is the binary *classification* we want to predict in future data. - Basic Training using Xgboost ---------------------------- -The most critical part of the process is the training one. +This step is the most critical part of the process for the quality of our model. -We are using the `train` data. As explained above, both `data` and `label` are in a variable. +### Basic training -In *sparse* matrix, cells which contains `0` are not encoded. Therefore, in a dataset where there are plenty of `0`, memory size is optimized. It is very usual to have such dataset. **Xgboost** can manage both *dense* and *sparse* matrix. +We are using the `train` data. As explained above, both `data` and `label` are stored in a `list`. + +In a *sparse* matrix, cells containing `0` are not stored in memory. Therefore, in a dataset mainly made of `0`, memory size is reduced. It is very usual to have such dataset. + +We will train decision tree model using the following parameters: + +* `objective = "binary:logistic"`: we will train a binary classification model ; +* `max.deph = 2`: the trees won't be deep, because our case is very simple ; +* `nround = 2`: there will be two pass on the data, the second one will focus on the data not correctly learned by the first pass. ```{r trainingSparse, message=F, warning=F} bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") ``` -> To reach the value of a variable in a `list` use the `$` character followed by the name. +> More the link between your features and your `label` is complex, more pass you need. -Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic *R* matrix. +### Parameter variations + +#### Dense matrix + +Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R** matrix. ```{r trainingDense, message=F, warning=F} bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") ``` -Above, data and label are not stored together. +#### xgb.DMatrix -**Xgboost** offer a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be usefull for the most advanced features we will discover later. +**Xgboost** offers a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be usefull for the most advanced features we will discover later. ```{r trainingDmatrix, message=F, warning=F} dtrain <- xgb.DMatrix(data = train$data, label = train$label) bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") ``` -**Xgboost** have plenty of features to help you to view how the learning progress internally. The obvious purpose is to help you to set the best parameters, which is the key in model quality you are building. +#### Verbose option -One of the most simple way to see the training progress is to set the `verbose` option. +**Xgboost** has severa features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality. + +One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics). ```{r trainingVerbose0, message=T, warning=F} # verbose = 0, no message @@ -169,9 +190,12 @@ bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "b ``` Basic prediction using Xgboost ------------------------------- +============================== -The main use of **Xgboost** is to predict data. For that purpose we will use the `test` dataset. +Perform the prediction +---------------------- + +The pupose of the model we have built is to classify new data. As explained before, we will use the `test` dataset for this step. ```{r predicting, message=F, warning=F} pred <- predict(bst, test$data) @@ -180,87 +204,53 @@ pred <- predict(bst, test$data) print(length(pred)) # limit display of predictions to the first 10 -print(pred[1:10]) +print(head(pred)) ``` -The only thing **Xgboost** do is a regression. But we are in a classification problem. If we think about this regression results, they are just kind of probabilities being classified as `1`. +These numbers doesn't look like *binary classification* `{0,1}`. We need to perform a simple transformation before being able to use these results. -Therefore, we will set the rule if the probability is `> 5` then the observation is classified as `1` and is classified `0` otherwise. +Transform the regression in a binary classification +--------------------------------------------------- + +The only thing that **Xgboost** does is a *regression*. **Xgboost** is using `label` vector to build its *regression* model. + +How can we use a *regression* model to perform a binary classification? + +If we think about the meaning of a regression applied to our data, the numbers we get are probabilities that a datum will be classified as `1`. Therefore, we will set the rule that if this probability for a specific datum is `> 0.5` then the observation is classified as `1` (or `0` otherwise). ```{r predictingTest, message=F, warning=F} +prediction <- as.numeric(pred > 0.5) +print(head(prediction)) +``` + +Measuring model performance +--------------------------- + +To measure the model performance, we will compute a simple metric, the *average error*. + +```{r predictingAverageError, message=F, warning=F} err <- mean(as.numeric(pred > 0.5) != test$label) print(paste("test-error=", err)) ``` -> We remind you that the algorithm has never seen the `test` data before. +> Note that the algorithm has not seen the `test` data during the model construction. -Here, we have just computed a simple metric, the average error. +Steps explanation: -1. `as.numeric(pred > 0.5)` applies our rule that when the probability (== prediction == regression) is over `0.5` the observation is classified as `1` and `0` otherwise ; +1. `as.numeric(pred > 0.5)` applies our rule that when the probability (<=> regression <=> prediction) is `> 0.5` the observation is classified as `1` and `0` otherwise ; 2. `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ; -3. `mean(vectorOfErrors)` computes the average error itself. +3. `mean(vectorOfErrors)` computes the *average error* itself. -The most important thing to remember is that **to do a classification basically, you just do a regression and then apply a threeshold**. +The most important thing to remember is that **to do a classification, you just do a regression to the** `label` **and then apply a threeshold**. -Multiclass classification works in a very similar way. +*Multiclass* classification works in a similar way. -This metrix is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well! - -Save and load models --------------------- - -May be your dataset is big, and it takes time to train a model on it? May be you are not a big fan of loosing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required. - -Hopefully for you, **Xgboost** implements such functions. - -```{r saveModel, message=F, warning=F} -# save model to binary local file -xgb.save(bst, "xgboost.model") -``` - -> `xgb.save` function should return `r TRUE` if everything goes well and crashes otherwise. - -An interesting test to see how identic to the original one our saved model is would be to compare the two predictions. - -```{r loadModel, message=F, warning=F} -# load binary model to R -bst2 <- xgb.load("xgboost.model") -pred2 <- predict(bst2, test$data) - -# And now the test -print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred)))) -``` - -```{r clean, include=FALSE} -# delete the created model -file.remove("./xgboost.model") -``` - -> result is `0`? We are good! - -In some very specific cases, like when you want to pilot **Xgboost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it. - -```{r saveLoadRBinVectorModel, message=F, warning=F} -# save model to R's raw vector -rawVec <- xgb.save.raw(bst) - -# print class -print(class(rawVec)) - -# load binary model to R -bst3 <- xgb.load(rawVec) -pred3 <- predict(bst3, test$data) - -# pred2 should be identical to pred -print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred)))) -``` - -> Again `0`? It seems that `Xgboost` works prety well! +This metric is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well! Advanced features ================= -Most of the features below have been created to help you to improve your model by offering a better understanding of its content. +Most of the features below have been implemented to help you to improve your model by offering a better understanding of its content. Dataset preparation @@ -278,9 +268,11 @@ Measure learning progress with xgb.train Both `xgboost` (simple) and `xgb.train` (advanced) functions train models. -One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following features will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible. +One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following technics will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible. -One way to measure progress in learning of a model is to provide to the **Xgboost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning. +One way to measure progress in learning of a model is to provide to **Xgboost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning. + +> in some way it is similar to what we have done above with the average error. The main difference is that below it was after building the model, and now it is during the construction that we measure errors. For the purpose of this example, we use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name. @@ -290,7 +282,7 @@ watchlist <- list(train=dtrain, test=dtest) bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic") ``` -**Xgboost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines of metric here). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset. +**Xgboost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset. Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset. @@ -302,7 +294,10 @@ For a better understanding of the learning progression, you may want to have som bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic") ``` -> `eval.metric` allows us to monitor two new metrics for each round, logloss and error. +> `eval.metric` allows us to monitor two new metrics for each round, `logloss` and `error`. + +Linear boosting +--------------- Until know, all the learnings we have performed were based on boosting trees. **Xgboost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter). @@ -310,8 +305,9 @@ Until know, all the learnings we have performed were based on boosting trees. ** bst <- xgb.train(data=dtrain, booster = "gblinear", max.depth=2, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic") ``` -In this specific case, linear boosting gets sligtly better performance metrics than decision trees based algorithm. In simple case, it will happem because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Check both implementations with your own dataset to have an idea of what to use. +In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm. +In simple cases, it will happem because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use. Manipulating xgb.DMatrix ------------------------ @@ -353,5 +349,56 @@ xgb.dump(bst, with.stats = T) > if you provide a path to `fname` parameter you can save the trees to your hard drive. +Save and load models +-------------------- + +May be your dataset is big, and it takes time to train a model on it? May be you are not a big fan of loosing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required. + +Hopefully for you, **Xgboost** implements such functions. + +```{r saveModel, message=F, warning=F} +# save model to binary local file +xgb.save(bst, "xgboost.model") +``` + +> `xgb.save` function should return `r TRUE` if everything goes well and crashes otherwise. + +An interesting test to see how identic is our saved model with the original one would be to compare the two predictions. + +```{r loadModel, message=F, warning=F} +# load binary model to R +bst2 <- xgb.load("xgboost.model") +pred2 <- predict(bst2, test$data) + +# And now the test +print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred)))) +``` + +```{r clean, include=FALSE} +# delete the created model +file.remove("./xgboost.model") +``` + +> result is `0`? We are good! + +In some very specific cases, like when you want to pilot **Xgboost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it. + +```{r saveLoadRBinVectorModel, message=F, warning=F} +# save model to R's raw vector +rawVec <- xgb.save.raw(bst) + +# print class +print(class(rawVec)) + +# load binary model to R +bst3 <- xgb.load(rawVec) +pred3 <- predict(bst3, test$data) + +# pred2 should be identical to pred +print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred)))) +``` + +> Again `0`? It seems that `Xgboost` works prety well! + References ========== \ No newline at end of file