From 423c3e6a8d8ed8a53595e33e84a792aad726f7a4 Mon Sep 17 00:00:00 2001 From: El Potaeto Date: Tue, 10 Feb 2015 13:54:30 +0100 Subject: [PATCH 1/5] improved vignette text --- R-package/vignettes/discoverYourData.Rmd | 43 +++++++++++++++++------- R-package/vignettes/vignette.css | 14 ++++---- 2 files changed, 38 insertions(+), 19 deletions(-) diff --git a/R-package/vignettes/discoverYourData.Rmd b/R-package/vignettes/discoverYourData.Rmd index 899efba11..210930dd6 100644 --- a/R-package/vignettes/discoverYourData.Rmd +++ b/R-package/vignettes/discoverYourData.Rmd @@ -36,8 +36,8 @@ Sometimes the dataset we have to work on have *categorical* data. A *categorical* variable is one which have a fixed number of different values. By exemple, if for each observation a variable called *Colour* can have only *red*, *blue* or *green* as value, it is a *categorical* variable. -In **R**, *categorical* variable is called `factor`. -Type `?factor` in console for more information. +> In **R**, *categorical* variable is called `factor`. +> Type `?factor` in console for more information. In this demo we will see how to transform a dense dataframe with *categorical* variables to a sparse matrix before analyzing it in **Xgboost**. @@ -62,18 +62,21 @@ Now we will check the format of each column. str(df) ``` -2 columns have `factor` type, one has `ordinal` type (`ordinal` variable is a categorical variable with values wich can be ordered, here: `None` > `Some` > `Marked`). +> 2 columns have `factor` type, one has `ordinal` type. +> `ordinal` variable is a categorical variable with values wich can be ordered +> Here: `None` > `Some` > `Marked`. Let's add some new categorical features to see if it helps. Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features. -For the first feature we create groups of age by rounding the real age. Note that we transform it to `factor` so the algorithm treat them as independant values. - ```{r} df[,AgeDiscret:= as.factor(round(Age/10,0))][1:10] ``` +> For the first feature we create groups of age by rounding the real age. +> Note that we transform it to `factor` so the algorithm treat them as independant values. + Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!). ```{r} @@ -99,7 +102,7 @@ The purpose is to transform each value of each *categorical* feature in a binary For example, the column Treatment will be replaced by two columns, Placebo, and Treated. Each of them will be *binary*. For example an observation which had the value Placebo in column Treatment before the transformation will have, after the transformation, the value 1 in the new column Placebo and the value 0 in the new column Treated. -Formulae `Improved~.-1` used below means transform all *categorical* features but column Improved to binary values. +> Formulae `Improved~.-1` used below means transform all *categorical* features but column Improved to binary values. Column Improved is excluded because it will be our output column, the one we want to predict. @@ -133,8 +136,9 @@ You can see plenty of `train-error: 0.XXXXX` lines followed by a number. It decr A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy paste too much the past, and is not that good to predict the future). -Here you can see the numbers decrease until line 7 and then increase. It probably means I am overfitting. To fix that I may reduce the number of rounds to `nround = 4`. I will let things like that because I don't really care for the purpose of this example :-) - +> Here you can see the numbers decrease until line 7 and then increase. +> It probably means I am overfitting. To fix that I may reduce the number of rounds to `nround = 4`. +> I will let things like that because I don't really care for the purpose of this example :-) Feature importance ================== @@ -149,7 +153,8 @@ importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst) print(importance) ``` -The column `Gain` provide the information we are looking for. +> The column `Gain` provide the information we are looking for. +> As you can see, features are classified by `Gain`. `Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite, both new branch being more accurate than the one before the insertion of the feature). @@ -157,8 +162,6 @@ The column `Gain` provide the information we are looking for. `Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it). -As you can see, features are classified by `Gain`. - Plotting the feature importance ------------------------------- @@ -170,6 +173,9 @@ xgb.plot.importance(importance_matrix = importance) Feature have been automatically divided in 2 clusters: the interesting features... and the others. +> Depending of the case you may have more than two clusters. +> Default value is to limit them to 10, but you can increase this limit. Look at the function documentation for more information. + According to the plot above, the most important feature in this dataset to predict if the treatment will work is : * the Age; @@ -177,8 +183,6 @@ According to the plot above, the most important feature in this dataset to predi * the sex is third but already included in the not interesting feature ; * then we see our generated features (AgeDiscret). We can see that their contribution is very low. -*Note: Depending of the case you may have more than two clusters. Default value is to limit them to 10, but you can increase this limit. Look at the function documentation for more information.* - Does these results make sense? ------------------------------ @@ -224,4 +228,17 @@ Linear model may not be that strong in these scenario. #xgb.plot.tree(sparse_matrix@Dimnames[[2]], model = bst, n_first_tree = 1, width = 1200, height = 800) ``` +Special Note: What about Random forest? +======================================= +As you may know, [Random Forest](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble leanrning](http://en.wikipedia.org/wiki/Ensemble_learning) family. + +Both trains several decision trees for one dataset. The *main* difference is that in Random Forest, trees are independant and in boosting tree N+1 focus its learning on what has no been well modeled by tree N (and so on...). + +This difference have an impact on a corner case in feature importance analysis: the *correlated features*. + +Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and random forest). + +However, in Random Forest this choice will be done plenty of times, because trees are independant. So the **importance** of a specific feature is diluted among features `A` and `B`. So you won't easily know they are important to predict what you want to predict. + +In boosting, when as aspect of your dataset have been learned by the algorithm, there is no more need to refocus on it. Therefore, all the importace will be on `A` or `B`. You will know that one of them is important, it is up to you to search for correlated features. diff --git a/R-package/vignettes/vignette.css b/R-package/vignettes/vignette.css index c09bf1813..678accf77 100644 --- a/R-package/vignettes/vignette.css +++ b/R-package/vignettes/vignette.css @@ -130,13 +130,16 @@ aside { width: 390px; } blockquote { - border-left:.5em solid #eee; - padding: 0 1em; - margin-left:0; - max-width: 476px; + font-size:14px; + border-left:.5em solid #606AAA; + background: #f5f5f5; + color:#bfbfbf; + padding: 5px; + margin-left:25px; + max-width: 500px; } blockquote cite { - / font-size:14px; + font-size:14px; line-height:20px; color:#bfbfbf; } @@ -146,7 +149,6 @@ blockquote cite:before { blockquote p { color: #666; - max-width: 460px; } hr { / width: 540px; From c0d8ae3781bf02c61662d09e5a5c564ae947adf4 Mon Sep 17 00:00:00 2001 From: El Potaeto Date: Tue, 10 Feb 2015 13:59:13 +0100 Subject: [PATCH 2/5] text change --- R-package/vignettes/discoverYourData.Rmd | 1 - R-package/vignettes/xgboostPresentation.Rmd | 113 ++++++++++++++++++++ 2 files changed, 113 insertions(+), 1 deletion(-) create mode 100644 R-package/vignettes/xgboostPresentation.Rmd diff --git a/R-package/vignettes/discoverYourData.Rmd b/R-package/vignettes/discoverYourData.Rmd index 210930dd6..e6a99bd56 100644 --- a/R-package/vignettes/discoverYourData.Rmd +++ b/R-package/vignettes/discoverYourData.Rmd @@ -5,7 +5,6 @@ output: css: vignette.css number_sections: yes toc: yes -date: "Wednesday, January 28, 2015" --- Introduction diff --git a/R-package/vignettes/xgboostPresentation.Rmd b/R-package/vignettes/xgboostPresentation.Rmd new file mode 100644 index 000000000..9dcd12f39 --- /dev/null +++ b/R-package/vignettes/xgboostPresentation.Rmd @@ -0,0 +1,113 @@ +--- +title: "Xgboost presentation" +output: + html_document: + css: vignette.css + number_sections: yes + toc: yes +--- + +require(xgboost) +require(methods) +# we load in the agaricus dataset +# In this example, we are aiming to predict whether a mushroom can be eated +data(agaricus.train, package='xgboost') +data(agaricus.test, package='xgboost') +train <- agaricus.train +test <- agaricus.test +# the loaded data is stored in sparseMatrix, and label is a numeric vector in {0,1} +class(train$label) +class(train$data) + +#-------------Basic Training using XGBoost----------------- +# this is the basic usage of xgboost you can put matrix in data field +# note: we are puting in sparse matrix here, xgboost naturally handles sparse input +# use sparse matrix when your feature is sparse(e.g. when you using one-hot encoding vector) +print("training xgboost with sparseMatrix") +bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, + objective = "binary:logistic") +# alternatively, you can put in dense matrix, i.e. basic R-matrix +print("training xgboost with Matrix") +bst <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2, + objective = "binary:logistic") + +# you can also put in xgb.DMatrix object, stores label, data and other meta datas needed for advanced features +print("training xgboost with xgb.DMatrix") +dtrain <- xgb.DMatrix(data = train$data, label = train$label) +bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") + +# Verbose = 0,1,2 +print ('train xgboost with verbose 0, no message') +bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, + objective = "binary:logistic", verbose = 0) +print ('train xgboost with verbose 1, print evaluation metric') +bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, + objective = "binary:logistic", verbose = 1) +print ('train xgboost with verbose 2, also print information about tree') +bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, + objective = "binary:logistic", verbose = 2) + +# you can also specify data as file path to a LibSVM format input +# since we do not have this file with us, the following line is just for illustration +# bst <- xgboost(data = 'agaricus.train.svm', max.depth = 2, eta = 1, nround = 2,objective = "binary:logistic") + +#--------------------basic prediction using xgboost-------------- +# you can do prediction using the following line +# you can put in Matrix, sparseMatrix, or xgb.DMatrix +pred <- predict(bst, test$data) +err <- mean(as.numeric(pred > 0.5) != test$label) +print(paste("test-error=", err)) + +#-------------------save and load models------------------------- +# save model to binary local file +xgb.save(bst, "xgboost.model") +# load binary model to R +bst2 <- xgb.load("xgboost.model") +pred2 <- predict(bst2, test$data) +# pred2 should be identical to pred +print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred)))) + +# save model to R's raw vector +raw = xgb.save.raw(bst) +# load binary model to R +bst3 <- xgb.load(raw) +pred3 <- predict(bst3, test$data) +# pred2 should be identical to pred +print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred)))) + +#----------------Advanced features -------------- +# to use advanced features, we need to put data in xgb.DMatrix +dtrain <- xgb.DMatrix(data = train$data, label=train$label) +dtest <- xgb.DMatrix(data = test$data, label=test$label) +#---------------Using watchlist---------------- +# watchlist is a list of xgb.DMatrix, each of them tagged with name +watchlist <- list(train=dtrain, test=dtest) +# to train with watchlist, use xgb.train, which contains more advanced features +# watchlist allows us to monitor the evaluation result on all data in the list +print ('train xgboost using xgb.train with watchlist') +bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, + objective = "binary:logistic") +# we can change evaluation metrics, or use multiple evaluation metrics +print ('train xgboost using xgb.train with watchlist, watch logloss and error') +bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, + eval.metric = "error", eval.metric = "logloss", + objective = "binary:logistic") + +# xgb.DMatrix can also be saved using xgb.DMatrix.save +xgb.DMatrix.save(dtrain, "dtrain.buffer") +# to load it in, simply call xgb.DMatrix +dtrain2 <- xgb.DMatrix("dtrain.buffer") +bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nround=2, watchlist=watchlist, + objective = "binary:logistic") +# information can be extracted from xgb.DMatrix using getinfo +label = getinfo(dtest, "label") +pred <- predict(bst, dtest) +err <- as.numeric(sum(as.integer(pred > 0.5) != label))/length(label) +print(paste("test-error=", err)) + +# You can dump the tree you learned using xgb.dump into a text file +xgb.dump(bst, "dump.raw.txt", with.stats = T) + +# Finally, you can check which features are the most important. +print("Most important features (look at column Gain):") +print(xgb.importance(feature_names = train$data@Dimnames[[2]], filename_dump = "dump.raw.txt")) From cefd55ef00f17f301e05ae3af713f2ed4ff650ce Mon Sep 17 00:00:00 2001 From: El Potaeto Date: Tue, 10 Feb 2015 17:09:21 +0100 Subject: [PATCH 3/5] Vignettes improvement --- R-package/vignettes/discoverYourData.Rmd | 4 +- R-package/vignettes/xgboostPresentation.Rmd | 84 ++++++++++++++++----- 2 files changed, 69 insertions(+), 19 deletions(-) diff --git a/R-package/vignettes/discoverYourData.Rmd b/R-package/vignettes/discoverYourData.Rmd index e6a99bd56..d8f0e62e6 100644 --- a/R-package/vignettes/discoverYourData.Rmd +++ b/R-package/vignettes/discoverYourData.Rmd @@ -12,7 +12,7 @@ Introduction The purpose of this Vignette is to show you how to use **Xgboost** to discover and better understand your own dataset. -You may know **Xgboost** as a state of the art tool to build some kind of Machine learning models. It has been used to win several [Kaggle](http://www.kaggle.com/) competition ([more information](https://github.com/tqchen/xgboost)). +You may know **Xgboost** as a state of the art tool to build some kind of Machine learning models. It has been [used](https://github.com/tqchen/xgboost) to win several [Kaggle](http://www.kaggle.com) competition. During these competition, the purpose is to make prediction. This Vignette is not about showing you how to predict anything. The purpose of this document is to explain *how to use **Xgboost** to understand the link between the features of your data and an outcome*. @@ -24,7 +24,7 @@ require(Matrix) require(data.table) if (!require(vcd)) install.packages('vcd') ``` -*Note that **VCD** is used for one of its embedded dataset only (and not for its own functions).* +> **VCD** is used for one of its embedded dataset only (and not for its own functions). Preparation of the dataset ========================== diff --git a/R-package/vignettes/xgboostPresentation.Rmd b/R-package/vignettes/xgboostPresentation.Rmd index 9dcd12f39..77a250a2c 100644 --- a/R-package/vignettes/xgboostPresentation.Rmd +++ b/R-package/vignettes/xgboostPresentation.Rmd @@ -7,34 +7,84 @@ output: toc: yes --- +Introduction +============ + +The purpose of this Vignette is to show you how to use **Xgboost** to make prediction from a model based on your own dataset. + +You may know **Xgboost** as a state of the art tool to build some kind of Machine learning models. It has been [used](https://github.com/tqchen/xgboost) to win several [Kaggle](http://www.kaggle.com) competition. + +For the purpose of this tutorial we will first load the required packages. + +```{r libLoading, results='hold', message=F, warning=F} require(xgboost) require(methods) -# we load in the agaricus dataset -# In this example, we are aiming to predict whether a mushroom can be eated +``` + +In this example, we are aiming to predict whether a mushroom can be eated. + +Learning +======== + +Dataset loading +--------------- + +We load the `agaricus` datasets and link it to variables. + +The dataset is already separated in `train` and `test` data. + +As their names imply, the train part will be used to build the model and the test part to check how well our model works. Without separation we would test the model on data the algorithm have already seen, as you may imagine, it's not the best methodology to check the performance of a prediction (would it even be a prediction?). + +```{r datasetLoading, results='hold', message=F, warning=F} data(agaricus.train, package='xgboost') data(agaricus.test, package='xgboost') train <- agaricus.train test <- agaricus.test -# the loaded data is stored in sparseMatrix, and label is a numeric vector in {0,1} +``` + +> In the reality, it would be up to you to make this division between `train` and `test` data. + +> Each variable is a S3 object containing both label and data. + +The loaded data is stored in `dgCMatrix` which is a **sparse matrix** type. + +Label is a `numeric` vector in `{0,1}`. + +```{r dataClass, message=F, warning=F} +class(train$data)[1] class(train$label) -class(train$data) +``` -#-------------Basic Training using XGBoost----------------- -# this is the basic usage of xgboost you can put matrix in data field -# note: we are puting in sparse matrix here, xgboost naturally handles sparse input -# use sparse matrix when your feature is sparse(e.g. when you using one-hot encoding vector) -print("training xgboost with sparseMatrix") -bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, - objective = "binary:logistic") -# alternatively, you can put in dense matrix, i.e. basic R-matrix -print("training xgboost with Matrix") -bst <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2, - objective = "binary:logistic") +Basic Training using XGBoost +---------------------------- -# you can also put in xgb.DMatrix object, stores label, data and other meta datas needed for advanced features -print("training xgboost with xgb.DMatrix") +The most critical part of the process is the training. + +We are using the train data. Both `data` and `label` are in each data (explained above). To access to the field of a `S3` object we use the `$` character in **R**. + +> label is the outcome of our dataset. It is the classification we want to predict. For these data we already have it, but when our model is built, that is this column we want to guess. + +In sparse matrix, cells which contains `0` are not encoded. Therefore, in a dataset where there are plenty of `0`, dataset size is optimized. It is very usual to have such dataset. Xgboost can manage both dense and sparse matrix. + +```{r trainingSparse, message=F, warning=F} +bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") +``` + +Alternatively, you can put your dataset in a dense matrix, i.e. a basic R-matrix. + +```{r trainingDense, message=F, warning=F} +bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2, + objective = "binary:logistic") +``` + +Above, data and label are not stored together. + +Xgboost offer a way to group them in a `xgb.DMatrix`. You can even add other meta data. It will be usefull for the most advanced features. + +```{r trainingDense, message=F, warning=F} dtrain <- xgb.DMatrix(data = train$data, label = train$label) bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") +``` # Verbose = 0,1,2 print ('train xgboost with verbose 0, no message') From d7ba5c15119c9119323c57b4045a6ef0692c12a7 Mon Sep 17 00:00:00 2001 From: El Potaeto Date: Tue, 10 Feb 2015 19:46:39 +0100 Subject: [PATCH 4/5] text vignette --- R-package/vignettes/xgboostPresentation.Rmd | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/R-package/vignettes/xgboostPresentation.Rmd b/R-package/vignettes/xgboostPresentation.Rmd index 77a250a2c..29b8e40f1 100644 --- a/R-package/vignettes/xgboostPresentation.Rmd +++ b/R-package/vignettes/xgboostPresentation.Rmd @@ -83,23 +83,24 @@ Xgboost offer a way to group them in a `xgb.DMatrix`. You can even add other met ```{r trainingDense, message=F, warning=F} dtrain <- xgb.DMatrix(data = train$data, label = train$label) -bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") +bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") ``` -# Verbose = 0,1,2 +Below is a demonstration of the effect of verbose parameter. + +```{r trainingVerbose, message=T, warning=F} print ('train xgboost with verbose 0, no message') bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 0) + print ('train xgboost with verbose 1, print evaluation metric') bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 1) + print ('train xgboost with verbose 2, also print information about tree') bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 2) - -# you can also specify data as file path to a LibSVM format input -# since we do not have this file with us, the following line is just for illustration -# bst <- xgboost(data = 'agaricus.train.svm', max.depth = 2, eta = 1, nround = 2,objective = "binary:logistic") +``` #--------------------basic prediction using xgboost-------------- # you can do prediction using the following line From dc9e4905e4a402cf86a4995d0bb00e9fb82a0f4b Mon Sep 17 00:00:00 2001 From: pommedeterresautee Date: Tue, 10 Feb 2015 22:48:16 +0100 Subject: [PATCH 5/5] Vignette text --- R-package/vignettes/xgboostPresentation.Rmd | 133 +++++++++++++++----- 1 file changed, 105 insertions(+), 28 deletions(-) diff --git a/R-package/vignettes/xgboostPresentation.Rmd b/R-package/vignettes/xgboostPresentation.Rmd index 29b8e40f1..90bb49d9c 100644 --- a/R-package/vignettes/xgboostPresentation.Rmd +++ b/R-package/vignettes/xgboostPresentation.Rmd @@ -42,10 +42,10 @@ train <- agaricus.train test <- agaricus.test ``` -> In the reality, it would be up to you to make this division between `train` and `test` data. - > Each variable is a S3 object containing both label and data. +> In the real world, it would be up to you to make this division between `train` and `test` data. + The loaded data is stored in `dgCMatrix` which is a **sparse matrix** type. Label is a `numeric` vector in `{0,1}`. @@ -64,7 +64,7 @@ We are using the train data. Both `data` and `label` are in each data (explained > label is the outcome of our dataset. It is the classification we want to predict. For these data we already have it, but when our model is built, that is this column we want to guess. -In sparse matrix, cells which contains `0` are not encoded. Therefore, in a dataset where there are plenty of `0`, dataset size is optimized. It is very usual to have such dataset. Xgboost can manage both dense and sparse matrix. +In sparse matrix, cells which contains `0` are not encoded. Therefore, in a dataset where there are plenty of `0`, dataset size is optimized. It is very usual to have such dataset. **Xgboost** can manage both dense and sparse matrix. ```{r trainingSparse, message=F, warning=F} bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") @@ -79,9 +79,9 @@ bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth Above, data and label are not stored together. -Xgboost offer a way to group them in a `xgb.DMatrix`. You can even add other meta data. It will be usefull for the most advanced features. +**Xgboost** offer a way to group them in a `xgb.DMatrix`. You can even add other meta data. It will be usefull for the most advanced features. -```{r trainingDense, message=F, warning=F} +```{r trainingDmatrix, message=F, warning=F} dtrain <- xgb.DMatrix(data = train$data, label = train$label) bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") ``` @@ -89,76 +89,153 @@ bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objecti Below is a demonstration of the effect of verbose parameter. ```{r trainingVerbose, message=T, warning=F} -print ('train xgboost with verbose 0, no message') +# verbose 0, no message bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 0) -print ('train xgboost with verbose 1, print evaluation metric') +# verbose 1, print evaluation metric bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 1) -print ('train xgboost with verbose 2, also print information about tree') +# verbose 2, also print information about tree bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 2) ``` -#--------------------basic prediction using xgboost-------------- -# you can do prediction using the following line -# you can put in Matrix, sparseMatrix, or xgb.DMatrix +Basic prediction using Xgboost +------------------------------ + +The main use of **Xgboost** is to predict data. For that purpose we will use the test dataset. We remind you that the algorithm has never seen these data. + +```{r predicting, message=F, warning=F} pred <- predict(bst, test$data) err <- mean(as.numeric(pred > 0.5) != test$label) print(paste("test-error=", err)) +``` -#-------------------save and load models------------------------- +> You can put data in Matrix, sparseMatrix, or xgb.DMatrix + +Save and load models +-------------------- + +When your dataset is big, it may takes time to build a model. Or may be you are not a big fan of loosing time in redoing the same thing again and again. In these cases, you will want to save your model and load it when required. + +Hopefully for you, **Xgboost** implement such functions. + +```{r saveLoadModel, message=F, warning=F} # save model to binary local file xgb.save(bst, "xgboost.model") + # load binary model to R bst2 <- xgb.load("xgboost.model") pred2 <- predict(bst2, test$data) + # pred2 should be identical to pred print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred)))) +``` +In some very specific cases, like when you want to pilot **Xgboost** from `caret`, you will want to save the model as a **R** binary vector. See below how to do it. + +```{r saveLoadRBinVectorModel, message=F, warning=F} # save model to R's raw vector raw = xgb.save.raw(bst) + # load binary model to R bst3 <- xgb.load(raw) pred3 <- predict(bst3, test$data) + # pred2 should be identical to pred print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred)))) +``` -#----------------Advanced features -------------- -# to use advanced features, we need to put data in xgb.DMatrix + + +Advanced features +================= + +Most of the features below have been created to help you to improve your model by offering a better understanding of its content. + + +Dataset preparation +------------------- + +For the following advanced features, we need to put data in `xgb.DMatrix` as explained above. + +```{r DMatrix, message=F, warning=F} dtrain <- xgb.DMatrix(data = train$data, label=train$label) dtest <- xgb.DMatrix(data = test$data, label=test$label) -#---------------Using watchlist---------------- -# watchlist is a list of xgb.DMatrix, each of them tagged with name +``` + +Using xgb.train +--------------- + +`xgb.train` is a powerfull way to follow progress in learning of one or more dataset. + +One way to measure progress in learning of a model is to provide to the **Xgboost** a second dataset already classified. Therefore it can learn on the real dataset and test its model on the second one. Some metrics are measured after each round during the learning. + +For that purpose, you will use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name. + +```{r watchlist, message=F, warning=F} watchlist <- list(train=dtrain, test=dtest) -# to train with watchlist, use xgb.train, which contains more advanced features -# watchlist allows us to monitor the evaluation result on all data in the list -print ('train xgboost using xgb.train with watchlist') + bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic") -# we can change evaluation metrics, or use multiple evaluation metrics -print ('train xgboost using xgb.train with watchlist, watch logloss and error') +``` + +> To train with watchlist, we use `xgb.train`, which contains more advanced features than `xgboost` function. + +For a better understanding, you may want to have some specific metric or even use multiple evaluation metrics. + +`eval.metric` allows us to monitor the evaluation of several metrics at a time. Hereafter we will watch two new metrics, logloss and error. + +```{r watchlist2, message=F, warning=F} bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic") +``` -# xgb.DMatrix can also be saved using xgb.DMatrix.save +Manipulating xgb.DMatrix +------------------------ + +### Save / Load + +Like saving models, `xgb.DMatrix` object (which groups both dataset and outcome) can also be saved using `xgb.DMatrix.save` function. + +```{r DMatrixSave, message=F, warning=F} xgb.DMatrix.save(dtrain, "dtrain.buffer") # to load it in, simply call xgb.DMatrix dtrain2 <- xgb.DMatrix("dtrain.buffer") bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic") -# information can be extracted from xgb.DMatrix using getinfo +``` + +### Information extraction + +Information can be extracted from `xgb.DMatrix` using `getinfo` function. Hereafter we will extract `label` data. + +```{r getinfo, message=F, warning=F} label = getinfo(dtest, "label") pred <- predict(bst, dtest) err <- as.numeric(sum(as.integer(pred > 0.5) != label))/length(label) print(paste("test-error=", err)) +``` -# You can dump the tree you learned using xgb.dump into a text file -xgb.dump(bst, "dump.raw.txt", with.stats = T) +View the trees from a model +--------------------------- -# Finally, you can check which features are the most important. -print("Most important features (look at column Gain):") -print(xgb.importance(feature_names = train$data@Dimnames[[2]], filename_dump = "dump.raw.txt")) +You can dump the tree you learned using `xgb.dump` into a text file. + +```{r dump, message=T, warning=F} +xgb.dump(bst, with.stats = T) +``` + +Feature importance +------------------ + +Finally, you can check which features are the most important. + +```{r featureImportance, message=T, warning=F} +importance_matrix <- xgb.importance(feature_names = train$data@Dimnames[[2]], model = bst) +print(importance_matrix) +xgb.plot.importance(importance_matrix) +``` \ No newline at end of file