From 276b68b9845f6bab0aaf1dac98e4f5d607df2b0f Mon Sep 17 00:00:00 2001 From: pommedeterresautee Date: Thu, 12 Feb 2015 22:22:00 +0100 Subject: [PATCH] Vignette text --- R-package/vignettes/xgboostPresentation.Rmd | 151 ++++++++++++-------- 1 file changed, 92 insertions(+), 59 deletions(-) diff --git a/R-package/vignettes/xgboostPresentation.Rmd b/R-package/vignettes/xgboostPresentation.Rmd index 1325c24e1..34e1055c3 100644 --- a/R-package/vignettes/xgboostPresentation.Rmd +++ b/R-package/vignettes/xgboostPresentation.Rmd @@ -28,13 +28,13 @@ It has been [used](https://github.com/tqchen/xgboost) to win several [Kaggle](ht It has several features: -* Speed: it can automatically do parallel computation on *Windows* and *Linux*, with *OpenMP*. It is generally over 10 times faster than `gbm`. +* Speed: it can automatically do parallel computation on *Windows* and *Linux*, with *OpenMP*. It is generally over 10 times faster than the classical `gbm`. * Input Type: it takes several types of input data: - * Dense Matrix: *R*'s dense matrix, i.e. `matrix` ; - * Sparse Matrix: *R*'s sparse matrix, i.e. `Matrix::dgCMatrix` ; + * *Dense* Matrix: *R*'s *dense* matrix, i.e. `matrix` ; + * *Sparse* Matrix: *R*'s *sparse* matrix, i.e. `Matrix::dgCMatrix` ; * Data File: local data files ; - * `xgb.DMatrix`: it's own class (recommended) ; -* Sparsity: it accepts sparse input for both *tree booster* and *linear booster*, and is optimized for sparse input ; + * `xgb.DMatrix`: it's own class (recommended). +* Sparsity: it accepts *sparse* input for both *tree booster* and *linear booster*, and is optimized for *sparse* input ; * Customization: it supports customized objective function and evaluation function ; * Performance: it has better performance on several different datasets. @@ -43,7 +43,9 @@ The purpose of this Vignette is to show you how to use **Xgboost** to make predi Installation ============ -For up-to-date version (which is *highly* recommended), please install from Github: +The first step is of course to install the package. + +For up-to-date version (which is *highly* recommended), install from Github: ```{r installGithub, eval=FALSE} devtools::install_github('tqchen/xgboost',subdir='R-package') @@ -51,19 +53,19 @@ devtools::install_github('tqchen/xgboost',subdir='R-package') > *Windows* user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first. -For stable version on CRAN, please run +For stable version on CRAN, run: ```{r installCran, eval=FALSE} install.packages('xgboost') ``` -For the purpose of this tutorial we will load the required package. +For the purpose of this tutorial we will load **Xgboost** package. ```{r libLoading, results='hold', message=F, warning=F} require(xgboost) ``` -In this example, we are aiming to predict whether a mushroom can be eated (yeah, as always, example data are the exact one you will work on in your every day life :-). +In this example, we are aiming to predict whether a mushroom can be eated or not (yeah I know, like many tutorial, example data are the exact one you will work on in your every day life :-). Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013. @@ -90,16 +92,25 @@ test <- agaricus.test # Each variable is a S3 object containing both label and data. ``` -> In the real world, it would be up to you to make this division between `train` and `test` data. The way you should do it is out of the purpose of this article, however `caret` package may help. +> In the real world, it would be up to you to make this division between `train` and `test` data. The way you should do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/splitting.html). -The loaded `data` are stored in `dgCMatrix` which is a *sparse matrix* type and `label` is a `numeric` vector in `{0,1}`. +Let's discover the dimensionality of our datasets. + +```{r dataSize, message=F, warning=F} +dim(train$data) +dim(test$data) +``` + +> Clearly, we have here a small dataset, however **Xgboost** can manage huge one very efficiently. + +The loaded `data` are stored in `dgCMatrix` which is a *sparse* matrix type and `label` is a `numeric` vector in `{0,1}`. ```{r dataClass, message=F, warning=F} class(train$data)[1] class(train$label) ``` -`label` is the outcome of our dataset. It is the binary classification we want to predict in future data. +`label` is the outcome of our dataset meaning it is the binary *classification* we want to predict in future data. Basic Training using Xgboost ---------------------------- @@ -108,7 +119,7 @@ The most critical part of the process is the training one. We are using the `train` data. As explained above, both `data` and `label` are in a variable. -In sparse matrix, cells which contains `0` are not encoded. Therefore, in a dataset where there are plenty of `0`, memory size is optimized. It is very usual to have such dataset. **Xgboost** can manage both dense and sparse matrix. +In *sparse* matrix, cells which contains `0` are not encoded. Therefore, in a dataset where there are plenty of `0`, memory size is optimized. It is very usual to have such dataset. **Xgboost** can manage both *dense* and *sparse* matrix. ```{r trainingSparse, message=F, warning=F} bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") @@ -116,11 +127,10 @@ bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta > To reach the value of a `S3` object field we use the `$` character. -Alternatively, you can put your dataset in a dense matrix, i.e. a basic R-matrix. +Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic *R* matrix. ```{r trainingDense, message=F, warning=F} -bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2, - objective = "binary:logistic") +bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") ``` Above, data and label are not stored together. @@ -132,22 +142,23 @@ dtrain <- xgb.DMatrix(data = train$data, label = train$label) bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic") ``` -**Xgboost** have plenty of features to help you to view how the learning progress internally. The obvious purpose is to help you to set the best parameters, which is the key of the quality of the model you are building. +**Xgboost** have plenty of features to help you to view how the learning progress internally. The obvious purpose is to help you to set the best parameters, which is the key in model quality you are building. -One of the most simple way to see the progress is to set the `verbose` option. Look below of the effect of this parameter. +One of the most simple way to see the training progress is to set the `verbose` option. -```{r trainingVerbose, message=T, warning=F} -# verbose 0, no message -bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, - objective = "binary:logistic", verbose = 0) +```{r trainingVerbose0, message=T, warning=F} +# verbose = 0, no message +bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 0) +``` -# verbose 1, print evaluation metric -bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, - objective = "binary:logistic", verbose = 1) +```{r trainingVerbose1, message=T, warning=F} +# verbose = 1, print evaluation metric +bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 1) +``` -# verbose 2, also print information about tree -bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, - objective = "binary:logistic", verbose = 2) +```{r trainingVerbose2, message=T, warning=F} +# verbose = 2, also print information about tree +bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 2) ``` Basic prediction using Xgboost @@ -157,11 +168,35 @@ The main use of **Xgboost** is to predict data. For that purpose we will use the ```{r predicting, message=F, warning=F} pred <- predict(bst, test$data) + +# size of the prediction vector +print(length(pred)) + +# limit display of predictions to the first 10 +print(pred[1:10]) +``` + +The only thing **Xgboost** do is a regression. But we are in a classification problem. If we think about this regression results, they are just kind of probabilities being classified as `1`. + +Therefore, we will set the rule if the probability is `> 5` then the observation is classified as `1` and is classified `0` otherwise. + +```{r predictingTest, message=F, warning=F} err <- mean(as.numeric(pred > 0.5) != test$label) print(paste("test-error=", err)) ``` -> We remind you that the algorithm has never seen the `test` data. +> We remind you that the algorithm has never seen the `test` data before. + +Here, we have just computed a simple metric: the average error: + +* `as.numeric(pred > 0.5)` applies our rule that when the probability (== prediction == regression) is over `0.5` the observation is classified as `1` and `0` otherwise ; +* `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ; +* `mean(vectorOfErrors)` computes the average error itself. + +> The most important thing to remember is that to do a classification basically, you just do a regression and then apply a threeshold. +> Multiclass classification works in a very similar way. + +This metrix is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well! Save and load models -------------------- @@ -175,30 +210,37 @@ Hopefully for you, **Xgboost** implements such functions. xgb.save(bst, "xgboost.model") ``` -An interesting test to see how identic our saved model is to the original one by comparing the two predictions. +> `xgb.save` function should return `r TRUE` if everything goes well and crashes otherwise. + +An interesting test to see how identic to the original one our saved model is would be to compare the two predictions. ```{r loadModel, message=F, warning=F} # load binary model to R bst2 <- xgb.load("xgboost.model") pred2 <- predict(bst2, test$data) -# delete the created model (because cleaning is always better than dirtyness) -rm("xgboost.model") - # And now the test print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred)))) ``` +```{r clean, include=FALSE} +# delete the created model +rm("xgboost.model") +``` + > result is `0`? We are good! -In some very specific cases, like when you want to pilot **Xgboost** from `caret`, you will want to save the model as a *R* binary vector. See below how to do it. +In some very specific cases, like when you want to pilot **Xgboost** from `caret` package, you will want to save the model as a *R* `binary` vector. See below how to do it. ```{r saveLoadRBinVectorModel, message=F, warning=F} # save model to R's raw vector -raw = xgb.save.raw(bst) +rawVec <- xgb.save.raw(bst) + +# print class +print(class(rawVec)) # load binary model to R -bst3 <- xgb.load(raw) +bst3 <- xgb.load(rawVec) pred3 <- predict(bst3, test$data) # pred2 should be identical to pred @@ -223,33 +265,36 @@ dtrain <- xgb.DMatrix(data = train$data, label=train$label) dtest <- xgb.DMatrix(data = test$data, label=test$label) ``` -Measure learning progress xgb.train ------------------------------------ +Measure learning progress with xgb.train +---------------------------------------- Both `xgboost` (simple) and `xgb.train` (advanced) functions train models. One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following features will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible. -One way to measure progress in learning of a model is to provide to the **Xgboost** a second dataset already classified. Therefore it can learn on the real dataset and test its model on the second one. Some metrics are measured after each round during the learning. +One way to measure progress in learning of a model is to provide to the **Xgboost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning. + +For the purpose of this example, we use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name. ```{r watchlist, message=F, warning=F} watchlist <- list(train=dtrain, test=dtest) -bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, - objective = "binary:logistic") +bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic") ``` -> For the purpose of this example, we use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name. +> **Xgboost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines of metric here). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset. + +Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset. + +If with your own dataset you have not such results, you should think about how you did to divide your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/splitting.html). For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics. ```{r watchlist2, message=F, warning=F} -bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, - eval.metric = "error", eval.metric = "logloss", - objective = "binary:logistic") +bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic") ``` -> `eval.metric` allows us to monitor the evaluation of several metrics at a time. Hereafter we will watch two new metrics, logloss and error. +> `eval.metric` allows us to monitor two new metrics for each round, logloss and error. Manipulating xgb.DMatrix ------------------------ @@ -262,8 +307,7 @@ Like saving models, `xgb.DMatrix` object (which groups both dataset and outcome) xgb.DMatrix.save(dtrain, "dtrain.buffer") # to load it in, simply call xgb.DMatrix dtrain2 <- xgb.DMatrix("dtrain.buffer") -bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nround=2, watchlist=watchlist, - objective = "binary:logistic") +bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic") ``` ### Information extraction @@ -288,16 +332,5 @@ xgb.dump(bst, with.stats = T) > if you provide a path to `fname` parameter you can save the trees to your hard drive. -Feature importance ------------------- - -Finally, you can check which features are the most important and plot the result (more information in vignette [Discover your data with **Xgboost**](www.somewhere.com)). - -```{r featureImportance, message=T, warning=F, fig.width=8, fig.height=5, fig.align='center'} -importance_matrix <- xgb.importance(feature_names = train$data@Dimnames[[2]], model = bst) -print(importance_matrix) -xgb.plot.importance(importance_matrix) -``` - References ========== \ No newline at end of file