diff --git a/R-package/R/xgb.importance.R b/R-package/R/xgb.importance.R index d9f70c510..4e22efd19 100644 --- a/R-package/R/xgb.importance.R +++ b/R-package/R/xgb.importance.R @@ -9,6 +9,7 @@ #' @importFrom magrittr %>% #' @importFrom Matrix colSums #' @importFrom Matrix cBind +#' @importFrom Matrix sparseVector #' #' @param feature_names names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}. #' @@ -82,6 +83,10 @@ xgb.importance <- function(feature_names = NULL, filename_dump = NULL, model = N stop("data/label: Provide the two arguments if you want co-occurence computation or none of them if you are not interested but not one of them only.") } + if(class(label) == "numeric"){ + if(sum(label == 0) / length(label) > 0.5) label <- as(label, "sparseVector") + } + if(is.null(model)){ text <- readLines(filename_dump) } else { diff --git a/R-package/vignettes/discoverYourData.Rmd b/R-package/vignettes/discoverYourData.Rmd index 9d1ce1f5e..b478e8662 100644 --- a/R-package/vignettes/discoverYourData.Rmd +++ b/R-package/vignettes/discoverYourData.Rmd @@ -164,7 +164,7 @@ print(importance) > > As you can see, features are classified by `Gain`. -`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite, both new branches being more accurate than the one before the split). +`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as `1`, and the other branch saying the exact opposite, both new branches being more accurate than the one before the split). `Cover` measures the relative quantity of observations concerned by a feature. diff --git a/R-package/vignettes/vignette.css b/R-package/vignettes/vignette.css index 7d370f2f2..b9967535c 100644 --- a/R-package/vignettes/vignette.css +++ b/R-package/vignettes/vignette.css @@ -25,7 +25,7 @@ body{ line-height: 1; max-width: 800px; - padding: 20px; + padding: 10px; font-size: 17px; text-align: justify; text-justify: inter-word; @@ -33,9 +33,10 @@ body{ p { - line-height: 150%; + line-height: 140%; / max-width: 540px; max-width: 960px; + margin-bottom: 5px; font-weight: 400; / color: #333333 } @@ -46,7 +47,7 @@ h1, h2, h3, h4 { font-weight: 400; } -h2, h3, h4, h5, p { +h2, h3, h4, h5 { margin-bottom: 20px; padding: 0; } @@ -86,6 +87,7 @@ h6 { font-variant:small-caps; font-style: italic; } + a { color: #606AAA; margin: 0; @@ -101,6 +103,7 @@ a:hover { a:visited { color: gray; } + ul, ol { padding: 0; margin: 0px 0px 0px 50px; @@ -138,9 +141,10 @@ code { } -p code { +li code, p code { background: #CDCDCD; color: #606AAA; + padding: 0px 5px 0px 5px; } code.r, code.cpp { diff --git a/R-package/vignettes/xgboostPresentation.Rmd b/R-package/vignettes/xgboostPresentation.Rmd index 02125ac46..63954f18a 100644 --- a/R-package/vignettes/xgboostPresentation.Rmd +++ b/R-package/vignettes/xgboostPresentation.Rmd @@ -16,13 +16,16 @@ vignette: > Introduction ============ -This is an introductory document of using the \verb@xgboost@ package in *R*. +This is an introductory document for using the \verb@xgboost@ package in *R*. **Xgboost** is short for e**X**treme **G**radient **B**oosting package. -It is an efficient and scalable implementation of gradient boosting framework by @friedman2001greedy. +It is an efficient and scalable implementation of gradient boosting framework by @friedman2001greedy. Two solvers are included: -The package includes efficient *linear model* solver and *tree learning* algorithm. It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objectives easily. +- *linear* model ; +- *tree learning* algorithm. + +It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objective function easily. It has been [used](https://github.com/tqchen/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions. @@ -33,19 +36,19 @@ It has several features: * *Dense* Matrix: *R*'s *dense* matrix, i.e. `matrix` ; * *Sparse* Matrix: *R*'s *sparse* matrix, i.e. `Matrix::dgCMatrix` ; * Data File: local data files ; - * `xgb.DMatrix`: it's own class (recommended). + * `xgb.DMatrix`: its own class (recommended). * Sparsity: it accepts *sparse* input for both *tree booster* and *linear booster*, and is optimized for *sparse* input ; -* Customization: it supports customized objective function and evaluation function ; +* Customization: it supports customized objective functions and evaluation functions ; * Performance: it has better performance on several different datasets. -The purpose of this Vignette is to show you how to use **Xgboost** to make prediction from a model based on your own dataset. +The purpose of this Vignette is to show you how to use **Xgboost** to make predictions from a model based on your dataset. Installation ============ -The first step is of course to install the package. +The first step is to install the package. -For up-to-date version (which is *highly* recommended), install from Github: +For up-to-date version (which is *highly* recommended), install from *Github*: ```{r installGithub, eval=FALSE} devtools::install_github('tqchen/xgboost',subdir='R-package') @@ -53,7 +56,7 @@ devtools::install_github('tqchen/xgboost',subdir='R-package') > *Windows* user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first. -For stable version on CRAN, run: +For stable version on *CRAN*, run: ```{r installCran, eval=FALSE} install.packages('xgboost') @@ -65,7 +68,7 @@ For the purpose of this tutorial we will load **Xgboost** package. require(xgboost) ``` -In this example, we are aiming to predict whether a mushroom can be eated or not (yeah I know, like many tutorial, example data are the exact one you will work on in your every day life :-). +In this example, we are aiming to predict whether a mushroom can be eaten or not (yeah I know, like many tutorials, example data are the the same as you will use on in your every day life :-). Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013. @@ -77,10 +80,10 @@ Dataset loading We will load the `agaricus` datasets embedded with the package and will link them to variables. -The datasets are already separated in `train` and `test` data: +The datasets are already split in: -* As their names imply, the `train` part will be used to build the model ; -* `test` will be used to check how well our model is. +* `train`: will be used to build the model ; +* `test`: will be used to assess the quality of our model. Without dividing the dataset we would test the model on data the algorithm have already seen. As you may imagine, it's not the best methodology to check the performance of a prediction (can it even be called a *prediction*?). @@ -191,11 +194,11 @@ print(paste("test-error=", err)) > We remind you that the algorithm has never seen the `test` data before. -Here, we have just computed a simple metric: the average error: +Here, we have just computed a simple metric, the average error. -* `as.numeric(pred > 0.5)` applies our rule that when the probability (== prediction == regression) is over `0.5` the observation is classified as `1` and `0` otherwise ; -* `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ; -* `mean(vectorOfErrors)` computes the average error itself. +1. `as.numeric(pred > 0.5)` applies our rule that when the probability (== prediction == regression) is over `0.5` the observation is classified as `1` and `0` otherwise ; +2. `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ; +3. `mean(vectorOfErrors)` computes the average error itself. The most important thing to remember is that **to do a classification basically, you just do a regression and then apply a threeshold**.