[R] various R code maintenance (#1964)

* [R] xgb.save must work when handle in nil but raw exists * [R] print.xgb.Booster should still print other info when handle is nil * [R] rename internal function xgb.Booster to xgb.Booster.handle to make its intent clear * [R] rename xgb.Booster.check to xgb.Booster.complete and make it visible; more docs * [R] storing evaluation_log should depend only on watchlist, not on verbose * [R] reduce the excessive chattiness of unit tests * [R] only disable some tests in windows when it's not 64-bit * [R] clean-up xgb.DMatrix * [R] test xgb.DMatrix loading from libsvm text file * [R] store feature_names in xgb.Booster, use them from utility functions * [R] remove non-functional co-occurence computation from xgb.importance * [R] verbose=0 is enough without a callback * [R] added forgotten xgb.Booster.complete.Rd; cran check fixes * [R] update installation instructions
2017-01-21 13:22:46 -06:00
parent a073a2c3d4
commit 2b5b96d760
27 changed files with 561 additions and 327 deletions
--- a/R-package/R/xgb.importance.R
+++ b/R-package/R/xgb.importance.R
@@ -1,102 +1,92 @@
-#' Show importance of features in a model
+#' Importance of features in a model.
 #' 
-#' Create a \code{data.table} of the most important features of a model. 
+#' Creates a \code{data.table} of feature importances in a model.
 #' 
-#' @param feature_names names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.
-#' @param model generated by the \code{xgb.train} function.
-#' @param data the dataset used for the training step. Will be used with \code{label} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.
-#' @param label the label vector used for the training step. Will be used with \code{data} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.
-#' @param target a function which returns \code{TRUE} or \code{1} when an observation should be count as a co-occurence and \code{FALSE} or \code{0} otherwise. Default function is provided for computing co-occurences in a binary classification. The \code{target} function should have only one parameter. This parameter will be used to provide each important feature vector after having applied the split condition, therefore these vector will be only made of 0 and 1 only, whatever was the information before. More information in \code{Detail} part. This parameter is optional.
-#'
-#' @return A \code{data.table} of the features used in the model with their average gain (and their weight for boosted tree model) in the model.
+#' @param feature_names character vector of feature names. If the model already
+#'       contains feature names, those would be used when \code{feature_names=NULL} (default value).
+#'       Non-null \code{feature_names} could be provided to override those in the model.
+#' @param model object of class \code{xgb.Booster}.
+#' @param data deprecated.
+#' @param label deprecated.
+#' @param target deprecated.
 #'
 #' @details 
-#' This function is for both linear and tree models.
 #' 
-#' \code{data.table} is returned by the function. 
-#' The columns are:
+#' This function works for both linear and tree models.
+#' 
+#' For linear models, the importance is the absolute magnitude of linear coefficients. 
+#' For that reason, in order to obtain a meaningful ranking by importance for a linear model, 
+#' the features need to be on the same scale (which you also would want to do when using either 
+#' L1 or L2 regularization).
+#' 
+#' @return
+#' 
+#' For a tree model, a \code{data.table} with the following columns:
 #' \itemize{
-#'   \item \code{Features} name of the features as provided in \code{feature_names} or already present in the model dump;
-#'   \item \code{Gain} contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the \code{label} used for the training (only available for tree models);
-#'   \item \code{Cover} metric of the number of observation related to this feature (only available for tree models);
-#'   \item \code{Weight} percentage representing the relative number of times a feature have been taken into trees.
+#'   \item \code{Features} names of the features used in the model;
+#'   \item \code{Gain} represents fractional contribution of each feature to the model based on
+#'        the total gain of this feature's splits. Higher percentage means a more important 
+#'        predictive feature.
+#'   \item \code{Cover} metric of the number of observation related to this feature;
+#'   \item \code{Frequency} percentage representing the relative number of times
+#'        a feature have been used in trees.
 #' }
 #' 
-#' If you don't provide \code{feature_names}, index of the features will be used instead.
+#' A linear model's importance \code{data.table} has only two columns:
+#' \itemize{
+#'   \item \code{Features} names of the features used in the model;
+#'   \item \code{Weight} the linear coefficient of this feature.
+#' }
 #' 
-#' Because the index is extracted from the model dump (made on the C++ side), it starts at 0 (usual in C++) instead of 1 (usual in R).
-#' 
-#' Co-occurence count
-#' ------------------
-#' 
-#' The gain gives you indication about the information of how a feature is important in making a branch of a decision tree more pure. However, with this information only, you can't know if this feature has to be present or not to get a specific classification. In the example code, you may wonder if odor=none should be \code{TRUE} to not eat a mushroom.
-#' 
-#' Co-occurence computation is here to help in understanding this relation between a predictor and a specific class. It will count how many observations are returned as \code{TRUE} by the \code{target} function (see parameters). When you execute the example below, there are 92 times only over the 3140 observations of the train dataset where a mushroom have no odor and can be eaten safely.
-#' 
-#' If you need to remember only one thing: unless you want to leave us early, don't eat a mushroom which has no odor :-)
+#' If you don't provide or \code{model} doesn't have \code{feature_names}, 
+#' index of the features will be used instead. Because the index is extracted from the model dump
+#' (based on C++ code), it starts at 0 (as in C/C++ or Python) instead of 1 (usual in R).
 #' 
 #' @examples
+#' 
 #' data(agaricus.train, package='xgboost')
 #' 
 #' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, 
 #'                eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
 #' 
-#' xgb.importance(colnames(agaricus.train$data), model = bst)
-#' 
-#' # Same thing with co-occurence computation this time
-#' xgb.importance(colnames(agaricus.train$data), model = bst, 
-#'                data = agaricus.train$data, label = agaricus.train$label)
+#' xgb.importance(model = bst)
 #' 
 #' @export
-xgb.importance <- function(feature_names = NULL, model = NULL, data = NULL, label = NULL, target = function(x) ( (x + label) == 2)){
-  if (!class(feature_names) %in% c("character", "NULL")) {
-    stop("feature_names: Has to be a vector of character or NULL if the model already contains feature name. Look at this function documentation to see where to get feature names.")
-  }
-
-  if (class(model) != "xgb.Booster") {
-    stop("model: Has to be an object of class xgb.Booster model generaged by the xgb.train function.")
-  }
-
-  if((is.null(data) & !is.null(label)) | (!is.null(data) & is.null(label))) {
-    stop("data/label: Provide the two arguments if you want co-occurence computation or none of them if you are not interested but not one of them only.")
-  }
-
-  if(class(label) == "numeric"){
-    if(sum(label == 0) / length(label) > 0.5) label <- as(label, "sparseVector")
-  }
+xgb.importance <- function(feature_names = NULL, model = NULL, 
+                           data = NULL, label = NULL, target = NULL){
  
-  treeDump <- function(feature_names, text, keepDetail){
-    if(keepDetail) groupBy <- c("Feature", "Split", "MissingNo") else groupBy <- "Feature"
-    xgb.model.dt.tree(feature_names = feature_names, text = text)[,"MissingNo" := Missing == No ][Feature != "Leaf",.(Gain = sum(Quality), Cover = sum(Cover), Frequency = .N), by = groupBy, with = T][,`:=`(Gain = Gain / sum(Gain), Cover = Cover / sum(Cover), Frequency = Frequency / sum(Frequency))][order(Gain, decreasing = T)]
-  }
+  if (!(is.null(data) && is.null(label) && is.null(target)))
+    warning("xgb.importance: parameters 'data', 'label' and 'target' are deprecated")
  
-  linearDump <- function(feature_names, text){
-    weights <- which(text == "weight:") %>% {a =. + 1; text[a:length(text)]} %>% as.numeric
-    if(is.null(feature_names)) feature_names <- seq(to = length(weights))
-    data.table(Feature = feature_names, Weight = weights)
-  }
-
-  model.text.dump <- xgb.dump(model = model, with_stats = T)
+  if (class(model) != "xgb.Booster")
+    stop("Either 'model' has to be an object of class xgb.Booster")
  
-  if(model.text.dump[2] == "bias:"){
-    result <- model.text.dump %>% linearDump(feature_names, .)
-    if(!is.null(data) | !is.null(label)) warning("data/label: these parameters should only be provided with decision tree based models.")
-  }  else {
-    result <- treeDump(feature_names, text = model.text.dump, keepDetail = !is.null(data))
+  if (is.null(feature_names) && !is.null(model$feature_names))
+    feature_names <- model$feature_names
+  
+  if (!class(feature_names) %in% c("character", "NULL"))
+    stop("feature_names: Has to be a character vector")

-    # Co-occurence computation
-    if(!is.null(data) & !is.null(label) & nrow(result) > 0) {
-      # Take care of missing column
-      a <- data[, result[MissingNo == T,Feature], drop=FALSE] != 0
-      # Bind the two Matrix and reorder columns
-      c <- data[, result[MissingNo == F,Feature], drop=FALSE] %>% cBind(a,.) %>% .[,result[,Feature]]
-      rm(a)
-      # Apply split
-      d <- data[, result[,Feature], drop=FALSE] < as.numeric(result[,Split])
-      apply(c & d, 2, . %>% target %>% sum) -> vec
-
-      result <- result[, "RealCover" := as.numeric(vec), with = F][, "RealCover %" := RealCover / sum(label)][, MissingNo := NULL]
-    }
+  model_text_dump <- xgb.dump(model = model, with_stats = TRUE)
+  
+  # linear model
+  if(model_text_dump[2] == "bias:"){
+    weights <- which(model_text_dump == "weight:") %>%
+               {model_text_dump[(. + 1):length(model_text_dump)]} %>%
+               as.numeric
+    if(is.null(feature_names)) 
+      feature_names <- seq(to = length(weights))
+    result <- data.table(Feature = feature_names, Weight = weights)[order(-abs(Weight))]
+  } else { 
+  # tree model
+    result <- xgb.model.dt.tree(feature_names = feature_names, text = model_text_dump)[
+      Feature != "Leaf", .(Gain = sum(Quality), 
+                           Cover = sum(Cover), 
+                           Frequency = .N), by = Feature][
+      ,`:=`(Gain = Gain / sum(Gain), 
+            Cover = Cover / sum(Cover),
+            Frequency = Frequency / sum(Frequency))][
+      order(Gain, decreasing = TRUE)]
  }
  result
 }
@@ -104,4 +94,4 @@ xgb.importance <- function(feature_names = NULL, model = NULL, data = NULL, labe
 # Avoid error messages during CRAN check.
 # The reason is that these variables are never declared
 # They are mainly column names inferred by Data.table...
-globalVariables(c(".", ".N", "Gain", "Frequency", "Feature", "Split", "No", "Missing", "MissingNo", "RealCover"))
+globalVariables(c(".", ".N", "Gain", "Cover", "Frequency", "Feature"))