Add new co-occurence computation capacity to importance feature function + related documentation

2015-02-15 17:15:47 +01:00
parent d75194303b
commit def2674dd1
2 changed files with 62 additions and 17 deletions
--- a/R-package/man/xgb.importance.Rd
+++ b/R-package/man/xgb.importance.Rd
@@ -4,7 +4,8 @@
 \alias{xgb.importance}
 \title{Show importance of features in a model}
 \usage{
-xgb.importance(feature_names = NULL, filename_dump = NULL, model = NULL)
+xgb.importance(feature_names = NULL, filename_dump = NULL, model = NULL,
+  data = NULL, label = NULL, target = function(x) ((x + label) == 2))
 }
 \arguments{
 \item{feature_names}{names of each feature as a character vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
@@ -12,6 +13,12 @@ xgb.importance(feature_names = NULL, filename_dump = NULL, model = NULL)
 \item{filename_dump}{the path to the text file storing the model. Model dump must include the gain per feature and per tree (\code{with.stats = T} in function \code{xgb.dump}).}

 \item{model}{generated by the \code{xgb.train} function. Avoid the creation of a dump file.}
+
+\item{data}{the dataset used for the training step. Will be used with \code{label} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.}
+
+\item{label}{the label vetor used for the training step. Will be used with \code{data} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.}
+
+\item{target}{a function which returns \code{TRUE} or \code{1} when an observation should be count as a co-occurence and \code{FALSE} or \code{0} otherwise. Default function is provided for computing co-occurence between a one-hot encoded categorical feature and a binary classification label.The \code{target} function should have only one parameter (will be used to provide each feature vector listed as importance feature). More information in \code{Detail} part. This parameter is optional.}
 }
 \value{
 A \code{data.table} of the features used in the model with their average gain (and their weight for boosted tree model) in the model.
@@ -33,21 +40,30 @@ There are 3 columns :
  \item \code{Cover} metric of the number of observation related to this feature (only available for tree models) ;
  \item \code{Weight} percentage representing the relative number of times a feature have been taken into trees. \code{Gain} should be prefered to search the most important feature. For boosted linear model, this column has no meaning.
 }
+
+Co-occurence count
+
+The gain gives you indication about the information of how a feature is important in making a branch of a decision tree more pure. But, by itself, you can't know if this feature has to be present or not to get a specific classification. In the example code, you may wonder if odor=none should be \code{TRUE} to not eat a mushroom.
+
+Co-occurence computation is here to help in understanding this relation. It will counts how many observations have target function true. In our example, there are 92 times only over the 3140 observations of the train dataset where a mushroom have no odor and can be eaten safely.
+
+If you need to remember one thing of all of this: until you want to leave us early, don't eat a mushroom which has no odor :-)
 }
 \examples{
 data(agaricus.train, package='xgboost')
-data(agaricus.test, package='xgboost')

-#Both dataset are list with two items, a sparse matrix and labels
-#(labels = outcome column which will be learned).
-#Each column of the sparse Matrix is a feature in one hot encoding format.
+# Both dataset are list with two items, a sparse matrix and labels
+# (labels = outcome column which will be learned).
+# Each column of the sparse Matrix is a feature in one hot encoding format.
 train <- agaricus.train
-test <- agaricus.test

 bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
               eta = 1, nround = 2,objective = "binary:logistic")

-#agaricus.test$data@Dimnames[[2]] represents the column names of the sparse matrix.
-xgb.importance(agaricus.test$data@Dimnames[[2]], model = bst)
+# train$data@Dimnames[[2]] represents the column names of the sparse matrix.
+xgb.importance(train$data@Dimnames[[2]], model = bst)
+
+# Same thing with co-occurence computation this time
+xgb.importance(agaricus.test$data@Dimnames[[2]], model = bst, data = train$data, label = train$label)
 }