[R] various R code maintenance (#1964)

* [R] xgb.save must work when handle in nil but raw exists * [R] print.xgb.Booster should still print other info when handle is nil * [R] rename internal function xgb.Booster to xgb.Booster.handle to make its intent clear * [R] rename xgb.Booster.check to xgb.Booster.complete and make it visible; more docs * [R] storing evaluation_log should depend only on watchlist, not on verbose * [R] reduce the excessive chattiness of unit tests * [R] only disable some tests in windows when it's not 64-bit * [R] clean-up xgb.DMatrix * [R] test xgb.DMatrix loading from libsvm text file * [R] store feature_names in xgb.Booster, use them from utility functions * [R] remove non-functional co-occurence computation from xgb.importance * [R] verbose=0 is enough without a callback * [R] added forgotten xgb.Booster.complete.Rd; cran check fixes * [R] update installation instructions
2017-01-21 13:22:46 -06:00
parent a073a2c3d4
commit 2b5b96d760
27 changed files with 561 additions and 327 deletions
--- a/R-package/man/xgb.Booster.complete.Rd
+++ b/R-package/man/xgb.Booster.complete.Rd
@@ -0,0 +1,49 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/xgb.Booster.R
+\name{xgb.Booster.complete}
+\alias{xgb.Booster.complete}
+\title{Restore missing parts of an incomplete xgb.Booster object.}
+\usage{
+xgb.Booster.complete(object, saveraw = TRUE)
+}
+\arguments{
+\item{object}{object of class \code{xgb.Booster}}
+
+\item{saveraw}{a flag indicating whether to append \code{raw} Booster memory dump data 
+when it doesn't already exist.}
+}
+\value{
+An object of \code{xgb.Booster} class.
+}
+\description{
+It attempts to complete an \code{xgb.Booster} object by restoring either its missing 
+raw model memory dump (when it has no \code{raw} data but its \code{xgb.Booster.handle} is valid)
+or its missing internal handle (when its \code{xgb.Booster.handle} is not valid 
+but it has a raw Booster memory dump).
+}
+\details{
+While this method is primarily for internal use, it might be useful in some practical situations.
+
+E.g., when an \code{xgb.Booster} model is saved as an R object and then is loaded as an R object,
+its handle (pointer) to an internal xgboost model would be invalid. The majority of xgboost methods 
+should still work for such a model object since those methods would be using 
+\code{xgb.Booster.complete} internally. However, one might find it to be more efficient to call the  
+\code{xgb.Booster.complete} function once after loading a model as an R-object. That which would
+prevent further reconstruction (potentially, multiple times) of an internal booster model.
+}
+\examples{
+
+data(agaricus.train, package='xgboost')
+bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, 
+               eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
+saveRDS(bst, "xgb.model.rds")
+
+bst1 <- readRDS("xgb.model.rds")
+# the handle is invalid:
+print(bst1$handle)
+bst1 <- xgb.Booster.complete(bst1)
+# now the handle points to a valid internal booster model:
+print(bst1$handle)
+
+}
+
--- a/R-package/man/xgb.dump.Rd
+++ b/R-package/man/xgb.dump.Rd
@@ -2,7 +2,7 @@
 % Please edit documentation in R/xgb.dump.R
 \name{xgb.dump}
 \alias{xgb.dump}
-\title{Save xgboost model to text file}
+\title{Dump an xgboost model in text format.}
 \usage{
 xgb.dump(model = NULL, fname = NULL, fmap = "", with_stats = FALSE,
  dump_format = c("text", "json"), ...)
@@ -10,17 +10,18 @@ xgb.dump(model = NULL, fname = NULL, fmap = "", with_stats = FALSE,
 \arguments{
 \item{model}{the model object.}

-\item{fname}{the name of the text file where to save the model text dump. If not provided or set to \code{NULL} the function will return the model as a \code{character} vector.}
+\item{fname}{the name of the text file where to save the model text dump. 
+If not provided or set to \code{NULL}, the model is returned as a \code{character} vector.}

-\item{fmap}{feature map file representing the type of feature. 
+\item{fmap}{feature map file representing feature types.
 Detailed description could be found at 
 \url{https://github.com/dmlc/xgboost/wiki/Binary-Classification#dump-model}.
 See demo/ for walkthrough example in R, and
 \url{https://github.com/dmlc/xgboost/blob/master/demo/data/featmap.txt} 
 for example Format.}

-\item{with_stats}{whether dump statistics of splits 
-When this option is on, the model dump comes with two additional statistics:
+\item{with_stats}{whether to dump some additional statistics about the splits.
+When this option is on, the model dump contains two additional values:
 gain is the approximate loss function gain we get in each split;
 cover is the sum of second order gradient in each node.}

@@ -29,10 +30,11 @@ cover is the sum of second order gradient in each node.}
 \item{...}{currently not used}
 }
 \value{
-if fname is not provided or set to \code{NULL} the function will return the model as a \code{character} vector. Otherwise it will return \code{TRUE}.
+If fname is not provided or set to \code{NULL} the function will return the model
+as a \code{character} vector. Otherwise it will return \code{TRUE}.
 }
 \description{
-Save a xgboost model to text file. Could be parsed later.
+Dump an xgboost model in text format.
 }
 \examples{
 data(agaricus.train, package='xgboost')
--- a/R-package/man/xgb.importance.Rd
+++ b/R-package/man/xgb.importance.Rd
@@ -2,64 +2,65 @@
 % Please edit documentation in R/xgb.importance.R
 \name{xgb.importance}
 \alias{xgb.importance}
-\title{Show importance of features in a model}
+\title{Importance of features in a model.}
 \usage{
 xgb.importance(feature_names = NULL, model = NULL, data = NULL,
-  label = NULL, target = function(x) ((x + label) == 2))
+  label = NULL, target = NULL)
 }
 \arguments{
-\item{feature_names}{names of each feature as a \code{character} vector. Can be extracted from a sparse matrix (see example). If model dump already contains feature names, this argument should be \code{NULL}.}
+\item{feature_names}{character vector of feature names. If the model already
+contains feature names, those would be used when \code{feature_names=NULL} (default value).
+Non-null \code{feature_names} could be provided to override those in the model.}

-\item{model}{generated by the \code{xgb.train} function.}
+\item{model}{object of class \code{xgb.Booster}.}

-\item{data}{the dataset used for the training step. Will be used with \code{label} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.}
+\item{data}{deprecated.}

-\item{label}{the label vector used for the training step. Will be used with \code{data} parameter for co-occurence computation. More information in \code{Detail} part. This parameter is optional.}
+\item{label}{deprecated.}

-\item{target}{a function which returns \code{TRUE} or \code{1} when an observation should be count as a co-occurence and \code{FALSE} or \code{0} otherwise. Default function is provided for computing co-occurences in a binary classification. The \code{target} function should have only one parameter. This parameter will be used to provide each important feature vector after having applied the split condition, therefore these vector will be only made of 0 and 1 only, whatever was the information before. More information in \code{Detail} part. This parameter is optional.}
+\item{target}{deprecated.}
 }
 \value{
-A \code{data.table} of the features used in the model with their average gain (and their weight for boosted tree model) in the model.
+For a tree model, a \code{data.table} with the following columns:
+\itemize{
+  \item \code{Features} names of the features used in the model;
+  \item \code{Gain} represents fractional contribution of each feature to the model based on
+       the total gain of this feature's splits. Higher percentage means a more important 
+       predictive feature.
+  \item \code{Cover} metric of the number of observation related to this feature;
+  \item \code{Frequency} percentage representing the relative number of times
+       a feature have been used in trees.
+}
+
+A linear model's importance \code{data.table} has only two columns:
+\itemize{
+  \item \code{Features} names of the features used in the model;
+  \item \code{Weight} the linear coefficient of this feature.
+}
+
+If you don't provide or \code{model} doesn't have \code{feature_names}, 
+index of the features will be used instead. Because the index is extracted from the model dump
+(based on C++ code), it starts at 0 (as in C/C++ or Python) instead of 1 (usual in R).
 }
 \description{
-Create a \code{data.table} of the most important features of a model.
+Creates a \code{data.table} of feature importances in a model.
 }
 \details{
-This function is for both linear and tree models.
+This function works for both linear and tree models.

-\code{data.table} is returned by the function. 
-The columns are:
-\itemize{
-  \item \code{Features} name of the features as provided in \code{feature_names} or already present in the model dump;
-  \item \code{Gain} contribution of each feature to the model. For boosted tree model, each gain of each feature of each tree is taken into account, then average per feature to give a vision of the entire model. Highest percentage means important feature to predict the \code{label} used for the training (only available for tree models);
-  \item \code{Cover} metric of the number of observation related to this feature (only available for tree models);
-  \item \code{Weight} percentage representing the relative number of times a feature have been taken into trees.
-}
-
-If you don't provide \code{feature_names}, index of the features will be used instead.
-
-Because the index is extracted from the model dump (made on the C++ side), it starts at 0 (usual in C++) instead of 1 (usual in R).
-
-Co-occurence count
------------------
-
-The gain gives you indication about the information of how a feature is important in making a branch of a decision tree more pure. However, with this information only, you can't know if this feature has to be present or not to get a specific classification. In the example code, you may wonder if odor=none should be \code{TRUE} to not eat a mushroom.
-
-Co-occurence computation is here to help in understanding this relation between a predictor and a specific class. It will count how many observations are returned as \code{TRUE} by the \code{target} function (see parameters). When you execute the example below, there are 92 times only over the 3140 observations of the train dataset where a mushroom have no odor and can be eaten safely.
-
-If you need to remember only one thing: unless you want to leave us early, don't eat a mushroom which has no odor :-)
+For linear models, the importance is the absolute magnitude of linear coefficients. 
+For that reason, in order to obtain a meaningful ranking by importance for a linear model, 
+the features need to be on the same scale (which you also would want to do when using either 
+L1 or L2 regularization).
 }
 \examples{
+
 data(agaricus.train, package='xgboost')

 bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2, 
               eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")

-xgb.importance(colnames(agaricus.train$data), model = bst)
-
-# Same thing with co-occurence computation this time
-xgb.importance(colnames(agaricus.train$data), model = bst, 
-               data = agaricus.train$data, label = agaricus.train$label)
+xgb.importance(model = bst)

 }

--- a/R-package/man/xgb.load.Rd
+++ b/R-package/man/xgb.load.Rd
@@ -7,10 +7,22 @@
 xgb.load(modelfile)
 }
 \arguments{
-\item{modelfile}{the name of the binary file.}
+\item{modelfile}{the name of the binary input file.}
+}
+\value{
+An object of \code{xgb.Booster} class.
 }
 \description{
-Load xgboost model from the binary model file
+Load xgboost model from the binary model file.
+}
+\details{
+The input file is expected to contain a model saved in an xgboost-internal binary format
+using either \code{\link{xgb.save}} or \code{\link{cb.save.model}} in R, or using some 
+appropriate methods from other xgboost interfaces. E.g., a model trained in Python and 
+saved from there in xgboost format, could be loaded from R.
+
+Note: a model saved as an R-object, has to be loaded using corresponding R-methods,
+not \code{xgb.load}.
 }
 \examples{
 data(agaricus.train, package='xgboost')
@@ -23,4 +35,7 @@ xgb.save(bst, 'xgb.model')
 bst <- xgb.load('xgb.model')
 pred <- predict(bst, test$data)
 }
+\seealso{
+\code{\link{xgb.save}}, \code{\link{xgb.Booster.complete}}.
+}

--- a/R-package/man/xgb.model.dt.tree.Rd
+++ b/R-package/man/xgb.model.dt.tree.Rd
@@ -9,17 +9,19 @@ xgb.model.dt.tree(feature_names = NULL, model = NULL, text = NULL,
 }
 \arguments{
 \item{feature_names}{character vector of feature names. If the model already
-contains feature names, this argument should be \code{NULL} (default value)}
+contains feature names, those would be used when \code{feature_names=NULL} (default value).
+Non-null \code{feature_names} could be provided to override those in the model.}

 \item{model}{object of class \code{xgb.Booster}}

 \item{text}{\code{character} vector previously generated by the \code{xgb.dump} 
-function  (where parameter \code{with_stats = TRUE} should have been set).}
+function  (where parameter \code{with_stats = TRUE} should have been set).
+\code{text} takes precedence over \code{model}.}

 \item{trees}{an integer vector of tree indices that should be parsed.
 If set to \code{NULL}, all trees of the model are parsed.
 It could be useful, e.g., in multiclass classification to get only
-the trees of one certain class. IMPORTANT: the tree index in xgboost model
+the trees of one certain class. IMPORTANT: the tree index in xgboost models
 is zero-based (e.g., use \code{trees = 0:4} for first 5 trees).}

 \item{...}{currently not used.}
@@ -56,7 +58,9 @@ bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_dep
               eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")

 (dt <- xgb.model.dt.tree(colnames(agaricus.train$data), bst))
-
+# This bst has feature_names stored in it, so those would be used when 
+# the feature_names parameter is not provided:
+(dt <- xgb.model.dt.tree(model = bst))

 # How to match feature names of splits that are following a current 'Yes' branch:

--- a/R-package/man/xgb.save.Rd
+++ b/R-package/man/xgb.save.Rd
@@ -7,12 +7,22 @@
 xgb.save(model, fname)
 }
 \arguments{
-\item{model}{the model object.}
+\item{model}{model object of \code{xgb.Booster} class.}

-\item{fname}{the name of the file to write.}
+\item{fname}{name of the file to write.}
 }
 \description{
-Save xgboost model from xgboost or xgb.train
+Save xgboost model to a file in binary format.
+}
+\details{
+This methods allows to save a model in an xgboost-internal binary format which is universal 
+among the various xgboost interfaces. In R, the saved model file could be read-in later
+using either the \code{\link{xgb.load}} function or the \code{xgb_model} parameter 
+of \code{\link{xgb.train}}.
+
+Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}} 
+or \code{\link[base]{save}}). However, it would then only be compatible with R, and 
+corresponding R-methods would need to be used to load it.
 }
 \examples{
 data(agaricus.train, package='xgboost')
@@ -25,4 +35,7 @@ xgb.save(bst, 'xgb.model')
 bst <- xgb.load('xgb.model')
 pred <- predict(bst, test$data)
 }
+\seealso{
+\code{\link{xgb.load}}, \code{\link{xgb.Booster.complete}}.
+}

--- a/R-package/man/xgb.train.Rd
+++ b/R-package/man/xgb.train.Rd
@@ -23,8 +23,7 @@ xgboost(data = NULL, label = NULL, missing = NA, weight = NULL,
 1. General Parameters

 \itemize{
-  \item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}
-  \item \code{silent} 0 means printing running messages, 1 means silent mode. Default: 0
+  \item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}.
 }
 
 2. Booster Parameters
@@ -68,16 +67,19 @@ xgboost(data = NULL, label = NULL, missing = NA, weight = NULL,
  \item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
 }}

-\item{data}{input dataset. \code{xgb.train} takes only an \code{xgb.DMatrix} as the input.
-\code{xgboost}, in addition, also accepts \code{matrix}, \code{dgCMatrix}, or local data file.}
+\item{data}{training dataset. \code{xgb.train} accepts only an \code{xgb.DMatrix} as the input.
+\code{xgboost}, in addition, also accepts \code{matrix}, \code{dgCMatrix}, or name of a local data file.}

-\item{nrounds}{the max number of iterations}
+\item{nrounds}{max number of boosting iterations.}

-\item{watchlist}{what information should be printed when \code{verbose=1} or
-\code{verbose=2}. Watchlist is used to specify validation set monitoring
-during training. For example user can specify
-watchlist=list(validation1=mat1, validation2=mat2) to watch
-the performance of each round's model on mat1 and mat2}
+\item{watchlist}{named list of xgb.DMatrix datasets to use for evaluating model performance.
+Metrics specified in either \code{eval_metric} or \code{feval} will be computed for each
+of these datasets during each boosting iteration, and stored in the end as a field named 
+\code{evaluation_log} in the resulting object. When either \code{verbose>=1} or 
+\code{\link{cb.print.evaluation}} callback is engaged, the performance results are continuously
+printed out during the training. 
+E.g., specifying \code{watchlist=list(validation1=mat1, validation2=mat2)} allows to track
+the performance of each round's model on mat1 and mat2.}

 \item{obj}{customized objective function. Returns gradient and second order 
 gradient with given prediction and dtrain.}
@@ -86,10 +88,10 @@ gradient with given prediction and dtrain.}
 \code{list(metric='metric-name', value='metric-value')} with given 
 prediction and dtrain.}

-\item{verbose}{If 0, xgboost will stay silent. If 1, xgboost will print 
-information of performance. If 2, xgboost will print some additional information.
-Setting \code{verbose > 0} automatically engages the \code{\link{cb.evaluation.log}} and 
-\code{\link{cb.print.evaluation}} callback functions.}
+\item{verbose}{If 0, xgboost will stay silent. If 1, it will print information about performance.
+If 2, some additional information will be printed out.
+Note that setting \code{verbose > 0} automatically engages the 
+\code{cb.print.evaluation(period=1)} callback function.}

 \item{print_every_n}{Print each n-th iteration evaluation messages when \code{verbose>0}.
 Default is 1 which means all messages are printed. This parameter is passed to the 
@@ -151,17 +153,20 @@ An object of class \code{xgb.Booster} with the following elements:
        (only available with early stopping).
  \item \code{best_score} the best evaluation metric value during early stopping.
        (only available with early stopping).
+  \item \code{feature_names} names of the training dataset features
+        (only when comun names were defined in training data).
 }
 }
 \description{
-\code{xgb.train} is an advanced interface for training an xgboost model. The \code{xgboost} function provides a simpler interface.
+\code{xgb.train} is an advanced interface for training an xgboost model.
+The \code{xgboost} function is a simpler wrapper for \code{xgb.train}.
 }
 \details{
 These are the training functions for \code{xgboost}. 

 The \code{xgb.train} interface supports advanced features such as \code{watchlist}, 
 customized objective and evaluation metric functions, therefore it is more flexible 
-than the \code{\link{xgboost}} interface.
+than the \code{xgboost} interface.

 Parallelization is automatically enabled if \code{OpenMP} is present. 
 Number of threads can also be manually specified via \code{nthread} parameter.
@@ -187,7 +192,7 @@ The following callbacks are automatically created when certain parameters are se
 \itemize{
  \item \code{cb.print.evaluation} is turned on when \code{verbose > 0};
        and the \code{print_every_n} parameter is passed to it.
-  \item \code{cb.evaluation.log} is on when \code{verbose > 0} and \code{watchlist} is present.
+  \item \code{cb.evaluation.log} is on when \code{watchlist} is present.
  \item \code{cb.early.stop}: when \code{early_stopping_rounds} is set.
  \item \code{cb.save.model}: when \code{save_period > 0} is set.
 }
@@ -198,7 +203,7 @@ data(agaricus.test, package='xgboost')

 dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
 dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
-watchlist <- list(eval = dtest, train = dtrain)
+watchlist <- list(train = dtrain, eval = dtest)

 ## A simple xgb.train example:
 param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, 
@@ -237,17 +242,15 @@ bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,


 ## An xgb.train example of using variable learning rates at each iteration:
-param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2)
+param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2,
+              objective = "binary:logistic", eval_metric = "auc")
 my_etas <- list(eta = c(0.5, 0.1))
 bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
                 callbacks = list(cb.reset.parameters(my_etas)))

-
-## Explicit use of the cb.evaluation.log callback allows to run 
-## xgb.train silently but still store the evaluation results:
-bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
-                 verbose = 0, callbacks = list(cb.evaluation.log()))
-print(bst$evaluation_log)
+## Early stopping:
+bst <- xgb.train(param, dtrain, nrounds = 25, watchlist,
+                 early_stopping_rounds = 3)

 ## An 'xgboost' interface example:
 bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label,