[R] Use new predict function. (#6819)

* Call new C prediction API. * Add `strict_shape`. * Add `iterationrange`. * Update document.
2021-06-11 13:03:29 +08:00
parent 25514e104a
commit b56614e9b8
18 changed files with 293 additions and 160 deletions
--- a/R-package/man/cb.early.stop.Rd
+++ b/R-package/man/cb.early.stop.Rd
@@ -38,10 +38,7 @@ The following additional fields are assigned to the model's R object:
 \itemize{
 \item \code{best_score} the evaluation score at the best iteration
 \item \code{best_iteration} at which boosting iteration the best score has occurred (1-based index)
-\item \code{best_ntreelimit} to use with the \code{ntreelimit} parameter in \code{predict}.
-     It differs from \code{best_iteration} in multiclass or random forest settings.
 }
-
 The Same values are also stored as xgb-attributes:
 \itemize{
 \item \code{best_iteration} is stored as a 0-based iteration index (for interoperability of binary models)
--- a/R-package/man/predict.xgb.Booster.Rd
+++ b/R-package/man/predict.xgb.Booster.Rd
@@ -17,6 +17,8 @@
  predinteraction = FALSE,
  reshape = FALSE,
  training = FALSE,
+  iterationrange = NULL,
+  strict_shape = FALSE,
  ...
 )

@@ -34,8 +36,7 @@ missing values in data (e.g., sometimes 0 or some other extreme value is used).}
 sum of predictions from boosting iterations' results. E.g., setting \code{outputmargin=TRUE} for
 logistic regression would result in predictions for log-odds instead of probabilities.}

-\item{ntreelimit}{limit the number of model's trees or boosting iterations used in prediction (see Details).
-It will use all the trees by default (\code{NULL} value).}
+\item{ntreelimit}{Deprecated, use \code{iterationrange} instead.}

 \item{predleaf}{whether predict leaf index.}

@@ -52,10 +53,20 @@ or predinteraction flags is TRUE.}
 \item{training}{whether is the prediction result used for training.  For dart booster,
 training predicting will perform dropout.}

+\item{iterationrange}{Specifies which layer of trees are used in prediction.  For
+example, if a random forest is trained with 100 rounds.  Specifying
+`iteration_range=(1, 21)`, then only the forests built during [1, 21) (half open set)
+rounds are used in this prediction.  It's 1-based index just like R vector.  When set
+to \code{c(1, 1)} XGBoost will use all trees.}
+
+\item{strict_shape}{Default is \code{FALSE}. When it's set to \code{TRUE}, output
+type and shape of prediction are invariant to model type.}
+
 \item{...}{Parameters passed to \code{predict.xgb.Booster}}
 }
 \value{
-For regression or binary classification, it returns a vector of length \code{nrows(newdata)}.
+The return type is different depending whether \code{strict_shape} is set to \code{TRUE}.  By default,
+for regression or binary classification, it returns a vector of length \code{nrows(newdata)}.
 For multiclass classification, either a \code{num_class * nrows(newdata)} vector or
 a \code{(nrows(newdata), num_class)} dimension matrix is returned, depending on
 the \code{reshape} value.
@@ -76,18 +87,19 @@ two dimensions. The "+ 1" columns corresponds to bias. Summing this array along
 produce practically the same result as predict with \code{predcontrib = TRUE}.
 For a multiclass case, a list of \code{num_class} elements is returned, where each element is
 such an array.
+
+When \code{strict_shape} is set to \code{TRUE}, the output is always an array.  For
+normal prediction, the output is a 2-dimension array \code{(num_class, nrow(newdata))}.
+
+For \code{predcontrib = TRUE}, output is \code{(ncol(newdata) + 1, num_class, nrow(newdata))}
+For \code{predinteraction = TRUE}, output is \code{(ncol(newdata) + 1, ncol(newdata) + 1, num_class, nrow(newdata))}
+For \code{predleaf = TRUE}, output is \code{(n_trees_in_forest, num_class, n_iterations, nrow(newdata))}
 }
 \description{
 Predicted values based on either xgboost model or model handle object.
 }
 \details{
-Note that \code{ntreelimit} is not necessarily equal to the number of boosting iterations
-and it is not necessarily equal to the number of trees in a model.
-E.g., in a random forest-like model, \code{ntreelimit} would limit the number of trees.
-But for multiclass classification, while there are multiple trees per iteration,
-\code{ntreelimit} limits the number of boosting iterations.
-
-Also note that \code{ntreelimit} would currently do nothing for predictions from gblinear,
+Note that \code{iterationrange} would currently do nothing for predictions from gblinear,
 since gblinear doesn't keep its boosting history.

 One possible practical applications of the \code{predleaf} option is to use the model
@@ -120,7 +132,7 @@ bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
 # use all trees by default
 pred <- predict(bst, test$data)
 # use only the 1st tree
-pred1 <- predict(bst, test$data, ntreelimit = 1)
+pred1 <- predict(bst, test$data, iterationrange = c(1, 2))

 # Predicting tree leafs:
 # the result is an nsamples X ntrees matrix
@@ -172,25 +184,9 @@ str(pred)
 all.equal(pred, pred_labels)
 # prediction from using only 5 iterations should result
 # in the same error as seen in iteration 5:
-pred5 <- predict(bst, as.matrix(iris[, -5]), ntreelimit=5)
+pred5 <- predict(bst, as.matrix(iris[, -5]), iterationrange=c(1, 6))
 sum(pred5 != lb)/length(lb)

-
-## random forest-like model of 25 trees for binary classification:
-
-set.seed(11)
-bst <- xgboost(data = train$data, label = train$label, max_depth = 5,
-               nthread = 2, nrounds = 1, objective = "binary:logistic",
-               num_parallel_tree = 25, subsample = 0.6, colsample_bytree = 0.1)
-# Inspect the prediction error vs number of trees:
-lb <- test$label
-dtest <- xgb.DMatrix(test$data, label=lb)
-err <- sapply(1:25, function(n) {
-  pred <- predict(bst, dtest, ntreelimit=n)
-  sum((pred > 0.5) != lb)/length(lb)
-})
-plot(err, type='l', ylim=c(0,0.1), xlab='#trees')
-
 }
 \references{
 Scott M. Lundberg, Su-In Lee, "A Unified Approach to Interpreting Model Predictions", NIPS Proceedings 2017, \url{https://arxiv.org/abs/1705.07874}
--- a/R-package/man/xgb.cv.Rd
+++ b/R-package/man/xgb.cv.Rd
@@ -135,9 +135,7 @@ An object of class \code{xgb.cv.synchronous} with the following elements:
        parameter or randomly generated.
  \item \code{best_iteration} iteration number with the best evaluation metric value
        (only available with early stopping).
-  \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
-        which could further be used in \code{predict} method
-        (only available with early stopping).
+  \item \code{best_ntreelimit} and the \code{ntreelimit} Deprecated attributes, use \code{best_iteration} instead.
  \item \code{pred} CV prediction values available when \code{prediction} is set.
        It is either vector or matrix (see \code{\link{cb.cv.predict}}).
  \item \code{models} a list of the CV folds' models. It is only available with the explicit
--- a/R-package/man/xgb.train.Rd
+++ b/R-package/man/xgb.train.Rd
@@ -187,9 +187,6 @@ An object of class \code{xgb.Booster} with the following elements:
        explicitly passed.
  \item \code{best_iteration} iteration number with the best evaluation metric value
        (only available with early stopping).
-  \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
-        which could further be used in \code{predict} method
-        (only available with early stopping).
  \item \code{best_score} the best evaluation metric value during early stopping.
        (only available with early stopping).
  \item \code{feature_names} names of the training dataset features