[R] remove 'reshape' argument, let shapes be handled by core cpp library (#10330)

2024-08-18 17:31:38 +02:00
parent fd365c147e
commit caabee2135
13 changed files with 239 additions and 248 deletions
--- a/R-package/man/predict.xgb.Booster.Rd
+++ b/R-package/man/predict.xgb.Booster.Rd
@@ -13,10 +13,10 @@
  predcontrib = FALSE,
  approxcontrib = FALSE,
  predinteraction = FALSE,
-  reshape = FALSE,
  training = FALSE,
  iterationrange = NULL,
  strict_shape = FALSE,
+  avoid_transpose = FALSE,
  validate_features = FALSE,
  base_margin = NULL,
  ...
@@ -66,10 +66,6 @@ logistic regression would return log-odds instead of probabilities.}

 \item{predinteraction}{Whether to return contributions of feature interactions to individual predictions (see Details).}

-\item{reshape}{Whether to reshape the vector of predictions to matrix form when there are several
-prediction outputs per case. No effect if \code{predleaf}, \code{predcontrib},
-or \code{predinteraction} is \code{TRUE}.}
-
 \item{training}{Whether the prediction result is used for training. For dart booster,
 training predicting will perform dropout.}

@@ -86,8 +82,27 @@ base-1 indexing, and inclusive of both ends).
   If passing "all", will use all of the rounds regardless of whether the model had early stopping or not.
 }\if{html}{\out{</div>}}}

-\item{strict_shape}{Default is \code{FALSE}. When set to \code{TRUE}, the output
-type and shape of predictions are invariant to the model type.}
+\item{strict_shape}{Whether to always return an array with the same dimensions for the given prediction mode
+regardless of the model type - meaning that, for example, both a multi-class and a binary classification
+model would generate output arrays with the same number of dimensions, with the 'class' dimension having
+size equal to '1' for the binary model.
+
+\if{html}{\out{<div class="sourceCode">}}\preformatted{   If passing `FALSE` (the default), dimensions will be simplified according to the model type, so that a
+   binary classification model for example would not have a redundant dimension for 'class'.
+
+   See documentation for the return type for the exact shape of the output arrays for each prediction mode.
+}\if{html}{\out{</div>}}}
+
+\item{avoid_transpose}{Whether to output the resulting predictions in the same memory layout in which they
+are generated by the core XGBoost library, without transposing them to match the expected output shape.
+
+\if{html}{\out{<div class="sourceCode">}}\preformatted{   Internally, XGBoost uses row-major order for the predictions it generates, while R arrays use column-major
+   order, hence the result needs to be transposed in order to have the expected shape when represented as
+   an R array or matrix, which might be a slow operation.
+
+   If passing `TRUE`, then the result will have dimensions in reverse order - for example, rows
+   will be the last dimensions instead of the first dimension.
+}\if{html}{\out{</div>}}}

 \item{validate_features}{When \code{TRUE}, validate that the Booster's and newdata's feature_names
 match (only applicable when both \code{object} and \code{newdata} have feature names).
@@ -116,32 +131,46 @@ match (only applicable when both \code{object} and \code{newdata} have feature n
 \item{...}{Not used.}
 }
 \value{
-The return type depends on \code{strict_shape}. If \code{FALSE} (default):
-\itemize{
-\item For regression or binary classification: A vector of length \code{nrows(newdata)}.
-\item For multiclass classification: A vector of length \code{num_class * nrows(newdata)} or
-a \verb{(nrows(newdata), num_class)} matrix, depending on the \code{reshape} value.
-\item When \code{predleaf = TRUE}: A matrix with one column per tree.
-\item When \code{predcontrib = TRUE}: When not multiclass, a matrix with
-\code{ num_features + 1} columns. The last "+ 1" column corresponds to the baseline value.
-In the multiclass case, a list of \code{num_class} such matrices.
-The contribution values are on the scale of untransformed margin
-(e.g., for binary classification, the values are log-odds deviations from the baseline).
-\item When \code{predinteraction = TRUE}: When not multiclass, the output is a 3d array of
-dimension \code{c(nrow, num_features + 1, num_features + 1)}. The off-diagonal (in the last two dimensions)
-elements represent different feature interaction contributions. The array is symmetric WRT the last
-two dimensions. The "+ 1" columns corresponds to the baselines. Summing this array along the last dimension should
-produce practically the same result as \code{predcontrib = TRUE}.
-In the multiclass case, a list of \code{num_class} such arrays.
+A numeric vector or array, with corresponding dimensions depending on the prediction mode and on
+parameter \code{strict_shape} as follows:
+
+If passing \code{strict_shape=FALSE}:\itemize{
+\item For regression or binary classification: a vector of length \code{nrows}.
+\item For multi-class and multi-target objectives: a matrix of dimensions \verb{[nrows, ngroups]}.
+
+Note that objective variant \code{multi:softmax} defaults towards predicting most likely class (a vector
+\code{nrows}) instead of per-class probabilities.
+\item For \code{predleaf}: a matrix with one column per tree.
+
+For multi-class / multi-target, they will be arranged so that columns in the output will have
+the leafs from one group followed by leafs of the other group (e.g. order will be \code{group1:feat1},
+\code{group1:feat2}, ..., \code{group2:feat1}, \code{group2:feat2}, ...).
+\item For \code{predcontrib}: when not multi-class / multi-target, a matrix with dimensions
+\verb{[nrows, nfeats+1]}. The last "+ 1" column corresponds to the baseline value.
+
+For multi-class and multi-target objectives, will be an array with dimensions \verb{[nrows, ngroups, nfeats+1]}.
+
+The contribution values are on the scale of untransformed margin (e.g., for binary classification,
+the values are log-odds deviations from the baseline).
+\item For \code{predinteraction}: when not multi-class / multi-target, the output is a 3D array of
+dimensions \verb{[nrows, nfeats+1, nfeats+1]}. The off-diagonal (in the last two dimensions)
+elements represent different feature interaction contributions. The array is symmetric w.r.t. the last
+two dimensions. The "+ 1" columns corresponds to the baselines. Summing this array along the last
+dimension should produce practically the same result as \code{predcontrib = TRUE}.
+
+For multi-class and multi-target, will be a 4D array with dimensions \verb{[nrows, ngroups, nfeats+1, nfeats+1]}
 }

-When \code{strict_shape = TRUE}, the output is always an array:
-\itemize{
-\item For normal predictions, the output has dimension \verb{(num_class, nrow(newdata))}.
-\item For \code{predcontrib = TRUE}, the dimension is \verb{(ncol(newdata) + 1, num_class, nrow(newdata))}.
-\item For \code{predinteraction = TRUE}, the dimension is \verb{(ncol(newdata) + 1, ncol(newdata) + 1, num_class, nrow(newdata))}.
-\item For \code{predleaf = TRUE}, the dimension is \verb{(n_trees_in_forest, num_class, n_iterations, nrow(newdata))}.
+If passing \code{strict_shape=FALSE}, the result is always an array:\itemize{
+\item For normal predictions, the dimension is \verb{[nrows, ngroups]}.
+\item For \code{predcontrib=TRUE}, the dimension is \verb{[nrows, ngroups, nfeats+1]}.
+\item For \code{predinteraction=TRUE}, the dimension is \verb{[nrows, ngroups, nfeats+1, nfeats+1]}.
+\item For \code{predleaf=TRUE}, the dimension is \verb{[nrows, niter, ngroups, num_parallel_tree]}.
 }
+
+If passing \code{avoid_transpose=TRUE}, then the dimensions in all cases will be in reverse order - for
+example, for \code{predinteraction}, they will be \verb{[nfeats+1, nfeats+1, ngroups, nrows]}
+instead of \verb{[nrows, ngroups, nfeats+1, nfeats+1]}.
 }
 \description{
 Predict values on data based on xgboost model.
@@ -241,8 +270,6 @@ bst <- xgb.train(
 # predict for softmax returns num_class probability numbers per case:
 pred <- predict(bst, as.matrix(iris[, -5]))
 str(pred)
-# reshape it to a num_class-columns matrix
-pred <- matrix(pred, ncol = num_class, byrow = TRUE)
 # convert the probabilities to softmax labels
 pred_labels <- max.col(pred) - 1
 # the following should result in the same error as seen in the last iteration