[R] Document handling of indexes (#10019)

--------- Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
2024-02-01 22:39:09 +01:00
parent 4dfbe2a893
commit 662854c7d7
5 changed files with 61 additions and 5 deletions
--- a/R-package/R/xgb.DMatrix.R
+++ b/R-package/R/xgb.DMatrix.R
@@ -33,7 +33,8 @@
 #' \item Binary files generated by \link{xgb.DMatrix.save},  passed as a path to the file. These are
 #' \bold{not} supported for xgb.QuantileDMatrix'.
 #' }
-#' @param label Label of the training data.
+#' @param label Label of the training data. For classification problems, should be passed encoded as
+#' integers with numeration starting at zero.
 #' @param weight Weight for each instance.
 #'
 #' Note that, for ranking task, weights are per-group.  In ranking task, one weight
@@ -69,6 +70,11 @@
 #' Note that, while categorical types are treated differently from the rest for model fitting
 #' purposes, the other types do not influence the generated model, but have effects in other
 #' functionalities such as feature importances.
+#'
+#' \bold{Important}: categorical features, if specified manually through `feature_types`, must
+#' be encoded as integers with numeration starting at zero, and the same encoding needs to be
+#' applied when passing data to `predict`. Even if passing `factor` types, the encoding will
+#' not be saved, so make sure that `factor` columns passed to `predict` have the same `levels`.
 #' @param nthread Number of threads used for creating DMatrix.
 #' @param group Group size for all ranking group.
 #' @param qid Query ID for data samples, used for ranking.
--- a/R-package/man/xgb.DMatrix.Rd
+++ b/R-package/man/xgb.DMatrix.Rd
@@ -66,7 +66,8 @@ supported for xgb.QuantileDMatrix'.
 \bold{not} supported for xgb.QuantileDMatrix'.
 }}

-\item{label}{Label of the training data.}
+\item{label}{Label of the training data. For classification problems, should be passed encoded as
+integers with numeration starting at zero.}

 \item{weight}{Weight for each instance.

@@ -109,7 +110,12 @@ with the following possible values:\itemize{

 Note that, while categorical types are treated differently from the rest for model fitting
 purposes, the other types do not influence the generated model, but have effects in other
-functionalities such as feature importances.}
+functionalities such as feature importances.
+
+\bold{Important}: categorical features, if specified manually through \code{feature_types}, must
+be encoded as integers with numeration starting at zero, and the same encoding needs to be
+applied when passing data to \code{predict}. Even if passing \code{factor} types, the encoding will
+not be saved, so make sure that \code{factor} columns passed to \code{predict} have the same \code{levels}.}

 \item{nthread}{Number of threads used for creating DMatrix.}

--- a/R-package/man/xgb.DataBatch.Rd
+++ b/R-package/man/xgb.DataBatch.Rd
@@ -33,7 +33,8 @@ conversions applied to it. See the documentation for parameter \code{data} in
 \item CSR matrices, as class \code{dgRMatrix} from package \code{Matrix}.
 }}

-\item{label}{Label of the training data.}
+\item{label}{Label of the training data. For classification problems, should be passed encoded as
+integers with numeration starting at zero.}

 \item{weight}{Weight for each instance.

@@ -69,7 +70,12 @@ with the following possible values:\itemize{

 Note that, while categorical types are treated differently from the rest for model fitting
 purposes, the other types do not influence the generated model, but have effects in other
-functionalities such as feature importances.}
+functionalities such as feature importances.
+
+\bold{Important}: categorical features, if specified manually through \code{feature_types}, must
+be encoded as integers with numeration starting at zero, and the same encoding needs to be
+applied when passing data to \code{predict}. Even if passing \code{factor} types, the encoding will
+not be saved, so make sure that \code{factor} columns passed to \code{predict} have the same \code{levels}.}

 \item{group}{Group size for all ranking group.}