% Generated by roxygen2: do not edit by hand % Please edit documentation in R/xgb.cv.R \name{xgb.cv} \alias{xgb.cv} \title{Cross Validation} \usage{ xgb.cv( params = list(), data, nrounds, nfold, prediction = FALSE, showsd = TRUE, metrics = list(), obj = NULL, feval = NULL, stratified = "auto", folds = NULL, train_folds = NULL, verbose = TRUE, print_every_n = 1L, early_stopping_rounds = NULL, maximize = NULL, callbacks = list(), ... ) } \arguments{ \item{params}{The list of parameters. The complete list of parameters is available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below is a shorter summary: \itemize{ \item \code{objective}: Objective function, common ones are \itemize{ \item \code{reg:squarederror}: Regression with squared loss. \item \code{binary:logistic}: Logistic regression for classification. } See \code{\link[=xgb.train]{xgb.train()}} for complete list of objectives. \item \code{eta}: Step size of each boosting step \item \code{max_depth}: Maximum depth of the tree \item \code{nthread}: Number of threads used in training. If not set, all threads are used } See \code{\link[=xgb.train]{xgb.train()}} for further details. See also demo for walkthrough example in R. Note that, while \code{params} accepts a \code{seed} entry and will use such parameter for model training if supplied, this seed is not used for creation of train-test splits, which instead rely on R's own RNG system - thus, for reproducible results, one needs to call the \code{\link[=set.seed]{set.seed()}} function beforehand.} \item{data}{An \code{xgb.DMatrix} object, with corresponding fields like \code{label} or bounds as required for model training by the objective. Note that only the basic \code{xgb.DMatrix} class is supported - variants such as \code{xgb.QuantileDMatrix} or \code{xgb.ExtMemDMatrix} are not supported here.} \item{nrounds}{The max number of iterations.} \item{nfold}{The original dataset is randomly partitioned into \code{nfold} equal size subsamples.} \item{prediction}{A logical value indicating whether to return the test fold predictions from each CV model. This parameter engages the \code{\link[=xgb.cb.cv.predict]{xgb.cb.cv.predict()}} callback.} \item{showsd}{Logical value whether to show standard deviation of cross validation.} \item{metrics}{List of evaluation metrics to be used in cross validation, when it is not specified, the evaluation metric is chosen according to objective function. Possible options are: \itemize{ \item \code{error}: Binary classification error rate \item \code{rmse}: Root mean square error \item \code{logloss}: Negative log-likelihood function \item \code{mae}: Mean absolute error \item \code{mape}: Mean absolute percentage error \item \code{auc}: Area under curve \item \code{aucpr}: Area under PR curve \item \code{merror}: Exact matching error used to evaluate multi-class classification }} \item{obj}{Customized objective function. Returns gradient and second order gradient with given prediction and dtrain.} \item{feval}{Customized evaluation function. Returns \code{list(metric='metric-name', value='metric-value')} with given prediction and dtrain.} \item{stratified}{Logical flag indicating whether sampling of folds should be stratified by the values of outcome labels. For real-valued labels in regression objectives, stratification will be done by discretizing the labels into up to 5 buckets beforehand. If passing "auto", will be set to \code{TRUE} if the objective in \code{params} is a classification objective (from XGBoost's built-in objectives, doesn't apply to custom ones), and to \code{FALSE} otherwise. This parameter is ignored when \code{data} has a \code{group} field - in such case, the splitting will be based on whole groups (note that this might make the folds have different sizes). Value \code{TRUE} here is \strong{not} supported for custom objectives.} \item{folds}{List with pre-defined CV folds (each element must be a vector of test fold's indices). When folds are supplied, the \code{nfold} and \code{stratified} parameters are ignored. If \code{data} has a \code{group} field and the objective requires this field, each fold (list element) must additionally have two attributes (retrievable through \code{attributes}) named \code{group_test} and \code{group_train}, which should hold the \code{group} to assign through \code{\link[=setinfo.xgb.DMatrix]{setinfo.xgb.DMatrix()}} to the resulting DMatrices.} \item{train_folds}{List specifying which indices to use for training. If \code{NULL} (the default) all indices not specified in \code{folds} will be used for training. This is not supported when \code{data} has \code{group} field.} \item{verbose}{Logical flag. Should statistics be printed during the process?} \item{print_every_n}{Print each nth iteration evaluation messages when \code{verbose > 0}. Default is 1 which means all messages are printed. This parameter is passed to the \code{\link[=xgb.cb.print.evaluation]{xgb.cb.print.evaluation()}} callback.} \item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered. If set to an integer \code{k}, training with a validation set will stop if the performance doesn't improve for \code{k} rounds. Setting this parameter engages the \code{\link[=xgb.cb.early.stop]{xgb.cb.early.stop()}} callback.} \item{maximize}{If \code{feval} and \code{early_stopping_rounds} are set, then this parameter must be set as well. When it is \code{TRUE}, it means the larger the evaluation score the better. This parameter is passed to the \code{\link[=xgb.cb.early.stop]{xgb.cb.early.stop()}} callback.} \item{callbacks}{A list of callback functions to perform various task during boosting. See \code{\link[=xgb.Callback]{xgb.Callback()}}. Some of the callbacks are automatically created depending on the parameters' values. User can provide either existing or their own callback methods in order to customize the training process.} \item{...}{Other parameters to pass to \code{params}.} } \value{ An object of class 'xgb.cv.synchronous' with the following elements: \itemize{ \item \code{call}: Function call. \item \code{params}: Parameters that were passed to the xgboost library. Note that it does not capture parameters changed by the \code{\link[=xgb.cb.reset.parameters]{xgb.cb.reset.parameters()}} callback. \item \code{evaluation_log}: Evaluation history stored as a \code{data.table} with the first column corresponding to iteration number and the rest corresponding to the CV-based evaluation means and standard deviations for the training and test CV-sets. It is created by the \code{\link[=xgb.cb.evaluation.log]{xgb.cb.evaluation.log()}} callback. \item \code{niter}: Number of boosting iterations. \item \code{nfeatures}: Number of features in training data. \item \code{folds}: The list of CV folds' indices - either those passed through the \code{folds} parameter or randomly generated. \item \code{best_iteration}: Iteration number with the best evaluation metric value (only available with early stopping). } Plus other potential elements that are the result of callbacks, such as a list \code{cv_predict} with a sub-element \code{pred} when passing \code{prediction = TRUE}, which is added by the \code{\link[=xgb.cb.cv.predict]{xgb.cb.cv.predict()}} callback (note that one can also pass it manually under \code{callbacks} with different settings, such as saving also the models created during cross validation); or a list \code{early_stop} which will contain elements such as \code{best_iteration} when using the early stopping callback (\code{\link[=xgb.cb.early.stop]{xgb.cb.early.stop()}}). } \description{ The cross validation function of xgboost. } \details{ The original sample is randomly partitioned into \code{nfold} equal size subsamples. Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data. The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data. All observations are used for both training and validation. Adapted from \url{https://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29} } \examples{ data(agaricus.train, package = "xgboost") dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label, nthread = 2)) cv <- xgb.cv( data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = list("rmse","auc"), max_depth = 3, eta = 1,objective = "binary:logistic" ) print(cv) print(cv, verbose = TRUE) }