194 lines
8.5 KiB
R
194 lines
8.5 KiB
R
% Generated by roxygen2: do not edit by hand
|
|
% Please edit documentation in R/xgb.cv.R
|
|
\name{xgb.cv}
|
|
\alias{xgb.cv}
|
|
\title{Cross Validation}
|
|
\usage{
|
|
xgb.cv(
|
|
params = list(),
|
|
data,
|
|
nrounds,
|
|
nfold,
|
|
prediction = FALSE,
|
|
showsd = TRUE,
|
|
metrics = list(),
|
|
obj = NULL,
|
|
feval = NULL,
|
|
stratified = "auto",
|
|
folds = NULL,
|
|
train_folds = NULL,
|
|
verbose = TRUE,
|
|
print_every_n = 1L,
|
|
early_stopping_rounds = NULL,
|
|
maximize = NULL,
|
|
callbacks = list(),
|
|
...
|
|
)
|
|
}
|
|
\arguments{
|
|
\item{params}{The list of parameters. The complete list of parameters is available in the
|
|
\href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}.
|
|
Below is a shorter summary:
|
|
\itemize{
|
|
\item \code{objective}: Objective function, common ones are
|
|
\itemize{
|
|
\item \code{reg:squarederror}: Regression with squared loss.
|
|
\item \code{binary:logistic}: Logistic regression for classification.
|
|
}
|
|
|
|
See \code{\link[=xgb.train]{xgb.train()}} for complete list of objectives.
|
|
\item \code{eta}: Step size of each boosting step
|
|
\item \code{max_depth}: Maximum depth of the tree
|
|
\item \code{nthread}: Number of threads used in training. If not set, all threads are used
|
|
}
|
|
|
|
See \code{\link[=xgb.train]{xgb.train()}} for further details.
|
|
See also demo for walkthrough example in R.
|
|
|
|
Note that, while \code{params} accepts a \code{seed} entry and will use such parameter for model training if
|
|
supplied, this seed is not used for creation of train-test splits, which instead rely on R's own RNG
|
|
system - thus, for reproducible results, one needs to call the \code{\link[=set.seed]{set.seed()}} function beforehand.}
|
|
|
|
\item{data}{An \code{xgb.DMatrix} object, with corresponding fields like \code{label} or bounds as required
|
|
for model training by the objective.
|
|
|
|
Note that only the basic \code{xgb.DMatrix} class is supported - variants such as \code{xgb.QuantileDMatrix}
|
|
or \code{xgb.ExtMemDMatrix} are not supported here.}
|
|
|
|
\item{nrounds}{The max number of iterations.}
|
|
|
|
\item{nfold}{The original dataset is randomly partitioned into \code{nfold} equal size subsamples.}
|
|
|
|
\item{prediction}{A logical value indicating whether to return the test fold predictions
|
|
from each CV model. This parameter engages the \code{\link[=xgb.cb.cv.predict]{xgb.cb.cv.predict()}} callback.}
|
|
|
|
\item{showsd}{Logical value whether to show standard deviation of cross validation.}
|
|
|
|
\item{metrics}{List of evaluation metrics to be used in cross validation,
|
|
when it is not specified, the evaluation metric is chosen according to objective function.
|
|
Possible options are:
|
|
\itemize{
|
|
\item \code{error}: Binary classification error rate
|
|
\item \code{rmse}: Root mean square error
|
|
\item \code{logloss}: Negative log-likelihood function
|
|
\item \code{mae}: Mean absolute error
|
|
\item \code{mape}: Mean absolute percentage error
|
|
\item \code{auc}: Area under curve
|
|
\item \code{aucpr}: Area under PR curve
|
|
\item \code{merror}: Exact matching error used to evaluate multi-class classification
|
|
}}
|
|
|
|
\item{obj}{Customized objective function. Returns gradient and second order
|
|
gradient with given prediction and dtrain.}
|
|
|
|
\item{feval}{Customized evaluation function. Returns
|
|
\code{list(metric='metric-name', value='metric-value')} with given prediction and dtrain.}
|
|
|
|
\item{stratified}{Logical flag indicating whether sampling of folds should be stratified
|
|
by the values of outcome labels. For real-valued labels in regression objectives,
|
|
stratification will be done by discretizing the labels into up to 5 buckets beforehand.
|
|
|
|
If passing "auto", will be set to \code{TRUE} if the objective in \code{params} is a classification
|
|
objective (from XGBoost's built-in objectives, doesn't apply to custom ones), and to
|
|
\code{FALSE} otherwise.
|
|
|
|
This parameter is ignored when \code{data} has a \code{group} field - in such case, the splitting
|
|
will be based on whole groups (note that this might make the folds have different sizes).
|
|
|
|
Value \code{TRUE} here is \strong{not} supported for custom objectives.}
|
|
|
|
\item{folds}{List with pre-defined CV folds (each element must be a vector of test fold's indices).
|
|
When folds are supplied, the \code{nfold} and \code{stratified} parameters are ignored.
|
|
|
|
If \code{data} has a \code{group} field and the objective requires this field, each fold (list element)
|
|
must additionally have two attributes (retrievable through \code{attributes}) named \code{group_test}
|
|
and \code{group_train}, which should hold the \code{group} to assign through \code{\link[=setinfo.xgb.DMatrix]{setinfo.xgb.DMatrix()}} to
|
|
the resulting DMatrices.}
|
|
|
|
\item{train_folds}{List specifying which indices to use for training. If \code{NULL}
|
|
(the default) all indices not specified in \code{folds} will be used for training.
|
|
|
|
This is not supported when \code{data} has \code{group} field.}
|
|
|
|
\item{verbose}{Logical flag. Should statistics be printed during the process?}
|
|
|
|
\item{print_every_n}{Print each nth iteration evaluation messages when \code{verbose > 0}.
|
|
Default is 1 which means all messages are printed. This parameter is passed to the
|
|
\code{\link[=xgb.cb.print.evaluation]{xgb.cb.print.evaluation()}} callback.}
|
|
|
|
\item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered.
|
|
If set to an integer \code{k}, training with a validation set will stop if the performance
|
|
doesn't improve for \code{k} rounds.
|
|
Setting this parameter engages the \code{\link[=xgb.cb.early.stop]{xgb.cb.early.stop()}} callback.}
|
|
|
|
\item{maximize}{If \code{feval} and \code{early_stopping_rounds} are set,
|
|
then this parameter must be set as well.
|
|
When it is \code{TRUE}, it means the larger the evaluation score the better.
|
|
This parameter is passed to the \code{\link[=xgb.cb.early.stop]{xgb.cb.early.stop()}} callback.}
|
|
|
|
\item{callbacks}{A list of callback functions to perform various task during boosting.
|
|
See \code{\link[=xgb.Callback]{xgb.Callback()}}. Some of the callbacks are automatically created depending on the
|
|
parameters' values. User can provide either existing or their own callback methods in order
|
|
to customize the training process.}
|
|
|
|
\item{...}{Other parameters to pass to \code{params}.}
|
|
}
|
|
\value{
|
|
An object of class 'xgb.cv.synchronous' with the following elements:
|
|
\itemize{
|
|
\item \code{call}: Function call.
|
|
\item \code{params}: Parameters that were passed to the xgboost library. Note that it does not
|
|
capture parameters changed by the \code{\link[=xgb.cb.reset.parameters]{xgb.cb.reset.parameters()}} callback.
|
|
\item \code{evaluation_log}: Evaluation history stored as a \code{data.table} with the
|
|
first column corresponding to iteration number and the rest corresponding to the
|
|
CV-based evaluation means and standard deviations for the training and test CV-sets.
|
|
It is created by the \code{\link[=xgb.cb.evaluation.log]{xgb.cb.evaluation.log()}} callback.
|
|
\item \code{niter}: Number of boosting iterations.
|
|
\item \code{nfeatures}: Number of features in training data.
|
|
\item \code{folds}: The list of CV folds' indices - either those passed through the \code{folds}
|
|
parameter or randomly generated.
|
|
\item \code{best_iteration}: Iteration number with the best evaluation metric value
|
|
(only available with early stopping).
|
|
}
|
|
|
|
Plus other potential elements that are the result of callbacks, such as a list \code{cv_predict} with
|
|
a sub-element \code{pred} when passing \code{prediction = TRUE}, which is added by the \code{\link[=xgb.cb.cv.predict]{xgb.cb.cv.predict()}}
|
|
callback (note that one can also pass it manually under \code{callbacks} with different settings,
|
|
such as saving also the models created during cross validation); or a list \code{early_stop} which
|
|
will contain elements such as \code{best_iteration} when using the early stopping callback (\code{\link[=xgb.cb.early.stop]{xgb.cb.early.stop()}}).
|
|
}
|
|
\description{
|
|
The cross validation function of xgboost.
|
|
}
|
|
\details{
|
|
The original sample is randomly partitioned into \code{nfold} equal size subsamples.
|
|
|
|
Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model,
|
|
and the remaining \code{nfold - 1} subsamples are used as training data.
|
|
|
|
The cross-validation process is then repeated \code{nrounds} times, with each of the
|
|
\code{nfold} subsamples used exactly once as the validation data.
|
|
|
|
All observations are used for both training and validation.
|
|
|
|
Adapted from \url{https://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29}
|
|
}
|
|
\examples{
|
|
data(agaricus.train, package = "xgboost")
|
|
|
|
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label, nthread = 2))
|
|
|
|
cv <- xgb.cv(
|
|
data = dtrain,
|
|
nrounds = 3,
|
|
nthread = 2,
|
|
nfold = 5,
|
|
metrics = list("rmse","auc"),
|
|
max_depth = 3,
|
|
eta = 1,objective = "binary:logistic"
|
|
)
|
|
print(cv)
|
|
print(cv, verbose = TRUE)
|
|
|
|
}
|