fixed typos in R package docs (#4345)

* fixed typos in R package docs

* updated verbosity parameter in xgb.train docs
This commit is contained in:
James Lamb 2019-04-21 02:54:11 -05:00 committed by Jiaming Yuan
parent 65db8d0626
commit 5e97de6a41
30 changed files with 414 additions and 413 deletions

View File

@ -1,26 +1,26 @@
#' Callback closures for booster training.
#'
#' These are used to perform various service tasks either during boosting iterations or at the end.
#' This approach helps to modularize many of such tasks without bloating the main training methods,
#' This approach helps to modularize many of such tasks without bloating the main training methods,
#' and it offers .
#'
#'
#' @details
#' By default, a callback function is run after each boosting iteration.
#' An R-attribute \code{is_pre_iteration} could be set for a callback to define a pre-iteration function.
#'
#' When a callback function has \code{finalize} parameter, its finalizer part will also be run after
#'
#' When a callback function has \code{finalize} parameter, its finalizer part will also be run after
#' the boosting is completed.
#'
#' WARNING: side-effects!!! Be aware that these callback functions access and modify things in
#'
#' WARNING: side-effects!!! Be aware that these callback functions access and modify things in
#' the environment from which they are called from, which is a fairly uncommon thing to do in R.
#'
#' To write a custom callback closure, make sure you first understand the main concepts about R envoronments.
#' Check either R documentation on \code{\link[base]{environment}} or the
#' \href{http://adv-r.had.co.nz/Environments.html}{Environments chapter} from the "Advanced R"
#'
#' To write a custom callback closure, make sure you first understand the main concepts about R environments.
#' Check either R documentation on \code{\link[base]{environment}} or the
#' \href{http://adv-r.had.co.nz/Environments.html}{Environments chapter} from the "Advanced R"
#' book by Hadley Wickham. Further, the best option is to read the code of some of the existing callbacks -
#' choose ones that do something similar to what you want to achieve. Also, you would need to get familiar
#' choose ones that do something similar to what you want to achieve. Also, you would need to get familiar
#' with the objects available inside of the \code{xgb.train} and \code{xgb.cv} internal environments.
#'
#'
#' @seealso
#' \code{\link{cb.print.evaluation}},
#' \code{\link{cb.evaluation.log}},
@ -30,42 +30,42 @@
#' \code{\link{cb.cv.predict}},
#' \code{\link{xgb.train}},
#' \code{\link{xgb.cv}}
#'
#'
#' @name callbacks
NULL
#
# Callbacks -------------------------------------------------------------------
#
#
#' Callback closure for printing the result of evaluation
#'
#'
#' @param period results would be printed every number of periods
#' @param showsd whether standard deviations should be printed (when available)
#'
#'
#' @details
#' The callback function prints the result of evaluation at every \code{period} iterations.
#' The initial and the last iteration's evaluations are always printed.
#'
#'
#' Callback function expects the following values to be set in its calling frame:
#' \code{bst_evaluation} (also \code{bst_evaluation_err} when available),
#' \code{iteration},
#' \code{begin_iteration},
#' \code{end_iteration}.
#'
#'
#' @seealso
#' \code{\link{callbacks}}
#'
#'
#' @export
cb.print.evaluation <- function(period = 1, showsd = TRUE) {
callback <- function(env = parent.frame()) {
if (length(env$bst_evaluation) == 0 ||
period == 0 ||
NVL(env$rank, 0) != 0 )
return()
i <- env$iteration
i <- env$iteration
if ((i-1) %% period == 0 ||
i == env$begin_iteration ||
i == env$end_iteration) {
@ -81,48 +81,48 @@ cb.print.evaluation <- function(period = 1, showsd = TRUE) {
#' Callback closure for logging the evaluation history
#'
#'
#' @details
#' This callback function appends the current iteration evaluation results \code{bst_evaluation}
#' available in the calling parent frame to the \code{evaluation_log} list in a calling frame.
#'
#' The finalizer callback (called with \code{finalize = TURE} in the end) converts
#'
#' The finalizer callback (called with \code{finalize = TURE} in the end) converts
#' the \code{evaluation_log} list into a final data.table.
#'
#' The iteration evaluation result \code{bst_evaluation} must be a named numeric vector.
#'
#' Note: in the column names of the final data.table, the dash '-' character is replaced with
#'
#' The iteration evaluation result \code{bst_evaluation} must be a named numeric vector.
#'
#' Note: in the column names of the final data.table, the dash '-' character is replaced with
#' the underscore '_' in order to make the column names more like regular R identifiers.
#'
#'
#' Callback function expects the following values to be set in its calling frame:
#' \code{evaluation_log},
#' \code{bst_evaluation},
#' \code{iteration}.
#'
#'
#' @seealso
#' \code{\link{callbacks}}
#'
#'
#' @export
cb.evaluation.log <- function() {
mnames <- NULL
init <- function(env) {
if (!is.list(env$evaluation_log))
stop("'evaluation_log' has to be a list")
mnames <<- names(env$bst_evaluation)
if (is.null(mnames) || any(mnames == ""))
stop("bst_evaluation must have non-empty names")
mnames <<- gsub('-', '_', names(env$bst_evaluation))
if(!is.null(env$bst_evaluation_err))
mnames <<- c(paste0(mnames, '_mean'), paste0(mnames, '_std'))
}
finalizer <- function(env) {
env$evaluation_log <- as.data.table(t(simplify2array(env$evaluation_log)))
setnames(env$evaluation_log, c('iter', mnames))
if(!is.null(env$bst_evaluation_err)) {
# rearrange col order from _mean,_mean,...,_std,_std,...
# to be _mean,_std,_mean,_std,...
@ -135,18 +135,18 @@ cb.evaluation.log <- function() {
env$evaluation_log <- env$evaluation_log[, c('iter', cnames), with = FALSE]
}
}
callback <- function(env = parent.frame(), finalize = FALSE) {
if (is.null(mnames))
init(env)
if (finalize)
return(finalizer(env))
ev <- env$bst_evaluation
if(!is.null(env$bst_evaluation_err))
ev <- c(ev, env$bst_evaluation_err)
env$evaluation_log <- c(env$evaluation_log,
env$evaluation_log <- c(env$evaluation_log,
list(c(iter = env$iteration, ev)))
}
attr(callback, 'call') <- match.call()
@ -154,21 +154,21 @@ cb.evaluation.log <- function() {
callback
}
#' Callback closure for restetting the booster's parameters at each iteration.
#'
#' Callback closure for resetting the booster's parameters at each iteration.
#'
#' @param new_params a list where each element corresponds to a parameter that needs to be reset.
#' Each element's value must be either a vector of values of length \code{nrounds}
#' to be set at each iteration,
#' or a function of two parameters \code{learning_rates(iteration, nrounds)}
#' which returns a new parameter value by using the current iteration number
#' Each element's value must be either a vector of values of length \code{nrounds}
#' to be set at each iteration,
#' or a function of two parameters \code{learning_rates(iteration, nrounds)}
#' which returns a new parameter value by using the current iteration number
#' and the total number of boosting rounds.
#'
#' @details
#'
#' @details
#' This is a "pre-iteration" callback function used to reset booster's parameters
#' at the beginning of each iteration.
#'
#' Note that when training is resumed from some previous model, and a function is used to
#' reset a parameter value, the \code{nrounds} argument in this function would be the
#'
#' Note that when training is resumed from some previous model, and a function is used to
#' reset a parameter value, the \code{nrounds} argument in this function would be the
#' the number of boosting rounds in the current training.
#'
#' Callback function expects the following values to be set in its calling frame:
@ -176,32 +176,32 @@ cb.evaluation.log <- function() {
#' \code{iteration},
#' \code{begin_iteration},
#' \code{end_iteration}.
#'
#'
#' @seealso
#' \code{\link{callbacks}}
#'
#'
#' @export
cb.reset.parameters <- function(new_params) {
if (typeof(new_params) != "list")
if (typeof(new_params) != "list")
stop("'new_params' must be a list")
pnames <- gsub("\\.", "_", names(new_params))
nrounds <- NULL
# run some checks in the begining
init <- function(env) {
nrounds <<- env$end_iteration - env$begin_iteration + 1
if (is.null(env$bst) && is.null(env$bst_folds))
stop("Parent frame has neither 'bst' nor 'bst_folds'")
# Some parameters are not allowed to be changed,
# since changing them would simply wreck some chaos
not_allowed <- pnames %in%
not_allowed <- pnames %in%
c('num_class', 'num_output_group', 'size_leaf_vector', 'updater_seq')
if (any(not_allowed))
stop('Parameters ', paste(pnames[not_allowed]), " cannot be changed during boosting.")
for (n in pnames) {
p <- new_params[[n]]
if (is.function(p)) {
@ -215,18 +215,18 @@ cb.reset.parameters <- function(new_params) {
}
}
}
callback <- function(env = parent.frame()) {
if (is.null(nrounds))
init(env)
i <- env$iteration
pars <- lapply(new_params, function(p) {
if (is.function(p))
return(p(i, nrounds))
p[i]
})
if (!is.null(env$bst)) {
xgb.parameters(env$bst$handle) <- pars
} else {
@ -242,23 +242,23 @@ cb.reset.parameters <- function(new_params) {
#' Callback closure to activate the early stopping.
#'
#' @param stopping_rounds The number of rounds with no improvement in
#'
#' @param stopping_rounds The number of rounds with no improvement in
#' the evaluation metric in order to stop the training.
#' @param maximize whether to maximize the evaluation metric
#' @param metric_name the name of an evaluation column to use as a criteria for early
#' stopping. If not set, the last column would be used.
#' Let's say the test data in \code{watchlist} was labelled as \code{dtest},
#' and one wants to use the AUC in test data for early stopping regardless of where
#' Let's say the test data in \code{watchlist} was labelled as \code{dtest},
#' and one wants to use the AUC in test data for early stopping regardless of where
#' it is in the \code{watchlist}, then one of the following would need to be set:
#' \code{metric_name='dtest-auc'} or \code{metric_name='dtest_auc'}.
#' All dash '-' characters in metric names are considered equivalent to '_'.
#' @param verbose whether to print the early stopping information.
#'
#'
#' @details
#' This callback function determines the condition for early stopping
#' This callback function determines the condition for early stopping
#' by setting the \code{stop_condition = TRUE} flag in its calling frame.
#'
#'
#' The following additional fields are assigned to the model's R object:
#' \itemize{
#' \item \code{best_score} the evaluation score at the best iteration
@ -266,13 +266,13 @@ cb.reset.parameters <- function(new_params) {
#' \item \code{best_ntreelimit} to use with the \code{ntreelimit} parameter in \code{predict}.
#' It differs from \code{best_iteration} in multiclass or random forest settings.
#' }
#'
#'
#' The Same values are also stored as xgb-attributes:
#' \itemize{
#' \item \code{best_iteration} is stored as a 0-based iteration index (for interoperability of binary models)
#' \item \code{best_msg} message string is also stored.
#' }
#'
#'
#' At least one data element is required in the evaluation watchlist for early stopping to work.
#'
#' Callback function expects the following values to be set in its calling frame:
@ -284,13 +284,13 @@ cb.reset.parameters <- function(new_params) {
#' \code{begin_iteration},
#' \code{end_iteration},
#' \code{num_parallel_tree}.
#'
#'
#' @seealso
#' \code{\link{callbacks}},
#' \code{\link{xgb.attr}}
#'
#'
#' @export
cb.early.stop <- function(stopping_rounds, maximize = FALSE,
cb.early.stop <- function(stopping_rounds, maximize = FALSE,
metric_name = NULL, verbose = TRUE) {
# state variables
best_iteration <- -1
@ -298,11 +298,11 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
best_score <- Inf
best_msg <- NULL
metric_idx <- 1
init <- function(env) {
if (length(env$bst_evaluation) == 0)
stop("For early stopping, watchlist must have at least one element")
eval_names <- gsub('-', '_', names(env$bst_evaluation))
if (!is.null(metric_name)) {
metric_idx <<- which(gsub('-', '_', metric_name) == eval_names)
@ -314,25 +314,25 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
length(env$bst_evaluation) > 1) {
metric_idx <<- length(eval_names)
if (verbose)
cat('Multiple eval metrics are present. Will use ',
cat('Multiple eval metrics are present. Will use ',
eval_names[metric_idx], ' for early stopping.\n', sep = '')
}
metric_name <<- eval_names[metric_idx]
# maximize is usually NULL when not set in xgb.train and built-in metrics
if (is.null(maximize))
maximize <<- grepl('(_auc|_map|_ndcg)', metric_name)
if (verbose && NVL(env$rank, 0) == 0)
cat("Will train until ", metric_name, " hasn't improved in ",
cat("Will train until ", metric_name, " hasn't improved in ",
stopping_rounds, " rounds.\n\n", sep = '')
best_iteration <<- 1
if (maximize) best_score <<- -Inf
env$stop_condition <- FALSE
if (!is.null(env$bst)) {
if (!inherits(env$bst, 'xgb.Booster'))
stop("'bst' in the parent frame must be an 'xgb.Booster'")
@ -348,7 +348,7 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
stop("Parent frame has neither 'bst' nor ('bst_folds' and 'basket')")
}
}
finalizer <- function(env) {
if (!is.null(env$bst)) {
attr_best_score = as.numeric(xgb.attr(env$bst$handle, 'best_score'))
@ -367,16 +367,16 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
callback <- function(env = parent.frame(), finalize = FALSE) {
if (best_iteration < 0)
init(env)
if (finalize)
return(finalizer(env))
i <- env$iteration
score = env$bst_evaluation[metric_idx]
if (( maximize && score > best_score) ||
(!maximize && score < best_score)) {
best_msg <<- format.eval.string(i, env$bst_evaluation, env$bst_evaluation_err)
best_score <<- score
best_iteration <<- i
@ -403,37 +403,37 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
#' Callback closure for saving a model file.
#'
#' @param save_period save the model to disk after every
#'
#' @param save_period save the model to disk after every
#' \code{save_period} iterations; 0 means save the model at the end.
#' @param save_name the name or path for the saved model file.
#' It can contain a \code{\link[base]{sprintf}} formatting specifier
#' It can contain a \code{\link[base]{sprintf}} formatting specifier
#' to include the integer iteration number in the file name.
#' E.g., with \code{save_name} = 'xgboost_%04d.model',
#' E.g., with \code{save_name} = 'xgboost_%04d.model',
#' the file saved at iteration 50 would be named "xgboost_0050.model".
#'
#' @details
#'
#' @details
#' This callback function allows to save an xgb-model file, either periodically after each \code{save_period}'s or at the end.
#'
#'
#' Callback function expects the following values to be set in its calling frame:
#' \code{bst},
#' \code{iteration},
#' \code{begin_iteration},
#' \code{end_iteration}.
#'
#'
#' @seealso
#' \code{\link{callbacks}}
#'
#'
#' @export
cb.save.model <- function(save_period = 0, save_name = "xgboost.model") {
if (save_period < 0)
stop("'save_period' cannot be negative")
callback <- function(env = parent.frame()) {
if (is.null(env$bst))
stop("'save_model' callback requires the 'bst' booster object in its calling frame")
if ((save_period > 0 && (env$iteration - env$begin_iteration) %% save_period == 0) ||
(save_period == 0 && env$iteration == env$end_iteration))
xgb.save(env$bst, sprintf(save_name, env$iteration))
@ -445,16 +445,16 @@ cb.save.model <- function(save_period = 0, save_name = "xgboost.model") {
#' Callback closure for returning cross-validation based predictions.
#'
#'
#' @param save_models a flag for whether to save the folds' models.
#'
#' @details
#'
#' @details
#' This callback function saves predictions for all of the test folds,
#' and also allows to save the folds' models.
#'
#'
#' It is a "finalizer" callback and it uses early stopping information whenever it is available,
#' thus it must be run after the early stopping callback if the early stopping is used.
#'
#'
#' Callback function expects the following values to be set in its calling frame:
#' \code{bst_folds},
#' \code{basket},
@ -463,36 +463,36 @@ cb.save.model <- function(save_period = 0, save_name = "xgboost.model") {
#' \code{params},
#' \code{num_parallel_tree},
#' \code{num_class}.
#'
#' @return
#'
#' @return
#' Predictions are returned inside of the \code{pred} element, which is either a vector or a matrix,
#' depending on the number of prediction outputs per data row. The order of predictions corresponds
#' to the order of rows in the original dataset. Note that when a custom \code{folds} list is
#' provided in \code{xgb.cv}, the predictions would only be returned properly when this list is a
#' non-overlapping list of k sets of indices, as in a standard k-fold CV. The predictions would not be
#' meaningful when user-profided folds have overlapping indices as in, e.g., random sampling splits.
#' depending on the number of prediction outputs per data row. The order of predictions corresponds
#' to the order of rows in the original dataset. Note that when a custom \code{folds} list is
#' provided in \code{xgb.cv}, the predictions would only be returned properly when this list is a
#' non-overlapping list of k sets of indices, as in a standard k-fold CV. The predictions would not be
#' meaningful when user-provided folds have overlapping indices as in, e.g., random sampling splits.
#' When some of the indices in the training dataset are not included into user-provided \code{folds},
#' their prediction value would be \code{NA}.
#'
#'
#' @seealso
#' \code{\link{callbacks}}
#'
#'
#' @export
cb.cv.predict <- function(save_models = FALSE) {
finalizer <- function(env) {
if (is.null(env$basket) || is.null(env$bst_folds))
stop("'cb.cv.predict' callback requires 'basket' and 'bst_folds' lists in its calling frame")
N <- nrow(env$data)
pred <-
pred <-
if (env$num_class > 1) {
matrix(NA_real_, N, env$num_class)
} else {
rep(NA_real_, N)
}
ntreelimit <- NVL(env$basket$best_ntreelimit,
ntreelimit <- NVL(env$basket$best_ntreelimit,
env$end_iteration * env$num_parallel_tree)
if (NVL(env$params[['booster']], '') == 'gblinear') {
ntreelimit <- 0 # must be 0 for gblinear
@ -569,7 +569,7 @@ cb.cv.predict <- function(save_models = FALSE) {
#' # Extract the coefficients' path and plot them vs boosting iteration number:
#' coef_path <- xgb.gblinear.history(bst)
#' matplot(coef_path, type = 'l')
#'
#'
#' # With the deterministic coordinate descent updater, it is safer to use higher learning rates.
#' # Will try the classical componentwise boosting which selects a single best feature per round:
#' bst <- xgb.train(param, dtrain, list(tr=dtrain), nrounds = 200, eta = 0.8,
@ -586,7 +586,7 @@ cb.cv.predict <- function(save_models = FALSE) {
#' # coefficients in the CV fold #3
#' xgb.gblinear.history(bst)[[3]] %>% matplot(type = 'l')
#'
#'
#'
#' #### Multiclass classification:
#' #
#' dtrain <- xgb.DMatrix(scale(x), label = as.numeric(iris$Species) - 1)
@ -681,9 +681,9 @@ cb.gblinear.history <- function(sparse=FALSE) {
#' using the \code{cb.gblinear.history()} callback.
#' @param class_index zero-based class index to extract the coefficients for only that
#' specific class in a multinomial multiclass model. When it is NULL, all the
#' coeffients are returned. Has no effect in non-multiclass models.
#' coefficients are returned. Has no effect in non-multiclass models.
#'
#' @return
#' @return
#' For an \code{xgb.train} result, a matrix (either dense or sparse) with the columns
#' corresponding to iteration's coefficients (in the order as \code{xgb.dump()} would
#' return) and the rows corresponding to boosting iterations.
@ -731,7 +731,7 @@ xgb.gblinear.history <- function(model, class_index = NULL) {
coef_path <- environment(model$callbacks$cb.gblinear.history)[["coefs"]]
if (!is.null(class_index) && num_class > 1) {
coef_path <- if (is.list(coef_path)) {
lapply(coef_path,
lapply(coef_path,
function(x) x[, seq(1 + class_index, by=num_class, length.out=num_feat)])
} else {
coef_path <- coef_path[, seq(1 + class_index, by=num_class, length.out=num_feat)]
@ -743,7 +743,7 @@ xgb.gblinear.history <- function(model, class_index = NULL) {
#
# Internal utility functions for callbacks ------------------------------------
#
#
# Format the evaluation metric string
format.eval.string <- function(iter, eval_res, eval_err = NULL) {
@ -773,7 +773,7 @@ callback.calls <- function(cb_list) {
unlist(lapply(cb_list, function(x) attr(x, 'call')))
}
# Add a callback cb to the list and make sure that
# Add a callback cb to the list and make sure that
# cb.early.stop and cb.cv.predict are at the end of the list
# with cb.cv.predict being the last (when present)
add.cb <- function(cb_list, cb) {
@ -782,11 +782,11 @@ add.cb <- function(cb_list, cb) {
if ('cb.early.stop' %in% names(cb_list)) {
cb_list <- c(cb_list, cb_list['cb.early.stop'])
# this removes only the first one
cb_list['cb.early.stop'] <- NULL
cb_list['cb.early.stop'] <- NULL
}
if ('cb.cv.predict' %in% names(cb_list)) {
cb_list <- c(cb_list, cb_list['cb.cv.predict'])
cb_list['cb.cv.predict'] <- NULL
cb_list['cb.cv.predict'] <- NULL
}
cb_list
}
@ -796,7 +796,7 @@ categorize.callbacks <- function(cb_list) {
list(
pre_iter = Filter(function(x) {
pre <- attr(x, 'is_pre_iteration')
!is.null(pre) && pre
!is.null(pre) && pre
}, cb_list),
post_iter = Filter(function(x) {
pre <- attr(x, 'is_pre_iteration')

View File

@ -81,7 +81,7 @@ xgb.get.handle <- function(object) {
#' its handle (pointer) to an internal xgboost model would be invalid. The majority of xgboost methods
#' should still work for such a model object since those methods would be using
#' \code{xgb.Booster.complete} internally. However, one might find it to be more efficient to call the
#' \code{xgb.Booster.complete} function explicitely once after loading a model as an R-object.
#' \code{xgb.Booster.complete} function explicitly once after loading a model as an R-object.
#' That would prevent further repeated implicit reconstruction of an internal booster model.
#'
#' @return
@ -162,7 +162,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#'
#' With \code{predinteraction = TRUE}, SHAP values of contributions of interaction of each pair of features
#' are computed. Note that this operation might be rather expensive in terms of compute and memory.
#' Since it quadratically depends on the number of features, it is recommended to perfom selection
#' Since it quadratically depends on the number of features, it is recommended to perform selection
#' of the most important features first. See below about the format of the returned results.
#'
#' @return
@ -190,7 +190,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#'
#' @seealso
#' \code{\link{xgb.train}}.
#'
#'
#' @references
#'
#' Scott M. Lundberg, Su-In Lee, "A Unified Approach to Interpreting Model Predictions", NIPS Proceedings 2017, \url{https://arxiv.org/abs/1705.07874}

View File

@ -1,18 +1,18 @@
#' Construct xgb.DMatrix object
#'
#'
#' Construct xgb.DMatrix object from either a dense matrix, a sparse matrix, or a local file.
#' Supported input file formats are either a libsvm text file or a binary file that was created previously by
#' \code{\link{xgb.DMatrix.save}}).
#'
#' @param data a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character
#'
#' @param data a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character
#' string representing a filename.
#' @param info a named list of additional information to store in the \code{xgb.DMatrix} object.
#' See \code{\link{setinfo}} for the specific allowed kinds of
#' See \code{\link{setinfo}} for the specific allowed kinds of
#' @param missing a float value to represents missing values in data (used only when input is a dense matrix).
#' It is useful when a 0 or some other extreme value represents missing values in data.
#' @param silent whether to suppress printing an informational message after loading from a file.
#' @param ... the \code{info} data could be passed directly as parameters, without creating an \code{info} list.
#'
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
@ -78,23 +78,23 @@ xgb.get.DMatrix <- function(data, label = NULL, missing = NA, weight = NULL) {
#' Dimensions of xgb.DMatrix
#'
#'
#' Returns a vector of numbers of rows and of columns in an \code{xgb.DMatrix}.
#' @param x Object of class \code{xgb.DMatrix}
#'
#'
#' @details
#' Note: since \code{nrow} and \code{ncol} internally use \code{dim}, they can also
#' Note: since \code{nrow} and \code{ncol} internally use \code{dim}, they can also
#' be directly used with an \code{xgb.DMatrix} object.
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#'
#'
#' stopifnot(nrow(dtrain) == nrow(train$data))
#' stopifnot(ncol(dtrain) == ncol(train$data))
#' stopifnot(all(dim(dtrain) == dim(train$data)))
#'
#'
#' @export
dim.xgb.DMatrix <- function(x) {
c(.Call(XGDMatrixNumRow_R, x), .Call(XGDMatrixNumCol_R, x))
@ -102,14 +102,14 @@ dim.xgb.DMatrix <- function(x) {
#' Handling of column names of \code{xgb.DMatrix}
#'
#' Only column names are supported for \code{xgb.DMatrix}, thus setting of
#' row names would have no effect and returnten row names would be NULL.
#'
#'
#' Only column names are supported for \code{xgb.DMatrix}, thus setting of
#' row names would have no effect and returned row names would be NULL.
#'
#' @param x object of class \code{xgb.DMatrix}
#' @param value a list of two elements: the first one is ignored
#' and the second one is column names
#'
#' and the second one is column names
#'
#' @details
#' Generic \code{dimnames} methods are used by \code{colnames}.
#' Since row names are irrelevant, it is recommended to use \code{colnames} directly.
@ -122,7 +122,7 @@ dim.xgb.DMatrix <- function(x) {
#' colnames(dtrain)
#' colnames(dtrain) <- make.names(1:ncol(train$data))
#' print(dtrain, verbose=TRUE)
#'
#'
#' @rdname dimnames.xgb.DMatrix
#' @export
dimnames.xgb.DMatrix <- function(x) {
@ -140,8 +140,8 @@ dimnames.xgb.DMatrix <- function(x) {
attr(x, '.Dimnames') <- NULL
return(x)
}
if (ncol(x) != length(value[[2]]))
stop("can't assign ", length(value[[2]]), " colnames to a ",
if (ncol(x) != length(value[[2]]))
stop("can't assign ", length(value[[2]]), " colnames to a ",
ncol(x), " column xgb.DMatrix")
attr(x, '.Dimnames') <- value
x
@ -149,33 +149,33 @@ dimnames.xgb.DMatrix <- function(x) {
#' Get information of an xgb.DMatrix object
#'
#'
#' Get information of an xgb.DMatrix object
#' @param object Object of class \code{xgb.DMatrix}
#' @param name the name of the information field to get (see details)
#' @param ... other parameters
#'
#'
#' @details
#' The \code{name} field can be one of the following:
#'
#'
#' \itemize{
#' \item \code{label}: label Xgboost learn from ;
#' \item \code{weight}: to do a weight rescale ;
#' \item \code{base_margin}: base margin is the base prediction Xgboost will boost from ;
#' \item \code{nrow}: number of rows of the \code{xgb.DMatrix}.
#'
#'
#' }
#'
#'
#' \code{group} can be setup by \code{setinfo} but can't be retrieved by \code{getinfo}.
#'
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#'
#'
#' labels <- getinfo(dtrain, 'label')
#' setinfo(dtrain, 'label', 1-labels)
#'
#'
#' labels2 <- getinfo(dtrain, 'label')
#' stopifnot(all(labels2 == 1-labels))
#' @rdname getinfo
@ -202,9 +202,9 @@ getinfo.xgb.DMatrix <- function(object, name, ...) {
#' Set information of an xgb.DMatrix object
#'
#'
#' Set information of an xgb.DMatrix object
#'
#'
#' @param object Object of class "xgb.DMatrix"
#' @param name the name of the field to get
#' @param info the specific field of information to set
@ -212,19 +212,19 @@ getinfo.xgb.DMatrix <- function(object, name, ...) {
#'
#' @details
#' The \code{name} field can be one of the following:
#'
#'
#' \itemize{
#' \item \code{label}: label Xgboost learn from ;
#' \item \code{weight}: to do a weight rescale ;
#' \item \code{base_margin}: base margin is the base prediction Xgboost will boost from ;
#' \item \code{group}: number of rows in each group (to use with \code{rank:pairwise} objective).
#' }
#'
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#'
#'
#' labels <- getinfo(dtrain, 'label')
#' setinfo(dtrain, 'label', 1-labels)
#' labels2 <- getinfo(dtrain, 'label')
@ -266,27 +266,27 @@ setinfo.xgb.DMatrix <- function(object, name, info, ...) {
#' Get a new DMatrix containing the specified rows of
#' orginal xgb.DMatrix object
#' original xgb.DMatrix object
#'
#' Get a new DMatrix containing the specified rows of
#' orginal xgb.DMatrix object
#'
#' original xgb.DMatrix object
#'
#' @param object Object of class "xgb.DMatrix"
#' @param idxset a integer vector of indices of rows needed
#' @param colset currently not used (columns subsetting is not available)
#' @param ... other parameters (currently not used)
#'
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#'
#'
#' dsub <- slice(dtrain, 1:42)
#' labels1 <- getinfo(dsub, 'label')
#' dsub <- dtrain[1:42, ]
#' labels2 <- getinfo(dsub, 'label')
#' all.equal(labels1, labels2)
#'
#'
#' @rdname slice.xgb.DMatrix
#' @export
slice <- function(object, ...) UseMethod("slice")
@ -325,22 +325,22 @@ slice.xgb.DMatrix <- function(object, idxset, ...) {
#' Print xgb.DMatrix
#'
#' Print information about xgb.DMatrix.
#'
#' Print information about xgb.DMatrix.
#' Currently it displays dimensions and presence of info-fields and colnames.
#'
#'
#' @param x an xgb.DMatrix object
#' @param verbose whether to print colnames (when present)
#' @param ... not currently used
#'
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#'
#'
#' dtrain
#' print(dtrain, verbose=TRUE)
#'
#'
#' @method print xgb.DMatrix
#' @export
print.xgb.DMatrix <- function(x, verbose = FALSE, ...) {

View File

@ -39,7 +39,7 @@
#' }
#' @param obj customized objective function. Returns gradient and second order
#' gradient with given prediction and dtrain.
#' @param feval custimized evaluation function. Returns
#' @param feval customized evaluation function. Returns
#' \code{list(metric='metric-name', value='metric-value')} with given
#' prediction and dtrain.
#' @param stratified a \code{boolean} indicating whether sampling of folds should be stratified
@ -84,7 +84,7 @@
#' capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
#' \item \code{callbacks} callback functions that were either automatically assigned or
#' explicitly passed.
#' \item \code{evaluation_log} evaluation history storead as a \code{data.table} with the
#' \item \code{evaluation_log} evaluation history stored as a \code{data.table} with the
#' first column corresponding to iteration number and the rest corresponding to the
#' CV-based evaluation means and standard deviations for the training and test CV-sets.
#' It is created by the \code{\link{cb.evaluation.log}} callback.

View File

@ -5,16 +5,16 @@
#'
#' @param importance_matrix a \code{data.table} returned by \code{\link{xgb.importance}}.
#' @param top_n maximal number of top features to include into the plot.
#' @param measure the name of importance measure to plot.
#' @param measure the name of importance measure to plot.
#' When \code{NULL}, 'Gain' would be used for trees and 'Weight' would be used for gblinear.
#' @param rel_to_first whether importance values should be represented as relative to the highest ranked feature.
#' See Details.
#' @param left_margin (base R barplot) allows to adjust the left margin size to fit feature names.
#' When it is NULL, the existing \code{par('mar')} is used.
#' @param cex (base R barplot) passed as \code{cex.names} parameter to \code{barplot}.
#' @param plot (base R barplot) whether a barplot should be produced.
#' @param plot (base R barplot) whether a barplot should be produced.
#' If FALSE, only a data.table is returned.
#' @param n_clusters (ggplot only) a \code{numeric} vector containing the min and the max range
#' @param n_clusters (ggplot only) a \code{numeric} vector containing the min and the max range
#' of the possible number of clusters of bars.
#' @param ... other parameters passed to \code{barplot} (except horiz, border, cex.names, names.arg, and las).
#'
@ -22,27 +22,27 @@
#' The graph represents each feature as a horizontal bar of length proportional to the importance of a feature.
#' Features are shown ranked in a decreasing importance order.
#' It works for importances from both \code{gblinear} and \code{gbtree} models.
#'
#'
#' When \code{rel_to_first = FALSE}, the values would be plotted as they were in \code{importance_matrix}.
#' For gbtree model, that would mean being normalized to the total of 1
#' For gbtree model, that would mean being normalized to the total of 1
#' ("what is feature's importance contribution relative to the whole model?").
#' For linear models, \code{rel_to_first = FALSE} would show actual values of the coefficients.
#' Setting \code{rel_to_first = TRUE} allows to see the picture from the perspective of
#' Setting \code{rel_to_first = TRUE} allows to see the picture from the perspective of
#' "what is feature's importance contribution relative to the most important feature?"
#'
#' The ggplot-backend method also performs 1-D custering of the importance values,
#' with bar colors coresponding to different clusters that have somewhat similar importance values.
#'
#'
#' The ggplot-backend method also performs 1-D clustering of the importance values,
#' with bar colors corresponding to different clusters that have somewhat similar importance values.
#'
#' @return
#' The \code{xgb.plot.importance} function creates a \code{barplot} (when \code{plot=TRUE})
#' and silently returns a processed data.table with \code{n_top} features sorted by importance.
#'
#'
#' The \code{xgb.ggplot.importance} function returns a ggplot graph which could be customized afterwards.
#' E.g., to change the title of the graph, add \code{+ ggtitle("A GRAPH NAME")} to the result.
#'
#' @seealso
#' \code{\link[graphics]{barplot}}.
#'
#'
#' @examples
#' data(agaricus.train)
#'
@ -50,15 +50,15 @@
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
#'
#' importance_matrix <- xgb.importance(colnames(agaricus.train$data), model = bst)
#'
#'
#' xgb.plot.importance(importance_matrix, rel_to_first = TRUE, xlab = "Relative importance")
#'
#'
#' (gg <- xgb.ggplot.importance(importance_matrix, measure = "Frequency", rel_to_first = TRUE))
#' gg + ggplot2::ylab("Frequency")
#'
#' @rdname xgb.plot.importance
#' @export
xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure = NULL,
xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure = NULL,
rel_to_first = FALSE, left_margin = 10, cex = NULL, plot = TRUE, ...) {
check.deprecation(...)
if (!is.data.table(importance_matrix)) {
@ -80,13 +80,13 @@ xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure
if (!"Feature" %in% imp_names)
stop("Importance matrix column names are not as expected!")
}
# also aggregate, just in case when the values were not yet summed up by feature
importance_matrix <- importance_matrix[, Importance := sum(get(measure)), by = Feature]
# make sure it's ordered
importance_matrix <- importance_matrix[order(-abs(Importance))]
if (!is.null(top_n)) {
top_n <- min(top_n, nrow(importance_matrix))
importance_matrix <- head(importance_matrix, top_n)
@ -97,14 +97,14 @@ xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure
if (is.null(cex)) {
cex <- 2.5/log2(1 + nrow(importance_matrix))
}
if (plot) {
op <- par(no.readonly = TRUE)
mar <- op$mar
if (!is.null(left_margin))
mar[2] <- left_margin
par(mar = mar)
# reverse the order of rows to have the highest ranked at the top
importance_matrix[nrow(importance_matrix):1,
barplot(Importance, horiz = TRUE, border = NA, cex.names = cex,
@ -115,7 +115,7 @@ xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure
barplot(Importance, horiz = TRUE, border = NA, add = TRUE)]
par(op)
}
invisible(importance_matrix)
}

View File

@ -1,9 +1,9 @@
#' SHAP contribution dependency plots
#'
#' Visualizing the SHAP feature contribution to prediction dependencies on feature value.
#'
#'
#' @param data data as a \code{matrix} or \code{dgCMatrix}.
#' @param shap_contrib a matrix of SHAP contributions that was computed earlier for the above
#' @param shap_contrib a matrix of SHAP contributions that was computed earlier for the above
#' \code{data}. When it is NULL, it is computed internally using \code{model} and \code{data}.
#' @param features a vector of either column indices or of feature names to plot. When it is NULL,
#' feature importance is calculated, and \code{top_n} high ranked features are taken.
@ -31,32 +31,32 @@
#' @param plot_loess whether to plot loess-smoothed curves. The smoothing is only done for features with
#' more than 5 distinct values.
#' @param col_loess a color to use for the loess curves.
#' @param span_loess the \code{span} paramerer in \code{\link[stats]{loess}}'s call.
#' @param span_loess the \code{span} parameter in \code{\link[stats]{loess}}'s call.
#' @param which whether to do univariate or bivariate plotting. NOTE: only 1D is implemented so far.
#' @param plot whether a plot should be drawn. If FALSE, only a lits of matrices is returned.
#' @param ... other parameters passed to \code{plot}.
#'
#'
#' @details
#'
#'
#' These scatterplots represent how SHAP feature contributions depend of feature values.
#' The similarity to partial dependency plots is that they also give an idea for how feature values
#' affect predictions. However, in partial dependency plots, we usually see marginal dependencies
#' of model prediction on feature value, while SHAP contribution dependency plots display the estimated
#' contributions of a feature to model prediction for each individual case.
#'
#'
#' When \code{plot_loess = TRUE} is set, feature values are rounded to 3 significant digits and
#' weighted LOESS is computed and plotted, where weights are the numbers of data points
#' at each rounded value.
#'
#'
#' Note: SHAP contributions are shown on the scale of model margin. E.g., for a logistic binomial objective,
#' the margin is prediction before a sigmoidal transform into probability-like values.
#' Also, since SHAP stands for "SHapley Additive exPlanation" (model prediction = sum of SHAP
#' contributions for all features + bias), depending on the objective used, transforming SHAP
#' contributions for a feature from the marginal to the prediction space is not necessarily
#' a meaningful thing to do.
#'
#'
#' @return
#'
#'
#' In addition to producing plots (when \code{plot=TRUE}), it silently returns a list of two matrices:
#' \itemize{
#' \item \code{data} the values of selected features;
@ -70,11 +70,11 @@
#' Scott M. Lundberg, Su-In Lee, "Consistent feature attribution for tree ensembles", \url{https://arxiv.org/abs/1706.06060}
#'
#' @examples
#'
#'
#' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost')
#'
#' bst <- xgboost(agaricus.train$data, agaricus.train$label, nrounds = 50,
#' bst <- xgboost(agaricus.train$data, agaricus.train$label, nrounds = 50,
#' eta = 0.1, max_depth = 3, subsample = .5,
#' method = "hist", objective = "binary:logistic", nthread = 2, verbose = 0)
#'
@ -99,7 +99,7 @@
#' n_col = 2, col = col, pch = 16, pch_NA = 17)
#' xgb.plot.shap(x, model = mbst, trees = trees0 + 2, target_class = 2, top_n = 4,
#' n_col = 2, col = col, pch = 16, pch_NA = 17)
#'
#'
#' @rdname xgb.plot.shap
#' @export
xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1, model = NULL,
@ -109,7 +109,7 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
plot_NA = TRUE, col_NA = rgb(0.7, 0, 1, 0.6), pch_NA = '.', pos_NA = 1.07,
plot_loess = TRUE, col_loess = 2, span_loess = 0.5,
which = c("1d", "2d"), plot = TRUE, ...) {
if (!is.matrix(data) && !inherits(data, "dgCMatrix"))
stop("data: must be either matrix or dgCMatrix")
@ -122,7 +122,7 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
if (!is.null(shap_contrib) &&
(!is.matrix(shap_contrib) || nrow(shap_contrib) != nrow(data) || ncol(shap_contrib) != ncol(data) + 1))
stop("shap_contrib is not compatible with the provided data")
nsample <- if (is.null(subsample)) min(100000, nrow(data)) else as.integer(subsample * nrow(data))
idx <- sample(1:nrow(data), nsample)
data <- data[idx,]
@ -144,13 +144,13 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
stop("top_n: must be an integer within [1, 100]")
features <- imp$Feature[1:min(top_n, NROW(imp))]
}
if (is.character(features)) {
if (is.null(colnames(data)))
stop("Either provide `data` with column names or provide `features` as column indices")
features <- match(features, colnames(data))
}
if (n_col > length(features)) n_col <- length(features)
if (is.list(shap_contrib)) { # multiclass: either choose a class or merge
@ -165,7 +165,7 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
if (is.null(cols)) cols <- paste0('X', 1:ncol(data))
colnames(data) <- cols
colnames(shap_contrib) <- cols
if (plot && which == "1d") {
op <- par(mfrow = c(ceiling(length(features) / n_col), n_col),
oma = c(0,0,0,0) + 0.2,

View File

@ -1,44 +1,44 @@
#' eXtreme Gradient Boosting Training
#'
#'
#' \code{xgb.train} is an advanced interface for training an xgboost model.
#' The \code{xgboost} function is a simpler wrapper for \code{xgb.train}.
#'
#' @param params the list of parameters.
#' @param params the list of parameters.
#' The complete list of parameters is available at \url{http://xgboost.readthedocs.io/en/latest/parameter.html}.
#' Below is a shorter summary:
#'
#'
#' 1. General Parameters
#'
#'
#' \itemize{
#' \item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}.
#' }
#'
#'
#' 2. Booster Parameters
#'
#'
#' 2.1. Parameter for Tree Booster
#'
#'
#' \itemize{
#' \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
#' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
#' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
#' \item \code{max_depth} maximum depth of a tree. Default: 6
#' \item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1
#' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
#' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
#' \item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.
#' \item \code{interaction_constraints} A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from \code{0} (\code{0} references the first column). Leave argument unspecified for no interaction constraints.
#' }
#'
#'
#' 2.2. Parameter for Linear Booster
#'
#'
#' \itemize{
#' \item \code{lambda} L2 regularization term on weights. Default: 0
#' \item \code{lambda_bias} L2 regularization term on bias. Default: 0
#' \item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
#' }
#'
#' 3. Task Parameters
#'
#'
#' 3. Task Parameters
#'
#' \itemize{
#' \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
#' \itemize{
@ -54,32 +54,32 @@
#' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
#' \item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
#' }
#'
#'
#' @param data training dataset. \code{xgb.train} accepts only an \code{xgb.DMatrix} as the input.
#' \code{xgboost}, in addition, also accepts \code{matrix}, \code{dgCMatrix}, or name of a local data file.
#' @param nrounds max number of boosting iterations.
#' @param watchlist named list of xgb.DMatrix datasets to use for evaluating model performance.
#' Metrics specified in either \code{eval_metric} or \code{feval} will be computed for each
#' of these datasets during each boosting iteration, and stored in the end as a field named
#' \code{evaluation_log} in the resulting object. When either \code{verbose>=1} or
#' of these datasets during each boosting iteration, and stored in the end as a field named
#' \code{evaluation_log} in the resulting object. When either \code{verbose>=1} or
#' \code{\link{cb.print.evaluation}} callback is engaged, the performance results are continuously
#' printed out during the training.
#' printed out during the training.
#' E.g., specifying \code{watchlist=list(validation1=mat1, validation2=mat2)} allows to track
#' the performance of each round's model on mat1 and mat2.
#' @param obj customized objective function. Returns gradient and second order
#' @param obj customized objective function. Returns gradient and second order
#' gradient with given prediction and dtrain.
#' @param feval custimized evaluation function. Returns
#' \code{list(metric='metric-name', value='metric-value')} with given
#' @param feval customized evaluation function. Returns
#' \code{list(metric='metric-name', value='metric-value')} with given
#' prediction and dtrain.
#' @param verbose If 0, xgboost will stay silent. If 1, it will print information about performance.
#' If 2, some additional information will be printed out.
#' Note that setting \code{verbose > 0} automatically engages the
#' Note that setting \code{verbose > 0} automatically engages the
#' \code{cb.print.evaluation(period=1)} callback function.
#' @param print_every_n Print each n-th iteration evaluation messages when \code{verbose>0}.
#' Default is 1 which means all messages are printed. This parameter is passed to the
#' Default is 1 which means all messages are printed. This parameter is passed to the
#' \code{\link{cb.print.evaluation}} callback.
#' @param early_stopping_rounds If \code{NULL}, the early stopping function is not triggered.
#' If set to an integer \code{k}, training with a validation set will stop if the performance
#' @param early_stopping_rounds If \code{NULL}, the early stopping function is not triggered.
#' If set to an integer \code{k}, training with a validation set will stop if the performance
#' doesn't improve for \code{k} rounds.
#' Setting this parameter engages the \code{\link{cb.early.stop}} callback.
#' @param maximize If \code{feval} and \code{early_stopping_rounds} are set,
@ -90,35 +90,35 @@
#' 0 means save at the end. The saving is handled by the \code{\link{cb.save.model}} callback.
#' @param save_name the name or path for periodically saved model file.
#' @param xgb_model a previously built model to continue the training from.
#' Could be either an object of class \code{xgb.Booster}, or its raw data, or the name of a
#' Could be either an object of class \code{xgb.Booster}, or its raw data, or the name of a
#' file with a previously saved model.
#' @param callbacks a list of callback functions to perform various task during boosting.
#' See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
#' parameters' values. User can provide either existing or their own callback methods in order
#' See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
#' parameters' values. User can provide either existing or their own callback methods in order
#' to customize the training process.
#' @param ... other parameters to pass to \code{params}.
#' @param label vector of response values. Should not be provided when data is
#' @param label vector of response values. Should not be provided when data is
#' a local data file name or an \code{xgb.DMatrix}.
#' @param missing by default is set to NA, which means that NA values should be considered as 'missing'
#' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values.
#' This parameter is only used when input is a dense matrix.
#' @param weight a vector indicating the weight for each row of the input.
#'
#' @details
#' These are the training functions for \code{xgboost}.
#'
#' The \code{xgb.train} interface supports advanced features such as \code{watchlist},
#' customized objective and evaluation metric functions, therefore it is more flexible
#'
#' @details
#' These are the training functions for \code{xgboost}.
#'
#' The \code{xgb.train} interface supports advanced features such as \code{watchlist},
#' customized objective and evaluation metric functions, therefore it is more flexible
#' than the \code{xgboost} interface.
#'
#' Parallelization is automatically enabled if \code{OpenMP} is present.
#' Parallelization is automatically enabled if \code{OpenMP} is present.
#' Number of threads can also be manually specified via \code{nthread} parameter.
#'
#'
#' The evaluation metric is chosen automatically by Xgboost (according to the objective)
#' when the \code{eval_metric} parameter is not provided.
#' User may set one or several \code{eval_metric} parameters.
#' User may set one or several \code{eval_metric} parameters.
#' Note that when using a customized metric, only this single metric can be used.
#' The folloiwing is the list of built-in metrics for which Xgboost provides optimized implementation:
#' The following is the list of built-in metrics for which Xgboost provides optimized implementation:
#' \itemize{
#' \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
#' \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
@ -131,7 +131,7 @@
#' \item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation.
#' \item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{http://en.wikipedia.org/wiki/NDCG}
#' }
#'
#'
#' The following callbacks are automatically created when certain parameters are set:
#' \itemize{
#' \item \code{cb.print.evaluation} is turned on when \code{verbose > 0};
@ -140,38 +140,38 @@
#' \item \code{cb.early.stop}: when \code{early_stopping_rounds} is set.
#' \item \code{cb.save.model}: when \code{save_period > 0} is set.
#' }
#'
#' @return
#'
#' @return
#' An object of class \code{xgb.Booster} with the following elements:
#' \itemize{
#' \item \code{handle} a handle (pointer) to the xgboost model in memory.
#' \item \code{raw} a cached memory dump of the xgboost model saved as R's \code{raw} type.
#' \item \code{niter} number of boosting iterations.
#' \item \code{evaluation_log} evaluation history storead as a \code{data.table} with the
#' \item \code{evaluation_log} evaluation history stored as a \code{data.table} with the
#' first column corresponding to iteration number and the rest corresponding to evaluation
#' metrics' values. It is created by the \code{\link{cb.evaluation.log}} callback.
#' \item \code{call} a function call.
#' \item \code{params} parameters that were passed to the xgboost library. Note that it does not
#' \item \code{params} parameters that were passed to the xgboost library. Note that it does not
#' capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
#' \item \code{callbacks} callback functions that were either automatically assigned or
#' explicitely passed.
#' \item \code{callbacks} callback functions that were either automatically assigned or
#' explicitly passed.
#' \item \code{best_iteration} iteration number with the best evaluation metric value
#' (only available with early stopping).
#' \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
#' \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
#' which could further be used in \code{predict} method
#' (only available with early stopping).
#' \item \code{best_score} the best evaluation metric value during early stopping.
#' (only available with early stopping).
#' \item \code{feature_names} names of the training dataset features
#' (only when comun names were defined in training data).
#' (only when column names were defined in training data).
#' \item \code{nfeatures} number of features in training data.
#' }
#'
#'
#' @seealso
#' \code{\link{callbacks}},
#' \code{\link{predict.xgb.Booster}},
#' \code{\link{xgb.cv}}
#'
#'
#' @references
#'
#' Tianqi Chen and Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System",
@ -180,17 +180,17 @@
#' @examples
#' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost')
#'
#'
#' dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
#' dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
#' watchlist <- list(train = dtrain, eval = dtest)
#'
#'
#' ## A simple xgb.train example:
#' param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2,
#' param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
#' objective = "binary:logistic", eval_metric = "auc")
#' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
#'
#'
#'
#'
#' ## An xgb.train example where custom objective and evaluation metric are used:
#' logregobj <- function(preds, dtrain) {
#' labels <- getinfo(dtrain, "label")
@ -204,58 +204,58 @@
#' err <- as.numeric(sum(labels != (preds > 0)))/length(labels)
#' return(list(metric = "error", value = err))
#' }
#'
#'
#' # These functions could be used by passing them either:
#' # as 'objective' and 'eval_metric' parameters in the params list:
#' param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2,
#' param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
#' objective = logregobj, eval_metric = evalerror)
#' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
#'
#'
#' # or through the ... arguments:
#' param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2)
#' param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2)
#' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
#' objective = logregobj, eval_metric = evalerror)
#'
#'
#' # or as dedicated 'obj' and 'feval' parameters of xgb.train:
#' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
#' obj = logregobj, feval = evalerror)
#'
#'
#'
#'
#' ## An xgb.train example of using variable learning rates at each iteration:
#' param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2,
#' param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
#' objective = "binary:logistic", eval_metric = "auc")
#' my_etas <- list(eta = c(0.5, 0.1))
#' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
#' callbacks = list(cb.reset.parameters(my_etas)))
#'
#'
#' ## Early stopping:
#' bst <- xgb.train(param, dtrain, nrounds = 25, watchlist,
#' early_stopping_rounds = 3)
#'
#'
#' ## An 'xgboost' interface example:
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label,
#' max_depth = 2, eta = 1, nthread = 2, nrounds = 2,
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label,
#' max_depth = 2, eta = 1, nthread = 2, nrounds = 2,
#' objective = "binary:logistic")
#' pred <- predict(bst, agaricus.test$data)
#'
#'
#' @rdname xgb.train
#' @export
xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
obj = NULL, feval = NULL, verbose = 1, print_every_n = 1L,
early_stopping_rounds = NULL, maximize = NULL,
save_period = NULL, save_name = "xgboost.model",
save_period = NULL, save_name = "xgboost.model",
xgb_model = NULL, callbacks = list(), ...) {
check.deprecation(...)
params <- check.booster.params(params, ...)
check.custom.obj()
check.custom.eval()
# data & watchlist checks
dtrain <- data
if (!inherits(dtrain, "xgb.DMatrix"))
if (!inherits(dtrain, "xgb.DMatrix"))
stop("second argument dtrain must be xgb.DMatrix")
if (length(watchlist) > 0) {
if (typeof(watchlist) != "list" ||
@ -288,7 +288,7 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
stop_condition <- FALSE
if (!is.null(early_stopping_rounds) &&
!has.callbacks(callbacks, 'cb.early.stop')) {
callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds,
callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds,
maximize = maximize, verbose = verbose))
}
# Sort the callbacks into categories
@ -318,22 +318,22 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
# TODO: distributed code
rank <- 0
niter_skip <- ifelse(is_update, 0, niter_init)
begin_iteration <- niter_skip + 1
end_iteration <- niter_skip + nrounds
# the main loop for boosting iterations
for (iteration in begin_iteration:end_iteration) {
for (f in cb$pre_iter) f()
xgb.iter.update(bst$handle, dtrain, iteration - 1, obj)
bst_evaluation <- numeric(0)
if (length(watchlist) > 0)
bst_evaluation <- xgb.iter.eval(bst$handle, watchlist, iteration - 1, feval)
xgb.attr(bst$handle, 'niter') <- iteration - 1
for (f in cb$post_iter) f()
@ -341,9 +341,9 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
if (stop_condition) break
}
for (f in cb$finalize) f(finalize = TRUE)
bst <- xgb.Booster.complete(bst, saveraw = TRUE)
# store the total number of boosting iterations
bst$niter = end_iteration

View File

@ -5,24 +5,24 @@
\title{Callback closures for booster training.}
\description{
These are used to perform various service tasks either during boosting iterations or at the end.
This approach helps to modularize many of such tasks without bloating the main training methods,
This approach helps to modularize many of such tasks without bloating the main training methods,
and it offers .
}
\details{
By default, a callback function is run after each boosting iteration.
An R-attribute \code{is_pre_iteration} could be set for a callback to define a pre-iteration function.
When a callback function has \code{finalize} parameter, its finalizer part will also be run after
When a callback function has \code{finalize} parameter, its finalizer part will also be run after
the boosting is completed.
WARNING: side-effects!!! Be aware that these callback functions access and modify things in
WARNING: side-effects!!! Be aware that these callback functions access and modify things in
the environment from which they are called from, which is a fairly uncommon thing to do in R.
To write a custom callback closure, make sure you first understand the main concepts about R envoronments.
Check either R documentation on \code{\link[base]{environment}} or the
\href{http://adv-r.had.co.nz/Environments.html}{Environments chapter} from the "Advanced R"
To write a custom callback closure, make sure you first understand the main concepts about R environments.
Check either R documentation on \code{\link[base]{environment}} or the
\href{http://adv-r.had.co.nz/Environments.html}{Environments chapter} from the "Advanced R"
book by Hadley Wickham. Further, the best option is to read the code of some of the existing callbacks -
choose ones that do something similar to what you want to achieve. Also, you would need to get familiar
choose ones that do something similar to what you want to achieve. Also, you would need to get familiar
with the objects available inside of the \code{xgb.train} and \code{xgb.cv} internal environments.
}
\seealso{

View File

@ -11,11 +11,11 @@ cb.cv.predict(save_models = FALSE)
}
\value{
Predictions are returned inside of the \code{pred} element, which is either a vector or a matrix,
depending on the number of prediction outputs per data row. The order of predictions corresponds
to the order of rows in the original dataset. Note that when a custom \code{folds} list is
provided in \code{xgb.cv}, the predictions would only be returned properly when this list is a
non-overlapping list of k sets of indices, as in a standard k-fold CV. The predictions would not be
meaningful when user-profided folds have overlapping indices as in, e.g., random sampling splits.
depending on the number of prediction outputs per data row. The order of predictions corresponds
to the order of rows in the original dataset. Note that when a custom \code{folds} list is
provided in \code{xgb.cv}, the predictions would only be returned properly when this list is a
non-overlapping list of k sets of indices, as in a standard k-fold CV. The predictions would not be
meaningful when user-provided folds have overlapping indices as in, e.g., random sampling splits.
When some of the indices in the training dataset are not included into user-provided \code{folds},
their prediction value would be \code{NA}.
}

View File

@ -8,15 +8,15 @@ cb.early.stop(stopping_rounds, maximize = FALSE, metric_name = NULL,
verbose = TRUE)
}
\arguments{
\item{stopping_rounds}{The number of rounds with no improvement in
\item{stopping_rounds}{The number of rounds with no improvement in
the evaluation metric in order to stop the training.}
\item{maximize}{whether to maximize the evaluation metric}
\item{metric_name}{the name of an evaluation column to use as a criteria for early
stopping. If not set, the last column would be used.
Let's say the test data in \code{watchlist} was labelled as \code{dtest},
and one wants to use the AUC in test data for early stopping regardless of where
Let's say the test data in \code{watchlist} was labelled as \code{dtest},
and one wants to use the AUC in test data for early stopping regardless of where
it is in the \code{watchlist}, then one of the following would need to be set:
\code{metric_name='dtest-auc'} or \code{metric_name='dtest_auc'}.
All dash '-' characters in metric names are considered equivalent to '_'.}
@ -27,7 +27,7 @@ All dash '-' characters in metric names are considered equivalent to '_'.}
Callback closure to activate the early stopping.
}
\details{
This callback function determines the condition for early stopping
This callback function determines the condition for early stopping
by setting the \code{stop_condition = TRUE} flag in its calling frame.
The following additional fields are assigned to the model's R object:

View File

@ -13,12 +13,12 @@ Callback closure for logging the evaluation history
This callback function appends the current iteration evaluation results \code{bst_evaluation}
available in the calling parent frame to the \code{evaluation_log} list in a calling frame.
The finalizer callback (called with \code{finalize = TURE} in the end) converts
The finalizer callback (called with \code{finalize = TURE} in the end) converts
the \code{evaluation_log} list into a final data.table.
The iteration evaluation result \code{bst_evaluation} must be a named numeric vector.
The iteration evaluation result \code{bst_evaluation} must be a named numeric vector.
Note: in the column names of the final data.table, the dash '-' character is replaced with
Note: in the column names of the final data.table, the dash '-' character is replaced with
the underscore '_' in order to make the column names more like regular R identifiers.
Callback function expects the following values to be set in its calling frame:

View File

@ -2,27 +2,27 @@
% Please edit documentation in R/callbacks.R
\name{cb.reset.parameters}
\alias{cb.reset.parameters}
\title{Callback closure for restetting the booster's parameters at each iteration.}
\title{Callback closure for resetting the booster's parameters at each iteration.}
\usage{
cb.reset.parameters(new_params)
}
\arguments{
\item{new_params}{a list where each element corresponds to a parameter that needs to be reset.
Each element's value must be either a vector of values of length \code{nrounds}
to be set at each iteration,
or a function of two parameters \code{learning_rates(iteration, nrounds)}
which returns a new parameter value by using the current iteration number
Each element's value must be either a vector of values of length \code{nrounds}
to be set at each iteration,
or a function of two parameters \code{learning_rates(iteration, nrounds)}
which returns a new parameter value by using the current iteration number
and the total number of boosting rounds.}
}
\description{
Callback closure for restetting the booster's parameters at each iteration.
Callback closure for resetting the booster's parameters at each iteration.
}
\details{
This is a "pre-iteration" callback function used to reset booster's parameters
at the beginning of each iteration.
Note that when training is resumed from some previous model, and a function is used to
reset a parameter value, the \code{nrounds} argument in this function would be the
Note that when training is resumed from some previous model, and a function is used to
reset a parameter value, the \code{nrounds} argument in this function would be the
the number of boosting rounds in the current training.
Callback function expects the following values to be set in its calling frame:

View File

@ -7,13 +7,13 @@
cb.save.model(save_period = 0, save_name = "xgboost.model")
}
\arguments{
\item{save_period}{save the model to disk after every
\item{save_period}{save the model to disk after every
\code{save_period} iterations; 0 means save the model at the end.}
\item{save_name}{the name or path for the saved model file.
It can contain a \code{\link[base]{sprintf}} formatting specifier
It can contain a \code{\link[base]{sprintf}} formatting specifier
to include the integer iteration number in the file name.
E.g., with \code{save_name} = 'xgboost_%04d.model',
E.g., with \code{save_name} = 'xgboost_%04d.model',
the file saved at iteration 50 would be named "xgboost_0050.model".}
}
\description{

View File

@ -13,7 +13,7 @@
Returns a vector of numbers of rows and of columns in an \code{xgb.DMatrix}.
}
\details{
Note: since \code{nrow} and \code{ncol} internally use \code{dim}, they can also
Note: since \code{nrow} and \code{ncol} internally use \code{dim}, they can also
be directly used with an \code{xgb.DMatrix} object.
}
\examples{

View File

@ -16,8 +16,8 @@
and the second one is column names}
}
\description{
Only column names are supported for \code{xgb.DMatrix}, thus setting of
row names would have no effect and returnten row names would be NULL.
Only column names are supported for \code{xgb.DMatrix}, thus setting of
row names would have no effect and returned row names would be NULL.
}
\details{
Generic \code{dimnames} methods are used by \code{colnames}.

View File

@ -27,7 +27,7 @@ The \code{name} field can be one of the following:
\item \code{weight}: to do a weight rescale ;
\item \code{base_margin}: base margin is the base prediction Xgboost will boost from ;
\item \code{nrow}: number of rows of the \code{xgb.DMatrix}.
}
\code{group} can be setup by \code{setinfo} but can't be retrieved by \code{getinfo}.

View File

@ -91,7 +91,7 @@ in \url{http://blog.datadive.net/interpreting-random-forests/}.
With \code{predinteraction = TRUE}, SHAP values of contributions of interaction of each pair of features
are computed. Note that this operation might be rather expensive in terms of compute and memory.
Since it quadratically depends on the number of features, it is recommended to perfom selection
Since it quadratically depends on the number of features, it is recommended to perform selection
of the most important features first. See below about the format of the returned results.
}
\examples{

View File

@ -14,7 +14,7 @@
\item{...}{not currently used}
}
\description{
Print information about xgb.DMatrix.
Print information about xgb.DMatrix.
Currently it displays dimensions and presence of info-fields and colnames.
}
\examples{

View File

@ -17,7 +17,7 @@
Prints formatted results of \code{xgb.cv}.
}
\details{
When not verbose, it would only print the evaluation results,
When not verbose, it would only print the evaluation results,
including the best iteration (when available).
}
\examples{

View File

@ -5,7 +5,7 @@
\alias{slice.xgb.DMatrix}
\alias{[.xgb.DMatrix}
\title{Get a new DMatrix containing the specified rows of
orginal xgb.DMatrix object}
original xgb.DMatrix object}
\usage{
slice(object, ...)
@ -24,7 +24,7 @@ slice(object, ...)
}
\description{
Get a new DMatrix containing the specified rows of
orginal xgb.DMatrix object
original xgb.DMatrix object
}
\examples{
data(agaricus.train, package='xgboost')

View File

@ -28,7 +28,7 @@ E.g., when an \code{xgb.Booster} model is saved as an R object and then is loade
its handle (pointer) to an internal xgboost model would be invalid. The majority of xgboost methods
should still work for such a model object since those methods would be using
\code{xgb.Booster.complete} internally. However, one might find it to be more efficient to call the
\code{xgb.Booster.complete} function explicitely once after loading a model as an R-object.
\code{xgb.Booster.complete} function explicitly once after loading a model as an R-object.
That would prevent further repeated implicit reconstruction of an internal booster model.
}
\examples{

View File

@ -7,7 +7,7 @@
xgb.DMatrix(data, info = list(), missing = NA, silent = FALSE, ...)
}
\arguments{
\item{data}{a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character
\item{data}{a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character
string representing a filename.}
\item{info}{a named list of additional information to store in the \code{xgb.DMatrix} object.

View File

@ -16,7 +16,7 @@ xgb.cv(params = list(), data, nrounds, nfold, label = NULL,
\itemize{
\item \code{objective} objective function, common ones are
\itemize{
\item \code{reg:squarederror} Regression with squared loss.
\item \code{reg:squarederror} Regression with squared loss
\item \code{binary:logistic} logistic regression for classification
}
\item \code{eta} step size of each boosting step
@ -35,11 +35,11 @@ xgb.cv(params = list(), data, nrounds, nfold, label = NULL,
\item{label}{vector of response values. Should be provided only when data is an R-matrix.}
\item{missing}{is only used when input is a dense matrix. By default is set to NA, which means
that NA values should be considered as 'missing' by the algorithm.
\item{missing}{is only used when input is a dense matrix. By default is set to NA, which means
that NA values should be considered as 'missing' by the algorithm.
Sometimes, 0 or other extreme value might be used to represent missing values.}
\item{prediction}{A logical value indicating whether to return the test fold predictions
\item{prediction}{A logical value indicating whether to return the test fold predictions
from each CV model. This parameter engages the \code{\link{cb.cv.predict}} callback.}
\item{showsd}{\code{boolean}, whether to show standard deviation of cross validation}
@ -56,28 +56,28 @@ from each CV model. This parameter engages the \code{\link{cb.cv.predict}} callb
\item \code{merror} Exact matching error, used to evaluate multi-class classification
}}
\item{obj}{customized objective function. Returns gradient and second order
\item{obj}{customized objective function. Returns gradient and second order
gradient with given prediction and dtrain.}
\item{feval}{custimized evaluation function. Returns
\code{list(metric='metric-name', value='metric-value')} with given
\item{feval}{customized evaluation function. Returns
\code{list(metric='metric-name', value='metric-value')} with given
prediction and dtrain.}
\item{stratified}{a \code{boolean} indicating whether sampling of folds should be stratified
\item{stratified}{a \code{boolean} indicating whether sampling of folds should be stratified
by the values of outcome labels.}
\item{folds}{\code{list} provides a possibility to use a list of pre-defined CV folds
(each element must be a vector of test fold's indices). When folds are supplied,
(each element must be a vector of test fold's indices). When folds are supplied,
the \code{nfold} and \code{stratified} parameters are ignored.}
\item{verbose}{\code{boolean}, print the statistics during the process}
\item{print_every_n}{Print each n-th iteration evaluation messages when \code{verbose>0}.
Default is 1 which means all messages are printed. This parameter is passed to the
Default is 1 which means all messages are printed. This parameter is passed to the
\code{\link{cb.print.evaluation}} callback.}
\item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered.
If set to an integer \code{k}, training with a validation set will stop if the performance
\item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered.
If set to an integer \code{k}, training with a validation set will stop if the performance
doesn't improve for \code{k} rounds.
Setting this parameter engages the \code{\link{cb.early.stop}} callback.}
@ -87,8 +87,8 @@ When it is \code{TRUE}, it means the larger the evaluation score the better.
This parameter is passed to the \code{\link{cb.early.stop}} callback.}
\item{callbacks}{a list of callback functions to perform various task during boosting.
See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
parameters' values. User can provide either existing or their own callback methods in order
See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
parameters' values. User can provide either existing or their own callback methods in order
to customize the training process.}
\item{...}{other parameters to pass to \code{params}.}
@ -97,26 +97,26 @@ to customize the training process.}
An object of class \code{xgb.cv.synchronous} with the following elements:
\itemize{
\item \code{call} a function call.
\item \code{params} parameters that were passed to the xgboost library. Note that it does not
\item \code{params} parameters that were passed to the xgboost library. Note that it does not
capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
\item \code{callbacks} callback functions that were either automatically assigned or
\item \code{callbacks} callback functions that were either automatically assigned or
explicitly passed.
\item \code{evaluation_log} evaluation history storead as a \code{data.table} with the
first column corresponding to iteration number and the rest corresponding to the
\item \code{evaluation_log} evaluation history stored as a \code{data.table} with the
first column corresponding to iteration number and the rest corresponding to the
CV-based evaluation means and standard deviations for the training and test CV-sets.
It is created by the \code{\link{cb.evaluation.log}} callback.
\item \code{niter} number of boosting iterations.
\item \code{nfeatures} number of features in training data.
\item \code{folds} the list of CV folds' indices - either those passed through the \code{folds}
\item \code{folds} the list of CV folds' indices - either those passed through the \code{folds}
parameter or randomly generated.
\item \code{best_iteration} iteration number with the best evaluation metric value
(only available with early stopping).
\item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
\item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
which could further be used in \code{predict} method
(only available with early stopping).
\item \code{pred} CV prediction values available when \code{prediction} is set.
\item \code{pred} CV prediction values available when \code{prediction} is set.
It is either vector or matrix (see \code{\link{cb.cv.predict}}).
\item \code{models} a liost of the CV folds' models. It is only available with the explicit
\item \code{models} a liost of the CV folds' models. It is only available with the explicit
setting of the \code{cb.cv.predict(save_models = TRUE)} callback.
}
}
@ -124,9 +124,9 @@ An object of class \code{xgb.cv.synchronous} with the following elements:
The cross validation function of xgboost
}
\details{
The original sample is randomly partitioned into \code{nfold} equal size subsamples.
The original sample is randomly partitioned into \code{nfold} equal size subsamples.
Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data.
Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data.
The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data.

View File

@ -12,7 +12,7 @@ using the \code{cb.gblinear.history()} callback.}
\item{class_index}{zero-based class index to extract the coefficients for only that
specific class in a multinomial multiclass model. When it is NULL, all the
coeffients are returned. Has no effect in non-multiclass models.}
coefficients are returned. Has no effect in non-multiclass models.}
}
\value{
For an \code{xgb.train} result, a matrix (either dense or sparse) with the columns

View File

@ -17,13 +17,13 @@ xgb.plot.importance(importance_matrix = NULL, top_n = NULL,
\item{top_n}{maximal number of top features to include into the plot.}
\item{measure}{the name of importance measure to plot.
\item{measure}{the name of importance measure to plot.
When \code{NULL}, 'Gain' would be used for trees and 'Weight' would be used for gblinear.}
\item{rel_to_first}{whether importance values should be represented as relative to the highest ranked feature.
See Details.}
\item{n_clusters}{(ggplot only) a \code{numeric} vector containing the min and the max range
\item{n_clusters}{(ggplot only) a \code{numeric} vector containing the min and the max range
of the possible number of clusters of bars.}
\item{...}{other parameters passed to \code{barplot} (except horiz, border, cex.names, names.arg, and las).}
@ -33,7 +33,7 @@ When it is NULL, the existing \code{par('mar')} is used.}
\item{cex}{(base R barplot) passed as \code{cex.names} parameter to \code{barplot}.}
\item{plot}{(base R barplot) whether a barplot should be produced.
\item{plot}{(base R barplot) whether a barplot should be produced.
If FALSE, only a data.table is returned.}
}
\value{
@ -53,14 +53,14 @@ Features are shown ranked in a decreasing importance order.
It works for importances from both \code{gblinear} and \code{gbtree} models.
When \code{rel_to_first = FALSE}, the values would be plotted as they were in \code{importance_matrix}.
For gbtree model, that would mean being normalized to the total of 1
For gbtree model, that would mean being normalized to the total of 1
("what is feature's importance contribution relative to the whole model?").
For linear models, \code{rel_to_first = FALSE} would show actual values of the coefficients.
Setting \code{rel_to_first = TRUE} allows to see the picture from the perspective of
Setting \code{rel_to_first = TRUE} allows to see the picture from the perspective of
"what is feature's importance contribution relative to the most important feature?"
The ggplot-backend method also performs 1-D custering of the importance values,
with bar colors coresponding to different clusters that have somewhat similar importance values.
The ggplot-backend method also performs 1-D clustering of the importance values,
with bar colors corresponding to different clusters that have somewhat similar importance values.
}
\examples{
data(agaricus.train)

View File

@ -15,7 +15,7 @@ xgb.plot.shap(data, shap_contrib = NULL, features = NULL, top_n = 1,
\arguments{
\item{data}{data as a \code{matrix} or \code{dgCMatrix}.}
\item{shap_contrib}{a matrix of SHAP contributions that was computed earlier for the above
\item{shap_contrib}{a matrix of SHAP contributions that was computed earlier for the above
\code{data}. When it is NULL, it is computed internally using \code{model} and \code{data}.}
\item{features}{a vector of either column indices or of feature names to plot. When it is NULL,
@ -63,7 +63,7 @@ more than 5 distinct values.}
\item{col_loess}{a color to use for the loess curves.}
\item{span_loess}{the \code{span} paramerer in \code{\link[stats]{loess}}'s call.}
\item{span_loess}{the \code{span} parameter in \code{\link[stats]{loess}}'s call.}
\item{which}{whether to do univariate or bivariate plotting. NOTE: only 1D is implemented so far.}
@ -104,7 +104,7 @@ a meaningful thing to do.
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
bst <- xgboost(agaricus.train$data, agaricus.train$label, nrounds = 50,
bst <- xgboost(agaricus.train$data, agaricus.train$label, nrounds = 50,
eta = 0.1, max_depth = 3, subsample = .5,
method = "hist", objective = "binary:logistic", nthread = 2, verbose = 0)

View File

@ -18,7 +18,7 @@ xgboost(data = NULL, label = NULL, missing = NA, weight = NULL,
...)
}
\arguments{
\item{params}{the list of parameters.
\item{params}{the list of parameters.
The complete list of parameters is available at \url{http://xgboost.readthedocs.io/en/latest/parameter.html}.
Below is a shorter summary:
@ -27,31 +27,32 @@ xgboost(data = NULL, label = NULL, missing = NA, weight = NULL,
\itemize{
\item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}.
}
2. Booster Parameters
2.1. Parameter for Tree Booster
\itemize{
\item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
\item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
\item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
\item \code{max_depth} maximum depth of a tree. Default: 6
\item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
\item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1
\item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1
\item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
\item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
\item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.
\item \code{interaction_constraints} A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from \code{0} (\code{0} references the first column). Leave argument unspecified for no interaction constraints.
}
2.2. Parameter for Linear Booster
\itemize{
\item \code{lambda} L2 regularization term on weights. Default: 0
\item \code{lambda_bias} L2 regularization term on bias. Default: 0
\item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
}
3. Task Parameters
3. Task Parameters
\itemize{
\item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
@ -76,31 +77,31 @@ xgboost(data = NULL, label = NULL, missing = NA, weight = NULL,
\item{watchlist}{named list of xgb.DMatrix datasets to use for evaluating model performance.
Metrics specified in either \code{eval_metric} or \code{feval} will be computed for each
of these datasets during each boosting iteration, and stored in the end as a field named
\code{evaluation_log} in the resulting object. When either \code{verbose>=1} or
of these datasets during each boosting iteration, and stored in the end as a field named
\code{evaluation_log} in the resulting object. When either \code{verbose>=1} or
\code{\link{cb.print.evaluation}} callback is engaged, the performance results are continuously
printed out during the training.
printed out during the training.
E.g., specifying \code{watchlist=list(validation1=mat1, validation2=mat2)} allows to track
the performance of each round's model on mat1 and mat2.}
\item{obj}{customized objective function. Returns gradient and second order
\item{obj}{customized objective function. Returns gradient and second order
gradient with given prediction and dtrain.}
\item{feval}{custimized evaluation function. Returns
\code{list(metric='metric-name', value='metric-value')} with given
\item{feval}{customized evaluation function. Returns
\code{list(metric='metric-name', value='metric-value')} with given
prediction and dtrain.}
\item{verbose}{If 0, xgboost will stay silent. If 1, it will print information about performance.
If 2, some additional information will be printed out.
Note that setting \code{verbose > 0} automatically engages the
Note that setting \code{verbose > 0} automatically engages the
\code{cb.print.evaluation(period=1)} callback function.}
\item{print_every_n}{Print each n-th iteration evaluation messages when \code{verbose>0}.
Default is 1 which means all messages are printed. This parameter is passed to the
Default is 1 which means all messages are printed. This parameter is passed to the
\code{\link{cb.print.evaluation}} callback.}
\item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered.
If set to an integer \code{k}, training with a validation set will stop if the performance
\item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered.
If set to an integer \code{k}, training with a validation set will stop if the performance
doesn't improve for \code{k} rounds.
Setting this parameter engages the \code{\link{cb.early.stop}} callback.}
@ -115,17 +116,17 @@ This parameter is passed to the \code{\link{cb.early.stop}} callback.}
\item{save_name}{the name or path for periodically saved model file.}
\item{xgb_model}{a previously built model to continue the training from.
Could be either an object of class \code{xgb.Booster}, or its raw data, or the name of a
Could be either an object of class \code{xgb.Booster}, or its raw data, or the name of a
file with a previously saved model.}
\item{callbacks}{a list of callback functions to perform various task during boosting.
See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
parameters' values. User can provide either existing or their own callback methods in order
See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
parameters' values. User can provide either existing or their own callback methods in order
to customize the training process.}
\item{...}{other parameters to pass to \code{params}.}
\item{label}{vector of response values. Should not be provided when data is
\item{label}{vector of response values. Should not be provided when data is
a local data file name or an \code{xgb.DMatrix}.}
\item{missing}{by default is set to NA, which means that NA values should be considered as 'missing'
@ -140,23 +141,23 @@ An object of class \code{xgb.Booster} with the following elements:
\item \code{handle} a handle (pointer) to the xgboost model in memory.
\item \code{raw} a cached memory dump of the xgboost model saved as R's \code{raw} type.
\item \code{niter} number of boosting iterations.
\item \code{evaluation_log} evaluation history storead as a \code{data.table} with the
\item \code{evaluation_log} evaluation history stored as a \code{data.table} with the
first column corresponding to iteration number and the rest corresponding to evaluation
metrics' values. It is created by the \code{\link{cb.evaluation.log}} callback.
\item \code{call} a function call.
\item \code{params} parameters that were passed to the xgboost library. Note that it does not
\item \code{params} parameters that were passed to the xgboost library. Note that it does not
capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
\item \code{callbacks} callback functions that were either automatically assigned or
explicitely passed.
\item \code{callbacks} callback functions that were either automatically assigned or
explicitly passed.
\item \code{best_iteration} iteration number with the best evaluation metric value
(only available with early stopping).
\item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
\item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
which could further be used in \code{predict} method
(only available with early stopping).
\item \code{best_score} the best evaluation metric value during early stopping.
(only available with early stopping).
\item \code{feature_names} names of the training dataset features
(only when comun names were defined in training data).
(only when column names were defined in training data).
\item \code{nfeatures} number of features in training data.
}
}
@ -165,20 +166,20 @@ An object of class \code{xgb.Booster} with the following elements:
The \code{xgboost} function is a simpler wrapper for \code{xgb.train}.
}
\details{
These are the training functions for \code{xgboost}.
These are the training functions for \code{xgboost}.
The \code{xgb.train} interface supports advanced features such as \code{watchlist},
customized objective and evaluation metric functions, therefore it is more flexible
The \code{xgb.train} interface supports advanced features such as \code{watchlist},
customized objective and evaluation metric functions, therefore it is more flexible
than the \code{xgboost} interface.
Parallelization is automatically enabled if \code{OpenMP} is present.
Parallelization is automatically enabled if \code{OpenMP} is present.
Number of threads can also be manually specified via \code{nthread} parameter.
The evaluation metric is chosen automatically by Xgboost (according to the objective)
when the \code{eval_metric} parameter is not provided.
User may set one or several \code{eval_metric} parameters.
User may set one or several \code{eval_metric} parameters.
Note that when using a customized metric, only this single metric can be used.
The folloiwing is the list of built-in metrics for which Xgboost provides optimized implementation:
The following is the list of built-in metrics for which Xgboost provides optimized implementation:
\itemize{
\item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
\item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
@ -210,7 +211,7 @@ dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest)
## A simple xgb.train example:
param <- list(max_depth = 2, eta = 1, verbosity = 0, nthread = 2,
param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
@ -231,12 +232,12 @@ evalerror <- function(preds, dtrain) {
# These functions could be used by passing them either:
# as 'objective' and 'eval_metric' parameters in the params list:
param <- list(max_depth = 2, eta = 1, verbosity = 0, nthread = 2,
param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
objective = logregobj, eval_metric = evalerror)
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
# or through the ... arguments:
param <- list(max_depth = 2, eta = 1, verbosity = 0, nthread = 2)
param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2)
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
objective = logregobj, eval_metric = evalerror)
@ -246,7 +247,7 @@ bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
## An xgb.train example of using variable learning rates at each iteration:
param <- list(max_depth = 2, eta = 1, verbosity = 0, nthread = 2,
param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
objective = "binary:logistic", eval_metric = "auc")
my_etas <- list(eta = c(0.5, 0.1))
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
@ -257,8 +258,8 @@ bst <- xgb.train(param, dtrain, nrounds = 25, watchlist,
early_stopping_rounds = 3)
## An 'xgboost' interface example:
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label,
max_depth = 2, eta = 1, nthread = 2, nrounds = 2,
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label,
max_depth = 2, eta = 1, nthread = 2, nrounds = 2,
objective = "binary:logistic")
pred <- predict(bst, agaricus.test$data)

View File

@ -10,7 +10,7 @@ The deprecated parameters would be removed in the next release.
\details{
To see all the current deprecated and new parameters, check the \code{xgboost:::depr_par_lut} table.
A deprecation warning is shown when any of the deprecated parameters is used in a call.
An additional warning is shown when there was a partial match to a deprecated parameter
A deprecation warning is shown when any of the deprecated parameters is used in a call.
An additional warning is shown when there was a partial match to a deprecated parameter
(as R is able to partially match parameter names).
}

View File

@ -138,7 +138,7 @@ levels(df[,Treatment])
Next step, we will transform the categorical data to dummy variables.
Several encoding methods exist, e.g., [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) is a common approach.
We will use the [dummy contrast coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#dummy) which is popular because it producess "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
We will use the [dummy contrast coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#dummy) which is popular because it produces "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`.
@ -268,7 +268,7 @@ c2 <- chisq.test(df$Age, output_vector)
print(c2)
```
Pearson correlation between Age and illness disapearing is **`r round(c2$statistic, 2 )`**.
Pearson correlation between Age and illness disappearing is **`r round(c2$statistic, 2 )`**.
```{r, warning=FALSE, message=FALSE}
c2 <- chisq.test(df$AgeDiscret, output_vector)

View File

@ -313,7 +313,7 @@ Until now, all the learnings we have performed were based on boosting trees. **X
bst <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
```
In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm.
In this specific case, *linear boosting* gets slightly better performance metrics than decision trees based algorithm.
In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.