fixed typos in R package docs (#4345)

* fixed typos in R package docs

* updated verbosity parameter in xgb.train docs
This commit is contained in:
James Lamb 2019-04-21 02:54:11 -05:00 committed by Jiaming Yuan
parent 65db8d0626
commit 5e97de6a41
30 changed files with 414 additions and 413 deletions

View File

@ -1,26 +1,26 @@
#' Callback closures for booster training. #' Callback closures for booster training.
#' #'
#' These are used to perform various service tasks either during boosting iterations or at the end. #' These are used to perform various service tasks either during boosting iterations or at the end.
#' This approach helps to modularize many of such tasks without bloating the main training methods, #' This approach helps to modularize many of such tasks without bloating the main training methods,
#' and it offers . #' and it offers .
#' #'
#' @details #' @details
#' By default, a callback function is run after each boosting iteration. #' By default, a callback function is run after each boosting iteration.
#' An R-attribute \code{is_pre_iteration} could be set for a callback to define a pre-iteration function. #' An R-attribute \code{is_pre_iteration} could be set for a callback to define a pre-iteration function.
#' #'
#' When a callback function has \code{finalize} parameter, its finalizer part will also be run after #' When a callback function has \code{finalize} parameter, its finalizer part will also be run after
#' the boosting is completed. #' the boosting is completed.
#' #'
#' WARNING: side-effects!!! Be aware that these callback functions access and modify things in #' WARNING: side-effects!!! Be aware that these callback functions access and modify things in
#' the environment from which they are called from, which is a fairly uncommon thing to do in R. #' the environment from which they are called from, which is a fairly uncommon thing to do in R.
#' #'
#' To write a custom callback closure, make sure you first understand the main concepts about R envoronments. #' To write a custom callback closure, make sure you first understand the main concepts about R environments.
#' Check either R documentation on \code{\link[base]{environment}} or the #' Check either R documentation on \code{\link[base]{environment}} or the
#' \href{http://adv-r.had.co.nz/Environments.html}{Environments chapter} from the "Advanced R" #' \href{http://adv-r.had.co.nz/Environments.html}{Environments chapter} from the "Advanced R"
#' book by Hadley Wickham. Further, the best option is to read the code of some of the existing callbacks - #' book by Hadley Wickham. Further, the best option is to read the code of some of the existing callbacks -
#' choose ones that do something similar to what you want to achieve. Also, you would need to get familiar #' choose ones that do something similar to what you want to achieve. Also, you would need to get familiar
#' with the objects available inside of the \code{xgb.train} and \code{xgb.cv} internal environments. #' with the objects available inside of the \code{xgb.train} and \code{xgb.cv} internal environments.
#' #'
#' @seealso #' @seealso
#' \code{\link{cb.print.evaluation}}, #' \code{\link{cb.print.evaluation}},
#' \code{\link{cb.evaluation.log}}, #' \code{\link{cb.evaluation.log}},
@ -30,42 +30,42 @@
#' \code{\link{cb.cv.predict}}, #' \code{\link{cb.cv.predict}},
#' \code{\link{xgb.train}}, #' \code{\link{xgb.train}},
#' \code{\link{xgb.cv}} #' \code{\link{xgb.cv}}
#' #'
#' @name callbacks #' @name callbacks
NULL NULL
# #
# Callbacks ------------------------------------------------------------------- # Callbacks -------------------------------------------------------------------
# #
#' Callback closure for printing the result of evaluation #' Callback closure for printing the result of evaluation
#' #'
#' @param period results would be printed every number of periods #' @param period results would be printed every number of periods
#' @param showsd whether standard deviations should be printed (when available) #' @param showsd whether standard deviations should be printed (when available)
#' #'
#' @details #' @details
#' The callback function prints the result of evaluation at every \code{period} iterations. #' The callback function prints the result of evaluation at every \code{period} iterations.
#' The initial and the last iteration's evaluations are always printed. #' The initial and the last iteration's evaluations are always printed.
#' #'
#' Callback function expects the following values to be set in its calling frame: #' Callback function expects the following values to be set in its calling frame:
#' \code{bst_evaluation} (also \code{bst_evaluation_err} when available), #' \code{bst_evaluation} (also \code{bst_evaluation_err} when available),
#' \code{iteration}, #' \code{iteration},
#' \code{begin_iteration}, #' \code{begin_iteration},
#' \code{end_iteration}. #' \code{end_iteration}.
#' #'
#' @seealso #' @seealso
#' \code{\link{callbacks}} #' \code{\link{callbacks}}
#' #'
#' @export #' @export
cb.print.evaluation <- function(period = 1, showsd = TRUE) { cb.print.evaluation <- function(period = 1, showsd = TRUE) {
callback <- function(env = parent.frame()) { callback <- function(env = parent.frame()) {
if (length(env$bst_evaluation) == 0 || if (length(env$bst_evaluation) == 0 ||
period == 0 || period == 0 ||
NVL(env$rank, 0) != 0 ) NVL(env$rank, 0) != 0 )
return() return()
i <- env$iteration i <- env$iteration
if ((i-1) %% period == 0 || if ((i-1) %% period == 0 ||
i == env$begin_iteration || i == env$begin_iteration ||
i == env$end_iteration) { i == env$end_iteration) {
@ -81,48 +81,48 @@ cb.print.evaluation <- function(period = 1, showsd = TRUE) {
#' Callback closure for logging the evaluation history #' Callback closure for logging the evaluation history
#' #'
#' @details #' @details
#' This callback function appends the current iteration evaluation results \code{bst_evaluation} #' This callback function appends the current iteration evaluation results \code{bst_evaluation}
#' available in the calling parent frame to the \code{evaluation_log} list in a calling frame. #' available in the calling parent frame to the \code{evaluation_log} list in a calling frame.
#' #'
#' The finalizer callback (called with \code{finalize = TURE} in the end) converts #' The finalizer callback (called with \code{finalize = TURE} in the end) converts
#' the \code{evaluation_log} list into a final data.table. #' the \code{evaluation_log} list into a final data.table.
#' #'
#' The iteration evaluation result \code{bst_evaluation} must be a named numeric vector. #' The iteration evaluation result \code{bst_evaluation} must be a named numeric vector.
#' #'
#' Note: in the column names of the final data.table, the dash '-' character is replaced with #' Note: in the column names of the final data.table, the dash '-' character is replaced with
#' the underscore '_' in order to make the column names more like regular R identifiers. #' the underscore '_' in order to make the column names more like regular R identifiers.
#' #'
#' Callback function expects the following values to be set in its calling frame: #' Callback function expects the following values to be set in its calling frame:
#' \code{evaluation_log}, #' \code{evaluation_log},
#' \code{bst_evaluation}, #' \code{bst_evaluation},
#' \code{iteration}. #' \code{iteration}.
#' #'
#' @seealso #' @seealso
#' \code{\link{callbacks}} #' \code{\link{callbacks}}
#' #'
#' @export #' @export
cb.evaluation.log <- function() { cb.evaluation.log <- function() {
mnames <- NULL mnames <- NULL
init <- function(env) { init <- function(env) {
if (!is.list(env$evaluation_log)) if (!is.list(env$evaluation_log))
stop("'evaluation_log' has to be a list") stop("'evaluation_log' has to be a list")
mnames <<- names(env$bst_evaluation) mnames <<- names(env$bst_evaluation)
if (is.null(mnames) || any(mnames == "")) if (is.null(mnames) || any(mnames == ""))
stop("bst_evaluation must have non-empty names") stop("bst_evaluation must have non-empty names")
mnames <<- gsub('-', '_', names(env$bst_evaluation)) mnames <<- gsub('-', '_', names(env$bst_evaluation))
if(!is.null(env$bst_evaluation_err)) if(!is.null(env$bst_evaluation_err))
mnames <<- c(paste0(mnames, '_mean'), paste0(mnames, '_std')) mnames <<- c(paste0(mnames, '_mean'), paste0(mnames, '_std'))
} }
finalizer <- function(env) { finalizer <- function(env) {
env$evaluation_log <- as.data.table(t(simplify2array(env$evaluation_log))) env$evaluation_log <- as.data.table(t(simplify2array(env$evaluation_log)))
setnames(env$evaluation_log, c('iter', mnames)) setnames(env$evaluation_log, c('iter', mnames))
if(!is.null(env$bst_evaluation_err)) { if(!is.null(env$bst_evaluation_err)) {
# rearrange col order from _mean,_mean,...,_std,_std,... # rearrange col order from _mean,_mean,...,_std,_std,...
# to be _mean,_std,_mean,_std,... # to be _mean,_std,_mean,_std,...
@ -135,18 +135,18 @@ cb.evaluation.log <- function() {
env$evaluation_log <- env$evaluation_log[, c('iter', cnames), with = FALSE] env$evaluation_log <- env$evaluation_log[, c('iter', cnames), with = FALSE]
} }
} }
callback <- function(env = parent.frame(), finalize = FALSE) { callback <- function(env = parent.frame(), finalize = FALSE) {
if (is.null(mnames)) if (is.null(mnames))
init(env) init(env)
if (finalize) if (finalize)
return(finalizer(env)) return(finalizer(env))
ev <- env$bst_evaluation ev <- env$bst_evaluation
if(!is.null(env$bst_evaluation_err)) if(!is.null(env$bst_evaluation_err))
ev <- c(ev, env$bst_evaluation_err) ev <- c(ev, env$bst_evaluation_err)
env$evaluation_log <- c(env$evaluation_log, env$evaluation_log <- c(env$evaluation_log,
list(c(iter = env$iteration, ev))) list(c(iter = env$iteration, ev)))
} }
attr(callback, 'call') <- match.call() attr(callback, 'call') <- match.call()
@ -154,21 +154,21 @@ cb.evaluation.log <- function() {
callback callback
} }
#' Callback closure for restetting the booster's parameters at each iteration. #' Callback closure for resetting the booster's parameters at each iteration.
#' #'
#' @param new_params a list where each element corresponds to a parameter that needs to be reset. #' @param new_params a list where each element corresponds to a parameter that needs to be reset.
#' Each element's value must be either a vector of values of length \code{nrounds} #' Each element's value must be either a vector of values of length \code{nrounds}
#' to be set at each iteration, #' to be set at each iteration,
#' or a function of two parameters \code{learning_rates(iteration, nrounds)} #' or a function of two parameters \code{learning_rates(iteration, nrounds)}
#' which returns a new parameter value by using the current iteration number #' which returns a new parameter value by using the current iteration number
#' and the total number of boosting rounds. #' and the total number of boosting rounds.
#' #'
#' @details #' @details
#' This is a "pre-iteration" callback function used to reset booster's parameters #' This is a "pre-iteration" callback function used to reset booster's parameters
#' at the beginning of each iteration. #' at the beginning of each iteration.
#' #'
#' Note that when training is resumed from some previous model, and a function is used to #' Note that when training is resumed from some previous model, and a function is used to
#' reset a parameter value, the \code{nrounds} argument in this function would be the #' reset a parameter value, the \code{nrounds} argument in this function would be the
#' the number of boosting rounds in the current training. #' the number of boosting rounds in the current training.
#' #'
#' Callback function expects the following values to be set in its calling frame: #' Callback function expects the following values to be set in its calling frame:
@ -176,32 +176,32 @@ cb.evaluation.log <- function() {
#' \code{iteration}, #' \code{iteration},
#' \code{begin_iteration}, #' \code{begin_iteration},
#' \code{end_iteration}. #' \code{end_iteration}.
#' #'
#' @seealso #' @seealso
#' \code{\link{callbacks}} #' \code{\link{callbacks}}
#' #'
#' @export #' @export
cb.reset.parameters <- function(new_params) { cb.reset.parameters <- function(new_params) {
if (typeof(new_params) != "list") if (typeof(new_params) != "list")
stop("'new_params' must be a list") stop("'new_params' must be a list")
pnames <- gsub("\\.", "_", names(new_params)) pnames <- gsub("\\.", "_", names(new_params))
nrounds <- NULL nrounds <- NULL
# run some checks in the begining # run some checks in the begining
init <- function(env) { init <- function(env) {
nrounds <<- env$end_iteration - env$begin_iteration + 1 nrounds <<- env$end_iteration - env$begin_iteration + 1
if (is.null(env$bst) && is.null(env$bst_folds)) if (is.null(env$bst) && is.null(env$bst_folds))
stop("Parent frame has neither 'bst' nor 'bst_folds'") stop("Parent frame has neither 'bst' nor 'bst_folds'")
# Some parameters are not allowed to be changed, # Some parameters are not allowed to be changed,
# since changing them would simply wreck some chaos # since changing them would simply wreck some chaos
not_allowed <- pnames %in% not_allowed <- pnames %in%
c('num_class', 'num_output_group', 'size_leaf_vector', 'updater_seq') c('num_class', 'num_output_group', 'size_leaf_vector', 'updater_seq')
if (any(not_allowed)) if (any(not_allowed))
stop('Parameters ', paste(pnames[not_allowed]), " cannot be changed during boosting.") stop('Parameters ', paste(pnames[not_allowed]), " cannot be changed during boosting.")
for (n in pnames) { for (n in pnames) {
p <- new_params[[n]] p <- new_params[[n]]
if (is.function(p)) { if (is.function(p)) {
@ -215,18 +215,18 @@ cb.reset.parameters <- function(new_params) {
} }
} }
} }
callback <- function(env = parent.frame()) { callback <- function(env = parent.frame()) {
if (is.null(nrounds)) if (is.null(nrounds))
init(env) init(env)
i <- env$iteration i <- env$iteration
pars <- lapply(new_params, function(p) { pars <- lapply(new_params, function(p) {
if (is.function(p)) if (is.function(p))
return(p(i, nrounds)) return(p(i, nrounds))
p[i] p[i]
}) })
if (!is.null(env$bst)) { if (!is.null(env$bst)) {
xgb.parameters(env$bst$handle) <- pars xgb.parameters(env$bst$handle) <- pars
} else { } else {
@ -242,23 +242,23 @@ cb.reset.parameters <- function(new_params) {
#' Callback closure to activate the early stopping. #' Callback closure to activate the early stopping.
#' #'
#' @param stopping_rounds The number of rounds with no improvement in #' @param stopping_rounds The number of rounds with no improvement in
#' the evaluation metric in order to stop the training. #' the evaluation metric in order to stop the training.
#' @param maximize whether to maximize the evaluation metric #' @param maximize whether to maximize the evaluation metric
#' @param metric_name the name of an evaluation column to use as a criteria for early #' @param metric_name the name of an evaluation column to use as a criteria for early
#' stopping. If not set, the last column would be used. #' stopping. If not set, the last column would be used.
#' Let's say the test data in \code{watchlist} was labelled as \code{dtest}, #' Let's say the test data in \code{watchlist} was labelled as \code{dtest},
#' and one wants to use the AUC in test data for early stopping regardless of where #' and one wants to use the AUC in test data for early stopping regardless of where
#' it is in the \code{watchlist}, then one of the following would need to be set: #' it is in the \code{watchlist}, then one of the following would need to be set:
#' \code{metric_name='dtest-auc'} or \code{metric_name='dtest_auc'}. #' \code{metric_name='dtest-auc'} or \code{metric_name='dtest_auc'}.
#' All dash '-' characters in metric names are considered equivalent to '_'. #' All dash '-' characters in metric names are considered equivalent to '_'.
#' @param verbose whether to print the early stopping information. #' @param verbose whether to print the early stopping information.
#' #'
#' @details #' @details
#' This callback function determines the condition for early stopping #' This callback function determines the condition for early stopping
#' by setting the \code{stop_condition = TRUE} flag in its calling frame. #' by setting the \code{stop_condition = TRUE} flag in its calling frame.
#' #'
#' The following additional fields are assigned to the model's R object: #' The following additional fields are assigned to the model's R object:
#' \itemize{ #' \itemize{
#' \item \code{best_score} the evaluation score at the best iteration #' \item \code{best_score} the evaluation score at the best iteration
@ -266,13 +266,13 @@ cb.reset.parameters <- function(new_params) {
#' \item \code{best_ntreelimit} to use with the \code{ntreelimit} parameter in \code{predict}. #' \item \code{best_ntreelimit} to use with the \code{ntreelimit} parameter in \code{predict}.
#' It differs from \code{best_iteration} in multiclass or random forest settings. #' It differs from \code{best_iteration} in multiclass or random forest settings.
#' } #' }
#' #'
#' The Same values are also stored as xgb-attributes: #' The Same values are also stored as xgb-attributes:
#' \itemize{ #' \itemize{
#' \item \code{best_iteration} is stored as a 0-based iteration index (for interoperability of binary models) #' \item \code{best_iteration} is stored as a 0-based iteration index (for interoperability of binary models)
#' \item \code{best_msg} message string is also stored. #' \item \code{best_msg} message string is also stored.
#' } #' }
#' #'
#' At least one data element is required in the evaluation watchlist for early stopping to work. #' At least one data element is required in the evaluation watchlist for early stopping to work.
#' #'
#' Callback function expects the following values to be set in its calling frame: #' Callback function expects the following values to be set in its calling frame:
@ -284,13 +284,13 @@ cb.reset.parameters <- function(new_params) {
#' \code{begin_iteration}, #' \code{begin_iteration},
#' \code{end_iteration}, #' \code{end_iteration},
#' \code{num_parallel_tree}. #' \code{num_parallel_tree}.
#' #'
#' @seealso #' @seealso
#' \code{\link{callbacks}}, #' \code{\link{callbacks}},
#' \code{\link{xgb.attr}} #' \code{\link{xgb.attr}}
#' #'
#' @export #' @export
cb.early.stop <- function(stopping_rounds, maximize = FALSE, cb.early.stop <- function(stopping_rounds, maximize = FALSE,
metric_name = NULL, verbose = TRUE) { metric_name = NULL, verbose = TRUE) {
# state variables # state variables
best_iteration <- -1 best_iteration <- -1
@ -298,11 +298,11 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
best_score <- Inf best_score <- Inf
best_msg <- NULL best_msg <- NULL
metric_idx <- 1 metric_idx <- 1
init <- function(env) { init <- function(env) {
if (length(env$bst_evaluation) == 0) if (length(env$bst_evaluation) == 0)
stop("For early stopping, watchlist must have at least one element") stop("For early stopping, watchlist must have at least one element")
eval_names <- gsub('-', '_', names(env$bst_evaluation)) eval_names <- gsub('-', '_', names(env$bst_evaluation))
if (!is.null(metric_name)) { if (!is.null(metric_name)) {
metric_idx <<- which(gsub('-', '_', metric_name) == eval_names) metric_idx <<- which(gsub('-', '_', metric_name) == eval_names)
@ -314,25 +314,25 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
length(env$bst_evaluation) > 1) { length(env$bst_evaluation) > 1) {
metric_idx <<- length(eval_names) metric_idx <<- length(eval_names)
if (verbose) if (verbose)
cat('Multiple eval metrics are present. Will use ', cat('Multiple eval metrics are present. Will use ',
eval_names[metric_idx], ' for early stopping.\n', sep = '') eval_names[metric_idx], ' for early stopping.\n', sep = '')
} }
metric_name <<- eval_names[metric_idx] metric_name <<- eval_names[metric_idx]
# maximize is usually NULL when not set in xgb.train and built-in metrics # maximize is usually NULL when not set in xgb.train and built-in metrics
if (is.null(maximize)) if (is.null(maximize))
maximize <<- grepl('(_auc|_map|_ndcg)', metric_name) maximize <<- grepl('(_auc|_map|_ndcg)', metric_name)
if (verbose && NVL(env$rank, 0) == 0) if (verbose && NVL(env$rank, 0) == 0)
cat("Will train until ", metric_name, " hasn't improved in ", cat("Will train until ", metric_name, " hasn't improved in ",
stopping_rounds, " rounds.\n\n", sep = '') stopping_rounds, " rounds.\n\n", sep = '')
best_iteration <<- 1 best_iteration <<- 1
if (maximize) best_score <<- -Inf if (maximize) best_score <<- -Inf
env$stop_condition <- FALSE env$stop_condition <- FALSE
if (!is.null(env$bst)) { if (!is.null(env$bst)) {
if (!inherits(env$bst, 'xgb.Booster')) if (!inherits(env$bst, 'xgb.Booster'))
stop("'bst' in the parent frame must be an 'xgb.Booster'") stop("'bst' in the parent frame must be an 'xgb.Booster'")
@ -348,7 +348,7 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
stop("Parent frame has neither 'bst' nor ('bst_folds' and 'basket')") stop("Parent frame has neither 'bst' nor ('bst_folds' and 'basket')")
} }
} }
finalizer <- function(env) { finalizer <- function(env) {
if (!is.null(env$bst)) { if (!is.null(env$bst)) {
attr_best_score = as.numeric(xgb.attr(env$bst$handle, 'best_score')) attr_best_score = as.numeric(xgb.attr(env$bst$handle, 'best_score'))
@ -367,16 +367,16 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
callback <- function(env = parent.frame(), finalize = FALSE) { callback <- function(env = parent.frame(), finalize = FALSE) {
if (best_iteration < 0) if (best_iteration < 0)
init(env) init(env)
if (finalize) if (finalize)
return(finalizer(env)) return(finalizer(env))
i <- env$iteration i <- env$iteration
score = env$bst_evaluation[metric_idx] score = env$bst_evaluation[metric_idx]
if (( maximize && score > best_score) || if (( maximize && score > best_score) ||
(!maximize && score < best_score)) { (!maximize && score < best_score)) {
best_msg <<- format.eval.string(i, env$bst_evaluation, env$bst_evaluation_err) best_msg <<- format.eval.string(i, env$bst_evaluation, env$bst_evaluation_err)
best_score <<- score best_score <<- score
best_iteration <<- i best_iteration <<- i
@ -403,37 +403,37 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
#' Callback closure for saving a model file. #' Callback closure for saving a model file.
#' #'
#' @param save_period save the model to disk after every #' @param save_period save the model to disk after every
#' \code{save_period} iterations; 0 means save the model at the end. #' \code{save_period} iterations; 0 means save the model at the end.
#' @param save_name the name or path for the saved model file. #' @param save_name the name or path for the saved model file.
#' It can contain a \code{\link[base]{sprintf}} formatting specifier #' It can contain a \code{\link[base]{sprintf}} formatting specifier
#' to include the integer iteration number in the file name. #' to include the integer iteration number in the file name.
#' E.g., with \code{save_name} = 'xgboost_%04d.model', #' E.g., with \code{save_name} = 'xgboost_%04d.model',
#' the file saved at iteration 50 would be named "xgboost_0050.model". #' the file saved at iteration 50 would be named "xgboost_0050.model".
#' #'
#' @details #' @details
#' This callback function allows to save an xgb-model file, either periodically after each \code{save_period}'s or at the end. #' This callback function allows to save an xgb-model file, either periodically after each \code{save_period}'s or at the end.
#' #'
#' Callback function expects the following values to be set in its calling frame: #' Callback function expects the following values to be set in its calling frame:
#' \code{bst}, #' \code{bst},
#' \code{iteration}, #' \code{iteration},
#' \code{begin_iteration}, #' \code{begin_iteration},
#' \code{end_iteration}. #' \code{end_iteration}.
#' #'
#' @seealso #' @seealso
#' \code{\link{callbacks}} #' \code{\link{callbacks}}
#' #'
#' @export #' @export
cb.save.model <- function(save_period = 0, save_name = "xgboost.model") { cb.save.model <- function(save_period = 0, save_name = "xgboost.model") {
if (save_period < 0) if (save_period < 0)
stop("'save_period' cannot be negative") stop("'save_period' cannot be negative")
callback <- function(env = parent.frame()) { callback <- function(env = parent.frame()) {
if (is.null(env$bst)) if (is.null(env$bst))
stop("'save_model' callback requires the 'bst' booster object in its calling frame") stop("'save_model' callback requires the 'bst' booster object in its calling frame")
if ((save_period > 0 && (env$iteration - env$begin_iteration) %% save_period == 0) || if ((save_period > 0 && (env$iteration - env$begin_iteration) %% save_period == 0) ||
(save_period == 0 && env$iteration == env$end_iteration)) (save_period == 0 && env$iteration == env$end_iteration))
xgb.save(env$bst, sprintf(save_name, env$iteration)) xgb.save(env$bst, sprintf(save_name, env$iteration))
@ -445,16 +445,16 @@ cb.save.model <- function(save_period = 0, save_name = "xgboost.model") {
#' Callback closure for returning cross-validation based predictions. #' Callback closure for returning cross-validation based predictions.
#' #'
#' @param save_models a flag for whether to save the folds' models. #' @param save_models a flag for whether to save the folds' models.
#' #'
#' @details #' @details
#' This callback function saves predictions for all of the test folds, #' This callback function saves predictions for all of the test folds,
#' and also allows to save the folds' models. #' and also allows to save the folds' models.
#' #'
#' It is a "finalizer" callback and it uses early stopping information whenever it is available, #' It is a "finalizer" callback and it uses early stopping information whenever it is available,
#' thus it must be run after the early stopping callback if the early stopping is used. #' thus it must be run after the early stopping callback if the early stopping is used.
#' #'
#' Callback function expects the following values to be set in its calling frame: #' Callback function expects the following values to be set in its calling frame:
#' \code{bst_folds}, #' \code{bst_folds},
#' \code{basket}, #' \code{basket},
@ -463,36 +463,36 @@ cb.save.model <- function(save_period = 0, save_name = "xgboost.model") {
#' \code{params}, #' \code{params},
#' \code{num_parallel_tree}, #' \code{num_parallel_tree},
#' \code{num_class}. #' \code{num_class}.
#' #'
#' @return #' @return
#' Predictions are returned inside of the \code{pred} element, which is either a vector or a matrix, #' Predictions are returned inside of the \code{pred} element, which is either a vector or a matrix,
#' depending on the number of prediction outputs per data row. The order of predictions corresponds #' depending on the number of prediction outputs per data row. The order of predictions corresponds
#' to the order of rows in the original dataset. Note that when a custom \code{folds} list is #' to the order of rows in the original dataset. Note that when a custom \code{folds} list is
#' provided in \code{xgb.cv}, the predictions would only be returned properly when this list is a #' provided in \code{xgb.cv}, the predictions would only be returned properly when this list is a
#' non-overlapping list of k sets of indices, as in a standard k-fold CV. The predictions would not be #' non-overlapping list of k sets of indices, as in a standard k-fold CV. The predictions would not be
#' meaningful when user-profided folds have overlapping indices as in, e.g., random sampling splits. #' meaningful when user-provided folds have overlapping indices as in, e.g., random sampling splits.
#' When some of the indices in the training dataset are not included into user-provided \code{folds}, #' When some of the indices in the training dataset are not included into user-provided \code{folds},
#' their prediction value would be \code{NA}. #' their prediction value would be \code{NA}.
#' #'
#' @seealso #' @seealso
#' \code{\link{callbacks}} #' \code{\link{callbacks}}
#' #'
#' @export #' @export
cb.cv.predict <- function(save_models = FALSE) { cb.cv.predict <- function(save_models = FALSE) {
finalizer <- function(env) { finalizer <- function(env) {
if (is.null(env$basket) || is.null(env$bst_folds)) if (is.null(env$basket) || is.null(env$bst_folds))
stop("'cb.cv.predict' callback requires 'basket' and 'bst_folds' lists in its calling frame") stop("'cb.cv.predict' callback requires 'basket' and 'bst_folds' lists in its calling frame")
N <- nrow(env$data) N <- nrow(env$data)
pred <- pred <-
if (env$num_class > 1) { if (env$num_class > 1) {
matrix(NA_real_, N, env$num_class) matrix(NA_real_, N, env$num_class)
} else { } else {
rep(NA_real_, N) rep(NA_real_, N)
} }
ntreelimit <- NVL(env$basket$best_ntreelimit, ntreelimit <- NVL(env$basket$best_ntreelimit,
env$end_iteration * env$num_parallel_tree) env$end_iteration * env$num_parallel_tree)
if (NVL(env$params[['booster']], '') == 'gblinear') { if (NVL(env$params[['booster']], '') == 'gblinear') {
ntreelimit <- 0 # must be 0 for gblinear ntreelimit <- 0 # must be 0 for gblinear
@ -569,7 +569,7 @@ cb.cv.predict <- function(save_models = FALSE) {
#' # Extract the coefficients' path and plot them vs boosting iteration number: #' # Extract the coefficients' path and plot them vs boosting iteration number:
#' coef_path <- xgb.gblinear.history(bst) #' coef_path <- xgb.gblinear.history(bst)
#' matplot(coef_path, type = 'l') #' matplot(coef_path, type = 'l')
#' #'
#' # With the deterministic coordinate descent updater, it is safer to use higher learning rates. #' # With the deterministic coordinate descent updater, it is safer to use higher learning rates.
#' # Will try the classical componentwise boosting which selects a single best feature per round: #' # Will try the classical componentwise boosting which selects a single best feature per round:
#' bst <- xgb.train(param, dtrain, list(tr=dtrain), nrounds = 200, eta = 0.8, #' bst <- xgb.train(param, dtrain, list(tr=dtrain), nrounds = 200, eta = 0.8,
@ -586,7 +586,7 @@ cb.cv.predict <- function(save_models = FALSE) {
#' # coefficients in the CV fold #3 #' # coefficients in the CV fold #3
#' xgb.gblinear.history(bst)[[3]] %>% matplot(type = 'l') #' xgb.gblinear.history(bst)[[3]] %>% matplot(type = 'l')
#' #'
#' #'
#' #### Multiclass classification: #' #### Multiclass classification:
#' # #' #
#' dtrain <- xgb.DMatrix(scale(x), label = as.numeric(iris$Species) - 1) #' dtrain <- xgb.DMatrix(scale(x), label = as.numeric(iris$Species) - 1)
@ -681,9 +681,9 @@ cb.gblinear.history <- function(sparse=FALSE) {
#' using the \code{cb.gblinear.history()} callback. #' using the \code{cb.gblinear.history()} callback.
#' @param class_index zero-based class index to extract the coefficients for only that #' @param class_index zero-based class index to extract the coefficients for only that
#' specific class in a multinomial multiclass model. When it is NULL, all the #' specific class in a multinomial multiclass model. When it is NULL, all the
#' coeffients are returned. Has no effect in non-multiclass models. #' coefficients are returned. Has no effect in non-multiclass models.
#' #'
#' @return #' @return
#' For an \code{xgb.train} result, a matrix (either dense or sparse) with the columns #' For an \code{xgb.train} result, a matrix (either dense or sparse) with the columns
#' corresponding to iteration's coefficients (in the order as \code{xgb.dump()} would #' corresponding to iteration's coefficients (in the order as \code{xgb.dump()} would
#' return) and the rows corresponding to boosting iterations. #' return) and the rows corresponding to boosting iterations.
@ -731,7 +731,7 @@ xgb.gblinear.history <- function(model, class_index = NULL) {
coef_path <- environment(model$callbacks$cb.gblinear.history)[["coefs"]] coef_path <- environment(model$callbacks$cb.gblinear.history)[["coefs"]]
if (!is.null(class_index) && num_class > 1) { if (!is.null(class_index) && num_class > 1) {
coef_path <- if (is.list(coef_path)) { coef_path <- if (is.list(coef_path)) {
lapply(coef_path, lapply(coef_path,
function(x) x[, seq(1 + class_index, by=num_class, length.out=num_feat)]) function(x) x[, seq(1 + class_index, by=num_class, length.out=num_feat)])
} else { } else {
coef_path <- coef_path[, seq(1 + class_index, by=num_class, length.out=num_feat)] coef_path <- coef_path[, seq(1 + class_index, by=num_class, length.out=num_feat)]
@ -743,7 +743,7 @@ xgb.gblinear.history <- function(model, class_index = NULL) {
# #
# Internal utility functions for callbacks ------------------------------------ # Internal utility functions for callbacks ------------------------------------
# #
# Format the evaluation metric string # Format the evaluation metric string
format.eval.string <- function(iter, eval_res, eval_err = NULL) { format.eval.string <- function(iter, eval_res, eval_err = NULL) {
@ -773,7 +773,7 @@ callback.calls <- function(cb_list) {
unlist(lapply(cb_list, function(x) attr(x, 'call'))) unlist(lapply(cb_list, function(x) attr(x, 'call')))
} }
# Add a callback cb to the list and make sure that # Add a callback cb to the list and make sure that
# cb.early.stop and cb.cv.predict are at the end of the list # cb.early.stop and cb.cv.predict are at the end of the list
# with cb.cv.predict being the last (when present) # with cb.cv.predict being the last (when present)
add.cb <- function(cb_list, cb) { add.cb <- function(cb_list, cb) {
@ -782,11 +782,11 @@ add.cb <- function(cb_list, cb) {
if ('cb.early.stop' %in% names(cb_list)) { if ('cb.early.stop' %in% names(cb_list)) {
cb_list <- c(cb_list, cb_list['cb.early.stop']) cb_list <- c(cb_list, cb_list['cb.early.stop'])
# this removes only the first one # this removes only the first one
cb_list['cb.early.stop'] <- NULL cb_list['cb.early.stop'] <- NULL
} }
if ('cb.cv.predict' %in% names(cb_list)) { if ('cb.cv.predict' %in% names(cb_list)) {
cb_list <- c(cb_list, cb_list['cb.cv.predict']) cb_list <- c(cb_list, cb_list['cb.cv.predict'])
cb_list['cb.cv.predict'] <- NULL cb_list['cb.cv.predict'] <- NULL
} }
cb_list cb_list
} }
@ -796,7 +796,7 @@ categorize.callbacks <- function(cb_list) {
list( list(
pre_iter = Filter(function(x) { pre_iter = Filter(function(x) {
pre <- attr(x, 'is_pre_iteration') pre <- attr(x, 'is_pre_iteration')
!is.null(pre) && pre !is.null(pre) && pre
}, cb_list), }, cb_list),
post_iter = Filter(function(x) { post_iter = Filter(function(x) {
pre <- attr(x, 'is_pre_iteration') pre <- attr(x, 'is_pre_iteration')

View File

@ -81,7 +81,7 @@ xgb.get.handle <- function(object) {
#' its handle (pointer) to an internal xgboost model would be invalid. The majority of xgboost methods #' its handle (pointer) to an internal xgboost model would be invalid. The majority of xgboost methods
#' should still work for such a model object since those methods would be using #' should still work for such a model object since those methods would be using
#' \code{xgb.Booster.complete} internally. However, one might find it to be more efficient to call the #' \code{xgb.Booster.complete} internally. However, one might find it to be more efficient to call the
#' \code{xgb.Booster.complete} function explicitely once after loading a model as an R-object. #' \code{xgb.Booster.complete} function explicitly once after loading a model as an R-object.
#' That would prevent further repeated implicit reconstruction of an internal booster model. #' That would prevent further repeated implicit reconstruction of an internal booster model.
#' #'
#' @return #' @return
@ -162,7 +162,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' #'
#' With \code{predinteraction = TRUE}, SHAP values of contributions of interaction of each pair of features #' With \code{predinteraction = TRUE}, SHAP values of contributions of interaction of each pair of features
#' are computed. Note that this operation might be rather expensive in terms of compute and memory. #' are computed. Note that this operation might be rather expensive in terms of compute and memory.
#' Since it quadratically depends on the number of features, it is recommended to perfom selection #' Since it quadratically depends on the number of features, it is recommended to perform selection
#' of the most important features first. See below about the format of the returned results. #' of the most important features first. See below about the format of the returned results.
#' #'
#' @return #' @return
@ -190,7 +190,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' #'
#' @seealso #' @seealso
#' \code{\link{xgb.train}}. #' \code{\link{xgb.train}}.
#' #'
#' @references #' @references
#' #'
#' Scott M. Lundberg, Su-In Lee, "A Unified Approach to Interpreting Model Predictions", NIPS Proceedings 2017, \url{https://arxiv.org/abs/1705.07874} #' Scott M. Lundberg, Su-In Lee, "A Unified Approach to Interpreting Model Predictions", NIPS Proceedings 2017, \url{https://arxiv.org/abs/1705.07874}

View File

@ -1,18 +1,18 @@
#' Construct xgb.DMatrix object #' Construct xgb.DMatrix object
#' #'
#' Construct xgb.DMatrix object from either a dense matrix, a sparse matrix, or a local file. #' Construct xgb.DMatrix object from either a dense matrix, a sparse matrix, or a local file.
#' Supported input file formats are either a libsvm text file or a binary file that was created previously by #' Supported input file formats are either a libsvm text file or a binary file that was created previously by
#' \code{\link{xgb.DMatrix.save}}). #' \code{\link{xgb.DMatrix.save}}).
#' #'
#' @param data a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character #' @param data a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character
#' string representing a filename. #' string representing a filename.
#' @param info a named list of additional information to store in the \code{xgb.DMatrix} object. #' @param info a named list of additional information to store in the \code{xgb.DMatrix} object.
#' See \code{\link{setinfo}} for the specific allowed kinds of #' See \code{\link{setinfo}} for the specific allowed kinds of
#' @param missing a float value to represents missing values in data (used only when input is a dense matrix). #' @param missing a float value to represents missing values in data (used only when input is a dense matrix).
#' It is useful when a 0 or some other extreme value represents missing values in data. #' It is useful when a 0 or some other extreme value represents missing values in data.
#' @param silent whether to suppress printing an informational message after loading from a file. #' @param silent whether to suppress printing an informational message after loading from a file.
#' @param ... the \code{info} data could be passed directly as parameters, without creating an \code{info} list. #' @param ... the \code{info} data could be passed directly as parameters, without creating an \code{info} list.
#' #'
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' train <- agaricus.train #' train <- agaricus.train
@ -78,23 +78,23 @@ xgb.get.DMatrix <- function(data, label = NULL, missing = NA, weight = NULL) {
#' Dimensions of xgb.DMatrix #' Dimensions of xgb.DMatrix
#' #'
#' Returns a vector of numbers of rows and of columns in an \code{xgb.DMatrix}. #' Returns a vector of numbers of rows and of columns in an \code{xgb.DMatrix}.
#' @param x Object of class \code{xgb.DMatrix} #' @param x Object of class \code{xgb.DMatrix}
#' #'
#' @details #' @details
#' Note: since \code{nrow} and \code{ncol} internally use \code{dim}, they can also #' Note: since \code{nrow} and \code{ncol} internally use \code{dim}, they can also
#' be directly used with an \code{xgb.DMatrix} object. #' be directly used with an \code{xgb.DMatrix} object.
#' #'
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' train <- agaricus.train #' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label) #' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' #'
#' stopifnot(nrow(dtrain) == nrow(train$data)) #' stopifnot(nrow(dtrain) == nrow(train$data))
#' stopifnot(ncol(dtrain) == ncol(train$data)) #' stopifnot(ncol(dtrain) == ncol(train$data))
#' stopifnot(all(dim(dtrain) == dim(train$data))) #' stopifnot(all(dim(dtrain) == dim(train$data)))
#' #'
#' @export #' @export
dim.xgb.DMatrix <- function(x) { dim.xgb.DMatrix <- function(x) {
c(.Call(XGDMatrixNumRow_R, x), .Call(XGDMatrixNumCol_R, x)) c(.Call(XGDMatrixNumRow_R, x), .Call(XGDMatrixNumCol_R, x))
@ -102,14 +102,14 @@ dim.xgb.DMatrix <- function(x) {
#' Handling of column names of \code{xgb.DMatrix} #' Handling of column names of \code{xgb.DMatrix}
#' #'
#' Only column names are supported for \code{xgb.DMatrix}, thus setting of #' Only column names are supported for \code{xgb.DMatrix}, thus setting of
#' row names would have no effect and returnten row names would be NULL. #' row names would have no effect and returned row names would be NULL.
#' #'
#' @param x object of class \code{xgb.DMatrix} #' @param x object of class \code{xgb.DMatrix}
#' @param value a list of two elements: the first one is ignored #' @param value a list of two elements: the first one is ignored
#' and the second one is column names #' and the second one is column names
#' #'
#' @details #' @details
#' Generic \code{dimnames} methods are used by \code{colnames}. #' Generic \code{dimnames} methods are used by \code{colnames}.
#' Since row names are irrelevant, it is recommended to use \code{colnames} directly. #' Since row names are irrelevant, it is recommended to use \code{colnames} directly.
@ -122,7 +122,7 @@ dim.xgb.DMatrix <- function(x) {
#' colnames(dtrain) #' colnames(dtrain)
#' colnames(dtrain) <- make.names(1:ncol(train$data)) #' colnames(dtrain) <- make.names(1:ncol(train$data))
#' print(dtrain, verbose=TRUE) #' print(dtrain, verbose=TRUE)
#' #'
#' @rdname dimnames.xgb.DMatrix #' @rdname dimnames.xgb.DMatrix
#' @export #' @export
dimnames.xgb.DMatrix <- function(x) { dimnames.xgb.DMatrix <- function(x) {
@ -140,8 +140,8 @@ dimnames.xgb.DMatrix <- function(x) {
attr(x, '.Dimnames') <- NULL attr(x, '.Dimnames') <- NULL
return(x) return(x)
} }
if (ncol(x) != length(value[[2]])) if (ncol(x) != length(value[[2]]))
stop("can't assign ", length(value[[2]]), " colnames to a ", stop("can't assign ", length(value[[2]]), " colnames to a ",
ncol(x), " column xgb.DMatrix") ncol(x), " column xgb.DMatrix")
attr(x, '.Dimnames') <- value attr(x, '.Dimnames') <- value
x x
@ -149,33 +149,33 @@ dimnames.xgb.DMatrix <- function(x) {
#' Get information of an xgb.DMatrix object #' Get information of an xgb.DMatrix object
#' #'
#' Get information of an xgb.DMatrix object #' Get information of an xgb.DMatrix object
#' @param object Object of class \code{xgb.DMatrix} #' @param object Object of class \code{xgb.DMatrix}
#' @param name the name of the information field to get (see details) #' @param name the name of the information field to get (see details)
#' @param ... other parameters #' @param ... other parameters
#' #'
#' @details #' @details
#' The \code{name} field can be one of the following: #' The \code{name} field can be one of the following:
#' #'
#' \itemize{ #' \itemize{
#' \item \code{label}: label Xgboost learn from ; #' \item \code{label}: label Xgboost learn from ;
#' \item \code{weight}: to do a weight rescale ; #' \item \code{weight}: to do a weight rescale ;
#' \item \code{base_margin}: base margin is the base prediction Xgboost will boost from ; #' \item \code{base_margin}: base margin is the base prediction Xgboost will boost from ;
#' \item \code{nrow}: number of rows of the \code{xgb.DMatrix}. #' \item \code{nrow}: number of rows of the \code{xgb.DMatrix}.
#' #'
#' } #' }
#' #'
#' \code{group} can be setup by \code{setinfo} but can't be retrieved by \code{getinfo}. #' \code{group} can be setup by \code{setinfo} but can't be retrieved by \code{getinfo}.
#' #'
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' train <- agaricus.train #' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label) #' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' #'
#' labels <- getinfo(dtrain, 'label') #' labels <- getinfo(dtrain, 'label')
#' setinfo(dtrain, 'label', 1-labels) #' setinfo(dtrain, 'label', 1-labels)
#' #'
#' labels2 <- getinfo(dtrain, 'label') #' labels2 <- getinfo(dtrain, 'label')
#' stopifnot(all(labels2 == 1-labels)) #' stopifnot(all(labels2 == 1-labels))
#' @rdname getinfo #' @rdname getinfo
@ -202,9 +202,9 @@ getinfo.xgb.DMatrix <- function(object, name, ...) {
#' Set information of an xgb.DMatrix object #' Set information of an xgb.DMatrix object
#' #'
#' Set information of an xgb.DMatrix object #' Set information of an xgb.DMatrix object
#' #'
#' @param object Object of class "xgb.DMatrix" #' @param object Object of class "xgb.DMatrix"
#' @param name the name of the field to get #' @param name the name of the field to get
#' @param info the specific field of information to set #' @param info the specific field of information to set
@ -212,19 +212,19 @@ getinfo.xgb.DMatrix <- function(object, name, ...) {
#' #'
#' @details #' @details
#' The \code{name} field can be one of the following: #' The \code{name} field can be one of the following:
#' #'
#' \itemize{ #' \itemize{
#' \item \code{label}: label Xgboost learn from ; #' \item \code{label}: label Xgboost learn from ;
#' \item \code{weight}: to do a weight rescale ; #' \item \code{weight}: to do a weight rescale ;
#' \item \code{base_margin}: base margin is the base prediction Xgboost will boost from ; #' \item \code{base_margin}: base margin is the base prediction Xgboost will boost from ;
#' \item \code{group}: number of rows in each group (to use with \code{rank:pairwise} objective). #' \item \code{group}: number of rows in each group (to use with \code{rank:pairwise} objective).
#' } #' }
#' #'
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' train <- agaricus.train #' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label) #' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' #'
#' labels <- getinfo(dtrain, 'label') #' labels <- getinfo(dtrain, 'label')
#' setinfo(dtrain, 'label', 1-labels) #' setinfo(dtrain, 'label', 1-labels)
#' labels2 <- getinfo(dtrain, 'label') #' labels2 <- getinfo(dtrain, 'label')
@ -266,27 +266,27 @@ setinfo.xgb.DMatrix <- function(object, name, info, ...) {
#' Get a new DMatrix containing the specified rows of #' Get a new DMatrix containing the specified rows of
#' orginal xgb.DMatrix object #' original xgb.DMatrix object
#' #'
#' Get a new DMatrix containing the specified rows of #' Get a new DMatrix containing the specified rows of
#' orginal xgb.DMatrix object #' original xgb.DMatrix object
#' #'
#' @param object Object of class "xgb.DMatrix" #' @param object Object of class "xgb.DMatrix"
#' @param idxset a integer vector of indices of rows needed #' @param idxset a integer vector of indices of rows needed
#' @param colset currently not used (columns subsetting is not available) #' @param colset currently not used (columns subsetting is not available)
#' @param ... other parameters (currently not used) #' @param ... other parameters (currently not used)
#' #'
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' train <- agaricus.train #' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label) #' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' #'
#' dsub <- slice(dtrain, 1:42) #' dsub <- slice(dtrain, 1:42)
#' labels1 <- getinfo(dsub, 'label') #' labels1 <- getinfo(dsub, 'label')
#' dsub <- dtrain[1:42, ] #' dsub <- dtrain[1:42, ]
#' labels2 <- getinfo(dsub, 'label') #' labels2 <- getinfo(dsub, 'label')
#' all.equal(labels1, labels2) #' all.equal(labels1, labels2)
#' #'
#' @rdname slice.xgb.DMatrix #' @rdname slice.xgb.DMatrix
#' @export #' @export
slice <- function(object, ...) UseMethod("slice") slice <- function(object, ...) UseMethod("slice")
@ -325,22 +325,22 @@ slice.xgb.DMatrix <- function(object, idxset, ...) {
#' Print xgb.DMatrix #' Print xgb.DMatrix
#' #'
#' Print information about xgb.DMatrix. #' Print information about xgb.DMatrix.
#' Currently it displays dimensions and presence of info-fields and colnames. #' Currently it displays dimensions and presence of info-fields and colnames.
#' #'
#' @param x an xgb.DMatrix object #' @param x an xgb.DMatrix object
#' @param verbose whether to print colnames (when present) #' @param verbose whether to print colnames (when present)
#' @param ... not currently used #' @param ... not currently used
#' #'
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' train <- agaricus.train #' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label) #' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' #'
#' dtrain #' dtrain
#' print(dtrain, verbose=TRUE) #' print(dtrain, verbose=TRUE)
#' #'
#' @method print xgb.DMatrix #' @method print xgb.DMatrix
#' @export #' @export
print.xgb.DMatrix <- function(x, verbose = FALSE, ...) { print.xgb.DMatrix <- function(x, verbose = FALSE, ...) {

View File

@ -39,7 +39,7 @@
#' } #' }
#' @param obj customized objective function. Returns gradient and second order #' @param obj customized objective function. Returns gradient and second order
#' gradient with given prediction and dtrain. #' gradient with given prediction and dtrain.
#' @param feval custimized evaluation function. Returns #' @param feval customized evaluation function. Returns
#' \code{list(metric='metric-name', value='metric-value')} with given #' \code{list(metric='metric-name', value='metric-value')} with given
#' prediction and dtrain. #' prediction and dtrain.
#' @param stratified a \code{boolean} indicating whether sampling of folds should be stratified #' @param stratified a \code{boolean} indicating whether sampling of folds should be stratified
@ -84,7 +84,7 @@
#' capture parameters changed by the \code{\link{cb.reset.parameters}} callback. #' capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
#' \item \code{callbacks} callback functions that were either automatically assigned or #' \item \code{callbacks} callback functions that were either automatically assigned or
#' explicitly passed. #' explicitly passed.
#' \item \code{evaluation_log} evaluation history storead as a \code{data.table} with the #' \item \code{evaluation_log} evaluation history stored as a \code{data.table} with the
#' first column corresponding to iteration number and the rest corresponding to the #' first column corresponding to iteration number and the rest corresponding to the
#' CV-based evaluation means and standard deviations for the training and test CV-sets. #' CV-based evaluation means and standard deviations for the training and test CV-sets.
#' It is created by the \code{\link{cb.evaluation.log}} callback. #' It is created by the \code{\link{cb.evaluation.log}} callback.

View File

@ -5,16 +5,16 @@
#' #'
#' @param importance_matrix a \code{data.table} returned by \code{\link{xgb.importance}}. #' @param importance_matrix a \code{data.table} returned by \code{\link{xgb.importance}}.
#' @param top_n maximal number of top features to include into the plot. #' @param top_n maximal number of top features to include into the plot.
#' @param measure the name of importance measure to plot. #' @param measure the name of importance measure to plot.
#' When \code{NULL}, 'Gain' would be used for trees and 'Weight' would be used for gblinear. #' When \code{NULL}, 'Gain' would be used for trees and 'Weight' would be used for gblinear.
#' @param rel_to_first whether importance values should be represented as relative to the highest ranked feature. #' @param rel_to_first whether importance values should be represented as relative to the highest ranked feature.
#' See Details. #' See Details.
#' @param left_margin (base R barplot) allows to adjust the left margin size to fit feature names. #' @param left_margin (base R barplot) allows to adjust the left margin size to fit feature names.
#' When it is NULL, the existing \code{par('mar')} is used. #' When it is NULL, the existing \code{par('mar')} is used.
#' @param cex (base R barplot) passed as \code{cex.names} parameter to \code{barplot}. #' @param cex (base R barplot) passed as \code{cex.names} parameter to \code{barplot}.
#' @param plot (base R barplot) whether a barplot should be produced. #' @param plot (base R barplot) whether a barplot should be produced.
#' If FALSE, only a data.table is returned. #' If FALSE, only a data.table is returned.
#' @param n_clusters (ggplot only) a \code{numeric} vector containing the min and the max range #' @param n_clusters (ggplot only) a \code{numeric} vector containing the min and the max range
#' of the possible number of clusters of bars. #' of the possible number of clusters of bars.
#' @param ... other parameters passed to \code{barplot} (except horiz, border, cex.names, names.arg, and las). #' @param ... other parameters passed to \code{barplot} (except horiz, border, cex.names, names.arg, and las).
#' #'
@ -22,27 +22,27 @@
#' The graph represents each feature as a horizontal bar of length proportional to the importance of a feature. #' The graph represents each feature as a horizontal bar of length proportional to the importance of a feature.
#' Features are shown ranked in a decreasing importance order. #' Features are shown ranked in a decreasing importance order.
#' It works for importances from both \code{gblinear} and \code{gbtree} models. #' It works for importances from both \code{gblinear} and \code{gbtree} models.
#' #'
#' When \code{rel_to_first = FALSE}, the values would be plotted as they were in \code{importance_matrix}. #' When \code{rel_to_first = FALSE}, the values would be plotted as they were in \code{importance_matrix}.
#' For gbtree model, that would mean being normalized to the total of 1 #' For gbtree model, that would mean being normalized to the total of 1
#' ("what is feature's importance contribution relative to the whole model?"). #' ("what is feature's importance contribution relative to the whole model?").
#' For linear models, \code{rel_to_first = FALSE} would show actual values of the coefficients. #' For linear models, \code{rel_to_first = FALSE} would show actual values of the coefficients.
#' Setting \code{rel_to_first = TRUE} allows to see the picture from the perspective of #' Setting \code{rel_to_first = TRUE} allows to see the picture from the perspective of
#' "what is feature's importance contribution relative to the most important feature?" #' "what is feature's importance contribution relative to the most important feature?"
#' #'
#' The ggplot-backend method also performs 1-D custering of the importance values, #' The ggplot-backend method also performs 1-D clustering of the importance values,
#' with bar colors coresponding to different clusters that have somewhat similar importance values. #' with bar colors corresponding to different clusters that have somewhat similar importance values.
#' #'
#' @return #' @return
#' The \code{xgb.plot.importance} function creates a \code{barplot} (when \code{plot=TRUE}) #' The \code{xgb.plot.importance} function creates a \code{barplot} (when \code{plot=TRUE})
#' and silently returns a processed data.table with \code{n_top} features sorted by importance. #' and silently returns a processed data.table with \code{n_top} features sorted by importance.
#' #'
#' The \code{xgb.ggplot.importance} function returns a ggplot graph which could be customized afterwards. #' The \code{xgb.ggplot.importance} function returns a ggplot graph which could be customized afterwards.
#' E.g., to change the title of the graph, add \code{+ ggtitle("A GRAPH NAME")} to the result. #' E.g., to change the title of the graph, add \code{+ ggtitle("A GRAPH NAME")} to the result.
#' #'
#' @seealso #' @seealso
#' \code{\link[graphics]{barplot}}. #' \code{\link[graphics]{barplot}}.
#' #'
#' @examples #' @examples
#' data(agaricus.train) #' data(agaricus.train)
#' #'
@ -50,15 +50,15 @@
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") #' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
#' #'
#' importance_matrix <- xgb.importance(colnames(agaricus.train$data), model = bst) #' importance_matrix <- xgb.importance(colnames(agaricus.train$data), model = bst)
#' #'
#' xgb.plot.importance(importance_matrix, rel_to_first = TRUE, xlab = "Relative importance") #' xgb.plot.importance(importance_matrix, rel_to_first = TRUE, xlab = "Relative importance")
#' #'
#' (gg <- xgb.ggplot.importance(importance_matrix, measure = "Frequency", rel_to_first = TRUE)) #' (gg <- xgb.ggplot.importance(importance_matrix, measure = "Frequency", rel_to_first = TRUE))
#' gg + ggplot2::ylab("Frequency") #' gg + ggplot2::ylab("Frequency")
#' #'
#' @rdname xgb.plot.importance #' @rdname xgb.plot.importance
#' @export #' @export
xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure = NULL, xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure = NULL,
rel_to_first = FALSE, left_margin = 10, cex = NULL, plot = TRUE, ...) { rel_to_first = FALSE, left_margin = 10, cex = NULL, plot = TRUE, ...) {
check.deprecation(...) check.deprecation(...)
if (!is.data.table(importance_matrix)) { if (!is.data.table(importance_matrix)) {
@ -80,13 +80,13 @@ xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure
if (!"Feature" %in% imp_names) if (!"Feature" %in% imp_names)
stop("Importance matrix column names are not as expected!") stop("Importance matrix column names are not as expected!")
} }
# also aggregate, just in case when the values were not yet summed up by feature # also aggregate, just in case when the values were not yet summed up by feature
importance_matrix <- importance_matrix[, Importance := sum(get(measure)), by = Feature] importance_matrix <- importance_matrix[, Importance := sum(get(measure)), by = Feature]
# make sure it's ordered # make sure it's ordered
importance_matrix <- importance_matrix[order(-abs(Importance))] importance_matrix <- importance_matrix[order(-abs(Importance))]
if (!is.null(top_n)) { if (!is.null(top_n)) {
top_n <- min(top_n, nrow(importance_matrix)) top_n <- min(top_n, nrow(importance_matrix))
importance_matrix <- head(importance_matrix, top_n) importance_matrix <- head(importance_matrix, top_n)
@ -97,14 +97,14 @@ xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure
if (is.null(cex)) { if (is.null(cex)) {
cex <- 2.5/log2(1 + nrow(importance_matrix)) cex <- 2.5/log2(1 + nrow(importance_matrix))
} }
if (plot) { if (plot) {
op <- par(no.readonly = TRUE) op <- par(no.readonly = TRUE)
mar <- op$mar mar <- op$mar
if (!is.null(left_margin)) if (!is.null(left_margin))
mar[2] <- left_margin mar[2] <- left_margin
par(mar = mar) par(mar = mar)
# reverse the order of rows to have the highest ranked at the top # reverse the order of rows to have the highest ranked at the top
importance_matrix[nrow(importance_matrix):1, importance_matrix[nrow(importance_matrix):1,
barplot(Importance, horiz = TRUE, border = NA, cex.names = cex, barplot(Importance, horiz = TRUE, border = NA, cex.names = cex,
@ -115,7 +115,7 @@ xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure
barplot(Importance, horiz = TRUE, border = NA, add = TRUE)] barplot(Importance, horiz = TRUE, border = NA, add = TRUE)]
par(op) par(op)
} }
invisible(importance_matrix) invisible(importance_matrix)
} }

View File

@ -1,9 +1,9 @@
#' SHAP contribution dependency plots #' SHAP contribution dependency plots
#' #'
#' Visualizing the SHAP feature contribution to prediction dependencies on feature value. #' Visualizing the SHAP feature contribution to prediction dependencies on feature value.
#' #'
#' @param data data as a \code{matrix} or \code{dgCMatrix}. #' @param data data as a \code{matrix} or \code{dgCMatrix}.
#' @param shap_contrib a matrix of SHAP contributions that was computed earlier for the above #' @param shap_contrib a matrix of SHAP contributions that was computed earlier for the above
#' \code{data}. When it is NULL, it is computed internally using \code{model} and \code{data}. #' \code{data}. When it is NULL, it is computed internally using \code{model} and \code{data}.
#' @param features a vector of either column indices or of feature names to plot. When it is NULL, #' @param features a vector of either column indices or of feature names to plot. When it is NULL,
#' feature importance is calculated, and \code{top_n} high ranked features are taken. #' feature importance is calculated, and \code{top_n} high ranked features are taken.
@ -31,32 +31,32 @@
#' @param plot_loess whether to plot loess-smoothed curves. The smoothing is only done for features with #' @param plot_loess whether to plot loess-smoothed curves. The smoothing is only done for features with
#' more than 5 distinct values. #' more than 5 distinct values.
#' @param col_loess a color to use for the loess curves. #' @param col_loess a color to use for the loess curves.
#' @param span_loess the \code{span} paramerer in \code{\link[stats]{loess}}'s call. #' @param span_loess the \code{span} parameter in \code{\link[stats]{loess}}'s call.
#' @param which whether to do univariate or bivariate plotting. NOTE: only 1D is implemented so far. #' @param which whether to do univariate or bivariate plotting. NOTE: only 1D is implemented so far.
#' @param plot whether a plot should be drawn. If FALSE, only a lits of matrices is returned. #' @param plot whether a plot should be drawn. If FALSE, only a lits of matrices is returned.
#' @param ... other parameters passed to \code{plot}. #' @param ... other parameters passed to \code{plot}.
#' #'
#' @details #' @details
#' #'
#' These scatterplots represent how SHAP feature contributions depend of feature values. #' These scatterplots represent how SHAP feature contributions depend of feature values.
#' The similarity to partial dependency plots is that they also give an idea for how feature values #' The similarity to partial dependency plots is that they also give an idea for how feature values
#' affect predictions. However, in partial dependency plots, we usually see marginal dependencies #' affect predictions. However, in partial dependency plots, we usually see marginal dependencies
#' of model prediction on feature value, while SHAP contribution dependency plots display the estimated #' of model prediction on feature value, while SHAP contribution dependency plots display the estimated
#' contributions of a feature to model prediction for each individual case. #' contributions of a feature to model prediction for each individual case.
#' #'
#' When \code{plot_loess = TRUE} is set, feature values are rounded to 3 significant digits and #' When \code{plot_loess = TRUE} is set, feature values are rounded to 3 significant digits and
#' weighted LOESS is computed and plotted, where weights are the numbers of data points #' weighted LOESS is computed and plotted, where weights are the numbers of data points
#' at each rounded value. #' at each rounded value.
#' #'
#' Note: SHAP contributions are shown on the scale of model margin. E.g., for a logistic binomial objective, #' Note: SHAP contributions are shown on the scale of model margin. E.g., for a logistic binomial objective,
#' the margin is prediction before a sigmoidal transform into probability-like values. #' the margin is prediction before a sigmoidal transform into probability-like values.
#' Also, since SHAP stands for "SHapley Additive exPlanation" (model prediction = sum of SHAP #' Also, since SHAP stands for "SHapley Additive exPlanation" (model prediction = sum of SHAP
#' contributions for all features + bias), depending on the objective used, transforming SHAP #' contributions for all features + bias), depending on the objective used, transforming SHAP
#' contributions for a feature from the marginal to the prediction space is not necessarily #' contributions for a feature from the marginal to the prediction space is not necessarily
#' a meaningful thing to do. #' a meaningful thing to do.
#' #'
#' @return #' @return
#' #'
#' In addition to producing plots (when \code{plot=TRUE}), it silently returns a list of two matrices: #' In addition to producing plots (when \code{plot=TRUE}), it silently returns a list of two matrices:
#' \itemize{ #' \itemize{
#' \item \code{data} the values of selected features; #' \item \code{data} the values of selected features;
@ -70,11 +70,11 @@
#' Scott M. Lundberg, Su-In Lee, "Consistent feature attribution for tree ensembles", \url{https://arxiv.org/abs/1706.06060} #' Scott M. Lundberg, Su-In Lee, "Consistent feature attribution for tree ensembles", \url{https://arxiv.org/abs/1706.06060}
#' #'
#' @examples #' @examples
#' #'
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost') #' data(agaricus.test, package='xgboost')
#' #'
#' bst <- xgboost(agaricus.train$data, agaricus.train$label, nrounds = 50, #' bst <- xgboost(agaricus.train$data, agaricus.train$label, nrounds = 50,
#' eta = 0.1, max_depth = 3, subsample = .5, #' eta = 0.1, max_depth = 3, subsample = .5,
#' method = "hist", objective = "binary:logistic", nthread = 2, verbose = 0) #' method = "hist", objective = "binary:logistic", nthread = 2, verbose = 0)
#' #'
@ -99,7 +99,7 @@
#' n_col = 2, col = col, pch = 16, pch_NA = 17) #' n_col = 2, col = col, pch = 16, pch_NA = 17)
#' xgb.plot.shap(x, model = mbst, trees = trees0 + 2, target_class = 2, top_n = 4, #' xgb.plot.shap(x, model = mbst, trees = trees0 + 2, target_class = 2, top_n = 4,
#' n_col = 2, col = col, pch = 16, pch_NA = 17) #' n_col = 2, col = col, pch = 16, pch_NA = 17)
#' #'
#' @rdname xgb.plot.shap #' @rdname xgb.plot.shap
#' @export #' @export
xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1, model = NULL, xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1, model = NULL,
@ -109,7 +109,7 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
plot_NA = TRUE, col_NA = rgb(0.7, 0, 1, 0.6), pch_NA = '.', pos_NA = 1.07, plot_NA = TRUE, col_NA = rgb(0.7, 0, 1, 0.6), pch_NA = '.', pos_NA = 1.07,
plot_loess = TRUE, col_loess = 2, span_loess = 0.5, plot_loess = TRUE, col_loess = 2, span_loess = 0.5,
which = c("1d", "2d"), plot = TRUE, ...) { which = c("1d", "2d"), plot = TRUE, ...) {
if (!is.matrix(data) && !inherits(data, "dgCMatrix")) if (!is.matrix(data) && !inherits(data, "dgCMatrix"))
stop("data: must be either matrix or dgCMatrix") stop("data: must be either matrix or dgCMatrix")
@ -122,7 +122,7 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
if (!is.null(shap_contrib) && if (!is.null(shap_contrib) &&
(!is.matrix(shap_contrib) || nrow(shap_contrib) != nrow(data) || ncol(shap_contrib) != ncol(data) + 1)) (!is.matrix(shap_contrib) || nrow(shap_contrib) != nrow(data) || ncol(shap_contrib) != ncol(data) + 1))
stop("shap_contrib is not compatible with the provided data") stop("shap_contrib is not compatible with the provided data")
nsample <- if (is.null(subsample)) min(100000, nrow(data)) else as.integer(subsample * nrow(data)) nsample <- if (is.null(subsample)) min(100000, nrow(data)) else as.integer(subsample * nrow(data))
idx <- sample(1:nrow(data), nsample) idx <- sample(1:nrow(data), nsample)
data <- data[idx,] data <- data[idx,]
@ -144,13 +144,13 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
stop("top_n: must be an integer within [1, 100]") stop("top_n: must be an integer within [1, 100]")
features <- imp$Feature[1:min(top_n, NROW(imp))] features <- imp$Feature[1:min(top_n, NROW(imp))]
} }
if (is.character(features)) { if (is.character(features)) {
if (is.null(colnames(data))) if (is.null(colnames(data)))
stop("Either provide `data` with column names or provide `features` as column indices") stop("Either provide `data` with column names or provide `features` as column indices")
features <- match(features, colnames(data)) features <- match(features, colnames(data))
} }
if (n_col > length(features)) n_col <- length(features) if (n_col > length(features)) n_col <- length(features)
if (is.list(shap_contrib)) { # multiclass: either choose a class or merge if (is.list(shap_contrib)) { # multiclass: either choose a class or merge
@ -165,7 +165,7 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
if (is.null(cols)) cols <- paste0('X', 1:ncol(data)) if (is.null(cols)) cols <- paste0('X', 1:ncol(data))
colnames(data) <- cols colnames(data) <- cols
colnames(shap_contrib) <- cols colnames(shap_contrib) <- cols
if (plot && which == "1d") { if (plot && which == "1d") {
op <- par(mfrow = c(ceiling(length(features) / n_col), n_col), op <- par(mfrow = c(ceiling(length(features) / n_col), n_col),
oma = c(0,0,0,0) + 0.2, oma = c(0,0,0,0) + 0.2,

View File

@ -1,44 +1,44 @@
#' eXtreme Gradient Boosting Training #' eXtreme Gradient Boosting Training
#' #'
#' \code{xgb.train} is an advanced interface for training an xgboost model. #' \code{xgb.train} is an advanced interface for training an xgboost model.
#' The \code{xgboost} function is a simpler wrapper for \code{xgb.train}. #' The \code{xgboost} function is a simpler wrapper for \code{xgb.train}.
#' #'
#' @param params the list of parameters. #' @param params the list of parameters.
#' The complete list of parameters is available at \url{http://xgboost.readthedocs.io/en/latest/parameter.html}. #' The complete list of parameters is available at \url{http://xgboost.readthedocs.io/en/latest/parameter.html}.
#' Below is a shorter summary: #' Below is a shorter summary:
#' #'
#' 1. General Parameters #' 1. General Parameters
#' #'
#' \itemize{ #' \itemize{
#' \item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}. #' \item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}.
#' } #' }
#' #'
#' 2. Booster Parameters #' 2. Booster Parameters
#' #'
#' 2.1. Parameter for Tree Booster #' 2.1. Parameter for Tree Booster
#' #'
#' \itemize{ #' \itemize{
#' \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3 #' \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
#' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be. #' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
#' \item \code{max_depth} maximum depth of a tree. Default: 6 #' \item \code{max_depth} maximum depth of a tree. Default: 6
#' \item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1 #' \item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1 #' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1
#' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1 #' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
#' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1 #' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
#' \item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint. #' \item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.
#' \item \code{interaction_constraints} A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from \code{0} (\code{0} references the first column). Leave argument unspecified for no interaction constraints. #' \item \code{interaction_constraints} A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from \code{0} (\code{0} references the first column). Leave argument unspecified for no interaction constraints.
#' } #' }
#' #'
#' 2.2. Parameter for Linear Booster #' 2.2. Parameter for Linear Booster
#' #'
#' \itemize{ #' \itemize{
#' \item \code{lambda} L2 regularization term on weights. Default: 0 #' \item \code{lambda} L2 regularization term on weights. Default: 0
#' \item \code{lambda_bias} L2 regularization term on bias. Default: 0 #' \item \code{lambda_bias} L2 regularization term on bias. Default: 0
#' \item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0 #' \item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
#' } #' }
#' #'
#' 3. Task Parameters #' 3. Task Parameters
#' #'
#' \itemize{ #' \itemize{
#' \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below: #' \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
#' \itemize{ #' \itemize{
@ -54,32 +54,32 @@
#' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5 #' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
#' \item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section. #' \item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
#' } #' }
#' #'
#' @param data training dataset. \code{xgb.train} accepts only an \code{xgb.DMatrix} as the input. #' @param data training dataset. \code{xgb.train} accepts only an \code{xgb.DMatrix} as the input.
#' \code{xgboost}, in addition, also accepts \code{matrix}, \code{dgCMatrix}, or name of a local data file. #' \code{xgboost}, in addition, also accepts \code{matrix}, \code{dgCMatrix}, or name of a local data file.
#' @param nrounds max number of boosting iterations. #' @param nrounds max number of boosting iterations.
#' @param watchlist named list of xgb.DMatrix datasets to use for evaluating model performance. #' @param watchlist named list of xgb.DMatrix datasets to use for evaluating model performance.
#' Metrics specified in either \code{eval_metric} or \code{feval} will be computed for each #' Metrics specified in either \code{eval_metric} or \code{feval} will be computed for each
#' of these datasets during each boosting iteration, and stored in the end as a field named #' of these datasets during each boosting iteration, and stored in the end as a field named
#' \code{evaluation_log} in the resulting object. When either \code{verbose>=1} or #' \code{evaluation_log} in the resulting object. When either \code{verbose>=1} or
#' \code{\link{cb.print.evaluation}} callback is engaged, the performance results are continuously #' \code{\link{cb.print.evaluation}} callback is engaged, the performance results are continuously
#' printed out during the training. #' printed out during the training.
#' E.g., specifying \code{watchlist=list(validation1=mat1, validation2=mat2)} allows to track #' E.g., specifying \code{watchlist=list(validation1=mat1, validation2=mat2)} allows to track
#' the performance of each round's model on mat1 and mat2. #' the performance of each round's model on mat1 and mat2.
#' @param obj customized objective function. Returns gradient and second order #' @param obj customized objective function. Returns gradient and second order
#' gradient with given prediction and dtrain. #' gradient with given prediction and dtrain.
#' @param feval custimized evaluation function. Returns #' @param feval customized evaluation function. Returns
#' \code{list(metric='metric-name', value='metric-value')} with given #' \code{list(metric='metric-name', value='metric-value')} with given
#' prediction and dtrain. #' prediction and dtrain.
#' @param verbose If 0, xgboost will stay silent. If 1, it will print information about performance. #' @param verbose If 0, xgboost will stay silent. If 1, it will print information about performance.
#' If 2, some additional information will be printed out. #' If 2, some additional information will be printed out.
#' Note that setting \code{verbose > 0} automatically engages the #' Note that setting \code{verbose > 0} automatically engages the
#' \code{cb.print.evaluation(period=1)} callback function. #' \code{cb.print.evaluation(period=1)} callback function.
#' @param print_every_n Print each n-th iteration evaluation messages when \code{verbose>0}. #' @param print_every_n Print each n-th iteration evaluation messages when \code{verbose>0}.
#' Default is 1 which means all messages are printed. This parameter is passed to the #' Default is 1 which means all messages are printed. This parameter is passed to the
#' \code{\link{cb.print.evaluation}} callback. #' \code{\link{cb.print.evaluation}} callback.
#' @param early_stopping_rounds If \code{NULL}, the early stopping function is not triggered. #' @param early_stopping_rounds If \code{NULL}, the early stopping function is not triggered.
#' If set to an integer \code{k}, training with a validation set will stop if the performance #' If set to an integer \code{k}, training with a validation set will stop if the performance
#' doesn't improve for \code{k} rounds. #' doesn't improve for \code{k} rounds.
#' Setting this parameter engages the \code{\link{cb.early.stop}} callback. #' Setting this parameter engages the \code{\link{cb.early.stop}} callback.
#' @param maximize If \code{feval} and \code{early_stopping_rounds} are set, #' @param maximize If \code{feval} and \code{early_stopping_rounds} are set,
@ -90,35 +90,35 @@
#' 0 means save at the end. The saving is handled by the \code{\link{cb.save.model}} callback. #' 0 means save at the end. The saving is handled by the \code{\link{cb.save.model}} callback.
#' @param save_name the name or path for periodically saved model file. #' @param save_name the name or path for periodically saved model file.
#' @param xgb_model a previously built model to continue the training from. #' @param xgb_model a previously built model to continue the training from.
#' Could be either an object of class \code{xgb.Booster}, or its raw data, or the name of a #' Could be either an object of class \code{xgb.Booster}, or its raw data, or the name of a
#' file with a previously saved model. #' file with a previously saved model.
#' @param callbacks a list of callback functions to perform various task during boosting. #' @param callbacks a list of callback functions to perform various task during boosting.
#' See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the #' See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
#' parameters' values. User can provide either existing or their own callback methods in order #' parameters' values. User can provide either existing or their own callback methods in order
#' to customize the training process. #' to customize the training process.
#' @param ... other parameters to pass to \code{params}. #' @param ... other parameters to pass to \code{params}.
#' @param label vector of response values. Should not be provided when data is #' @param label vector of response values. Should not be provided when data is
#' a local data file name or an \code{xgb.DMatrix}. #' a local data file name or an \code{xgb.DMatrix}.
#' @param missing by default is set to NA, which means that NA values should be considered as 'missing' #' @param missing by default is set to NA, which means that NA values should be considered as 'missing'
#' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values. #' by the algorithm. Sometimes, 0 or other extreme value might be used to represent missing values.
#' This parameter is only used when input is a dense matrix. #' This parameter is only used when input is a dense matrix.
#' @param weight a vector indicating the weight for each row of the input. #' @param weight a vector indicating the weight for each row of the input.
#' #'
#' @details #' @details
#' These are the training functions for \code{xgboost}. #' These are the training functions for \code{xgboost}.
#' #'
#' The \code{xgb.train} interface supports advanced features such as \code{watchlist}, #' The \code{xgb.train} interface supports advanced features such as \code{watchlist},
#' customized objective and evaluation metric functions, therefore it is more flexible #' customized objective and evaluation metric functions, therefore it is more flexible
#' than the \code{xgboost} interface. #' than the \code{xgboost} interface.
#' #'
#' Parallelization is automatically enabled if \code{OpenMP} is present. #' Parallelization is automatically enabled if \code{OpenMP} is present.
#' Number of threads can also be manually specified via \code{nthread} parameter. #' Number of threads can also be manually specified via \code{nthread} parameter.
#' #'
#' The evaluation metric is chosen automatically by Xgboost (according to the objective) #' The evaluation metric is chosen automatically by Xgboost (according to the objective)
#' when the \code{eval_metric} parameter is not provided. #' when the \code{eval_metric} parameter is not provided.
#' User may set one or several \code{eval_metric} parameters. #' User may set one or several \code{eval_metric} parameters.
#' Note that when using a customized metric, only this single metric can be used. #' Note that when using a customized metric, only this single metric can be used.
#' The folloiwing is the list of built-in metrics for which Xgboost provides optimized implementation: #' The following is the list of built-in metrics for which Xgboost provides optimized implementation:
#' \itemize{ #' \itemize{
#' \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error} #' \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
#' \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood} #' \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
@ -131,7 +131,7 @@
#' \item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation. #' \item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation.
#' \item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{http://en.wikipedia.org/wiki/NDCG} #' \item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{http://en.wikipedia.org/wiki/NDCG}
#' } #' }
#' #'
#' The following callbacks are automatically created when certain parameters are set: #' The following callbacks are automatically created when certain parameters are set:
#' \itemize{ #' \itemize{
#' \item \code{cb.print.evaluation} is turned on when \code{verbose > 0}; #' \item \code{cb.print.evaluation} is turned on when \code{verbose > 0};
@ -140,38 +140,38 @@
#' \item \code{cb.early.stop}: when \code{early_stopping_rounds} is set. #' \item \code{cb.early.stop}: when \code{early_stopping_rounds} is set.
#' \item \code{cb.save.model}: when \code{save_period > 0} is set. #' \item \code{cb.save.model}: when \code{save_period > 0} is set.
#' } #' }
#' #'
#' @return #' @return
#' An object of class \code{xgb.Booster} with the following elements: #' An object of class \code{xgb.Booster} with the following elements:
#' \itemize{ #' \itemize{
#' \item \code{handle} a handle (pointer) to the xgboost model in memory. #' \item \code{handle} a handle (pointer) to the xgboost model in memory.
#' \item \code{raw} a cached memory dump of the xgboost model saved as R's \code{raw} type. #' \item \code{raw} a cached memory dump of the xgboost model saved as R's \code{raw} type.
#' \item \code{niter} number of boosting iterations. #' \item \code{niter} number of boosting iterations.
#' \item \code{evaluation_log} evaluation history storead as a \code{data.table} with the #' \item \code{evaluation_log} evaluation history stored as a \code{data.table} with the
#' first column corresponding to iteration number and the rest corresponding to evaluation #' first column corresponding to iteration number and the rest corresponding to evaluation
#' metrics' values. It is created by the \code{\link{cb.evaluation.log}} callback. #' metrics' values. It is created by the \code{\link{cb.evaluation.log}} callback.
#' \item \code{call} a function call. #' \item \code{call} a function call.
#' \item \code{params} parameters that were passed to the xgboost library. Note that it does not #' \item \code{params} parameters that were passed to the xgboost library. Note that it does not
#' capture parameters changed by the \code{\link{cb.reset.parameters}} callback. #' capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
#' \item \code{callbacks} callback functions that were either automatically assigned or #' \item \code{callbacks} callback functions that were either automatically assigned or
#' explicitely passed. #' explicitly passed.
#' \item \code{best_iteration} iteration number with the best evaluation metric value #' \item \code{best_iteration} iteration number with the best evaluation metric value
#' (only available with early stopping). #' (only available with early stopping).
#' \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration, #' \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
#' which could further be used in \code{predict} method #' which could further be used in \code{predict} method
#' (only available with early stopping). #' (only available with early stopping).
#' \item \code{best_score} the best evaluation metric value during early stopping. #' \item \code{best_score} the best evaluation metric value during early stopping.
#' (only available with early stopping). #' (only available with early stopping).
#' \item \code{feature_names} names of the training dataset features #' \item \code{feature_names} names of the training dataset features
#' (only when comun names were defined in training data). #' (only when column names were defined in training data).
#' \item \code{nfeatures} number of features in training data. #' \item \code{nfeatures} number of features in training data.
#' } #' }
#' #'
#' @seealso #' @seealso
#' \code{\link{callbacks}}, #' \code{\link{callbacks}},
#' \code{\link{predict.xgb.Booster}}, #' \code{\link{predict.xgb.Booster}},
#' \code{\link{xgb.cv}} #' \code{\link{xgb.cv}}
#' #'
#' @references #' @references
#' #'
#' Tianqi Chen and Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System", #' Tianqi Chen and Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System",
@ -180,17 +180,17 @@
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost') #' data(agaricus.test, package='xgboost')
#' #'
#' dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) #' dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
#' dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label) #' dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
#' watchlist <- list(train = dtrain, eval = dtest) #' watchlist <- list(train = dtrain, eval = dtest)
#' #'
#' ## A simple xgb.train example: #' ## A simple xgb.train example:
#' param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, #' param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
#' objective = "binary:logistic", eval_metric = "auc") #' objective = "binary:logistic", eval_metric = "auc")
#' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist) #' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
#' #'
#' #'
#' ## An xgb.train example where custom objective and evaluation metric are used: #' ## An xgb.train example where custom objective and evaluation metric are used:
#' logregobj <- function(preds, dtrain) { #' logregobj <- function(preds, dtrain) {
#' labels <- getinfo(dtrain, "label") #' labels <- getinfo(dtrain, "label")
@ -204,58 +204,58 @@
#' err <- as.numeric(sum(labels != (preds > 0)))/length(labels) #' err <- as.numeric(sum(labels != (preds > 0)))/length(labels)
#' return(list(metric = "error", value = err)) #' return(list(metric = "error", value = err))
#' } #' }
#' #'
#' # These functions could be used by passing them either: #' # These functions could be used by passing them either:
#' # as 'objective' and 'eval_metric' parameters in the params list: #' # as 'objective' and 'eval_metric' parameters in the params list:
#' param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, #' param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
#' objective = logregobj, eval_metric = evalerror) #' objective = logregobj, eval_metric = evalerror)
#' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist) #' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
#' #'
#' # or through the ... arguments: #' # or through the ... arguments:
#' param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2) #' param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2)
#' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, #' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
#' objective = logregobj, eval_metric = evalerror) #' objective = logregobj, eval_metric = evalerror)
#' #'
#' # or as dedicated 'obj' and 'feval' parameters of xgb.train: #' # or as dedicated 'obj' and 'feval' parameters of xgb.train:
#' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, #' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
#' obj = logregobj, feval = evalerror) #' obj = logregobj, feval = evalerror)
#' #'
#' #'
#' ## An xgb.train example of using variable learning rates at each iteration: #' ## An xgb.train example of using variable learning rates at each iteration:
#' param <- list(max_depth = 2, eta = 1, silent = 1, nthread = 2, #' param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
#' objective = "binary:logistic", eval_metric = "auc") #' objective = "binary:logistic", eval_metric = "auc")
#' my_etas <- list(eta = c(0.5, 0.1)) #' my_etas <- list(eta = c(0.5, 0.1))
#' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, #' bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
#' callbacks = list(cb.reset.parameters(my_etas))) #' callbacks = list(cb.reset.parameters(my_etas)))
#' #'
#' ## Early stopping: #' ## Early stopping:
#' bst <- xgb.train(param, dtrain, nrounds = 25, watchlist, #' bst <- xgb.train(param, dtrain, nrounds = 25, watchlist,
#' early_stopping_rounds = 3) #' early_stopping_rounds = 3)
#' #'
#' ## An 'xgboost' interface example: #' ## An 'xgboost' interface example:
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, #' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label,
#' max_depth = 2, eta = 1, nthread = 2, nrounds = 2, #' max_depth = 2, eta = 1, nthread = 2, nrounds = 2,
#' objective = "binary:logistic") #' objective = "binary:logistic")
#' pred <- predict(bst, agaricus.test$data) #' pred <- predict(bst, agaricus.test$data)
#' #'
#' @rdname xgb.train #' @rdname xgb.train
#' @export #' @export
xgb.train <- function(params = list(), data, nrounds, watchlist = list(), xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
obj = NULL, feval = NULL, verbose = 1, print_every_n = 1L, obj = NULL, feval = NULL, verbose = 1, print_every_n = 1L,
early_stopping_rounds = NULL, maximize = NULL, early_stopping_rounds = NULL, maximize = NULL,
save_period = NULL, save_name = "xgboost.model", save_period = NULL, save_name = "xgboost.model",
xgb_model = NULL, callbacks = list(), ...) { xgb_model = NULL, callbacks = list(), ...) {
check.deprecation(...) check.deprecation(...)
params <- check.booster.params(params, ...) params <- check.booster.params(params, ...)
check.custom.obj() check.custom.obj()
check.custom.eval() check.custom.eval()
# data & watchlist checks # data & watchlist checks
dtrain <- data dtrain <- data
if (!inherits(dtrain, "xgb.DMatrix")) if (!inherits(dtrain, "xgb.DMatrix"))
stop("second argument dtrain must be xgb.DMatrix") stop("second argument dtrain must be xgb.DMatrix")
if (length(watchlist) > 0) { if (length(watchlist) > 0) {
if (typeof(watchlist) != "list" || if (typeof(watchlist) != "list" ||
@ -288,7 +288,7 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
stop_condition <- FALSE stop_condition <- FALSE
if (!is.null(early_stopping_rounds) && if (!is.null(early_stopping_rounds) &&
!has.callbacks(callbacks, 'cb.early.stop')) { !has.callbacks(callbacks, 'cb.early.stop')) {
callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds, callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds,
maximize = maximize, verbose = verbose)) maximize = maximize, verbose = verbose))
} }
# Sort the callbacks into categories # Sort the callbacks into categories
@ -318,22 +318,22 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
# TODO: distributed code # TODO: distributed code
rank <- 0 rank <- 0
niter_skip <- ifelse(is_update, 0, niter_init) niter_skip <- ifelse(is_update, 0, niter_init)
begin_iteration <- niter_skip + 1 begin_iteration <- niter_skip + 1
end_iteration <- niter_skip + nrounds end_iteration <- niter_skip + nrounds
# the main loop for boosting iterations # the main loop for boosting iterations
for (iteration in begin_iteration:end_iteration) { for (iteration in begin_iteration:end_iteration) {
for (f in cb$pre_iter) f() for (f in cb$pre_iter) f()
xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) xgb.iter.update(bst$handle, dtrain, iteration - 1, obj)
bst_evaluation <- numeric(0) bst_evaluation <- numeric(0)
if (length(watchlist) > 0) if (length(watchlist) > 0)
bst_evaluation <- xgb.iter.eval(bst$handle, watchlist, iteration - 1, feval) bst_evaluation <- xgb.iter.eval(bst$handle, watchlist, iteration - 1, feval)
xgb.attr(bst$handle, 'niter') <- iteration - 1 xgb.attr(bst$handle, 'niter') <- iteration - 1
for (f in cb$post_iter) f() for (f in cb$post_iter) f()
@ -341,9 +341,9 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
if (stop_condition) break if (stop_condition) break
} }
for (f in cb$finalize) f(finalize = TRUE) for (f in cb$finalize) f(finalize = TRUE)
bst <- xgb.Booster.complete(bst, saveraw = TRUE) bst <- xgb.Booster.complete(bst, saveraw = TRUE)
# store the total number of boosting iterations # store the total number of boosting iterations
bst$niter = end_iteration bst$niter = end_iteration

View File

@ -5,24 +5,24 @@
\title{Callback closures for booster training.} \title{Callback closures for booster training.}
\description{ \description{
These are used to perform various service tasks either during boosting iterations or at the end. These are used to perform various service tasks either during boosting iterations or at the end.
This approach helps to modularize many of such tasks without bloating the main training methods, This approach helps to modularize many of such tasks without bloating the main training methods,
and it offers . and it offers .
} }
\details{ \details{
By default, a callback function is run after each boosting iteration. By default, a callback function is run after each boosting iteration.
An R-attribute \code{is_pre_iteration} could be set for a callback to define a pre-iteration function. An R-attribute \code{is_pre_iteration} could be set for a callback to define a pre-iteration function.
When a callback function has \code{finalize} parameter, its finalizer part will also be run after When a callback function has \code{finalize} parameter, its finalizer part will also be run after
the boosting is completed. the boosting is completed.
WARNING: side-effects!!! Be aware that these callback functions access and modify things in WARNING: side-effects!!! Be aware that these callback functions access and modify things in
the environment from which they are called from, which is a fairly uncommon thing to do in R. the environment from which they are called from, which is a fairly uncommon thing to do in R.
To write a custom callback closure, make sure you first understand the main concepts about R envoronments. To write a custom callback closure, make sure you first understand the main concepts about R environments.
Check either R documentation on \code{\link[base]{environment}} or the Check either R documentation on \code{\link[base]{environment}} or the
\href{http://adv-r.had.co.nz/Environments.html}{Environments chapter} from the "Advanced R" \href{http://adv-r.had.co.nz/Environments.html}{Environments chapter} from the "Advanced R"
book by Hadley Wickham. Further, the best option is to read the code of some of the existing callbacks - book by Hadley Wickham. Further, the best option is to read the code of some of the existing callbacks -
choose ones that do something similar to what you want to achieve. Also, you would need to get familiar choose ones that do something similar to what you want to achieve. Also, you would need to get familiar
with the objects available inside of the \code{xgb.train} and \code{xgb.cv} internal environments. with the objects available inside of the \code{xgb.train} and \code{xgb.cv} internal environments.
} }
\seealso{ \seealso{

View File

@ -11,11 +11,11 @@ cb.cv.predict(save_models = FALSE)
} }
\value{ \value{
Predictions are returned inside of the \code{pred} element, which is either a vector or a matrix, Predictions are returned inside of the \code{pred} element, which is either a vector or a matrix,
depending on the number of prediction outputs per data row. The order of predictions corresponds depending on the number of prediction outputs per data row. The order of predictions corresponds
to the order of rows in the original dataset. Note that when a custom \code{folds} list is to the order of rows in the original dataset. Note that when a custom \code{folds} list is
provided in \code{xgb.cv}, the predictions would only be returned properly when this list is a provided in \code{xgb.cv}, the predictions would only be returned properly when this list is a
non-overlapping list of k sets of indices, as in a standard k-fold CV. The predictions would not be non-overlapping list of k sets of indices, as in a standard k-fold CV. The predictions would not be
meaningful when user-profided folds have overlapping indices as in, e.g., random sampling splits. meaningful when user-provided folds have overlapping indices as in, e.g., random sampling splits.
When some of the indices in the training dataset are not included into user-provided \code{folds}, When some of the indices in the training dataset are not included into user-provided \code{folds},
their prediction value would be \code{NA}. their prediction value would be \code{NA}.
} }

View File

@ -8,15 +8,15 @@ cb.early.stop(stopping_rounds, maximize = FALSE, metric_name = NULL,
verbose = TRUE) verbose = TRUE)
} }
\arguments{ \arguments{
\item{stopping_rounds}{The number of rounds with no improvement in \item{stopping_rounds}{The number of rounds with no improvement in
the evaluation metric in order to stop the training.} the evaluation metric in order to stop the training.}
\item{maximize}{whether to maximize the evaluation metric} \item{maximize}{whether to maximize the evaluation metric}
\item{metric_name}{the name of an evaluation column to use as a criteria for early \item{metric_name}{the name of an evaluation column to use as a criteria for early
stopping. If not set, the last column would be used. stopping. If not set, the last column would be used.
Let's say the test data in \code{watchlist} was labelled as \code{dtest}, Let's say the test data in \code{watchlist} was labelled as \code{dtest},
and one wants to use the AUC in test data for early stopping regardless of where and one wants to use the AUC in test data for early stopping regardless of where
it is in the \code{watchlist}, then one of the following would need to be set: it is in the \code{watchlist}, then one of the following would need to be set:
\code{metric_name='dtest-auc'} or \code{metric_name='dtest_auc'}. \code{metric_name='dtest-auc'} or \code{metric_name='dtest_auc'}.
All dash '-' characters in metric names are considered equivalent to '_'.} All dash '-' characters in metric names are considered equivalent to '_'.}
@ -27,7 +27,7 @@ All dash '-' characters in metric names are considered equivalent to '_'.}
Callback closure to activate the early stopping. Callback closure to activate the early stopping.
} }
\details{ \details{
This callback function determines the condition for early stopping This callback function determines the condition for early stopping
by setting the \code{stop_condition = TRUE} flag in its calling frame. by setting the \code{stop_condition = TRUE} flag in its calling frame.
The following additional fields are assigned to the model's R object: The following additional fields are assigned to the model's R object:

View File

@ -13,12 +13,12 @@ Callback closure for logging the evaluation history
This callback function appends the current iteration evaluation results \code{bst_evaluation} This callback function appends the current iteration evaluation results \code{bst_evaluation}
available in the calling parent frame to the \code{evaluation_log} list in a calling frame. available in the calling parent frame to the \code{evaluation_log} list in a calling frame.
The finalizer callback (called with \code{finalize = TURE} in the end) converts The finalizer callback (called with \code{finalize = TURE} in the end) converts
the \code{evaluation_log} list into a final data.table. the \code{evaluation_log} list into a final data.table.
The iteration evaluation result \code{bst_evaluation} must be a named numeric vector. The iteration evaluation result \code{bst_evaluation} must be a named numeric vector.
Note: in the column names of the final data.table, the dash '-' character is replaced with Note: in the column names of the final data.table, the dash '-' character is replaced with
the underscore '_' in order to make the column names more like regular R identifiers. the underscore '_' in order to make the column names more like regular R identifiers.
Callback function expects the following values to be set in its calling frame: Callback function expects the following values to be set in its calling frame:

View File

@ -2,27 +2,27 @@
% Please edit documentation in R/callbacks.R % Please edit documentation in R/callbacks.R
\name{cb.reset.parameters} \name{cb.reset.parameters}
\alias{cb.reset.parameters} \alias{cb.reset.parameters}
\title{Callback closure for restetting the booster's parameters at each iteration.} \title{Callback closure for resetting the booster's parameters at each iteration.}
\usage{ \usage{
cb.reset.parameters(new_params) cb.reset.parameters(new_params)
} }
\arguments{ \arguments{
\item{new_params}{a list where each element corresponds to a parameter that needs to be reset. \item{new_params}{a list where each element corresponds to a parameter that needs to be reset.
Each element's value must be either a vector of values of length \code{nrounds} Each element's value must be either a vector of values of length \code{nrounds}
to be set at each iteration, to be set at each iteration,
or a function of two parameters \code{learning_rates(iteration, nrounds)} or a function of two parameters \code{learning_rates(iteration, nrounds)}
which returns a new parameter value by using the current iteration number which returns a new parameter value by using the current iteration number
and the total number of boosting rounds.} and the total number of boosting rounds.}
} }
\description{ \description{
Callback closure for restetting the booster's parameters at each iteration. Callback closure for resetting the booster's parameters at each iteration.
} }
\details{ \details{
This is a "pre-iteration" callback function used to reset booster's parameters This is a "pre-iteration" callback function used to reset booster's parameters
at the beginning of each iteration. at the beginning of each iteration.
Note that when training is resumed from some previous model, and a function is used to Note that when training is resumed from some previous model, and a function is used to
reset a parameter value, the \code{nrounds} argument in this function would be the reset a parameter value, the \code{nrounds} argument in this function would be the
the number of boosting rounds in the current training. the number of boosting rounds in the current training.
Callback function expects the following values to be set in its calling frame: Callback function expects the following values to be set in its calling frame:

View File

@ -7,13 +7,13 @@
cb.save.model(save_period = 0, save_name = "xgboost.model") cb.save.model(save_period = 0, save_name = "xgboost.model")
} }
\arguments{ \arguments{
\item{save_period}{save the model to disk after every \item{save_period}{save the model to disk after every
\code{save_period} iterations; 0 means save the model at the end.} \code{save_period} iterations; 0 means save the model at the end.}
\item{save_name}{the name or path for the saved model file. \item{save_name}{the name or path for the saved model file.
It can contain a \code{\link[base]{sprintf}} formatting specifier It can contain a \code{\link[base]{sprintf}} formatting specifier
to include the integer iteration number in the file name. to include the integer iteration number in the file name.
E.g., with \code{save_name} = 'xgboost_%04d.model', E.g., with \code{save_name} = 'xgboost_%04d.model',
the file saved at iteration 50 would be named "xgboost_0050.model".} the file saved at iteration 50 would be named "xgboost_0050.model".}
} }
\description{ \description{

View File

@ -13,7 +13,7 @@
Returns a vector of numbers of rows and of columns in an \code{xgb.DMatrix}. Returns a vector of numbers of rows and of columns in an \code{xgb.DMatrix}.
} }
\details{ \details{
Note: since \code{nrow} and \code{ncol} internally use \code{dim}, they can also Note: since \code{nrow} and \code{ncol} internally use \code{dim}, they can also
be directly used with an \code{xgb.DMatrix} object. be directly used with an \code{xgb.DMatrix} object.
} }
\examples{ \examples{

View File

@ -16,8 +16,8 @@
and the second one is column names} and the second one is column names}
} }
\description{ \description{
Only column names are supported for \code{xgb.DMatrix}, thus setting of Only column names are supported for \code{xgb.DMatrix}, thus setting of
row names would have no effect and returnten row names would be NULL. row names would have no effect and returned row names would be NULL.
} }
\details{ \details{
Generic \code{dimnames} methods are used by \code{colnames}. Generic \code{dimnames} methods are used by \code{colnames}.

View File

@ -27,7 +27,7 @@ The \code{name} field can be one of the following:
\item \code{weight}: to do a weight rescale ; \item \code{weight}: to do a weight rescale ;
\item \code{base_margin}: base margin is the base prediction Xgboost will boost from ; \item \code{base_margin}: base margin is the base prediction Xgboost will boost from ;
\item \code{nrow}: number of rows of the \code{xgb.DMatrix}. \item \code{nrow}: number of rows of the \code{xgb.DMatrix}.
} }
\code{group} can be setup by \code{setinfo} but can't be retrieved by \code{getinfo}. \code{group} can be setup by \code{setinfo} but can't be retrieved by \code{getinfo}.

View File

@ -91,7 +91,7 @@ in \url{http://blog.datadive.net/interpreting-random-forests/}.
With \code{predinteraction = TRUE}, SHAP values of contributions of interaction of each pair of features With \code{predinteraction = TRUE}, SHAP values of contributions of interaction of each pair of features
are computed. Note that this operation might be rather expensive in terms of compute and memory. are computed. Note that this operation might be rather expensive in terms of compute and memory.
Since it quadratically depends on the number of features, it is recommended to perfom selection Since it quadratically depends on the number of features, it is recommended to perform selection
of the most important features first. See below about the format of the returned results. of the most important features first. See below about the format of the returned results.
} }
\examples{ \examples{

View File

@ -14,7 +14,7 @@
\item{...}{not currently used} \item{...}{not currently used}
} }
\description{ \description{
Print information about xgb.DMatrix. Print information about xgb.DMatrix.
Currently it displays dimensions and presence of info-fields and colnames. Currently it displays dimensions and presence of info-fields and colnames.
} }
\examples{ \examples{

View File

@ -17,7 +17,7 @@
Prints formatted results of \code{xgb.cv}. Prints formatted results of \code{xgb.cv}.
} }
\details{ \details{
When not verbose, it would only print the evaluation results, When not verbose, it would only print the evaluation results,
including the best iteration (when available). including the best iteration (when available).
} }
\examples{ \examples{

View File

@ -5,7 +5,7 @@
\alias{slice.xgb.DMatrix} \alias{slice.xgb.DMatrix}
\alias{[.xgb.DMatrix} \alias{[.xgb.DMatrix}
\title{Get a new DMatrix containing the specified rows of \title{Get a new DMatrix containing the specified rows of
orginal xgb.DMatrix object} original xgb.DMatrix object}
\usage{ \usage{
slice(object, ...) slice(object, ...)
@ -24,7 +24,7 @@ slice(object, ...)
} }
\description{ \description{
Get a new DMatrix containing the specified rows of Get a new DMatrix containing the specified rows of
orginal xgb.DMatrix object original xgb.DMatrix object
} }
\examples{ \examples{
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')

View File

@ -28,7 +28,7 @@ E.g., when an \code{xgb.Booster} model is saved as an R object and then is loade
its handle (pointer) to an internal xgboost model would be invalid. The majority of xgboost methods its handle (pointer) to an internal xgboost model would be invalid. The majority of xgboost methods
should still work for such a model object since those methods would be using should still work for such a model object since those methods would be using
\code{xgb.Booster.complete} internally. However, one might find it to be more efficient to call the \code{xgb.Booster.complete} internally. However, one might find it to be more efficient to call the
\code{xgb.Booster.complete} function explicitely once after loading a model as an R-object. \code{xgb.Booster.complete} function explicitly once after loading a model as an R-object.
That would prevent further repeated implicit reconstruction of an internal booster model. That would prevent further repeated implicit reconstruction of an internal booster model.
} }
\examples{ \examples{

View File

@ -7,7 +7,7 @@
xgb.DMatrix(data, info = list(), missing = NA, silent = FALSE, ...) xgb.DMatrix(data, info = list(), missing = NA, silent = FALSE, ...)
} }
\arguments{ \arguments{
\item{data}{a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character \item{data}{a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character
string representing a filename.} string representing a filename.}
\item{info}{a named list of additional information to store in the \code{xgb.DMatrix} object. \item{info}{a named list of additional information to store in the \code{xgb.DMatrix} object.

View File

@ -16,7 +16,7 @@ xgb.cv(params = list(), data, nrounds, nfold, label = NULL,
\itemize{ \itemize{
\item \code{objective} objective function, common ones are \item \code{objective} objective function, common ones are
\itemize{ \itemize{
\item \code{reg:squarederror} Regression with squared loss. \item \code{reg:squarederror} Regression with squared loss
\item \code{binary:logistic} logistic regression for classification \item \code{binary:logistic} logistic regression for classification
} }
\item \code{eta} step size of each boosting step \item \code{eta} step size of each boosting step
@ -35,11 +35,11 @@ xgb.cv(params = list(), data, nrounds, nfold, label = NULL,
\item{label}{vector of response values. Should be provided only when data is an R-matrix.} \item{label}{vector of response values. Should be provided only when data is an R-matrix.}
\item{missing}{is only used when input is a dense matrix. By default is set to NA, which means \item{missing}{is only used when input is a dense matrix. By default is set to NA, which means
that NA values should be considered as 'missing' by the algorithm. that NA values should be considered as 'missing' by the algorithm.
Sometimes, 0 or other extreme value might be used to represent missing values.} Sometimes, 0 or other extreme value might be used to represent missing values.}
\item{prediction}{A logical value indicating whether to return the test fold predictions \item{prediction}{A logical value indicating whether to return the test fold predictions
from each CV model. This parameter engages the \code{\link{cb.cv.predict}} callback.} from each CV model. This parameter engages the \code{\link{cb.cv.predict}} callback.}
\item{showsd}{\code{boolean}, whether to show standard deviation of cross validation} \item{showsd}{\code{boolean}, whether to show standard deviation of cross validation}
@ -56,28 +56,28 @@ from each CV model. This parameter engages the \code{\link{cb.cv.predict}} callb
\item \code{merror} Exact matching error, used to evaluate multi-class classification \item \code{merror} Exact matching error, used to evaluate multi-class classification
}} }}
\item{obj}{customized objective function. Returns gradient and second order \item{obj}{customized objective function. Returns gradient and second order
gradient with given prediction and dtrain.} gradient with given prediction and dtrain.}
\item{feval}{custimized evaluation function. Returns \item{feval}{customized evaluation function. Returns
\code{list(metric='metric-name', value='metric-value')} with given \code{list(metric='metric-name', value='metric-value')} with given
prediction and dtrain.} prediction and dtrain.}
\item{stratified}{a \code{boolean} indicating whether sampling of folds should be stratified \item{stratified}{a \code{boolean} indicating whether sampling of folds should be stratified
by the values of outcome labels.} by the values of outcome labels.}
\item{folds}{\code{list} provides a possibility to use a list of pre-defined CV folds \item{folds}{\code{list} provides a possibility to use a list of pre-defined CV folds
(each element must be a vector of test fold's indices). When folds are supplied, (each element must be a vector of test fold's indices). When folds are supplied,
the \code{nfold} and \code{stratified} parameters are ignored.} the \code{nfold} and \code{stratified} parameters are ignored.}
\item{verbose}{\code{boolean}, print the statistics during the process} \item{verbose}{\code{boolean}, print the statistics during the process}
\item{print_every_n}{Print each n-th iteration evaluation messages when \code{verbose>0}. \item{print_every_n}{Print each n-th iteration evaluation messages when \code{verbose>0}.
Default is 1 which means all messages are printed. This parameter is passed to the Default is 1 which means all messages are printed. This parameter is passed to the
\code{\link{cb.print.evaluation}} callback.} \code{\link{cb.print.evaluation}} callback.}
\item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered. \item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered.
If set to an integer \code{k}, training with a validation set will stop if the performance If set to an integer \code{k}, training with a validation set will stop if the performance
doesn't improve for \code{k} rounds. doesn't improve for \code{k} rounds.
Setting this parameter engages the \code{\link{cb.early.stop}} callback.} Setting this parameter engages the \code{\link{cb.early.stop}} callback.}
@ -87,8 +87,8 @@ When it is \code{TRUE}, it means the larger the evaluation score the better.
This parameter is passed to the \code{\link{cb.early.stop}} callback.} This parameter is passed to the \code{\link{cb.early.stop}} callback.}
\item{callbacks}{a list of callback functions to perform various task during boosting. \item{callbacks}{a list of callback functions to perform various task during boosting.
See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
parameters' values. User can provide either existing or their own callback methods in order parameters' values. User can provide either existing or their own callback methods in order
to customize the training process.} to customize the training process.}
\item{...}{other parameters to pass to \code{params}.} \item{...}{other parameters to pass to \code{params}.}
@ -97,26 +97,26 @@ to customize the training process.}
An object of class \code{xgb.cv.synchronous} with the following elements: An object of class \code{xgb.cv.synchronous} with the following elements:
\itemize{ \itemize{
\item \code{call} a function call. \item \code{call} a function call.
\item \code{params} parameters that were passed to the xgboost library. Note that it does not \item \code{params} parameters that were passed to the xgboost library. Note that it does not
capture parameters changed by the \code{\link{cb.reset.parameters}} callback. capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
\item \code{callbacks} callback functions that were either automatically assigned or \item \code{callbacks} callback functions that were either automatically assigned or
explicitly passed. explicitly passed.
\item \code{evaluation_log} evaluation history storead as a \code{data.table} with the \item \code{evaluation_log} evaluation history stored as a \code{data.table} with the
first column corresponding to iteration number and the rest corresponding to the first column corresponding to iteration number and the rest corresponding to the
CV-based evaluation means and standard deviations for the training and test CV-sets. CV-based evaluation means and standard deviations for the training and test CV-sets.
It is created by the \code{\link{cb.evaluation.log}} callback. It is created by the \code{\link{cb.evaluation.log}} callback.
\item \code{niter} number of boosting iterations. \item \code{niter} number of boosting iterations.
\item \code{nfeatures} number of features in training data. \item \code{nfeatures} number of features in training data.
\item \code{folds} the list of CV folds' indices - either those passed through the \code{folds} \item \code{folds} the list of CV folds' indices - either those passed through the \code{folds}
parameter or randomly generated. parameter or randomly generated.
\item \code{best_iteration} iteration number with the best evaluation metric value \item \code{best_iteration} iteration number with the best evaluation metric value
(only available with early stopping). (only available with early stopping).
\item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration, \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
which could further be used in \code{predict} method which could further be used in \code{predict} method
(only available with early stopping). (only available with early stopping).
\item \code{pred} CV prediction values available when \code{prediction} is set. \item \code{pred} CV prediction values available when \code{prediction} is set.
It is either vector or matrix (see \code{\link{cb.cv.predict}}). It is either vector or matrix (see \code{\link{cb.cv.predict}}).
\item \code{models} a liost of the CV folds' models. It is only available with the explicit \item \code{models} a liost of the CV folds' models. It is only available with the explicit
setting of the \code{cb.cv.predict(save_models = TRUE)} callback. setting of the \code{cb.cv.predict(save_models = TRUE)} callback.
} }
} }
@ -124,9 +124,9 @@ An object of class \code{xgb.cv.synchronous} with the following elements:
The cross validation function of xgboost The cross validation function of xgboost
} }
\details{ \details{
The original sample is randomly partitioned into \code{nfold} equal size subsamples. The original sample is randomly partitioned into \code{nfold} equal size subsamples.
Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data. Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data.
The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data. The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data.

View File

@ -12,7 +12,7 @@ using the \code{cb.gblinear.history()} callback.}
\item{class_index}{zero-based class index to extract the coefficients for only that \item{class_index}{zero-based class index to extract the coefficients for only that
specific class in a multinomial multiclass model. When it is NULL, all the specific class in a multinomial multiclass model. When it is NULL, all the
coeffients are returned. Has no effect in non-multiclass models.} coefficients are returned. Has no effect in non-multiclass models.}
} }
\value{ \value{
For an \code{xgb.train} result, a matrix (either dense or sparse) with the columns For an \code{xgb.train} result, a matrix (either dense or sparse) with the columns

View File

@ -17,13 +17,13 @@ xgb.plot.importance(importance_matrix = NULL, top_n = NULL,
\item{top_n}{maximal number of top features to include into the plot.} \item{top_n}{maximal number of top features to include into the plot.}
\item{measure}{the name of importance measure to plot. \item{measure}{the name of importance measure to plot.
When \code{NULL}, 'Gain' would be used for trees and 'Weight' would be used for gblinear.} When \code{NULL}, 'Gain' would be used for trees and 'Weight' would be used for gblinear.}
\item{rel_to_first}{whether importance values should be represented as relative to the highest ranked feature. \item{rel_to_first}{whether importance values should be represented as relative to the highest ranked feature.
See Details.} See Details.}
\item{n_clusters}{(ggplot only) a \code{numeric} vector containing the min and the max range \item{n_clusters}{(ggplot only) a \code{numeric} vector containing the min and the max range
of the possible number of clusters of bars.} of the possible number of clusters of bars.}
\item{...}{other parameters passed to \code{barplot} (except horiz, border, cex.names, names.arg, and las).} \item{...}{other parameters passed to \code{barplot} (except horiz, border, cex.names, names.arg, and las).}
@ -33,7 +33,7 @@ When it is NULL, the existing \code{par('mar')} is used.}
\item{cex}{(base R barplot) passed as \code{cex.names} parameter to \code{barplot}.} \item{cex}{(base R barplot) passed as \code{cex.names} parameter to \code{barplot}.}
\item{plot}{(base R barplot) whether a barplot should be produced. \item{plot}{(base R barplot) whether a barplot should be produced.
If FALSE, only a data.table is returned.} If FALSE, only a data.table is returned.}
} }
\value{ \value{
@ -53,14 +53,14 @@ Features are shown ranked in a decreasing importance order.
It works for importances from both \code{gblinear} and \code{gbtree} models. It works for importances from both \code{gblinear} and \code{gbtree} models.
When \code{rel_to_first = FALSE}, the values would be plotted as they were in \code{importance_matrix}. When \code{rel_to_first = FALSE}, the values would be plotted as they were in \code{importance_matrix}.
For gbtree model, that would mean being normalized to the total of 1 For gbtree model, that would mean being normalized to the total of 1
("what is feature's importance contribution relative to the whole model?"). ("what is feature's importance contribution relative to the whole model?").
For linear models, \code{rel_to_first = FALSE} would show actual values of the coefficients. For linear models, \code{rel_to_first = FALSE} would show actual values of the coefficients.
Setting \code{rel_to_first = TRUE} allows to see the picture from the perspective of Setting \code{rel_to_first = TRUE} allows to see the picture from the perspective of
"what is feature's importance contribution relative to the most important feature?" "what is feature's importance contribution relative to the most important feature?"
The ggplot-backend method also performs 1-D custering of the importance values, The ggplot-backend method also performs 1-D clustering of the importance values,
with bar colors coresponding to different clusters that have somewhat similar importance values. with bar colors corresponding to different clusters that have somewhat similar importance values.
} }
\examples{ \examples{
data(agaricus.train) data(agaricus.train)

View File

@ -15,7 +15,7 @@ xgb.plot.shap(data, shap_contrib = NULL, features = NULL, top_n = 1,
\arguments{ \arguments{
\item{data}{data as a \code{matrix} or \code{dgCMatrix}.} \item{data}{data as a \code{matrix} or \code{dgCMatrix}.}
\item{shap_contrib}{a matrix of SHAP contributions that was computed earlier for the above \item{shap_contrib}{a matrix of SHAP contributions that was computed earlier for the above
\code{data}. When it is NULL, it is computed internally using \code{model} and \code{data}.} \code{data}. When it is NULL, it is computed internally using \code{model} and \code{data}.}
\item{features}{a vector of either column indices or of feature names to plot. When it is NULL, \item{features}{a vector of either column indices or of feature names to plot. When it is NULL,
@ -63,7 +63,7 @@ more than 5 distinct values.}
\item{col_loess}{a color to use for the loess curves.} \item{col_loess}{a color to use for the loess curves.}
\item{span_loess}{the \code{span} paramerer in \code{\link[stats]{loess}}'s call.} \item{span_loess}{the \code{span} parameter in \code{\link[stats]{loess}}'s call.}
\item{which}{whether to do univariate or bivariate plotting. NOTE: only 1D is implemented so far.} \item{which}{whether to do univariate or bivariate plotting. NOTE: only 1D is implemented so far.}
@ -104,7 +104,7 @@ a meaningful thing to do.
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost') data(agaricus.test, package='xgboost')
bst <- xgboost(agaricus.train$data, agaricus.train$label, nrounds = 50, bst <- xgboost(agaricus.train$data, agaricus.train$label, nrounds = 50,
eta = 0.1, max_depth = 3, subsample = .5, eta = 0.1, max_depth = 3, subsample = .5,
method = "hist", objective = "binary:logistic", nthread = 2, verbose = 0) method = "hist", objective = "binary:logistic", nthread = 2, verbose = 0)

View File

@ -18,7 +18,7 @@ xgboost(data = NULL, label = NULL, missing = NA, weight = NULL,
...) ...)
} }
\arguments{ \arguments{
\item{params}{the list of parameters. \item{params}{the list of parameters.
The complete list of parameters is available at \url{http://xgboost.readthedocs.io/en/latest/parameter.html}. The complete list of parameters is available at \url{http://xgboost.readthedocs.io/en/latest/parameter.html}.
Below is a shorter summary: Below is a shorter summary:
@ -27,31 +27,32 @@ xgboost(data = NULL, label = NULL, missing = NA, weight = NULL,
\itemize{ \itemize{
\item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}. \item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}.
} }
2. Booster Parameters 2. Booster Parameters
2.1. Parameter for Tree Booster 2.1. Parameter for Tree Booster
\itemize{ \itemize{
\item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3 \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
\item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be. \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
\item \code{max_depth} maximum depth of a tree. Default: 6 \item \code{max_depth} maximum depth of a tree. Default: 6
\item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1 \item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
\item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1 \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1
\item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1 \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
\item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1 \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
\item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint. \item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.
\item \code{interaction_constraints} A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from \code{0} (\code{0} references the first column). Leave argument unspecified for no interaction constraints.
} }
2.2. Parameter for Linear Booster 2.2. Parameter for Linear Booster
\itemize{ \itemize{
\item \code{lambda} L2 regularization term on weights. Default: 0 \item \code{lambda} L2 regularization term on weights. Default: 0
\item \code{lambda_bias} L2 regularization term on bias. Default: 0 \item \code{lambda_bias} L2 regularization term on bias. Default: 0
\item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0 \item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
} }
3. Task Parameters 3. Task Parameters
\itemize{ \itemize{
\item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below: \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
@ -76,31 +77,31 @@ xgboost(data = NULL, label = NULL, missing = NA, weight = NULL,
\item{watchlist}{named list of xgb.DMatrix datasets to use for evaluating model performance. \item{watchlist}{named list of xgb.DMatrix datasets to use for evaluating model performance.
Metrics specified in either \code{eval_metric} or \code{feval} will be computed for each Metrics specified in either \code{eval_metric} or \code{feval} will be computed for each
of these datasets during each boosting iteration, and stored in the end as a field named of these datasets during each boosting iteration, and stored in the end as a field named
\code{evaluation_log} in the resulting object. When either \code{verbose>=1} or \code{evaluation_log} in the resulting object. When either \code{verbose>=1} or
\code{\link{cb.print.evaluation}} callback is engaged, the performance results are continuously \code{\link{cb.print.evaluation}} callback is engaged, the performance results are continuously
printed out during the training. printed out during the training.
E.g., specifying \code{watchlist=list(validation1=mat1, validation2=mat2)} allows to track E.g., specifying \code{watchlist=list(validation1=mat1, validation2=mat2)} allows to track
the performance of each round's model on mat1 and mat2.} the performance of each round's model on mat1 and mat2.}
\item{obj}{customized objective function. Returns gradient and second order \item{obj}{customized objective function. Returns gradient and second order
gradient with given prediction and dtrain.} gradient with given prediction and dtrain.}
\item{feval}{custimized evaluation function. Returns \item{feval}{customized evaluation function. Returns
\code{list(metric='metric-name', value='metric-value')} with given \code{list(metric='metric-name', value='metric-value')} with given
prediction and dtrain.} prediction and dtrain.}
\item{verbose}{If 0, xgboost will stay silent. If 1, it will print information about performance. \item{verbose}{If 0, xgboost will stay silent. If 1, it will print information about performance.
If 2, some additional information will be printed out. If 2, some additional information will be printed out.
Note that setting \code{verbose > 0} automatically engages the Note that setting \code{verbose > 0} automatically engages the
\code{cb.print.evaluation(period=1)} callback function.} \code{cb.print.evaluation(period=1)} callback function.}
\item{print_every_n}{Print each n-th iteration evaluation messages when \code{verbose>0}. \item{print_every_n}{Print each n-th iteration evaluation messages when \code{verbose>0}.
Default is 1 which means all messages are printed. This parameter is passed to the Default is 1 which means all messages are printed. This parameter is passed to the
\code{\link{cb.print.evaluation}} callback.} \code{\link{cb.print.evaluation}} callback.}
\item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered. \item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered.
If set to an integer \code{k}, training with a validation set will stop if the performance If set to an integer \code{k}, training with a validation set will stop if the performance
doesn't improve for \code{k} rounds. doesn't improve for \code{k} rounds.
Setting this parameter engages the \code{\link{cb.early.stop}} callback.} Setting this parameter engages the \code{\link{cb.early.stop}} callback.}
@ -115,17 +116,17 @@ This parameter is passed to the \code{\link{cb.early.stop}} callback.}
\item{save_name}{the name or path for periodically saved model file.} \item{save_name}{the name or path for periodically saved model file.}
\item{xgb_model}{a previously built model to continue the training from. \item{xgb_model}{a previously built model to continue the training from.
Could be either an object of class \code{xgb.Booster}, or its raw data, or the name of a Could be either an object of class \code{xgb.Booster}, or its raw data, or the name of a
file with a previously saved model.} file with a previously saved model.}
\item{callbacks}{a list of callback functions to perform various task during boosting. \item{callbacks}{a list of callback functions to perform various task during boosting.
See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the
parameters' values. User can provide either existing or their own callback methods in order parameters' values. User can provide either existing or their own callback methods in order
to customize the training process.} to customize the training process.}
\item{...}{other parameters to pass to \code{params}.} \item{...}{other parameters to pass to \code{params}.}
\item{label}{vector of response values. Should not be provided when data is \item{label}{vector of response values. Should not be provided when data is
a local data file name or an \code{xgb.DMatrix}.} a local data file name or an \code{xgb.DMatrix}.}
\item{missing}{by default is set to NA, which means that NA values should be considered as 'missing' \item{missing}{by default is set to NA, which means that NA values should be considered as 'missing'
@ -140,23 +141,23 @@ An object of class \code{xgb.Booster} with the following elements:
\item \code{handle} a handle (pointer) to the xgboost model in memory. \item \code{handle} a handle (pointer) to the xgboost model in memory.
\item \code{raw} a cached memory dump of the xgboost model saved as R's \code{raw} type. \item \code{raw} a cached memory dump of the xgboost model saved as R's \code{raw} type.
\item \code{niter} number of boosting iterations. \item \code{niter} number of boosting iterations.
\item \code{evaluation_log} evaluation history storead as a \code{data.table} with the \item \code{evaluation_log} evaluation history stored as a \code{data.table} with the
first column corresponding to iteration number and the rest corresponding to evaluation first column corresponding to iteration number and the rest corresponding to evaluation
metrics' values. It is created by the \code{\link{cb.evaluation.log}} callback. metrics' values. It is created by the \code{\link{cb.evaluation.log}} callback.
\item \code{call} a function call. \item \code{call} a function call.
\item \code{params} parameters that were passed to the xgboost library. Note that it does not \item \code{params} parameters that were passed to the xgboost library. Note that it does not
capture parameters changed by the \code{\link{cb.reset.parameters}} callback. capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
\item \code{callbacks} callback functions that were either automatically assigned or \item \code{callbacks} callback functions that were either automatically assigned or
explicitely passed. explicitly passed.
\item \code{best_iteration} iteration number with the best evaluation metric value \item \code{best_iteration} iteration number with the best evaluation metric value
(only available with early stopping). (only available with early stopping).
\item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration, \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
which could further be used in \code{predict} method which could further be used in \code{predict} method
(only available with early stopping). (only available with early stopping).
\item \code{best_score} the best evaluation metric value during early stopping. \item \code{best_score} the best evaluation metric value during early stopping.
(only available with early stopping). (only available with early stopping).
\item \code{feature_names} names of the training dataset features \item \code{feature_names} names of the training dataset features
(only when comun names were defined in training data). (only when column names were defined in training data).
\item \code{nfeatures} number of features in training data. \item \code{nfeatures} number of features in training data.
} }
} }
@ -165,20 +166,20 @@ An object of class \code{xgb.Booster} with the following elements:
The \code{xgboost} function is a simpler wrapper for \code{xgb.train}. The \code{xgboost} function is a simpler wrapper for \code{xgb.train}.
} }
\details{ \details{
These are the training functions for \code{xgboost}. These are the training functions for \code{xgboost}.
The \code{xgb.train} interface supports advanced features such as \code{watchlist}, The \code{xgb.train} interface supports advanced features such as \code{watchlist},
customized objective and evaluation metric functions, therefore it is more flexible customized objective and evaluation metric functions, therefore it is more flexible
than the \code{xgboost} interface. than the \code{xgboost} interface.
Parallelization is automatically enabled if \code{OpenMP} is present. Parallelization is automatically enabled if \code{OpenMP} is present.
Number of threads can also be manually specified via \code{nthread} parameter. Number of threads can also be manually specified via \code{nthread} parameter.
The evaluation metric is chosen automatically by Xgboost (according to the objective) The evaluation metric is chosen automatically by Xgboost (according to the objective)
when the \code{eval_metric} parameter is not provided. when the \code{eval_metric} parameter is not provided.
User may set one or several \code{eval_metric} parameters. User may set one or several \code{eval_metric} parameters.
Note that when using a customized metric, only this single metric can be used. Note that when using a customized metric, only this single metric can be used.
The folloiwing is the list of built-in metrics for which Xgboost provides optimized implementation: The following is the list of built-in metrics for which Xgboost provides optimized implementation:
\itemize{ \itemize{
\item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error} \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
\item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood} \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
@ -210,7 +211,7 @@ dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
watchlist <- list(train = dtrain, eval = dtest) watchlist <- list(train = dtrain, eval = dtest)
## A simple xgb.train example: ## A simple xgb.train example:
param <- list(max_depth = 2, eta = 1, verbosity = 0, nthread = 2, param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
objective = "binary:logistic", eval_metric = "auc") objective = "binary:logistic", eval_metric = "auc")
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist) bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
@ -231,12 +232,12 @@ evalerror <- function(preds, dtrain) {
# These functions could be used by passing them either: # These functions could be used by passing them either:
# as 'objective' and 'eval_metric' parameters in the params list: # as 'objective' and 'eval_metric' parameters in the params list:
param <- list(max_depth = 2, eta = 1, verbosity = 0, nthread = 2, param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
objective = logregobj, eval_metric = evalerror) objective = logregobj, eval_metric = evalerror)
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist) bst <- xgb.train(param, dtrain, nrounds = 2, watchlist)
# or through the ... arguments: # or through the ... arguments:
param <- list(max_depth = 2, eta = 1, verbosity = 0, nthread = 2) param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2)
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
objective = logregobj, eval_metric = evalerror) objective = logregobj, eval_metric = evalerror)
@ -246,7 +247,7 @@ bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
## An xgb.train example of using variable learning rates at each iteration: ## An xgb.train example of using variable learning rates at each iteration:
param <- list(max_depth = 2, eta = 1, verbosity = 0, nthread = 2, param <- list(max_depth = 2, eta = 1, verbose = 0, nthread = 2,
objective = "binary:logistic", eval_metric = "auc") objective = "binary:logistic", eval_metric = "auc")
my_etas <- list(eta = c(0.5, 0.1)) my_etas <- list(eta = c(0.5, 0.1))
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, bst <- xgb.train(param, dtrain, nrounds = 2, watchlist,
@ -257,8 +258,8 @@ bst <- xgb.train(param, dtrain, nrounds = 25, watchlist,
early_stopping_rounds = 3) early_stopping_rounds = 3)
## An 'xgboost' interface example: ## An 'xgboost' interface example:
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label,
max_depth = 2, eta = 1, nthread = 2, nrounds = 2, max_depth = 2, eta = 1, nthread = 2, nrounds = 2,
objective = "binary:logistic") objective = "binary:logistic")
pred <- predict(bst, agaricus.test$data) pred <- predict(bst, agaricus.test$data)

View File

@ -10,7 +10,7 @@ The deprecated parameters would be removed in the next release.
\details{ \details{
To see all the current deprecated and new parameters, check the \code{xgboost:::depr_par_lut} table. To see all the current deprecated and new parameters, check the \code{xgboost:::depr_par_lut} table.
A deprecation warning is shown when any of the deprecated parameters is used in a call. A deprecation warning is shown when any of the deprecated parameters is used in a call.
An additional warning is shown when there was a partial match to a deprecated parameter An additional warning is shown when there was a partial match to a deprecated parameter
(as R is able to partially match parameter names). (as R is able to partially match parameter names).
} }

View File

@ -138,7 +138,7 @@ levels(df[,Treatment])
Next step, we will transform the categorical data to dummy variables. Next step, we will transform the categorical data to dummy variables.
Several encoding methods exist, e.g., [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) is a common approach. Several encoding methods exist, e.g., [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) is a common approach.
We will use the [dummy contrast coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#dummy) which is popular because it producess "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)). We will use the [dummy contrast coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#dummy) which is popular because it produces "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`. The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`.
@ -268,7 +268,7 @@ c2 <- chisq.test(df$Age, output_vector)
print(c2) print(c2)
``` ```
Pearson correlation between Age and illness disapearing is **`r round(c2$statistic, 2 )`**. Pearson correlation between Age and illness disappearing is **`r round(c2$statistic, 2 )`**.
```{r, warning=FALSE, message=FALSE} ```{r, warning=FALSE, message=FALSE}
c2 <- chisq.test(df$AgeDiscret, output_vector) c2 <- chisq.test(df$AgeDiscret, output_vector)

View File

@ -313,7 +313,7 @@ Until now, all the learnings we have performed were based on boosting trees. **X
bst <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic") bst <- xgb.train(data=dtrain, booster = "gblinear", max_depth=2, nthread = 2, nrounds=2, watchlist=watchlist, eval_metric = "error", eval_metric = "logloss", objective = "binary:logistic")
``` ```
In this specific case, *linear boosting* gets sligtly better performance metrics than decision trees based algorithm. In this specific case, *linear boosting* gets slightly better performance metrics than decision trees based algorithm.
In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use. In simple cases, it will happen because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Because there is no silver bullet, we advise you to check both algorithms with your own datasets to have an idea of what to use.