[R] Provide better guidance for persisting XGBoost model (#5964)
* [R] Provide better guidance for persisting XGBoost model * Update saving_model.rst * Add a paragraph about xgb.serialize()
This commit is contained in:
parent
bf2990e773
commit
5a2dcd1c33
@ -64,5 +64,5 @@ Imports:
|
|||||||
data.table (>= 1.9.6),
|
data.table (>= 1.9.6),
|
||||||
magrittr (>= 1.5),
|
magrittr (>= 1.5),
|
||||||
stringi (>= 0.5.2)
|
stringi (>= 0.5.2)
|
||||||
RoxygenNote: 7.1.0
|
RoxygenNote: 7.1.1
|
||||||
SystemRequirements: GNU make, C++14
|
SystemRequirements: GNU make, C++14
|
||||||
|
|||||||
@ -308,18 +308,64 @@ xgb.createFolds <- function(y, k = 10)
|
|||||||
#' @name xgboost-deprecated
|
#' @name xgboost-deprecated
|
||||||
NULL
|
NULL
|
||||||
|
|
||||||
#' Do not use saveRDS() for long-term archival of models. Use xgb.save() instead.
|
#' Do not use \code{\link[base]{saveRDS}} or \code{\link[base]{save}} for long-term archival of
|
||||||
|
#' models. Instead, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}}.
|
||||||
#'
|
#'
|
||||||
#' It is a common practice to use the built-in \code{saveRDS()} function to persist R objects to
|
#' It is a common practice to use the built-in \code{\link[base]{saveRDS}} function (or
|
||||||
#' the disk. While \code{xgb.Booster} objects can be persisted with \code{saveRDS()} as well, it
|
#' \code{\link[base]{save}}) to persist R objects to the disk. While it is possible to persist
|
||||||
#' is not advisable to use it if the model is to be accessed in the future. If you train a model
|
#' \code{xgb.Booster} objects using \code{\link[base]{saveRDS}}, it is not advisable to do so if
|
||||||
#' with the current version of XGBoost and persist it with \code{saveRDS()}, the model is not
|
#' the model is to be accessed in the future. If you train a model with the current version of
|
||||||
#' guaranteed to be accessible in later releases of XGBoost. To ensure that your model can be
|
#' XGBoost and persist it with \code{\link[base]{saveRDS}}, the model is not guaranteed to be
|
||||||
#' accessed in future releases of XGBoost, use \code{xgb.save()} instead. For more details and
|
#' accessible in later releases of XGBoost. To ensure that your model can be accessed in future
|
||||||
#' explanation, consult the page
|
#' releases of XGBoost, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}} instead.
|
||||||
|
#'
|
||||||
|
#' @details
|
||||||
|
#' Use \code{\link{xgb.save}} to save the XGBoost model as a stand-alone file. You may opt into
|
||||||
|
#' the JSON format by specifying the JSON extension. To read the model back, use
|
||||||
|
#' \code{\link{xgb.load}}.
|
||||||
|
#'
|
||||||
|
#' Use \code{\link{xgb.save.raw}} to save the XGBoost model as a sequence (vector) of raw bytes
|
||||||
|
#' in a future-proof manner. Future releases of XGBoost will be able to read the raw bytes and
|
||||||
|
#' re-construct the corresponding model. To read the model back, use \code{\link{xgb.load.raw}}.
|
||||||
|
#' The \code{\link{xgb.save.raw}} function is useful if you'd like to persist the XGBoost model
|
||||||
|
#' as part of another R object.
|
||||||
|
#'
|
||||||
|
#' Note: Do not use \code{\link{xgb.serialize}} to store models long-term. It persists not only the
|
||||||
|
#' model but also internal configurations and parameters, and its format is not stable across
|
||||||
|
#' multiple XGBoost versions. Use \code{\link{xgb.serialize}} only for checkpointing.
|
||||||
|
#'
|
||||||
|
#' For more details and explanation about model persistence and archival, consult the page
|
||||||
#' \url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
|
#' \url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
|
||||||
#'
|
#'
|
||||||
#' @name a-compatibility-note-for-saveRDS
|
#' @examples
|
||||||
|
#' data(agaricus.train, package='xgboost')
|
||||||
|
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
|
||||||
|
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
||||||
|
#'
|
||||||
|
#' # Save as a stand-alone file; load it with xgb.load()
|
||||||
|
#' xgb.save(bst, 'xgb.model')
|
||||||
|
#' bst2 <- xgb.load('xgb.model')
|
||||||
|
#'
|
||||||
|
#' # Save as a stand-alone file (JSON); load it with xgb.load()
|
||||||
|
#' xgb.save(bst, 'xgb.model.json')
|
||||||
|
#' bst2 <- xgb.load('xgb.model.json')
|
||||||
|
#'
|
||||||
|
#' # Save as a raw byte vector; load it with xgb.load.raw()
|
||||||
|
#' xgb_bytes <- xgb.save.raw(bst)
|
||||||
|
#' bst2 <- xgb.load.raw(xgb_bytes)
|
||||||
|
#'
|
||||||
|
#' # Persist XGBoost model as part of another R object
|
||||||
|
#' obj <- list(xgb_model_bytes = xgb.save.raw(bst), description = "My first XGBoost model")
|
||||||
|
#' # Persist the R object. Here, saveRDS() is okay, since it doesn't persist
|
||||||
|
#' # xgb.Booster directly. What's being persisted is the future-proof byte representation
|
||||||
|
#' # as given by xgb.save.raw().
|
||||||
|
#' saveRDS(obj, 'my_object.rds')
|
||||||
|
#' # Read back the R object
|
||||||
|
#' obj2 <- readRDS('my_object.rds')
|
||||||
|
#' # Re-construct xgb.Booster object from the bytes
|
||||||
|
#' bst2 <- xgb.load.raw(obj2$xgb_model_bytes)
|
||||||
|
#'
|
||||||
|
#' @name a-compatibility-note-for-saveRDS-save
|
||||||
NULL
|
NULL
|
||||||
|
|
||||||
# Lookup table for the deprecated parameters bookkeeping
|
# Lookup table for the deprecated parameters bookkeeping
|
||||||
|
|||||||
@ -111,6 +111,8 @@ xgb.get.handle <- function(object) {
|
|||||||
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
||||||
#' saveRDS(bst, "xgb.model.rds")
|
#' saveRDS(bst, "xgb.model.rds")
|
||||||
#'
|
#'
|
||||||
|
#' # Warning: The resulting RDS file is only compatible with the current XGBoost version.
|
||||||
|
#' # Refer to the section titled "a-compatibility-note-for-saveRDS-save".
|
||||||
#' bst1 <- readRDS("xgb.model.rds")
|
#' bst1 <- readRDS("xgb.model.rds")
|
||||||
#' if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds")
|
#' if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds")
|
||||||
#' # the handle is invalid:
|
#' # the handle is invalid:
|
||||||
|
|||||||
@ -13,7 +13,11 @@
|
|||||||
#'
|
#'
|
||||||
#' Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}}
|
#' Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}}
|
||||||
#' or \code{\link[base]{save}}). However, it would then only be compatible with R, and
|
#' or \code{\link[base]{save}}). However, it would then only be compatible with R, and
|
||||||
#' corresponding R-methods would need to be used to load it.
|
#' corresponding R-methods would need to be used to load it. Moreover, persisting the model with
|
||||||
|
#' \code{\link[base]{readRDS}} or \code{\link[base]{save}}) will cause compatibility problems in
|
||||||
|
#' future versions of XGBoost. Consult \code{\link{a-compatibility-note-for-saveRDS-save}} to learn
|
||||||
|
#' how to persist models in a future-proof way, i.e. to make the model accessible in future
|
||||||
|
#' releases of XGBoost.
|
||||||
#'
|
#'
|
||||||
#' @seealso
|
#' @seealso
|
||||||
#' \code{\link{xgb.load}}, \code{\link{xgb.Booster.complete}}.
|
#' \code{\link{xgb.load}}, \code{\link{xgb.Booster.complete}}.
|
||||||
|
|||||||
62
R-package/man/a-compatibility-note-for-saveRDS-save.Rd
Normal file
62
R-package/man/a-compatibility-note-for-saveRDS-save.Rd
Normal file
@ -0,0 +1,62 @@
|
|||||||
|
% Generated by roxygen2: do not edit by hand
|
||||||
|
% Please edit documentation in R/utils.R
|
||||||
|
\name{a-compatibility-note-for-saveRDS-save}
|
||||||
|
\alias{a-compatibility-note-for-saveRDS-save}
|
||||||
|
\title{Do not use \code{\link[base]{saveRDS}} or \code{\link[base]{save}} for long-term archival of
|
||||||
|
models. Instead, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}}.}
|
||||||
|
\description{
|
||||||
|
It is a common practice to use the built-in \code{\link[base]{saveRDS}} function (or
|
||||||
|
\code{\link[base]{save}}) to persist R objects to the disk. While it is possible to persist
|
||||||
|
\code{xgb.Booster} objects using \code{\link[base]{saveRDS}}, it is not advisable to do so if
|
||||||
|
the model is to be accessed in the future. If you train a model with the current version of
|
||||||
|
XGBoost and persist it with \code{\link[base]{saveRDS}}, the model is not guaranteed to be
|
||||||
|
accessible in later releases of XGBoost. To ensure that your model can be accessed in future
|
||||||
|
releases of XGBoost, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}} instead.
|
||||||
|
}
|
||||||
|
\details{
|
||||||
|
Use \code{\link{xgb.save}} to save the XGBoost model as a stand-alone file. You may opt into
|
||||||
|
the JSON format by specifying the JSON extension. To read the model back, use
|
||||||
|
\code{\link{xgb.load}}.
|
||||||
|
|
||||||
|
Use \code{\link{xgb.save.raw}} to save the XGBoost model as a sequence (vector) of raw bytes
|
||||||
|
in a future-proof manner. Future releases of XGBoost will be able to read the raw bytes and
|
||||||
|
re-construct the corresponding model. To read the model back, use \code{\link{xgb.load.raw}}.
|
||||||
|
The \code{\link{xgb.save.raw}} function is useful if you'd like to persist the XGBoost model
|
||||||
|
as part of another R object.
|
||||||
|
|
||||||
|
Note: Do not use \code{\link{xgb.serialize}} to store models long-term. It persists not only the
|
||||||
|
model but also internal configurations and parameters, and its format is not stable across
|
||||||
|
multiple XGBoost versions. Use \code{\link{xgb.serialize}} only for checkpointing.
|
||||||
|
|
||||||
|
For more details and explanation about model persistence and archival, consult the page
|
||||||
|
\url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
|
||||||
|
}
|
||||||
|
\examples{
|
||||||
|
data(agaricus.train, package='xgboost')
|
||||||
|
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
|
||||||
|
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
||||||
|
|
||||||
|
# Save as a stand-alone file; load it with xgb.load()
|
||||||
|
xgb.save(bst, 'xgb.model')
|
||||||
|
bst2 <- xgb.load('xgb.model')
|
||||||
|
|
||||||
|
# Save as a stand-alone file (JSON); load it with xgb.load()
|
||||||
|
xgb.save(bst, 'xgb.model.json')
|
||||||
|
bst2 <- xgb.load('xgb.model.json')
|
||||||
|
|
||||||
|
# Save as a raw byte vector; load it with xgb.load.raw()
|
||||||
|
xgb_bytes <- xgb.save.raw(bst)
|
||||||
|
bst2 <- xgb.load.raw(xgb_bytes)
|
||||||
|
|
||||||
|
# Persist XGBoost model as part of another R object
|
||||||
|
obj <- list(xgb_model_bytes = xgb.save.raw(bst), description = "My first XGBoost model")
|
||||||
|
# Persist the R object. Here, saveRDS() is okay, since it doesn't persist
|
||||||
|
# xgb.Booster directly. What's being persisted is the future-proof byte representation
|
||||||
|
# as given by xgb.save.raw().
|
||||||
|
saveRDS(obj, 'my_object.rds')
|
||||||
|
# Read back the R object
|
||||||
|
obj2 <- readRDS('my_object.rds')
|
||||||
|
# Re-construct xgb.Booster object from the bytes
|
||||||
|
bst2 <- xgb.load.raw(obj2$xgb_model_bytes)
|
||||||
|
|
||||||
|
}
|
||||||
@ -1,15 +0,0 @@
|
|||||||
% Generated by roxygen2: do not edit by hand
|
|
||||||
% Please edit documentation in R/utils.R
|
|
||||||
\name{a-compatibility-note-for-saveRDS}
|
|
||||||
\alias{a-compatibility-note-for-saveRDS}
|
|
||||||
\title{Do not use saveRDS() for long-term archival of models. Use xgb.save() instead.}
|
|
||||||
\description{
|
|
||||||
It is a common practice to use the built-in \code{saveRDS()} function to persist R objects to
|
|
||||||
the disk. While \code{xgb.Booster} objects can be persisted with \code{saveRDS()} as well, it
|
|
||||||
is not advisable to use it if the model is to be accessed in the future. If you train a model
|
|
||||||
with the current version of XGBoost and persist it with \code{saveRDS()}, the model is not
|
|
||||||
guaranteed to be accessible in later releases of XGBoost. To ensure that your model can be
|
|
||||||
accessed in future releases of XGBoost, use \code{xgb.save()} instead. For more details and
|
|
||||||
explanation, consult the page
|
|
||||||
\url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
|
|
||||||
}
|
|
||||||
@ -38,6 +38,8 @@ bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_dep
|
|||||||
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
||||||
saveRDS(bst, "xgb.model.rds")
|
saveRDS(bst, "xgb.model.rds")
|
||||||
|
|
||||||
|
# Warning: The resulting RDS file is only compatible with the current XGBoost version.
|
||||||
|
# Refer to the section titled "a-compatibility-note-for-saveRDS-save".
|
||||||
bst1 <- readRDS("xgb.model.rds")
|
bst1 <- readRDS("xgb.model.rds")
|
||||||
if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds")
|
if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds")
|
||||||
# the handle is invalid:
|
# the handle is invalid:
|
||||||
|
|||||||
@ -24,9 +24,9 @@ This is the function inspired from the paragraph 3.1 of the paper:
|
|||||||
|
|
||||||
\strong{Practical Lessons from Predicting Clicks on Ads at Facebook}
|
\strong{Practical Lessons from Predicting Clicks on Ads at Facebook}
|
||||||
|
|
||||||
\emph{(Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yan, xin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers,
|
\emph{(Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yan, xin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers,
|
||||||
Joaquin Quinonero Candela)}
|
Joaquin Quinonero Candela)}
|
||||||
|
|
||||||
International Workshop on Data Mining for Online Advertising (ADKDD) - August 24, 2014
|
International Workshop on Data Mining for Online Advertising (ADKDD) - August 24, 2014
|
||||||
|
|
||||||
\url{https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}.
|
\url{https://research.fb.com/publications/practical-lessons-from-predicting-clicks-on-ads-at-facebook/}.
|
||||||
@ -37,10 +37,10 @@ Extract explaining the method:
|
|||||||
convenient way to implement non-linear and tuple transformations
|
convenient way to implement non-linear and tuple transformations
|
||||||
of the kind we just described. We treat each individual
|
of the kind we just described. We treat each individual
|
||||||
tree as a categorical feature that takes as value the
|
tree as a categorical feature that takes as value the
|
||||||
index of the leaf an instance ends up falling in. We use
|
index of the leaf an instance ends up falling in. We use
|
||||||
1-of-K coding of this type of features.
|
1-of-K coding of this type of features.
|
||||||
|
|
||||||
For example, consider the boosted tree model in Figure 1 with 2 subtrees,
|
For example, consider the boosted tree model in Figure 1 with 2 subtrees,
|
||||||
where the first subtree has 3 leafs and the second 2 leafs. If an
|
where the first subtree has 3 leafs and the second 2 leafs. If an
|
||||||
instance ends up in leaf 2 in the first subtree and leaf 1 in
|
instance ends up in leaf 2 in the first subtree and leaf 1 in
|
||||||
second subtree, the overall input to the linear classifier will
|
second subtree, the overall input to the linear classifier will
|
||||||
|
|||||||
@ -28,7 +28,7 @@ xgb.cv(
|
|||||||
)
|
)
|
||||||
}
|
}
|
||||||
\arguments{
|
\arguments{
|
||||||
\item{params}{the list of parameters. The complete list of parameters is
|
\item{params}{the list of parameters. The complete list of parameters is
|
||||||
available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below
|
available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below
|
||||||
is a shorter summary:
|
is a shorter summary:
|
||||||
\itemize{
|
\itemize{
|
||||||
|
|||||||
@ -16,14 +16,14 @@ xgb.dump(
|
|||||||
\arguments{
|
\arguments{
|
||||||
\item{model}{the model object.}
|
\item{model}{the model object.}
|
||||||
|
|
||||||
\item{fname}{the name of the text file where to save the model text dump.
|
\item{fname}{the name of the text file where to save the model text dump.
|
||||||
If not provided or set to \code{NULL}, the model is returned as a \code{character} vector.}
|
If not provided or set to \code{NULL}, the model is returned as a \code{character} vector.}
|
||||||
|
|
||||||
\item{fmap}{feature map file representing feature types.
|
\item{fmap}{feature map file representing feature types.
|
||||||
Detailed description could be found at
|
Detailed description could be found at
|
||||||
\url{https://github.com/dmlc/xgboost/wiki/Binary-Classification#dump-model}.
|
\url{https://github.com/dmlc/xgboost/wiki/Binary-Classification#dump-model}.
|
||||||
See demo/ for walkthrough example in R, and
|
See demo/ for walkthrough example in R, and
|
||||||
\url{https://github.com/dmlc/xgboost/blob/master/demo/data/featmap.txt}
|
\url{https://github.com/dmlc/xgboost/blob/master/demo/data/featmap.txt}
|
||||||
for example Format.}
|
for example Format.}
|
||||||
|
|
||||||
\item{with_stats}{whether to dump some additional statistics about the splits.
|
\item{with_stats}{whether to dump some additional statistics about the splits.
|
||||||
@ -47,7 +47,7 @@ data(agaricus.train, package='xgboost')
|
|||||||
data(agaricus.test, package='xgboost')
|
data(agaricus.test, package='xgboost')
|
||||||
train <- agaricus.train
|
train <- agaricus.train
|
||||||
test <- agaricus.test
|
test <- agaricus.test
|
||||||
bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
|
bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
|
||||||
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
||||||
# save the model in file 'xgb.model.dump'
|
# save the model in file 'xgb.model.dump'
|
||||||
dump_path = file.path(tempdir(), 'model.dump')
|
dump_path = file.path(tempdir(), 'model.dump')
|
||||||
|
|||||||
@ -22,7 +22,7 @@ Non-null \code{feature_names} could be provided to override those in the model.}
|
|||||||
|
|
||||||
\item{trees}{(only for the gbtree booster) an integer vector of tree indices that should be included
|
\item{trees}{(only for the gbtree booster) an integer vector of tree indices that should be included
|
||||||
into the importance calculation. If set to \code{NULL}, all trees of the model are parsed.
|
into the importance calculation. If set to \code{NULL}, all trees of the model are parsed.
|
||||||
It could be useful, e.g., in multiclass classification to get feature importances
|
It could be useful, e.g., in multiclass classification to get feature importances
|
||||||
for each class separately. IMPORTANT: the tree index in xgboost models
|
for each class separately. IMPORTANT: the tree index in xgboost models
|
||||||
is zero-based (e.g., use \code{trees = 0:4} for first 5 trees).}
|
is zero-based (e.g., use \code{trees = 0:4} for first 5 trees).}
|
||||||
|
|
||||||
@ -37,7 +37,7 @@ For a tree model, a \code{data.table} with the following columns:
|
|||||||
\itemize{
|
\itemize{
|
||||||
\item \code{Features} names of the features used in the model;
|
\item \code{Features} names of the features used in the model;
|
||||||
\item \code{Gain} represents fractional contribution of each feature to the model based on
|
\item \code{Gain} represents fractional contribution of each feature to the model based on
|
||||||
the total gain of this feature's splits. Higher percentage means a more important
|
the total gain of this feature's splits. Higher percentage means a more important
|
||||||
predictive feature.
|
predictive feature.
|
||||||
\item \code{Cover} metric of the number of observation related to this feature;
|
\item \code{Cover} metric of the number of observation related to this feature;
|
||||||
\item \code{Frequency} percentage representing the relative number of times
|
\item \code{Frequency} percentage representing the relative number of times
|
||||||
@ -51,7 +51,7 @@ A linear model's importance \code{data.table} has the following columns:
|
|||||||
\item \code{Class} (only for multiclass models) class label.
|
\item \code{Class} (only for multiclass models) class label.
|
||||||
}
|
}
|
||||||
|
|
||||||
If \code{feature_names} is not provided and \code{model} doesn't have \code{feature_names},
|
If \code{feature_names} is not provided and \code{model} doesn't have \code{feature_names},
|
||||||
index of the features will be used instead. Because the index is extracted from the model dump
|
index of the features will be used instead. Because the index is extracted from the model dump
|
||||||
(based on C++ code), it starts at 0 (as in C/C++ or Python) instead of 1 (usual in R).
|
(based on C++ code), it starts at 0 (as in C/C++ or Python) instead of 1 (usual in R).
|
||||||
}
|
}
|
||||||
@ -61,21 +61,21 @@ Creates a \code{data.table} of feature importances in a model.
|
|||||||
\details{
|
\details{
|
||||||
This function works for both linear and tree models.
|
This function works for both linear and tree models.
|
||||||
|
|
||||||
For linear models, the importance is the absolute magnitude of linear coefficients.
|
For linear models, the importance is the absolute magnitude of linear coefficients.
|
||||||
For that reason, in order to obtain a meaningful ranking by importance for a linear model,
|
For that reason, in order to obtain a meaningful ranking by importance for a linear model,
|
||||||
the features need to be on the same scale (which you also would want to do when using either
|
the features need to be on the same scale (which you also would want to do when using either
|
||||||
L1 or L2 regularization).
|
L1 or L2 regularization).
|
||||||
}
|
}
|
||||||
\examples{
|
\examples{
|
||||||
|
|
||||||
# binomial classification using gbtree:
|
# binomial classification using gbtree:
|
||||||
data(agaricus.train, package='xgboost')
|
data(agaricus.train, package='xgboost')
|
||||||
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
|
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
|
||||||
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
||||||
xgb.importance(model = bst)
|
xgb.importance(model = bst)
|
||||||
|
|
||||||
# binomial classification using gblinear:
|
# binomial classification using gblinear:
|
||||||
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, booster = "gblinear",
|
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, booster = "gblinear",
|
||||||
eta = 0.3, nthread = 1, nrounds = 20, objective = "binary:logistic")
|
eta = 0.3, nthread = 1, nrounds = 20, objective = "binary:logistic")
|
||||||
xgb.importance(model = bst)
|
xgb.importance(model = bst)
|
||||||
|
|
||||||
|
|||||||
@ -20,7 +20,7 @@ Non-null \code{feature_names} could be provided to override those in the model.}
|
|||||||
|
|
||||||
\item{model}{object of class \code{xgb.Booster}}
|
\item{model}{object of class \code{xgb.Booster}}
|
||||||
|
|
||||||
\item{text}{\code{character} vector previously generated by the \code{xgb.dump}
|
\item{text}{\code{character} vector previously generated by the \code{xgb.dump}
|
||||||
function (where parameter \code{with_stats = TRUE} should have been set).
|
function (where parameter \code{with_stats = TRUE} should have been set).
|
||||||
\code{text} takes precedence over \code{model}.}
|
\code{text} takes precedence over \code{model}.}
|
||||||
|
|
||||||
@ -53,10 +53,10 @@ The columns of the \code{data.table} are:
|
|||||||
\item \code{Quality}: either the split gain (change in loss) or the leaf value
|
\item \code{Quality}: either the split gain (change in loss) or the leaf value
|
||||||
\item \code{Cover}: metric related to the number of observation either seen by a split
|
\item \code{Cover}: metric related to the number of observation either seen by a split
|
||||||
or collected by a leaf during training.
|
or collected by a leaf during training.
|
||||||
}
|
}
|
||||||
|
|
||||||
When \code{use_int_id=FALSE}, columns "Yes", "No", and "Missing" point to model-wide node identifiers
|
When \code{use_int_id=FALSE}, columns "Yes", "No", and "Missing" point to model-wide node identifiers
|
||||||
in the "ID" column. When \code{use_int_id=TRUE}, those columns point to node identifiers from
|
in the "ID" column. When \code{use_int_id=TRUE}, those columns point to node identifiers from
|
||||||
the corresponding trees in the "Node" column.
|
the corresponding trees in the "Node" column.
|
||||||
}
|
}
|
||||||
\description{
|
\description{
|
||||||
@ -67,17 +67,17 @@ Parse a boosted tree model text dump into a \code{data.table} structure.
|
|||||||
|
|
||||||
data(agaricus.train, package='xgboost')
|
data(agaricus.train, package='xgboost')
|
||||||
|
|
||||||
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
|
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
|
||||||
eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
|
eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
|
||||||
|
|
||||||
(dt <- xgb.model.dt.tree(colnames(agaricus.train$data), bst))
|
(dt <- xgb.model.dt.tree(colnames(agaricus.train$data), bst))
|
||||||
|
|
||||||
# This bst model already has feature_names stored with it, so those would be used when
|
# This bst model already has feature_names stored with it, so those would be used when
|
||||||
# feature_names is not set:
|
# feature_names is not set:
|
||||||
(dt <- xgb.model.dt.tree(model = bst))
|
(dt <- xgb.model.dt.tree(model = bst))
|
||||||
|
|
||||||
# How to match feature names of splits that are following a current 'Yes' branch:
|
# How to match feature names of splits that are following a current 'Yes' branch:
|
||||||
|
|
||||||
merge(dt, dt[, .(ID, Y.Feature=Feature)], by.x='Yes', by.y='ID', all.x=TRUE)[order(Tree,Node)]
|
merge(dt, dt[, .(ID, Y.Feature=Feature)], by.x='Yes', by.y='ID', all.x=TRUE)[order(Tree,Node)]
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|||||||
@ -23,7 +23,7 @@ or a data.table result of the \code{xgb.model.dt.tree} function.}
|
|||||||
|
|
||||||
\item{which}{which distribution to plot (see details).}
|
\item{which}{which distribution to plot (see details).}
|
||||||
|
|
||||||
\item{plot}{(base R barplot) whether a barplot should be produced.
|
\item{plot}{(base R barplot) whether a barplot should be produced.
|
||||||
If FALSE, only a data.table is returned.}
|
If FALSE, only a data.table is returned.}
|
||||||
|
|
||||||
\item{...}{other parameters passed to \code{barplot} or \code{plot}.}
|
\item{...}{other parameters passed to \code{barplot} or \code{plot}.}
|
||||||
@ -45,10 +45,10 @@ When \code{which="2x1"}, two distributions with respect to the leaf depth
|
|||||||
are plotted on top of each other:
|
are plotted on top of each other:
|
||||||
\itemize{
|
\itemize{
|
||||||
\item the distribution of the number of leafs in a tree model at a certain depth;
|
\item the distribution of the number of leafs in a tree model at a certain depth;
|
||||||
\item the distribution of average weighted number of observations ("cover")
|
\item the distribution of average weighted number of observations ("cover")
|
||||||
ending up in leafs at certain depth.
|
ending up in leafs at certain depth.
|
||||||
}
|
}
|
||||||
Those could be helpful in determining sensible ranges of the \code{max_depth}
|
Those could be helpful in determining sensible ranges of the \code{max_depth}
|
||||||
and \code{min_child_weight} parameters.
|
and \code{min_child_weight} parameters.
|
||||||
|
|
||||||
When \code{which="max.depth"} or \code{which="med.depth"}, plots of either maximum or median depth
|
When \code{which="max.depth"} or \code{which="med.depth"}, plots of either maximum or median depth
|
||||||
|
|||||||
@ -60,7 +60,7 @@ The content of each node is organised that way:
|
|||||||
\item \code{Gain} (for split nodes): the information gain metric of a split
|
\item \code{Gain} (for split nodes): the information gain metric of a split
|
||||||
(corresponds to the importance of the node in the model).
|
(corresponds to the importance of the node in the model).
|
||||||
\item \code{Value} (for leafs): the margin value that the leaf may contribute to prediction.
|
\item \code{Value} (for leafs): the margin value that the leaf may contribute to prediction.
|
||||||
}
|
}
|
||||||
The tree root nodes also indicate the Tree index (0-based).
|
The tree root nodes also indicate the Tree index (0-based).
|
||||||
|
|
||||||
The "Yes" branches are marked by the "< split_value" label.
|
The "Yes" branches are marked by the "< split_value" label.
|
||||||
@ -80,7 +80,7 @@ xgb.plot.tree(model = bst)
|
|||||||
xgb.plot.tree(model = bst, trees = 0, show_node_id = TRUE)
|
xgb.plot.tree(model = bst, trees = 0, show_node_id = TRUE)
|
||||||
|
|
||||||
\dontrun{
|
\dontrun{
|
||||||
# Below is an example of how to save this plot to a file.
|
# Below is an example of how to save this plot to a file.
|
||||||
# Note that for `export_graph` to work, the DiagrammeRsvg and rsvg packages must also be installed.
|
# Note that for `export_graph` to work, the DiagrammeRsvg and rsvg packages must also be installed.
|
||||||
library(DiagrammeR)
|
library(DiagrammeR)
|
||||||
gr <- xgb.plot.tree(model=bst, trees=0:1, render=FALSE)
|
gr <- xgb.plot.tree(model=bst, trees=0:1, render=FALSE)
|
||||||
|
|||||||
@ -15,21 +15,25 @@ xgb.save(model, fname)
|
|||||||
Save xgboost model to a file in binary format.
|
Save xgboost model to a file in binary format.
|
||||||
}
|
}
|
||||||
\details{
|
\details{
|
||||||
This methods allows to save a model in an xgboost-internal binary format which is universal
|
This methods allows to save a model in an xgboost-internal binary format which is universal
|
||||||
among the various xgboost interfaces. In R, the saved model file could be read-in later
|
among the various xgboost interfaces. In R, the saved model file could be read-in later
|
||||||
using either the \code{\link{xgb.load}} function or the \code{xgb_model} parameter
|
using either the \code{\link{xgb.load}} function or the \code{xgb_model} parameter
|
||||||
of \code{\link{xgb.train}}.
|
of \code{\link{xgb.train}}.
|
||||||
|
|
||||||
Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}}
|
Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}}
|
||||||
or \code{\link[base]{save}}). However, it would then only be compatible with R, and
|
or \code{\link[base]{save}}). However, it would then only be compatible with R, and
|
||||||
corresponding R-methods would need to be used to load it.
|
corresponding R-methods would need to be used to load it. Moreover, persisting the model with
|
||||||
|
\code{\link[base]{readRDS}} or \code{\link[base]{save}}) will cause compatibility problems in
|
||||||
|
future versions of XGBoost. Consult \code{\link{a-compatibility-note-for-saveRDS-save}} to learn
|
||||||
|
how to persist models in a future-proof way, i.e. to make the model accessible in future
|
||||||
|
releases of XGBoost.
|
||||||
}
|
}
|
||||||
\examples{
|
\examples{
|
||||||
data(agaricus.train, package='xgboost')
|
data(agaricus.train, package='xgboost')
|
||||||
data(agaricus.test, package='xgboost')
|
data(agaricus.test, package='xgboost')
|
||||||
train <- agaricus.train
|
train <- agaricus.train
|
||||||
test <- agaricus.test
|
test <- agaricus.test
|
||||||
bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
|
bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
|
||||||
eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
|
eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
|
||||||
xgb.save(bst, 'xgb.model')
|
xgb.save(bst, 'xgb.model')
|
||||||
bst <- xgb.load('xgb.model')
|
bst <- xgb.load('xgb.model')
|
||||||
|
|||||||
@ -42,7 +42,7 @@ xgboost(
|
|||||||
)
|
)
|
||||||
}
|
}
|
||||||
\arguments{
|
\arguments{
|
||||||
\item{params}{the list of parameters. The complete list of parameters is
|
\item{params}{the list of parameters. The complete list of parameters is
|
||||||
available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below
|
available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below
|
||||||
is a shorter summary:
|
is a shorter summary:
|
||||||
|
|
||||||
|
|||||||
@ -15,27 +15,36 @@ name with ``.json`` as file extension when saving/loading model:
|
|||||||
``booster.save_model('model.json')``. More details below.
|
``booster.save_model('model.json')``. More details below.
|
||||||
|
|
||||||
Before we get started, XGBoost is a gradient boosting library with focus on tree model,
|
Before we get started, XGBoost is a gradient boosting library with focus on tree model,
|
||||||
which means inside XGBoost, there are 2 distinct parts: the model consisted of trees and
|
which means inside XGBoost, there are 2 distinct parts:
|
||||||
algorithms used to build it. If you come from Deep Learning community, then it should be
|
|
||||||
clear to you that there are differences between the neural network structures composed of
|
|
||||||
weights with fixed tensor operations, and the optimizers (like RMSprop) used to train
|
|
||||||
them.
|
|
||||||
|
|
||||||
So when one calls ``booster.save_model``, XGBoost saves the trees, some model parameters
|
1. The model consisting of trees and
|
||||||
like number of input columns in trained trees, and the objective function, which combined
|
2. Hyperparameters and configurations used for building the model.
|
||||||
|
|
||||||
|
If you come from Deep Learning community, then it should be
|
||||||
|
clear to you that there are differences between the neural network structures composed of
|
||||||
|
weights with fixed tensor operations, and the optimizers (like RMSprop) used to train them.
|
||||||
|
|
||||||
|
So when one calls ``booster.save_model`` (``xgb.save`` in R), XGBoost saves the trees, some model
|
||||||
|
parameters like number of input columns in trained trees, and the objective function, which combined
|
||||||
to represent the concept of "model" in XGBoost. As for why are we saving the objective as
|
to represent the concept of "model" in XGBoost. As for why are we saving the objective as
|
||||||
part of model, that's because objective controls transformation of global bias (called
|
part of model, that's because objective controls transformation of global bias (called
|
||||||
``base_score`` in XGBoost). Users can share this model with others for prediction,
|
``base_score`` in XGBoost). Users can share this model with others for prediction,
|
||||||
evaluation or continue the training with a different set of hyper-parameters etc.
|
evaluation or continue the training with a different set of hyper-parameters etc.
|
||||||
|
|
||||||
However, this is not the end of story. There are cases where we need to save something
|
However, this is not the end of story. There are cases where we need to save something
|
||||||
more than just the model itself. For example, in distrbuted training, XGBoost performs
|
more than just the model itself. For example, in distrbuted training, XGBoost performs
|
||||||
checkpointing operation. Or for some reasons, your favorite distributed computing
|
checkpointing operation. Or for some reasons, your favorite distributed computing
|
||||||
framework decide to copy the model from one worker to another and continue the training in
|
framework decide to copy the model from one worker to another and continue the training in
|
||||||
there. In such cases, the serialisation output is required to contain enougth information
|
there. In such cases, the serialisation output is required to contain enougth information
|
||||||
to continue previous training without user providing any parameters again. We consider
|
to continue previous training without user providing any parameters again. We consider
|
||||||
such scenario as memory snapshot (or memory based serialisation method) and distinguish it
|
such scenario as **memory snapshot** (or memory based serialisation method) and distinguish it
|
||||||
with normal model IO operation. In Python, this can be invoked by pickling the
|
with normal model IO operation. Currently, memory snapshot is used in the following places:
|
||||||
``Booster`` object. Other language bindings are still working in progress.
|
|
||||||
|
* Python package: when the ``Booster`` object is pickled with the built-in ``pickle`` module.
|
||||||
|
* R package: when the ``xgb.Booster`` object is persisted with the built-in functions ``saveRDS``
|
||||||
|
or ``save``.
|
||||||
|
|
||||||
|
Other language bindings are still working in progress.
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
@ -48,12 +57,17 @@ To enable JSON format support for model IO (saving only the trees and objective)
|
|||||||
a filename with ``.json`` as file extension:
|
a filename with ``.json`` as file extension:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
:caption: Python
|
||||||
|
|
||||||
bst.save_model('model_file_name.json')
|
bst.save_model('model_file_name.json')
|
||||||
|
|
||||||
While for enabling JSON as memory based serialisation format, pass
|
.. code-block:: r
|
||||||
``enable_experimental_json_serialization`` as a training parameter. In Python this can be
|
:caption: R
|
||||||
done by:
|
|
||||||
|
xgb.save(bst, 'model_file_name.json')
|
||||||
|
|
||||||
|
To use JSON to store memory snapshots, add ``enable_experimental_json_serialization`` as a training
|
||||||
|
parameter. In Python this can be done by:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
@ -63,13 +77,33 @@ done by:
|
|||||||
|
|
||||||
Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost. Hence
|
Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost. Hence
|
||||||
parameter ``enable_experimental_json_serialization`` is required to enable JSON format.
|
parameter ``enable_experimental_json_serialization`` is required to enable JSON format.
|
||||||
As the name suggested, memory based serialisation captures many stuffs internal to
|
|
||||||
XGBoost, so it's only suitable to be used for checkpoints, which doesn't require stable
|
Similarly, in the R package, add ``enable_experimental_json_serialization`` to the training
|
||||||
output format. That being said, loading pickled booster (memory snapshot) in a different
|
parameter:
|
||||||
XGBoost version may lead to errors or undefined behaviors. But we promise the stable
|
|
||||||
output format of binary model and JSON model (once it's no-longer experimental) as they
|
.. code-block:: r
|
||||||
are designed to be reusable. This scheme fits as Python itself doesn't guarantee pickled
|
|
||||||
bytecode can be used in different Python version.
|
params <- list(enable_experimental_json_serialization = TRUE, ...)
|
||||||
|
bst <- xgboost.train(params, dtrain, nrounds = 10)
|
||||||
|
saveRDS(bst, 'filename.rds')
|
||||||
|
|
||||||
|
***************************************************************
|
||||||
|
A note on backward compatibility of models and memory snapshots
|
||||||
|
***************************************************************
|
||||||
|
|
||||||
|
**We guarantee backward compatibility for models but not for memory snapshots.**
|
||||||
|
|
||||||
|
Models (trees and objective) use a stable representation, so that models produced in earlier
|
||||||
|
versions of XGBoost are accessible in later versions of XGBoost. **If you'd like to store or archive
|
||||||
|
your model for long-term storage, use** ``save_model`` (Python) and ``xgb.save`` (R).
|
||||||
|
|
||||||
|
On the other hand, memory snapshot (serialisation) captures many stuff internal to XGBoost, and its
|
||||||
|
format is not stable and is subject to frequent changes. Therefore, memory snapshot is suitable for
|
||||||
|
checkpointing only, where you persist the complete snapshot of the training configurations so that
|
||||||
|
you can recover robustly from possible failures and resume the training process. Loading memory
|
||||||
|
snapshot generated by an earlier version of XGBoost may result in errors or undefined behaviors.
|
||||||
|
**If a model is persisted with** ``pickle.dump`` (Python) or ``saveRDS`` (R), **then the model may
|
||||||
|
not be accessible in later versions of XGBoost.**
|
||||||
|
|
||||||
***************************
|
***************************
|
||||||
Custom objective and metric
|
Custom objective and metric
|
||||||
@ -98,6 +132,18 @@ suits simple use cases, and it's advised not to use pickle when stability is nee
|
|||||||
It's located in ``xgboost/doc/python`` with the name ``convert_090to100.py``. See
|
It's located in ``xgboost/doc/python`` with the name ``convert_090to100.py``. See
|
||||||
comments in the script for more details.
|
comments in the script for more details.
|
||||||
|
|
||||||
|
A similar procedure may be used to recover the model persisted in an old RDS file. In R, you are
|
||||||
|
able to install an older version of XGBoost using the ``remotes`` package:
|
||||||
|
|
||||||
|
.. code-block:: r
|
||||||
|
|
||||||
|
library(remotes)
|
||||||
|
remotes::install_version("xgboost", "0.90.0.1") # Install version 0.90.0.1
|
||||||
|
|
||||||
|
Once the desired version is installed, you can load the RDS file with ``readRDS`` and recover the
|
||||||
|
``xgb.Booster`` object. Then call ``xgb.save`` to export the model using the stable representation.
|
||||||
|
Now you should be able to use the model in the latest version of XGBoost.
|
||||||
|
|
||||||
********************************************************
|
********************************************************
|
||||||
Saving and Loading the internal parameters configuration
|
Saving and Loading the internal parameters configuration
|
||||||
********************************************************
|
********************************************************
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user