[R] Provide better guidance for persisting XGBoost model (#5964)
* [R] Provide better guidance for persisting XGBoost model * Update saving_model.rst * Add a paragraph about xgb.serialize()
This commit is contained in:
parent
bf2990e773
commit
5a2dcd1c33
@ -64,5 +64,5 @@ Imports:
|
||||
data.table (>= 1.9.6),
|
||||
magrittr (>= 1.5),
|
||||
stringi (>= 0.5.2)
|
||||
RoxygenNote: 7.1.0
|
||||
RoxygenNote: 7.1.1
|
||||
SystemRequirements: GNU make, C++14
|
||||
|
||||
@ -308,18 +308,64 @@ xgb.createFolds <- function(y, k = 10)
|
||||
#' @name xgboost-deprecated
|
||||
NULL
|
||||
|
||||
#' Do not use saveRDS() for long-term archival of models. Use xgb.save() instead.
|
||||
#' Do not use \code{\link[base]{saveRDS}} or \code{\link[base]{save}} for long-term archival of
|
||||
#' models. Instead, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}}.
|
||||
#'
|
||||
#' It is a common practice to use the built-in \code{saveRDS()} function to persist R objects to
|
||||
#' the disk. While \code{xgb.Booster} objects can be persisted with \code{saveRDS()} as well, it
|
||||
#' is not advisable to use it if the model is to be accessed in the future. If you train a model
|
||||
#' with the current version of XGBoost and persist it with \code{saveRDS()}, the model is not
|
||||
#' guaranteed to be accessible in later releases of XGBoost. To ensure that your model can be
|
||||
#' accessed in future releases of XGBoost, use \code{xgb.save()} instead. For more details and
|
||||
#' explanation, consult the page
|
||||
#' It is a common practice to use the built-in \code{\link[base]{saveRDS}} function (or
|
||||
#' \code{\link[base]{save}}) to persist R objects to the disk. While it is possible to persist
|
||||
#' \code{xgb.Booster} objects using \code{\link[base]{saveRDS}}, it is not advisable to do so if
|
||||
#' the model is to be accessed in the future. If you train a model with the current version of
|
||||
#' XGBoost and persist it with \code{\link[base]{saveRDS}}, the model is not guaranteed to be
|
||||
#' accessible in later releases of XGBoost. To ensure that your model can be accessed in future
|
||||
#' releases of XGBoost, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}} instead.
|
||||
#'
|
||||
#' @details
|
||||
#' Use \code{\link{xgb.save}} to save the XGBoost model as a stand-alone file. You may opt into
|
||||
#' the JSON format by specifying the JSON extension. To read the model back, use
|
||||
#' \code{\link{xgb.load}}.
|
||||
#'
|
||||
#' Use \code{\link{xgb.save.raw}} to save the XGBoost model as a sequence (vector) of raw bytes
|
||||
#' in a future-proof manner. Future releases of XGBoost will be able to read the raw bytes and
|
||||
#' re-construct the corresponding model. To read the model back, use \code{\link{xgb.load.raw}}.
|
||||
#' The \code{\link{xgb.save.raw}} function is useful if you'd like to persist the XGBoost model
|
||||
#' as part of another R object.
|
||||
#'
|
||||
#' Note: Do not use \code{\link{xgb.serialize}} to store models long-term. It persists not only the
|
||||
#' model but also internal configurations and parameters, and its format is not stable across
|
||||
#' multiple XGBoost versions. Use \code{\link{xgb.serialize}} only for checkpointing.
|
||||
#'
|
||||
#' For more details and explanation about model persistence and archival, consult the page
|
||||
#' \url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
|
||||
#'
|
||||
#' @name a-compatibility-note-for-saveRDS
|
||||
#' @examples
|
||||
#' data(agaricus.train, package='xgboost')
|
||||
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
|
||||
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
||||
#'
|
||||
#' # Save as a stand-alone file; load it with xgb.load()
|
||||
#' xgb.save(bst, 'xgb.model')
|
||||
#' bst2 <- xgb.load('xgb.model')
|
||||
#'
|
||||
#' # Save as a stand-alone file (JSON); load it with xgb.load()
|
||||
#' xgb.save(bst, 'xgb.model.json')
|
||||
#' bst2 <- xgb.load('xgb.model.json')
|
||||
#'
|
||||
#' # Save as a raw byte vector; load it with xgb.load.raw()
|
||||
#' xgb_bytes <- xgb.save.raw(bst)
|
||||
#' bst2 <- xgb.load.raw(xgb_bytes)
|
||||
#'
|
||||
#' # Persist XGBoost model as part of another R object
|
||||
#' obj <- list(xgb_model_bytes = xgb.save.raw(bst), description = "My first XGBoost model")
|
||||
#' # Persist the R object. Here, saveRDS() is okay, since it doesn't persist
|
||||
#' # xgb.Booster directly. What's being persisted is the future-proof byte representation
|
||||
#' # as given by xgb.save.raw().
|
||||
#' saveRDS(obj, 'my_object.rds')
|
||||
#' # Read back the R object
|
||||
#' obj2 <- readRDS('my_object.rds')
|
||||
#' # Re-construct xgb.Booster object from the bytes
|
||||
#' bst2 <- xgb.load.raw(obj2$xgb_model_bytes)
|
||||
#'
|
||||
#' @name a-compatibility-note-for-saveRDS-save
|
||||
NULL
|
||||
|
||||
# Lookup table for the deprecated parameters bookkeeping
|
||||
|
||||
@ -111,6 +111,8 @@ xgb.get.handle <- function(object) {
|
||||
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
||||
#' saveRDS(bst, "xgb.model.rds")
|
||||
#'
|
||||
#' # Warning: The resulting RDS file is only compatible with the current XGBoost version.
|
||||
#' # Refer to the section titled "a-compatibility-note-for-saveRDS-save".
|
||||
#' bst1 <- readRDS("xgb.model.rds")
|
||||
#' if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds")
|
||||
#' # the handle is invalid:
|
||||
|
||||
@ -13,7 +13,11 @@
|
||||
#'
|
||||
#' Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}}
|
||||
#' or \code{\link[base]{save}}). However, it would then only be compatible with R, and
|
||||
#' corresponding R-methods would need to be used to load it.
|
||||
#' corresponding R-methods would need to be used to load it. Moreover, persisting the model with
|
||||
#' \code{\link[base]{readRDS}} or \code{\link[base]{save}}) will cause compatibility problems in
|
||||
#' future versions of XGBoost. Consult \code{\link{a-compatibility-note-for-saveRDS-save}} to learn
|
||||
#' how to persist models in a future-proof way, i.e. to make the model accessible in future
|
||||
#' releases of XGBoost.
|
||||
#'
|
||||
#' @seealso
|
||||
#' \code{\link{xgb.load}}, \code{\link{xgb.Booster.complete}}.
|
||||
|
||||
62
R-package/man/a-compatibility-note-for-saveRDS-save.Rd
Normal file
62
R-package/man/a-compatibility-note-for-saveRDS-save.Rd
Normal file
@ -0,0 +1,62 @@
|
||||
% Generated by roxygen2: do not edit by hand
|
||||
% Please edit documentation in R/utils.R
|
||||
\name{a-compatibility-note-for-saveRDS-save}
|
||||
\alias{a-compatibility-note-for-saveRDS-save}
|
||||
\title{Do not use \code{\link[base]{saveRDS}} or \code{\link[base]{save}} for long-term archival of
|
||||
models. Instead, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}}.}
|
||||
\description{
|
||||
It is a common practice to use the built-in \code{\link[base]{saveRDS}} function (or
|
||||
\code{\link[base]{save}}) to persist R objects to the disk. While it is possible to persist
|
||||
\code{xgb.Booster} objects using \code{\link[base]{saveRDS}}, it is not advisable to do so if
|
||||
the model is to be accessed in the future. If you train a model with the current version of
|
||||
XGBoost and persist it with \code{\link[base]{saveRDS}}, the model is not guaranteed to be
|
||||
accessible in later releases of XGBoost. To ensure that your model can be accessed in future
|
||||
releases of XGBoost, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}} instead.
|
||||
}
|
||||
\details{
|
||||
Use \code{\link{xgb.save}} to save the XGBoost model as a stand-alone file. You may opt into
|
||||
the JSON format by specifying the JSON extension. To read the model back, use
|
||||
\code{\link{xgb.load}}.
|
||||
|
||||
Use \code{\link{xgb.save.raw}} to save the XGBoost model as a sequence (vector) of raw bytes
|
||||
in a future-proof manner. Future releases of XGBoost will be able to read the raw bytes and
|
||||
re-construct the corresponding model. To read the model back, use \code{\link{xgb.load.raw}}.
|
||||
The \code{\link{xgb.save.raw}} function is useful if you'd like to persist the XGBoost model
|
||||
as part of another R object.
|
||||
|
||||
Note: Do not use \code{\link{xgb.serialize}} to store models long-term. It persists not only the
|
||||
model but also internal configurations and parameters, and its format is not stable across
|
||||
multiple XGBoost versions. Use \code{\link{xgb.serialize}} only for checkpointing.
|
||||
|
||||
For more details and explanation about model persistence and archival, consult the page
|
||||
\url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
|
||||
}
|
||||
\examples{
|
||||
data(agaricus.train, package='xgboost')
|
||||
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
|
||||
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
||||
|
||||
# Save as a stand-alone file; load it with xgb.load()
|
||||
xgb.save(bst, 'xgb.model')
|
||||
bst2 <- xgb.load('xgb.model')
|
||||
|
||||
# Save as a stand-alone file (JSON); load it with xgb.load()
|
||||
xgb.save(bst, 'xgb.model.json')
|
||||
bst2 <- xgb.load('xgb.model.json')
|
||||
|
||||
# Save as a raw byte vector; load it with xgb.load.raw()
|
||||
xgb_bytes <- xgb.save.raw(bst)
|
||||
bst2 <- xgb.load.raw(xgb_bytes)
|
||||
|
||||
# Persist XGBoost model as part of another R object
|
||||
obj <- list(xgb_model_bytes = xgb.save.raw(bst), description = "My first XGBoost model")
|
||||
# Persist the R object. Here, saveRDS() is okay, since it doesn't persist
|
||||
# xgb.Booster directly. What's being persisted is the future-proof byte representation
|
||||
# as given by xgb.save.raw().
|
||||
saveRDS(obj, 'my_object.rds')
|
||||
# Read back the R object
|
||||
obj2 <- readRDS('my_object.rds')
|
||||
# Re-construct xgb.Booster object from the bytes
|
||||
bst2 <- xgb.load.raw(obj2$xgb_model_bytes)
|
||||
|
||||
}
|
||||
@ -1,15 +0,0 @@
|
||||
% Generated by roxygen2: do not edit by hand
|
||||
% Please edit documentation in R/utils.R
|
||||
\name{a-compatibility-note-for-saveRDS}
|
||||
\alias{a-compatibility-note-for-saveRDS}
|
||||
\title{Do not use saveRDS() for long-term archival of models. Use xgb.save() instead.}
|
||||
\description{
|
||||
It is a common practice to use the built-in \code{saveRDS()} function to persist R objects to
|
||||
the disk. While \code{xgb.Booster} objects can be persisted with \code{saveRDS()} as well, it
|
||||
is not advisable to use it if the model is to be accessed in the future. If you train a model
|
||||
with the current version of XGBoost and persist it with \code{saveRDS()}, the model is not
|
||||
guaranteed to be accessible in later releases of XGBoost. To ensure that your model can be
|
||||
accessed in future releases of XGBoost, use \code{xgb.save()} instead. For more details and
|
||||
explanation, consult the page
|
||||
\url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
|
||||
}
|
||||
@ -38,6 +38,8 @@ bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_dep
|
||||
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
|
||||
saveRDS(bst, "xgb.model.rds")
|
||||
|
||||
# Warning: The resulting RDS file is only compatible with the current XGBoost version.
|
||||
# Refer to the section titled "a-compatibility-note-for-saveRDS-save".
|
||||
bst1 <- readRDS("xgb.model.rds")
|
||||
if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds")
|
||||
# the handle is invalid:
|
||||
|
||||
@ -22,7 +22,11 @@ of \code{\link{xgb.train}}.
|
||||
|
||||
Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}}
|
||||
or \code{\link[base]{save}}). However, it would then only be compatible with R, and
|
||||
corresponding R-methods would need to be used to load it.
|
||||
corresponding R-methods would need to be used to load it. Moreover, persisting the model with
|
||||
\code{\link[base]{readRDS}} or \code{\link[base]{save}}) will cause compatibility problems in
|
||||
future versions of XGBoost. Consult \code{\link{a-compatibility-note-for-saveRDS-save}} to learn
|
||||
how to persist models in a future-proof way, i.e. to make the model accessible in future
|
||||
releases of XGBoost.
|
||||
}
|
||||
\examples{
|
||||
data(agaricus.train, package='xgboost')
|
||||
|
||||
@ -15,27 +15,36 @@ name with ``.json`` as file extension when saving/loading model:
|
||||
``booster.save_model('model.json')``. More details below.
|
||||
|
||||
Before we get started, XGBoost is a gradient boosting library with focus on tree model,
|
||||
which means inside XGBoost, there are 2 distinct parts: the model consisted of trees and
|
||||
algorithms used to build it. If you come from Deep Learning community, then it should be
|
||||
clear to you that there are differences between the neural network structures composed of
|
||||
weights with fixed tensor operations, and the optimizers (like RMSprop) used to train
|
||||
them.
|
||||
which means inside XGBoost, there are 2 distinct parts:
|
||||
|
||||
So when one calls ``booster.save_model``, XGBoost saves the trees, some model parameters
|
||||
like number of input columns in trained trees, and the objective function, which combined
|
||||
1. The model consisting of trees and
|
||||
2. Hyperparameters and configurations used for building the model.
|
||||
|
||||
If you come from Deep Learning community, then it should be
|
||||
clear to you that there are differences between the neural network structures composed of
|
||||
weights with fixed tensor operations, and the optimizers (like RMSprop) used to train them.
|
||||
|
||||
So when one calls ``booster.save_model`` (``xgb.save`` in R), XGBoost saves the trees, some model
|
||||
parameters like number of input columns in trained trees, and the objective function, which combined
|
||||
to represent the concept of "model" in XGBoost. As for why are we saving the objective as
|
||||
part of model, that's because objective controls transformation of global bias (called
|
||||
``base_score`` in XGBoost). Users can share this model with others for prediction,
|
||||
evaluation or continue the training with a different set of hyper-parameters etc.
|
||||
|
||||
However, this is not the end of story. There are cases where we need to save something
|
||||
more than just the model itself. For example, in distrbuted training, XGBoost performs
|
||||
checkpointing operation. Or for some reasons, your favorite distributed computing
|
||||
framework decide to copy the model from one worker to another and continue the training in
|
||||
there. In such cases, the serialisation output is required to contain enougth information
|
||||
to continue previous training without user providing any parameters again. We consider
|
||||
such scenario as memory snapshot (or memory based serialisation method) and distinguish it
|
||||
with normal model IO operation. In Python, this can be invoked by pickling the
|
||||
``Booster`` object. Other language bindings are still working in progress.
|
||||
such scenario as **memory snapshot** (or memory based serialisation method) and distinguish it
|
||||
with normal model IO operation. Currently, memory snapshot is used in the following places:
|
||||
|
||||
* Python package: when the ``Booster`` object is pickled with the built-in ``pickle`` module.
|
||||
* R package: when the ``xgb.Booster`` object is persisted with the built-in functions ``saveRDS``
|
||||
or ``save``.
|
||||
|
||||
Other language bindings are still working in progress.
|
||||
|
||||
.. note::
|
||||
|
||||
@ -48,12 +57,17 @@ To enable JSON format support for model IO (saving only the trees and objective)
|
||||
a filename with ``.json`` as file extension:
|
||||
|
||||
.. code-block:: python
|
||||
:caption: Python
|
||||
|
||||
bst.save_model('model_file_name.json')
|
||||
|
||||
While for enabling JSON as memory based serialisation format, pass
|
||||
``enable_experimental_json_serialization`` as a training parameter. In Python this can be
|
||||
done by:
|
||||
.. code-block:: r
|
||||
:caption: R
|
||||
|
||||
xgb.save(bst, 'model_file_name.json')
|
||||
|
||||
To use JSON to store memory snapshots, add ``enable_experimental_json_serialization`` as a training
|
||||
parameter. In Python this can be done by:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@ -63,13 +77,33 @@ done by:
|
||||
|
||||
Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost. Hence
|
||||
parameter ``enable_experimental_json_serialization`` is required to enable JSON format.
|
||||
As the name suggested, memory based serialisation captures many stuffs internal to
|
||||
XGBoost, so it's only suitable to be used for checkpoints, which doesn't require stable
|
||||
output format. That being said, loading pickled booster (memory snapshot) in a different
|
||||
XGBoost version may lead to errors or undefined behaviors. But we promise the stable
|
||||
output format of binary model and JSON model (once it's no-longer experimental) as they
|
||||
are designed to be reusable. This scheme fits as Python itself doesn't guarantee pickled
|
||||
bytecode can be used in different Python version.
|
||||
|
||||
Similarly, in the R package, add ``enable_experimental_json_serialization`` to the training
|
||||
parameter:
|
||||
|
||||
.. code-block:: r
|
||||
|
||||
params <- list(enable_experimental_json_serialization = TRUE, ...)
|
||||
bst <- xgboost.train(params, dtrain, nrounds = 10)
|
||||
saveRDS(bst, 'filename.rds')
|
||||
|
||||
***************************************************************
|
||||
A note on backward compatibility of models and memory snapshots
|
||||
***************************************************************
|
||||
|
||||
**We guarantee backward compatibility for models but not for memory snapshots.**
|
||||
|
||||
Models (trees and objective) use a stable representation, so that models produced in earlier
|
||||
versions of XGBoost are accessible in later versions of XGBoost. **If you'd like to store or archive
|
||||
your model for long-term storage, use** ``save_model`` (Python) and ``xgb.save`` (R).
|
||||
|
||||
On the other hand, memory snapshot (serialisation) captures many stuff internal to XGBoost, and its
|
||||
format is not stable and is subject to frequent changes. Therefore, memory snapshot is suitable for
|
||||
checkpointing only, where you persist the complete snapshot of the training configurations so that
|
||||
you can recover robustly from possible failures and resume the training process. Loading memory
|
||||
snapshot generated by an earlier version of XGBoost may result in errors or undefined behaviors.
|
||||
**If a model is persisted with** ``pickle.dump`` (Python) or ``saveRDS`` (R), **then the model may
|
||||
not be accessible in later versions of XGBoost.**
|
||||
|
||||
***************************
|
||||
Custom objective and metric
|
||||
@ -98,6 +132,18 @@ suits simple use cases, and it's advised not to use pickle when stability is nee
|
||||
It's located in ``xgboost/doc/python`` with the name ``convert_090to100.py``. See
|
||||
comments in the script for more details.
|
||||
|
||||
A similar procedure may be used to recover the model persisted in an old RDS file. In R, you are
|
||||
able to install an older version of XGBoost using the ``remotes`` package:
|
||||
|
||||
.. code-block:: r
|
||||
|
||||
library(remotes)
|
||||
remotes::install_version("xgboost", "0.90.0.1") # Install version 0.90.0.1
|
||||
|
||||
Once the desired version is installed, you can load the RDS file with ``readRDS`` and recover the
|
||||
``xgb.Booster`` object. Then call ``xgb.save`` to export the model using the stable representation.
|
||||
Now you should be able to use the model in the latest version of XGBoost.
|
||||
|
||||
********************************************************
|
||||
Saving and Loading the internal parameters configuration
|
||||
********************************************************
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user