[R] Provide better guidance for persisting XGBoost model (#5964)

* [R] Provide better guidance for persisting XGBoost model

* Update saving_model.rst

* Add a paragraph about xgb.serialize()
This commit is contained in:
Philip Hyunsu Cho 2020-07-31 20:00:26 -07:00 committed by GitHub
parent bf2990e773
commit 5a2dcd1c33
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
17 changed files with 233 additions and 82 deletions

View File

@ -64,5 +64,5 @@ Imports:
data.table (>= 1.9.6),
magrittr (>= 1.5),
stringi (>= 0.5.2)
RoxygenNote: 7.1.0
RoxygenNote: 7.1.1
SystemRequirements: GNU make, C++14

View File

@ -308,18 +308,64 @@ xgb.createFolds <- function(y, k = 10)
#' @name xgboost-deprecated
NULL
#' Do not use saveRDS() for long-term archival of models. Use xgb.save() instead.
#' Do not use \code{\link[base]{saveRDS}} or \code{\link[base]{save}} for long-term archival of
#' models. Instead, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}}.
#'
#' It is a common practice to use the built-in \code{saveRDS()} function to persist R objects to
#' the disk. While \code{xgb.Booster} objects can be persisted with \code{saveRDS()} as well, it
#' is not advisable to use it if the model is to be accessed in the future. If you train a model
#' with the current version of XGBoost and persist it with \code{saveRDS()}, the model is not
#' guaranteed to be accessible in later releases of XGBoost. To ensure that your model can be
#' accessed in future releases of XGBoost, use \code{xgb.save()} instead. For more details and
#' explanation, consult the page
#' It is a common practice to use the built-in \code{\link[base]{saveRDS}} function (or
#' \code{\link[base]{save}}) to persist R objects to the disk. While it is possible to persist
#' \code{xgb.Booster} objects using \code{\link[base]{saveRDS}}, it is not advisable to do so if
#' the model is to be accessed in the future. If you train a model with the current version of
#' XGBoost and persist it with \code{\link[base]{saveRDS}}, the model is not guaranteed to be
#' accessible in later releases of XGBoost. To ensure that your model can be accessed in future
#' releases of XGBoost, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}} instead.
#'
#' @details
#' Use \code{\link{xgb.save}} to save the XGBoost model as a stand-alone file. You may opt into
#' the JSON format by specifying the JSON extension. To read the model back, use
#' \code{\link{xgb.load}}.
#'
#' Use \code{\link{xgb.save.raw}} to save the XGBoost model as a sequence (vector) of raw bytes
#' in a future-proof manner. Future releases of XGBoost will be able to read the raw bytes and
#' re-construct the corresponding model. To read the model back, use \code{\link{xgb.load.raw}}.
#' The \code{\link{xgb.save.raw}} function is useful if you'd like to persist the XGBoost model
#' as part of another R object.
#'
#' Note: Do not use \code{\link{xgb.serialize}} to store models long-term. It persists not only the
#' model but also internal configurations and parameters, and its format is not stable across
#' multiple XGBoost versions. Use \code{\link{xgb.serialize}} only for checkpointing.
#'
#' For more details and explanation about model persistence and archival, consult the page
#' \url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
#'
#' @name a-compatibility-note-for-saveRDS
#' @examples
#' data(agaricus.train, package='xgboost')
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
#'
#' # Save as a stand-alone file; load it with xgb.load()
#' xgb.save(bst, 'xgb.model')
#' bst2 <- xgb.load('xgb.model')
#'
#' # Save as a stand-alone file (JSON); load it with xgb.load()
#' xgb.save(bst, 'xgb.model.json')
#' bst2 <- xgb.load('xgb.model.json')
#'
#' # Save as a raw byte vector; load it with xgb.load.raw()
#' xgb_bytes <- xgb.save.raw(bst)
#' bst2 <- xgb.load.raw(xgb_bytes)
#'
#' # Persist XGBoost model as part of another R object
#' obj <- list(xgb_model_bytes = xgb.save.raw(bst), description = "My first XGBoost model")
#' # Persist the R object. Here, saveRDS() is okay, since it doesn't persist
#' # xgb.Booster directly. What's being persisted is the future-proof byte representation
#' # as given by xgb.save.raw().
#' saveRDS(obj, 'my_object.rds')
#' # Read back the R object
#' obj2 <- readRDS('my_object.rds')
#' # Re-construct xgb.Booster object from the bytes
#' bst2 <- xgb.load.raw(obj2$xgb_model_bytes)
#'
#' @name a-compatibility-note-for-saveRDS-save
NULL
# Lookup table for the deprecated parameters bookkeeping

View File

@ -111,6 +111,8 @@ xgb.get.handle <- function(object) {
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
#' saveRDS(bst, "xgb.model.rds")
#'
#' # Warning: The resulting RDS file is only compatible with the current XGBoost version.
#' # Refer to the section titled "a-compatibility-note-for-saveRDS-save".
#' bst1 <- readRDS("xgb.model.rds")
#' if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds")
#' # the handle is invalid:

View File

@ -13,7 +13,11 @@
#'
#' Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}}
#' or \code{\link[base]{save}}). However, it would then only be compatible with R, and
#' corresponding R-methods would need to be used to load it.
#' corresponding R-methods would need to be used to load it. Moreover, persisting the model with
#' \code{\link[base]{readRDS}} or \code{\link[base]{save}}) will cause compatibility problems in
#' future versions of XGBoost. Consult \code{\link{a-compatibility-note-for-saveRDS-save}} to learn
#' how to persist models in a future-proof way, i.e. to make the model accessible in future
#' releases of XGBoost.
#'
#' @seealso
#' \code{\link{xgb.load}}, \code{\link{xgb.Booster.complete}}.

View File

@ -0,0 +1,62 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/utils.R
\name{a-compatibility-note-for-saveRDS-save}
\alias{a-compatibility-note-for-saveRDS-save}
\title{Do not use \code{\link[base]{saveRDS}} or \code{\link[base]{save}} for long-term archival of
models. Instead, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}}.}
\description{
It is a common practice to use the built-in \code{\link[base]{saveRDS}} function (or
\code{\link[base]{save}}) to persist R objects to the disk. While it is possible to persist
\code{xgb.Booster} objects using \code{\link[base]{saveRDS}}, it is not advisable to do so if
the model is to be accessed in the future. If you train a model with the current version of
XGBoost and persist it with \code{\link[base]{saveRDS}}, the model is not guaranteed to be
accessible in later releases of XGBoost. To ensure that your model can be accessed in future
releases of XGBoost, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}} instead.
}
\details{
Use \code{\link{xgb.save}} to save the XGBoost model as a stand-alone file. You may opt into
the JSON format by specifying the JSON extension. To read the model back, use
\code{\link{xgb.load}}.
Use \code{\link{xgb.save.raw}} to save the XGBoost model as a sequence (vector) of raw bytes
in a future-proof manner. Future releases of XGBoost will be able to read the raw bytes and
re-construct the corresponding model. To read the model back, use \code{\link{xgb.load.raw}}.
The \code{\link{xgb.save.raw}} function is useful if you'd like to persist the XGBoost model
as part of another R object.
Note: Do not use \code{\link{xgb.serialize}} to store models long-term. It persists not only the
model but also internal configurations and parameters, and its format is not stable across
multiple XGBoost versions. Use \code{\link{xgb.serialize}} only for checkpointing.
For more details and explanation about model persistence and archival, consult the page
\url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
}
\examples{
data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
# Save as a stand-alone file; load it with xgb.load()
xgb.save(bst, 'xgb.model')
bst2 <- xgb.load('xgb.model')
# Save as a stand-alone file (JSON); load it with xgb.load()
xgb.save(bst, 'xgb.model.json')
bst2 <- xgb.load('xgb.model.json')
# Save as a raw byte vector; load it with xgb.load.raw()
xgb_bytes <- xgb.save.raw(bst)
bst2 <- xgb.load.raw(xgb_bytes)
# Persist XGBoost model as part of another R object
obj <- list(xgb_model_bytes = xgb.save.raw(bst), description = "My first XGBoost model")
# Persist the R object. Here, saveRDS() is okay, since it doesn't persist
# xgb.Booster directly. What's being persisted is the future-proof byte representation
# as given by xgb.save.raw().
saveRDS(obj, 'my_object.rds')
# Read back the R object
obj2 <- readRDS('my_object.rds')
# Re-construct xgb.Booster object from the bytes
bst2 <- xgb.load.raw(obj2$xgb_model_bytes)
}

View File

@ -1,15 +0,0 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/utils.R
\name{a-compatibility-note-for-saveRDS}
\alias{a-compatibility-note-for-saveRDS}
\title{Do not use saveRDS() for long-term archival of models. Use xgb.save() instead.}
\description{
It is a common practice to use the built-in \code{saveRDS()} function to persist R objects to
the disk. While \code{xgb.Booster} objects can be persisted with \code{saveRDS()} as well, it
is not advisable to use it if the model is to be accessed in the future. If you train a model
with the current version of XGBoost and persist it with \code{saveRDS()}, the model is not
guaranteed to be accessible in later releases of XGBoost. To ensure that your model can be
accessed in future releases of XGBoost, use \code{xgb.save()} instead. For more details and
explanation, consult the page
\url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
}

View File

@ -38,6 +38,8 @@ bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_dep
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
saveRDS(bst, "xgb.model.rds")
# Warning: The resulting RDS file is only compatible with the current XGBoost version.
# Refer to the section titled "a-compatibility-note-for-saveRDS-save".
bst1 <- readRDS("xgb.model.rds")
if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds")
# the handle is invalid:

View File

@ -22,7 +22,11 @@ of \code{\link{xgb.train}}.
Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}}
or \code{\link[base]{save}}). However, it would then only be compatible with R, and
corresponding R-methods would need to be used to load it.
corresponding R-methods would need to be used to load it. Moreover, persisting the model with
\code{\link[base]{readRDS}} or \code{\link[base]{save}}) will cause compatibility problems in
future versions of XGBoost. Consult \code{\link{a-compatibility-note-for-saveRDS-save}} to learn
how to persist models in a future-proof way, i.e. to make the model accessible in future
releases of XGBoost.
}
\examples{
data(agaricus.train, package='xgboost')

View File

@ -15,27 +15,36 @@ name with ``.json`` as file extension when saving/loading model:
``booster.save_model('model.json')``. More details below.
Before we get started, XGBoost is a gradient boosting library with focus on tree model,
which means inside XGBoost, there are 2 distinct parts: the model consisted of trees and
algorithms used to build it. If you come from Deep Learning community, then it should be
clear to you that there are differences between the neural network structures composed of
weights with fixed tensor operations, and the optimizers (like RMSprop) used to train
them.
which means inside XGBoost, there are 2 distinct parts:
So when one calls ``booster.save_model``, XGBoost saves the trees, some model parameters
like number of input columns in trained trees, and the objective function, which combined
1. The model consisting of trees and
2. Hyperparameters and configurations used for building the model.
If you come from Deep Learning community, then it should be
clear to you that there are differences between the neural network structures composed of
weights with fixed tensor operations, and the optimizers (like RMSprop) used to train them.
So when one calls ``booster.save_model`` (``xgb.save`` in R), XGBoost saves the trees, some model
parameters like number of input columns in trained trees, and the objective function, which combined
to represent the concept of "model" in XGBoost. As for why are we saving the objective as
part of model, that's because objective controls transformation of global bias (called
``base_score`` in XGBoost). Users can share this model with others for prediction,
evaluation or continue the training with a different set of hyper-parameters etc.
However, this is not the end of story. There are cases where we need to save something
more than just the model itself. For example, in distrbuted training, XGBoost performs
checkpointing operation. Or for some reasons, your favorite distributed computing
framework decide to copy the model from one worker to another and continue the training in
there. In such cases, the serialisation output is required to contain enougth information
to continue previous training without user providing any parameters again. We consider
such scenario as memory snapshot (or memory based serialisation method) and distinguish it
with normal model IO operation. In Python, this can be invoked by pickling the
``Booster`` object. Other language bindings are still working in progress.
such scenario as **memory snapshot** (or memory based serialisation method) and distinguish it
with normal model IO operation. Currently, memory snapshot is used in the following places:
* Python package: when the ``Booster`` object is pickled with the built-in ``pickle`` module.
* R package: when the ``xgb.Booster`` object is persisted with the built-in functions ``saveRDS``
or ``save``.
Other language bindings are still working in progress.
.. note::
@ -48,12 +57,17 @@ To enable JSON format support for model IO (saving only the trees and objective)
a filename with ``.json`` as file extension:
.. code-block:: python
:caption: Python
bst.save_model('model_file_name.json')
While for enabling JSON as memory based serialisation format, pass
``enable_experimental_json_serialization`` as a training parameter. In Python this can be
done by:
.. code-block:: r
:caption: R
xgb.save(bst, 'model_file_name.json')
To use JSON to store memory snapshots, add ``enable_experimental_json_serialization`` as a training
parameter. In Python this can be done by:
.. code-block:: python
@ -63,13 +77,33 @@ done by:
Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost. Hence
parameter ``enable_experimental_json_serialization`` is required to enable JSON format.
As the name suggested, memory based serialisation captures many stuffs internal to
XGBoost, so it's only suitable to be used for checkpoints, which doesn't require stable
output format. That being said, loading pickled booster (memory snapshot) in a different
XGBoost version may lead to errors or undefined behaviors. But we promise the stable
output format of binary model and JSON model (once it's no-longer experimental) as they
are designed to be reusable. This scheme fits as Python itself doesn't guarantee pickled
bytecode can be used in different Python version.
Similarly, in the R package, add ``enable_experimental_json_serialization`` to the training
parameter:
.. code-block:: r
params <- list(enable_experimental_json_serialization = TRUE, ...)
bst <- xgboost.train(params, dtrain, nrounds = 10)
saveRDS(bst, 'filename.rds')
***************************************************************
A note on backward compatibility of models and memory snapshots
***************************************************************
**We guarantee backward compatibility for models but not for memory snapshots.**
Models (trees and objective) use a stable representation, so that models produced in earlier
versions of XGBoost are accessible in later versions of XGBoost. **If you'd like to store or archive
your model for long-term storage, use** ``save_model`` (Python) and ``xgb.save`` (R).
On the other hand, memory snapshot (serialisation) captures many stuff internal to XGBoost, and its
format is not stable and is subject to frequent changes. Therefore, memory snapshot is suitable for
checkpointing only, where you persist the complete snapshot of the training configurations so that
you can recover robustly from possible failures and resume the training process. Loading memory
snapshot generated by an earlier version of XGBoost may result in errors or undefined behaviors.
**If a model is persisted with** ``pickle.dump`` (Python) or ``saveRDS`` (R), **then the model may
not be accessible in later versions of XGBoost.**
***************************
Custom objective and metric
@ -98,6 +132,18 @@ suits simple use cases, and it's advised not to use pickle when stability is nee
It's located in ``xgboost/doc/python`` with the name ``convert_090to100.py``. See
comments in the script for more details.
A similar procedure may be used to recover the model persisted in an old RDS file. In R, you are
able to install an older version of XGBoost using the ``remotes`` package:
.. code-block:: r
library(remotes)
remotes::install_version("xgboost", "0.90.0.1") # Install version 0.90.0.1
Once the desired version is installed, you can load the RDS file with ``readRDS`` and recover the
``xgb.Booster`` object. Then call ``xgb.save`` to export the model using the stable representation.
Now you should be able to use the model in the latest version of XGBoost.
********************************************************
Saving and Loading the internal parameters configuration
********************************************************