[R] Provide better guidance for persisting XGBoost model (#5964)
* [R] Provide better guidance for persisting XGBoost model * Update saving_model.rst * Add a paragraph about xgb.serialize()
This commit is contained in:
committed by
GitHub
parent
bf2990e773
commit
5a2dcd1c33
@@ -15,27 +15,36 @@ name with ``.json`` as file extension when saving/loading model:
|
||||
``booster.save_model('model.json')``. More details below.
|
||||
|
||||
Before we get started, XGBoost is a gradient boosting library with focus on tree model,
|
||||
which means inside XGBoost, there are 2 distinct parts: the model consisted of trees and
|
||||
algorithms used to build it. If you come from Deep Learning community, then it should be
|
||||
clear to you that there are differences between the neural network structures composed of
|
||||
weights with fixed tensor operations, and the optimizers (like RMSprop) used to train
|
||||
them.
|
||||
which means inside XGBoost, there are 2 distinct parts:
|
||||
|
||||
So when one calls ``booster.save_model``, XGBoost saves the trees, some model parameters
|
||||
like number of input columns in trained trees, and the objective function, which combined
|
||||
1. The model consisting of trees and
|
||||
2. Hyperparameters and configurations used for building the model.
|
||||
|
||||
If you come from Deep Learning community, then it should be
|
||||
clear to you that there are differences between the neural network structures composed of
|
||||
weights with fixed tensor operations, and the optimizers (like RMSprop) used to train them.
|
||||
|
||||
So when one calls ``booster.save_model`` (``xgb.save`` in R), XGBoost saves the trees, some model
|
||||
parameters like number of input columns in trained trees, and the objective function, which combined
|
||||
to represent the concept of "model" in XGBoost. As for why are we saving the objective as
|
||||
part of model, that's because objective controls transformation of global bias (called
|
||||
``base_score`` in XGBoost). Users can share this model with others for prediction,
|
||||
evaluation or continue the training with a different set of hyper-parameters etc.
|
||||
|
||||
However, this is not the end of story. There are cases where we need to save something
|
||||
more than just the model itself. For example, in distrbuted training, XGBoost performs
|
||||
checkpointing operation. Or for some reasons, your favorite distributed computing
|
||||
framework decide to copy the model from one worker to another and continue the training in
|
||||
there. In such cases, the serialisation output is required to contain enougth information
|
||||
to continue previous training without user providing any parameters again. We consider
|
||||
such scenario as memory snapshot (or memory based serialisation method) and distinguish it
|
||||
with normal model IO operation. In Python, this can be invoked by pickling the
|
||||
``Booster`` object. Other language bindings are still working in progress.
|
||||
such scenario as **memory snapshot** (or memory based serialisation method) and distinguish it
|
||||
with normal model IO operation. Currently, memory snapshot is used in the following places:
|
||||
|
||||
* Python package: when the ``Booster`` object is pickled with the built-in ``pickle`` module.
|
||||
* R package: when the ``xgb.Booster`` object is persisted with the built-in functions ``saveRDS``
|
||||
or ``save``.
|
||||
|
||||
Other language bindings are still working in progress.
|
||||
|
||||
.. note::
|
||||
|
||||
@@ -48,12 +57,17 @@ To enable JSON format support for model IO (saving only the trees and objective)
|
||||
a filename with ``.json`` as file extension:
|
||||
|
||||
.. code-block:: python
|
||||
:caption: Python
|
||||
|
||||
bst.save_model('model_file_name.json')
|
||||
|
||||
While for enabling JSON as memory based serialisation format, pass
|
||||
``enable_experimental_json_serialization`` as a training parameter. In Python this can be
|
||||
done by:
|
||||
.. code-block:: r
|
||||
:caption: R
|
||||
|
||||
xgb.save(bst, 'model_file_name.json')
|
||||
|
||||
To use JSON to store memory snapshots, add ``enable_experimental_json_serialization`` as a training
|
||||
parameter. In Python this can be done by:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@@ -63,13 +77,33 @@ done by:
|
||||
|
||||
Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost. Hence
|
||||
parameter ``enable_experimental_json_serialization`` is required to enable JSON format.
|
||||
As the name suggested, memory based serialisation captures many stuffs internal to
|
||||
XGBoost, so it's only suitable to be used for checkpoints, which doesn't require stable
|
||||
output format. That being said, loading pickled booster (memory snapshot) in a different
|
||||
XGBoost version may lead to errors or undefined behaviors. But we promise the stable
|
||||
output format of binary model and JSON model (once it's no-longer experimental) as they
|
||||
are designed to be reusable. This scheme fits as Python itself doesn't guarantee pickled
|
||||
bytecode can be used in different Python version.
|
||||
|
||||
Similarly, in the R package, add ``enable_experimental_json_serialization`` to the training
|
||||
parameter:
|
||||
|
||||
.. code-block:: r
|
||||
|
||||
params <- list(enable_experimental_json_serialization = TRUE, ...)
|
||||
bst <- xgboost.train(params, dtrain, nrounds = 10)
|
||||
saveRDS(bst, 'filename.rds')
|
||||
|
||||
***************************************************************
|
||||
A note on backward compatibility of models and memory snapshots
|
||||
***************************************************************
|
||||
|
||||
**We guarantee backward compatibility for models but not for memory snapshots.**
|
||||
|
||||
Models (trees and objective) use a stable representation, so that models produced in earlier
|
||||
versions of XGBoost are accessible in later versions of XGBoost. **If you'd like to store or archive
|
||||
your model for long-term storage, use** ``save_model`` (Python) and ``xgb.save`` (R).
|
||||
|
||||
On the other hand, memory snapshot (serialisation) captures many stuff internal to XGBoost, and its
|
||||
format is not stable and is subject to frequent changes. Therefore, memory snapshot is suitable for
|
||||
checkpointing only, where you persist the complete snapshot of the training configurations so that
|
||||
you can recover robustly from possible failures and resume the training process. Loading memory
|
||||
snapshot generated by an earlier version of XGBoost may result in errors or undefined behaviors.
|
||||
**If a model is persisted with** ``pickle.dump`` (Python) or ``saveRDS`` (R), **then the model may
|
||||
not be accessible in later versions of XGBoost.**
|
||||
|
||||
***************************
|
||||
Custom objective and metric
|
||||
@@ -98,6 +132,18 @@ suits simple use cases, and it's advised not to use pickle when stability is nee
|
||||
It's located in ``xgboost/doc/python`` with the name ``convert_090to100.py``. See
|
||||
comments in the script for more details.
|
||||
|
||||
A similar procedure may be used to recover the model persisted in an old RDS file. In R, you are
|
||||
able to install an older version of XGBoost using the ``remotes`` package:
|
||||
|
||||
.. code-block:: r
|
||||
|
||||
library(remotes)
|
||||
remotes::install_version("xgboost", "0.90.0.1") # Install version 0.90.0.1
|
||||
|
||||
Once the desired version is installed, you can load the RDS file with ``readRDS`` and recover the
|
||||
``xgb.Booster`` object. Then call ``xgb.save`` to export the model using the stable representation.
|
||||
Now you should be able to use the model in the latest version of XGBoost.
|
||||
|
||||
********************************************************
|
||||
Saving and Loading the internal parameters configuration
|
||||
********************************************************
|
||||
|
||||
Reference in New Issue
Block a user