[R] Provide better guidance for persisting XGBoost model (#5964)

* [R] Provide better guidance for persisting XGBoost model * Update saving_model.rst * Add a paragraph about xgb.serialize()
2020-07-31 20:00:26 -07:00
parent bf2990e773
commit 5a2dcd1c33
17 changed files with 233 additions and 82 deletions
--- a/doc/tutorials/saving_model.rst
+++ b/doc/tutorials/saving_model.rst
@@ -15,27 +15,36 @@ name with ``.json`` as file extension when saving/loading model:
 ``booster.save_model('model.json')``.  More details below.

 Before we get started, XGBoost is a gradient boosting library with focus on tree model,
-which means inside XGBoost, there are 2 distinct parts: the model consisted of trees and
-algorithms used to build it.  If you come from Deep Learning community, then it should be
-clear to you that there are differences between the neural network structures composed of
-weights with fixed tensor operations, and the optimizers (like RMSprop) used to train
-them.
+which means inside XGBoost, there are 2 distinct parts:

-So when one calls ``booster.save_model``, XGBoost saves the trees, some model parameters
-like number of input columns in trained trees, and the objective function, which combined
+1. The model consisting of trees and
+2. Hyperparameters and configurations used for building the model.
+
+If you come from Deep Learning community, then it should be
+clear to you that there are differences between the neural network structures composed of
+weights with fixed tensor operations, and the optimizers (like RMSprop) used to train them.
+
+So when one calls ``booster.save_model`` (``xgb.save`` in R), XGBoost saves the trees, some model
+parameters like number of input columns in trained trees, and the objective function, which combined
 to represent the concept of "model" in XGBoost.  As for why are we saving the objective as
 part of model, that's because objective controls transformation of global bias (called
 ``base_score`` in XGBoost).  Users can share this model with others for prediction,
 evaluation or continue the training with a different set of hyper-parameters etc.
+
 However, this is not the end of story.  There are cases where we need to save something
 more than just the model itself.  For example, in distrbuted training, XGBoost performs
 checkpointing operation.  Or for some reasons, your favorite distributed computing
 framework decide to copy the model from one worker to another and continue the training in
 there.  In such cases, the serialisation output is required to contain enougth information
 to continue previous training without user providing any parameters again.  We consider
-such scenario as memory snapshot (or memory based serialisation method) and distinguish it
-with normal model IO operation.  In Python, this can be invoked by pickling the
-``Booster`` object.  Other language bindings are still working in progress.
+such scenario as **memory snapshot** (or memory based serialisation method) and distinguish it
+with normal model IO operation. Currently, memory snapshot is used in the following places:
+
+* Python package: when the ``Booster`` object is pickled with the built-in ``pickle`` module.
+* R package: when the ``xgb.Booster`` object is persisted with the built-in functions ``saveRDS``
+  or ``save``.
+
+Other language bindings are still working in progress.

 .. note::

@@ -48,12 +57,17 @@ To enable JSON format support for model IO (saving only the trees and objective)
 a filename with ``.json`` as file extension:

 .. code-block:: python
+  :caption: Python

  bst.save_model('model_file_name.json')

-While for enabling JSON as memory based serialisation format, pass
-``enable_experimental_json_serialization`` as a training parameter.  In Python this can be
-done by:
+.. code-block:: r
+  :caption: R
+
+  xgb.save(bst, 'model_file_name.json')
+
+To use JSON to store memory snapshots, add ``enable_experimental_json_serialization`` as a training
+parameter.  In Python this can be done by:

 .. code-block:: python

@@ -63,13 +77,33 @@ done by:

 Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost.  Hence
 parameter ``enable_experimental_json_serialization`` is required to enable JSON format.
-As the name suggested, memory based serialisation captures many stuffs internal to
-XGBoost, so it's only suitable to be used for checkpoints, which doesn't require stable
-output format.  That being said, loading pickled booster (memory snapshot) in a different
-XGBoost version may lead to errors or undefined behaviors.  But we promise the stable
-output format of binary model and JSON model (once it's no-longer experimental) as they
-are designed to be reusable.  This scheme fits as Python itself doesn't guarantee pickled
-bytecode can be used in different Python version.
+
+Similarly, in the R package, add ``enable_experimental_json_serialization`` to the training
+parameter:
+
+.. code-block:: r
+
+  params <- list(enable_experimental_json_serialization = TRUE, ...)
+  bst <- xgboost.train(params, dtrain, nrounds = 10)
+  saveRDS(bst, 'filename.rds')
+
+***************************************************************
+A note on backward compatibility of models and memory snapshots
+***************************************************************
+
+**We guarantee backward compatibility for models but not for memory snapshots.**
+
+Models (trees and objective) use a stable representation, so that models produced in earlier
+versions of XGBoost are accessible in later versions of XGBoost. **If you'd like to store or archive
+your model for long-term storage, use** ``save_model`` (Python) and ``xgb.save`` (R).
+
+On the other hand, memory snapshot (serialisation) captures many stuff internal to XGBoost, and its
+format is not stable and is subject to frequent changes. Therefore, memory snapshot is suitable for
+checkpointing only, where you persist the complete snapshot of the training configurations so that
+you can recover robustly from possible failures and resume the training process. Loading memory
+snapshot generated by an earlier version of XGBoost may result in errors or undefined behaviors.
+**If a model is persisted with** ``pickle.dump`` (Python) or ``saveRDS`` (R), **then the model may
+not be accessible in later versions of XGBoost.**

 ***************************
 Custom objective and metric
@@ -98,6 +132,18 @@ suits simple use cases, and it's advised not to use pickle when stability is nee
 It's located in ``xgboost/doc/python`` with the name ``convert_090to100.py``.  See
 comments in the script for more details.

+A similar procedure may be used to recover the model persisted in an old RDS file. In R, you are
+able to install an older version of XGBoost using the ``remotes`` package:
+
+.. code-block:: r
+
+  library(remotes)
+  remotes::install_version("xgboost", "0.90.0.1")  # Install version 0.90.0.1
+
+Once the desired version is installed, you can load the RDS file with ``readRDS`` and recover the
+``xgb.Booster`` object. Then call ``xgb.save`` to export the model using the stable representation.
+Now you should be able to use the model in the latest version of XGBoost.
+
 ********************************************************
 Saving and Loading the internal parameters configuration
 ********************************************************