Test loading models with invalid file extensions. (#9955)

This commit is contained in:
Jiaming Yuan 2024-01-08 19:26:24 +08:00 committed by GitHub
parent 3ff3a5f1ed
commit 9a30bdd313
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 82 additions and 10 deletions

View File

@ -2,14 +2,20 @@
Introduction to Model IO Introduction to Model IO
######################## ########################
Since 2.1.0, the default model format for XGBoost is the UBJSON format, the option is
enabled for serializing models to file, serializing models to buffer, and for memory
snapshot (pickle and alike).
In XGBoost 1.0.0, we introduced support of using `JSON In XGBoost 1.0.0, we introduced support of using `JSON
<https://www.json.org/json-en.html>`_ for saving/loading XGBoost models and related <https://www.json.org/json-en.html>`_ for saving/loading XGBoost models and related
hyper-parameters for training, aiming to replace the old binary internal format with an hyper-parameters for training, aiming to replace the old binary internal format with an
open format that can be easily reused. Later in XGBoost 1.6.0, additional support for open format that can be easily reused. Later in XGBoost 1.6.0, additional support for
`Universal Binary JSON <https://ubjson.org/>`__ is added as an optimization for more `Universal Binary JSON <https://ubjson.org/>`__ is added as an optimization for more
efficient model IO. They have the same document structure with different representations, efficient model IO, which is set to default in 2.1.
and we will refer them collectively as the JSON format. This tutorial aims to share some
basic insights into the JSON serialisation method used in XGBoost. Without explicitly JSON and UBJSON have the same document structure with different representations, and we
will refer them collectively as the JSON format. This tutorial aims to share some basic
insights into the JSON serialisation method used in XGBoost. Without explicitly
mentioned, the following sections assume you are using the one of the 2 outputs formats, mentioned, the following sections assume you are using the one of the 2 outputs formats,
which can be enabled by providing the file name with ``.json`` (or ``.ubj`` for binary which can be enabled by providing the file name with ``.json`` (or ``.ubj`` for binary
JSON) as file extension when saving/loading model: ``booster.save_model('model.json')``. JSON) as file extension when saving/loading model: ``booster.save_model('model.json')``.
@ -25,12 +31,13 @@ If you come from Deep Learning community, then it should be
clear to you that there are differences between the neural network structures composed of clear to you that there are differences between the neural network structures composed of
weights with fixed tensor operations, and the optimizers (like RMSprop) used to train them. weights with fixed tensor operations, and the optimizers (like RMSprop) used to train them.
So when one calls ``booster.save_model`` (``xgb.save`` in R), XGBoost saves the trees, some model So when one calls ``booster.save_model`` (``xgb.save`` in R), XGBoost saves the trees,
parameters like number of input columns in trained trees, and the objective function, which combined some model parameters like number of input columns in trained trees, and the objective
to represent the concept of "model" in XGBoost. As for why are we saving the objective as function, which combined to represent the concept of "model" in XGBoost. As for why are
part of model, that's because objective controls transformation of global bias (called we saving the objective as part of model, that's because objective controls transformation
``base_score`` in XGBoost). Users can share this model with others for prediction, of global bias (called ``base_score`` in XGBoost) and task-specific information. Users
evaluation or continue the training with a different set of hyper-parameters etc. can share this model with others for prediction, evaluation or continue the training with
a different set of hyper-parameters etc.
However, this is not the end of story. There are cases where we need to save something However, this is not the end of story. There are cases where we need to save something
more than just the model itself. For example, in distributed training, XGBoost performs more than just the model itself. For example, in distributed training, XGBoost performs
@ -81,7 +88,10 @@ a filename with ``.json`` or ``.ubj`` as file extension, the latter is the exten
JSON files that were produced by an external source may lead to undefined behaviors JSON files that were produced by an external source may lead to undefined behaviors
and crashes. and crashes.
While for memory snapshot, UBJSON is the default starting with xgboost 1.6. While for memory snapshot, UBJSON is the default starting with xgboost 1.6. When loading
the model back, XGBoost recognizes the file extensions ``.json`` and ``.ubj``, and can
dispatch accordingly. If the extension is not specified, XGBoost tries to guess the right
one.
*************************************************************** ***************************************************************
A note on backward compatibility of models and memory snapshots A note on backward compatibility of models and memory snapshots

View File

@ -254,6 +254,68 @@ class TestBoosterIO:
# remove file # remove file
Path.unlink(save_path) Path.unlink(save_path)
def test_invalid_postfix(self) -> None:
"""Test mis-specified model format, no special hanlding is expected, the
JSON/UBJ parser can emit parsing errors.
"""
X, y, w = tm.make_regression(64, 16, False)
booster = xgb.train({}, xgb.QuantileDMatrix(X, y, weight=w), num_boost_round=3)
def rename(src: str, dst: str) -> None:
if os.path.exists(dst):
# Windows cannot overwrite an existing file.
os.remove(dst)
os.rename(src, dst)
with tempfile.TemporaryDirectory() as tmpdir:
path_dep = os.path.join(tmpdir, "model.deprecated")
# save into deprecated format
with pytest.warns(UserWarning, match="UBJSON"):
booster.save_model(path_dep)
path_ubj = os.path.join(tmpdir, "model.ubj")
rename(path_dep, path_ubj)
with pytest.raises(ValueError, match="{"):
xgb.Booster(model_file=path_ubj)
path_json = os.path.join(tmpdir, "model.json")
rename(path_ubj, path_json)
with pytest.raises(ValueError, match="{"):
xgb.Booster(model_file=path_json)
# save into ubj format
booster.save_model(path_ubj)
rename(path_ubj, path_dep)
# deprecated is not a recognized format internally, XGBoost can guess the
# right format
xgb.Booster(model_file=path_dep)
rename(path_dep, path_json)
with pytest.raises(ValueError, match="Expecting"):
xgb.Booster(model_file=path_json)
# save into JSON format
booster.save_model(path_json)
rename(path_json, path_dep)
# deprecated is not a recognized format internally, XGBoost can guess the
# right format
xgb.Booster(model_file=path_dep)
rename(path_dep, path_ubj)
with pytest.raises(ValueError, match="Expecting"):
xgb.Booster(model_file=path_ubj)
# save model without file extension
path_no = os.path.join(tmpdir, "model")
with pytest.warns(UserWarning, match="UBJSON"):
booster.save_model(path_no)
booster_1 = xgb.Booster(model_file=path_no)
r0 = booster.save_raw(raw_format="json")
r1 = booster_1.save_raw(raw_format="json")
assert r0 == r1
def save_load_model(model_path: str) -> None: def save_load_model(model_path: str) -> None:
from sklearn.datasets import load_digits from sklearn.datasets import load_digits