Drop saving binary format for memory snapshot. (#6513)
This commit is contained in:
parent
0e97d97d50
commit
c5876277a8
@ -9,10 +9,9 @@ open format that can be easily reused. The support for binary format will be co
|
|||||||
the future until JSON format is no-longer experimental and has satisfying performance.
|
the future until JSON format is no-longer experimental and has satisfying performance.
|
||||||
This tutorial aims to share some basic insights into the JSON serialisation method used in
|
This tutorial aims to share some basic insights into the JSON serialisation method used in
|
||||||
XGBoost. Without explicitly mentioned, the following sections assume you are using the
|
XGBoost. Without explicitly mentioned, the following sections assume you are using the
|
||||||
experimental JSON format, which can be enabled by passing
|
JSON format, which can be enabled by providing the file name with ``.json`` as file
|
||||||
``enable_experimental_json_serialization=True`` as training parameter, or provide the file
|
extension when saving/loading model: ``booster.save_model('model.json')``. More details
|
||||||
name with ``.json`` as file extension when saving/loading model:
|
below.
|
||||||
``booster.save_model('model.json')``. More details below.
|
|
||||||
|
|
||||||
Before we get started, XGBoost is a gradient boosting library with focus on tree model,
|
Before we get started, XGBoost is a gradient boosting library with focus on tree model,
|
||||||
which means inside XGBoost, there are 2 distinct parts:
|
which means inside XGBoost, there are 2 distinct parts:
|
||||||
@ -66,26 +65,7 @@ a filename with ``.json`` as file extension:
|
|||||||
|
|
||||||
xgb.save(bst, 'model_file_name.json')
|
xgb.save(bst, 'model_file_name.json')
|
||||||
|
|
||||||
To use JSON to store memory snapshots, add ``enable_experimental_json_serialization`` as a training
|
While for memory snapshot, JSON is the default starting with xgboost 1.3.
|
||||||
parameter. In Python this can be done by:
|
|
||||||
|
|
||||||
.. code-block:: python
|
|
||||||
|
|
||||||
bst = xgboost.train({'enable_experimental_json_serialization': True}, dtrain)
|
|
||||||
with open('filename', 'wb') as fd:
|
|
||||||
pickle.dump(bst, fd)
|
|
||||||
|
|
||||||
Notice the ``filename`` is for Python intrinsic function ``open``, not for XGBoost. Hence
|
|
||||||
parameter ``enable_experimental_json_serialization`` is required to enable JSON format.
|
|
||||||
|
|
||||||
Similarly, in the R package, add ``enable_experimental_json_serialization`` to the training
|
|
||||||
parameter:
|
|
||||||
|
|
||||||
.. code-block:: r
|
|
||||||
|
|
||||||
params <- list(enable_experimental_json_serialization = TRUE, ...)
|
|
||||||
bst <- xgboost.train(params, dtrain, nrounds = 10)
|
|
||||||
saveRDS(bst, 'filename.rds')
|
|
||||||
|
|
||||||
***************************************************************
|
***************************************************************
|
||||||
A note on backward compatibility of models and memory snapshots
|
A note on backward compatibility of models and memory snapshots
|
||||||
@ -110,11 +90,11 @@ Custom objective and metric
|
|||||||
***************************
|
***************************
|
||||||
|
|
||||||
XGBoost accepts user provided objective and metric functions as an extension. These
|
XGBoost accepts user provided objective and metric functions as an extension. These
|
||||||
functions are not saved in model file as they are language dependent feature. With
|
functions are not saved in model file as they are language dependent features. With
|
||||||
Python, user can pickle the model to include these functions in saved binary. One
|
Python, user can pickle the model to include these functions in saved binary. One
|
||||||
drawback is, the output from pickle is not a stable serialization format and doesn't work
|
drawback is, the output from pickle is not a stable serialization format and doesn't work
|
||||||
on different Python version or XGBoost version, not to mention different language
|
on different Python version nor XGBoost version, not to mention different language
|
||||||
environment. Another way to workaround this limitation is to provide these functions
|
environments. Another way to workaround this limitation is to provide these functions
|
||||||
again after the model is loaded. If the customized function is useful, please consider
|
again after the model is loaded. If the customized function is useful, please consider
|
||||||
making a PR for implementing it inside XGBoost, this way we can have your functions
|
making a PR for implementing it inside XGBoost, this way we can have your functions
|
||||||
working with different language bindings.
|
working with different language bindings.
|
||||||
@ -128,9 +108,9 @@ models are valuable. One way to restore it in the future is to load it back wit
|
|||||||
specific version of Python and XGBoost, export the model by calling `save_model`. To help
|
specific version of Python and XGBoost, export the model by calling `save_model`. To help
|
||||||
easing the mitigation, we created a simple script for converting pickled XGBoost 0.90
|
easing the mitigation, we created a simple script for converting pickled XGBoost 0.90
|
||||||
Scikit-Learn interface object to XGBoost 1.0.0 native model. Please note that the script
|
Scikit-Learn interface object to XGBoost 1.0.0 native model. Please note that the script
|
||||||
suits simple use cases, and it's advised not to use pickle when stability is needed.
|
suits simple use cases, and it's advised not to use pickle when stability is needed. It's
|
||||||
It's located in ``xgboost/doc/python`` with the name ``convert_090to100.py``. See
|
located in ``xgboost/doc/python`` with the name ``convert_090to100.py``. See comments in
|
||||||
comments in the script for more details.
|
the script for more details.
|
||||||
|
|
||||||
A similar procedure may be used to recover the model persisted in an old RDS file. In R, you are
|
A similar procedure may be used to recover the model persisted in an old RDS file. In R, you are
|
||||||
able to install an older version of XGBoost using the ``remotes`` package:
|
able to install an older version of XGBoost using the ``remotes`` package:
|
||||||
@ -172,7 +152,6 @@ Will print out something similiar to (not actual output as it's too long for dem
|
|||||||
{
|
{
|
||||||
"Learner": {
|
"Learner": {
|
||||||
"generic_parameter": {
|
"generic_parameter": {
|
||||||
"enable_experimental_json_serialization": "0",
|
|
||||||
"gpu_id": "0",
|
"gpu_id": "0",
|
||||||
"gpu_page_size": "0",
|
"gpu_page_size": "0",
|
||||||
"n_jobs": "0",
|
"n_jobs": "0",
|
||||||
|
|||||||
@ -31,7 +31,6 @@ struct GenericParameter : public XGBoostParameter<GenericParameter> {
|
|||||||
bool fail_on_invalid_gpu_id {false};
|
bool fail_on_invalid_gpu_id {false};
|
||||||
// gpu page size in external memory mode, 0 means using the default.
|
// gpu page size in external memory mode, 0 means using the default.
|
||||||
size_t gpu_page_size;
|
size_t gpu_page_size;
|
||||||
bool enable_experimental_json_serialization {true};
|
|
||||||
bool validate_parameters {false};
|
bool validate_parameters {false};
|
||||||
|
|
||||||
void CheckDeprecated() {
|
void CheckDeprecated() {
|
||||||
@ -74,10 +73,6 @@ struct GenericParameter : public XGBoostParameter<GenericParameter> {
|
|||||||
.set_default(0)
|
.set_default(0)
|
||||||
.set_lower_bound(0)
|
.set_lower_bound(0)
|
||||||
.describe("GPU page size when running in external memory mode.");
|
.describe("GPU page size when running in external memory mode.");
|
||||||
DMLC_DECLARE_FIELD(enable_experimental_json_serialization)
|
|
||||||
.set_default(true)
|
|
||||||
.describe("Enable using JSON for memory serialization (Python Pickle, "
|
|
||||||
"rabit checkpoints etc.).");
|
|
||||||
DMLC_DECLARE_FIELD(validate_parameters)
|
DMLC_DECLARE_FIELD(validate_parameters)
|
||||||
.set_default(false)
|
.set_default(false)
|
||||||
.describe("Enable checking whether parameters are used or not.");
|
.describe("Enable checking whether parameters are used or not.");
|
||||||
|
|||||||
@ -874,7 +874,6 @@ class LearnerIO : public LearnerConfiguration {
|
|||||||
}
|
}
|
||||||
|
|
||||||
void Save(dmlc::Stream* fo) const override {
|
void Save(dmlc::Stream* fo) const override {
|
||||||
if (generic_parameters_.enable_experimental_json_serialization) {
|
|
||||||
Json memory_snapshot{Object()};
|
Json memory_snapshot{Object()};
|
||||||
memory_snapshot["Model"] = Object();
|
memory_snapshot["Model"] = Object();
|
||||||
auto &model = memory_snapshot["Model"];
|
auto &model = memory_snapshot["Model"];
|
||||||
@ -885,29 +884,6 @@ class LearnerIO : public LearnerConfiguration {
|
|||||||
std::string out_str;
|
std::string out_str;
|
||||||
Json::Dump(memory_snapshot, &out_str);
|
Json::Dump(memory_snapshot, &out_str);
|
||||||
fo->Write(out_str.c_str(), out_str.size());
|
fo->Write(out_str.c_str(), out_str.size());
|
||||||
} else {
|
|
||||||
std::string binary_buf;
|
|
||||||
common::MemoryBufferStream s(&binary_buf);
|
|
||||||
this->SaveModel(&s);
|
|
||||||
Json config{ Object() };
|
|
||||||
// Do not use std::size_t as it's not portable.
|
|
||||||
int64_t const json_offset = binary_buf.size();
|
|
||||||
this->SaveConfig(&config);
|
|
||||||
std::string config_str;
|
|
||||||
Json::Dump(config, &config_str);
|
|
||||||
// concatonate the model and config at final output, it's a temporary solution for
|
|
||||||
// continuing support for binary model format
|
|
||||||
fo->Write(&serialisation_header_[0], serialisation_header_.size());
|
|
||||||
if (DMLC_IO_NO_ENDIAN_SWAP) {
|
|
||||||
fo->Write(&json_offset, sizeof(json_offset));
|
|
||||||
} else {
|
|
||||||
auto x = json_offset;
|
|
||||||
dmlc::ByteSwap(&x, sizeof(x), 1);
|
|
||||||
fo->Write(&x, sizeof(json_offset));
|
|
||||||
}
|
|
||||||
fo->Write(&binary_buf[0], binary_buf.size());
|
|
||||||
fo->Write(&config_str[0], config_str.size());
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
|
||||||
void Load(dmlc::Stream* fi) override {
|
void Load(dmlc::Stream* fi) override {
|
||||||
|
|||||||
@ -42,17 +42,9 @@ class TestPickling:
|
|||||||
if os.path.exists(filename):
|
if os.path.exists(filename):
|
||||||
os.remove(filename)
|
os.remove(filename)
|
||||||
|
|
||||||
def test_model_pickling_binary(self):
|
|
||||||
params = {
|
|
||||||
'nthread': 1,
|
|
||||||
'tree_method': 'hist'
|
|
||||||
}
|
|
||||||
self.run_model_pickling(params)
|
|
||||||
|
|
||||||
def test_model_pickling_json(self):
|
def test_model_pickling_json(self):
|
||||||
params = {
|
params = {
|
||||||
'nthread': 1,
|
'nthread': 1,
|
||||||
'tree_method': 'hist',
|
'tree_method': 'hist',
|
||||||
'enable_experimental_json_serialization': True
|
|
||||||
}
|
}
|
||||||
self.run_model_pickling(params)
|
self.run_model_pickling(params)
|
||||||
|
|||||||
@ -9,7 +9,7 @@ rng = np.random.RandomState(1337)
|
|||||||
class TestTrainingContinuation:
|
class TestTrainingContinuation:
|
||||||
num_parallel_tree = 3
|
num_parallel_tree = 3
|
||||||
|
|
||||||
def generate_parameters(self, use_json):
|
def generate_parameters(self):
|
||||||
xgb_params_01_binary = {
|
xgb_params_01_binary = {
|
||||||
'nthread': 1,
|
'nthread': 1,
|
||||||
}
|
}
|
||||||
@ -24,13 +24,6 @@ class TestTrainingContinuation:
|
|||||||
'num_class': 5,
|
'num_class': 5,
|
||||||
'num_parallel_tree': self.num_parallel_tree
|
'num_parallel_tree': self.num_parallel_tree
|
||||||
}
|
}
|
||||||
if use_json:
|
|
||||||
xgb_params_01_binary[
|
|
||||||
'enable_experimental_json_serialization'] = True
|
|
||||||
xgb_params_02_binary[
|
|
||||||
'enable_experimental_json_serialization'] = True
|
|
||||||
xgb_params_03_binary[
|
|
||||||
'enable_experimental_json_serialization'] = True
|
|
||||||
|
|
||||||
return [
|
return [
|
||||||
xgb_params_01_binary, xgb_params_02_binary, xgb_params_03_binary
|
xgb_params_01_binary, xgb_params_02_binary, xgb_params_03_binary
|
||||||
@ -136,31 +129,16 @@ class TestTrainingContinuation:
|
|||||||
ntree_limit=gbdt_05.best_ntree_limit)
|
ntree_limit=gbdt_05.best_ntree_limit)
|
||||||
np.testing.assert_almost_equal(res1, res2)
|
np.testing.assert_almost_equal(res1, res2)
|
||||||
|
|
||||||
@pytest.mark.skipif(**tm.no_sklearn())
|
|
||||||
def test_training_continuation_binary(self):
|
|
||||||
params = self.generate_parameters(False)
|
|
||||||
self.run_training_continuation(params[0], params[1], params[2])
|
|
||||||
|
|
||||||
@pytest.mark.skipif(**tm.no_sklearn())
|
@pytest.mark.skipif(**tm.no_sklearn())
|
||||||
def test_training_continuation_json(self):
|
def test_training_continuation_json(self):
|
||||||
params = self.generate_parameters(True)
|
params = self.generate_parameters()
|
||||||
for p in params:
|
|
||||||
p['enable_experimental_json_serialization'] = True
|
|
||||||
self.run_training_continuation(params[0], params[1], params[2])
|
|
||||||
|
|
||||||
@pytest.mark.skipif(**tm.no_sklearn())
|
|
||||||
def test_training_continuation_updaters_binary(self):
|
|
||||||
updaters = 'grow_colmaker,prune,refresh'
|
|
||||||
params = self.generate_parameters(False)
|
|
||||||
for p in params:
|
|
||||||
p['updater'] = updaters
|
|
||||||
self.run_training_continuation(params[0], params[1], params[2])
|
self.run_training_continuation(params[0], params[1], params[2])
|
||||||
|
|
||||||
@pytest.mark.skipif(**tm.no_sklearn())
|
@pytest.mark.skipif(**tm.no_sklearn())
|
||||||
def test_training_continuation_updaters_json(self):
|
def test_training_continuation_updaters_json(self):
|
||||||
# Picked up from R tests.
|
# Picked up from R tests.
|
||||||
updaters = 'grow_colmaker,prune,refresh'
|
updaters = 'grow_colmaker,prune,refresh'
|
||||||
params = self.generate_parameters(True)
|
params = self.generate_parameters()
|
||||||
for p in params:
|
for p in params:
|
||||||
p['updater'] = updaters
|
p['updater'] = updaters
|
||||||
self.run_training_continuation(params[0], params[1], params[2])
|
self.run_training_continuation(params[0], params[1], params[2])
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user