Add dump_format=json option (#1726)

* Add format to the params accepted by DumpModel

Currently, only the test format is supported when trying to dump
a model. The plan is to add more such formats like JSON which are
easy to read and/or parse by machines. And to make the interface
for this even more generic to allow other formats to be added.

Hence, we make some modifications to make these function generic
and accept a new parameter "format" which signifies the format of
the dump to be created.

* Fix typos and errors in docs

* plugin: Mention all the register macros available

Document the register macros currently available to the plugin
writers so they know what exactly can be extended using hooks.

* sparce_page_source: Use same arg name in .h and .cc

* gbm: Add JSON dump

The dump_format argument can be used to specify what type
of dump file should be created. Add functionality to dump
gblinear and gbtree into a JSON file.

The JSON file has an array, each item is a JSON object for the tree.
For gblinear:
 - The item is the bias and weights vectors
For gbtree:
 - The item is the root node. The root node has a attribute "children"
   which holds the children nodes. This happens recursively.

* core.py: Add arg dump_format for get_dump()
This commit is contained in:
AbdealiJK
2016-11-04 22:25:25 +05:30
committed by Tianqi Chen
parent 9c693f0f5f
commit b94fcab4dc
16 changed files with 320 additions and 92 deletions

View File

@@ -446,6 +446,23 @@ XGB_DLL int XGBoosterDumpModel(BoosterHandle handle,
bst_ulong *out_len,
const char ***out_dump_array);
/*!
* \brief dump model, return array of strings representing model dump
* \param handle handle
* \param fmap name to fmap can be empty string
* \param with_stats whether to dump with statistics
* \param format the format to dump the model in
* \param out_len length of output array
* \param out_dump_array pointer to hold representing dump of each model
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGBoosterDumpModelEx(BoosterHandle handle,
const char *fmap,
int with_stats,
const char *format,
bst_ulong *out_len,
const char ***out_dump_array);
/*!
* \brief dump model, return array of strings representing model dump
* \param handle handle
@@ -465,6 +482,27 @@ XGB_DLL int XGBoosterDumpModelWithFeatures(BoosterHandle handle,
bst_ulong *out_len,
const char ***out_models);
/*!
* \brief dump model, return array of strings representing model dump
* \param handle handle
* \param fnum number of features
* \param fname names of features
* \param ftype types of features
* \param with_stats whether to dump with statistics
* \param format the format to dump the model in
* \param out_len length of output array
* \param out_models pointer to hold representing dump of each model
* \return 0 when success, -1 when failure happens
*/
XGB_DLL int XGBoosterDumpModelExWithFeatures(BoosterHandle handle,
int fnum,
const char **fname,
const char **ftype,
int with_stats,
const char *format,
bst_ulong *out_len,
const char ***out_models);
/*!
* \brief Get string attribute from Booster.
* \param handle handle

View File

@@ -225,7 +225,7 @@ struct RowSet {
* - Provide a dmlc::Parser and pass into the DMatrix::Create
* - Alternatively, if data can be represented by an URL, define a new dmlc::Parser and register by DMLC_REGISTER_DATA_PARSER;
* - This works best for user defined data input source, such as data-base, filesystem.
* - Provdie a DataSource, that can be passed to DMatrix::Create
* - Provide a DataSource, that can be passed to DMatrix::Create
* This can be used to re-use inmemory data structure into DMatrix.
*/
class DMatrix {

View File

@@ -108,12 +108,15 @@ class GradientBooster {
std::vector<float>* out_preds,
unsigned ntree_limit = 0) = 0;
/*!
* \brief dump the model to text format
* \brief dump the model in the requested format
* \param fmap feature map that may help give interpretations of feature
* \param option extra option of the dump model
* \param with_stats extra statistics while dumping model
* \param format the format to dump the model in
* \return a vector of dump for boosters.
*/
virtual std::vector<std::string> Dump2Text(const FeatureMap& fmap, int option) const = 0;
virtual std::vector<std::string> DumpModel(const FeatureMap& fmap,
bool with_stats,
std::string format) const = 0;
/*!
* \brief create a gradient booster from given name
* \param name name of gradient booster

View File

@@ -140,12 +140,15 @@ class Learner : public rabit::Serializable {
*/
bool AllowLazyCheckPoint() const;
/*!
* \brief dump the model in text format
* \brief dump the model in the requested format
* \param fmap feature map that may help give interpretations of feature
* \param option extra option of the dump model
* \param with_stats extra statistics while dumping model
* \param format the format to dump the model in
* \return a vector of dump for boosters.
*/
std::vector<std::string> Dump2Text(const FeatureMap& fmap, int option) const;
std::vector<std::string> DumpModel(const FeatureMap& fmap,
bool with_stats,
std::string format) const;
/*!
* \brief online prediction function, predict score for one instance at a time
* NOTE: use the batch prediction interface if possible, batch prediction is usually

View File

@@ -480,12 +480,15 @@ class RegTree: public TreeModel<bst_float, RTreeNodeStat> {
*/
inline int GetNext(int pid, float fvalue, bool is_unknown) const;
/*!
* \brief dump model to text string
* \param fmap feature map of feature types
* \brief dump the model in the requested format as a text string
* \param fmap feature map that may help give interpretations of feature
* \param with_stats whether dump out statistics as well
* \param format the format to dump the model in
* \return the string of dumped model
*/
std::string Dump2Text(const FeatureMap& fmap, bool with_stats) const;
std::string DumpModel(const FeatureMap& fmap,
bool with_stats,
std::string format) const;
};
// implementations of inline functions