From 9f52e834dca67fe19e6d5933f8bbcf21fd7e2c7f Mon Sep 17 00:00:00 2001 From: Jiaming Yuan Date: Tue, 26 Nov 2019 00:22:02 +0800 Subject: [PATCH] [doc] Some notes for external memory. (#5065) --- doc/python/python_intro.rst | 2 +- doc/tutorials/external_memory.rst | 22 ++++++++++++++++------ doc/tutorials/input_format.rst | 5 +---- 3 files changed, 18 insertions(+), 11 deletions(-) diff --git a/doc/python/python_intro.rst b/doc/python/python_intro.rst index 761d252ad..6bb138a2f 100644 --- a/doc/python/python_intro.rst +++ b/doc/python/python_intro.rst @@ -54,7 +54,7 @@ The data is stored in a :py:class:`DMatrix ` object. .. note:: Categorical features not supported Note that XGBoost does not provide specialization for categorical features; if your data contains - categorical features, load it as a NumPy array first and then perform + categorical features, load it as a NumPy array first and then perform corresponding preprocessing steps like `one-hot encoding `_. .. note:: Use Pandas to load CSV files with headers diff --git a/doc/tutorials/external_memory.rst b/doc/tutorials/external_memory.rst index 7800fa83c..b5427d127 100644 --- a/doc/tutorials/external_memory.rst +++ b/doc/tutorials/external_memory.rst @@ -4,24 +4,34 @@ Using XGBoost External Memory Version (beta) There is no big difference between using external memory version and in-memory version. The only difference is the filename format. -The external memory version takes in the following filename format: +The external memory version takes in the following `URI `_ format: .. code-block:: none filename#cacheprefix -The ``filename`` is the normal path to libsvm file you want to load in, and ``cacheprefix`` is a -path to a cache file that XGBoost will use for external memory cache. +The ``filename`` is the normal path to libsvm format file you want to load in, and +``cacheprefix`` is a path to a cache file that XGBoost will use for caching preprocessed +data in binary form. .. note:: External memory is also available with GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``) -The following code was extracted from `demo/guide-python/external_memory.py `_: +To provide a simple example for illustration, extracting the code from +`demo/guide-python/external_memory.py `_. If +you have a dataset stored in a file similar to ``agaricus.txt.train`` with libSVM format, the external memory support can be enabled by: + +.. code-block:: python + + dtrain = DMatrix('../data/agaricus.txt.train#dtrain.cache') + +XGBoost will first load ``agaricus.txt.train`` in, preprocess it, then write to a new file named +``dtrain.cache`` as an on disk cache for storing preprocessed data in a internal binary format. For +more notes about text input formats, see :doc:`/tutorials/input_format`. .. code-block:: python dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache') -You can find that there is additional ``#dtrain.cache`` following the libsvm file, this is the name of cache file. For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``. **************** @@ -47,7 +57,7 @@ so that you can directly use ``dtrain.cache`` to cache to current folder. ********** Usage Note ********** -* This is a experimental version +* This is an experimental version * Currently only importing from libsvm format is supported * OSX is not tested. diff --git a/doc/tutorials/input_format.rst b/doc/tutorials/input_format.rst index f844e09a4..f0cb69c2c 100644 --- a/doc/tutorials/input_format.rst +++ b/doc/tutorials/input_format.rst @@ -5,10 +5,7 @@ Text Input Format of DMatrix ****************** Basic Input Format ****************** -XGBoost currently supports two text formats for ingesting data: LibSVM and CSV. The rest of this document will describe the LibSVM format. (See `this Wikipedia article `_ for a description of the CSV format.) - -.. note:: - * XGBoost does **not** understand file extensions nor try to guess the file format. Instead it employs uri format for specifying input file type. For example if you provide a `csv` file ``./data.train.csv`` as input, XGBoost will use the default libsvm parser to digest it and generate a parser error. Instead, users need to provide an uri in the form of ``train.csv?format=csv``. For external memory input, the uri should of a form similar to ``train.csv?format=csv#dtrain.cache``. See :ref:`python_data_interface` also. +XGBoost currently supports two text formats for ingesting data: LibSVM and CSV. The rest of this document will describe the LibSVM format. (See `this Wikipedia article `_ for a description of the CSV format.). Please be careful that, XGBoost does **not** understand file extensions, nor try to guess the file format, as there is no universal agreement upon file extension of LibSVM or CSV. Instead it employs `URI `_ format for specifying the precise input file type. For example if you provide a `csv` file ``./data.train.csv`` as input, XGBoost will blindly use the default libsvm parser to digest it and generate a parser error. Instead, users need to provide an uri in the form of ``train.csv?format=csv``. For external memory input, the uri should of a form similar to ``train.csv?format=csv#dtrain.cache``. See :ref:`python_data_interface` and :doc:`/tutorials/external_memory` also. For training or predicting, XGBoost takes an instance file with the format as below: