[doc] Some notes for external memory. (#5065)
This commit is contained in:
parent
d667ea9335
commit
9f52e834dc
@ -54,7 +54,7 @@ The data is stored in a :py:class:`DMatrix <xgboost.DMatrix>` object.
|
|||||||
.. note:: Categorical features not supported
|
.. note:: Categorical features not supported
|
||||||
|
|
||||||
Note that XGBoost does not provide specialization for categorical features; if your data contains
|
Note that XGBoost does not provide specialization for categorical features; if your data contains
|
||||||
categorical features, load it as a NumPy array first and then perform
|
categorical features, load it as a NumPy array first and then perform corresponding preprocessing steps like
|
||||||
`one-hot encoding <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>`_.
|
`one-hot encoding <http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html>`_.
|
||||||
|
|
||||||
.. note:: Use Pandas to load CSV files with headers
|
.. note:: Use Pandas to load CSV files with headers
|
||||||
|
|||||||
@ -4,24 +4,34 @@ Using XGBoost External Memory Version (beta)
|
|||||||
There is no big difference between using external memory version and in-memory version.
|
There is no big difference between using external memory version and in-memory version.
|
||||||
The only difference is the filename format.
|
The only difference is the filename format.
|
||||||
|
|
||||||
The external memory version takes in the following filename format:
|
The external memory version takes in the following `URI <https://en.wikipedia.org/wiki/Uniform_Resource_Identifier>`_ format:
|
||||||
|
|
||||||
.. code-block:: none
|
.. code-block:: none
|
||||||
|
|
||||||
filename#cacheprefix
|
filename#cacheprefix
|
||||||
|
|
||||||
The ``filename`` is the normal path to libsvm file you want to load in, and ``cacheprefix`` is a
|
The ``filename`` is the normal path to libsvm format file you want to load in, and
|
||||||
path to a cache file that XGBoost will use for external memory cache.
|
``cacheprefix`` is a path to a cache file that XGBoost will use for caching preprocessed
|
||||||
|
data in binary form.
|
||||||
|
|
||||||
.. note:: External memory is also available with GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``)
|
.. note:: External memory is also available with GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``)
|
||||||
|
|
||||||
The following code was extracted from `demo/guide-python/external_memory.py <https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py>`_:
|
To provide a simple example for illustration, extracting the code from
|
||||||
|
`demo/guide-python/external_memory.py <https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py>`_. If
|
||||||
|
you have a dataset stored in a file similar to ``agaricus.txt.train`` with libSVM format, the external memory support can be enabled by:
|
||||||
|
|
||||||
|
.. code-block:: python
|
||||||
|
|
||||||
|
dtrain = DMatrix('../data/agaricus.txt.train#dtrain.cache')
|
||||||
|
|
||||||
|
XGBoost will first load ``agaricus.txt.train`` in, preprocess it, then write to a new file named
|
||||||
|
``dtrain.cache`` as an on disk cache for storing preprocessed data in a internal binary format. For
|
||||||
|
more notes about text input formats, see :doc:`/tutorials/input_format`.
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
|
dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
|
||||||
|
|
||||||
You can find that there is additional ``#dtrain.cache`` following the libsvm file, this is the name of cache file.
|
|
||||||
For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.
|
For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.
|
||||||
|
|
||||||
****************
|
****************
|
||||||
@ -47,7 +57,7 @@ so that you can directly use ``dtrain.cache`` to cache to current folder.
|
|||||||
**********
|
**********
|
||||||
Usage Note
|
Usage Note
|
||||||
**********
|
**********
|
||||||
* This is a experimental version
|
* This is an experimental version
|
||||||
* Currently only importing from libsvm format is supported
|
* Currently only importing from libsvm format is supported
|
||||||
* OSX is not tested.
|
* OSX is not tested.
|
||||||
|
|
||||||
|
|||||||
@ -5,10 +5,7 @@ Text Input Format of DMatrix
|
|||||||
******************
|
******************
|
||||||
Basic Input Format
|
Basic Input Format
|
||||||
******************
|
******************
|
||||||
XGBoost currently supports two text formats for ingesting data: LibSVM and CSV. The rest of this document will describe the LibSVM format. (See `this Wikipedia article <https://en.wikipedia.org/wiki/Comma-separated_values>`_ for a description of the CSV format.)
|
XGBoost currently supports two text formats for ingesting data: LibSVM and CSV. The rest of this document will describe the LibSVM format. (See `this Wikipedia article <https://en.wikipedia.org/wiki/Comma-separated_values>`_ for a description of the CSV format.). Please be careful that, XGBoost does **not** understand file extensions, nor try to guess the file format, as there is no universal agreement upon file extension of LibSVM or CSV. Instead it employs `URI <https://en.wikipedia.org/wiki/Uniform_Resource_Identifier>`_ format for specifying the precise input file type. For example if you provide a `csv` file ``./data.train.csv`` as input, XGBoost will blindly use the default libsvm parser to digest it and generate a parser error. Instead, users need to provide an uri in the form of ``train.csv?format=csv``. For external memory input, the uri should of a form similar to ``train.csv?format=csv#dtrain.cache``. See :ref:`python_data_interface` and :doc:`/tutorials/external_memory` also.
|
||||||
|
|
||||||
.. note::
|
|
||||||
* XGBoost does **not** understand file extensions nor try to guess the file format. Instead it employs uri format for specifying input file type. For example if you provide a `csv` file ``./data.train.csv`` as input, XGBoost will use the default libsvm parser to digest it and generate a parser error. Instead, users need to provide an uri in the form of ``train.csv?format=csv``. For external memory input, the uri should of a form similar to ``train.csv?format=csv#dtrain.cache``. See :ref:`python_data_interface` also.
|
|
||||||
|
|
||||||
For training or predicting, XGBoost takes an instance file with the format as below:
|
For training or predicting, XGBoost takes an instance file with the format as below:
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user