* - fix external memory documentation [skip ci] - to state that it is supported now on gpu algorithms
54 lines
2.0 KiB
ReStructuredText
54 lines
2.0 KiB
ReStructuredText
############################################
|
|
Using XGBoost External Memory Version (beta)
|
|
############################################
|
|
There is no big difference between using external memory version and in-memory version.
|
|
The only difference is the filename format.
|
|
|
|
The external memory version takes in the following filename format:
|
|
|
|
.. code-block:: none
|
|
|
|
filename#cacheprefix
|
|
|
|
The ``filename`` is the normal path to libsvm file you want to load in, and ``cacheprefix`` is a
|
|
path to a cache file that XGBoost will use for external memory cache.
|
|
|
|
.. note:: External memory is also available with GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``)
|
|
|
|
The following code was extracted from `demo/guide-python/external_memory.py <https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py>`_:
|
|
|
|
.. code-block:: python
|
|
|
|
dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
|
|
|
|
You can find that there is additional ``#dtrain.cache`` following the libsvm file, this is the name of cache file.
|
|
For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.
|
|
|
|
****************
|
|
Performance Note
|
|
****************
|
|
* the parameter ``nthread`` should be set to number of **physical** cores
|
|
|
|
- Most modern CPUs use hyperthreading, which means a 4 core CPU may carry 8 threads
|
|
- Set ``nthread`` to be 4 for maximum performance in such case
|
|
|
|
*******************
|
|
Distributed Version
|
|
*******************
|
|
The external memory mode naturally works on distributed version, you can simply set path like
|
|
|
|
.. code-block:: none
|
|
|
|
data = "hdfs://path-to-data/#dtrain.cache"
|
|
|
|
XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporal
|
|
so that you can directly use ``dtrain.cache`` to cache to current folder.
|
|
|
|
**********
|
|
Usage Note
|
|
**********
|
|
* This is a experimental version
|
|
* Currently only importing from libsvm format is supported
|
|
|
|
- Contribution of ingestion from other common external memory data source is welcomed
|