update docs for gpu external memory (#5332)
* update docs for gpu external memory * add hist limitation
This commit is contained in:
@@ -1,6 +1,6 @@
|
||||
############################################
|
||||
Using XGBoost External Memory Version (beta)
|
||||
############################################
|
||||
#####################################
|
||||
Using XGBoost External Memory Version
|
||||
#####################################
|
||||
There is no big difference between using external memory version and in-memory version.
|
||||
The only difference is the filename format.
|
||||
|
||||
@@ -14,7 +14,13 @@ The ``filename`` is the normal path to libsvm format file you want to load in, a
|
||||
``cacheprefix`` is a path to a cache file that XGBoost will use for caching preprocessed
|
||||
data in binary form.
|
||||
|
||||
.. note:: External memory is also available with GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``)
|
||||
To load from csv files, use the following syntax:
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
filename.csv?format=csv&label_column=0#cacheprefix
|
||||
|
||||
where ``label_column`` should point to the csv column acting as the label.
|
||||
|
||||
To provide a simple example for illustration, extracting the code from
|
||||
`demo/guide-python/external_memory.py <https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py>`_. If
|
||||
@@ -25,22 +31,26 @@ you have a dataset stored in a file similar to ``agaricus.txt.train`` with libSV
|
||||
dtrain = DMatrix('../data/agaricus.txt.train#dtrain.cache')
|
||||
|
||||
XGBoost will first load ``agaricus.txt.train`` in, preprocess it, then write to a new file named
|
||||
``dtrain.cache`` as an on disk cache for storing preprocessed data in a internal binary format. For
|
||||
``dtrain.cache`` as an on disk cache for storing preprocessed data in an internal binary format. For
|
||||
more notes about text input formats, see :doc:`/tutorials/input_format`.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
|
||||
|
||||
For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.
|
||||
|
||||
****************
|
||||
Performance Note
|
||||
****************
|
||||
* the parameter ``nthread`` should be set to number of **physical** cores
|
||||
***********
|
||||
GPU Version
|
||||
***********
|
||||
External memory is fully supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).
|
||||
|
||||
- Most modern CPUs use hyperthreading, which means a 4 core CPU may carry 8 threads
|
||||
- Set ``nthread`` to be 4 for maximum performance in such case
|
||||
If you are still getting out-of-memory errors after enabling external memory, try subsampling the
|
||||
data to further reduce GPU memory usage:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
param = {
|
||||
...
|
||||
'subsample': 0.1,
|
||||
'sampling_method': 'gradient_based',
|
||||
}
|
||||
|
||||
*******************
|
||||
Distributed Version
|
||||
@@ -51,14 +61,12 @@ The external memory mode naturally works on distributed version, you can simply
|
||||
|
||||
data = "hdfs://path-to-data/#dtrain.cache"
|
||||
|
||||
XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporal
|
||||
XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporary
|
||||
so that you can directly use ``dtrain.cache`` to cache to current folder.
|
||||
|
||||
**********
|
||||
Usage Note
|
||||
**********
|
||||
* This is an experimental version
|
||||
* Currently only importing from libsvm format is supported
|
||||
***********
|
||||
Limitations
|
||||
***********
|
||||
* The ``hist`` tree method hasn't been tested thoroughly with external memory support (see
|
||||
`this issue <https://github.com/dmlc/xgboost/issues/4093>`_).
|
||||
* OSX is not tested.
|
||||
|
||||
- Contribution of ingestion from other common external memory data source is welcomed
|
||||
|
||||
Reference in New Issue
Block a user