update docs for gpu external memory (#5332)

* update docs for gpu external memory * add hist limitation
2020-02-21 22:57:40 -08:00 · 2020-02-21 22:57:40 -08:00 · d6b31df449
commit d6b31df449
parent 7ac7e8778f
2 changed files with 42 additions and 23 deletions
--- a/doc/parameter.rst
+++ b/doc/parameter.rst
@ -88,6 +88,17 @@ Parameters for Tree Booster
  - Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost would randomly sample half of the training data prior to growing trees. and this will prevent overfitting. Subsampling will occur once in every boosting iteration.
  - range: (0,1]

+* ``sampling_method`` [default= ``uniform``]
+
+  - The method to use to sample the training instances.
+  - ``uniform``: each training instance has an equal probability of being selected. Typically set
+    ``subsample`` >= 0.5 for good results.
+  - ``gradient_based``: the selection probability for each training instance is proportional to the
+    *regularized absolute value* of gradients (more specifically, :math:`\sqrt{g^2+\lambda h^2}`).
+    ``subsample`` may be set to as low as 0.1 without loss of model accuracy. Note that this
+    sampling method is only supported when ``tree_method`` is set to ``gpu_hist``; other tree
+    methods only support ``uniform`` sampling.
+
 * ``colsample_bytree``, ``colsample_bylevel``, ``colsample_bynode`` [default=1]

  - This is a family of parameters for subsampling of columns.
--- a/doc/tutorials/external_memory.rst
+++ b/doc/tutorials/external_memory.rst
@ -1,6 +1,6 @@
-############################################
-Using XGBoost External Memory Version (beta)
-############################################
+#####################################
+Using XGBoost External Memory Version
+#####################################
 There is no big difference between using external memory version and in-memory version.
 The only difference is the filename format.

@ -14,7 +14,13 @@ The ``filename`` is the normal path to libsvm format file you want to load in, a
 ``cacheprefix`` is a path to a cache file that XGBoost will use for caching preprocessed
 data in binary form.

-.. note:: External memory is also available with GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``)
+To load from csv files, use the following syntax:
+
+.. code-block:: none
+
+  filename.csv?format=csv&label_column=0#cacheprefix
+
+where ``label_column`` should point to the csv column acting as the label.

 To provide a simple example for illustration, extracting the code from
 `demo/guide-python/external_memory.py <https://github.com/dmlc/xgboost/blob/master/demo/guide-python/external_memory.py>`_. If
@ -25,22 +31,26 @@ you have a dataset stored in a file similar to ``agaricus.txt.train`` with libSV
  dtrain = DMatrix('../data/agaricus.txt.train#dtrain.cache')

 XGBoost will first load ``agaricus.txt.train`` in, preprocess it, then write to a new file named
-``dtrain.cache`` as an on disk cache for storing preprocessed data in a internal binary format.  For
+``dtrain.cache`` as an on disk cache for storing preprocessed data in an internal binary format.  For
 more notes about text input formats, see :doc:`/tutorials/input_format`.

-.. code-block:: python
-
-  dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
-
 For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.

-****************
-Performance Note
-****************
-* the parameter ``nthread`` should be set to number of **physical** cores
+***********
+GPU Version
+***********
+External memory is fully supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).

-  - Most modern CPUs use hyperthreading, which means a 4 core CPU may carry 8 threads
-  - Set ``nthread`` to be 4 for maximum performance in such case
+If you are still getting out-of-memory errors after enabling external memory, try subsampling the
+data to further reduce GPU memory usage:
+
+.. code-block:: python
+
+  param = {
+    ...
+    'subsample': 0.1,
+    'sampling_method': 'gradient_based',
+  }

 *******************
 Distributed Version
@ -51,14 +61,12 @@ The external memory mode naturally works on distributed version, you can simply

  data = "hdfs://path-to-data/#dtrain.cache"

-XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporal
+XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporary
 so that you can directly use ``dtrain.cache`` to cache to current folder.

-**********
-Usage Note
-**********
-* This is an experimental version
-* Currently only importing from libsvm format is supported
+***********
+Limitations
+***********
+* The ``hist`` tree method hasn't been tested thoroughly with external memory support (see
+  `this issue <https://github.com/dmlc/xgboost/issues/4093>`_).
 * OSX is not tested.
-
-  - Contribution of ingestion from other common external memory data source is welcomed