Export Python Interface for external memory. (#7070)

* Add Python iterator interface. * Add tests. * Add demo. * Add documents. * Handle empty dataset.
2021-07-22 15:15:53 +08:00
parent e64ee6592f
commit e6088366df
34 changed files with 961 additions and 200 deletions
--- a/doc/tutorials/external_memory.rst
+++ b/doc/tutorials/external_memory.rst
@@ -1,6 +1,75 @@
 #####################################
 Using XGBoost External Memory Version
 #####################################
+
+XGBoost supports loading data from external memory using builtin data parser.  And
+starting from version 1.5, users can also define a custom iterator to load data in chunks.
+The feature is still experimental and not yet ready for production use.  In this tutorial
+we will introduce both methods.  Please note that training on data from external memory is
+not supported by ``exact`` tree method.
+
+*************
+Data Iterator
+*************
+
+Starting from XGBoost 1.5, users can define their own data loader using Python or C
+interface.  There are some examples in the ``demo`` directory for quick start.  This is a
+generalized version of text input external memory, where users no longer need to prepare a
+text file that XGBoost recognizes.  To enable the feature, user need to define a data
+iterator with 2 class methods ``next`` and ``reset`` then pass it into ``DMatrix``
+constructor.
+
+.. code-block:: python
+
+  import os
+  from typing import List, Callable
+  import xgboost
+  from sklearn.datasets import load_svmlight_file
+
+  class Iterator(xgboost.DataIter):
+    def __init__(self, svm_file_paths: List[str]):
+      self._file_paths = svm_file_paths
+      self._it = 0
+      # XGBoost will generate some cache files under current directory with the prefix
+      # "cache"
+      super().__init__(cache_prefix=os.path.join(".", "cache"))
+
+    def next(self, input_data: Callable):
+      """Advance the iterator by 1 step and pass the data to XGBoost.  This function is
+      called by XGBoost during the construction of ``DMatrix``
+
+      """
+      if self._it == len(self._file_paths):
+        # return 0 to let XGBoost know this is the end of iteration
+        return 0
+
+      # input_data is a function passed in by XGBoost who has the exact same signature of
+      # ``DMatrix``
+      X, y = load_svmlight_file(self._file_paths[self._it])
+      input_data(X, y)
+      self._it += 1
+      # Return 1 to let XGBoost know we haven't seen all the files yet.
+      return 1
+
+    def reset(self):
+      """Reset the iterator to its beginning"""
+      self._it = 0
+
+  it = Iterator(["file_0.svm", "file_1.svm", "file_2.svm"])
+  Xy = xgboost.DMatrix(it)
+
+  # Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
+  # as noted in following sections.
+  booster = xgboost.train({"tree_method": "approx"}, Xy)
+
+
+The above snippet is a simplifed version of ``demo/guide-python/external_memory.py``.  For
+an example in C, please see ``demo/c-api/external-memory/``.
+
+****************
+Text File Inputs
+****************
+
 There is no big difference between using external memory version and in-memory version.
 The only difference is the filename format.

@@ -36,10 +105,11 @@ more notes about text input formats, see :doc:`/tutorials/input_format`.

 For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.

-***********
-GPU Version
-***********
-External memory is fully supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).
+
+**********************************
+GPU Version (GPU Hist tree method)
+**********************************
+External memory is supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).

 If you are still getting out-of-memory errors after enabling external memory, try subsampling the
 data to further reduce GPU memory usage:
@@ -52,23 +122,14 @@ data to further reduce GPU memory usage:
    'sampling_method': 'gradient_based',
  }

-For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_.
+For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_.  Internally
+the tree method still concatenate all the chunks into 1 final histogram index due to
+performance reason, but in compressed format.  So its scalability has an upper bound but
+still has lower memory cost in general.

-*******************
-Distributed Version
-*******************
-The external memory mode naturally works on distributed version, you can simply set path like
+********
+CPU Hist
+********

-.. code-block:: none
-
-  data = "hdfs://path-to-data/#dtrain.cache"
-
-XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporary
-so that you can directly use ``dtrain.cache`` to cache to current folder.
-
-***********
-Limitations
-***********
-* The ``hist`` tree method hasn't been tested thoroughly with external memory support (see
-  `this issue <https://github.com/dmlc/xgboost/issues/4093>`_).
-* OSX is not tested.
+It's limited by the same factor of GPU Hist, except that gradient based sampling is not
+yet supported on CPU.