[EM] Make page concatenation optional. (#10826)

This PR introduces a new parameter `extmem_concat_pages` to make the page concatenation optional for GPU hist. In addition, the document is updated for the new GPU-based external memory.
2024-09-24 06:19:28 +08:00
parent 215da76263
commit e228c1a121
31 changed files with 690 additions and 388 deletions
--- a/doc/tutorials/external_memory.rst
+++ b/doc/tutorials/external_memory.rst
@@ -4,15 +4,13 @@ Using XGBoost External Memory Version

 When working with large datasets, training XGBoost models can be challenging as the entire
 dataset needs to be loaded into memory. This can be costly and sometimes
-infeasible. Staring from 1.5, users can define a custom iterator to load data in chunks
-for running XGBoost algorithms. External memory can be used for both training and
-prediction, but training is the primary use case and it will be our focus in this
-tutorial. For prediction and evaluation, users can iterate through the data themselves
-while training requires the full dataset to be loaded into the memory.
-
-During training, there are two different modes for external memory support available in
-XGBoost, one for CPU-based algorithms like ``hist`` and ``approx``, another one for the
-GPU-based training algorithm. We will introduce them in the following sections.
+infeasible. Starting from 1.5, users can define a custom iterator to load data in chunks
+for running XGBoost algorithms. External memory can be used for training and prediction,
+but training is the primary use case and it will be our focus in this tutorial. For
+prediction and evaluation, users can iterate through the data themselves, whereas training
+requires the entire dataset to be loaded into the memory. Significant progress was made in
+the 3.0 release for the GPU implementation. We will introduce the difference between CPU
+and GPU in the following sections.

 .. note::

@@ -20,27 +18,33 @@ GPU-based training algorithm. We will introduce them in the following sections.

 .. note::

-   The feature is still experimental as of 2.0. The performance is not well optimized.
+   The feature is considered experimental but ready for public testing in 3.0. Vector-leaf
+   is not yet supported.

-The external memory support has gone through multiple iterations and is still under heavy
-development. Like the :py:class:`~xgboost.QuantileDMatrix` with
-:py:class:`~xgboost.DataIter`, XGBoost loads data batch-by-batch using a custom iterator
-supplied by the user. However, unlike the :py:class:`~xgboost.QuantileDMatrix`, external
-memory will not concatenate the batches unless GPU is used (it uses a hybrid approach,
-more details follow). Instead, it will cache all batches on the external memory and fetch
-them on-demand.  Go to the end of the document to see a comparison between
-:py:class:`~xgboost.QuantileDMatrix` and external memory.
+The external memory support has undergone multiple development iterations. Like the
+:py:class:`~xgboost.QuantileDMatrix` with :py:class:`~xgboost.DataIter`, XGBoost loads
+data batch-by-batch using a custom iterator supplied by the user. However, unlike the
+:py:class:`~xgboost.QuantileDMatrix`, external memory does not concatenate the batches
+(unless specified by the ``extmem_concat_pages``) . Instead, it caches all batches in the
+external memory and fetch them on-demand. Go to the end of the document to see a
+comparison between :py:class:`~xgboost.QuantileDMatrix` and the external memory version of
+:py:class:`~xgboost.ExtMemQuantileDMatrix`.
+
+**Contents**
+
+.. contents::
+  :backlinks: none
+  :local:

 *************
 Data Iterator
 *************

-Starting from XGBoost 1.5, users can define their own data loader using Python or C
-interface.  There are some examples in the ``demo`` directory for quick start.  This is a
-generalized version of text input external memory, where users no longer need to prepare a
-text file that XGBoost recognizes.  To enable the feature, users need to define a data
-iterator with 2 class methods: ``next`` and ``reset``, then pass it into the
-:py:class:`~xgboost.DMatrix` constructor.
+Starting with XGBoost 1.5, users can define their own data loader using Python or C
+interface. Some examples are in the ``demo`` directory for a quick start. To enable
+external memory training, users need to define a data iterator with 2 class methods:
+``next`` and ``reset``, then pass it into the :py:class:`~xgboost.DMatrix` or the
+:py:class:`~xgboost.ExtMemQuantileDMatrix` constructor.

 .. code-block:: python

@@ -53,20 +57,20 @@ iterator with 2 class methods: ``next`` and ``reset``, then pass it into the
    def __init__(self, svm_file_paths: List[str]):
      self._file_paths = svm_file_paths
      self._it = 0
-      # XGBoost will generate some cache files under current directory with the prefix
+      # XGBoost will generate some cache files under the current directory with the prefix
      # "cache"
      super().__init__(cache_prefix=os.path.join(".", "cache"))

    def next(self, input_data: Callable):
-      """Advance the iterator by 1 step and pass the data to XGBoost.  This function is
+      """Advance the iterator by 1 step and pass the data to XGBoost. This function is
      called by XGBoost during the construction of ``DMatrix``

      """
      if self._it == len(self._file_paths):
-        # return 0 to let XGBoost know this is the end of iteration
+        # return 0 to let XGBoost know this is the end of the iteration
        return 0

-      # input_data is a function passed in by XGBoost who has the exact same signature of
+      # input_data is a function passed in by XGBoost and has the exact same signature of
      # ``DMatrix``
      X, y = load_svmlight_file(self._file_paths[self._it])
      input_data(data=X, label=y)
@@ -79,59 +83,106 @@ iterator with 2 class methods: ``next`` and ``reset``, then pass it into the
      self._it = 0

  it = Iterator(["file_0.svm", "file_1.svm", "file_2.svm"])
-  Xy = xgboost.DMatrix(it)

-  # The ``approx`` also work, but with low performance. GPU implementation is different from CPU.
-  # as noted in following sections.
+  Xy = xgboost.ExtMemQuantileDMatrix(it)
  booster = xgboost.train({"tree_method": "hist"}, Xy)

+  # The ``approx`` tree method also works, but with lower performance and cannot be used
+  # with the quantile DMatrix.
+
+  Xy = xgboost.DMatrix(it)
+  booster = xgboost.train({"tree_method": "approx"}, Xy)

 The above snippet is a simplified version of :ref:`sphx_glr_python_examples_external_memory.py`.
 For an example in C, please see ``demo/c-api/external-memory/``. The iterator is the
 common interface for using external memory with XGBoost, you can pass the resulting
-:py:class:`DMatrix` object for training, prediction, and evaluation.
+:py:class:`~xgboost.DMatrix` object for training, prediction, and evaluation.
+
+The :py:class:`~xgboost.ExtMemQuantileDMatrix` is an external memory version of the
+:py:class:`~xgboost.QuantileDMatrix`. These two classes are specifically designed for the
+``hist`` tree method for reduced memory usage and data loading overhead. See respective
+references for more info.

 It is important to set the batch size based on the memory available. A good starting point
-is to set the batch size to 10GB per batch if you have 64GB of memory. It is *not*
-recommended to set small batch sizes like 32 samples per batch, as this can seriously hurt
-performance in gradient boosting.
-
-***********
-CPU Version
-***********
-
-In the previous section, we demonstrated how to train a tree-based model using the
-``hist`` tree method on a CPU. This method involves iterating through data batches stored
-in a cache during tree construction. For optimal performance, we recommend using the
-``grow_policy=depthwise`` setting, which allows XGBoost to build an entire layer of tree
-nodes with only a few batch iterations. Conversely, using the ``lossguide`` policy
-requires XGBoost to iterate over the data set for each tree node, resulting in slower
-performance.
-
-If external memory is used, the performance of CPU training is limited by IO
-(input/output) speed. This means that the disk IO speed primarily determines the training
-speed. During benchmarking, we used an NVMe connected to a PCIe-4 slot, other types of
-storage can be too slow for practical usage. In addition, your system may perform caching
-to reduce the overhead of file reading.
+for CPU is to set the batch size to 10GB per batch if you have 64GB of memory. It is *not*
+recommended to set small batch sizes like 32 samples per batch, as this can severely hurt
+performance in gradient boosting. See below sections for information about the GPU version
+and other best practices.

 **********************************
 GPU Version (GPU Hist tree method)
 **********************************

-External memory is supported by GPU algorithms (i.e. when ``device`` is set to
-``cuda``). However, the algorithm used for GPU is different from the one used for
-CPU. When training on a CPU, the tree method iterates through all batches from external
-memory for each step of the tree construction algorithm. On the other hand, the GPU
-algorithm uses a hybrid approach. It iterates through the data during the beginning of
-each iteration and concatenates all batches into one in GPU memory for performance
-reasons. To reduce overall memory usage, users can utilize subsampling. The GPU hist tree
-method supports `gradient-based sampling`, enabling users to set a low sampling rate
-without compromising accuracy.
+External memory is supported by GPU algorithms (i.e., when ``device`` is set to
+``cuda``). Starting with 3.0, the default GPU implementation is similar to what the CPU
+version does. It also supports the use of :py:class:`~xgboost.ExtMemQuantileDMatrix` when
+the ``hist`` tree method is employed. For a GPU device, the main memory is the device
+memory, whereas the external memory can be either a disk or the CPU memory. XGBoost stages
+the cache on CPU memory by default. Users can change the backing storage to disk by
+specifying the ``on_host`` parameter in the :py:class:`~xgboost.DataIter`. However, using
+the disk is not recommended. It's likely to make the GPU slower than the CPU. The option is
+here for experimental purposes only.
+
+Inputs to the :py:class:`~xgboost.ExtMemQuantileDMatrix` (through the iterator) must be on
+the GPU. This is a current limitation we aim to address in the future.
+
+.. code-block:: python
+
+    import cupy as cp
+    import rmm
+    from rmm.allocators.cupy import rmm_cupy_allocator
+
+    # It's important to use RMM for GPU-based external memory to improve performance.
+    # If XGBoost is not built with RMM support, a warning will be raised.
+    mr = rmm.mr.PoolMemoryResource(rmm.mr.CudaAsyncMemoryResource())
+    rmm.mr.set_current_device_resource(mr)
+    # Set the allocator for cupy as well.
+    cp.cuda.set_allocator(rmm_cupy_allocator)
+    # Make sure XGBoost is using RMM for all allocations.
+    with xgboost.config_context(use_rmm=True):
+        # Construct the iterators for ExtMemQuantileDMatrix
+	# ...
+	# Build the ExtMemQuantileDMatrix and start training
+	Xy_train = xgboost.ExtMemQuantileDMatrix(it_train, max_bin=n_bins)
+	Xy_valid = xgboost.ExtMemQuantileDMatrix(it_valid, max_bin=n_bins, ref=Xy_train)
+	booster = xgboost.train(
+	    {
+		"tree_method": "hist",
+		"max_depth": 6,
+		"max_bin": n_bins,
+		"device": device,
+	    },
+	    Xy_train,
+	    num_boost_round=n_rounds,
+	    evals=[(Xy_train, "Train"), (Xy_valid, "Valid")]
+	)
+
+It's crucial to use `RAPIDS Memory Manager (RMM) <https://github.com/rapidsai/rmm>`__ for
+all memory allocation when training with external memory. XGBoost relies on the memory
+pool to reduce the overhead for data fetching. The size of each batch should be slightly
+smaller than a quarter of the available GPU memory. In addition, the open source `NVIDIA
+Linux driver
+<https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/>`__
+is required for ``Heterogeneous memory management (HMM)`` support.
+
+In addition to the batch-based data fetching, the GPU version supports concatenating
+batches into a single blob for the training data to improve performance. For GPUs
+connected via PCIe instead of nvlink, the performance overhead with batch-based training
+is significant, particularly for non-dense data. Overall, it can be at least five times
+slower than in-core training. Concatenating pages can be used to get the performance
+closer to in-core training. This option should be used in combination with subsampling to
+reduce the memory usage. During concatenation, subsampling removes a portion of samples,
+reducing the training dataset size. The GPU hist tree method supports `gradient-based
+sampling`, enabling users to set a low sampling rate without compromising accuracy. Before
+3.0, concatenation with subsampling was the only option for GPU-based external
+memory. After 3.0, XGBoost uses the regular batch fetching as the default while the page
+concatenation can be enabled by:

 .. code-block:: python

  param = {
-    ...
+    "device": "cuda",
+    "extmem_concat_pages": true,
    'subsample': 0.2,
    'sampling_method': 'gradient_based',
  }
@@ -139,10 +190,70 @@ without compromising accuracy.
 For more information about the sampling algorithm and its use in external memory training,
 see `this paper <https://arxiv.org/abs/2005.09148>`_.

-.. warning::
+==========
+NVLink-C2C
+==========

-   When GPU is running out of memory during iteration on external memory, user might
-   receive a segfault instead of an OOM exception.
+The newer NVIDIA platforms like `Grace-Hopper
+<https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/>`__ use `NVLink-C2C
+<https://www.nvidia.com/en-us/data-center/nvlink-c2c/>`__, which facilitates a fast
+interconnect between the CPU and the GPU. With the host memory serving as the data cache,
+XGBoost can retrieve data with significantly lower overhead. When the input data is dense,
+there's minimal to no performance loss for training, except for the initial construction
+of the :py:class:`~xgboost.ExtMemQuantileDMatrix`. The initial construction iterates
+through the input data twice, as a result, the most significantly overhead compared to
+in-core training is one additional data read when the data is dense.
+
+To run experiments on these platforms, the open source `NVIDIA Linux driver
+<https://developer.nvidia.com/blog/nvidia-transitions-fully-towards-open-source-gpu-kernel-modules/>`__
+with version ``>=565.47`` is required.
+
+**************
+Best Practices
+**************
+
+In previous sections, we demonstrated how to train a tree-based model with data residing
+on an external memory and made some recommendations for batch size. Here are some other
+configurations we find useful. The external memory feature involves iterating through data
+batches stored in a cache during tree construction. For optimal performance, we recommend
+using the ``grow_policy=depthwise`` setting, which allows XGBoost to build an entire layer
+of tree nodes with only a few batch iterations. Conversely, using the ``lossguide`` policy
+requires XGBoost to iterate over the data set for each tree node, resulting in
+significantly slower performance.
+
+In addition, this ``hist`` tree method should be preferred over the ``approx`` tree method
+as the former doesn't recreate the histogram bins for every iteration. Creating the
+histogram bins requires loading the raw input data, which is prohibitively expensive. The
+:py:class:`~xgboost.ExtMemQuantileDMatrix` designed for the ``hist`` tree method can speed
+up the initial data construction and the evaluation significantly for external memory.
+
+Since the external memory implementation focuses on training where XGBoost needs to access
+the entire dataset, only the ``X`` is divided into batches while everything else is
+concatenated. As a result, it's recommended for users to define their own management code
+to iterate through the data for inference, especially for SHAP value computation. The size
+of SHAP results can be larger than ``X``, making external memory in XGBoost less
+effective. Some frameworks like ``dask`` can help with the data chunking and iterate
+through the data for inference with memory spilling.
+
+When external memory is used, the performance of CPU training is limited by disk IO
+(input/output) speed. This means that the disk IO speed primarily determines the training
+speed. Similarly, PCIe bandwidth limits the GPU performance, assuming the CPU memory is
+used as a cache and address translation services (ATS) is unavailable. We recommend using
+regular :py:class:`~xgboost.QuantileDMatrix` over
+:py:class:`~xgboost.ExtMemQuantileDMatrix` for constructing the validation dataset when
+feasible. Running inference is much less computation-intensive than training and, hence,
+much faster. For GPU, the time it takes to read the data from host to device completely
+determines the time it takes to run inference, even if a C2C link is available.
+
+.. code-block:: python
+
+    # Try to use `QuantileDMatrix` for the validation if it can be fit into the GPU memory.
+    Xy_train = xgboost.ExtMemQuantileDMatrix(it_train, max_bin=n_bins)
+    Xy_valid = xgboost.QuantileDMatrix(it_valid, max_bin=n_bins, ref=Xy_train)
+
+During CPU benchmarking, we used an NVMe connected to a PCIe-4 slot. Other types of
+storage can be too slow for practical usage. However, your system will likely perform some
+caching to reduce the overhead of the file read. See the following sections for remarks.

 .. _ext_remarks:

@@ -157,43 +268,43 @@ and internal runtime structures are concatenated. This means that memory reducti
 effective when dealing with wide datasets where ``X`` is significantly larger in size
 compared to other data like ``y``, while it has little impact on slim datasets.

-As one might expect, fetching data on-demand puts significant pressure on the storage
-device. Today's computing device can process way more data than a storage can read in a
-single unit of time. The ratio is at order of magnitudes. An GPU is capable of processing
-hundred of Gigabytes of floating-point data in a split second. On the other hand, a
-four-lane NVMe storage connected to a PCIe-4 slot usually has about 6GB/s of data transfer
-rate. As a result, the training is likely to be severely bounded by your storage
+As one might expect, fetching data on demand puts significant pressure on the storage
+device. Today's computing devices can process way more data than storage devices can read
+in a single unit of time. The ratio is in the order of magnitudes. A GPU is capable of
+processing hundreds of Gigabytes of floating-point data in a split second. On the other
+hand, a four-lane NVMe storage connected to a PCIe-4 slot usually has about 6GB/s of data
+transfer rate. As a result, the training is likely to be severely bounded by your storage
 device. Before adopting the external memory solution, some back-of-envelop calculations
-might help you see whether it's viable. For instance, if your NVMe drive can transfer 4GB
-(a fairly practical number) of data per second and you have a 100GB of data in compressed
-XGBoost cache (which corresponds to a dense float32 numpy array with the size of 200GB,
-give or take). A tree with depth 8 needs at least 16 iterations through the data when the
-parameter is right. You need about 14 minutes to train a single tree without accounting
+might help you determine its viability. For instance, if your NVMe drive can transfer 4GB
+(a reasonably practical number) of data per second, and you have a 100GB of data in a
+compressed XGBoost cache (corresponding to a dense float32 numpy array with 200GB, give or
+take). A tree with depth 8 needs at least 16 iterations through the data when the
+parameter is optimal. You need about 14 minutes to train a single tree without accounting
 for some other overheads and assume the computation overlaps with the IO. If your dataset
-happens to have TB-level size, then you might need thousands of trees to get a generalized
-model. These calculations can help you get an estimate on the expected training time.
+happens to have a TB-level size, you might need thousands of trees to get a generalized
+model. These calculations can help you get an estimate of the expected training time.

-However, sometimes we can ameliorate this limitation. One should also consider that the OS
-(mostly talking about the Linux kernel) can usually cache the data on host memory. It only
-evicts pages when new data comes in and there's no room left. In practice, at least some
-portion of the data can persist on the host memory throughout the entire training
+However, sometimes, we can ameliorate this limitation. One should also consider that the
+OS (mainly talking about the Linux kernel) can usually cache the data on host memory. It
+only evicts pages when new data comes in and there's no room left. In practice, at least
+some portion of the data can persist in the host memory throughout the entire training
 session. We are aware of this cache when optimizing the external memory fetcher. The
 compressed cache is usually smaller than the raw input data, especially when the input is
 dense without any missing value. If the host memory can fit a significant portion of this
-compressed cache, then the performance should be decent after initialization. Our
-development so far focus on two fronts of optimization for external memory:
+compressed cache, the performance should be decent after initialization. Our development
+so far focuses on following fronts of optimization for external memory:

 - Avoid iterating through the data whenever appropriate.
 - If the OS can cache the data, the performance should be close to in-core training.
+- For GPU, the actual computation should overlap with memory copy as much as possible.

-Starting with XGBoost 2.0, the implementation of external memory uses ``mmap``. It is not
-tested against system errors like disconnected network devices (`SIGBUS`). In the face of
-a bus error, you will see a hard crash and need to clean up the cache files. If the
-training session might take a long time and you are using solutions like NVMe-oF, we
+Starting with XGBoost 2.0, the implementation of external memory uses ``mmap``. It has not
+been tested against system errors like disconnected network devices (`SIGBUS`). In the
+face of a bus error, you will see a hard crash and need to clean up the cache files. If
+the training session might take a long time and you use solutions like NVMe-oF, we
 recommend checkpointing your model periodically. Also, it's worth noting that most tests
 have been conducted on Linux distributions.

-
 Another important point to keep in mind is that creating the initial cache for XGBoost may
 take some time. The interface to external memory is through custom iterators, which we can
 not assume to be thread-safe. Therefore, initialization is performed sequentially. Using
@@ -206,13 +317,30 @@ Compared to the QuantileDMatrix

 Passing an iterator to the :py:class:`~xgboost.QuantileDMatrix` enables direct
 construction of :py:class:`~xgboost.QuantileDMatrix` with data chunks. On the other hand,
-if it's passed to :py:class:`~xgboost.DMatrix`, it instead enables the external memory
-feature. The :py:class:`~xgboost.QuantileDMatrix` concatenates the data on memory after
+if it's passed to the :py:class:`~xgboost.DMatrix` or the
+:py:class:`~xgboost.ExtMemQuantileDMatrix`, it instead enables the external memory
+feature. The :py:class:`~xgboost.QuantileDMatrix` concatenates the data in memory after
 compression and doesn't fetch data during training. On the other hand, the external memory
-:py:class:`~xgboost.DMatrix` fetches data batches from external memory on-demand.  Use the
-:py:class:`~xgboost.QuantileDMatrix` (with iterator if necessary) when you can fit most of
-your data in memory. The training would be an order of magnitude faster than using
-external memory.
+:py:class:`~xgboost.DMatrix` (:py:class:`~xgboost.ExtMemQuantileDMatrix`) fetches data
+batches from external memory on demand. Use the :py:class:`~xgboost.QuantileDMatrix` (with
+iterator if necessary) when you can fit most of your data in memory. For many platforms,
+the training speed can be an order of magnitude faster than external memory.
+
+*************
+Brief History
+*************
+
+For a long time, external memory support has been an experimental feature and has
+undergone multiple development iterations. Here's a brief summary of major changes:
+
+- Gradient-based sampling was introduced to the GPU hist in 1.1.
+- The iterator interface was introduced in 1.5, along with a major rewrite for the
+  internal framework.
+- 2.0 introduced the use of ``mmap``, along with optimization in XBGoost to enable
+  zero-copy data fetching.
+- 3.0 reworked the GPU implementation to support caching data on the host and disk,
+  introduced the :py:class:`~xgboost.ExtMemQuantileDMatrix` class, added quantile-based
+  objectives support.

 ****************
 Text File Inputs
@@ -220,11 +348,11 @@ Text File Inputs

 .. warning::

-   This is the original form of external memory support before 1.5, users are encouraged
-   to use custom data iterator instead.
+   This is the original form of external memory support before 1.5 and is now deprecated,
+   users are encouraged to use a custom data iterator instead.

-There is no big difference between using external memory version of text input and the
-in-memory version.  The only difference is the filename format.
+There is no significant difference between using the external memory version of text input
+and the in-memory version of text input. The only difference is the filename format.

 The external memory version takes in the following `URI
 <https://en.wikipedia.org/wiki/Uniform_Resource_Identifier>`_ format:
@@ -233,7 +361,7 @@ The external memory version takes in the following `URI

  filename?format=libsvm#cacheprefix

-The ``filename`` is the normal path to LIBSVM format file you want to load in, and
+The ``filename`` is the typical path to LIBSVM format file you want to load in, and
 ``cacheprefix`` is a path to a cache file that XGBoost will use for caching preprocessed
 data in binary form.

@@ -253,7 +381,7 @@ format, the external memory support can be enabled by:
  dtrain = DMatrix('../data/agaricus.txt.train?format=libsvm#dtrain.cache')

 XGBoost will first load ``agaricus.txt.train`` in, preprocess it, then write to a new file named
-``dtrain.cache`` as an on disk cache for storing preprocessed data in an internal binary format.  For
+``dtrain.cache`` as an on disk cache for storing preprocessed data in an internal binary format. For
 more notes about text input formats, see :doc:`/tutorials/input_format`.

-For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train?format=libsvm#dtrain.cache"``.
+For the CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train?format=libsvm#dtrain.cache"``.