Export Python Interface for external memory. (#7070)
* Add Python iterator interface. * Add tests. * Add demo. * Add documents. * Handle empty dataset.
This commit is contained in:
@@ -1,8 +1,8 @@
|
||||
##############################
|
||||
C API Tutorial
|
||||
##############################
|
||||
##############
|
||||
C API Tutorial
|
||||
##############
|
||||
|
||||
In this tutorial, we are going to install XGBoost library & configure the CMakeLists.txt file of our C/C++ application to link XGBoost library with our application. Later on, we will see some useful tips for using C API and code snippets as examples to use various functions available in C API to perform basic task like loading, training model & predicting on test dataset.
|
||||
In this tutorial, we are going to install XGBoost library & configure the CMakeLists.txt file of our C/C++ application to link XGBoost library with our application. Later on, we will see some useful tips for using C API and code snippets as examples to use various functions available in C API to perform basic task like loading, training model & predicting on test dataset.
|
||||
|
||||
.. contents::
|
||||
:backlinks: none
|
||||
@@ -12,7 +12,7 @@ In this tutorial, we are going to install XGBoost library & configure the CMakeL
|
||||
Requirements
|
||||
************
|
||||
|
||||
Install CMake - Follow the `cmake installation documentation <https://cmake.org/install/>`_ for instructions.
|
||||
Install CMake - Follow the `cmake installation documentation <https://cmake.org/install/>`_ for instructions.
|
||||
Install Conda - Follow the `conda installation documentation <https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html>`_ for instructions
|
||||
|
||||
*************************************
|
||||
@@ -31,18 +31,18 @@ Run the following commands on your terminal. The below commands will install the
|
||||
# Activate the Conda environment, into which we'll install XGBoost
|
||||
conda activate [env_name]
|
||||
# Build the compiled version of XGBoost inside the build folder
|
||||
cmake .. -DBUILD_STATIC_LIB=ON -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
|
||||
cmake .. -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
|
||||
# install XGBoost in your conda environment (usually under [your home directory]/miniconda3)
|
||||
make install
|
||||
|
||||
*********************************************************************
|
||||
Configure CMakeList.txt file of your application to link with XGBoost
|
||||
Configure CMakeList.txt file of your application to link with XGBoost
|
||||
*********************************************************************
|
||||
|
||||
Here, we assume that your C++ application is using CMake for builds.
|
||||
|
||||
Use ``find_package()`` and ``target_link_libraries()`` in your application's CMakeList.txt to link with the XGBoost library:
|
||||
|
||||
|
||||
.. code-block:: cmake
|
||||
|
||||
cmake_minimum_required(VERSION 3.13)
|
||||
@@ -79,8 +79,8 @@ a. In a C application: Use the following macro to guard all calls to XGBoost's C
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
#define safe_xgboost(call) { \
|
||||
int err = (call); \
|
||||
#define safe_xgboost(call) { \
|
||||
int err = (call); \
|
||||
if (err != 0) { \
|
||||
fprintf(stderr, "%s:%d: error in %s: %s\n", __FILE__, __LINE__, #call, XGBGetLastError()); \
|
||||
exit(1); \
|
||||
@@ -101,8 +101,8 @@ b. In a C++ application: modify the macro ``safe_xgboost`` to throw an exception
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
#define safe_xgboost(call) { \
|
||||
int err = (call); \
|
||||
#define safe_xgboost(call) { \
|
||||
int err = (call); \
|
||||
if (err != 0) { \
|
||||
throw new Exception(std::string(__FILE__) + ":" + std::to_string(__LINE__) + \
|
||||
": error in " + #call + ":" + XGBGetLastError())); \
|
||||
@@ -125,29 +125,29 @@ c. Assertion technique: It works both in C/ C++. If expression evaluates to 0 (f
|
||||
#include <stdio.h>
|
||||
#include <stdlib.h>
|
||||
#include <xgboost/c_api.h>
|
||||
|
||||
|
||||
int main(int argc, char** argv) {
|
||||
int silent = 0;
|
||||
|
||||
|
||||
BoosterHandle booster;
|
||||
|
||||
|
||||
// do something with booster
|
||||
|
||||
|
||||
//free the memory
|
||||
XGBoosterFree(booster)
|
||||
|
||||
DMatrixHandle DMatrixHandle_param;
|
||||
|
||||
|
||||
// do something with DMatrixHandle_param
|
||||
|
||||
|
||||
// free the memory
|
||||
XGDMatrixFree(DMatrixHandle_param);
|
||||
|
||||
|
||||
return 0;
|
||||
}
|
||||
|
||||
|
||||
3. For tree models, it is important to use consistent data formats during training and scoring/ predicting otherwise it will result in wrong outputs.
|
||||
3. For tree models, it is important to use consistent data formats during training and scoring/ predicting otherwise it will result in wrong outputs.
|
||||
Example if we our training data is in ``dense matrix`` format then your prediction dataset should also be a ``dense matrix`` or if training in ``libsvm`` format then dataset for prediction should also be in ``libsvm`` format.
|
||||
|
||||
|
||||
@@ -166,7 +166,7 @@ Sample examples along with Code snippet to use C API functions
|
||||
1. If the dataset is available in a file, it can be loaded into a ``DMatrix`` object using the `XGDMatrixCreateFromFile <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#a357c3654a1a4dcc05e6b5c50acd17105>`_
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
|
||||
DMatrixHandle data; // handle to DMatrix
|
||||
// Load the dat from file & store it in data variable of DMatrixHandle datatype
|
||||
safe_xgboost(XGDMatrixCreateFromFile("/path/to/file/filename", silent, &data));
|
||||
@@ -188,10 +188,10 @@ Sample examples along with Code snippet to use C API functions
|
||||
// dmatrix variable will contain the created DMatrix using it
|
||||
safe_xgboost(XGDMatrixCreateFromMat(data1, 1, 50, 0, &dmatrix));
|
||||
// here -1 represents the missing value in the matrix dataset
|
||||
safe_xgboost(XGDMatrixCreateFromMat(data2, ROWS, COLS, -1, &dmatrix2)(;
|
||||
safe_xgboost(XGDMatrixCreateFromMat(data2, ROWS, COLS, -1, &dmatrix2));
|
||||
|
||||
|
||||
3. Create a Booster object for training & testing on dataset using `XGBoosterCreate <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#ad9fe6f8c8c4901db1c7581a96a21f9ae>`_
|
||||
3. Create a Booster object for training & testing on dataset using `XGBoosterCreate <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#ad9fe6f8c8c4901db1c7581a96a21f9ae>`_
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
@@ -201,7 +201,7 @@ Sample examples along with Code snippet to use C API functions
|
||||
DMatrixHandle eval_dmats[eval_dmats_size] = {train, test};
|
||||
safe_xgboost(XGBoosterCreate(eval_dmats, eval_dmats_size, &booster));
|
||||
|
||||
|
||||
|
||||
4. For each ``DMatrix`` object, set the labels using `XGDMatrixSetFloatInfo <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#aef75cda93db3ae9af89e465ae7e9cbe3>`_. Later you can access the label using `XGDMatrixGetFloatInfo <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#ab0ee317539a1fb1ce2b5f249e8c768f6>`_.
|
||||
|
||||
.. code-block:: c
|
||||
@@ -221,7 +221,7 @@ Sample examples along with Code snippet to use C API functions
|
||||
|
||||
// Loading the labels
|
||||
safe_xgboost(XGDMatrixSetFloatInfo(dmatrix, "label", labels, ROWS));
|
||||
|
||||
|
||||
// reading the labels and store the length of the result
|
||||
bst_ulong result_len;
|
||||
|
||||
@@ -233,12 +233,12 @@ Sample examples along with Code snippet to use C API functions
|
||||
for(unsigned int i = 0; i < result_len; i++) {
|
||||
printf("label[%i] = %f\n", i, result[i]);
|
||||
}
|
||||
|
||||
|
||||
|
||||
|
||||
5. Set the parameters for the ``Booster`` object according to the requirement using `XGBoosterSetParam <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#af7378865b0c999d2d08a5b16483b8bcb>`_ . Check out the full list of parameters available `here <https://xgboost.readthedocs.io/en/latest/parameter.html>`_ .
|
||||
|
||||
.. code-block :: c
|
||||
|
||||
|
||||
BoosterHandle booster;
|
||||
safe_xgboost(XGBoosterSetParam(booster, "booster", "gblinear"));
|
||||
// default max_depth =6
|
||||
|
||||
@@ -1,6 +1,75 @@
|
||||
#####################################
|
||||
Using XGBoost External Memory Version
|
||||
#####################################
|
||||
|
||||
XGBoost supports loading data from external memory using builtin data parser. And
|
||||
starting from version 1.5, users can also define a custom iterator to load data in chunks.
|
||||
The feature is still experimental and not yet ready for production use. In this tutorial
|
||||
we will introduce both methods. Please note that training on data from external memory is
|
||||
not supported by ``exact`` tree method.
|
||||
|
||||
*************
|
||||
Data Iterator
|
||||
*************
|
||||
|
||||
Starting from XGBoost 1.5, users can define their own data loader using Python or C
|
||||
interface. There are some examples in the ``demo`` directory for quick start. This is a
|
||||
generalized version of text input external memory, where users no longer need to prepare a
|
||||
text file that XGBoost recognizes. To enable the feature, user need to define a data
|
||||
iterator with 2 class methods ``next`` and ``reset`` then pass it into ``DMatrix``
|
||||
constructor.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
import os
|
||||
from typing import List, Callable
|
||||
import xgboost
|
||||
from sklearn.datasets import load_svmlight_file
|
||||
|
||||
class Iterator(xgboost.DataIter):
|
||||
def __init__(self, svm_file_paths: List[str]):
|
||||
self._file_paths = svm_file_paths
|
||||
self._it = 0
|
||||
# XGBoost will generate some cache files under current directory with the prefix
|
||||
# "cache"
|
||||
super().__init__(cache_prefix=os.path.join(".", "cache"))
|
||||
|
||||
def next(self, input_data: Callable):
|
||||
"""Advance the iterator by 1 step and pass the data to XGBoost. This function is
|
||||
called by XGBoost during the construction of ``DMatrix``
|
||||
|
||||
"""
|
||||
if self._it == len(self._file_paths):
|
||||
# return 0 to let XGBoost know this is the end of iteration
|
||||
return 0
|
||||
|
||||
# input_data is a function passed in by XGBoost who has the exact same signature of
|
||||
# ``DMatrix``
|
||||
X, y = load_svmlight_file(self._file_paths[self._it])
|
||||
input_data(X, y)
|
||||
self._it += 1
|
||||
# Return 1 to let XGBoost know we haven't seen all the files yet.
|
||||
return 1
|
||||
|
||||
def reset(self):
|
||||
"""Reset the iterator to its beginning"""
|
||||
self._it = 0
|
||||
|
||||
it = Iterator(["file_0.svm", "file_1.svm", "file_2.svm"])
|
||||
Xy = xgboost.DMatrix(it)
|
||||
|
||||
# Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
|
||||
# as noted in following sections.
|
||||
booster = xgboost.train({"tree_method": "approx"}, Xy)
|
||||
|
||||
|
||||
The above snippet is a simplifed version of ``demo/guide-python/external_memory.py``. For
|
||||
an example in C, please see ``demo/c-api/external-memory/``.
|
||||
|
||||
****************
|
||||
Text File Inputs
|
||||
****************
|
||||
|
||||
There is no big difference between using external memory version and in-memory version.
|
||||
The only difference is the filename format.
|
||||
|
||||
@@ -36,10 +105,11 @@ more notes about text input formats, see :doc:`/tutorials/input_format`.
|
||||
|
||||
For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.
|
||||
|
||||
***********
|
||||
GPU Version
|
||||
***********
|
||||
External memory is fully supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).
|
||||
|
||||
**********************************
|
||||
GPU Version (GPU Hist tree method)
|
||||
**********************************
|
||||
External memory is supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).
|
||||
|
||||
If you are still getting out-of-memory errors after enabling external memory, try subsampling the
|
||||
data to further reduce GPU memory usage:
|
||||
@@ -52,23 +122,14 @@ data to further reduce GPU memory usage:
|
||||
'sampling_method': 'gradient_based',
|
||||
}
|
||||
|
||||
For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_.
|
||||
For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_. Internally
|
||||
the tree method still concatenate all the chunks into 1 final histogram index due to
|
||||
performance reason, but in compressed format. So its scalability has an upper bound but
|
||||
still has lower memory cost in general.
|
||||
|
||||
*******************
|
||||
Distributed Version
|
||||
*******************
|
||||
The external memory mode naturally works on distributed version, you can simply set path like
|
||||
********
|
||||
CPU Hist
|
||||
********
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
data = "hdfs://path-to-data/#dtrain.cache"
|
||||
|
||||
XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporary
|
||||
so that you can directly use ``dtrain.cache`` to cache to current folder.
|
||||
|
||||
***********
|
||||
Limitations
|
||||
***********
|
||||
* The ``hist`` tree method hasn't been tested thoroughly with external memory support (see
|
||||
`this issue <https://github.com/dmlc/xgboost/issues/4093>`_).
|
||||
* OSX is not tested.
|
||||
It's limited by the same factor of GPU Hist, except that gradient based sampling is not
|
||||
yet supported on CPU.
|
||||
|
||||
Reference in New Issue
Block a user