Export Python Interface for external memory. (#7070)

* Add Python iterator interface.
* Add tests.
* Add demo.
* Add documents.
* Handle empty dataset.
This commit is contained in:
Jiaming Yuan
2021-07-22 15:15:53 +08:00
committed by GitHub
parent e64ee6592f
commit e6088366df
34 changed files with 961 additions and 200 deletions

View File

@@ -1,8 +1,8 @@
##############################
C API Tutorial
##############################
##############
C API Tutorial
##############
In this tutorial, we are going to install XGBoost library & configure the CMakeLists.txt file of our C/C++ application to link XGBoost library with our application. Later on, we will see some useful tips for using C API and code snippets as examples to use various functions available in C API to perform basic task like loading, training model & predicting on test dataset.
In this tutorial, we are going to install XGBoost library & configure the CMakeLists.txt file of our C/C++ application to link XGBoost library with our application. Later on, we will see some useful tips for using C API and code snippets as examples to use various functions available in C API to perform basic task like loading, training model & predicting on test dataset.
.. contents::
:backlinks: none
@@ -12,7 +12,7 @@ In this tutorial, we are going to install XGBoost library & configure the CMakeL
Requirements
************
Install CMake - Follow the `cmake installation documentation <https://cmake.org/install/>`_ for instructions.
Install CMake - Follow the `cmake installation documentation <https://cmake.org/install/>`_ for instructions.
Install Conda - Follow the `conda installation documentation <https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html>`_ for instructions
*************************************
@@ -31,18 +31,18 @@ Run the following commands on your terminal. The below commands will install the
# Activate the Conda environment, into which we'll install XGBoost
conda activate [env_name]
# Build the compiled version of XGBoost inside the build folder
cmake .. -DBUILD_STATIC_LIB=ON -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
cmake .. -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
# install XGBoost in your conda environment (usually under [your home directory]/miniconda3)
make install
*********************************************************************
Configure CMakeList.txt file of your application to link with XGBoost
Configure CMakeList.txt file of your application to link with XGBoost
*********************************************************************
Here, we assume that your C++ application is using CMake for builds.
Use ``find_package()`` and ``target_link_libraries()`` in your application's CMakeList.txt to link with the XGBoost library:
.. code-block:: cmake
cmake_minimum_required(VERSION 3.13)
@@ -79,8 +79,8 @@ a. In a C application: Use the following macro to guard all calls to XGBoost's C
.. code-block:: c
#define safe_xgboost(call) { \
int err = (call); \
#define safe_xgboost(call) { \
int err = (call); \
if (err != 0) { \
fprintf(stderr, "%s:%d: error in %s: %s\n", __FILE__, __LINE__, #call, XGBGetLastError()); \
exit(1); \
@@ -101,8 +101,8 @@ b. In a C++ application: modify the macro ``safe_xgboost`` to throw an exception
.. code-block:: cpp
#define safe_xgboost(call) { \
int err = (call); \
#define safe_xgboost(call) { \
int err = (call); \
if (err != 0) { \
throw new Exception(std::string(__FILE__) + ":" + std::to_string(__LINE__) + \
": error in " + #call + ":" + XGBGetLastError())); \
@@ -125,29 +125,29 @@ c. Assertion technique: It works both in C/ C++. If expression evaluates to 0 (f
#include <stdio.h>
#include <stdlib.h>
#include <xgboost/c_api.h>
int main(int argc, char** argv) {
int silent = 0;
BoosterHandle booster;
// do something with booster
//free the memory
XGBoosterFree(booster)
DMatrixHandle DMatrixHandle_param;
// do something with DMatrixHandle_param
// free the memory
XGDMatrixFree(DMatrixHandle_param);
return 0;
}
3. For tree models, it is important to use consistent data formats during training and scoring/ predicting otherwise it will result in wrong outputs.
3. For tree models, it is important to use consistent data formats during training and scoring/ predicting otherwise it will result in wrong outputs.
Example if we our training data is in ``dense matrix`` format then your prediction dataset should also be a ``dense matrix`` or if training in ``libsvm`` format then dataset for prediction should also be in ``libsvm`` format.
@@ -166,7 +166,7 @@ Sample examples along with Code snippet to use C API functions
1. If the dataset is available in a file, it can be loaded into a ``DMatrix`` object using the `XGDMatrixCreateFromFile <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#a357c3654a1a4dcc05e6b5c50acd17105>`_
.. code-block:: c
DMatrixHandle data; // handle to DMatrix
// Load the dat from file & store it in data variable of DMatrixHandle datatype
safe_xgboost(XGDMatrixCreateFromFile("/path/to/file/filename", silent, &data));
@@ -188,10 +188,10 @@ Sample examples along with Code snippet to use C API functions
// dmatrix variable will contain the created DMatrix using it
safe_xgboost(XGDMatrixCreateFromMat(data1, 1, 50, 0, &dmatrix));
// here -1 represents the missing value in the matrix dataset
safe_xgboost(XGDMatrixCreateFromMat(data2, ROWS, COLS, -1, &dmatrix2)(;
safe_xgboost(XGDMatrixCreateFromMat(data2, ROWS, COLS, -1, &dmatrix2));
3. Create a Booster object for training & testing on dataset using `XGBoosterCreate <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#ad9fe6f8c8c4901db1c7581a96a21f9ae>`_
3. Create a Booster object for training & testing on dataset using `XGBoosterCreate <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#ad9fe6f8c8c4901db1c7581a96a21f9ae>`_
.. code-block:: c
@@ -201,7 +201,7 @@ Sample examples along with Code snippet to use C API functions
DMatrixHandle eval_dmats[eval_dmats_size] = {train, test};
safe_xgboost(XGBoosterCreate(eval_dmats, eval_dmats_size, &booster));
4. For each ``DMatrix`` object, set the labels using `XGDMatrixSetFloatInfo <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#aef75cda93db3ae9af89e465ae7e9cbe3>`_. Later you can access the label using `XGDMatrixGetFloatInfo <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#ab0ee317539a1fb1ce2b5f249e8c768f6>`_.
.. code-block:: c
@@ -221,7 +221,7 @@ Sample examples along with Code snippet to use C API functions
// Loading the labels
safe_xgboost(XGDMatrixSetFloatInfo(dmatrix, "label", labels, ROWS));
// reading the labels and store the length of the result
bst_ulong result_len;
@@ -233,12 +233,12 @@ Sample examples along with Code snippet to use C API functions
for(unsigned int i = 0; i < result_len; i++) {
printf("label[%i] = %f\n", i, result[i]);
}
5. Set the parameters for the ``Booster`` object according to the requirement using `XGBoosterSetParam <https://xgboost.readthedocs.io/en/stable/dev/c__api_8h.html#af7378865b0c999d2d08a5b16483b8bcb>`_ . Check out the full list of parameters available `here <https://xgboost.readthedocs.io/en/latest/parameter.html>`_ .
.. code-block :: c
BoosterHandle booster;
safe_xgboost(XGBoosterSetParam(booster, "booster", "gblinear"));
// default max_depth =6

View File

@@ -1,6 +1,75 @@
#####################################
Using XGBoost External Memory Version
#####################################
XGBoost supports loading data from external memory using builtin data parser. And
starting from version 1.5, users can also define a custom iterator to load data in chunks.
The feature is still experimental and not yet ready for production use. In this tutorial
we will introduce both methods. Please note that training on data from external memory is
not supported by ``exact`` tree method.
*************
Data Iterator
*************
Starting from XGBoost 1.5, users can define their own data loader using Python or C
interface. There are some examples in the ``demo`` directory for quick start. This is a
generalized version of text input external memory, where users no longer need to prepare a
text file that XGBoost recognizes. To enable the feature, user need to define a data
iterator with 2 class methods ``next`` and ``reset`` then pass it into ``DMatrix``
constructor.
.. code-block:: python
import os
from typing import List, Callable
import xgboost
from sklearn.datasets import load_svmlight_file
class Iterator(xgboost.DataIter):
def __init__(self, svm_file_paths: List[str]):
self._file_paths = svm_file_paths
self._it = 0
# XGBoost will generate some cache files under current directory with the prefix
# "cache"
super().__init__(cache_prefix=os.path.join(".", "cache"))
def next(self, input_data: Callable):
"""Advance the iterator by 1 step and pass the data to XGBoost. This function is
called by XGBoost during the construction of ``DMatrix``
"""
if self._it == len(self._file_paths):
# return 0 to let XGBoost know this is the end of iteration
return 0
# input_data is a function passed in by XGBoost who has the exact same signature of
# ``DMatrix``
X, y = load_svmlight_file(self._file_paths[self._it])
input_data(X, y)
self._it += 1
# Return 1 to let XGBoost know we haven't seen all the files yet.
return 1
def reset(self):
"""Reset the iterator to its beginning"""
self._it = 0
it = Iterator(["file_0.svm", "file_1.svm", "file_2.svm"])
Xy = xgboost.DMatrix(it)
# Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
# as noted in following sections.
booster = xgboost.train({"tree_method": "approx"}, Xy)
The above snippet is a simplifed version of ``demo/guide-python/external_memory.py``. For
an example in C, please see ``demo/c-api/external-memory/``.
****************
Text File Inputs
****************
There is no big difference between using external memory version and in-memory version.
The only difference is the filename format.
@@ -36,10 +105,11 @@ more notes about text input formats, see :doc:`/tutorials/input_format`.
For CLI version, simply add the cache suffix, e.g. ``"../data/agaricus.txt.train#dtrain.cache"``.
***********
GPU Version
***********
External memory is fully supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).
**********************************
GPU Version (GPU Hist tree method)
**********************************
External memory is supported in GPU algorithms (i.e. when ``tree_method`` is set to ``gpu_hist``).
If you are still getting out-of-memory errors after enabling external memory, try subsampling the
data to further reduce GPU memory usage:
@@ -52,23 +122,14 @@ data to further reduce GPU memory usage:
'sampling_method': 'gradient_based',
}
For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_.
For more information, see `this paper <https://arxiv.org/abs/2005.09148>`_. Internally
the tree method still concatenate all the chunks into 1 final histogram index due to
performance reason, but in compressed format. So its scalability has an upper bound but
still has lower memory cost in general.
*******************
Distributed Version
*******************
The external memory mode naturally works on distributed version, you can simply set path like
********
CPU Hist
********
.. code-block:: none
data = "hdfs://path-to-data/#dtrain.cache"
XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporary
so that you can directly use ``dtrain.cache`` to cache to current folder.
***********
Limitations
***********
* The ``hist`` tree method hasn't been tested thoroughly with external memory support (see
`this issue <https://github.com/dmlc/xgboost/issues/4093>`_).
* OSX is not tested.
It's limited by the same factor of GPU Hist, except that gradient based sampling is not
yet supported on CPU.