[GPU-Plugin] Major refactor 2 (#2664)

* Change cmake option

* Move source files

* Move google tests

* Move python tests

* Move benchmarks

* Move documentation

* Remove makefile support

* Fix test run

* Move GPU tests
This commit is contained in:
Rory Mitchell
2017-09-08 09:57:16 +12:00
committed by GitHub
parent 8244f6f120
commit 15267eedf2
21 changed files with 76 additions and 249 deletions

View File

@@ -1,161 +1,3 @@
# CUDA Accelerated Tree Construction Algorithms
This plugin adds GPU accelerated tree construction and prediction algorithms to XGBoost.
## Usage
Specify the 'tree_method' parameter as one of the following algorithms.
### Algorithms
| tree_method | Description |
| --- | --- |
gpu_exact | The standard XGBoost tree construction algorithm. Performs exact search for splits. Slower and uses considerably more memory than 'gpu_hist' |
gpu_hist | Equivalent to the XGBoost fast histogram algorithm. Faster and uses considerably less memory. Splits may be less accurate. |
### Supported parameters
| parameter | gpu_exact | gpu_hist |
| --- | --- | --- |
subsample | ✔ | ✔ |
colsample_bytree | ✔ | ✔|
colsample_bylevel | ✔ | ✔ |
max_bin | ✖ | ✔ |
gpu_id | ✔ | ✔ |
n_gpus | ✖ | ✔ |
predictor | ✔ | ✔ |
GPU accelerated prediction is enabled by default for the above mentioned 'tree_method' parameters but can be switched to CPU prediction by setting 'predictor':'cpu_predictor'. This could be useful if you want to conserve GPU memory. Likewise when using CPU algorithms, GPU accelerated prediction can be enabled by setting 'predictor':'gpu_predictor'.
The device ordinal can be selected using the 'gpu_id' parameter, which defaults to 0.
Multiple GPUs can be used with the grow_gpu_hist parameter using the n_gpus parameter. which defaults to 1. If this is set to -1 all available GPUs will be used. If gpu_id is specified as non-zero, the gpu device order is mod(gpu_id + i) % n_visible_devices for i=0 to n_gpus-1. As with GPU vs. CPU, multi-GPU will not always be faster than a single GPU due to PCI bus bandwidth that can limit performance. For example, when n_features * n_bins * 2^depth divided by time of each round/iteration becomes comparable to the real PCI 16x bus bandwidth of order 4GB/s to 10GB/s, then AllReduce will dominant code speed and multiple GPUs become ineffective at increasing performance. Also, CPU overhead between GPU calls can limit usefulness of multiple GPUs.
This plugin currently works with the CLI version and python version.
Python example:
```python
param['gpu_id'] = 0
param['max_bin'] = 16
param['tree_method'] = 'gpu_hist'
```
## Benchmarks
To run benchmarks on synthetic data for binary classification:
```bash
$ python benchmark/benchmark.py
```
Training time time on 1,000,000 rows x 50 columns with 500 boosting iterations and 0.25/0.75 test/train split on i7-6700K CPU @ 4.00GHz and Pascal Titan X.
| tree_method | Time (s) |
| --- | --- |
| gpu_hist | 13.87 |
| hist | 63.55 |
| gpu_exact | 161.08 |
| exact | 1082.20 |
[See here](http://dmlc.ml/2016/12/14/GPU-accelerated-xgboost.html) for additional performance benchmarks of the 'gpu_exact' tree_method.
## Test
To run python tests:
```bash
$ python -m nose test/python/
```
Google tests can be enabled by specifying -DGOOGLE_TEST=ON when building with cmake.
## Dependencies
A CUDA capable GPU with at least compute capability >= 3.5
Building the plug-in requires CUDA Toolkit 7.5 or later (https://developer.nvidia.com/cuda-downloads)
## Build
From the command line on Linux starting from the xgboost directory:
On Linux, from the xgboost directory:
```bash
$ mkdir build
$ cd build
$ cmake .. -DPLUGIN_UPDATER_GPU=ON
$ make -j
```
On Windows using cmake, see what options for Generators you have for cmake, and choose one with [arch] replaced by Win64:
```bash
cmake -help
```
Then run cmake as:
```bash
$ mkdir build
$ cd build
$ cmake .. -G"Visual Studio 14 2015 Win64" -DPLUGIN_UPDATER_GPU=ON
```
Cmake will create an xgboost.sln solution file in the build directory. Build this solution in release mode as a x64 build.
Visual studio community 2015, supported by cuda toolkit (http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/#axzz4isREr2nS), can be downloaded from: https://my.visualstudio.com/Downloads?q=Visual%20Studio%20Community%202015 . You may also be able to use a later version of visual studio depending on whether the CUDA toolkit supports it. Note that Mingw cannot be used with cuda.
### For other nccl libraries
On some systems, nccl libraries are specific to a particular system (IBM Power or nvidia-docker) and can enable use of nvlink (between GPUs or even between GPUs and system memory). In that case, one wants to avoid the static nccl library by changing "STATIC" to "SHARED" in nccl/CMakeLists.txt and deleting the shared nccl library created (so that the system one is used).
### For Developers!
In case you want to build only for a specific GPU(s), for eg. GP100 and GP102,
whose compute capability are 60 and 61 respectively:
```bash
$ cmake .. -DPLUGIN_UPDATER_GPU=ON -DGPU_COMPUTE_VER="60;61"
```
### Using make
Now, it also supports the usual 'make' flow to build gpu-enabled tree construction plugins. It's currently only tested on Linux. From the xgboost directory
```bash
# make sure CUDA SDK bin directory is in the 'PATH' env variable
$ make -j PLUGIN_UPDATER_GPU=ON
```
Similar to cmake, if you want to build only for a specific GPU(s):
```bash
$ make -j PLUGIN_UPDATER_GPU=ON GPU_COMPUTE_VER="60 61"
```
## Changelog
##### 2017/8/14
* Added GPU accelerated prediction. Considerably improved performance when using test/eval sets.
##### 2017/7/10
* Memory performance improved 4x for gpu_hist
##### 2017/6/26
* Change API to use tree_method parameter
* Increase required cmake version to 3.5
* Add compute arch 3.5 to default archs
* Set default n_gpus to 1
##### 2017/6/5
* Multi-GPU support for histogram method using NVIDIA NCCL.
##### 2017/5/31
* Faster version of the grow_gpu plugin
* Added support for building gpu plugin through 'make' flow too
##### 2017/5/19
* Further performance enhancements for histogram method.
##### 2017/5/5
* Histogram performance improvements
* Fix gcc build issues
##### 2017/4/25
* Add fast histogram algorithm
* Fix Linux build
* Add 'gpu_id' parameter
## References
[Mitchell, Rory, and Eibe Frank. Accelerating the XGBoost algorithm using GPU computing. No. e2911v1. PeerJ Preprints, 2017.](https://peerj.com/preprints/2911/)
## Author
Rory Mitchell
Jonathan C. McKinney
Shankara Rao Thejaswi Nanditale
Vinay Deshpande
... and the rest of the H2O.ai and NVIDIA team.
Please report bugs to the xgboost/issues page.
[The XGBoost GPU documentation has moved here.](https://xgboost.readthedocs.io/en/latest/gpu/index.html)

View File

@@ -1,61 +0,0 @@
# pylint: skip-file
import sys, argparse
import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import time
def run_benchmark(args, gpu_algorithm, cpu_algorithm):
print("Generating dataset: {} rows * {} columns".format(args.rows, args.columns))
print("{}/{} test/train split".format(args.test_size, 1.0 - args.test_size))
tmp = time.time()
X, y = make_classification(args.rows, n_features=args.columns, random_state=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=args.test_size, random_state=7)
print ("Generate Time: %s seconds" % (str(time.time() - tmp)))
tmp = time.time()
print ("DMatrix Start")
# omp way
dtrain = xgb.DMatrix(X_train, y_train, nthread=-1)
dtest = xgb.DMatrix(X_test, y_test, nthread=-1)
print ("DMatrix Time: %s seconds" % (str(time.time() - tmp)))
param = {'objective': 'binary:logistic',
'max_depth': 6,
'silent': 0,
'n_gpus': 1,
'gpu_id': 0,
'eval_metric': 'error',
'debug_verbose': 0,
}
param['tree_method'] = gpu_algorithm
print("Training with '%s'" % param['tree_method'])
tmp = time.time()
xgb.train(param, dtrain, args.iterations, evals=[(dtest, "test")])
print ("Train Time: %s seconds" % (str(time.time() - tmp)))
param['silent'] = 1
param['tree_method'] = cpu_algorithm
print("Training with '%s'" % param['tree_method'])
tmp = time.time()
xgb.train(param, dtrain, args.iterations, evals=[(dtest, "test")])
print ("Time: %s seconds" % (str(time.time() - tmp)))
parser = argparse.ArgumentParser()
parser.add_argument('--algorithm', choices=['all', 'gpu_exact', 'gpu_hist'], default='all')
parser.add_argument('--rows', type=int, default=1000000)
parser.add_argument('--columns', type=int, default=50)
parser.add_argument('--iterations', type=int, default=500)
parser.add_argument('--test_size', type=float, default=0.25)
args = parser.parse_args()
if 'gpu_hist' in args.algorithm:
run_benchmark(args, args.algorithm, 'hist')
elif 'gpu_exact' in args.algorithm:
run_benchmark(args, args.algorithm, 'exact')
elif 'all' in args.algorithm:
run_benchmark(args, 'gpu_exact', 'exact')
run_benchmark(args, 'gpu_hist', 'hist')

View File

@@ -1,6 +0,0 @@
PLUGIN_OBJS += build_plugin/updater_gpu/src/register_updater_gpu.o \
build_plugin/updater_gpu/src/updater_gpu.o \
build_plugin/updater_gpu/src/gpu_hist_builder.o \
build_plugin/updater_gpu/src/gpu_predictor.o
PLUGIN_LDFLAGS += -L$(CUDA_ROOT)/lib64 -lcudart

View File

@@ -1,884 +0,0 @@
/*!
* Copyright 2017 XGBoost contributors
*/
#pragma once
#include <thrust/device_vector.h>
#include <thrust/system/cuda/error.h>
#include <thrust/system/cuda/execution_policy.h>
#include <thrust/system_error.h>
#include <xgboost/logging.h>
#include <algorithm>
#include <chrono>
#include <ctime>
#include <cub/cub.cuh>
#include <numeric>
#include <sstream>
#include <string>
#include <vector>
#include "nccl.h"
// Uncomment to enable
#define TIMERS
namespace dh {
#define HOST_DEV_INLINE __host__ __device__ __forceinline__
#define DEV_INLINE __device__ __forceinline__
/*
* Error handling functions
*/
#define safe_cuda(ans) throw_on_cuda_error((ans), __FILE__, __LINE__)
inline cudaError_t throw_on_cuda_error(cudaError_t code, const char *file,
int line) {
if (code != cudaSuccess) {
std::stringstream ss;
ss << file << "(" << line << ")";
std::string file_and_line;
ss >> file_and_line;
throw thrust::system_error(code, thrust::cuda_category(), file_and_line);
}
return code;
}
#define safe_nccl(ans) throw_on_nccl_error((ans), __FILE__, __LINE__)
inline ncclResult_t throw_on_nccl_error(ncclResult_t code, const char *file,
int line) {
if (code != ncclSuccess) {
std::stringstream ss;
ss << "NCCL failure :" << ncclGetErrorString(code) << " ";
ss << file << "(" << line << ")";
throw std::runtime_error(ss.str());
}
return code;
}
inline int n_visible_devices() {
int n_visgpus = 0;
cudaGetDeviceCount(&n_visgpus);
return n_visgpus;
}
inline int n_devices_all(int n_gpus) {
int n_devices_visible = dh::n_visible_devices();
int n_devices = n_gpus < 0 ? n_devices_visible : n_gpus;
return (n_devices);
}
inline int n_devices(int n_gpus, int num_rows) {
int n_devices = dh::n_devices_all(n_gpus);
// fix-up device number to be limited by number of rows
n_devices = n_devices > num_rows ? num_rows : n_devices;
return (n_devices);
}
// if n_devices=-1, then use all visible devices
inline void synchronize_n_devices(int n_devices, std::vector<int> dList) {
for (int d_idx = 0; d_idx < n_devices; d_idx++) {
int device_idx = dList[d_idx];
safe_cuda(cudaSetDevice(device_idx));
safe_cuda(cudaDeviceSynchronize());
}
}
inline void synchronize_all() {
for (int device_idx = 0; device_idx < n_visible_devices(); device_idx++) {
safe_cuda(cudaSetDevice(device_idx));
safe_cuda(cudaDeviceSynchronize());
}
}
inline std::string device_name(int device_idx) {
cudaDeviceProp prop;
dh::safe_cuda(cudaGetDeviceProperties(&prop, device_idx));
return std::string(prop.name);
}
inline size_t available_memory(int device_idx) {
size_t device_free = 0;
size_t device_total = 0;
safe_cuda(cudaSetDevice(device_idx));
dh::safe_cuda(cudaMemGetInfo(&device_free, &device_total));
return device_free;
}
/**
* \fn inline int max_shared_memory(int device_idx)
*
* \brief Maximum shared memory per block on this device.
*
* \param device_idx Zero-based index of the device.
*/
inline int max_shared_memory(int device_idx) {
cudaDeviceProp prop;
dh::safe_cuda(cudaGetDeviceProperties(&prop, device_idx));
return prop.sharedMemPerBlock;
}
// ensure gpu_id is correct, so not dependent upon user knowing details
inline int get_device_idx(int gpu_id) {
// protect against overrun for gpu_id
return (std::abs(gpu_id) + 0) % dh::n_visible_devices();
}
/*
* Timers
*/
struct Timer {
typedef std::chrono::high_resolution_clock ClockT;
typedef std::chrono::high_resolution_clock::time_point TimePointT;
TimePointT start;
Timer() { reset(); }
void reset() { start = ClockT::now(); }
int64_t elapsed() const { return (ClockT::now() - start).count(); }
double elapsedSeconds() const {
return elapsed() * ((double)ClockT::period::num / ClockT::period::den);
}
void printElapsed(std::string label) {
// synchronize_n_devices(n_devices, dList);
printf("%s:\t %fs\n", label.c_str(), elapsedSeconds());
reset();
}
};
/*
* Range iterator
*/
class range {
public:
class iterator {
friend class range;
public:
__host__ __device__ int64_t operator*() const { return i_; }
__host__ __device__ const iterator &operator++() {
i_ += step_;
return *this;
}
__host__ __device__ iterator operator++(int) {
iterator copy(*this);
i_ += step_;
return copy;
}
__host__ __device__ bool operator==(const iterator &other) const {
return i_ >= other.i_;
}
__host__ __device__ bool operator!=(const iterator &other) const {
return i_ < other.i_;
}
__host__ __device__ void step(int s) { step_ = s; }
protected:
__host__ __device__ explicit iterator(int64_t start) : i_(start) {}
public:
uint64_t i_;
int step_ = 1;
};
__host__ __device__ iterator begin() const { return begin_; }
__host__ __device__ iterator end() const { return end_; }
__host__ __device__ range(int64_t begin, int64_t end)
: begin_(begin), end_(end) {}
__host__ __device__ void step(int s) { begin_.step(s); }
private:
iterator begin_;
iterator end_;
};
template <typename T>
__device__ range grid_stride_range(T begin, T end) {
begin += blockDim.x * blockIdx.x + threadIdx.x;
range r(begin, end);
r.step(gridDim.x * blockDim.x);
return r;
}
template <typename T>
__device__ range block_stride_range(T begin, T end) {
begin += threadIdx.x;
range r(begin, end);
r.step(blockDim.x);
return r;
}
// Threadblock iterates over range, filling with value. Requires all threads in
// block to be active.
template <typename IterT, typename ValueT>
__device__ void block_fill(IterT begin, size_t n, ValueT value) {
for (auto i : block_stride_range(static_cast<size_t>(0), n)) {
begin[i] = value;
}
}
/*
* Memory
*/
enum memory_type { DEVICE, DEVICE_MANAGED };
template <memory_type MemoryT>
class bulk_allocator;
template <typename T>
class dvec2;
template <typename T>
class dvec {
friend class dvec2<T>;
private:
T *_ptr;
size_t _size;
int _device_idx;
public:
void external_allocate(int device_idx, void *ptr, size_t size) {
if (!empty()) {
throw std::runtime_error("Tried to allocate dvec but already allocated");
}
_ptr = static_cast<T *>(ptr);
_size = size;
_device_idx = device_idx;
safe_cuda(cudaSetDevice(_device_idx));
}
dvec() : _ptr(NULL), _size(0), _device_idx(-1) {}
size_t size() const { return _size; }
int device_idx() const { return _device_idx; }
bool empty() const { return _ptr == NULL || _size == 0; }
T *data() { return _ptr; }
const T *data() const { return _ptr; }
std::vector<T> as_vector() const {
std::vector<T> h_vector(size());
safe_cuda(cudaSetDevice(_device_idx));
safe_cuda(cudaMemcpy(h_vector.data(), _ptr, size() * sizeof(T),
cudaMemcpyDeviceToHost));
return h_vector;
}
void fill(T value) {
safe_cuda(cudaSetDevice(_device_idx));
thrust::fill_n(thrust::device_pointer_cast(_ptr), size(), value);
}
void print() {
auto h_vector = this->as_vector();
for (auto e : h_vector) {
std::cout << e << " ";
}
std::cout << "\n";
}
thrust::device_ptr<T> tbegin() { return thrust::device_pointer_cast(_ptr); }
thrust::device_ptr<T> tend() {
return thrust::device_pointer_cast(_ptr + size());
}
template <typename T2>
dvec &operator=(const std::vector<T2> &other) {
this->copy(other.begin(), other.end());
return *this;
}
dvec &operator=(dvec<T> &other) {
if (other.size() != size()) {
throw std::runtime_error(
"Cannot copy assign dvec to dvec, sizes are different");
}
safe_cuda(cudaSetDevice(this->device_idx()));
if (other.device_idx() == this->device_idx()) {
thrust::copy(other.tbegin(), other.tend(), this->tbegin());
} else {
std::cout << "deviceother: " << other.device_idx()
<< " devicethis: " << this->device_idx() << std::endl;
std::cout << "size deviceother: " << other.size()
<< " devicethis: " << this->device_idx() << std::endl;
throw std::runtime_error("Cannot copy to/from different devices");
}
return *this;
}
template <typename IterT>
void copy(IterT begin, IterT end) {
safe_cuda(cudaSetDevice(this->device_idx()));
if (end - begin != size()) {
throw std::runtime_error(
"Cannot copy assign vector to dvec, sizes are different");
}
thrust::copy(begin, end, this->tbegin());
}
};
/**
* @class dvec2 device_helpers.cuh
* @brief wrapper for storing 2 dvec's which are needed for cub::DoubleBuffer
*/
template <typename T>
class dvec2 {
private:
dvec<T> _d1, _d2;
cub::DoubleBuffer<T> _buff;
int _device_idx;
public:
void external_allocate(int device_idx, void *ptr1, void *ptr2, size_t size) {
if (!empty()) {
throw std::runtime_error("Tried to allocate dvec2 but already allocated");
}
_device_idx = device_idx;
_d1.external_allocate(_device_idx, ptr1, size);
_d2.external_allocate(_device_idx, ptr2, size);
_buff.d_buffers[0] = static_cast<T *>(ptr1);
_buff.d_buffers[1] = static_cast<T *>(ptr2);
_buff.selector = 0;
}
dvec2() : _d1(), _d2(), _buff(), _device_idx(-1) {}
size_t size() const { return _d1.size(); }
int device_idx() const { return _device_idx; }
bool empty() const { return _d1.empty() || _d2.empty(); }
cub::DoubleBuffer<T> &buff() { return _buff; }
dvec<T> &d1() { return _d1; }
dvec<T> &d2() { return _d2; }
T *current() { return _buff.Current(); }
dvec<T> &current_dvec() { return _buff.selector == 0 ? d1() : d2(); }
T *other() { return _buff.Alternate(); }
};
template <memory_type MemoryT>
class bulk_allocator {
std::vector<char *> d_ptr;
std::vector<size_t> _size;
std::vector<int> _device_idx;
const int align = 256;
template <typename SizeT>
size_t align_round_up(SizeT n) {
n = (n + align - 1) / align;
return n * align;
}
template <typename T, typename SizeT>
size_t get_size_bytes(dvec<T> *first_vec, SizeT first_size) {
return align_round_up<SizeT>(first_size * sizeof(T));
}
template <typename T, typename SizeT, typename... Args>
size_t get_size_bytes(dvec<T> *first_vec, SizeT first_size, Args... args) {
return get_size_bytes<T, SizeT>(first_vec, first_size) +
get_size_bytes(args...);
}
template <typename T, typename SizeT>
void allocate_dvec(int device_idx, char *ptr, dvec<T> *first_vec,
SizeT first_size) {
first_vec->external_allocate(device_idx, static_cast<void *>(ptr),
first_size);
}
template <typename T, typename SizeT, typename... Args>
void allocate_dvec(int device_idx, char *ptr, dvec<T> *first_vec,
SizeT first_size, Args... args) {
first_vec->external_allocate(device_idx, static_cast<void *>(ptr),
first_size);
ptr += align_round_up(first_size * sizeof(T));
allocate_dvec(device_idx, ptr, args...);
}
// template <memory_type MemoryT>
char *allocate_device(int device_idx, size_t bytes, memory_type t) {
char *ptr;
if (t == memory_type::DEVICE) {
safe_cuda(cudaSetDevice(device_idx));
safe_cuda(cudaMalloc(&ptr, bytes));
} else {
safe_cuda(cudaMallocManaged(&ptr, bytes));
}
return ptr;
}
template <typename T, typename SizeT>
size_t get_size_bytes(dvec2<T> *first_vec, SizeT first_size) {
return 2 * align_round_up(first_size * sizeof(T));
}
template <typename T, typename SizeT, typename... Args>
size_t get_size_bytes(dvec2<T> *first_vec, SizeT first_size, Args... args) {
return get_size_bytes<T, SizeT>(first_vec, first_size) +
get_size_bytes(args...);
}
template <typename T, typename SizeT>
void allocate_dvec(int device_idx, char *ptr, dvec2<T> *first_vec,
SizeT first_size) {
first_vec->external_allocate(
device_idx, static_cast<void *>(ptr),
static_cast<void *>(ptr + align_round_up(first_size * sizeof(T))),
first_size);
}
template <typename T, typename SizeT, typename... Args>
void allocate_dvec(int device_idx, char *ptr, dvec2<T> *first_vec,
SizeT first_size, Args... args) {
allocate_dvec<T, SizeT>(device_idx, ptr, first_vec, first_size);
ptr += (align_round_up(first_size * sizeof(T)) * 2);
allocate_dvec(device_idx, ptr, args...);
}
public:
~bulk_allocator() {
for (size_t i = 0; i < d_ptr.size(); i++) {
if (!(d_ptr[i] == nullptr)) {
safe_cuda(cudaSetDevice(_device_idx[i]));
safe_cuda(cudaFree(d_ptr[i]));
}
}
}
// returns sum of bytes for all allocations
size_t size() {
return std::accumulate(_size.begin(), _size.end(), static_cast<size_t>(0));
}
template <typename... Args>
void allocate(int device_idx, bool silent, Args... args) {
size_t size = get_size_bytes(args...);
char *ptr = allocate_device(device_idx, size, MemoryT);
allocate_dvec(device_idx, ptr, args...);
d_ptr.push_back(ptr);
_size.push_back(size);
_device_idx.push_back(device_idx);
if (!silent) {
const int mb_size = 1048576;
LOG(CONSOLE) << "Allocated " << size / mb_size << "MB on [" << device_idx
<< "] " << device_name(device_idx) << ", "
<< available_memory(device_idx) / mb_size << "MB remaining.";
}
}
};
// Keep track of cub library device allocation
struct CubMemory {
void *d_temp_storage;
size_t temp_storage_bytes;
// Thrust
typedef char value_type;
CubMemory() : d_temp_storage(NULL), temp_storage_bytes(0) {}
~CubMemory() { Free(); }
void Free() {
if (this->IsAllocated()) {
safe_cuda(cudaFree(d_temp_storage));
}
}
void LazyAllocate(size_t num_bytes) {
if (num_bytes > temp_storage_bytes) {
Free();
safe_cuda(cudaMalloc(&d_temp_storage, num_bytes));
temp_storage_bytes = num_bytes;
}
}
// Thrust
char *allocate(std::ptrdiff_t num_bytes) {
LazyAllocate(num_bytes);
return reinterpret_cast<char *>(d_temp_storage);
}
// Thrust
void deallocate(char *ptr, size_t n) {
// Do nothing
}
bool IsAllocated() { return d_temp_storage != NULL; }
};
/*
* Utility functions
*/
template <typename T>
void print(const thrust::device_vector<T> &v, size_t max_items = 10) {
thrust::host_vector<T> h = v;
for (size_t i = 0; i < std::min(max_items, h.size()); i++) {
std::cout << " " << h[i];
}
std::cout << "\n";
}
template <typename T>
void print(const dvec<T> &v, size_t max_items = 10) {
std::vector<T> h = v.as_vector();
for (size_t i = 0; i < std::min(max_items, h.size()); i++) {
std::cout << " " << h[i];
}
std::cout << "\n";
}
template <typename T>
void print(char *label, const thrust::device_vector<T> &v,
const char *format = "%d ", size_t max = 10) {
thrust::host_vector<T> h_v = v;
std::cout << label << ":\n";
for (size_t i = 0; i < std::min(static_cast<size_t>(h_v.size()), max); i++) {
printf(format, h_v[i]);
}
std::cout << "\n";
}
template <typename T1, typename T2>
T1 div_round_up(const T1 a, const T2 b) {
return static_cast<T1>(ceil(static_cast<double>(a) / b));
}
template <typename T>
thrust::device_ptr<T> dptr(T *d_ptr) {
return thrust::device_pointer_cast(d_ptr);
}
template <typename T>
T *raw(thrust::device_vector<T> &v) { // NOLINT
return raw_pointer_cast(v.data());
}
template <typename T>
const T *raw(const thrust::device_vector<T> &v) { // NOLINT
return raw_pointer_cast(v.data());
}
template <typename T>
size_t size_bytes(const thrust::device_vector<T> &v) {
return sizeof(T) * v.size();
}
/*
* Kernel launcher
*/
template <typename L>
__global__ void launch_n_kernel(size_t begin, size_t end, L lambda) {
for (auto i : grid_stride_range(begin, end)) {
lambda(i);
}
}
template <typename L>
__global__ void launch_n_kernel(int device_idx, size_t begin, size_t end,
L lambda) {
for (auto i : grid_stride_range(begin, end)) {
lambda(i, device_idx);
}
}
template <int ITEMS_PER_THREAD = 8, int BLOCK_THREADS = 256, typename L>
inline void launch_n(int device_idx, size_t n, L lambda) {
safe_cuda(cudaSetDevice(device_idx));
// TODO: Template on n so GRID_SIZE always fits into int.
const int GRID_SIZE = div_round_up(n, ITEMS_PER_THREAD * BLOCK_THREADS);
#if defined(__CUDACC__)
launch_n_kernel<<<GRID_SIZE, BLOCK_THREADS>>>(static_cast<size_t>(0), n,
lambda);
#endif
}
// if n_devices=-1, then use all visible devices
template <int ITEMS_PER_THREAD = 8, int BLOCK_THREADS = 256, typename L>
inline void multi_launch_n(size_t n, int n_devices, L lambda) {
n_devices = n_devices < 0 ? n_visible_devices() : n_devices;
CHECK_LE(n_devices, n_visible_devices()) << "Number of devices requested "
"needs to be less than equal to "
"number of visible devices.";
// TODO: Template on n so GRID_SIZE always fits into int.
const int GRID_SIZE = div_round_up(n, ITEMS_PER_THREAD * BLOCK_THREADS);
#if defined(__CUDACC__)
n_devices = n_devices > n ? n : n_devices;
for (int device_idx = 0; device_idx < n_devices; device_idx++) {
safe_cuda(cudaSetDevice(device_idx));
size_t begin = (n / n_devices) * device_idx;
size_t end = std::min((n / n_devices) * (device_idx + 1), n);
launch_n_kernel<<<GRID_SIZE, BLOCK_THREADS>>>(device_idx, begin, end,
lambda);
}
#endif
}
/**
* @brief Helper macro to measure timing on GPU
* @param call the GPU call
* @param name name used to track later
* @param stream cuda stream where to measure time
*/
#define TIMEIT(call, name) \
do { \
dh::Timer t1234; \
call; \
t1234.printElapsed(name); \
} while (0)
// Load balancing search
template <typename coordinate_t, typename segments_t, typename offset_t>
void FindMergePartitions(int device_idx, coordinate_t *d_tile_coordinates,
int num_tiles, int tile_size, segments_t segments,
offset_t num_rows, offset_t num_elements) {
dh::launch_n(device_idx, num_tiles + 1, [=] __device__(int idx) {
offset_t diagonal = idx * tile_size;
coordinate_t tile_coordinate;
cub::CountingInputIterator<offset_t> nonzero_indices(0);
// Search the merge path
// Cast to signed integer as this function can have negatives
cub::MergePathSearch(static_cast<int64_t>(diagonal), segments + 1,
nonzero_indices, static_cast<int64_t>(num_rows),
static_cast<int64_t>(num_elements), tile_coordinate);
// Output starting offset
d_tile_coordinates[idx] = tile_coordinate;
});
}
template <int TILE_SIZE, int ITEMS_PER_THREAD, int BLOCK_THREADS,
typename offset_t, typename coordinate_t, typename func_t,
typename segments_iter>
__global__ void LbsKernel(coordinate_t *d_coordinates,
segments_iter segment_end_offsets, func_t f,
offset_t num_segments) {
int tile = blockIdx.x;
coordinate_t tile_start_coord = d_coordinates[tile];
coordinate_t tile_end_coord = d_coordinates[tile + 1];
int64_t tile_num_rows = tile_end_coord.x - tile_start_coord.x;
int64_t tile_num_elements = tile_end_coord.y - tile_start_coord.y;
cub::CountingInputIterator<offset_t> tile_element_indices(tile_start_coord.y);
coordinate_t thread_start_coord;
typedef typename std::iterator_traits<segments_iter>::value_type segment_t;
__shared__ struct {
segment_t tile_segment_end_offsets[TILE_SIZE + 1];
segment_t output_segment[TILE_SIZE];
} temp_storage;
for (auto item : dh::block_stride_range(int(0), int(tile_num_rows + 1))) {
temp_storage.tile_segment_end_offsets[item] =
segment_end_offsets[min(tile_start_coord.x + item, num_segments - 1)];
}
__syncthreads();
int64_t diag = threadIdx.x * ITEMS_PER_THREAD;
// Cast to signed integer as this function can have negatives
cub::MergePathSearch(diag, // Diagonal
temp_storage.tile_segment_end_offsets, // List A
tile_element_indices, // List B
tile_num_rows, tile_num_elements, thread_start_coord);
coordinate_t thread_current_coord = thread_start_coord;
#pragma unroll
for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM) {
if (tile_element_indices[thread_current_coord.y] <
temp_storage.tile_segment_end_offsets[thread_current_coord.x]) {
temp_storage.output_segment[thread_current_coord.y] =
thread_current_coord.x + tile_start_coord.x;
++thread_current_coord.y;
} else {
++thread_current_coord.x;
}
}
__syncthreads();
for (auto item : dh::block_stride_range(int(0), int(tile_num_elements))) {
f(tile_start_coord.y + item, temp_storage.output_segment[item]);
}
}
template <typename func_t, typename segments_iter, typename offset_t>
void SparseTransformLbs(int device_idx, dh::CubMemory *temp_memory,
offset_t count, segments_iter segments,
offset_t num_segments, func_t f) {
typedef typename cub::CubVector<offset_t, 2>::Type coordinate_t;
dh::safe_cuda(cudaSetDevice(device_idx));
const int BLOCK_THREADS = 256;
const int ITEMS_PER_THREAD = 1;
const int TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD;
int num_tiles = dh::div_round_up(count + num_segments, BLOCK_THREADS);
temp_memory->LazyAllocate(sizeof(coordinate_t) * (num_tiles + 1));
coordinate_t *tmp_tile_coordinates =
reinterpret_cast<coordinate_t *>(temp_memory->d_temp_storage);
FindMergePartitions(device_idx, tmp_tile_coordinates, num_tiles,
BLOCK_THREADS, segments, num_segments, count);
LbsKernel<TILE_SIZE, ITEMS_PER_THREAD, BLOCK_THREADS, offset_t>
<<<num_tiles, BLOCK_THREADS>>>(tmp_tile_coordinates, segments + 1, f,
num_segments);
}
template <typename func_t, typename offset_t>
void DenseTransformLbs(int device_idx, offset_t count, offset_t num_segments,
func_t f) {
CHECK(count % num_segments == 0) << "Data is not dense.";
launch_n(device_idx, count, [=] __device__(offset_t idx) {
offset_t segment = idx / (count / num_segments);
f(idx, segment);
});
}
/**
* \fn template <typename func_t, typename segments_iter, typename offset_t>
* void TransformLbs(int device_idx, dh::CubMemory *temp_memory, offset_t count,
* segments_iter segments, offset_t num_segments, bool is_dense, func_t f)
*
* \brief Load balancing search function. Reads a CSR type matrix description
* and allows a function to be executed on each element. Search 'modern GPU load
* balancing search' for more information.
*
* \author Rory
* \date 7/9/2017
*
* \tparam func_t Type of the function t.
* \tparam segments_iter Type of the segments iterator.
* \tparam offset_t Type of the offset.
* \param device_idx Zero-based index of the device.
* \param [in,out] temp_memory Temporary memory allocator.
* \param count Number of elements.
* \param segments Device pointer to segments.
* \param num_segments Number of segments.
* \param is_dense True if this object is dense.
* \param f Lambda to be executed on matrix elements.
*/
template <typename func_t, typename segments_iter, typename offset_t>
void TransformLbs(int device_idx, dh::CubMemory *temp_memory, offset_t count,
segments_iter segments, offset_t num_segments, bool is_dense,
func_t f) {
if (is_dense) {
DenseTransformLbs(device_idx, count, num_segments, f);
} else {
SparseTransformLbs(device_idx, temp_memory, count, segments, num_segments,
f);
}
}
/**
* @brief Helper function to sort the pairs using cub's segmented RadixSortPairs
* @param tmp_mem cub temporary memory info
* @param keys keys double-buffer array
* @param vals the values double-buffer array
* @param nVals number of elements in the array
* @param nSegs number of segments
* @param offsets the segments
*/
template <typename T1, typename T2>
void segmentedSort(dh::CubMemory *tmp_mem, dh::dvec2<T1> *keys,
dh::dvec2<T2> *vals, int nVals, int nSegs,
const dh::dvec<int> &offsets, int start = 0,
int end = sizeof(T1) * 8) {
size_t tmpSize;
dh::safe_cuda(cub::DeviceSegmentedRadixSort::SortPairs(
NULL, tmpSize, keys->buff(), vals->buff(), nVals, nSegs, offsets.data(),
offsets.data() + 1, start, end));
tmp_mem->LazyAllocate(tmpSize);
dh::safe_cuda(cub::DeviceSegmentedRadixSort::SortPairs(
tmp_mem->d_temp_storage, tmpSize, keys->buff(), vals->buff(), nVals,
nSegs, offsets.data(), offsets.data() + 1, start, end));
}
/**
* @brief Helper function to perform device-wide sum-reduction
* @param tmp_mem cub temporary memory info
* @param in the input array to be reduced
* @param out the output reduced value
* @param nVals number of elements in the input array
*/
template <typename T>
void sumReduction(dh::CubMemory &tmp_mem, dh::dvec<T> &in, dh::dvec<T> &out,
int nVals) {
size_t tmpSize;
dh::safe_cuda(
cub::DeviceReduce::Sum(NULL, tmpSize, in.data(), out.data(), nVals));
tmp_mem.LazyAllocate(tmpSize);
dh::safe_cuda(cub::DeviceReduce::Sum(tmp_mem.d_temp_storage, tmpSize,
in.data(), out.data(), nVals));
}
/**
* @brief Fill a given constant value across all elements in the buffer
* @param out the buffer to be filled
* @param len number of elements i the buffer
* @param def default value to be filled
*/
template <typename T, int BlkDim = 256, int ItemsPerThread = 4>
void fillConst(int device_idx, T *out, int len, T def) {
dh::launch_n<ItemsPerThread, BlkDim>(device_idx, len,
[=] __device__(int i) { out[i] = def; });
}
/**
* @brief gather elements
* @param out1 output gathered array for the first buffer
* @param in1 first input buffer
* @param out2 output gathered array for the second buffer
* @param in2 second input buffer
* @param instId gather indices
* @param nVals length of the buffers
*/
template <typename T1, typename T2, int BlkDim = 256, int ItemsPerThread = 4>
void gather(int device_idx, T1 *out1, const T1 *in1, T2 *out2, const T2 *in2,
const int *instId, int nVals) {
dh::launch_n<ItemsPerThread, BlkDim>(device_idx, nVals,
[=] __device__(int i) {
int iid = instId[i];
T1 v1 = in1[iid];
T2 v2 = in2[iid];
out1[i] = v1;
out2[i] = v2;
});
}
/**
* @brief gather elements
* @param out output gathered array
* @param in input buffer
* @param instId gather indices
* @param nVals length of the buffers
*/
template <typename T, int BlkDim = 256, int ItemsPerThread = 4>
void gather(int device_idx, T *out, const T *in, const int *instId, int nVals) {
dh::launch_n<ItemsPerThread, BlkDim>(device_idx, nVals,
[=] __device__(int i) {
int iid = instId[i];
out[i] = in[iid];
});
}
} // namespace dh

View File

@@ -1,411 +0,0 @@
/*!
* Copyright by Contributors 2017
*/
#include <dmlc/parameter.h>
#include <thrust/copy.h>
#include <xgboost/data.h>
#include <xgboost/predictor.h>
#include <xgboost/tree_model.h>
#include <xgboost/tree_updater.h>
#include <memory>
#include "device_helpers.cuh"
namespace xgboost {
namespace predictor {
DMLC_REGISTRY_FILE_TAG(gpu_predictor);
/*! \brief prediction parameters */
struct GPUPredictionParam : public dmlc::Parameter<GPUPredictionParam> {
int gpu_id;
int n_gpus;
bool silent;
// declare parameters
DMLC_DECLARE_PARAMETER(GPUPredictionParam) {
DMLC_DECLARE_FIELD(gpu_id).set_default(0).describe(
"Device ordinal for GPU prediction.");
DMLC_DECLARE_FIELD(n_gpus).set_default(1).describe(
"Number of devices to use for prediction (NOT IMPLEMENTED).");
DMLC_DECLARE_FIELD(silent).set_default(false).describe(
"Do not print information during trainig.");
}
};
DMLC_REGISTER_PARAMETER(GPUPredictionParam);
template <typename iter_t>
void increment_offset(iter_t begin_itr, iter_t end_itr, size_t amount) {
thrust::transform(begin_itr, end_itr, begin_itr,
[=] __device__(size_t elem) { return elem + amount; });
}
/**
* \struct DeviceMatrix
*
* \brief A csr representation of the input matrix allocated on the device.
*/
struct DeviceMatrix {
DMatrix* p_mat; // Pointer to the original matrix on the host
dh::bulk_allocator<dh::memory_type::DEVICE> ba;
dh::dvec<size_t> row_ptr;
dh::dvec<SparseBatch::Entry> data;
thrust::device_vector<float> predictions;
DeviceMatrix(DMatrix* dmat, int device_idx, bool silent) : p_mat(dmat) {
dh::safe_cuda(cudaSetDevice(device_idx));
auto info = dmat->info();
ba.allocate(device_idx, silent, &row_ptr, info.num_row + 1, &data,
info.num_nonzero);
auto iter = dmat->RowIterator();
iter->BeforeFirst();
size_t data_offset = 0;
while (iter->Next()) {
auto batch = iter->Value();
// Copy row ptr
thrust::copy(batch.ind_ptr, batch.ind_ptr + batch.size + 1,
row_ptr.tbegin() + batch.base_rowid);
if (batch.base_rowid > 0) {
auto begin_itr = row_ptr.tbegin() + batch.base_rowid;
auto end_itr = begin_itr + batch.size + 1;
increment_offset(begin_itr, end_itr, batch.base_rowid);
}
// Copy data
thrust::copy(batch.data_ptr, batch.data_ptr + batch.ind_ptr[batch.size],
data.tbegin() + data_offset);
data_offset += batch.ind_ptr[batch.size];
}
}
};
/**
* \struct DevicePredictionNode
*
* \brief Packed 16 byte representation of a tree node for use in device
* prediction
*/
struct DevicePredictionNode {
XGBOOST_DEVICE DevicePredictionNode()
: fidx(-1), left_child_idx(-1), right_child_idx(-1) {}
union NodeValue {
float leaf_weight;
float fvalue;
};
int fidx;
int left_child_idx;
int right_child_idx;
NodeValue val;
DevicePredictionNode(const RegTree::Node& n) { // NOLINT
this->left_child_idx = n.cleft();
this->right_child_idx = n.cright();
this->fidx = n.split_index();
if (n.default_left()) {
fidx |= (1U << 31);
}
if (n.is_leaf()) {
this->val.leaf_weight = n.leaf_value();
} else {
this->val.fvalue = n.split_cond();
}
}
XGBOOST_DEVICE bool IsLeaf() const { return left_child_idx == -1; }
XGBOOST_DEVICE int GetFidx() const { return fidx & ((1U << 31) - 1U); }
XGBOOST_DEVICE bool MissingLeft() const { return (fidx >> 31) != 0; }
XGBOOST_DEVICE int MissingIdx() const {
if (MissingLeft()) {
return this->left_child_idx;
} else {
return this->right_child_idx;
}
}
XGBOOST_DEVICE float GetFvalue() const { return val.fvalue; }
XGBOOST_DEVICE float GetWeight() const { return val.leaf_weight; }
};
struct ElementLoader {
bool use_shared;
size_t* d_row_ptr;
SparseBatch::Entry* d_data;
int num_features;
float* smem;
__device__ ElementLoader(bool use_shared, size_t* row_ptr,
SparseBatch::Entry* entry, int num_features,
float* smem, int num_rows)
: use_shared(use_shared),
d_row_ptr(row_ptr),
d_data(entry),
num_features(num_features),
smem(smem) {
// Copy instances
if (use_shared) {
bst_uint global_idx = blockDim.x * blockIdx.x + threadIdx.x;
int shared_elements = blockDim.x * num_features;
dh::block_fill(smem, shared_elements, nanf(""));
__syncthreads();
if (global_idx < num_rows) {
bst_uint elem_begin = d_row_ptr[global_idx];
bst_uint elem_end = d_row_ptr[global_idx + 1];
for (bst_uint elem_idx = elem_begin; elem_idx < elem_end; elem_idx++) {
SparseBatch::Entry elem = d_data[elem_idx];
smem[threadIdx.x * num_features + elem.index] = elem.fvalue;
}
}
__syncthreads();
}
}
__device__ float GetFvalue(int ridx, int fidx) {
if (use_shared) {
return smem[threadIdx.x * num_features + fidx];
} else {
// Binary search
auto begin_ptr = d_data + d_row_ptr[ridx];
auto end_ptr = d_data + d_row_ptr[ridx + 1];
SparseBatch::Entry* previous_middle = nullptr;
while (end_ptr != begin_ptr) {
auto middle = begin_ptr + (end_ptr - begin_ptr) / 2;
if (middle == previous_middle) {
break;
} else {
previous_middle = middle;
}
if (middle->index == fidx) {
return middle->fvalue;
} else if (middle->index < fidx) {
begin_ptr = middle;
} else {
end_ptr = middle;
}
}
// Value is missing
return nanf("");
}
}
};
__device__ float GetLeafWeight(bst_uint ridx, const DevicePredictionNode* tree,
ElementLoader* loader) {
DevicePredictionNode n = tree[0];
while (!n.IsLeaf()) {
float fvalue = loader->GetFvalue(ridx, n.GetFidx());
// Missing value
if (isnan(fvalue)) {
n = tree[n.MissingIdx()];
} else {
if (fvalue < n.GetFvalue()) {
n = tree[n.left_child_idx];
} else {
n = tree[n.right_child_idx];
}
}
}
return n.GetWeight();
}
template <int BLOCK_THREADS>
__global__ void PredictKernel(const DevicePredictionNode* d_nodes,
float* d_out_predictions, int* d_tree_segments,
int* d_tree_group, size_t* d_row_ptr,
SparseBatch::Entry* d_data, int tree_begin,
int tree_end, int num_features, bst_uint num_rows,
bool use_shared, int num_group) {
extern __shared__ float smem[];
bst_uint global_idx = blockDim.x * blockIdx.x + threadIdx.x;
ElementLoader loader(use_shared, d_row_ptr, d_data, num_features, smem,
num_rows);
if (global_idx >= num_rows) return;
if (num_group == 1) {
float sum = 0;
for (int tree_idx = tree_begin; tree_idx < tree_end; tree_idx++) {
const DevicePredictionNode* d_tree =
d_nodes + d_tree_segments[tree_idx - tree_begin];
sum += GetLeafWeight(global_idx, d_tree, &loader);
}
d_out_predictions[global_idx] += sum;
} else {
for (int tree_idx = tree_begin; tree_idx < tree_end; tree_idx++) {
int tree_group = d_tree_group[tree_idx];
const DevicePredictionNode* d_tree =
d_nodes + d_tree_segments[tree_idx - tree_begin];
bst_uint out_prediction_idx = global_idx * num_group + tree_group;
d_out_predictions[out_prediction_idx] +=
GetLeafWeight(global_idx, d_tree, &loader);
}
}
}
class GPUPredictor : public xgboost::Predictor {
private:
void DevicePredictInternal(DMatrix* dmat, std::vector<bst_float>* out_preds,
const gbm::GBTreeModel& model, int tree_begin,
int tree_end) {
if (tree_end - tree_begin == 0) {
return;
}
// Add dmatrix to device if not seen before
if (this->device_matrix_cache_.find(dmat) ==
this->device_matrix_cache_.end()) {
this->device_matrix_cache_.emplace(
dmat, std::unique_ptr<DeviceMatrix>(
new DeviceMatrix(dmat, param.gpu_id, param.silent)));
}
DeviceMatrix* device_matrix = device_matrix_cache_.find(dmat)->second.get();
dh::safe_cuda(cudaSetDevice(param.gpu_id));
CHECK_EQ(model.param.size_leaf_vector, 0);
// Copy decision trees to device
thrust::host_vector<int> h_tree_segments;
h_tree_segments.reserve((tree_end - tree_end) + 1);
int sum = 0;
h_tree_segments.push_back(sum);
for (int tree_idx = tree_begin; tree_idx < tree_end; tree_idx++) {
sum += model.trees[tree_idx]->GetNodes().size();
h_tree_segments.push_back(sum);
}
thrust::host_vector<DevicePredictionNode> h_nodes(h_tree_segments.back());
for (int tree_idx = tree_begin; tree_idx < tree_end; tree_idx++) {
auto& src_nodes = model.trees[tree_idx]->GetNodes();
std::copy(src_nodes.begin(), src_nodes.end(),
h_nodes.begin() + h_tree_segments[tree_idx - tree_begin]);
}
nodes.resize(h_nodes.size());
thrust::copy(h_nodes.begin(), h_nodes.end(), nodes.begin());
tree_segments.resize(h_tree_segments.size());
thrust::copy(h_tree_segments.begin(), h_tree_segments.end(),
tree_segments.begin());
tree_group.resize(model.tree_info.size());
thrust::copy(model.tree_info.begin(), model.tree_info.end(),
tree_group.begin());
if (device_matrix->predictions.size() != out_preds->size()) {
device_matrix->predictions.resize(out_preds->size());
thrust::copy(out_preds->begin(), out_preds->end(),
device_matrix->predictions.begin());
}
const int BLOCK_THREADS = 128;
const int GRID_SIZE =
dh::div_round_up(device_matrix->row_ptr.size() - 1, BLOCK_THREADS);
int shared_memory_bytes =
sizeof(float) * device_matrix->p_mat->info().num_col * BLOCK_THREADS;
bool use_shared = true;
if (shared_memory_bytes > dh::max_shared_memory(param.gpu_id)) {
shared_memory_bytes = 0;
use_shared = false;
}
PredictKernel<BLOCK_THREADS>
<<<GRID_SIZE, BLOCK_THREADS, shared_memory_bytes>>>(
dh::raw(nodes), dh::raw(device_matrix->predictions),
dh::raw(tree_segments), dh::raw(tree_group),
device_matrix->row_ptr.data(), device_matrix->data.data(),
tree_begin, tree_end, device_matrix->p_mat->info().num_col,
device_matrix->p_mat->info().num_row, use_shared,
model.param.num_output_group);
dh::safe_cuda(cudaDeviceSynchronize());
thrust::copy(device_matrix->predictions.begin(),
device_matrix->predictions.end(), out_preds->begin());
}
public:
GPUPredictor() : cpu_predictor(Predictor::Create("cpu_predictor")) {}
void PredictBatch(DMatrix* dmat, std::vector<bst_float>* out_preds,
const gbm::GBTreeModel& model, int tree_begin,
unsigned ntree_limit = 0) override {
if (this->PredictFromCache(dmat, out_preds, model, ntree_limit)) {
return;
}
this->InitOutPredictions(dmat->info(), out_preds, model);
int tree_end = ntree_limit * model.param.num_output_group;
if (ntree_limit == 0 || ntree_limit > model.trees.size()) {
tree_end = static_cast<unsigned>(model.trees.size());
}
DevicePredictInternal(dmat, out_preds, model, tree_begin, tree_end);
}
void UpdatePredictionCache(
const gbm::GBTreeModel& model,
std::vector<std::unique_ptr<TreeUpdater>>* updaters,
int num_new_trees) override {
// dh::Timer t;
int old_ntree = model.trees.size() - num_new_trees;
// update cache entry
for (auto& kv : cache_) {
PredictionCacheEntry& e = kv.second;
DMatrix* dmat = kv.first;
if (e.predictions.size() == 0) {
cpu_predictor->PredictBatch(dmat, &(e.predictions), model, 0,
model.trees.size());
} else if (model.param.num_output_group == 1 && updaters->size() > 0 &&
num_new_trees == 1 &&
updaters->back()->UpdatePredictionCache(e.data.get(),
&(e.predictions))) {
{} // do nothing
} else {
DevicePredictInternal(dmat, &(e.predictions), model, old_ntree,
model.trees.size());
}
}
}
void PredictInstance(const SparseBatch::Inst& inst,
std::vector<bst_float>* out_preds,
const gbm::GBTreeModel& model, unsigned ntree_limit,
unsigned root_index) override {
cpu_predictor->PredictInstance(inst, out_preds, model, root_index);
}
void PredictLeaf(DMatrix* p_fmat, std::vector<bst_float>* out_preds,
const gbm::GBTreeModel& model,
unsigned ntree_limit) override {
cpu_predictor->PredictLeaf(p_fmat, out_preds, model, ntree_limit);
}
void PredictContribution(DMatrix* p_fmat,
std::vector<bst_float>* out_contribs,
const gbm::GBTreeModel& model,
unsigned ntree_limit) override {
cpu_predictor->PredictContribution(p_fmat, out_contribs, model,
ntree_limit);
}
void Init(const std::vector<std::pair<std::string, std::string>>& cfg,
const std::vector<std::shared_ptr<DMatrix>>& cache) override {
Predictor::Init(cfg, cache);
cpu_predictor->Init(cfg, cache);
param.InitAllowUnknown(cfg);
}
private:
GPUPredictionParam param;
std::unique_ptr<Predictor> cpu_predictor;
std::unordered_map<DMatrix*, std::unique_ptr<DeviceMatrix>>
device_matrix_cache_;
thrust::device_vector<DevicePredictionNode> nodes;
thrust::device_vector<int> tree_segments;
thrust::device_vector<int> tree_group;
};
XGBOOST_REGISTER_PREDICTOR(GPUPredictor, "gpu_predictor")
.describe("Make predictions using GPU.")
.set_body([]() { return new GPUPredictor(); });
} // namespace predictor
} // namespace xgboost

View File

@@ -1,754 +0,0 @@
/*!
* Copyright 2017 XGBoost contributors
*/
#include <xgboost/tree_updater.h>
#include <utility>
#include <vector>
#include "../../../src/tree/param.h"
#include "updater_gpu_common.cuh"
namespace xgboost {
namespace tree {
DMLC_REGISTRY_FILE_TAG(updater_gpu);
/**
* @brief Absolute BFS order IDs to col-wise unique IDs based on user input
* @param tid the index of the element that this thread should access
* @param abs the array of absolute IDs
* @param colIds the array of column IDs for each element
* @param nodeStart the start of the node ID at this level
* @param nKeys number of nodes at this level.
* @return the uniq key
*/
static HOST_DEV_INLINE node_id_t abs2uniqKey(int tid, const node_id_t* abs,
const int* colIds, node_id_t nodeStart,
int nKeys) {
int a = abs[tid];
if (a == UNUSED_NODE) return a;
return ((a - nodeStart) + (colIds[tid] * nKeys));
}
/**
* @struct Pair
* @brief Pair used for key basd scan operations on bst_gpair
*/
struct Pair {
int key;
bst_gpair value;
};
/** define a key that's not used at all in the entire boosting process */
static const int NONE_KEY = -100;
/**
* @brief Allocate temporary buffers needed for scan operations
* @param tmpScans gradient buffer
* @param tmpKeys keys buffer
* @param size number of elements that will be scanned
*/
template <int BLKDIM_L1L3 = 256>
int scanTempBufferSize(int size) {
int nBlks = dh::div_round_up(size, BLKDIM_L1L3);
return nBlks;
}
struct AddByKey {
template <typename T>
HOST_DEV_INLINE T operator()(const T& first, const T& second) const {
T result;
if (first.key == second.key) {
result.key = first.key;
result.value = first.value + second.value;
} else {
result.key = second.key;
result.value = second.value;
}
return result;
}
};
/**
* @brief Gradient value getter function
* @param id the index into the vals or instIds array to which to fetch
* @param vals the gradient value buffer
* @param instIds instance index buffer
* @return the expected gradient value
*/
HOST_DEV_INLINE bst_gpair get(int id, const bst_gpair* vals,
const int* instIds) {
id = instIds[id];
return vals[id];
}
template <int BLKDIM_L1L3>
__global__ void cubScanByKeyL1(bst_gpair* scans, const bst_gpair* vals,
const int* instIds, bst_gpair* mScans,
int* mKeys, const node_id_t* keys, int nUniqKeys,
const int* colIds, node_id_t nodeStart,
const int size) {
Pair rootPair = {NONE_KEY, bst_gpair(0.f, 0.f)};
int myKey;
bst_gpair myValue;
typedef cub::BlockScan<Pair, BLKDIM_L1L3> BlockScan;
__shared__ typename BlockScan::TempStorage temp_storage;
Pair threadData;
int tid = blockIdx.x * BLKDIM_L1L3 + threadIdx.x;
if (tid < size) {
myKey = abs2uniqKey(tid, keys, colIds, nodeStart, nUniqKeys);
myValue = get(tid, vals, instIds);
} else {
myKey = NONE_KEY;
myValue = 0.f;
}
threadData.key = myKey;
threadData.value = myValue;
// get previous key, especially needed for the last thread in this block
// in order to pass on the partial scan values.
// this statement MUST appear before the checks below!
// else, the result of this shuffle operation will be undefined
int previousKey = __shfl_up(myKey, 1);
// Collectively compute the block-wide exclusive prefix sum
BlockScan(temp_storage)
.ExclusiveScan(threadData, threadData, rootPair, AddByKey());
if (tid < size) {
scans[tid] = threadData.value;
} else {
return;
}
if (threadIdx.x == BLKDIM_L1L3 - 1) {
threadData.value =
(myKey == previousKey) ? threadData.value : bst_gpair(0.0f, 0.0f);
mKeys[blockIdx.x] = myKey;
mScans[blockIdx.x] = threadData.value + myValue;
}
}
template <int BLKSIZE>
__global__ void cubScanByKeyL2(bst_gpair* mScans, int* mKeys, int mLength) {
typedef cub::BlockScan<Pair, BLKSIZE, cub::BLOCK_SCAN_WARP_SCANS> BlockScan;
Pair threadData;
__shared__ typename BlockScan::TempStorage temp_storage;
for (int i = threadIdx.x; i < mLength; i += BLKSIZE - 1) {
threadData.key = mKeys[i];
threadData.value = mScans[i];
BlockScan(temp_storage).InclusiveScan(threadData, threadData, AddByKey());
mScans[i] = threadData.value;
__syncthreads();
}
}
template <int BLKDIM_L1L3>
__global__ void cubScanByKeyL3(bst_gpair* sums, bst_gpair* scans,
const bst_gpair* vals, const int* instIds,
const bst_gpair* mScans, const int* mKeys,
const node_id_t* keys, int nUniqKeys,
const int* colIds, node_id_t nodeStart,
const int size) {
int relId = threadIdx.x;
int tid = (blockIdx.x * BLKDIM_L1L3) + relId;
// to avoid the following warning from nvcc:
// __shared__ memory variable with non-empty constructor or destructor
// (potential race between threads)
__shared__ char gradBuff[sizeof(bst_gpair)];
__shared__ int s_mKeys;
bst_gpair* s_mScans = reinterpret_cast<bst_gpair*>(gradBuff);
if (tid >= size) return;
// cache block-wide partial scan info
if (relId == 0) {
s_mKeys = (blockIdx.x > 0) ? mKeys[blockIdx.x - 1] : NONE_KEY;
s_mScans[0] = (blockIdx.x > 0) ? mScans[blockIdx.x - 1] : bst_gpair();
}
int myKey = abs2uniqKey(tid, keys, colIds, nodeStart, nUniqKeys);
int previousKey =
tid == 0 ? NONE_KEY
: abs2uniqKey(tid - 1, keys, colIds, nodeStart, nUniqKeys);
bst_gpair myValue = scans[tid];
__syncthreads();
if (blockIdx.x > 0 && s_mKeys == previousKey) {
myValue += s_mScans[0];
}
if (tid == size - 1) {
sums[previousKey] = myValue + get(tid, vals, instIds);
}
if ((previousKey != myKey) && (previousKey >= 0)) {
sums[previousKey] = myValue;
myValue = bst_gpair(0.0f, 0.0f);
}
scans[tid] = myValue;
}
/**
* @brief Performs fused reduce and scan by key functionality. It is assumed
* that
* the keys occur contiguously!
* @param sums the output gradient reductions for each element performed
* key-wise
* @param scans the output gradient scans for each element performed key-wise
* @param vals the gradients evaluated for each observation.
* @param instIds instance ids for each element
* @param keys keys to be used to segment the reductions. They need not occur
* contiguously in contrast to scan_by_key. Currently, we need one key per
* value in the 'vals' array.
* @param size number of elements in the 'vals' array
* @param nUniqKeys max number of uniq keys found per column
* @param nCols number of columns
* @param tmpScans temporary scan buffer needed for cub-pyramid algo
* @param tmpKeys temporary key buffer needed for cub-pyramid algo
* @param colIds column indices for each element in the array
* @param nodeStart index of the leftmost node in the current level
*/
template <int BLKDIM_L1L3 = 256, int BLKDIM_L2 = 512>
void reduceScanByKey(bst_gpair* sums, bst_gpair* scans, const bst_gpair* vals,
const int* instIds, const node_id_t* keys, int size,
int nUniqKeys, int nCols, bst_gpair* tmpScans,
int* tmpKeys, const int* colIds, node_id_t nodeStart) {
int nBlks = dh::div_round_up(size, BLKDIM_L1L3);
cudaMemset(sums, 0, nUniqKeys * nCols * sizeof(bst_gpair));
cubScanByKeyL1<BLKDIM_L1L3>
<<<nBlks, BLKDIM_L1L3>>>(scans, vals, instIds, tmpScans, tmpKeys, keys,
nUniqKeys, colIds, nodeStart, size);
cubScanByKeyL2<BLKDIM_L2><<<1, BLKDIM_L2>>>(tmpScans, tmpKeys, nBlks);
cubScanByKeyL3<BLKDIM_L1L3>
<<<nBlks, BLKDIM_L1L3>>>(sums, scans, vals, instIds, tmpScans, tmpKeys,
keys, nUniqKeys, colIds, nodeStart, size);
}
/**
* @struct ExactSplitCandidate
* @brief Abstraction of a possible split in the decision tree
*/
struct ExactSplitCandidate {
/** the optimal gain score for this node */
float score;
/** index where to split in the DMatrix */
int index;
HOST_DEV_INLINE ExactSplitCandidate() : score(-FLT_MAX), index(INT_MAX) {}
/**
* @brief Whether the split info is valid to be used to create a new child
* @param minSplitLoss minimum score above which decision to split is made
* @return true if splittable, else false
*/
HOST_DEV_INLINE bool isSplittable(float minSplitLoss) const {
return ((score >= minSplitLoss) && (index != INT_MAX));
}
};
/**
* @enum ArgMaxByKeyAlgo best_split_evaluation.cuh
* @brief Help decide which algorithm to use for multi-argmax operation
*/
enum ArgMaxByKeyAlgo {
/** simplest, use gmem-atomics for all updates */
ABK_GMEM = 0,
/** use smem-atomics for updates (when number of keys are less) */
ABK_SMEM
};
/** max depth until which to use shared mem based atomics for argmax */
static const int MAX_ABK_LEVELS = 3;
HOST_DEV_INLINE ExactSplitCandidate maxSplit(ExactSplitCandidate a,
ExactSplitCandidate b) {
ExactSplitCandidate out;
if (a.score < b.score) {
out.score = b.score;
out.index = b.index;
} else if (a.score == b.score) {
out.score = a.score;
out.index = (a.index < b.index) ? a.index : b.index;
} else {
out.score = a.score;
out.index = a.index;
}
return out;
}
DEV_INLINE void atomicArgMax(ExactSplitCandidate* address,
ExactSplitCandidate val) {
unsigned long long* intAddress = (unsigned long long*)address; // NOLINT
unsigned long long old = *intAddress; // NOLINT
unsigned long long assumed; // NOLINT
do {
assumed = old;
ExactSplitCandidate res =
maxSplit(val, *reinterpret_cast<ExactSplitCandidate*>(&assumed));
old = atomicCAS(intAddress, assumed, *reinterpret_cast<uint64_t*>(&res));
} while (assumed != old);
}
DEV_INLINE void argMaxWithAtomics(
int id, ExactSplitCandidate* nodeSplits, const bst_gpair* gradScans,
const bst_gpair* gradSums, const float* vals, const int* colIds,
const node_id_t* nodeAssigns, const DeviceDenseNode* nodes, int nUniqKeys,
node_id_t nodeStart, int len, const GPUTrainingParam& param) {
int nodeId = nodeAssigns[id];
// @todo: this is really a bad check! but will be fixed when we move
// to key-based reduction
if ((id == 0) ||
!((nodeId == nodeAssigns[id - 1]) && (colIds[id] == colIds[id - 1]) &&
(vals[id] == vals[id - 1]))) {
if (nodeId != UNUSED_NODE) {
int sumId = abs2uniqKey(id, nodeAssigns, colIds, nodeStart, nUniqKeys);
bst_gpair colSum = gradSums[sumId];
int uid = nodeId - nodeStart;
DeviceDenseNode n = nodes[nodeId];
bst_gpair parentSum = n.sum_gradients;
float parentGain = n.root_gain;
bool tmp;
ExactSplitCandidate s;
bst_gpair missing = parentSum - colSum;
s.score = loss_chg_missing(gradScans[id], missing, parentSum, parentGain,
param, tmp);
s.index = id;
atomicArgMax(nodeSplits + uid, s);
} // end if nodeId != UNUSED_NODE
} // end if id == 0 ...
}
__global__ void atomicArgMaxByKeyGmem(
ExactSplitCandidate* nodeSplits, const bst_gpair* gradScans,
const bst_gpair* gradSums, const float* vals, const int* colIds,
const node_id_t* nodeAssigns, const DeviceDenseNode* nodes, int nUniqKeys,
node_id_t nodeStart, int len, const TrainParam param) {
int id = threadIdx.x + (blockIdx.x * blockDim.x);
const int stride = blockDim.x * gridDim.x;
for (; id < len; id += stride) {
argMaxWithAtomics(id, nodeSplits, gradScans, gradSums, vals, colIds,
nodeAssigns, nodes, nUniqKeys, nodeStart, len,
GPUTrainingParam(param));
}
}
__global__ void atomicArgMaxByKeySmem(
ExactSplitCandidate* nodeSplits, const bst_gpair* gradScans,
const bst_gpair* gradSums, const float* vals, const int* colIds,
const node_id_t* nodeAssigns, const DeviceDenseNode* nodes, int nUniqKeys,
node_id_t nodeStart, int len, const TrainParam param) {
extern __shared__ char sArr[];
ExactSplitCandidate* sNodeSplits =
reinterpret_cast<ExactSplitCandidate*>(sArr);
int tid = threadIdx.x;
ExactSplitCandidate defVal;
#pragma unroll 1
for (int i = tid; i < nUniqKeys; i += blockDim.x) {
sNodeSplits[i] = defVal;
}
__syncthreads();
int id = tid + (blockIdx.x * blockDim.x);
const int stride = blockDim.x * gridDim.x;
for (; id < len; id += stride) {
argMaxWithAtomics(id, sNodeSplits, gradScans, gradSums, vals, colIds,
nodeAssigns, nodes, nUniqKeys, nodeStart, len, param);
}
__syncthreads();
for (int i = tid; i < nUniqKeys; i += blockDim.x) {
ExactSplitCandidate s = sNodeSplits[i];
atomicArgMax(nodeSplits + i, s);
}
}
/**
* @brief Performs argmax_by_key functionality but for cases when keys need not
* occur contiguously
* @param nodeSplits will contain information on best split for each node
* @param gradScans exclusive sum on sorted segments for each col
* @param gradSums gradient sum for each column in DMatrix based on to node-ids
* @param vals feature values
* @param colIds column index for each element in the feature values array
* @param nodeAssigns node-id assignments to each element in DMatrix
* @param nodes pointer to all nodes for this tree in BFS order
* @param nUniqKeys number of unique node-ids in this level
* @param nodeStart start index of the node-ids in this level
* @param len number of elements
* @param param training parameters
* @param algo which algorithm to use for argmax_by_key
*/
template <int BLKDIM = 256, int ITEMS_PER_THREAD = 4>
void argMaxByKey(ExactSplitCandidate* nodeSplits, const bst_gpair* gradScans,
const bst_gpair* gradSums, const float* vals,
const int* colIds, const node_id_t* nodeAssigns,
const DeviceDenseNode* nodes, int nUniqKeys,
node_id_t nodeStart, int len, const TrainParam param,
ArgMaxByKeyAlgo algo) {
dh::fillConst<ExactSplitCandidate, BLKDIM, ITEMS_PER_THREAD>(
dh::get_device_idx(param.gpu_id), nodeSplits, nUniqKeys,
ExactSplitCandidate());
int nBlks = dh::div_round_up(len, ITEMS_PER_THREAD * BLKDIM);
switch (algo) {
case ABK_GMEM:
atomicArgMaxByKeyGmem<<<nBlks, BLKDIM>>>(
nodeSplits, gradScans, gradSums, vals, colIds, nodeAssigns, nodes,
nUniqKeys, nodeStart, len, param);
break;
case ABK_SMEM:
atomicArgMaxByKeySmem<<<nBlks, BLKDIM,
sizeof(ExactSplitCandidate) * nUniqKeys>>>(
nodeSplits, gradScans, gradSums, vals, colIds, nodeAssigns, nodes,
nUniqKeys, nodeStart, len, param);
break;
default:
throw std::runtime_error("argMaxByKey: Bad algo passed!");
}
}
__global__ void assignColIds(int* colIds, const int* colOffsets) {
int myId = blockIdx.x;
int start = colOffsets[myId];
int end = colOffsets[myId + 1];
for (int id = start + threadIdx.x; id < end; id += blockDim.x) {
colIds[id] = myId;
}
}
__global__ void fillDefaultNodeIds(node_id_t* nodeIdsPerInst,
const DeviceDenseNode* nodes, int nRows) {
int id = threadIdx.x + (blockIdx.x * blockDim.x);
if (id >= nRows) {
return;
}
// if this element belongs to none of the currently active node-id's
node_id_t nId = nodeIdsPerInst[id];
if (nId == UNUSED_NODE) {
return;
}
const DeviceDenseNode n = nodes[nId];
node_id_t result;
if (n.IsLeaf() || n.IsUnused()) {
result = UNUSED_NODE;
} else if (n.dir == LeftDir) {
result = (2 * n.idx) + 1;
} else {
result = (2 * n.idx) + 2;
}
nodeIdsPerInst[id] = result;
}
__global__ void assignNodeIds(node_id_t* nodeIdsPerInst, int* nodeLocations,
const node_id_t* nodeIds, const int* instId,
const DeviceDenseNode* nodes,
const int* colOffsets, const float* vals,
int nVals, int nCols) {
int id = threadIdx.x + (blockIdx.x * blockDim.x);
const int stride = blockDim.x * gridDim.x;
for (; id < nVals; id += stride) {
// fusing generation of indices for node locations
nodeLocations[id] = id;
// using nodeIds here since the previous kernel would have updated
// the nodeIdsPerInst with all default assignments
int nId = nodeIds[id];
// if this element belongs to none of the currently active node-id's
if (nId != UNUSED_NODE) {
const DeviceDenseNode n = nodes[nId];
int colId = n.fidx;
// printf("nid=%d colId=%d id=%d\n", nId, colId, id);
int start = colOffsets[colId];
int end = colOffsets[colId + 1];
// @todo: too much wasteful threads!!
if ((id >= start) && (id < end) && !(n.IsLeaf() || n.IsUnused())) {
node_id_t result = (2 * n.idx) + 1 + (vals[id] >= n.fvalue);
nodeIdsPerInst[instId[id]] = result;
}
}
}
}
__global__ void markLeavesKernel(DeviceDenseNode* nodes, int len) {
int id = (blockIdx.x * blockDim.x) + threadIdx.x;
if ((id < len) && !nodes[id].IsUnused()) {
int lid = (id << 1) + 1;
int rid = (id << 1) + 2;
if ((lid >= len) || (rid >= len)) {
nodes[id].root_gain = -FLT_MAX; // bottom-most nodes
} else if (nodes[lid].IsUnused() && nodes[rid].IsUnused()) {
nodes[id].root_gain = -FLT_MAX; // unused child nodes
}
}
}
class GPUMaker : public TreeUpdater {
protected:
TrainParam param;
/** whether we have initialized memory already (so as not to repeat!) */
bool allocated;
/** feature values stored in column-major compressed format */
dh::dvec2<float> vals;
dh::dvec<float> vals_cached;
/** corresponding instance id's of these featutre values */
dh::dvec2<int> instIds;
dh::dvec<int> instIds_cached;
/** column offsets for these feature values */
dh::dvec<int> colOffsets;
dh::dvec<bst_gpair> gradsInst;
dh::dvec2<node_id_t> nodeAssigns;
dh::dvec2<int> nodeLocations;
dh::dvec<DeviceDenseNode> nodes;
dh::dvec<node_id_t> nodeAssignsPerInst;
dh::dvec<bst_gpair> gradSums;
dh::dvec<bst_gpair> gradScans;
dh::dvec<ExactSplitCandidate> nodeSplits;
int nVals;
int nRows;
int nCols;
int maxNodes;
int maxLeaves;
dh::CubMemory tmp_mem;
dh::dvec<bst_gpair> tmpScanGradBuff;
dh::dvec<int> tmpScanKeyBuff;
dh::dvec<int> colIds;
dh::bulk_allocator<dh::memory_type::DEVICE> ba;
public:
GPUMaker() : allocated(false) {}
~GPUMaker() {}
void Init(
const std::vector<std::pair<std::string, std::string>>& args) override {
param.InitAllowUnknown(args);
maxNodes = (1 << (param.max_depth + 1)) - 1;
maxLeaves = 1 << param.max_depth;
}
void Update(const std::vector<bst_gpair>& gpair, DMatrix* dmat,
const std::vector<RegTree*>& trees) override {
GradStats::CheckInfo(dmat->info());
// rescale learning rate according to size of trees
float lr = param.learning_rate;
param.learning_rate = lr / trees.size();
try {
// build tree
for (size_t i = 0; i < trees.size(); ++i) {
UpdateTree(gpair, dmat, trees[i]);
}
} catch (const std::exception& e) {
LOG(FATAL) << "GPU plugin exception: " << e.what() << std::endl;
}
param.learning_rate = lr;
}
/// @note: Update should be only after Init!!
void UpdateTree(const std::vector<bst_gpair>& gpair, DMatrix* dmat,
RegTree* hTree) {
if (!allocated) {
setupOneTimeData(dmat);
}
for (int i = 0; i < param.max_depth; ++i) {
if (i == 0) {
// make sure to start on a fresh tree with sorted values!
vals.current_dvec() = vals_cached;
instIds.current_dvec() = instIds_cached;
transferGrads(gpair);
}
int nNodes = 1 << i;
node_id_t nodeStart = nNodes - 1;
initNodeData(i, nodeStart, nNodes);
findSplit(i, nodeStart, nNodes);
}
// mark all the used nodes with unused children as leaf nodes
markLeaves();
dense2sparse_tree(hTree, nodes, param);
}
void split2node(int nNodes, node_id_t nodeStart) {
auto d_nodes = nodes.data();
auto d_gradScans = gradScans.data();
auto d_gradSums = gradSums.data();
auto d_nodeAssigns = nodeAssigns.current();
auto d_colIds = colIds.data();
auto d_vals = vals.current();
auto d_nodeSplits = nodeSplits.data();
int nUniqKeys = nNodes;
float min_split_loss = param.min_split_loss;
auto gpu_param = GPUTrainingParam(param);
dh::launch_n(param.gpu_id, nNodes, [=] __device__(int uid) {
int absNodeId = uid + nodeStart;
ExactSplitCandidate s = d_nodeSplits[uid];
if (s.isSplittable(min_split_loss)) {
int idx = s.index;
int nodeInstId =
abs2uniqKey(idx, d_nodeAssigns, d_colIds, nodeStart, nUniqKeys);
bool missingLeft = true;
const DeviceDenseNode& n = d_nodes[absNodeId];
bst_gpair gradScan = d_gradScans[idx];
bst_gpair gradSum = d_gradSums[nodeInstId];
float thresh = d_vals[idx];
int colId = d_colIds[idx];
// get the default direction for the current node
bst_gpair missing = n.sum_gradients - gradSum;
loss_chg_missing(gradScan, missing, n.sum_gradients, n.root_gain,
gpu_param, missingLeft);
// get the score/weight/id/gradSum for left and right child nodes
bst_gpair lGradSum = missingLeft ? gradScan + missing : gradScan;
bst_gpair rGradSum = n.sum_gradients - lGradSum;
// Create children
d_nodes[left_child_nidx(absNodeId)] =
DeviceDenseNode(lGradSum, left_child_nidx(absNodeId), gpu_param);
d_nodes[right_child_nidx(absNodeId)] =
DeviceDenseNode(rGradSum, right_child_nidx(absNodeId), gpu_param);
// Set split for parent
d_nodes[absNodeId].SetSplit(thresh, colId,
missingLeft ? LeftDir : RightDir);
} else {
// cannot be split further, so this node is a leaf!
d_nodes[absNodeId].root_gain = -FLT_MAX;
}
});
}
void findSplit(int level, node_id_t nodeStart, int nNodes) {
reduceScanByKey(gradSums.data(), gradScans.data(), gradsInst.data(),
instIds.current(), nodeAssigns.current(), nVals, nNodes,
nCols, tmpScanGradBuff.data(), tmpScanKeyBuff.data(),
colIds.data(), nodeStart);
argMaxByKey(nodeSplits.data(), gradScans.data(), gradSums.data(),
vals.current(), colIds.data(), nodeAssigns.current(),
nodes.data(), nNodes, nodeStart, nVals, param,
level <= MAX_ABK_LEVELS ? ABK_SMEM : ABK_GMEM);
split2node(nNodes, nodeStart);
}
void allocateAllData(int offsetSize) {
int tmpBuffSize = scanTempBufferSize(nVals);
ba.allocate(dh::get_device_idx(param.gpu_id), param.silent, &vals, nVals,
&vals_cached, nVals, &instIds, nVals, &instIds_cached, nVals,
&colOffsets, offsetSize, &gradsInst, nRows, &nodeAssigns, nVals,
&nodeLocations, nVals, &nodes, maxNodes, &nodeAssignsPerInst,
nRows, &gradSums, maxLeaves * nCols, &gradScans, nVals,
&nodeSplits, maxLeaves, &tmpScanGradBuff, tmpBuffSize,
&tmpScanKeyBuff, tmpBuffSize, &colIds, nVals);
}
void setupOneTimeData(DMatrix* dmat) {
size_t free_memory = dh::available_memory(dh::get_device_idx(param.gpu_id));
if (!dmat->SingleColBlock()) {
throw std::runtime_error("exact::GPUBuilder - must have 1 column block");
}
std::vector<float> fval;
std::vector<int> fId, offset;
convertToCsc(dmat, &fval, &fId, &offset);
allocateAllData(static_cast<int>(offset.size()));
transferAndSortData(fval, fId, offset);
allocated = true;
}
void convertToCsc(DMatrix* dmat, std::vector<float>* fval,
std::vector<int>* fId, std::vector<int>* offset) {
MetaInfo info = dmat->info();
nRows = info.num_row;
nCols = info.num_col;
offset->reserve(nCols + 1);
offset->push_back(0);
fval->reserve(nCols * nRows);
fId->reserve(nCols * nRows);
// in case you end up with a DMatrix having no column access
// then make sure to enable that before copying the data!
if (!dmat->HaveColAccess()) {
const std::vector<bool> enable(nCols, true);
dmat->InitColAccess(enable, 1, nRows);
}
dmlc::DataIter<ColBatch>* iter = dmat->ColIterator();
iter->BeforeFirst();
while (iter->Next()) {
const ColBatch& batch = iter->Value();
for (int i = 0; i < batch.size; i++) {
const ColBatch::Inst& col = batch[i];
for (const ColBatch::Entry* it = col.data; it != col.data + col.length;
it++) {
int inst_id = static_cast<int>(it->index);
fval->push_back(it->fvalue);
fId->push_back(inst_id);
}
offset->push_back(fval->size());
}
}
nVals = fval->size();
}
void transferAndSortData(const std::vector<float>& fval,
const std::vector<int>& fId,
const std::vector<int>& offset) {
vals.current_dvec() = fval;
instIds.current_dvec() = fId;
colOffsets = offset;
dh::segmentedSort<float, int>(&tmp_mem, &vals, &instIds, nVals, nCols,
colOffsets);
vals_cached = vals.current_dvec();
instIds_cached = instIds.current_dvec();
assignColIds<<<nCols, 512>>>(colIds.data(), colOffsets.data());
}
void transferGrads(const std::vector<bst_gpair>& gpair) {
// HACK
dh::safe_cuda(cudaMemcpy(gradsInst.data(), &(gpair[0]),
sizeof(bst_gpair) * nRows,
cudaMemcpyHostToDevice));
// evaluate the full-grad reduction for the root node
dh::sumReduction<bst_gpair>(tmp_mem, gradsInst, gradSums, nRows);
}
void initNodeData(int level, node_id_t nodeStart, int nNodes) {
// all instances belong to root node at the beginning!
if (level == 0) {
nodes.fill(DeviceDenseNode());
nodeAssigns.current_dvec().fill(0);
nodeAssignsPerInst.fill(0);
// for root node, just update the gradient/score/weight/id info
// before splitting it! Currently all data is on GPU, hence this
// stupid little kernel
auto d_nodes = nodes.data();
auto d_sums = gradSums.data();
auto gpu_params = GPUTrainingParam(param);
dh::launch_n(param.gpu_id, 1, [=] __device__(int idx) {
d_nodes[0] = DeviceDenseNode(d_sums[0], 0, gpu_params);
});
} else {
const int BlkDim = 256;
const int ItemsPerThread = 4;
// assign default node ids first
int nBlks = dh::div_round_up(nRows, BlkDim);
fillDefaultNodeIds<<<nBlks, BlkDim>>>(nodeAssignsPerInst.data(),
nodes.data(), nRows);
// evaluate the correct child indices of non-missing values next
nBlks = dh::div_round_up(nVals, BlkDim * ItemsPerThread);
assignNodeIds<<<nBlks, BlkDim>>>(
nodeAssignsPerInst.data(), nodeLocations.current(),
nodeAssigns.current(), instIds.current(), nodes.data(),
colOffsets.data(), vals.current(), nVals, nCols);
// gather the node assignments across all other columns too
dh::gather(dh::get_device_idx(param.gpu_id), nodeAssigns.current(),
nodeAssignsPerInst.data(), instIds.current(), nVals);
sortKeys(level);
}
}
void sortKeys(int level) {
// segmented-sort the arrays based on node-id's
// but we don't need more than level+1 bits for sorting!
segmentedSort(&tmp_mem, &nodeAssigns, &nodeLocations, nVals, nCols,
colOffsets, 0, level + 1);
dh::gather<float, int>(dh::get_device_idx(param.gpu_id), vals.other(),
vals.current(), instIds.other(), instIds.current(),
nodeLocations.current(), nVals);
vals.buff().selector ^= 1;
instIds.buff().selector ^= 1;
}
void markLeaves() {
const int BlkDim = 128;
int nBlks = dh::div_round_up(maxNodes, BlkDim);
markLeavesKernel<<<nBlks, BlkDim>>>(nodes.data(), maxNodes);
}
};
XGBOOST_REGISTER_TREE_UPDATER(GPUMaker, "grow_gpu")
.describe("Grow tree with GPU.")
.set_body([]() { return new GPUMaker(); });
} // namespace tree
} // namespace xgboost

View File

@@ -1,243 +0,0 @@
/*!
* Copyright 2017 XGBoost contributors
*/
#pragma once
#include <thrust/random.h>
#include <cstdio>
#include <stdexcept>
#include <string>
#include <vector>
#include "../../../src/common/random.h"
#include "../../../src/tree/param.h"
#include "cub/cub.cuh"
#include "device_helpers.cuh"
namespace xgboost {
namespace tree {
struct GPUTrainingParam {
// minimum amount of hessian(weight) allowed in a child
float min_child_weight;
// L2 regularization factor
float reg_lambda;
// L1 regularization factor
float reg_alpha;
// maximum delta update we can add in weight estimation
// this parameter can be used to stabilize update
// default=0 means no constraint on weight delta
float max_delta_step;
__host__ __device__ GPUTrainingParam() {}
__host__ __device__ GPUTrainingParam(const TrainParam& param)
: min_child_weight(param.min_child_weight),
reg_lambda(param.reg_lambda),
reg_alpha(param.reg_alpha),
max_delta_step(param.max_delta_step) {}
};
typedef int node_id_t;
/** used to assign default id to a Node */
static const int UNUSED_NODE = -1;
/**
* @enum DefaultDirection node.cuh
* @brief Default direction to be followed in case of missing values
*/
enum DefaultDirection {
/** move to left child */
LeftDir = 0,
/** move to right child */
RightDir
};
struct DeviceDenseNode {
bst_gpair sum_gradients;
float root_gain;
float weight;
/** default direction for missing values */
DefaultDirection dir;
/** threshold value for comparison */
float fvalue;
/** \brief The feature index. */
int fidx;
/** node id (used as key for reduce/scan) */
node_id_t idx;
HOST_DEV_INLINE DeviceDenseNode()
: sum_gradients(),
root_gain(-FLT_MAX),
weight(-FLT_MAX),
dir(LeftDir),
fvalue(0.f),
fidx(UNUSED_NODE),
idx(UNUSED_NODE) {}
HOST_DEV_INLINE DeviceDenseNode(bst_gpair sum_gradients, node_id_t nidx,
const GPUTrainingParam& param)
: sum_gradients(sum_gradients),
dir(LeftDir),
fvalue(0.f),
fidx(UNUSED_NODE),
idx(nidx) {
this->root_gain = CalcGain(param, sum_gradients.grad, sum_gradients.hess);
this->weight = CalcWeight(param, sum_gradients.grad, sum_gradients.hess);
}
HOST_DEV_INLINE void SetSplit(float fvalue, int fidx, DefaultDirection dir) {
this->fvalue = fvalue;
this->fidx = fidx;
this->dir = dir;
}
/** Tells whether this node is part of the decision tree */
HOST_DEV_INLINE bool IsUnused() const { return (idx == UNUSED_NODE); }
/** Tells whether this node is a leaf of the decision tree */
HOST_DEV_INLINE bool IsLeaf() const {
return (!IsUnused() && (fidx == UNUSED_NODE));
}
};
template <typename gpair_t>
__device__ inline float device_calc_loss_chg(
const GPUTrainingParam& param, const gpair_t& scan, const gpair_t& missing,
const gpair_t& parent_sum, const float& parent_gain, bool missing_left) {
gpair_t left = scan;
if (missing_left) {
left += missing;
}
gpair_t right = parent_sum - left;
float left_gain = CalcGain(param, left.grad, left.hess);
float right_gain = CalcGain(param, right.grad, right.hess);
return left_gain + right_gain - parent_gain;
}
template <typename gpair_t>
__device__ float inline loss_chg_missing(const gpair_t& scan,
const gpair_t& missing,
const gpair_t& parent_sum,
const float& parent_gain,
const GPUTrainingParam& param,
bool& missing_left_out) { // NOLINT
float missing_left_loss =
device_calc_loss_chg(param, scan, missing, parent_sum, parent_gain, true);
float missing_right_loss = device_calc_loss_chg(
param, scan, missing, parent_sum, parent_gain, false);
if (missing_left_loss >= missing_right_loss) {
missing_left_out = true;
return missing_left_loss;
} else {
missing_left_out = false;
return missing_right_loss;
}
}
// Total number of nodes in tree, given depth
__host__ __device__ inline int n_nodes(int depth) {
return (1 << (depth + 1)) - 1;
}
// Number of nodes at this level of the tree
__host__ __device__ inline int n_nodes_level(int depth) { return 1 << depth; }
// Whether a node is currently being processed at current depth
__host__ __device__ inline bool is_active(int nidx, int depth) {
return nidx >= n_nodes(depth - 1);
}
__host__ __device__ inline int parent_nidx(int nidx) { return (nidx - 1) / 2; }
__host__ __device__ inline int left_child_nidx(int nidx) {
return nidx * 2 + 1;
}
__host__ __device__ inline int right_child_nidx(int nidx) {
return nidx * 2 + 2;
}
__host__ __device__ inline bool is_left_child(int nidx) {
return nidx % 2 == 1;
}
// Copy gpu dense representation of tree to xgboost sparse representation
inline void dense2sparse_tree(RegTree* p_tree,
const dh::dvec<DeviceDenseNode>& nodes,
const TrainParam& param) {
RegTree& tree = *p_tree;
std::vector<DeviceDenseNode> h_nodes = nodes.as_vector();
int nid = 0;
for (int gpu_nid = 0; gpu_nid < h_nodes.size(); gpu_nid++) {
const DeviceDenseNode& n = h_nodes[gpu_nid];
if (!n.IsUnused() && !n.IsLeaf()) {
tree.AddChilds(nid);
tree[nid].set_split(n.fidx, n.fvalue, n.dir == LeftDir);
tree.stat(nid).loss_chg = n.root_gain;
tree.stat(nid).base_weight = n.weight;
tree.stat(nid).sum_hess = n.sum_gradients.hess;
tree[tree[nid].cleft()].set_leaf(0);
tree[tree[nid].cright()].set_leaf(0);
nid++;
} else if (n.IsLeaf()) {
tree[nid].set_leaf(n.weight * param.learning_rate);
tree.stat(nid).sum_hess = n.sum_gradients.hess;
nid++;
}
}
}
/*
* Random
*/
struct BernoulliRng {
float p;
int seed;
__host__ __device__ BernoulliRng(float p, int seed) : p(p), seed(seed) {}
__host__ __device__ bool operator()(const int i) const {
thrust::default_random_engine rng(seed);
thrust::uniform_real_distribution<float> dist;
rng.discard(i);
return dist(rng) <= p;
}
};
// Set gradient pair to 0 with p = 1 - subsample
inline void subsample_gpair(dh::dvec<bst_gpair>* p_gpair, float subsample,
int offset = 0) {
if (subsample == 1.0) {
return;
}
dh::dvec<bst_gpair>& gpair = *p_gpair;
auto d_gpair = gpair.data();
BernoulliRng rng(subsample, common::GlobalRandom()());
dh::launch_n(gpair.device_idx(), gpair.size(), [=] __device__(int i) {
if (!rng(i + offset)) {
d_gpair[i] = bst_gpair();
}
});
}
inline std::vector<int> col_sample(std::vector<int> features, float colsample) {
CHECK_GT(features.size(), 0);
int n = std::max(1, static_cast<int>(colsample * features.size()));
std::shuffle(features.begin(), features.end(), common::GlobalRandom());
features.resize(n);
return features;
}
} // namespace tree
} // namespace xgboost

File diff suppressed because it is too large Load Diff

View File

@@ -1,78 +0,0 @@
/*!
* Copyright 2017 XGBoost contributors
*/
#include <thrust/device_vector.h>
#include <xgboost/base.h>
#include "../../src/device_helpers.cuh"
#include "gtest/gtest.h"
void CreateTestData(xgboost::bst_uint num_rows, int max_row_size,
thrust::host_vector<int> *row_ptr,
thrust::host_vector<xgboost::bst_uint> *rows) {
row_ptr->resize(num_rows + 1);
int sum = 0;
for (int i = 0; i <= num_rows; i++) {
(*row_ptr)[i] = sum;
sum += rand() % max_row_size; // NOLINT
if (i < num_rows) {
for (int j = (*row_ptr)[i]; j < sum; j++) {
(*rows).push_back(i);
}
}
}
}
void SpeedTest() {
int num_rows = 1000000;
int max_row_size = 100;
dh::CubMemory temp_memory;
thrust::host_vector<int> h_row_ptr;
thrust::host_vector<xgboost::bst_uint> h_rows;
CreateTestData(num_rows, max_row_size, &h_row_ptr, &h_rows);
thrust::device_vector<int> row_ptr = h_row_ptr;
thrust::device_vector<int> output_row(h_rows.size());
auto d_output_row = output_row.data();
dh::Timer t;
dh::TransformLbs(
0, &temp_memory, h_rows.size(), dh::raw(row_ptr), row_ptr.size() - 1, false,
[=] __device__(size_t idx, size_t ridx) { d_output_row[idx] = ridx; });
dh::safe_cuda(cudaDeviceSynchronize());
double time = t.elapsedSeconds();
const int mb_size = 1048576;
size_t size = (sizeof(int) * h_rows.size()) / mb_size;
printf("size: %llumb, time: %fs, bandwidth: %fmb/s\n", size, time,
size / time);
}
void TestLbs() {
srand(17);
dh::CubMemory temp_memory;
std::vector<int> test_rows = {4, 100, 1000};
std::vector<int> test_max_row_sizes = {4, 100, 1300};
for (auto num_rows : test_rows) {
for (auto max_row_size : test_max_row_sizes) {
thrust::host_vector<int> h_row_ptr;
thrust::host_vector<xgboost::bst_uint> h_rows;
CreateTestData(num_rows, max_row_size, &h_row_ptr, &h_rows);
thrust::device_vector<size_t> row_ptr = h_row_ptr;
thrust::device_vector<int> output_row(h_rows.size());
auto d_output_row = output_row.data();
dh::TransformLbs(0, &temp_memory, h_rows.size(), dh::raw(row_ptr),
row_ptr.size() - 1, false,
[=] __device__(size_t idx, size_t ridx) {
d_output_row[idx] = ridx;
});
dh::safe_cuda(cudaDeviceSynchronize());
ASSERT_TRUE(h_rows == output_row);
}
}
}
TEST(cub_lbs, Test) { TestLbs(); }

View File

@@ -1,73 +0,0 @@
/*!
* Copyright 2017 XGBoost contributors
*/
#include <xgboost/c_api.h>
#include <xgboost/predictor.h>
#include "gtest/gtest.h"
#include "../../../../tests/cpp/helpers.h"
namespace xgboost {
namespace predictor {
TEST(gpu_predictor, Test) {
std::unique_ptr<Predictor> gpu_predictor =
std::unique_ptr<Predictor>(Predictor::Create("gpu_predictor"));
std::unique_ptr<Predictor> cpu_predictor =
std::unique_ptr<Predictor>(Predictor::Create("cpu_predictor"));
std::vector<std::unique_ptr<RegTree>> trees;
trees.push_back(std::make_unique<RegTree>());
trees.back()->InitModel();
(*trees.back())[0].set_leaf(1.5f);
gbm::GBTreeModel model(0.5);
model.CommitModel(std::move(trees), 0);
model.param.num_output_group = 1;
int n_row = 5;
int n_col = 5;
auto dmat = CreateDMatrix(n_row, n_col, 0);
// Test predict batch
std::vector<float> gpu_out_predictions;
std::vector<float> cpu_out_predictions;
gpu_predictor->PredictBatch(dmat.get(), &gpu_out_predictions, model, 0);
cpu_predictor->PredictBatch(dmat.get(), &cpu_out_predictions, model, 0);
float abs_tolerance = 0.001;
for (int i = 0; i < gpu_out_predictions.size(); i++) {
ASSERT_LT(std::abs(gpu_out_predictions[i] - cpu_out_predictions[i]),
abs_tolerance);
}
// Test predict instance
auto batch = dmat->RowIterator()->Value();
for (int i = 0; i < batch.size; i++) {
std::vector<float> gpu_instance_out_predictions;
std::vector<float> cpu_instance_out_predictions;
cpu_predictor->PredictInstance(batch[i], &cpu_instance_out_predictions,
model);
gpu_predictor->PredictInstance(batch[i], &gpu_instance_out_predictions,
model);
ASSERT_EQ(gpu_instance_out_predictions[0], cpu_instance_out_predictions[0]);
}
// Test predict leaf
std::vector<float> gpu_leaf_out_predictions;
std::vector<float> cpu_leaf_out_predictions;
cpu_predictor->PredictLeaf(dmat.get(), &cpu_leaf_out_predictions, model);
gpu_predictor->PredictLeaf(dmat.get(), &gpu_leaf_out_predictions, model);
for (int i = 0; i < gpu_leaf_out_predictions.size(); i++) {
ASSERT_EQ(gpu_leaf_out_predictions[i], cpu_leaf_out_predictions[i]);
}
// Test predict contribution
std::vector<float> gpu_out_contribution;
std::vector<float> cpu_out_contribution;
cpu_predictor->PredictContribution(dmat.get(), &cpu_out_contribution, model);
gpu_predictor->PredictContribution(dmat.get(), &gpu_out_contribution, model);
for (int i = 0; i < gpu_out_contribution.size(); i++) {
ASSERT_EQ(gpu_out_contribution[i], cpu_out_contribution[i]);
}
}
} // namespace predictor
} // namespace xgboost

View File

@@ -1,319 +0,0 @@
from __future__ import print_function
#pylint: skip-file
import sys
sys.path.append("../../tests/python")
import xgboost as xgb
import testing as tm
import numpy as np
import unittest
rng = np.random.RandomState(1994)
dpath = '../../demo/data/'
ag_dtrain = xgb.DMatrix(dpath + 'agaricus.txt.train')
ag_dtest = xgb.DMatrix(dpath + 'agaricus.txt.test')
def eprint(*args, **kwargs):
print(*args, file=sys.stderr, **kwargs)
print(*args, file=sys.stdout, **kwargs)
class TestGPU(unittest.TestCase):
def test_grow_gpu(self):
tm._skip_if_no_sklearn()
from sklearn.datasets import load_digits
try:
from sklearn.model_selection import train_test_split
except:
from sklearn.cross_validation import train_test_split
ag_param = {'max_depth': 2,
'tree_method': 'exact',
'nthread': 0,
'eta': 1,
'silent': 1,
'debug_verbose': 0,
'objective': 'binary:logistic',
'eval_metric': 'auc'}
ag_param2 = {'max_depth': 2,
'tree_method': 'gpu_exact',
'nthread': 0,
'eta': 1,
'silent': 1,
'debug_verbose': 0,
'objective': 'binary:logistic',
'eval_metric': 'auc'}
ag_res = {}
ag_res2 = {}
num_rounds = 10
xgb.train(ag_param, ag_dtrain, num_rounds, [(ag_dtrain, 'train'), (ag_dtest, 'test')],
evals_result=ag_res)
xgb.train(ag_param2, ag_dtrain, num_rounds, [(ag_dtrain, 'train'), (ag_dtest, 'test')],
evals_result=ag_res2)
assert ag_res['train']['auc'] == ag_res2['train']['auc']
assert ag_res['test']['auc'] == ag_res2['test']['auc']
digits = load_digits(2)
X = digits['data']
y = digits['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)
param = {'objective': 'binary:logistic',
'nthread': 0,
'tree_method': 'gpu_exact',
'max_depth': 3,
'debug_verbose': 0,
'eval_metric': 'auc'}
res = {}
xgb.train(param, dtrain, num_rounds, [(dtrain, 'train'), (dtest, 'test')],
evals_result=res)
assert self.non_decreasing(res['train']['auc'])
assert self.non_decreasing(res['test']['auc'])
# fail-safe test for dense data
from sklearn.datasets import load_svmlight_file
X2, y2 = load_svmlight_file(dpath + 'agaricus.txt.train')
X2 = X2.toarray()
dtrain2 = xgb.DMatrix(X2, label=y2)
param = {'objective': 'binary:logistic',
'nthread': 0,
'tree_method': 'gpu_exact',
'max_depth': 2,
'debug_verbose': 0,
'eval_metric': 'auc'}
res = {}
xgb.train(param, dtrain2, num_rounds, [(dtrain2, 'train')], evals_result=res)
assert self.non_decreasing(res['train']['auc'])
assert res['train']['auc'][0] >= 0.85
for j in range(X2.shape[1]):
for i in rng.choice(X2.shape[0], size=num_rounds, replace=False):
X2[i, j] = 2
dtrain3 = xgb.DMatrix(X2, label=y2)
res = {}
xgb.train(param, dtrain3, num_rounds, [(dtrain3, 'train')], evals_result=res)
assert self.non_decreasing(res['train']['auc'])
assert res['train']['auc'][0] >= 0.85
for j in range(X2.shape[1]):
for i in np.random.choice(X2.shape[0], size=num_rounds, replace=False):
X2[i, j] = 3
dtrain4 = xgb.DMatrix(X2, label=y2)
res = {}
xgb.train(param, dtrain4, num_rounds, [(dtrain4, 'train')], evals_result=res)
assert self.non_decreasing(res['train']['auc'])
assert res['train']['auc'][0] >= 0.85
def test_grow_gpu_hist(self):
n_gpus=-1
tm._skip_if_no_sklearn()
from sklearn.datasets import load_digits
try:
from sklearn.model_selection import train_test_split
except:
from sklearn.cross_validation import train_test_split
for max_depth in range(3,10): # TODO: Doesn't work with 2 for some tests
#eprint("max_depth=%d" % (max_depth))
for max_bin_i in range(3,11):
max_bin = np.power(2,max_bin_i)
#eprint("max_bin=%d" % (max_bin))
# regression test --- hist must be same as exact on all-categorial data
ag_param = {'max_depth': max_depth,
'tree_method': 'exact',
'nthread': 0,
'eta': 1,
'silent': 1,
'debug_verbose': 0,
'objective': 'binary:logistic',
'eval_metric': 'auc'}
ag_param2 = {'max_depth': max_depth,
'nthread': 0,
'tree_method': 'gpu_hist',
'eta': 1,
'silent': 1,
'debug_verbose': 0,
'n_gpus': 1,
'objective': 'binary:logistic',
'max_bin': max_bin,
'eval_metric': 'auc'}
ag_param3 = {'max_depth': max_depth,
'nthread': 0,
'tree_method': 'gpu_hist',
'eta': 1,
'silent': 1,
'debug_verbose': 0,
'n_gpus': n_gpus,
'objective': 'binary:logistic',
'max_bin': max_bin,
'eval_metric': 'auc'}
ag_res = {}
ag_res2 = {}
ag_res3 = {}
num_rounds = 10
#eprint("normal updater");
xgb.train(ag_param, ag_dtrain, num_rounds, [(ag_dtrain, 'train'), (ag_dtest, 'test')],
evals_result=ag_res)
#eprint("grow_gpu_hist updater 1 gpu");
xgb.train(ag_param2, ag_dtrain, num_rounds, [(ag_dtrain, 'train'), (ag_dtest, 'test')],
evals_result=ag_res2)
#eprint("grow_gpu_hist updater %d gpus" % (n_gpus));
xgb.train(ag_param3, ag_dtrain, num_rounds, [(ag_dtrain, 'train'), (ag_dtest, 'test')],
evals_result=ag_res3)
# assert 1==0
assert ag_res['train']['auc'] == ag_res2['train']['auc']
assert ag_res['test']['auc'] == ag_res2['test']['auc']
assert ag_res['test']['auc'] == ag_res3['test']['auc']
######################################################################
digits = load_digits(2)
X = digits['data']
y = digits['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
dtrain = xgb.DMatrix(X_train, y_train)
dtest = xgb.DMatrix(X_test, y_test)
param = {'objective': 'binary:logistic',
'tree_method': 'gpu_hist',
'nthread': 0,
'max_depth': max_depth,
'n_gpus': 1,
'max_bin': max_bin,
'debug_verbose': 0,
'eval_metric': 'auc'}
res = {}
#eprint("digits: grow_gpu_hist updater 1 gpu");
xgb.train(param, dtrain, num_rounds, [(dtrain, 'train'), (dtest, 'test')],
evals_result=res)
assert self.non_decreasing(res['train']['auc'])
#assert self.non_decreasing(res['test']['auc'])
param2 = {'objective': 'binary:logistic',
'nthread': 0,
'tree_method': 'gpu_hist',
'max_depth': max_depth,
'n_gpus': n_gpus,
'max_bin': max_bin,
'debug_verbose': 0,
'eval_metric': 'auc'}
res2 = {}
#eprint("digits: grow_gpu_hist updater %d gpus" % (n_gpus));
xgb.train(param2, dtrain, num_rounds, [(dtrain, 'train'), (dtest, 'test')],
evals_result=res2)
assert self.non_decreasing(res2['train']['auc'])
#assert self.non_decreasing(res2['test']['auc'])
assert res['train']['auc'] == res2['train']['auc']
#assert res['test']['auc'] == res2['test']['auc']
######################################################################
# fail-safe test for dense data
from sklearn.datasets import load_svmlight_file
X2, y2 = load_svmlight_file(dpath + 'agaricus.txt.train')
X2 = X2.toarray()
dtrain2 = xgb.DMatrix(X2, label=y2)
param = {'objective': 'binary:logistic',
'nthread': 0,
'tree_method': 'gpu_hist',
'max_depth': max_depth,
'n_gpus': n_gpus,
'max_bin': max_bin,
'debug_verbose': 0,
'eval_metric': 'auc'}
res = {}
xgb.train(param, dtrain2, num_rounds, [(dtrain2, 'train')], evals_result=res)
assert self.non_decreasing(res['train']['auc'])
if max_bin>32:
assert res['train']['auc'][0] >= 0.85
for j in range(X2.shape[1]):
for i in rng.choice(X2.shape[0], size=num_rounds, replace=False):
X2[i, j] = 2
dtrain3 = xgb.DMatrix(X2, label=y2)
res = {}
xgb.train(param, dtrain3, num_rounds, [(dtrain3, 'train')], evals_result=res)
assert self.non_decreasing(res['train']['auc'])
if max_bin>32:
assert res['train']['auc'][0] >= 0.85
for j in range(X2.shape[1]):
for i in np.random.choice(X2.shape[0], size=num_rounds, replace=False):
X2[i, j] = 3
dtrain4 = xgb.DMatrix(X2, label=y2)
res = {}
xgb.train(param, dtrain4, num_rounds, [(dtrain4, 'train')], evals_result=res)
assert self.non_decreasing(res['train']['auc'])
if max_bin>32:
assert res['train']['auc'][0] >= 0.85
######################################################################
# fail-safe test for max_bin
param = {'objective': 'binary:logistic',
'nthread': 0,
'tree_method': 'gpu_hist',
'max_depth': max_depth,
'n_gpus': n_gpus,
'debug_verbose': 0,
'eval_metric': 'auc',
'max_bin': max_bin}
res = {}
xgb.train(param, dtrain2, num_rounds, [(dtrain2, 'train')], evals_result=res)
assert self.non_decreasing(res['train']['auc'])
if max_bin>32:
assert res['train']['auc'][0] >= 0.85
######################################################################
# subsampling
param = {'objective': 'binary:logistic',
'nthread': 0,
'tree_method': 'gpu_hist',
'max_depth': max_depth,
'n_gpus': n_gpus,
'eval_metric': 'auc',
'colsample_bytree': 0.5,
'colsample_bylevel': 0.5,
'subsample': 0.5,
'debug_verbose': 0,
'max_bin': max_bin}
res = {}
xgb.train(param, dtrain2, num_rounds, [(dtrain2, 'train')], evals_result=res)
assert self.non_decreasing(res['train']['auc'])
if max_bin>32:
assert res['train']['auc'][0] >= 0.85
######################################################################
# fail-safe test for max_bin=2
param = {'objective': 'binary:logistic',
'nthread': 0,
'tree_method': 'gpu_hist',
'max_depth': 2,
'n_gpus': n_gpus,
'debug_verbose': 0,
'eval_metric': 'auc',
'max_bin': 2}
res = {}
xgb.train(param, dtrain2, num_rounds, [(dtrain2, 'train')], evals_result=res)
assert self.non_decreasing(res['train']['auc'])
if max_bin>32:
assert res['train']['auc'][0] >= 0.85
def non_decreasing(self, L):
return all((x - y) < 0.001 for x, y in zip(L, L[1:]))

View File

@@ -1,112 +0,0 @@
from __future__ import print_function
#pylint: skip-file
import sys
import time
sys.path.append("../../tests/python")
import xgboost as xgb
import testing as tm
import numpy as np
import unittest
from sklearn.datasets import make_classification
def eprint(*args, **kwargs):
print(*args, file=sys.stderr, **kwargs) ; sys.stderr.flush()
print(*args, file=sys.stdout, **kwargs) ; sys.stdout.flush()
eprint("Testing Big Data (this may take a while)")
rng = np.random.RandomState(1994)
# "realistic" size based upon http://stat-computing.org/dataexpo/2009/ , which has been processed to one-hot encode categoricalsxsy
cols = 31
# reduced to fit onto 1 gpu but still be large
rows3 = 5000 # small
rows2 = 4360032 # medium
rows1 = 42360032 # large
#rows1 = 152360032 # can do this for multi-gpu test (very large)
rowslist = [rows1, rows2, rows3]
class TestGPU(unittest.TestCase):
def test_large(self):
eprint("Starting test for large data")
tm._skip_if_no_sklearn()
for rows in rowslist:
eprint("Creating train data rows=%d cols=%d" % (rows,cols))
tmp = time.time()
np.random.seed(7)
X = np.random.rand(rows, cols)
y = np.random.rand(rows)
print("Time to Create Data: %r" % (time.time() - tmp))
eprint("Starting DMatrix(X,y)")
tmp = time.time()
ag_dtrain = xgb.DMatrix(X,y,nthread=40)
print("Time to DMatrix: %r" % (time.time() - tmp))
max_depth=6
max_bin=1024
# regression test --- hist must be same as exact on all-categorial data
ag_param = {'max_depth': max_depth,
'tree_method': 'exact',
'nthread': 0,
'eta': 1,
'silent': 0,
'debug_verbose': 5,
'objective': 'binary:logistic',
'eval_metric': 'auc'}
ag_paramb = {'max_depth': max_depth,
'tree_method': 'hist',
'nthread': 0,
'eta': 1,
'silent': 0,
'debug_verbose': 5,
'objective': 'binary:logistic',
'eval_metric': 'auc'}
ag_param2 = {'max_depth': max_depth,
'tree_method': 'gpu_hist',
'nthread': 0,
'eta': 1,
'silent': 0,
'debug_verbose': 5,
'n_gpus': 1,
'objective': 'binary:logistic',
'max_bin': max_bin,
'eval_metric': 'auc'}
ag_param3 = {'max_depth': max_depth,
'tree_method': 'gpu_hist',
'nthread': 0,
'eta': 1,
'silent': 0,
'debug_verbose': 5,
'n_gpus': -1,
'objective': 'binary:logistic',
'max_bin': max_bin,
'eval_metric': 'auc'}
ag_res = {}
ag_resb = {}
ag_res2 = {}
ag_res3 = {}
num_rounds = 1
tmp = time.time()
#eprint("hist updater")
#xgb.train(ag_paramb, ag_dtrain, num_rounds, [(ag_dtrain, 'train')],
# evals_result=ag_resb)
#print("Time to Train: %s seconds" % (str(time.time() - tmp)))
tmp = time.time()
eprint("gpu_hist updater 1 gpu")
xgb.train(ag_param2, ag_dtrain, num_rounds, [(ag_dtrain, 'train')],
evals_result=ag_res2)
print("Time to Train: %s seconds" % (str(time.time() - tmp)))
tmp = time.time()
eprint("gpu_hist updater all gpus")
xgb.train(ag_param3, ag_dtrain, num_rounds, [(ag_dtrain, 'train')],
evals_result=ag_res3)
print("Time to Train: %s seconds" % (str(time.time() - tmp)))

View File

@@ -1,37 +0,0 @@
from __future__ import print_function
#pylint: skip-file
import sys
sys.path.append("../../tests/python")
import xgboost as xgb
import testing as tm
import numpy as np
import unittest
rng = np.random.RandomState(1994)
class TestGPUPredict (unittest.TestCase):
def test_predict(self):
iterations = 1
np.random.seed(1)
test_num_rows = [10,1000,5000]
test_num_cols = [10,50,500]
for num_rows in test_num_rows:
for num_cols in test_num_cols:
dm = xgb.DMatrix(np.random.randn(num_rows, num_cols), label=[0, 1] * int(num_rows/2))
watchlist = [(dm, 'train')]
res = {}
param = {
"objective":"binary:logistic",
"predictor":"gpu_predictor",
'eval_metric': 'auc',
}
bst = xgb.train(param, dm,iterations,evals=watchlist, evals_result=res)
assert self.non_decreasing(res["train"]["auc"])
gpu_pred = bst.predict(dm, output_margin=True)
bst.set_param({"predictor":"cpu_predictor"})
cpu_pred = bst.predict(dm, output_margin=True)
np.testing.assert_allclose(cpu_pred, gpu_pred, rtol=1e-5)
def non_decreasing(self, L):
return all((x - y) < 0.001 for x, y in zip(L, L[1:]))