[GPU-Plugin] Major refactor 2 (#2664)

* Change cmake option * Move source files * Move google tests * Move python tests * Move benchmarks * Move documentation * Remove makefile support * Fix test run * Move GPU tests
2017-09-08 09:57:16 +12:00
parent 8244f6f120
commit 15267eedf2
21 changed files with 76 additions and 249 deletions
--- a/plugin/updater_gpu/README.md
+++ b/plugin/updater_gpu/README.md
@@ -1,161 +1,3 @@
 # CUDA Accelerated Tree Construction Algorithms
-This plugin adds GPU accelerated tree construction and prediction algorithms to XGBoost.
-## Usage
-Specify the 'tree_method' parameter as one of the following algorithms. 
-
-### Algorithms
-| tree_method | Description |
-| --- | --- |
-gpu_exact | The standard XGBoost tree construction algorithm. Performs exact search for splits. Slower and uses considerably more memory than 'gpu_hist' |
-gpu_hist | Equivalent to the XGBoost fast histogram algorithm. Faster and uses considerably less memory. Splits may be less accurate. |
-
-### Supported parameters 
-| parameter | gpu_exact | gpu_hist |
-| --- | --- | --- |
-subsample | &#10004; | &#10004; |
-colsample_bytree | &#10004; | &#10004;|
-colsample_bylevel | &#10004; | &#10004; |
-max_bin | &#10006; | &#10004; |
-gpu_id | &#10004; | &#10004; | 
-n_gpus | &#10006; | &#10004; | 
-predictor | &#10004; | &#10004; |
-
-GPU accelerated prediction is enabled by default for the above mentioned 'tree_method' parameters but can be switched to CPU prediction by setting 'predictor':'cpu_predictor'. This could be useful if you want to conserve GPU memory. Likewise when using CPU algorithms, GPU accelerated prediction can be enabled by setting 'predictor':'gpu_predictor'.
-
-The device ordinal can be selected using the 'gpu_id' parameter, which defaults to 0.
-
-Multiple GPUs can be used with the grow_gpu_hist parameter using the n_gpus parameter. which defaults to 1. If this is set to -1 all available GPUs will be used.  If gpu_id is specified as non-zero, the gpu device order is mod(gpu_id + i) % n_visible_devices for i=0 to n_gpus-1.  As with GPU vs. CPU, multi-GPU will not always be faster than a single GPU due to PCI bus bandwidth that can limit performance.  For example, when n_features * n_bins * 2^depth divided by time of each round/iteration becomes comparable to the real PCI 16x bus bandwidth of order 4GB/s to 10GB/s, then AllReduce will dominant code speed and multiple GPUs become ineffective at increasing performance.  Also, CPU overhead between GPU calls can limit usefulness of multiple GPUs.
-
-This plugin currently works with the CLI version and python version.
-
-Python example:
-```python
-param['gpu_id'] = 0
-param['max_bin'] = 16
-param['tree_method'] = 'gpu_hist'
-```
-## Benchmarks
-To run benchmarks on synthetic data for binary classification:
-```bash
-$ python benchmark/benchmark.py
-```
-
-Training time time on 1,000,000 rows x 50 columns with 500 boosting iterations and 0.25/0.75 test/train split on i7-6700K CPU @ 4.00GHz and Pascal Titan X.
-
-| tree_method | Time (s) |
-| --- | --- |
-| gpu_hist | 13.87 |
-| hist | 63.55 |
-| gpu_exact | 161.08 |
-| exact | 1082.20 |
-
-
-[See here](http://dmlc.ml/2016/12/14/GPU-accelerated-xgboost.html) for additional performance benchmarks of the 'gpu_exact' tree_method.
-
-## Test
-To run python tests:
-```bash
-$ python -m nose test/python/
-```
-
-Google tests can be enabled by specifying -DGOOGLE_TEST=ON when building with cmake.
-
-## Dependencies
-A CUDA capable GPU with at least compute capability >= 3.5 
-
-Building the plug-in requires CUDA Toolkit 7.5 or later (https://developer.nvidia.com/cuda-downloads)
-
-## Build
-
-From the command line on Linux starting from the xgboost directory:
-
-On Linux, from the xgboost directory:
-```bash
-$ mkdir build
-$ cd build
-$ cmake .. -DPLUGIN_UPDATER_GPU=ON
-$ make -j
-```
-On Windows using cmake, see what options for Generators you have for cmake, and choose one with [arch] replaced by Win64:
-```bash
-cmake -help
-```
-Then run cmake as:
-```bash
-$ mkdir build
-$ cd build
-$ cmake .. -G"Visual Studio 14 2015 Win64" -DPLUGIN_UPDATER_GPU=ON
-```
-Cmake will create an xgboost.sln solution file in the build directory. Build this solution in release mode as a x64 build.
-
-Visual studio community 2015, supported by cuda toolkit (http://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/#axzz4isREr2nS), can be downloaded from: https://my.visualstudio.com/Downloads?q=Visual%20Studio%20Community%202015 .  You may also be able to use a later version of visual studio depending on whether the CUDA toolkit supports it.  Note that Mingw cannot be used with cuda.
-
-### For other nccl libraries
-
-On some systems, nccl libraries are specific to a particular system (IBM Power or nvidia-docker) and can enable use of nvlink (between GPUs or even between GPUs and system memory).  In that case, one wants to avoid the static nccl library by changing "STATIC" to "SHARED" in nccl/CMakeLists.txt and deleting the shared nccl library created (so that the system one is used).
-
-### For Developers!
-
-In case you want to build only for a specific GPU(s), for eg. GP100 and GP102,
-whose compute capability are 60 and 61 respectively:
-```bash
-$ cmake .. -DPLUGIN_UPDATER_GPU=ON -DGPU_COMPUTE_VER="60;61"
-```
-
-### Using make
-Now, it also supports the usual 'make' flow to build gpu-enabled tree construction plugins. It's currently only tested on Linux. From the xgboost directory
-```bash
-# make sure CUDA SDK bin directory is in the 'PATH' env variable
-$ make -j PLUGIN_UPDATER_GPU=ON
-```
-
-Similar to cmake, if you want to build only for a specific GPU(s):
-```bash
-$ make -j PLUGIN_UPDATER_GPU=ON GPU_COMPUTE_VER="60 61"
-```
-
-## Changelog
-##### 2017/8/14
-* Added GPU accelerated prediction. Considerably improved performance when using test/eval sets.
-
-##### 2017/7/10
-* Memory performance improved 4x for gpu_hist
-
-##### 2017/6/26
-* Change API to use tree_method parameter
-* Increase required cmake version to 3.5
-* Add compute arch 3.5 to default archs
-* Set default n_gpus to 1
-
-##### 2017/6/5
-
-* Multi-GPU support for histogram method using NVIDIA NCCL.
-
-##### 2017/5/31
-* Faster version of the grow_gpu plugin
-* Added support for building gpu plugin through 'make' flow too
-
-##### 2017/5/19
-* Further performance enhancements for histogram method.
-
-##### 2017/5/5
-* Histogram performance improvements
-* Fix gcc build issues 
-
-##### 2017/4/25
-* Add fast histogram algorithm
-* Fix Linux build
-* Add 'gpu_id' parameter
-
-## References
-[Mitchell, Rory, and Eibe Frank. Accelerating the XGBoost algorithm using GPU computing. No. e2911v1. PeerJ Preprints, 2017.](https://peerj.com/preprints/2911/)
-
-## Author
-Rory Mitchell
-Jonathan C. McKinney
-Shankara Rao Thejaswi Nanditale
-Vinay Deshpande
-... and the rest of the H2O.ai and NVIDIA team.
-
-Please report bugs to the xgboost/issues page.

+[The XGBoost GPU documentation has moved here.](https://xgboost.readthedocs.io/en/latest/gpu/index.html)
--- a/plugin/updater_gpu/benchmark/benchmark.py
+++ b/plugin/updater_gpu/benchmark/benchmark.py
@@ -1,61 +0,0 @@
-# pylint: skip-file
-import sys, argparse
-import xgboost as xgb
-import numpy as np
-from sklearn.datasets import make_classification
-from sklearn.model_selection import train_test_split
-import time
-
-
-def run_benchmark(args, gpu_algorithm, cpu_algorithm):
-    print("Generating dataset: {} rows * {} columns".format(args.rows, args.columns))
-    print("{}/{} test/train split".format(args.test_size, 1.0 - args.test_size))
-    tmp = time.time()
-    X, y = make_classification(args.rows, n_features=args.columns, random_state=7)
-    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=args.test_size, random_state=7)
-    print ("Generate Time: %s seconds" % (str(time.time() - tmp)))
-    tmp = time.time()
-    print ("DMatrix Start")
-    # omp way
-    dtrain = xgb.DMatrix(X_train, y_train, nthread=-1)
-    dtest = xgb.DMatrix(X_test, y_test, nthread=-1)
-    print ("DMatrix Time: %s seconds" % (str(time.time() - tmp)))
-
-    param = {'objective': 'binary:logistic',
-             'max_depth': 6,
-             'silent': 0,
-             'n_gpus': 1,
-             'gpu_id': 0,
-             'eval_metric': 'error',
-             'debug_verbose': 0,
-             }
-
-    param['tree_method'] = gpu_algorithm
-    print("Training with '%s'" % param['tree_method'])
-    tmp = time.time()
-    xgb.train(param, dtrain, args.iterations, evals=[(dtest, "test")])
-    print ("Train Time: %s seconds" % (str(time.time() - tmp)))
-
-    param['silent'] = 1
-    param['tree_method'] = cpu_algorithm
-    print("Training with '%s'" % param['tree_method'])
-    tmp = time.time()
-    xgb.train(param, dtrain, args.iterations, evals=[(dtest, "test")])
-    print ("Time: %s seconds" % (str(time.time() - tmp)))
-
-
-parser = argparse.ArgumentParser()
-parser.add_argument('--algorithm', choices=['all', 'gpu_exact', 'gpu_hist'], default='all')
-parser.add_argument('--rows', type=int, default=1000000)
-parser.add_argument('--columns', type=int, default=50)
-parser.add_argument('--iterations', type=int, default=500)
-parser.add_argument('--test_size', type=float, default=0.25)
-args = parser.parse_args()
-
-if 'gpu_hist' in args.algorithm:
-    run_benchmark(args, args.algorithm, 'hist')
-elif 'gpu_exact' in args.algorithm:
-    run_benchmark(args, args.algorithm, 'exact')
-elif 'all' in args.algorithm:
-    run_benchmark(args, 'gpu_exact', 'exact')
-    run_benchmark(args, 'gpu_hist', 'hist')
--- a/plugin/updater_gpu/plugin.mk
+++ b/plugin/updater_gpu/plugin.mk
@@ -1,6 +0,0 @@
-
-PLUGIN_OBJS += build_plugin/updater_gpu/src/register_updater_gpu.o \
-               build_plugin/updater_gpu/src/updater_gpu.o \
-               build_plugin/updater_gpu/src/gpu_hist_builder.o \
-               build_plugin/updater_gpu/src/gpu_predictor.o
-PLUGIN_LDFLAGS += -L$(CUDA_ROOT)/lib64 -lcudart
--- a/plugin/updater_gpu/src/device_helpers.cuh
+++ b/plugin/updater_gpu/src/device_helpers.cuh
@@ -1,884 +0,0 @@
-/*!
- * Copyright 2017 XGBoost contributors
- */
-#pragma once
-#include <thrust/device_vector.h>
-#include <thrust/system/cuda/error.h>
-#include <thrust/system/cuda/execution_policy.h>
-#include <thrust/system_error.h>
-#include <xgboost/logging.h>
-#include <algorithm>
-#include <chrono>
-#include <ctime>
-#include <cub/cub.cuh>
-#include <numeric>
-#include <sstream>
-#include <string>
-#include <vector>
-#include "nccl.h"
-
-// Uncomment to enable
-#define TIMERS
-
-namespace dh {
-
-#define HOST_DEV_INLINE __host__ __device__ __forceinline__
-#define DEV_INLINE __device__ __forceinline__
-
-/*
- * Error handling  functions
- */
-
-#define safe_cuda(ans) throw_on_cuda_error((ans), __FILE__, __LINE__)
-
-inline cudaError_t throw_on_cuda_error(cudaError_t code, const char *file,
-                                       int line) {
-  if (code != cudaSuccess) {
-    std::stringstream ss;
-    ss << file << "(" << line << ")";
-    std::string file_and_line;
-    ss >> file_and_line;
-    throw thrust::system_error(code, thrust::cuda_category(), file_and_line);
-  }
-
-  return code;
-}
-
-#define safe_nccl(ans) throw_on_nccl_error((ans), __FILE__, __LINE__)
-
-inline ncclResult_t throw_on_nccl_error(ncclResult_t code, const char *file,
-                                        int line) {
-  if (code != ncclSuccess) {
-    std::stringstream ss;
-    ss << "NCCL failure :" << ncclGetErrorString(code) << " ";
-    ss << file << "(" << line << ")";
-    throw std::runtime_error(ss.str());
-  }
-
-  return code;
-}
-
-inline int n_visible_devices() {
-  int n_visgpus = 0;
-
-  cudaGetDeviceCount(&n_visgpus);
-
-  return n_visgpus;
-}
-
-inline int n_devices_all(int n_gpus) {
-  int n_devices_visible = dh::n_visible_devices();
-  int n_devices = n_gpus < 0 ? n_devices_visible : n_gpus;
-  return (n_devices);
-}
-inline int n_devices(int n_gpus, int num_rows) {
-  int n_devices = dh::n_devices_all(n_gpus);
-  // fix-up device number to be limited by number of rows
-  n_devices = n_devices > num_rows ? num_rows : n_devices;
-  return (n_devices);
-}
-
-// if n_devices=-1, then use all visible devices
-inline void synchronize_n_devices(int n_devices, std::vector<int> dList) {
-  for (int d_idx = 0; d_idx < n_devices; d_idx++) {
-    int device_idx = dList[d_idx];
-    safe_cuda(cudaSetDevice(device_idx));
-    safe_cuda(cudaDeviceSynchronize());
-  }
-}
-inline void synchronize_all() {
-  for (int device_idx = 0; device_idx < n_visible_devices(); device_idx++) {
-    safe_cuda(cudaSetDevice(device_idx));
-    safe_cuda(cudaDeviceSynchronize());
-  }
-}
-
-inline std::string device_name(int device_idx) {
-  cudaDeviceProp prop;
-  dh::safe_cuda(cudaGetDeviceProperties(&prop, device_idx));
-  return std::string(prop.name);
-}
-
-inline size_t available_memory(int device_idx) {
-  size_t device_free = 0;
-  size_t device_total = 0;
-  safe_cuda(cudaSetDevice(device_idx));
-  dh::safe_cuda(cudaMemGetInfo(&device_free, &device_total));
-  return device_free;
-}
-
-/**
- * \fn  inline int max_shared_memory(int device_idx)
- *
- * \brief Maximum shared memory per block on this device.
- *
- * \param device_idx  Zero-based index of the device.
- */
-
-inline int max_shared_memory(int device_idx) {
-  cudaDeviceProp prop;
-  dh::safe_cuda(cudaGetDeviceProperties(&prop, device_idx));
-  return prop.sharedMemPerBlock;
-}
-
-// ensure gpu_id is correct, so not dependent upon user knowing details
-inline int get_device_idx(int gpu_id) {
-  // protect against overrun for gpu_id
-  return (std::abs(gpu_id) + 0) % dh::n_visible_devices();
-}
-
-/*
- *  Timers
- */
-
-struct Timer {
-  typedef std::chrono::high_resolution_clock ClockT;
-
-  typedef std::chrono::high_resolution_clock::time_point TimePointT;
-  TimePointT start;
-  Timer() { reset(); }
-
-  void reset() { start = ClockT::now(); }
-  int64_t elapsed() const { return (ClockT::now() - start).count(); }
-  double elapsedSeconds() const {
-    return elapsed() * ((double)ClockT::period::num / ClockT::period::den);
-  }
-  void printElapsed(std::string label) {
-    //    synchronize_n_devices(n_devices, dList);
-    printf("%s:\t %fs\n", label.c_str(), elapsedSeconds());
-    reset();
-  }
-};
-
-/*
- * Range iterator
- */
-
-class range {
- public:
-  class iterator {
-    friend class range;
-
-   public:
-    __host__ __device__ int64_t operator*() const { return i_; }
-    __host__ __device__ const iterator &operator++() {
-      i_ += step_;
-      return *this;
-    }
-    __host__ __device__ iterator operator++(int) {
-      iterator copy(*this);
-      i_ += step_;
-      return copy;
-    }
-
-    __host__ __device__ bool operator==(const iterator &other) const {
-      return i_ >= other.i_;
-    }
-    __host__ __device__ bool operator!=(const iterator &other) const {
-      return i_ < other.i_;
-    }
-
-    __host__ __device__ void step(int s) { step_ = s; }
-
-   protected:
-    __host__ __device__ explicit iterator(int64_t start) : i_(start) {}
-
-   public:
-    uint64_t i_;
-    int step_ = 1;
-  };
-
-  __host__ __device__ iterator begin() const { return begin_; }
-  __host__ __device__ iterator end() const { return end_; }
-  __host__ __device__ range(int64_t begin, int64_t end)
-      : begin_(begin), end_(end) {}
-  __host__ __device__ void step(int s) { begin_.step(s); }
-
- private:
-  iterator begin_;
-  iterator end_;
-};
-
-template <typename T>
-__device__ range grid_stride_range(T begin, T end) {
-  begin += blockDim.x * blockIdx.x + threadIdx.x;
-  range r(begin, end);
-  r.step(gridDim.x * blockDim.x);
-  return r;
-}
-
-template <typename T>
-__device__ range block_stride_range(T begin, T end) {
-  begin += threadIdx.x;
-  range r(begin, end);
-  r.step(blockDim.x);
-  return r;
-}
-
-// Threadblock iterates over range, filling with value. Requires all threads in
-// block to be active.
-template <typename IterT, typename ValueT>
-__device__ void block_fill(IterT begin, size_t n, ValueT value) {
-  for (auto i : block_stride_range(static_cast<size_t>(0), n)) {
-    begin[i] = value;
-  }
-}
-
-/*
- * Memory
- */
-
-enum memory_type { DEVICE, DEVICE_MANAGED };
-
-template <memory_type MemoryT>
-class bulk_allocator;
-template <typename T>
-class dvec2;
-
-template <typename T>
-class dvec {
-  friend class dvec2<T>;
-
- private:
-  T *_ptr;
-  size_t _size;
-  int _device_idx;
-
- public:
-  void external_allocate(int device_idx, void *ptr, size_t size) {
-    if (!empty()) {
-      throw std::runtime_error("Tried to allocate dvec but already allocated");
-    }
-    _ptr = static_cast<T *>(ptr);
-    _size = size;
-    _device_idx = device_idx;
-    safe_cuda(cudaSetDevice(_device_idx));
-  }
-
-  dvec() : _ptr(NULL), _size(0), _device_idx(-1) {}
-  size_t size() const { return _size; }
-  int device_idx() const { return _device_idx; }
-  bool empty() const { return _ptr == NULL || _size == 0; }
-
-  T *data() { return _ptr; }
-
-  const T *data() const { return _ptr; }
-
-  std::vector<T> as_vector() const {
-    std::vector<T> h_vector(size());
-    safe_cuda(cudaSetDevice(_device_idx));
-    safe_cuda(cudaMemcpy(h_vector.data(), _ptr, size() * sizeof(T),
-                         cudaMemcpyDeviceToHost));
-    return h_vector;
-  }
-
-  void fill(T value) {
-    safe_cuda(cudaSetDevice(_device_idx));
-    thrust::fill_n(thrust::device_pointer_cast(_ptr), size(), value);
-  }
-
-  void print() {
-    auto h_vector = this->as_vector();
-    for (auto e : h_vector) {
-      std::cout << e << " ";
-    }
-    std::cout << "\n";
-  }
-
-  thrust::device_ptr<T> tbegin() { return thrust::device_pointer_cast(_ptr); }
-
-  thrust::device_ptr<T> tend() {
-    return thrust::device_pointer_cast(_ptr + size());
-  }
-
-  template <typename T2>
-  dvec &operator=(const std::vector<T2> &other) {
-    this->copy(other.begin(), other.end());
-    return *this;
-  }
-
-  dvec &operator=(dvec<T> &other) {
-    if (other.size() != size()) {
-      throw std::runtime_error(
-          "Cannot copy assign dvec to dvec, sizes are different");
-    }
-    safe_cuda(cudaSetDevice(this->device_idx()));
-    if (other.device_idx() == this->device_idx()) {
-      thrust::copy(other.tbegin(), other.tend(), this->tbegin());
-    } else {
-      std::cout << "deviceother: " << other.device_idx()
-                << " devicethis: " << this->device_idx() << std::endl;
-      std::cout << "size deviceother: " << other.size()
-                << " devicethis: " << this->device_idx() << std::endl;
-      throw std::runtime_error("Cannot copy to/from different devices");
-    }
-
-    return *this;
-  }
-
-  template <typename IterT>
-  void copy(IterT begin, IterT end) {
-    safe_cuda(cudaSetDevice(this->device_idx()));
-    if (end - begin != size()) {
-      throw std::runtime_error(
-          "Cannot copy assign vector to dvec, sizes are different");
-    }
-    thrust::copy(begin, end, this->tbegin());
-  }
-};
-
-/**
- * @class dvec2 device_helpers.cuh
- * @brief wrapper for storing 2 dvec's which are needed for cub::DoubleBuffer
- */
-template <typename T>
-class dvec2 {
- private:
-  dvec<T> _d1, _d2;
-  cub::DoubleBuffer<T> _buff;
-  int _device_idx;
-
- public:
-  void external_allocate(int device_idx, void *ptr1, void *ptr2, size_t size) {
-    if (!empty()) {
-      throw std::runtime_error("Tried to allocate dvec2 but already allocated");
-    }
-    _device_idx = device_idx;
-    _d1.external_allocate(_device_idx, ptr1, size);
-    _d2.external_allocate(_device_idx, ptr2, size);
-    _buff.d_buffers[0] = static_cast<T *>(ptr1);
-    _buff.d_buffers[1] = static_cast<T *>(ptr2);
-    _buff.selector = 0;
-  }
-  dvec2() : _d1(), _d2(), _buff(), _device_idx(-1) {}
-
-  size_t size() const { return _d1.size(); }
-  int device_idx() const { return _device_idx; }
-  bool empty() const { return _d1.empty() || _d2.empty(); }
-
-  cub::DoubleBuffer<T> &buff() { return _buff; }
-
-  dvec<T> &d1() { return _d1; }
-  dvec<T> &d2() { return _d2; }
-
-  T *current() { return _buff.Current(); }
-
-  dvec<T> &current_dvec() { return _buff.selector == 0 ? d1() : d2(); }
-
-  T *other() { return _buff.Alternate(); }
-};
-
-template <memory_type MemoryT>
-class bulk_allocator {
-  std::vector<char *> d_ptr;
-  std::vector<size_t> _size;
-  std::vector<int> _device_idx;
-
-  const int align = 256;
-
-  template <typename SizeT>
-  size_t align_round_up(SizeT n) {
-    n = (n + align - 1) / align;
-    return n * align;
-  }
-
-  template <typename T, typename SizeT>
-  size_t get_size_bytes(dvec<T> *first_vec, SizeT first_size) {
-    return align_round_up<SizeT>(first_size * sizeof(T));
-  }
-
-  template <typename T, typename SizeT, typename... Args>
-  size_t get_size_bytes(dvec<T> *first_vec, SizeT first_size, Args... args) {
-    return get_size_bytes<T, SizeT>(first_vec, first_size) +
-           get_size_bytes(args...);
-  }
-
-  template <typename T, typename SizeT>
-  void allocate_dvec(int device_idx, char *ptr, dvec<T> *first_vec,
-                     SizeT first_size) {
-    first_vec->external_allocate(device_idx, static_cast<void *>(ptr),
-                                 first_size);
-  }
-
-  template <typename T, typename SizeT, typename... Args>
-  void allocate_dvec(int device_idx, char *ptr, dvec<T> *first_vec,
-                     SizeT first_size, Args... args) {
-    first_vec->external_allocate(device_idx, static_cast<void *>(ptr),
-                                 first_size);
-    ptr += align_round_up(first_size * sizeof(T));
-    allocate_dvec(device_idx, ptr, args...);
-  }
-
-  //    template <memory_type MemoryT>
-  char *allocate_device(int device_idx, size_t bytes, memory_type t) {
-    char *ptr;
-    if (t == memory_type::DEVICE) {
-      safe_cuda(cudaSetDevice(device_idx));
-      safe_cuda(cudaMalloc(&ptr, bytes));
-    } else {
-      safe_cuda(cudaMallocManaged(&ptr, bytes));
-    }
-    return ptr;
-  }
-  template <typename T, typename SizeT>
-  size_t get_size_bytes(dvec2<T> *first_vec, SizeT first_size) {
-    return 2 * align_round_up(first_size * sizeof(T));
-  }
-
-  template <typename T, typename SizeT, typename... Args>
-  size_t get_size_bytes(dvec2<T> *first_vec, SizeT first_size, Args... args) {
-    return get_size_bytes<T, SizeT>(first_vec, first_size) +
-           get_size_bytes(args...);
-  }
-
-  template <typename T, typename SizeT>
-  void allocate_dvec(int device_idx, char *ptr, dvec2<T> *first_vec,
-                     SizeT first_size) {
-    first_vec->external_allocate(
-        device_idx, static_cast<void *>(ptr),
-        static_cast<void *>(ptr + align_round_up(first_size * sizeof(T))),
-        first_size);
-  }
-
-  template <typename T, typename SizeT, typename... Args>
-  void allocate_dvec(int device_idx, char *ptr, dvec2<T> *first_vec,
-                     SizeT first_size, Args... args) {
-    allocate_dvec<T, SizeT>(device_idx, ptr, first_vec, first_size);
-    ptr += (align_round_up(first_size * sizeof(T)) * 2);
-    allocate_dvec(device_idx, ptr, args...);
-  }
-
- public:
-  ~bulk_allocator() {
-    for (size_t i = 0; i < d_ptr.size(); i++) {
-      if (!(d_ptr[i] == nullptr)) {
-        safe_cuda(cudaSetDevice(_device_idx[i]));
-        safe_cuda(cudaFree(d_ptr[i]));
-      }
-    }
-  }
-
-  // returns sum of bytes for all allocations
-  size_t size() {
-    return std::accumulate(_size.begin(), _size.end(), static_cast<size_t>(0));
-  }
-
-  template <typename... Args>
-  void allocate(int device_idx, bool silent, Args... args) {
-    size_t size = get_size_bytes(args...);
-
-    char *ptr = allocate_device(device_idx, size, MemoryT);
-
-    allocate_dvec(device_idx, ptr, args...);
-
-    d_ptr.push_back(ptr);
-    _size.push_back(size);
-    _device_idx.push_back(device_idx);
-
-    if (!silent) {
-      const int mb_size = 1048576;
-      LOG(CONSOLE) << "Allocated " << size / mb_size << "MB on [" << device_idx
-                   << "] " << device_name(device_idx) << ", "
-                   << available_memory(device_idx) / mb_size << "MB remaining.";
-    }
-  }
-};
-
-// Keep track of cub library device allocation
-struct CubMemory {
-  void *d_temp_storage;
-  size_t temp_storage_bytes;
-
-  // Thrust
-  typedef char value_type;
-
-  CubMemory() : d_temp_storage(NULL), temp_storage_bytes(0) {}
-
-  ~CubMemory() { Free(); }
-
-  void Free() {
-    if (this->IsAllocated()) {
-      safe_cuda(cudaFree(d_temp_storage));
-    }
-  }
-
-  void LazyAllocate(size_t num_bytes) {
-    if (num_bytes > temp_storage_bytes) {
-      Free();
-      safe_cuda(cudaMalloc(&d_temp_storage, num_bytes));
-      temp_storage_bytes = num_bytes;
-    }
-  }
-  // Thrust
-  char *allocate(std::ptrdiff_t num_bytes) {
-    LazyAllocate(num_bytes);
-    return reinterpret_cast<char *>(d_temp_storage);
-  }
-
-  // Thrust
-  void deallocate(char *ptr, size_t n) {
-    // Do nothing
-  }
-
-  bool IsAllocated() { return d_temp_storage != NULL; }
-};
-
-/*
- *  Utility functions
- */
-
-template <typename T>
-void print(const thrust::device_vector<T> &v, size_t max_items = 10) {
-  thrust::host_vector<T> h = v;
-  for (size_t i = 0; i < std::min(max_items, h.size()); i++) {
-    std::cout << " " << h[i];
-  }
-  std::cout << "\n";
-}
-
-template <typename T>
-void print(const dvec<T> &v, size_t max_items = 10) {
-  std::vector<T> h = v.as_vector();
-  for (size_t i = 0; i < std::min(max_items, h.size()); i++) {
-    std::cout << " " << h[i];
-  }
-  std::cout << "\n";
-}
-
-template <typename T>
-void print(char *label, const thrust::device_vector<T> &v,
-           const char *format = "%d ", size_t max = 10) {
-  thrust::host_vector<T> h_v = v;
-  std::cout << label << ":\n";
-  for (size_t i = 0; i < std::min(static_cast<size_t>(h_v.size()), max); i++) {
-    printf(format, h_v[i]);
-  }
-  std::cout << "\n";
-}
-
-template <typename T1, typename T2>
-T1 div_round_up(const T1 a, const T2 b) {
-  return static_cast<T1>(ceil(static_cast<double>(a) / b));
-}
-
-template <typename T>
-thrust::device_ptr<T> dptr(T *d_ptr) {
-  return thrust::device_pointer_cast(d_ptr);
-}
-
-template <typename T>
-T *raw(thrust::device_vector<T> &v) {  //  NOLINT
-  return raw_pointer_cast(v.data());
-}
-
-template <typename T>
-const T *raw(const thrust::device_vector<T> &v) {  //  NOLINT
-  return raw_pointer_cast(v.data());
-}
-
-template <typename T>
-size_t size_bytes(const thrust::device_vector<T> &v) {
-  return sizeof(T) * v.size();
-}
-/*
- * Kernel launcher
- */
-
-template <typename L>
-__global__ void launch_n_kernel(size_t begin, size_t end, L lambda) {
-  for (auto i : grid_stride_range(begin, end)) {
-    lambda(i);
-  }
-}
-template <typename L>
-__global__ void launch_n_kernel(int device_idx, size_t begin, size_t end,
-                                L lambda) {
-  for (auto i : grid_stride_range(begin, end)) {
-    lambda(i, device_idx);
-  }
-}
-
-template <int ITEMS_PER_THREAD = 8, int BLOCK_THREADS = 256, typename L>
-inline void launch_n(int device_idx, size_t n, L lambda) {
-  safe_cuda(cudaSetDevice(device_idx));
-  // TODO: Template on n so GRID_SIZE always fits into int.
-  const int GRID_SIZE = div_round_up(n, ITEMS_PER_THREAD * BLOCK_THREADS);
-#if defined(__CUDACC__)
-  launch_n_kernel<<<GRID_SIZE, BLOCK_THREADS>>>(static_cast<size_t>(0), n,
-                                                lambda);
-#endif
-}
-
-// if n_devices=-1, then use all visible devices
-template <int ITEMS_PER_THREAD = 8, int BLOCK_THREADS = 256, typename L>
-inline void multi_launch_n(size_t n, int n_devices, L lambda) {
-  n_devices = n_devices < 0 ? n_visible_devices() : n_devices;
-  CHECK_LE(n_devices, n_visible_devices()) << "Number of devices requested "
-                                              "needs to be less than equal to "
-                                              "number of visible devices.";
-  // TODO: Template on n so GRID_SIZE always fits into int.
-  const int GRID_SIZE = div_round_up(n, ITEMS_PER_THREAD * BLOCK_THREADS);
-#if defined(__CUDACC__)
-  n_devices = n_devices > n ? n : n_devices;
-  for (int device_idx = 0; device_idx < n_devices; device_idx++) {
-    safe_cuda(cudaSetDevice(device_idx));
-    size_t begin = (n / n_devices) * device_idx;
-    size_t end = std::min((n / n_devices) * (device_idx + 1), n);
-    launch_n_kernel<<<GRID_SIZE, BLOCK_THREADS>>>(device_idx, begin, end,
-                                                  lambda);
-  }
-#endif
-}
-
-/**
- * @brief Helper macro to measure timing on GPU
- * @param call the GPU call
- * @param name name used to track later
- * @param stream cuda stream where to measure time
- */
-#define TIMEIT(call, name)    \
-  do {                        \
-    dh::Timer t1234;          \
-    call;                     \
-    t1234.printElapsed(name); \
-  } while (0)
-
-// Load balancing search
-
-template <typename coordinate_t, typename segments_t, typename offset_t>
-void FindMergePartitions(int device_idx, coordinate_t *d_tile_coordinates,
-                         int num_tiles, int tile_size, segments_t segments,
-                         offset_t num_rows, offset_t num_elements) {
-  dh::launch_n(device_idx, num_tiles + 1, [=] __device__(int idx) {
-    offset_t diagonal = idx * tile_size;
-    coordinate_t tile_coordinate;
-    cub::CountingInputIterator<offset_t> nonzero_indices(0);
-
-    // Search the merge path
-    // Cast to signed integer as this function can have negatives
-    cub::MergePathSearch(static_cast<int64_t>(diagonal), segments + 1,
-                         nonzero_indices, static_cast<int64_t>(num_rows),
-                         static_cast<int64_t>(num_elements), tile_coordinate);
-
-    // Output starting offset
-    d_tile_coordinates[idx] = tile_coordinate;
-  });
-}
-
-template <int TILE_SIZE, int ITEMS_PER_THREAD, int BLOCK_THREADS,
-          typename offset_t, typename coordinate_t, typename func_t,
-          typename segments_iter>
-__global__ void LbsKernel(coordinate_t *d_coordinates,
-                          segments_iter segment_end_offsets, func_t f,
-                          offset_t num_segments) {
-  int tile = blockIdx.x;
-  coordinate_t tile_start_coord = d_coordinates[tile];
-  coordinate_t tile_end_coord = d_coordinates[tile + 1];
-  int64_t tile_num_rows = tile_end_coord.x - tile_start_coord.x;
-  int64_t tile_num_elements = tile_end_coord.y - tile_start_coord.y;
-
-  cub::CountingInputIterator<offset_t> tile_element_indices(tile_start_coord.y);
-  coordinate_t thread_start_coord;
-
-  typedef typename std::iterator_traits<segments_iter>::value_type segment_t;
-  __shared__ struct {
-    segment_t tile_segment_end_offsets[TILE_SIZE + 1];
-    segment_t output_segment[TILE_SIZE];
-  } temp_storage;
-
-  for (auto item : dh::block_stride_range(int(0), int(tile_num_rows + 1))) {
-    temp_storage.tile_segment_end_offsets[item] =
-        segment_end_offsets[min(tile_start_coord.x + item, num_segments - 1)];
-  }
-  __syncthreads();
-
-  int64_t diag = threadIdx.x * ITEMS_PER_THREAD;
-
-  // Cast to signed integer as this function can have negatives
-  cub::MergePathSearch(diag,                                   // Diagonal
-                       temp_storage.tile_segment_end_offsets,  // List A
-                       tile_element_indices,                   // List B
-                       tile_num_rows, tile_num_elements, thread_start_coord);
-
-  coordinate_t thread_current_coord = thread_start_coord;
-#pragma unroll
-  for (int ITEM = 0; ITEM < ITEMS_PER_THREAD; ++ITEM) {
-    if (tile_element_indices[thread_current_coord.y] <
-        temp_storage.tile_segment_end_offsets[thread_current_coord.x]) {
-      temp_storage.output_segment[thread_current_coord.y] =
-          thread_current_coord.x + tile_start_coord.x;
-      ++thread_current_coord.y;
-    } else {
-      ++thread_current_coord.x;
-    }
-  }
-  __syncthreads();
-
-  for (auto item : dh::block_stride_range(int(0), int(tile_num_elements))) {
-    f(tile_start_coord.y + item, temp_storage.output_segment[item]);
-  }
-}
-
-template <typename func_t, typename segments_iter, typename offset_t>
-void SparseTransformLbs(int device_idx, dh::CubMemory *temp_memory,
-                        offset_t count, segments_iter segments,
-                        offset_t num_segments, func_t f) {
-  typedef typename cub::CubVector<offset_t, 2>::Type coordinate_t;
-  dh::safe_cuda(cudaSetDevice(device_idx));
-  const int BLOCK_THREADS = 256;
-  const int ITEMS_PER_THREAD = 1;
-  const int TILE_SIZE = BLOCK_THREADS * ITEMS_PER_THREAD;
-  int num_tiles = dh::div_round_up(count + num_segments, BLOCK_THREADS);
-
-  temp_memory->LazyAllocate(sizeof(coordinate_t) * (num_tiles + 1));
-  coordinate_t *tmp_tile_coordinates =
-      reinterpret_cast<coordinate_t *>(temp_memory->d_temp_storage);
-
-  FindMergePartitions(device_idx, tmp_tile_coordinates, num_tiles,
-                      BLOCK_THREADS, segments, num_segments, count);
-
-  LbsKernel<TILE_SIZE, ITEMS_PER_THREAD, BLOCK_THREADS, offset_t>
-      <<<num_tiles, BLOCK_THREADS>>>(tmp_tile_coordinates, segments + 1, f,
-                                     num_segments);
-}
-
-template <typename func_t, typename offset_t>
-void DenseTransformLbs(int device_idx, offset_t count, offset_t num_segments,
-                       func_t f) {
-  CHECK(count % num_segments == 0) << "Data is not dense.";
-
-  launch_n(device_idx, count, [=] __device__(offset_t idx) {
-    offset_t segment = idx / (count / num_segments);
-    f(idx, segment);
-  });
-}
-
-/**
- * \fn  template <typename func_t, typename segments_iter, typename offset_t>
- * void TransformLbs(int device_idx, dh::CubMemory *temp_memory, offset_t count,
- * segments_iter segments, offset_t num_segments, bool is_dense, func_t f)
- *
- * \brief Load balancing search function. Reads a CSR type matrix description
- * and allows a function to be executed on each element. Search 'modern GPU load
- * balancing search' for more information.
- *
- * \author  Rory
- * \date  7/9/2017
- *
- * \tparam  func_t        Type of the function t.
- * \tparam  segments_iter Type of the segments iterator.
- * \tparam  offset_t      Type of the offset.
- * \param           device_idx    Zero-based index of the device.
- * \param [in,out]  temp_memory   Temporary memory allocator.
- * \param           count         Number of elements.
- * \param           segments      Device pointer to segments.
- * \param           num_segments  Number of segments.
- * \param           is_dense      True if this object is dense.
- * \param           f             Lambda to be executed on matrix elements.
- */
-
-template <typename func_t, typename segments_iter, typename offset_t>
-void TransformLbs(int device_idx, dh::CubMemory *temp_memory, offset_t count,
-                  segments_iter segments, offset_t num_segments, bool is_dense,
-                  func_t f) {
-  if (is_dense) {
-    DenseTransformLbs(device_idx, count, num_segments, f);
-  } else {
-    SparseTransformLbs(device_idx, temp_memory, count, segments, num_segments,
-                       f);
-  }
-}
-
-/**
- * @brief Helper function to sort the pairs using cub's segmented RadixSortPairs
- * @param tmp_mem cub temporary memory info
- * @param keys keys double-buffer array
- * @param vals the values double-buffer array
- * @param nVals number of elements in the array
- * @param nSegs number of segments
- * @param offsets the segments
- */
-template <typename T1, typename T2>
-void segmentedSort(dh::CubMemory *tmp_mem, dh::dvec2<T1> *keys,
-                   dh::dvec2<T2> *vals, int nVals, int nSegs,
-                   const dh::dvec<int> &offsets, int start = 0,
-                   int end = sizeof(T1) * 8) {
-  size_t tmpSize;
-  dh::safe_cuda(cub::DeviceSegmentedRadixSort::SortPairs(
-      NULL, tmpSize, keys->buff(), vals->buff(), nVals, nSegs, offsets.data(),
-      offsets.data() + 1, start, end));
-  tmp_mem->LazyAllocate(tmpSize);
-  dh::safe_cuda(cub::DeviceSegmentedRadixSort::SortPairs(
-      tmp_mem->d_temp_storage, tmpSize, keys->buff(), vals->buff(), nVals,
-      nSegs, offsets.data(), offsets.data() + 1, start, end));
-}
-
-/**
- * @brief Helper function to perform device-wide sum-reduction
- * @param tmp_mem cub temporary memory info
- * @param in the input array to be reduced
- * @param out the output reduced value
- * @param nVals number of elements in the input array
- */
-template <typename T>
-void sumReduction(dh::CubMemory &tmp_mem, dh::dvec<T> &in, dh::dvec<T> &out,
-                  int nVals) {
-  size_t tmpSize;
-  dh::safe_cuda(
-      cub::DeviceReduce::Sum(NULL, tmpSize, in.data(), out.data(), nVals));
-  tmp_mem.LazyAllocate(tmpSize);
-  dh::safe_cuda(cub::DeviceReduce::Sum(tmp_mem.d_temp_storage, tmpSize,
-                                       in.data(), out.data(), nVals));
-}
-
-/**
- * @brief Fill a given constant value across all elements in the buffer
- * @param out the buffer to be filled
- * @param len number of elements i the buffer
- * @param def default value to be filled
- */
-template <typename T, int BlkDim = 256, int ItemsPerThread = 4>
-void fillConst(int device_idx, T *out, int len, T def) {
-  dh::launch_n<ItemsPerThread, BlkDim>(device_idx, len,
-                                       [=] __device__(int i) { out[i] = def; });
-}
-
-/**
- * @brief gather elements
- * @param out1 output gathered array for the first buffer
- * @param in1 first input buffer
- * @param out2 output gathered array for the second buffer
- * @param in2 second input buffer
- * @param instId gather indices
- * @param nVals length of the buffers
- */
-template <typename T1, typename T2, int BlkDim = 256, int ItemsPerThread = 4>
-void gather(int device_idx, T1 *out1, const T1 *in1, T2 *out2, const T2 *in2,
-            const int *instId, int nVals) {
-  dh::launch_n<ItemsPerThread, BlkDim>(device_idx, nVals,
-                                       [=] __device__(int i) {
-                                         int iid = instId[i];
-                                         T1 v1 = in1[iid];
-                                         T2 v2 = in2[iid];
-                                         out1[i] = v1;
-                                         out2[i] = v2;
-                                       });
-}
-
-/**
- * @brief gather elements
- * @param out output gathered array
- * @param in input buffer
- * @param instId gather indices
- * @param nVals length of the buffers
- */
-template <typename T, int BlkDim = 256, int ItemsPerThread = 4>
-void gather(int device_idx, T *out, const T *in, const int *instId, int nVals) {
-  dh::launch_n<ItemsPerThread, BlkDim>(device_idx, nVals,
-                                       [=] __device__(int i) {
-                                         int iid = instId[i];
-                                         out[i] = in[iid];
-                                       });
-}
-
-}  // namespace dh
--- a/plugin/updater_gpu/src/gpu_predictor.cu
+++ b/plugin/updater_gpu/src/gpu_predictor.cu
@@ -1,411 +0,0 @@
-/*!
- * Copyright by Contributors 2017
- */
-#include <dmlc/parameter.h>
-#include <thrust/copy.h>
-#include <xgboost/data.h>
-#include <xgboost/predictor.h>
-#include <xgboost/tree_model.h>
-#include <xgboost/tree_updater.h>
-#include <memory>
-#include "device_helpers.cuh"
-
-namespace xgboost {
-namespace predictor {
-
-DMLC_REGISTRY_FILE_TAG(gpu_predictor);
-
-/*! \brief prediction parameters */
-struct GPUPredictionParam : public dmlc::Parameter<GPUPredictionParam> {
-  int gpu_id;
-  int n_gpus;
-  bool silent;
-  // declare parameters
-  DMLC_DECLARE_PARAMETER(GPUPredictionParam) {
-    DMLC_DECLARE_FIELD(gpu_id).set_default(0).describe(
-        "Device ordinal for GPU prediction.");
-    DMLC_DECLARE_FIELD(n_gpus).set_default(1).describe(
-        "Number of devices to use for prediction (NOT IMPLEMENTED).");
-    DMLC_DECLARE_FIELD(silent).set_default(false).describe(
-        "Do not print information during trainig.");
-  }
-};
-DMLC_REGISTER_PARAMETER(GPUPredictionParam);
-
-template <typename iter_t>
-void increment_offset(iter_t begin_itr, iter_t end_itr, size_t amount) {
-  thrust::transform(begin_itr, end_itr, begin_itr,
-                    [=] __device__(size_t elem) { return elem + amount; });
-}
-
-/**
- * \struct  DeviceMatrix
- *
- * \brief A csr representation of the input matrix allocated on the device.
- */
-
-struct DeviceMatrix {
-  DMatrix* p_mat;  // Pointer to the original matrix on the host
-  dh::bulk_allocator<dh::memory_type::DEVICE> ba;
-  dh::dvec<size_t> row_ptr;
-  dh::dvec<SparseBatch::Entry> data;
-  thrust::device_vector<float> predictions;
-
-  DeviceMatrix(DMatrix* dmat, int device_idx, bool silent) : p_mat(dmat) {
-    dh::safe_cuda(cudaSetDevice(device_idx));
-    auto info = dmat->info();
-    ba.allocate(device_idx, silent, &row_ptr, info.num_row + 1, &data,
-                info.num_nonzero);
-    auto iter = dmat->RowIterator();
-    iter->BeforeFirst();
-    size_t data_offset = 0;
-    while (iter->Next()) {
-      auto batch = iter->Value();
-      // Copy row ptr
-      thrust::copy(batch.ind_ptr, batch.ind_ptr + batch.size + 1,
-                   row_ptr.tbegin() + batch.base_rowid);
-      if (batch.base_rowid > 0) {
-        auto begin_itr = row_ptr.tbegin() + batch.base_rowid;
-        auto end_itr = begin_itr + batch.size + 1;
-        increment_offset(begin_itr, end_itr, batch.base_rowid);
-      }
-      // Copy data
-      thrust::copy(batch.data_ptr, batch.data_ptr + batch.ind_ptr[batch.size],
-                   data.tbegin() + data_offset);
-      data_offset += batch.ind_ptr[batch.size];
-    }
-  }
-};
-
-/**
- * \struct  DevicePredictionNode
- *
- * \brief Packed 16 byte representation of a tree node for use in device
- * prediction
- */
-
-struct DevicePredictionNode {
-  XGBOOST_DEVICE DevicePredictionNode()
-      : fidx(-1), left_child_idx(-1), right_child_idx(-1) {}
-
-  union NodeValue {
-    float leaf_weight;
-    float fvalue;
-  };
-
-  int fidx;
-  int left_child_idx;
-  int right_child_idx;
-  NodeValue val;
-
-  DevicePredictionNode(const RegTree::Node& n) {  // NOLINT
-    this->left_child_idx = n.cleft();
-    this->right_child_idx = n.cright();
-    this->fidx = n.split_index();
-    if (n.default_left()) {
-      fidx |= (1U << 31);
-    }
-
-    if (n.is_leaf()) {
-      this->val.leaf_weight = n.leaf_value();
-    } else {
-      this->val.fvalue = n.split_cond();
-    }
-  }
-
-  XGBOOST_DEVICE bool IsLeaf() const { return left_child_idx == -1; }
-
-  XGBOOST_DEVICE int GetFidx() const { return fidx & ((1U << 31) - 1U); }
-
-  XGBOOST_DEVICE bool MissingLeft() const { return (fidx >> 31) != 0; }
-
-  XGBOOST_DEVICE int MissingIdx() const {
-    if (MissingLeft()) {
-      return this->left_child_idx;
-    } else {
-      return this->right_child_idx;
-    }
-  }
-
-  XGBOOST_DEVICE float GetFvalue() const { return val.fvalue; }
-
-  XGBOOST_DEVICE float GetWeight() const { return val.leaf_weight; }
-};
-
-struct ElementLoader {
-  bool use_shared;
-  size_t* d_row_ptr;
-  SparseBatch::Entry* d_data;
-  int num_features;
-  float* smem;
-
-  __device__ ElementLoader(bool use_shared, size_t* row_ptr,
-                           SparseBatch::Entry* entry, int num_features,
-                           float* smem, int num_rows)
-      : use_shared(use_shared),
-        d_row_ptr(row_ptr),
-        d_data(entry),
-        num_features(num_features),
-        smem(smem) {
-    // Copy instances
-    if (use_shared) {
-      bst_uint global_idx = blockDim.x * blockIdx.x + threadIdx.x;
-      int shared_elements = blockDim.x * num_features;
-      dh::block_fill(smem, shared_elements, nanf(""));
-      __syncthreads();
-      if (global_idx < num_rows) {
-        bst_uint elem_begin = d_row_ptr[global_idx];
-        bst_uint elem_end = d_row_ptr[global_idx + 1];
-        for (bst_uint elem_idx = elem_begin; elem_idx < elem_end; elem_idx++) {
-          SparseBatch::Entry elem = d_data[elem_idx];
-          smem[threadIdx.x * num_features + elem.index] = elem.fvalue;
-        }
-      }
-      __syncthreads();
-    }
-  }
-  __device__ float GetFvalue(int ridx, int fidx) {
-    if (use_shared) {
-      return smem[threadIdx.x * num_features + fidx];
-    } else {
-      // Binary search
-      auto begin_ptr = d_data + d_row_ptr[ridx];
-      auto end_ptr = d_data + d_row_ptr[ridx + 1];
-      SparseBatch::Entry* previous_middle = nullptr;
-      while (end_ptr != begin_ptr) {
-        auto middle = begin_ptr + (end_ptr - begin_ptr) / 2;
-        if (middle == previous_middle) {
-          break;
-        } else {
-          previous_middle = middle;
-        }
-
-        if (middle->index == fidx) {
-          return middle->fvalue;
-        } else if (middle->index < fidx) {
-          begin_ptr = middle;
-        } else {
-          end_ptr = middle;
-        }
-      }
-      // Value is missing
-      return nanf("");
-    }
-  }
-};
-
-__device__ float GetLeafWeight(bst_uint ridx, const DevicePredictionNode* tree,
-                               ElementLoader* loader) {
-  DevicePredictionNode n = tree[0];
-  while (!n.IsLeaf()) {
-    float fvalue = loader->GetFvalue(ridx, n.GetFidx());
-    // Missing value
-    if (isnan(fvalue)) {
-      n = tree[n.MissingIdx()];
-    } else {
-      if (fvalue < n.GetFvalue()) {
-        n = tree[n.left_child_idx];
-      } else {
-        n = tree[n.right_child_idx];
-      }
-    }
-  }
-  return n.GetWeight();
-}
-
-template <int BLOCK_THREADS>
-__global__ void PredictKernel(const DevicePredictionNode* d_nodes,
-                              float* d_out_predictions, int* d_tree_segments,
-                              int* d_tree_group, size_t* d_row_ptr,
-                              SparseBatch::Entry* d_data, int tree_begin,
-                              int tree_end, int num_features, bst_uint num_rows,
-                              bool use_shared, int num_group) {
-  extern __shared__ float smem[];
-  bst_uint global_idx = blockDim.x * blockIdx.x + threadIdx.x;
-  ElementLoader loader(use_shared, d_row_ptr, d_data, num_features, smem,
-                       num_rows);
-  if (global_idx >= num_rows) return;
-  if (num_group == 1) {
-    float sum = 0;
-    for (int tree_idx = tree_begin; tree_idx < tree_end; tree_idx++) {
-      const DevicePredictionNode* d_tree =
-          d_nodes + d_tree_segments[tree_idx - tree_begin];
-      sum += GetLeafWeight(global_idx, d_tree, &loader);
-    }
-    d_out_predictions[global_idx] += sum;
-  } else {
-    for (int tree_idx = tree_begin; tree_idx < tree_end; tree_idx++) {
-      int tree_group = d_tree_group[tree_idx];
-      const DevicePredictionNode* d_tree =
-          d_nodes + d_tree_segments[tree_idx - tree_begin];
-      bst_uint out_prediction_idx = global_idx * num_group + tree_group;
-      d_out_predictions[out_prediction_idx] +=
-          GetLeafWeight(global_idx, d_tree, &loader);
-    }
-  }
-}
-
-class GPUPredictor : public xgboost::Predictor {
- private:
-  void DevicePredictInternal(DMatrix* dmat, std::vector<bst_float>* out_preds,
-                             const gbm::GBTreeModel& model, int tree_begin,
-                             int tree_end) {
-    if (tree_end - tree_begin == 0) {
-      return;
-    }
-
-    // Add dmatrix to device if not seen before
-    if (this->device_matrix_cache_.find(dmat) ==
-        this->device_matrix_cache_.end()) {
-      this->device_matrix_cache_.emplace(
-          dmat, std::unique_ptr<DeviceMatrix>(
-                    new DeviceMatrix(dmat, param.gpu_id, param.silent)));
-    }
-    DeviceMatrix* device_matrix = device_matrix_cache_.find(dmat)->second.get();
-
-    dh::safe_cuda(cudaSetDevice(param.gpu_id));
-    CHECK_EQ(model.param.size_leaf_vector, 0);
-    // Copy decision trees to device
-    thrust::host_vector<int> h_tree_segments;
-    h_tree_segments.reserve((tree_end - tree_end) + 1);
-    int sum = 0;
-    h_tree_segments.push_back(sum);
-    for (int tree_idx = tree_begin; tree_idx < tree_end; tree_idx++) {
-      sum += model.trees[tree_idx]->GetNodes().size();
-      h_tree_segments.push_back(sum);
-    }
-
-    thrust::host_vector<DevicePredictionNode> h_nodes(h_tree_segments.back());
-    for (int tree_idx = tree_begin; tree_idx < tree_end; tree_idx++) {
-      auto& src_nodes = model.trees[tree_idx]->GetNodes();
-      std::copy(src_nodes.begin(), src_nodes.end(),
-                h_nodes.begin() + h_tree_segments[tree_idx - tree_begin]);
-    }
-
-    nodes.resize(h_nodes.size());
-    thrust::copy(h_nodes.begin(), h_nodes.end(), nodes.begin());
-    tree_segments.resize(h_tree_segments.size());
-    thrust::copy(h_tree_segments.begin(), h_tree_segments.end(),
-                 tree_segments.begin());
-    tree_group.resize(model.tree_info.size());
-    thrust::copy(model.tree_info.begin(), model.tree_info.end(),
-                 tree_group.begin());
-
-    if (device_matrix->predictions.size() != out_preds->size()) {
-      device_matrix->predictions.resize(out_preds->size());
-      thrust::copy(out_preds->begin(), out_preds->end(),
-                   device_matrix->predictions.begin());
-    }
-
-    const int BLOCK_THREADS = 128;
-    const int GRID_SIZE =
-        dh::div_round_up(device_matrix->row_ptr.size() - 1, BLOCK_THREADS);
-
-    int shared_memory_bytes =
-        sizeof(float) * device_matrix->p_mat->info().num_col * BLOCK_THREADS;
-    bool use_shared = true;
-    if (shared_memory_bytes > dh::max_shared_memory(param.gpu_id)) {
-      shared_memory_bytes = 0;
-      use_shared = false;
-    }
-
-    PredictKernel<BLOCK_THREADS>
-        <<<GRID_SIZE, BLOCK_THREADS, shared_memory_bytes>>>(
-            dh::raw(nodes), dh::raw(device_matrix->predictions),
-            dh::raw(tree_segments), dh::raw(tree_group),
-            device_matrix->row_ptr.data(), device_matrix->data.data(),
-            tree_begin, tree_end, device_matrix->p_mat->info().num_col,
-            device_matrix->p_mat->info().num_row, use_shared,
-            model.param.num_output_group);
-
-    dh::safe_cuda(cudaDeviceSynchronize());
-    thrust::copy(device_matrix->predictions.begin(),
-                 device_matrix->predictions.end(), out_preds->begin());
-  }
-
- public:
-  GPUPredictor() : cpu_predictor(Predictor::Create("cpu_predictor")) {}
-
-  void PredictBatch(DMatrix* dmat, std::vector<bst_float>* out_preds,
-                    const gbm::GBTreeModel& model, int tree_begin,
-                    unsigned ntree_limit = 0) override {
-    if (this->PredictFromCache(dmat, out_preds, model, ntree_limit)) {
-      return;
-    }
-    this->InitOutPredictions(dmat->info(), out_preds, model);
-
-    int tree_end = ntree_limit * model.param.num_output_group;
-    if (ntree_limit == 0 || ntree_limit > model.trees.size()) {
-      tree_end = static_cast<unsigned>(model.trees.size());
-    }
-
-    DevicePredictInternal(dmat, out_preds, model, tree_begin, tree_end);
-  }
-
-  void UpdatePredictionCache(
-      const gbm::GBTreeModel& model,
-      std::vector<std::unique_ptr<TreeUpdater>>* updaters,
-      int num_new_trees) override {
-    // dh::Timer t;
-    int old_ntree = model.trees.size() - num_new_trees;
-    // update cache entry
-    for (auto& kv : cache_) {
-      PredictionCacheEntry& e = kv.second;
-      DMatrix* dmat = kv.first;
-
-      if (e.predictions.size() == 0) {
-        cpu_predictor->PredictBatch(dmat, &(e.predictions), model, 0,
-                                    model.trees.size());
-      } else if (model.param.num_output_group == 1 && updaters->size() > 0 &&
-                 num_new_trees == 1 &&
-                 updaters->back()->UpdatePredictionCache(e.data.get(),
-                                                         &(e.predictions))) {
-        {}  // do nothing
-      } else {
-        DevicePredictInternal(dmat, &(e.predictions), model, old_ntree,
-                              model.trees.size());
-      }
-    }
-  }
-
-  void PredictInstance(const SparseBatch::Inst& inst,
-                       std::vector<bst_float>* out_preds,
-                       const gbm::GBTreeModel& model, unsigned ntree_limit,
-                       unsigned root_index) override {
-    cpu_predictor->PredictInstance(inst, out_preds, model, root_index);
-  }
-  void PredictLeaf(DMatrix* p_fmat, std::vector<bst_float>* out_preds,
-                   const gbm::GBTreeModel& model,
-                   unsigned ntree_limit) override {
-    cpu_predictor->PredictLeaf(p_fmat, out_preds, model, ntree_limit);
-  }
-
-  void PredictContribution(DMatrix* p_fmat,
-                           std::vector<bst_float>* out_contribs,
-                           const gbm::GBTreeModel& model,
-                           unsigned ntree_limit) override {
-    cpu_predictor->PredictContribution(p_fmat, out_contribs, model,
-                                       ntree_limit);
-  }
-
-  void Init(const std::vector<std::pair<std::string, std::string>>& cfg,
-            const std::vector<std::shared_ptr<DMatrix>>& cache) override {
-    Predictor::Init(cfg, cache);
-    cpu_predictor->Init(cfg, cache);
-    param.InitAllowUnknown(cfg);
-  }
-
- private:
-  GPUPredictionParam param;
-  std::unique_ptr<Predictor> cpu_predictor;
-  std::unordered_map<DMatrix*, std::unique_ptr<DeviceMatrix>>
-      device_matrix_cache_;
-  thrust::device_vector<DevicePredictionNode> nodes;
-  thrust::device_vector<int> tree_segments;
-  thrust::device_vector<int> tree_group;
-};
-XGBOOST_REGISTER_PREDICTOR(GPUPredictor, "gpu_predictor")
-    .describe("Make predictions using GPU.")
-    .set_body([]() { return new GPUPredictor(); });
-}  // namespace predictor
-}  // namespace xgboost
--- a/plugin/updater_gpu/src/updater_gpu.cu
+++ b/plugin/updater_gpu/src/updater_gpu.cu
@@ -1,754 +0,0 @@
-/*!
- * Copyright 2017 XGBoost contributors
- */
-#include <xgboost/tree_updater.h>
-#include <utility>
-#include <vector>
-#include "../../../src/tree/param.h"
-#include "updater_gpu_common.cuh"
-
-namespace xgboost {
-namespace tree {
-
-DMLC_REGISTRY_FILE_TAG(updater_gpu);
-
-/**
- * @brief Absolute BFS order IDs to col-wise unique IDs based on user input
- * @param tid the index of the element that this thread should access
- * @param abs the array of absolute IDs
- * @param colIds the array of column IDs for each element
- * @param nodeStart the start of the node ID at this level
- * @param nKeys number of nodes at this level.
- * @return the uniq key
- */
-
-static HOST_DEV_INLINE node_id_t abs2uniqKey(int tid, const node_id_t* abs,
-                                const int* colIds, node_id_t nodeStart,
-                                int nKeys) {
-  int a = abs[tid];
-  if (a == UNUSED_NODE) return a;
-  return ((a - nodeStart) + (colIds[tid] * nKeys));
-}
-
-
-/**
- * @struct Pair
- * @brief Pair used for key basd scan operations on bst_gpair
- */
-struct Pair {
-  int key;
-  bst_gpair value;
-};
-
-/** define a key that's not used at all in the entire boosting process */
-static const int NONE_KEY = -100;
-
-/**
- * @brief Allocate temporary buffers needed for scan operations
- * @param tmpScans gradient buffer
- * @param tmpKeys keys buffer
- * @param size number of elements that will be scanned
- */
-template <int BLKDIM_L1L3 = 256>
-int scanTempBufferSize(int size) {
-  int nBlks = dh::div_round_up(size, BLKDIM_L1L3);
-  return nBlks;
-}
-
-struct AddByKey {
-  template <typename T>
-  HOST_DEV_INLINE T operator()(const T& first, const T& second) const {
-    T result;
-    if (first.key == second.key) {
-      result.key = first.key;
-      result.value = first.value + second.value;
-    } else {
-      result.key = second.key;
-      result.value = second.value;
-    }
-    return result;
-  }
-};
-
-/**
- * @brief Gradient value getter function
- * @param id the index into the vals or instIds array to which to fetch
- * @param vals the gradient value buffer
- * @param instIds instance index buffer
- * @return the expected gradient value
- */
-HOST_DEV_INLINE bst_gpair get(int id, const bst_gpair* vals,
-                              const int* instIds) {
-  id = instIds[id];
-  return vals[id];
-}
-
-template <int BLKDIM_L1L3>
-__global__ void cubScanByKeyL1(bst_gpair* scans, const bst_gpair* vals,
-                               const int* instIds, bst_gpair* mScans,
-                               int* mKeys, const node_id_t* keys, int nUniqKeys,
-                               const int* colIds, node_id_t nodeStart,
-                               const int size) {
-  Pair rootPair = {NONE_KEY, bst_gpair(0.f, 0.f)};
-  int myKey;
-  bst_gpair myValue;
-  typedef cub::BlockScan<Pair, BLKDIM_L1L3> BlockScan;
-  __shared__ typename BlockScan::TempStorage temp_storage;
-  Pair threadData;
-  int tid = blockIdx.x * BLKDIM_L1L3 + threadIdx.x;
-  if (tid < size) {
-    myKey = abs2uniqKey(tid, keys, colIds, nodeStart, nUniqKeys);
-    myValue = get(tid, vals, instIds);
-  } else {
-    myKey = NONE_KEY;
-    myValue = 0.f;
-  }
-  threadData.key = myKey;
-  threadData.value = myValue;
-  // get previous key, especially needed for the last thread in this block
-  // in order to pass on the partial scan values.
-  // this statement MUST appear before the checks below!
-  // else, the result of this shuffle operation will be undefined
-  int previousKey = __shfl_up(myKey, 1);
-  // Collectively compute the block-wide exclusive prefix sum
-  BlockScan(temp_storage)
-      .ExclusiveScan(threadData, threadData, rootPair, AddByKey());
-  if (tid < size) {
-    scans[tid] = threadData.value;
-  } else {
-    return;
-  }
-  if (threadIdx.x == BLKDIM_L1L3 - 1) {
-    threadData.value =
-        (myKey == previousKey) ? threadData.value : bst_gpair(0.0f, 0.0f);
-    mKeys[blockIdx.x] = myKey;
-    mScans[blockIdx.x] = threadData.value + myValue;
-  }
-}
-
-template <int BLKSIZE>
-__global__ void cubScanByKeyL2(bst_gpair* mScans, int* mKeys, int mLength) {
-  typedef cub::BlockScan<Pair, BLKSIZE, cub::BLOCK_SCAN_WARP_SCANS> BlockScan;
-  Pair threadData;
-  __shared__ typename BlockScan::TempStorage temp_storage;
-  for (int i = threadIdx.x; i < mLength; i += BLKSIZE - 1) {
-    threadData.key = mKeys[i];
-    threadData.value = mScans[i];
-    BlockScan(temp_storage).InclusiveScan(threadData, threadData, AddByKey());
-    mScans[i] = threadData.value;
-    __syncthreads();
-  }
-}
-
-template <int BLKDIM_L1L3>
-__global__ void cubScanByKeyL3(bst_gpair* sums, bst_gpair* scans,
-                               const bst_gpair* vals, const int* instIds,
-                               const bst_gpair* mScans, const int* mKeys,
-                               const node_id_t* keys, int nUniqKeys,
-                               const int* colIds, node_id_t nodeStart,
-                               const int size) {
-  int relId = threadIdx.x;
-  int tid = (blockIdx.x * BLKDIM_L1L3) + relId;
-  // to avoid the following warning from nvcc:
-  //   __shared__ memory variable with non-empty constructor or destructor
-  //     (potential race between threads)
-  __shared__ char gradBuff[sizeof(bst_gpair)];
-  __shared__ int s_mKeys;
-  bst_gpair* s_mScans = reinterpret_cast<bst_gpair*>(gradBuff);
-  if (tid >= size) return;
-  // cache block-wide partial scan info
-  if (relId == 0) {
-    s_mKeys = (blockIdx.x > 0) ? mKeys[blockIdx.x - 1] : NONE_KEY;
-    s_mScans[0] = (blockIdx.x > 0) ? mScans[blockIdx.x - 1] : bst_gpair();
-  }
-  int myKey = abs2uniqKey(tid, keys, colIds, nodeStart, nUniqKeys);
-  int previousKey =
-      tid == 0 ? NONE_KEY
-               : abs2uniqKey(tid - 1, keys, colIds, nodeStart, nUniqKeys);
-  bst_gpair myValue = scans[tid];
-  __syncthreads();
-  if (blockIdx.x > 0 && s_mKeys == previousKey) {
-    myValue += s_mScans[0];
-  }
-  if (tid == size - 1) {
-    sums[previousKey] = myValue + get(tid, vals, instIds);
-  }
-  if ((previousKey != myKey) && (previousKey >= 0)) {
-    sums[previousKey] = myValue;
-    myValue = bst_gpair(0.0f, 0.0f);
-  }
-  scans[tid] = myValue;
-}
-
-/**
- * @brief Performs fused reduce and scan by key functionality. It is assumed
- * that
- *  the keys occur contiguously!
- * @param sums the output gradient reductions for each element performed
- * key-wise
- * @param scans the output gradient scans for each element performed key-wise
- * @param vals the gradients evaluated for each observation.
- * @param instIds instance ids for each element
- * @param keys keys to be used to segment the reductions. They need not occur
- *  contiguously in contrast to scan_by_key. Currently, we need one key per
- *  value in the 'vals' array.
- * @param size number of elements in the 'vals' array
- * @param nUniqKeys max number of uniq keys found per column
- * @param nCols number of columns
- * @param tmpScans temporary scan buffer needed for cub-pyramid algo
- * @param tmpKeys temporary key buffer needed for cub-pyramid algo
- * @param colIds column indices for each element in the array
- * @param nodeStart index of the leftmost node in the current level
- */
-template <int BLKDIM_L1L3 = 256, int BLKDIM_L2 = 512>
-void reduceScanByKey(bst_gpair* sums, bst_gpair* scans, const bst_gpair* vals,
-                     const int* instIds, const node_id_t* keys, int size,
-                     int nUniqKeys, int nCols, bst_gpair* tmpScans,
-                     int* tmpKeys, const int* colIds, node_id_t nodeStart) {
-  int nBlks = dh::div_round_up(size, BLKDIM_L1L3);
-  cudaMemset(sums, 0, nUniqKeys * nCols * sizeof(bst_gpair));
-  cubScanByKeyL1<BLKDIM_L1L3>
-      <<<nBlks, BLKDIM_L1L3>>>(scans, vals, instIds, tmpScans, tmpKeys, keys,
-                               nUniqKeys, colIds, nodeStart, size);
-  cubScanByKeyL2<BLKDIM_L2><<<1, BLKDIM_L2>>>(tmpScans, tmpKeys, nBlks);
-  cubScanByKeyL3<BLKDIM_L1L3>
-      <<<nBlks, BLKDIM_L1L3>>>(sums, scans, vals, instIds, tmpScans, tmpKeys,
-                               keys, nUniqKeys, colIds, nodeStart, size);
-}
-
-/**
- * @struct ExactSplitCandidate
- * @brief Abstraction of a possible split in the decision tree
- */
-struct ExactSplitCandidate {
-  /** the optimal gain score for this node */
-  float score;
-  /** index where to split in the DMatrix */
-  int index;
-
-  HOST_DEV_INLINE ExactSplitCandidate() : score(-FLT_MAX), index(INT_MAX) {}
-
-  /**
-   * @brief Whether the split info is valid to be used to create a new child
-   * @param minSplitLoss minimum score above which decision to split is made
-   * @return true if splittable, else false
-   */
-  HOST_DEV_INLINE bool isSplittable(float minSplitLoss) const {
-    return ((score >= minSplitLoss) && (index != INT_MAX));
-  }
-};
-
-/**
- * @enum ArgMaxByKeyAlgo best_split_evaluation.cuh
- * @brief Help decide which algorithm to use for multi-argmax operation
- */
-enum ArgMaxByKeyAlgo {
-  /** simplest, use gmem-atomics for all updates */
-  ABK_GMEM = 0,
-  /** use smem-atomics for updates (when number of keys are less) */
-  ABK_SMEM
-};
-
-/** max depth until which to use shared mem based atomics for argmax */
-static const int MAX_ABK_LEVELS = 3;
-
-HOST_DEV_INLINE ExactSplitCandidate maxSplit(ExactSplitCandidate a,
-                                             ExactSplitCandidate b) {
-  ExactSplitCandidate out;
-  if (a.score < b.score) {
-    out.score = b.score;
-    out.index = b.index;
-  } else if (a.score == b.score) {
-    out.score = a.score;
-    out.index = (a.index < b.index) ? a.index : b.index;
-  } else {
-    out.score = a.score;
-    out.index = a.index;
-  }
-  return out;
-}
-
-DEV_INLINE void atomicArgMax(ExactSplitCandidate* address,
-                             ExactSplitCandidate val) {
-  unsigned long long* intAddress = (unsigned long long*)address;  // NOLINT
-  unsigned long long old = *intAddress;                           // NOLINT
-  unsigned long long assumed;                                     // NOLINT
-  do {
-    assumed = old;
-    ExactSplitCandidate res =
-        maxSplit(val, *reinterpret_cast<ExactSplitCandidate*>(&assumed));
-    old = atomicCAS(intAddress, assumed, *reinterpret_cast<uint64_t*>(&res));
-  } while (assumed != old);
-}
-
-DEV_INLINE void argMaxWithAtomics(
-    int id, ExactSplitCandidate* nodeSplits, const bst_gpair* gradScans,
-    const bst_gpair* gradSums, const float* vals, const int* colIds,
-    const node_id_t* nodeAssigns, const DeviceDenseNode* nodes, int nUniqKeys,
-    node_id_t nodeStart, int len, const GPUTrainingParam& param) {
-  int nodeId = nodeAssigns[id];
-  // @todo: this is really a bad check! but will be fixed when we move
-  //  to key-based reduction
-  if ((id == 0) ||
-      !((nodeId == nodeAssigns[id - 1]) && (colIds[id] == colIds[id - 1]) &&
-        (vals[id] == vals[id - 1]))) {
-    if (nodeId != UNUSED_NODE) {
-      int sumId = abs2uniqKey(id, nodeAssigns, colIds, nodeStart, nUniqKeys);
-      bst_gpair colSum = gradSums[sumId];
-      int uid = nodeId - nodeStart;
-      DeviceDenseNode n = nodes[nodeId];
-      bst_gpair parentSum = n.sum_gradients;
-      float parentGain = n.root_gain;
-      bool tmp;
-      ExactSplitCandidate s;
-      bst_gpair missing = parentSum - colSum;
-      s.score = loss_chg_missing(gradScans[id], missing, parentSum, parentGain,
-                                 param, tmp);
-      s.index = id;
-      atomicArgMax(nodeSplits + uid, s);
-    }  // end if nodeId != UNUSED_NODE
-  }    // end if id == 0 ...
-}
-
-__global__ void atomicArgMaxByKeyGmem(
-    ExactSplitCandidate* nodeSplits, const bst_gpair* gradScans,
-    const bst_gpair* gradSums, const float* vals, const int* colIds,
-    const node_id_t* nodeAssigns, const DeviceDenseNode* nodes, int nUniqKeys,
-    node_id_t nodeStart, int len, const TrainParam param) {
-  int id = threadIdx.x + (blockIdx.x * blockDim.x);
-  const int stride = blockDim.x * gridDim.x;
-  for (; id < len; id += stride) {
-    argMaxWithAtomics(id, nodeSplits, gradScans, gradSums, vals, colIds,
-                      nodeAssigns, nodes, nUniqKeys, nodeStart, len,
-                      GPUTrainingParam(param));
-  }
-}
-
-__global__ void atomicArgMaxByKeySmem(
-    ExactSplitCandidate* nodeSplits, const bst_gpair* gradScans,
-    const bst_gpair* gradSums, const float* vals, const int* colIds,
-    const node_id_t* nodeAssigns, const DeviceDenseNode* nodes, int nUniqKeys,
-    node_id_t nodeStart, int len, const TrainParam param) {
-  extern __shared__ char sArr[];
-  ExactSplitCandidate* sNodeSplits =
-      reinterpret_cast<ExactSplitCandidate*>(sArr);
-  int tid = threadIdx.x;
-  ExactSplitCandidate defVal;
-#pragma unroll 1
-  for (int i = tid; i < nUniqKeys; i += blockDim.x) {
-    sNodeSplits[i] = defVal;
-  }
-  __syncthreads();
-  int id = tid + (blockIdx.x * blockDim.x);
-  const int stride = blockDim.x * gridDim.x;
-  for (; id < len; id += stride) {
-    argMaxWithAtomics(id, sNodeSplits, gradScans, gradSums, vals, colIds,
-                      nodeAssigns, nodes, nUniqKeys, nodeStart, len, param);
-  }
-  __syncthreads();
-  for (int i = tid; i < nUniqKeys; i += blockDim.x) {
-    ExactSplitCandidate s = sNodeSplits[i];
-    atomicArgMax(nodeSplits + i, s);
-  }
-}
-
-/**
- * @brief Performs argmax_by_key functionality but for cases when keys need not
- *  occur contiguously
- * @param nodeSplits will contain information on best split for each node
- * @param gradScans exclusive sum on sorted segments for each col
- * @param gradSums gradient sum for each column in DMatrix based on to node-ids
- * @param vals feature values
- * @param colIds column index for each element in the feature values array
- * @param nodeAssigns node-id assignments to each element in DMatrix
- * @param nodes pointer to all nodes for this tree in BFS order
- * @param nUniqKeys number of unique node-ids in this level
- * @param nodeStart start index of the node-ids in this level
- * @param len number of elements
- * @param param training parameters
- * @param algo which algorithm to use for argmax_by_key
- */
-template <int BLKDIM = 256, int ITEMS_PER_THREAD = 4>
-void argMaxByKey(ExactSplitCandidate* nodeSplits, const bst_gpair* gradScans,
-                 const bst_gpair* gradSums, const float* vals,
-                 const int* colIds, const node_id_t* nodeAssigns,
-                 const DeviceDenseNode* nodes, int nUniqKeys,
-                 node_id_t nodeStart, int len, const TrainParam param,
-                 ArgMaxByKeyAlgo algo) {
-  dh::fillConst<ExactSplitCandidate, BLKDIM, ITEMS_PER_THREAD>(
-      dh::get_device_idx(param.gpu_id), nodeSplits, nUniqKeys,
-      ExactSplitCandidate());
-  int nBlks = dh::div_round_up(len, ITEMS_PER_THREAD * BLKDIM);
-  switch (algo) {
-    case ABK_GMEM:
-      atomicArgMaxByKeyGmem<<<nBlks, BLKDIM>>>(
-          nodeSplits, gradScans, gradSums, vals, colIds, nodeAssigns, nodes,
-          nUniqKeys, nodeStart, len, param);
-      break;
-    case ABK_SMEM:
-      atomicArgMaxByKeySmem<<<nBlks, BLKDIM,
-                              sizeof(ExactSplitCandidate) * nUniqKeys>>>(
-          nodeSplits, gradScans, gradSums, vals, colIds, nodeAssigns, nodes,
-          nUniqKeys, nodeStart, len, param);
-      break;
-    default:
-      throw std::runtime_error("argMaxByKey: Bad algo passed!");
-  }
-}
-
-__global__ void assignColIds(int* colIds, const int* colOffsets) {
-  int myId = blockIdx.x;
-  int start = colOffsets[myId];
-  int end = colOffsets[myId + 1];
-  for (int id = start + threadIdx.x; id < end; id += blockDim.x) {
-    colIds[id] = myId;
-  }
-}
-
-__global__ void fillDefaultNodeIds(node_id_t* nodeIdsPerInst,
-                                   const DeviceDenseNode* nodes, int nRows) {
-  int id = threadIdx.x + (blockIdx.x * blockDim.x);
-  if (id >= nRows) {
-    return;
-  }
-  // if this element belongs to none of the currently active node-id's
-  node_id_t nId = nodeIdsPerInst[id];
-  if (nId == UNUSED_NODE) {
-    return;
-  }
-  const DeviceDenseNode n = nodes[nId];
-  node_id_t result;
-  if (n.IsLeaf() || n.IsUnused()) {
-    result = UNUSED_NODE;
-  } else if (n.dir == LeftDir) {
-    result = (2 * n.idx) + 1;
-  } else {
-    result = (2 * n.idx) + 2;
-  }
-  nodeIdsPerInst[id] = result;
-}
-
-__global__ void assignNodeIds(node_id_t* nodeIdsPerInst, int* nodeLocations,
-                              const node_id_t* nodeIds, const int* instId,
-                              const DeviceDenseNode* nodes,
-                              const int* colOffsets, const float* vals,
-                              int nVals, int nCols) {
-  int id = threadIdx.x + (blockIdx.x * blockDim.x);
-  const int stride = blockDim.x * gridDim.x;
-  for (; id < nVals; id += stride) {
-    // fusing generation of indices for node locations
-    nodeLocations[id] = id;
-    // using nodeIds here since the previous kernel would have updated
-    // the nodeIdsPerInst with all default assignments
-    int nId = nodeIds[id];
-    // if this element belongs to none of the currently active node-id's
-    if (nId != UNUSED_NODE) {
-      const DeviceDenseNode n = nodes[nId];
-      int colId = n.fidx;
-      // printf("nid=%d colId=%d id=%d\n", nId, colId, id);
-      int start = colOffsets[colId];
-      int end = colOffsets[colId + 1];
-      // @todo: too much wasteful threads!!
-      if ((id >= start) && (id < end) && !(n.IsLeaf() || n.IsUnused())) {
-        node_id_t result = (2 * n.idx) + 1 + (vals[id] >= n.fvalue);
-        nodeIdsPerInst[instId[id]] = result;
-      }
-    }
-  }
-}
-
-__global__ void markLeavesKernel(DeviceDenseNode* nodes, int len) {
-  int id = (blockIdx.x * blockDim.x) + threadIdx.x;
-  if ((id < len) && !nodes[id].IsUnused()) {
-    int lid = (id << 1) + 1;
-    int rid = (id << 1) + 2;
-    if ((lid >= len) || (rid >= len)) {
-      nodes[id].root_gain = -FLT_MAX;  // bottom-most nodes
-    } else if (nodes[lid].IsUnused() && nodes[rid].IsUnused()) {
-      nodes[id].root_gain = -FLT_MAX;  // unused child nodes
-    }
-  }
-}
-
-class GPUMaker : public TreeUpdater {
- protected:
-  TrainParam param;
-  /** whether we have initialized memory already (so as not to repeat!) */
-  bool allocated;
-  /** feature values stored in column-major compressed format */
-  dh::dvec2<float> vals;
-  dh::dvec<float> vals_cached;
-  /** corresponding instance id's of these featutre values */
-  dh::dvec2<int> instIds;
-  dh::dvec<int> instIds_cached;
-  /** column offsets for these feature values */
-  dh::dvec<int> colOffsets;
-  dh::dvec<bst_gpair> gradsInst;
-  dh::dvec2<node_id_t> nodeAssigns;
-  dh::dvec2<int> nodeLocations;
-  dh::dvec<DeviceDenseNode> nodes;
-  dh::dvec<node_id_t> nodeAssignsPerInst;
-  dh::dvec<bst_gpair> gradSums;
-  dh::dvec<bst_gpair> gradScans;
-  dh::dvec<ExactSplitCandidate> nodeSplits;
-  int nVals;
-  int nRows;
-  int nCols;
-  int maxNodes;
-  int maxLeaves;
-  dh::CubMemory tmp_mem;
-  dh::dvec<bst_gpair> tmpScanGradBuff;
-  dh::dvec<int> tmpScanKeyBuff;
-  dh::dvec<int> colIds;
-  dh::bulk_allocator<dh::memory_type::DEVICE> ba;
-
- public:
-  GPUMaker() : allocated(false) {}
-  ~GPUMaker() {}
-
-  void Init(
-      const std::vector<std::pair<std::string, std::string>>& args) override {
-    param.InitAllowUnknown(args);
-    maxNodes = (1 << (param.max_depth + 1)) - 1;
-    maxLeaves = 1 << param.max_depth;
-  }
-
-  void Update(const std::vector<bst_gpair>& gpair, DMatrix* dmat,
-              const std::vector<RegTree*>& trees) override {
-    GradStats::CheckInfo(dmat->info());
-    // rescale learning rate according to size of trees
-    float lr = param.learning_rate;
-    param.learning_rate = lr / trees.size();
-
-    try {
-      // build tree
-      for (size_t i = 0; i < trees.size(); ++i) {
-        UpdateTree(gpair, dmat, trees[i]);
-      }
-    } catch (const std::exception& e) {
-      LOG(FATAL) << "GPU plugin exception: " << e.what() << std::endl;
-    }
-    param.learning_rate = lr;
-  }
-  /// @note: Update should be only after Init!!
-  void UpdateTree(const std::vector<bst_gpair>& gpair, DMatrix* dmat,
-                  RegTree* hTree) {
-    if (!allocated) {
-      setupOneTimeData(dmat);
-    }
-    for (int i = 0; i < param.max_depth; ++i) {
-      if (i == 0) {
-        // make sure to start on a fresh tree with sorted values!
-        vals.current_dvec() = vals_cached;
-        instIds.current_dvec() = instIds_cached;
-        transferGrads(gpair);
-      }
-      int nNodes = 1 << i;
-      node_id_t nodeStart = nNodes - 1;
-      initNodeData(i, nodeStart, nNodes);
-      findSplit(i, nodeStart, nNodes);
-    }
-    // mark all the used nodes with unused children as leaf nodes
-    markLeaves();
-    dense2sparse_tree(hTree, nodes, param);
-  }
-
-  void split2node(int nNodes, node_id_t nodeStart) {
-    auto d_nodes = nodes.data();
-    auto d_gradScans = gradScans.data();
-    auto d_gradSums = gradSums.data();
-    auto d_nodeAssigns = nodeAssigns.current();
-    auto d_colIds = colIds.data();
-    auto d_vals = vals.current();
-    auto d_nodeSplits = nodeSplits.data();
-    int nUniqKeys = nNodes;
-    float min_split_loss = param.min_split_loss;
-    auto gpu_param = GPUTrainingParam(param);
-
-    dh::launch_n(param.gpu_id, nNodes, [=] __device__(int uid) {
-      int absNodeId = uid + nodeStart;
-      ExactSplitCandidate s = d_nodeSplits[uid];
-      if (s.isSplittable(min_split_loss)) {
-        int idx = s.index;
-        int nodeInstId =
-            abs2uniqKey(idx, d_nodeAssigns, d_colIds, nodeStart, nUniqKeys);
-        bool missingLeft = true;
-        const DeviceDenseNode& n = d_nodes[absNodeId];
-        bst_gpair gradScan = d_gradScans[idx];
-        bst_gpair gradSum = d_gradSums[nodeInstId];
-        float thresh = d_vals[idx];
-        int colId = d_colIds[idx];
-        // get the default direction for the current node
-        bst_gpair missing = n.sum_gradients - gradSum;
-        loss_chg_missing(gradScan, missing, n.sum_gradients, n.root_gain,
-                         gpu_param, missingLeft);
-        // get the score/weight/id/gradSum for left and right child nodes
-        bst_gpair lGradSum = missingLeft ? gradScan + missing : gradScan;
-        bst_gpair rGradSum = n.sum_gradients - lGradSum;
-
-        // Create children
-        d_nodes[left_child_nidx(absNodeId)] =
-            DeviceDenseNode(lGradSum, left_child_nidx(absNodeId), gpu_param);
-        d_nodes[right_child_nidx(absNodeId)] =
-            DeviceDenseNode(rGradSum, right_child_nidx(absNodeId), gpu_param);
-        // Set split for parent
-        d_nodes[absNodeId].SetSplit(thresh, colId,
-                                    missingLeft ? LeftDir : RightDir);
-      } else {
-        // cannot be split further, so this node is a leaf!
-        d_nodes[absNodeId].root_gain = -FLT_MAX;
-      }
-    });
-  }
-
-  void findSplit(int level, node_id_t nodeStart, int nNodes) {
-    reduceScanByKey(gradSums.data(), gradScans.data(), gradsInst.data(),
-                    instIds.current(), nodeAssigns.current(), nVals, nNodes,
-                    nCols, tmpScanGradBuff.data(), tmpScanKeyBuff.data(),
-                    colIds.data(), nodeStart);
-    argMaxByKey(nodeSplits.data(), gradScans.data(), gradSums.data(),
-                vals.current(), colIds.data(), nodeAssigns.current(),
-                nodes.data(), nNodes, nodeStart, nVals, param,
-                level <= MAX_ABK_LEVELS ? ABK_SMEM : ABK_GMEM);
-    split2node(nNodes, nodeStart);
-  }
-
-  void allocateAllData(int offsetSize) {
-    int tmpBuffSize = scanTempBufferSize(nVals);
-    ba.allocate(dh::get_device_idx(param.gpu_id), param.silent, &vals, nVals,
-                &vals_cached, nVals, &instIds, nVals, &instIds_cached, nVals,
-                &colOffsets, offsetSize, &gradsInst, nRows, &nodeAssigns, nVals,
-                &nodeLocations, nVals, &nodes, maxNodes, &nodeAssignsPerInst,
-                nRows, &gradSums, maxLeaves * nCols, &gradScans, nVals,
-                &nodeSplits, maxLeaves, &tmpScanGradBuff, tmpBuffSize,
-                &tmpScanKeyBuff, tmpBuffSize, &colIds, nVals);
-  }
-
-  void setupOneTimeData(DMatrix* dmat) {
-    size_t free_memory = dh::available_memory(dh::get_device_idx(param.gpu_id));
-    if (!dmat->SingleColBlock()) {
-      throw std::runtime_error("exact::GPUBuilder - must have 1 column block");
-    }
-    std::vector<float> fval;
-    std::vector<int> fId, offset;
-    convertToCsc(dmat, &fval, &fId, &offset);
-    allocateAllData(static_cast<int>(offset.size()));
-    transferAndSortData(fval, fId, offset);
-    allocated = true;
-  }
-
-  void convertToCsc(DMatrix* dmat, std::vector<float>* fval,
-                    std::vector<int>* fId, std::vector<int>* offset) {
-    MetaInfo info = dmat->info();
-    nRows = info.num_row;
-    nCols = info.num_col;
-    offset->reserve(nCols + 1);
-    offset->push_back(0);
-    fval->reserve(nCols * nRows);
-    fId->reserve(nCols * nRows);
-    // in case you end up with a DMatrix having no column access
-    // then make sure to enable that before copying the data!
-    if (!dmat->HaveColAccess()) {
-      const std::vector<bool> enable(nCols, true);
-      dmat->InitColAccess(enable, 1, nRows);
-    }
-    dmlc::DataIter<ColBatch>* iter = dmat->ColIterator();
-    iter->BeforeFirst();
-    while (iter->Next()) {
-      const ColBatch& batch = iter->Value();
-      for (int i = 0; i < batch.size; i++) {
-        const ColBatch::Inst& col = batch[i];
-        for (const ColBatch::Entry* it = col.data; it != col.data + col.length;
-             it++) {
-          int inst_id = static_cast<int>(it->index);
-          fval->push_back(it->fvalue);
-          fId->push_back(inst_id);
-        }
-        offset->push_back(fval->size());
-      }
-    }
-    nVals = fval->size();
-  }
-
-  void transferAndSortData(const std::vector<float>& fval,
-                           const std::vector<int>& fId,
-                           const std::vector<int>& offset) {
-    vals.current_dvec() = fval;
-    instIds.current_dvec() = fId;
-    colOffsets = offset;
-    dh::segmentedSort<float, int>(&tmp_mem, &vals, &instIds, nVals, nCols,
-                              colOffsets);
-    vals_cached = vals.current_dvec();
-    instIds_cached = instIds.current_dvec();
-    assignColIds<<<nCols, 512>>>(colIds.data(), colOffsets.data());
-  }
-
-  void transferGrads(const std::vector<bst_gpair>& gpair) {
-    // HACK
-    dh::safe_cuda(cudaMemcpy(gradsInst.data(), &(gpair[0]),
-                             sizeof(bst_gpair) * nRows,
-                             cudaMemcpyHostToDevice));
-    // evaluate the full-grad reduction for the root node
-    dh::sumReduction<bst_gpair>(tmp_mem, gradsInst, gradSums, nRows);
-  }
-
-  void initNodeData(int level, node_id_t nodeStart, int nNodes) {
-    // all instances belong to root node at the beginning!
-    if (level == 0) {
-      nodes.fill(DeviceDenseNode());
-      nodeAssigns.current_dvec().fill(0);
-      nodeAssignsPerInst.fill(0);
-      // for root node, just update the gradient/score/weight/id info
-      // before splitting it! Currently all data is on GPU, hence this
-      // stupid little kernel
-      auto d_nodes = nodes.data();
-      auto d_sums = gradSums.data();
-      auto gpu_params = GPUTrainingParam(param);
-      dh::launch_n(param.gpu_id, 1, [=] __device__(int idx) {
-        d_nodes[0] = DeviceDenseNode(d_sums[0], 0, gpu_params);
-      });
-    } else {
-      const int BlkDim = 256;
-      const int ItemsPerThread = 4;
-      // assign default node ids first
-      int nBlks = dh::div_round_up(nRows, BlkDim);
-      fillDefaultNodeIds<<<nBlks, BlkDim>>>(nodeAssignsPerInst.data(),
-                                            nodes.data(), nRows);
-      // evaluate the correct child indices of non-missing values next
-      nBlks = dh::div_round_up(nVals, BlkDim * ItemsPerThread);
-      assignNodeIds<<<nBlks, BlkDim>>>(
-          nodeAssignsPerInst.data(), nodeLocations.current(),
-          nodeAssigns.current(), instIds.current(), nodes.data(),
-          colOffsets.data(), vals.current(), nVals, nCols);
-      // gather the node assignments across all other columns too
-      dh::gather(dh::get_device_idx(param.gpu_id), nodeAssigns.current(),
-             nodeAssignsPerInst.data(), instIds.current(), nVals);
-      sortKeys(level);
-    }
-  }
-
-  void sortKeys(int level) {
-    // segmented-sort the arrays based on node-id's
-    // but we don't need more than level+1 bits for sorting!
-    segmentedSort(&tmp_mem, &nodeAssigns, &nodeLocations, nVals, nCols,
-                  colOffsets, 0, level + 1);
-    dh::gather<float, int>(dh::get_device_idx(param.gpu_id), vals.other(),
-                       vals.current(), instIds.other(), instIds.current(),
-                       nodeLocations.current(), nVals);
-    vals.buff().selector ^= 1;
-    instIds.buff().selector ^= 1;
-  }
-
-  void markLeaves() {
-    const int BlkDim = 128;
-    int nBlks = dh::div_round_up(maxNodes, BlkDim);
-    markLeavesKernel<<<nBlks, BlkDim>>>(nodes.data(), maxNodes);
-  }
-};
-
-XGBOOST_REGISTER_TREE_UPDATER(GPUMaker, "grow_gpu")
-    .describe("Grow tree with GPU.")
-    .set_body([]() { return new GPUMaker(); });
-
-}  // namespace tree
-}  // namespace xgboost
--- a/plugin/updater_gpu/src/updater_gpu_common.cuh
+++ b/plugin/updater_gpu/src/updater_gpu_common.cuh
@@ -1,243 +0,0 @@
-/*!
- * Copyright 2017 XGBoost contributors
- */
-#pragma once
-#include <thrust/random.h>
-#include <cstdio>
-#include <stdexcept>
-#include <string>
-#include <vector>
-#include "../../../src/common/random.h"
-#include "../../../src/tree/param.h"
-#include "cub/cub.cuh"
-#include "device_helpers.cuh"
-
-namespace xgboost {
-namespace tree {
-
-struct GPUTrainingParam {
-  // minimum amount of hessian(weight) allowed in a child
-  float min_child_weight;
-  // L2 regularization factor
-  float reg_lambda;
-  // L1 regularization factor
-  float reg_alpha;
-  // maximum delta update we can add in weight estimation
-  // this parameter can be used to stabilize update
-  // default=0 means no constraint on weight delta
-  float max_delta_step;
-
-  __host__ __device__ GPUTrainingParam() {}
-
-  __host__ __device__ GPUTrainingParam(const TrainParam& param)
-      : min_child_weight(param.min_child_weight),
-        reg_lambda(param.reg_lambda),
-        reg_alpha(param.reg_alpha),
-        max_delta_step(param.max_delta_step) {}
-};
-
-typedef int node_id_t;
-
-/** used to assign default id to a Node */
-static const int UNUSED_NODE = -1;
-
-/**
- * @enum DefaultDirection node.cuh
- * @brief Default direction to be followed in case of missing values
- */
-enum DefaultDirection {
-  /** move to left child */
-  LeftDir = 0,
-  /** move to right child */
-  RightDir
-};
-
-struct DeviceDenseNode {
-  bst_gpair sum_gradients;
-  float root_gain;
-  float weight;
-
-  /** default direction for missing values */
-  DefaultDirection dir;
-  /** threshold value for comparison */
-  float fvalue;
-  /** \brief The feature index. */
-  int fidx;
-  /** node id (used as key for reduce/scan) */
-  node_id_t idx;
-
-  HOST_DEV_INLINE DeviceDenseNode()
-      : sum_gradients(),
-        root_gain(-FLT_MAX),
-        weight(-FLT_MAX),
-        dir(LeftDir),
-        fvalue(0.f),
-        fidx(UNUSED_NODE),
-        idx(UNUSED_NODE) {}
-
-  HOST_DEV_INLINE DeviceDenseNode(bst_gpair sum_gradients, node_id_t nidx,
-                                  const GPUTrainingParam& param)
-      : sum_gradients(sum_gradients),
-        dir(LeftDir),
-        fvalue(0.f),
-        fidx(UNUSED_NODE),
-        idx(nidx) {
-    this->root_gain = CalcGain(param, sum_gradients.grad, sum_gradients.hess);
-    this->weight = CalcWeight(param, sum_gradients.grad, sum_gradients.hess);
-  }
-
-  HOST_DEV_INLINE void SetSplit(float fvalue, int fidx, DefaultDirection dir) {
-    this->fvalue = fvalue;
-    this->fidx = fidx;
-    this->dir = dir;
-  }
-
-  /** Tells whether this node is part of the decision tree */
-  HOST_DEV_INLINE bool IsUnused() const { return (idx == UNUSED_NODE); }
-
-  /** Tells whether this node is a leaf of the decision tree */
-  HOST_DEV_INLINE bool IsLeaf() const {
-    return (!IsUnused() && (fidx == UNUSED_NODE));
-  }
-};
-
-template <typename gpair_t>
-__device__ inline float device_calc_loss_chg(
-    const GPUTrainingParam& param, const gpair_t& scan, const gpair_t& missing,
-    const gpair_t& parent_sum, const float& parent_gain, bool missing_left) {
-  gpair_t left = scan;
-
-  if (missing_left) {
-    left += missing;
-  }
-
-  gpair_t right = parent_sum - left;
-
-  float left_gain = CalcGain(param, left.grad, left.hess);
-  float right_gain = CalcGain(param, right.grad, right.hess);
-  return left_gain + right_gain - parent_gain;
-}
-
-template <typename gpair_t>
-__device__ float inline loss_chg_missing(const gpair_t& scan,
-                                         const gpair_t& missing,
-                                         const gpair_t& parent_sum,
-                                         const float& parent_gain,
-                                         const GPUTrainingParam& param,
-                                         bool& missing_left_out) {  // NOLINT
-  float missing_left_loss =
-      device_calc_loss_chg(param, scan, missing, parent_sum, parent_gain, true);
-  float missing_right_loss = device_calc_loss_chg(
-      param, scan, missing, parent_sum, parent_gain, false);
-
-  if (missing_left_loss >= missing_right_loss) {
-    missing_left_out = true;
-    return missing_left_loss;
-  } else {
-    missing_left_out = false;
-    return missing_right_loss;
-  }
-}
-
-// Total number of nodes in tree, given depth
-__host__ __device__ inline int n_nodes(int depth) {
-  return (1 << (depth + 1)) - 1;
-}
-
-// Number of nodes at this level of the tree
-__host__ __device__ inline int n_nodes_level(int depth) { return 1 << depth; }
-
-// Whether a node is currently being processed at current depth
-__host__ __device__ inline bool is_active(int nidx, int depth) {
-  return nidx >= n_nodes(depth - 1);
-}
-
-__host__ __device__ inline int parent_nidx(int nidx) { return (nidx - 1) / 2; }
-
-__host__ __device__ inline int left_child_nidx(int nidx) {
-  return nidx * 2 + 1;
-}
-
-__host__ __device__ inline int right_child_nidx(int nidx) {
-  return nidx * 2 + 2;
-}
-
-__host__ __device__ inline bool is_left_child(int nidx) {
-  return nidx % 2 == 1;
-}
-
-// Copy gpu dense representation of tree to xgboost sparse representation
-inline void dense2sparse_tree(RegTree* p_tree,
-                              const dh::dvec<DeviceDenseNode>& nodes,
-                              const TrainParam& param) {
-  RegTree& tree = *p_tree;
-  std::vector<DeviceDenseNode> h_nodes = nodes.as_vector();
-
-  int nid = 0;
-  for (int gpu_nid = 0; gpu_nid < h_nodes.size(); gpu_nid++) {
-    const DeviceDenseNode& n = h_nodes[gpu_nid];
-    if (!n.IsUnused() && !n.IsLeaf()) {
-      tree.AddChilds(nid);
-      tree[nid].set_split(n.fidx, n.fvalue, n.dir == LeftDir);
-      tree.stat(nid).loss_chg = n.root_gain;
-      tree.stat(nid).base_weight = n.weight;
-      tree.stat(nid).sum_hess = n.sum_gradients.hess;
-      tree[tree[nid].cleft()].set_leaf(0);
-      tree[tree[nid].cright()].set_leaf(0);
-      nid++;
-    } else if (n.IsLeaf()) {
-      tree[nid].set_leaf(n.weight * param.learning_rate);
-      tree.stat(nid).sum_hess = n.sum_gradients.hess;
-      nid++;
-    }
-  }
-}
-
-/*
- * Random
- */
-
-struct BernoulliRng {
-  float p;
-  int seed;
-
-  __host__ __device__ BernoulliRng(float p, int seed) : p(p), seed(seed) {}
-
-  __host__ __device__ bool operator()(const int i) const {
-    thrust::default_random_engine rng(seed);
-    thrust::uniform_real_distribution<float> dist;
-    rng.discard(i);
-    return dist(rng) <= p;
-  }
-};
-
-// Set gradient pair to 0 with p = 1 - subsample
-inline void subsample_gpair(dh::dvec<bst_gpair>* p_gpair, float subsample,
-                            int offset = 0) {
-  if (subsample == 1.0) {
-    return;
-  }
-
-  dh::dvec<bst_gpair>& gpair = *p_gpair;
-
-  auto d_gpair = gpair.data();
-  BernoulliRng rng(subsample, common::GlobalRandom()());
-
-  dh::launch_n(gpair.device_idx(), gpair.size(), [=] __device__(int i) {
-    if (!rng(i + offset)) {
-      d_gpair[i] = bst_gpair();
-    }
-  });
-}
-
-inline std::vector<int> col_sample(std::vector<int> features, float colsample) {
-  CHECK_GT(features.size(), 0);
-  int n = std::max(1, static_cast<int>(colsample * features.size()));
-
-  std::shuffle(features.begin(), features.end(), common::GlobalRandom());
-  features.resize(n);
-
-  return features;
-}
-}  // namespace tree
-}  // namespace xgboost
--- a/plugin/updater_gpu/src/updater_gpu_hist.cu
+++ b/plugin/updater_gpu/src/updater_gpu_hist.cu
--- a/plugin/updater_gpu/test/cpp/test_device_helpers.cu
+++ b/plugin/updater_gpu/test/cpp/test_device_helpers.cu
@@ -1,78 +0,0 @@
-
-/*!
- * Copyright 2017 XGBoost contributors
- */
-#include <thrust/device_vector.h>
-#include <xgboost/base.h>
-#include "../../src/device_helpers.cuh"
-#include "gtest/gtest.h"
-
-void CreateTestData(xgboost::bst_uint num_rows, int max_row_size,
-                    thrust::host_vector<int> *row_ptr,
-                    thrust::host_vector<xgboost::bst_uint> *rows) {
-  row_ptr->resize(num_rows + 1);
-  int sum = 0;
-  for (int i = 0; i <= num_rows; i++) {
-    (*row_ptr)[i] = sum;
-    sum += rand() % max_row_size;  // NOLINT
-
-    if (i < num_rows) {
-      for (int j = (*row_ptr)[i]; j < sum; j++) {
-        (*rows).push_back(i);
-      }
-    }
-  }
-}
-
-void SpeedTest() {
-  int num_rows = 1000000;
-  int max_row_size = 100;
-  dh::CubMemory temp_memory;
-  thrust::host_vector<int> h_row_ptr;
-  thrust::host_vector<xgboost::bst_uint> h_rows;
-  CreateTestData(num_rows, max_row_size, &h_row_ptr, &h_rows);
-  thrust::device_vector<int> row_ptr = h_row_ptr;
-  thrust::device_vector<int> output_row(h_rows.size());
-  auto d_output_row = output_row.data();
-
-  dh::Timer t;
-  dh::TransformLbs(
-      0, &temp_memory, h_rows.size(), dh::raw(row_ptr), row_ptr.size() - 1, false,
-      [=] __device__(size_t idx, size_t ridx) { d_output_row[idx] = ridx; });
-
-  dh::safe_cuda(cudaDeviceSynchronize());
-  double time = t.elapsedSeconds();
-  const int mb_size = 1048576;
-  size_t size = (sizeof(int) * h_rows.size()) / mb_size;
-  printf("size: %llumb, time: %fs, bandwidth: %fmb/s\n", size, time,
-         size / time);
-}
-
-void TestLbs() {
-  srand(17);
-  dh::CubMemory temp_memory;
-
-  std::vector<int> test_rows = {4, 100, 1000};
-  std::vector<int> test_max_row_sizes = {4, 100, 1300};
-
-  for (auto num_rows : test_rows) {
-    for (auto max_row_size : test_max_row_sizes) {
-      thrust::host_vector<int> h_row_ptr;
-      thrust::host_vector<xgboost::bst_uint> h_rows;
-      CreateTestData(num_rows, max_row_size, &h_row_ptr, &h_rows);
-      thrust::device_vector<size_t> row_ptr = h_row_ptr;
-      thrust::device_vector<int> output_row(h_rows.size());
-      auto d_output_row = output_row.data();
-
-      dh::TransformLbs(0, &temp_memory, h_rows.size(), dh::raw(row_ptr),
-                       row_ptr.size() - 1, false,
-                       [=] __device__(size_t idx, size_t ridx) {
-                         d_output_row[idx] = ridx;
-                       });
-
-      dh::safe_cuda(cudaDeviceSynchronize());
-      ASSERT_TRUE(h_rows == output_row);
-    }
-  }
-}
-TEST(cub_lbs, Test) { TestLbs(); }
--- a/plugin/updater_gpu/test/cpp/test_gpu_predictor.cu
+++ b/plugin/updater_gpu/test/cpp/test_gpu_predictor.cu
@@ -1,73 +0,0 @@
-
-/*!
- * Copyright 2017 XGBoost contributors
- */
-#include <xgboost/c_api.h>
-#include <xgboost/predictor.h>
-#include "gtest/gtest.h"
-#include "../../../../tests/cpp/helpers.h"
-
-namespace xgboost {
-namespace predictor {
-TEST(gpu_predictor, Test) {
-  std::unique_ptr<Predictor> gpu_predictor =
-      std::unique_ptr<Predictor>(Predictor::Create("gpu_predictor"));
-  std::unique_ptr<Predictor> cpu_predictor =
-      std::unique_ptr<Predictor>(Predictor::Create("cpu_predictor"));
-
-  std::vector<std::unique_ptr<RegTree>> trees;
-  trees.push_back(std::make_unique<RegTree>());
-  trees.back()->InitModel();
-  (*trees.back())[0].set_leaf(1.5f);
-  gbm::GBTreeModel model(0.5);
-  model.CommitModel(std::move(trees), 0);
-  model.param.num_output_group = 1;
-
-  int n_row = 5;
-  int n_col = 5;
-
-  auto dmat = CreateDMatrix(n_row, n_col, 0);
-
-  // Test predict batch
-  std::vector<float> gpu_out_predictions;
-  std::vector<float> cpu_out_predictions;
-  gpu_predictor->PredictBatch(dmat.get(), &gpu_out_predictions, model, 0);
-  cpu_predictor->PredictBatch(dmat.get(), &cpu_out_predictions, model, 0);
-  float abs_tolerance = 0.001;
-  for (int i = 0; i < gpu_out_predictions.size(); i++) {
-    ASSERT_LT(std::abs(gpu_out_predictions[i] - cpu_out_predictions[i]),
-              abs_tolerance);
-  }
-
-  // Test predict instance
-  auto batch = dmat->RowIterator()->Value();
-  for (int i = 0; i < batch.size; i++) {
-    std::vector<float> gpu_instance_out_predictions;
-    std::vector<float> cpu_instance_out_predictions;
-    cpu_predictor->PredictInstance(batch[i], &cpu_instance_out_predictions,
-                                   model);
-    gpu_predictor->PredictInstance(batch[i], &gpu_instance_out_predictions,
-                                   model);
-    ASSERT_EQ(gpu_instance_out_predictions[0], cpu_instance_out_predictions[0]);
-  }
-
-  // Test predict leaf
-  std::vector<float> gpu_leaf_out_predictions;
-  std::vector<float> cpu_leaf_out_predictions;
-  cpu_predictor->PredictLeaf(dmat.get(), &cpu_leaf_out_predictions, model);
-  gpu_predictor->PredictLeaf(dmat.get(), &gpu_leaf_out_predictions, model);
-  for (int i = 0; i < gpu_leaf_out_predictions.size(); i++) {
-    ASSERT_EQ(gpu_leaf_out_predictions[i], cpu_leaf_out_predictions[i]);
-  }
-
-  // Test predict contribution
-  std::vector<float> gpu_out_contribution;
-  std::vector<float> cpu_out_contribution;
-  cpu_predictor->PredictContribution(dmat.get(), &cpu_out_contribution, model);
-  gpu_predictor->PredictContribution(dmat.get(), &gpu_out_contribution, model);
-  for (int i = 0; i < gpu_out_contribution.size(); i++) {
-    ASSERT_EQ(gpu_out_contribution[i], cpu_out_contribution[i]);
-  }
-}
-}  // namespace predictor
-}  // namespace xgboost
--- a/plugin/updater_gpu/test/python/test.py
+++ b/plugin/updater_gpu/test/python/test.py
@@ -1,319 +0,0 @@
-from __future__ import print_function
-#pylint: skip-file
-import sys
-sys.path.append("../../tests/python")
-import xgboost as xgb
-import testing as tm
-import numpy as np
-import unittest
-
-rng = np.random.RandomState(1994)
-
-dpath = '../../demo/data/'
-ag_dtrain = xgb.DMatrix(dpath + 'agaricus.txt.train')
-ag_dtest = xgb.DMatrix(dpath + 'agaricus.txt.test')
-
-def eprint(*args, **kwargs):
-    print(*args, file=sys.stderr, **kwargs)
-    print(*args, file=sys.stdout, **kwargs)
-        
-
-class TestGPU(unittest.TestCase):
-    def test_grow_gpu(self):
-        tm._skip_if_no_sklearn()
-        from sklearn.datasets import load_digits
-        try:
-            from sklearn.model_selection import train_test_split
-        except:
-            from sklearn.cross_validation import train_test_split
-
-        ag_param = {'max_depth': 2,
-                    'tree_method': 'exact',
-                    'nthread': 0,
-                    'eta': 1,
-                    'silent': 1,
-                    'debug_verbose': 0,
-                    'objective': 'binary:logistic',
-                    'eval_metric': 'auc'}
-        ag_param2 = {'max_depth': 2,
-                     'tree_method': 'gpu_exact',
-                     'nthread': 0,
-                     'eta': 1,
-                     'silent': 1,
-                     'debug_verbose': 0,
-                     'objective': 'binary:logistic',
-                     'eval_metric': 'auc'}
-        ag_res = {}
-        ag_res2 = {}
-
-        num_rounds = 10
-        xgb.train(ag_param, ag_dtrain, num_rounds, [(ag_dtrain, 'train'), (ag_dtest, 'test')],
-                  evals_result=ag_res)
-        xgb.train(ag_param2, ag_dtrain, num_rounds, [(ag_dtrain, 'train'), (ag_dtest, 'test')],
-                  evals_result=ag_res2)
-        assert ag_res['train']['auc'] == ag_res2['train']['auc']
-        assert ag_res['test']['auc'] == ag_res2['test']['auc']
-
-        digits = load_digits(2)
-        X = digits['data']
-        y = digits['target']
-        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
-        dtrain = xgb.DMatrix(X_train, y_train)
-        dtest = xgb.DMatrix(X_test, y_test)
-
-        param = {'objective': 'binary:logistic',
-                 'nthread': 0,
-                 'tree_method': 'gpu_exact',
-                 'max_depth': 3,
-                 'debug_verbose': 0,
-                 'eval_metric': 'auc'}
-        res = {}
-        xgb.train(param, dtrain, num_rounds, [(dtrain, 'train'), (dtest, 'test')],
-                  evals_result=res)
-        assert self.non_decreasing(res['train']['auc'])
-        assert self.non_decreasing(res['test']['auc'])
-
-        # fail-safe test for dense data
-        from sklearn.datasets import load_svmlight_file
-        X2, y2 = load_svmlight_file(dpath + 'agaricus.txt.train')
-        X2 = X2.toarray()
-        dtrain2 = xgb.DMatrix(X2, label=y2)
-
-        param = {'objective': 'binary:logistic',
-                 'nthread': 0,
-                 'tree_method': 'gpu_exact',
-                 'max_depth': 2,
-                 'debug_verbose': 0,
-                 'eval_metric': 'auc'}
-        res = {}
-        xgb.train(param, dtrain2, num_rounds, [(dtrain2, 'train')], evals_result=res)
-
-        assert self.non_decreasing(res['train']['auc'])
-        assert res['train']['auc'][0] >= 0.85
-
-        for j in range(X2.shape[1]):
-            for i in rng.choice(X2.shape[0], size=num_rounds, replace=False):
-                X2[i, j] = 2
-
-        dtrain3 = xgb.DMatrix(X2, label=y2)
-        res = {}
-
-        xgb.train(param, dtrain3, num_rounds, [(dtrain3, 'train')], evals_result=res)
-
-        assert self.non_decreasing(res['train']['auc'])
-        assert res['train']['auc'][0] >= 0.85
-
-        for j in range(X2.shape[1]):
-            for i in np.random.choice(X2.shape[0], size=num_rounds, replace=False):
-                X2[i, j] = 3
-
-        dtrain4 = xgb.DMatrix(X2, label=y2)
-        res = {}
-        xgb.train(param, dtrain4, num_rounds, [(dtrain4, 'train')], evals_result=res)
-        assert self.non_decreasing(res['train']['auc'])
-        assert res['train']['auc'][0] >= 0.85
-
-        
-    def test_grow_gpu_hist(self):
-        n_gpus=-1
-        tm._skip_if_no_sklearn()
-        from sklearn.datasets import load_digits
-        try:
-            from sklearn.model_selection import train_test_split
-        except:
-            from sklearn.cross_validation import train_test_split
-
-        for max_depth in range(3,10): # TODO: Doesn't work with 2 for some tests
-            #eprint("max_depth=%d" % (max_depth))
-            
-            for max_bin_i in range(3,11):
-                max_bin = np.power(2,max_bin_i)
-                #eprint("max_bin=%d" % (max_bin))
-
-                
-            
-                # regression test --- hist must be same as exact on all-categorial data
-                ag_param = {'max_depth': max_depth,
-                            'tree_method': 'exact',
-                            'nthread': 0,
-                            'eta': 1,
-                            'silent': 1,
-                            'debug_verbose': 0,
-                            'objective': 'binary:logistic',
-                            'eval_metric': 'auc'}
-                ag_param2 = {'max_depth': max_depth,
-                             'nthread': 0,
-                             'tree_method': 'gpu_hist',
-                             'eta': 1,
-                             'silent': 1,
-                             'debug_verbose': 0,
-                             'n_gpus': 1,
-                             'objective': 'binary:logistic',
-                             'max_bin': max_bin,
-                             'eval_metric': 'auc'}
-                ag_param3 = {'max_depth': max_depth,
-                             'nthread': 0,
-                             'tree_method': 'gpu_hist',
-                             'eta': 1,
-                             'silent': 1,
-                             'debug_verbose': 0,
-                             'n_gpus': n_gpus,
-                             'objective': 'binary:logistic',
-                             'max_bin': max_bin,
-                             'eval_metric': 'auc'}
-                ag_res = {}
-                ag_res2 = {}
-                ag_res3 = {}
-
-                num_rounds = 10
-                #eprint("normal updater");
-                xgb.train(ag_param, ag_dtrain, num_rounds, [(ag_dtrain, 'train'), (ag_dtest, 'test')],
-                          evals_result=ag_res)
-                #eprint("grow_gpu_hist updater 1 gpu");
-                xgb.train(ag_param2, ag_dtrain, num_rounds, [(ag_dtrain, 'train'), (ag_dtest, 'test')],
-                          evals_result=ag_res2)
-                #eprint("grow_gpu_hist updater %d gpus" % (n_gpus));
-                xgb.train(ag_param3, ag_dtrain, num_rounds, [(ag_dtrain, 'train'), (ag_dtest, 'test')],
-                          evals_result=ag_res3)
-                #        assert 1==0
-                assert ag_res['train']['auc'] == ag_res2['train']['auc']
-                assert ag_res['test']['auc'] == ag_res2['test']['auc']
-                assert ag_res['test']['auc'] == ag_res3['test']['auc']
-
-                ######################################################################
-                digits = load_digits(2)
-                X = digits['data']
-                y = digits['target']
-                X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
-                dtrain = xgb.DMatrix(X_train, y_train)
-                dtest = xgb.DMatrix(X_test, y_test)
-
-                param = {'objective': 'binary:logistic',
-                         'tree_method': 'gpu_hist',
-                         'nthread': 0,
-                         'max_depth': max_depth,
-                         'n_gpus': 1,
-                         'max_bin': max_bin,
-                         'debug_verbose': 0,
-                         'eval_metric': 'auc'}
-                res = {}
-                #eprint("digits: grow_gpu_hist updater 1 gpu");
-                xgb.train(param, dtrain, num_rounds, [(dtrain, 'train'), (dtest, 'test')],
-                          evals_result=res)
-                assert self.non_decreasing(res['train']['auc'])
-                #assert self.non_decreasing(res['test']['auc'])
-                param2 = {'objective': 'binary:logistic',
-                          'nthread': 0,
-                          'tree_method': 'gpu_hist',
-                          'max_depth': max_depth,
-                          'n_gpus': n_gpus,
-                          'max_bin': max_bin,
-                          'debug_verbose': 0,
-                          'eval_metric': 'auc'}
-                res2 = {}
-                #eprint("digits: grow_gpu_hist updater %d gpus" % (n_gpus));
-                xgb.train(param2, dtrain, num_rounds, [(dtrain, 'train'), (dtest, 'test')],
-                          evals_result=res2)
-                assert self.non_decreasing(res2['train']['auc'])
-                #assert self.non_decreasing(res2['test']['auc'])
-                assert res['train']['auc'] == res2['train']['auc']
-                #assert res['test']['auc'] == res2['test']['auc']
-
-                ######################################################################
-                # fail-safe test for dense data
-                from sklearn.datasets import load_svmlight_file
-                X2, y2 = load_svmlight_file(dpath + 'agaricus.txt.train')
-                X2 = X2.toarray()
-                dtrain2 = xgb.DMatrix(X2, label=y2)
-
-                param = {'objective': 'binary:logistic',
-                         'nthread': 0,
-                         'tree_method': 'gpu_hist',
-                         'max_depth': max_depth,
-                         'n_gpus': n_gpus,
-                         'max_bin': max_bin,
-                         'debug_verbose': 0,
-                         'eval_metric': 'auc'}
-                res = {}
-                xgb.train(param, dtrain2, num_rounds, [(dtrain2, 'train')], evals_result=res)
-
-                assert self.non_decreasing(res['train']['auc'])
-                if max_bin>32:
-                    assert res['train']['auc'][0] >= 0.85
-
-                for j in range(X2.shape[1]):
-                    for i in rng.choice(X2.shape[0], size=num_rounds, replace=False):
-                        X2[i, j] = 2
-
-                dtrain3 = xgb.DMatrix(X2, label=y2)
-                res = {}
-
-                xgb.train(param, dtrain3, num_rounds, [(dtrain3, 'train')], evals_result=res)
-
-                assert self.non_decreasing(res['train']['auc'])
-                if max_bin>32:
-                    assert res['train']['auc'][0] >= 0.85
-
-                for j in range(X2.shape[1]):
-                    for i in np.random.choice(X2.shape[0], size=num_rounds, replace=False):
-                        X2[i, j] = 3
-
-                dtrain4 = xgb.DMatrix(X2, label=y2)
-                res = {}
-                xgb.train(param, dtrain4, num_rounds, [(dtrain4, 'train')], evals_result=res)
-                assert self.non_decreasing(res['train']['auc'])
-                if max_bin>32:
-                    assert res['train']['auc'][0] >= 0.85
-
-                ######################################################################
-                # fail-safe test for max_bin
-                param = {'objective': 'binary:logistic',
-                         'nthread': 0,
-                         'tree_method': 'gpu_hist',
-                         'max_depth': max_depth,
-                         'n_gpus': n_gpus,
-                         'debug_verbose': 0,
-                         'eval_metric': 'auc',
-                         'max_bin': max_bin}
-                res = {}
-                xgb.train(param, dtrain2, num_rounds, [(dtrain2, 'train')], evals_result=res)
-                assert self.non_decreasing(res['train']['auc'])
-                if max_bin>32:
-                    assert res['train']['auc'][0] >= 0.85
-                ######################################################################
-                # subsampling
-                param = {'objective': 'binary:logistic',
-                         'nthread': 0,
-                         'tree_method': 'gpu_hist',
-                         'max_depth': max_depth,
-                         'n_gpus': n_gpus,
-                         'eval_metric': 'auc',
-                         'colsample_bytree': 0.5,
-                         'colsample_bylevel': 0.5,
-                         'subsample': 0.5,
-                         'debug_verbose': 0,
-                         'max_bin': max_bin}
-                res = {}
-                xgb.train(param, dtrain2, num_rounds, [(dtrain2, 'train')], evals_result=res)
-                assert self.non_decreasing(res['train']['auc'])
-                if max_bin>32:
-                    assert res['train']['auc'][0] >= 0.85
-        ######################################################################
-        # fail-safe test for max_bin=2
-        param = {'objective': 'binary:logistic',
-                 'nthread': 0,
-                 'tree_method': 'gpu_hist',
-                 'max_depth': 2,
-                 'n_gpus': n_gpus,
-                 'debug_verbose': 0,
-                 'eval_metric': 'auc',
-                 'max_bin': 2}
-        res = {}
-        xgb.train(param, dtrain2, num_rounds, [(dtrain2, 'train')], evals_result=res)
-        assert self.non_decreasing(res['train']['auc'])
-        if max_bin>32:
-            assert res['train']['auc'][0] >= 0.85
-        
-        
-    def non_decreasing(self, L):
-            return all((x - y) < 0.001 for x, y in zip(L, L[1:]))
--- a/plugin/updater_gpu/test/python/test_large.py
+++ b/plugin/updater_gpu/test/python/test_large.py
@@ -1,112 +0,0 @@
-from __future__ import print_function
-#pylint: skip-file
-import sys
-import time
-sys.path.append("../../tests/python")
-import xgboost as xgb
-import testing as tm
-import numpy as np
-import unittest
-from sklearn.datasets import make_classification
-def eprint(*args, **kwargs):
-    print(*args, file=sys.stderr, **kwargs) ; sys.stderr.flush()
-    print(*args, file=sys.stdout, **kwargs) ; sys.stdout.flush()
-
-eprint("Testing Big Data (this may take a while)")
-
-rng = np.random.RandomState(1994)
-
-# "realistic" size based upon http://stat-computing.org/dataexpo/2009/ , which has been processed to one-hot encode categoricalsxsy
-cols = 31
-# reduced to fit onto 1 gpu but still be large
-rows3 = 5000 # small
-rows2 = 4360032 # medium
-rows1 = 42360032 # large
-#rows1 = 152360032 # can do this for multi-gpu test (very large)
-rowslist = [rows1, rows2, rows3]
-
-
-class TestGPU(unittest.TestCase):
-    def test_large(self):
-        eprint("Starting test for large data")
-        tm._skip_if_no_sklearn()
-
-        for rows in rowslist:
-            
-            eprint("Creating train data rows=%d cols=%d" % (rows,cols))
-            tmp = time.time()
-            np.random.seed(7)
-            X = np.random.rand(rows, cols)
-            y = np.random.rand(rows)
-            print("Time to Create Data: %r" % (time.time() - tmp))
-
-            eprint("Starting DMatrix(X,y)")
-            tmp = time.time()
-            ag_dtrain = xgb.DMatrix(X,y,nthread=40)
-            print("Time to DMatrix: %r" % (time.time() - tmp))
-
-            max_depth=6
-            max_bin=1024
-                
-            # regression test --- hist must be same as exact on all-categorial data
-            ag_param = {'max_depth': max_depth,
-                        'tree_method': 'exact',
-                        'nthread': 0,
-                        'eta': 1,
-                        'silent': 0,
-                        'debug_verbose': 5,
-                        'objective': 'binary:logistic',
-                        'eval_metric': 'auc'}
-            ag_paramb = {'max_depth': max_depth,
-                        'tree_method': 'hist',
-                        'nthread': 0,
-                        'eta': 1,
-                        'silent': 0,
-                        'debug_verbose': 5,
-                        'objective': 'binary:logistic',
-                        'eval_metric': 'auc'}
-            ag_param2 = {'max_depth': max_depth,
-                        'tree_method': 'gpu_hist',
-                        'nthread': 0,
-                        'eta': 1,
-                        'silent': 0,
-                        'debug_verbose': 5,
-                        'n_gpus': 1,
-                        'objective': 'binary:logistic',
-                        'max_bin': max_bin,
-                        'eval_metric': 'auc'}
-            ag_param3 = {'max_depth': max_depth,
-                         'tree_method': 'gpu_hist',
-                         'nthread': 0,
-                         'eta': 1,
-                         'silent': 0,
-                         'debug_verbose': 5,
-                         'n_gpus': -1,
-                         'objective': 'binary:logistic',
-                         'max_bin': max_bin,
-                         'eval_metric': 'auc'}
-            ag_res = {}
-            ag_resb = {}
-            ag_res2 = {}
-            ag_res3 = {}
-
-            num_rounds = 1
-            tmp = time.time()
-            #eprint("hist updater")
-            #xgb.train(ag_paramb, ag_dtrain, num_rounds, [(ag_dtrain, 'train')],
-            #          evals_result=ag_resb)
-            #print("Time to Train: %s seconds" % (str(time.time() - tmp)))
-
-            tmp = time.time()
-            eprint("gpu_hist updater 1 gpu")
-            xgb.train(ag_param2, ag_dtrain, num_rounds, [(ag_dtrain, 'train')],
-                      evals_result=ag_res2)
-            print("Time to Train: %s seconds" % (str(time.time() - tmp)))
-
-            tmp = time.time()
-            eprint("gpu_hist updater all gpus")
-            xgb.train(ag_param3, ag_dtrain, num_rounds, [(ag_dtrain, 'train')],
-                      evals_result=ag_res3)
-            print("Time to Train: %s seconds" % (str(time.time() - tmp)))
-
-    
--- a/plugin/updater_gpu/test/python/test_prediction.py
+++ b/plugin/updater_gpu/test/python/test_prediction.py
@@ -1,37 +0,0 @@
-from __future__ import print_function
-#pylint: skip-file
-import sys
-sys.path.append("../../tests/python")
-import xgboost as xgb
-import testing as tm
-import numpy as np
-import unittest
-
-rng = np.random.RandomState(1994)
-
-
-class TestGPUPredict (unittest.TestCase):
-    def test_predict(self):
-        iterations = 1
-        np.random.seed(1)
-        test_num_rows = [10,1000,5000]
-        test_num_cols = [10,50,500]
-        for num_rows in test_num_rows:
-            for num_cols in test_num_cols:
-                dm = xgb.DMatrix(np.random.randn(num_rows, num_cols), label=[0, 1] * int(num_rows/2))
-                watchlist = [(dm, 'train')]
-                res = {}
-                param = {
-                    "objective":"binary:logistic",
-                    "predictor":"gpu_predictor",
-                    'eval_metric': 'auc',
-                }
-                bst = xgb.train(param, dm,iterations,evals=watchlist, evals_result=res)
-                assert self.non_decreasing(res["train"]["auc"])
-                gpu_pred = bst.predict(dm, output_margin=True)
-                bst.set_param({"predictor":"cpu_predictor"})
-                cpu_pred = bst.predict(dm, output_margin=True)
-                np.testing.assert_allclose(cpu_pred, gpu_pred, rtol=1e-5)
-
-    def non_decreasing(self, L):
-        return all((x - y) < 0.001 for x, y in zip(L, L[1:]))