Histogram Optimized Tree Grower (#1940)

* Support histogram-based algorithm + multiple tree growing strategy * Add a brand new updater to support histogram-based algorithm, which buckets continuous features into discrete bins to speed up training. To use it, set `tree_method = fast_hist` to configuration. * Support multiple tree growing strategies. For now, two policies are supported: * `grow_policy=depthwise` (default): favor splitting at nodes closest to the root, i.e. grow depth-wise. * `grow_policy=lossguide`: favor splitting at nodes with highest loss change * Improve single-threaded performance * Unroll critical loops * Introduce specialized code for dense data (i.e. no missing values) * Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose` * Adding a small test for hist method * Fix memory error in row_set.h When std::vector is resized, a reference to one of its element may become stale. Any such reference must be updated as well. * Resolve cross-platform compilation issues * Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g. alignas(*) and new initializer syntax. To support g++ 4.6, use pre-C++11 initializer and remove alignas(*). * Versions of MSVC older than 2015 does not support alignas(*). To support MSVC 2012, remove alignas(*). * For g++ 4.8 and newer, alignas(*) is enabled for performance benefits. * Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases (which uses `using` to declate type aliases). So always use `typedef`. * Fix a host of CI issues * Remove dependency for libz on osx * Fix heading for hist_util * Fix minor style issues * Add missing #include * Remove extraneous logging * Enable tree_method=hist in R * Renaming HistMaker to GHistBuilder to avoid confusion * Fix R integration * Respond to style comments * Consistent tie-breaking for priority queue using timestamps * Last-minute style fixes * Fix issuecomment-271977647 The way we quantize data is broken. The agaricus data consists of all categorical values. When NAs are converted into 0's, `HistCutMatrix::Init` assign both 0's and 1's to the same single bin. Why? gmat only the smallest value (0) and an upper bound (2), which is twice the maximum value (1). Add the maximum value itself to gmat to fix the issue. * Fix issuecomment-272266358 * Remove padding from cut values for the continuous case * For categorical/ordinal values, use midpoints as bin boundaries to be safe * Fix CI issue -- do not use xrange(*) * Fix corner case in quantile sketch Signed-off-by: Philip Cho <chohyu01@cs.washington.edu> * Adding a test for an edge case in quantile sketcher max_bin=2 used to cause an exception. * Fix fast_hist test The test used to require a strictly increasing Test AUC for all examples. One of them exhibits a small blip in Test AUC before achieving a Test AUC of 1. (See bottom.) Solution: do not require monotonic increase for this particular example. [0] train-auc:0.99989 test-auc:0.999497 [1] train-auc:1 test-auc:0.999749 [2] train-auc:1 test-auc:0.999749 [3] train-auc:1 test-auc:0.999749 [4] train-auc:1 test-auc:0.999749 [5] train-auc:1 test-auc:0.999497 [6] train-auc:1 test-auc:1 [7] train-auc:1 test-auc:1 [8] train-auc:1 test-auc:1 [9] train-auc:1 test-auc:1
2017-01-13 09:25:55 -08:00
parent ef8d92fc52
commit aeb4e76118
13 changed files with 1509 additions and 31 deletions
--- a/src/common/hist_util.cc
+++ b/src/common/hist_util.cc
@@ -0,0 +1,227 @@
+/*!
+ * Copyright 2017 by Contributors
+ * \file hist_util.h
+ * \brief Utilities to store histograms
+ * \author Philip Cho, Tianqi Chen
+ */
+#include <dmlc/omp.h>
+#include <vector>
+#include "./sync.h"
+#include "./hist_util.h"
+#include "./quantile.h"
+
+namespace xgboost {
+namespace common {
+
+void HistCutMatrix::Init(DMatrix* p_fmat, size_t max_num_bins) {
+  typedef common::WXQuantileSketch<bst_float, bst_float> WXQSketch;
+  const MetaInfo& info = p_fmat->info();
+
+  // safe factor for better accuracy
+  const int kFactor = 8;
+  std::vector<WXQSketch> sketchs;
+
+  int nthread;
+  #pragma omp parallel
+  {
+    nthread = omp_get_num_threads();
+  }
+  nthread = std::max(nthread / 2, 1);
+
+  unsigned nstep = (info.num_col + nthread - 1) / nthread;
+  unsigned ncol = static_cast<unsigned>(info.num_col);
+  sketchs.resize(info.num_col);
+  for (auto& s : sketchs) {
+    s.Init(info.num_row, 1.0 / (max_num_bins * kFactor));
+  }
+
+  dmlc::DataIter<RowBatch>* iter = p_fmat->RowIterator();
+  iter->BeforeFirst();
+  while (iter->Next()) {
+    const RowBatch& batch = iter->Value();
+    #pragma omp parallel num_threads(nthread)
+    {
+      CHECK_EQ(nthread, omp_get_num_threads());
+      unsigned tid = static_cast<unsigned>(omp_get_thread_num());
+      unsigned begin = std::min(nstep * tid, ncol);
+      unsigned end = std::min(nstep * (tid + 1), ncol);
+      for (size_t i = 0; i < batch.size; ++i) { // NOLINT(*)
+        bst_uint ridx = static_cast<bst_uint>(batch.base_rowid + i);
+        RowBatch::Inst inst = batch[i];
+        for (bst_uint j = 0; j < inst.length; ++j) {
+          if (inst[j].index >= begin && inst[j].index < end) {
+            sketchs[inst[j].index].Push(inst[j].fvalue, info.GetWeight(ridx));
+          }
+        }
+      }
+    }
+  }
+
+  // gather the histogram data
+  rabit::SerializeReducer<WXQSketch::SummaryContainer> sreducer;
+  std::vector<WXQSketch::SummaryContainer> summary_array;
+  summary_array.resize(sketchs.size());
+  for (size_t i = 0; i < sketchs.size(); ++i) {
+    WXQSketch::SummaryContainer out;
+    sketchs[i].GetSummary(&out);
+    summary_array[i].Reserve(max_num_bins * kFactor);
+    summary_array[i].SetPrune(out, max_num_bins * kFactor);
+  }
+  size_t nbytes = WXQSketch::SummaryContainer::CalcMemCost(max_num_bins * kFactor);
+  sreducer.Allreduce(dmlc::BeginPtr(summary_array), nbytes, summary_array.size());
+
+  this->min_val.resize(info.num_col);
+  row_ptr.push_back(0);
+  for (size_t fid = 0; fid < summary_array.size(); ++fid) {
+    WXQSketch::SummaryContainer a;
+    a.Reserve(max_num_bins);
+    a.SetPrune(summary_array[fid], max_num_bins);
+    const bst_float mval = a.data[0].value;
+    this->min_val[fid] = mval - fabs(mval);
+    if (a.size > 1 && a.size <= 16) {
+      /* specialized code categorial / ordinal data -- use midpoints */
+      for (size_t i = 1; i < a.size; ++i) {
+        bst_float cpt = (a.data[i].value + a.data[i - 1].value) / 2.0;
+        if (i == 1 || cpt > cut.back()) {
+          cut.push_back(cpt);
+        }
+      }
+    } else {
+      for (size_t i = 2; i < a.size; ++i) {
+        bst_float cpt = a.data[i - 1].value;
+        if (i == 2 || cpt > cut.back()) {
+          cut.push_back(cpt);
+        }
+      }
+    }
+    // push a value that is greater than anything
+    if (a.size != 0) {
+      bst_float cpt = a.data[a.size - 1].value;
+      // this must be bigger than last value in a scale
+      bst_float last = cpt + fabs(cpt);
+      cut.push_back(last);
+    }
+    row_ptr.push_back(cut.size());
+  }
+}
+
+
+void GHistIndexMatrix::Init(DMatrix* p_fmat) {
+  CHECK(cut != nullptr);
+  dmlc::DataIter<RowBatch>* iter = p_fmat->RowIterator();
+  hit_count.resize(cut->row_ptr.back(), 0);
+
+  int nthread;
+  #pragma omp parallel
+  {
+    nthread = omp_get_num_threads();
+  }
+  nthread = std::max(nthread / 2, 1);
+
+  iter->BeforeFirst();
+  row_ptr.push_back(0);
+  while (iter->Next()) {
+    const RowBatch& batch = iter->Value();
+    size_t rbegin = row_ptr.size() - 1;
+    for (size_t i = 0; i < batch.size; ++i) {
+      row_ptr.push_back(batch[i].length + row_ptr.back());
+    }
+    index.resize(row_ptr.back());
+
+    CHECK_GT(cut->cut.size(), 0);
+    CHECK_EQ(cut->row_ptr.back(), cut->cut.size());
+
+    omp_ulong bsize = static_cast<omp_ulong>(batch.size);
+    #pragma omp parallel for num_threads(nthread) schedule(static)
+    for (omp_ulong i = 0; i < bsize; ++i) { // NOLINT(*)
+      size_t ibegin = row_ptr[rbegin + i];
+      size_t iend = row_ptr[rbegin + i + 1];
+      RowBatch::Inst inst = batch[i];
+      CHECK_EQ(ibegin + inst.length, iend);
+      for (bst_uint j = 0; j < inst.length; ++j) {
+        unsigned fid = inst[j].index;
+        auto cbegin = cut->cut.begin() + cut->row_ptr[fid];
+        auto cend = cut->cut.begin() + cut->row_ptr[fid + 1];
+        CHECK(cbegin != cend);
+        auto it = std::upper_bound(cbegin, cend, inst[j].fvalue);
+        if (it == cend) it = cend - 1;
+        unsigned idx = static_cast<unsigned>(it - cut->cut.begin());
+        index[ibegin + j] = idx;
+      }
+      std::sort(index.begin() + ibegin, index.begin() + iend);
+    }
+  }
+}
+
+void GHistBuilder::BuildHist(const std::vector<bst_gpair>& gpair,
+                             const RowSetCollection::Elem row_indices,
+                             const GHistIndexMatrix& gmat,
+                             GHistRow hist) {
+  CHECK(!data_.empty()) << "GHistBuilder must be initialized";
+  CHECK_EQ(data_.size(), nbins_ * nthread_) << "invalid dimensions for temp buffer";
+
+  std::fill(data_.begin(), data_.end(), GHistEntry());
+
+  const int K = 8;  // loop unrolling factor
+  const bst_omp_uint nthread = static_cast<bst_omp_uint>(this->nthread_);
+  const bst_omp_uint nrows = row_indices.end - row_indices.begin;
+  const bst_omp_uint rest = nrows % K;
+
+  #pragma omp parallel for num_threads(nthread) schedule(static)
+  for (bst_omp_uint i = 0; i < nrows - rest; i += K) {
+    const bst_omp_uint tid = omp_get_thread_num();
+    const size_t off = tid * nbins_;
+    bst_uint rid[K];
+    bst_gpair stat[K];
+    size_t ibegin[K], iend[K];
+    for (int k = 0; k < K; ++k) {
+      rid[k] = row_indices.begin[i + k];
+    }
+    for (int k = 0; k < K; ++k) {
+      stat[k] = gpair[rid[k]];
+    }
+    for (int k = 0; k < K; ++k) {
+      ibegin[k] = static_cast<size_t>(gmat.row_ptr[rid[k]]);
+      iend[k] = static_cast<size_t>(gmat.row_ptr[rid[k] + 1]);
+    }
+    for (int k = 0; k < K; ++k) {
+      for (size_t j = ibegin[k]; j < iend[k]; ++j) {
+        const size_t bin = gmat.index[j];
+        data_[off + bin].Add(stat[k]);
+      }
+    }
+  }
+  for (bst_omp_uint i = nrows - rest; i < nrows; ++i) {
+    const bst_uint rid = row_indices.begin[i];
+    const bst_gpair stat = gpair[rid];
+    const size_t ibegin = static_cast<size_t>(gmat.row_ptr[rid]);
+    const size_t iend = static_cast<size_t>(gmat.row_ptr[rid + 1]);
+    for (size_t j = ibegin; j < iend; ++j) {
+      const size_t bin = gmat.index[j];
+      data_[bin].Add(stat);
+    }
+  }
+
+  /* reduction */
+  const bst_omp_uint nbins = static_cast<bst_omp_uint>(nbins_);
+  #pragma omp parallel for num_threads(nthread) schedule(static)
+  for (bst_omp_uint bin_id = 0; bin_id < nbins; ++bin_id) {
+    for (bst_omp_uint tid = 0; tid < nthread; ++tid) {
+      hist.begin[bin_id].Add(data_[tid * nbins_ + bin_id]);
+    }
+  }
+}
+
+void GHistBuilder::SubtractionTrick(GHistRow self,
+                                    GHistRow sibling,
+                                    GHistRow parent) {
+  const bst_omp_uint nthread = static_cast<bst_omp_uint>(this->nthread_);
+  const bst_omp_uint nbins = static_cast<bst_omp_uint>(nbins_);
+  #pragma omp parallel for num_threads(nthread) schedule(static)
+  for (bst_omp_uint bin_id = 0; bin_id < nbins; ++bin_id) {
+    self.begin[bin_id].SetSubtract(parent.begin[bin_id], sibling.begin[bin_id]);
+  }
+}
+
+}  // namespace common
+}  // namespace xgboost
--- a/src/common/hist_util.h
+++ b/src/common/hist_util.h
@@ -0,0 +1,214 @@
+/*!
+ * Copyright 2017 by Contributors
+ * \file hist_util.h
+ * \brief Utility for fast histogram aggregation
+ * \author Philip Cho, Tianqi Chen
+ */
+#ifndef XGBOOST_COMMON_HIST_UTIL_H_
+#define XGBOOST_COMMON_HIST_UTIL_H_
+
+#include <xgboost/data.h>
+#include <limits>
+#include <vector>
+#include "row_set.h"
+
+namespace xgboost {
+namespace common {
+
+/*! \brief sums of gradient statistics corresponding to a histogram bin */
+struct GHistEntry {
+  /*! \brief sum of first-order gradient statistics */
+  double sum_grad;
+  /*! \brief sum of second-order gradient statistics */
+  double sum_hess;
+
+  GHistEntry() : sum_grad(0), sum_hess(0) {}
+
+  /*! \brief add a bst_gpair to the sum */
+  inline void Add(const bst_gpair& e) {
+    sum_grad += e.grad;
+    sum_hess += e.hess;
+  }
+
+  /*! \brief add a GHistEntry to the sum */
+  inline void Add(const GHistEntry& e) {
+    sum_grad += e.sum_grad;
+    sum_hess += e.sum_hess;
+  }
+
+  /*! \brief set sum to be difference of two GHistEntry's */
+  inline void SetSubtract(const GHistEntry& a, const GHistEntry& b) {
+    sum_grad = a.sum_grad - b.sum_grad;
+    sum_hess = a.sum_hess - b.sum_hess;
+  }
+};
+
+
+/*! \brief Cut configuration for one feature */
+struct HistCutUnit {
+  /*! \brief the index pointer of each histunit */
+  const bst_float* cut;
+  /*! \brief number of cutting point, containing the maximum point */
+  size_t size;
+  // default constructor
+  HistCutUnit() {}
+  // constructor
+  HistCutUnit(const bst_float* cut, unsigned size)
+      : cut(cut), size(size) {}
+};
+
+/*! \brief cut configuration for all the features */
+struct HistCutMatrix {
+  /*! \brief actual unit pointer */
+  std::vector<unsigned> row_ptr;
+  /*! \brief minimum value of each feature */
+  std::vector<bst_float> min_val;
+  /*! \brief the cut field */
+  std::vector<bst_float> cut;
+  /*! \brief Get histogram bound for fid */
+  inline HistCutUnit operator[](unsigned fid) const {
+    return HistCutUnit(dmlc::BeginPtr(cut) + row_ptr[fid],
+                       row_ptr[fid + 1] - row_ptr[fid]);
+  }
+  // create histogram cut matrix given statistics from data
+  // using approximate quantile sketch approach
+  void Init(DMatrix* p_fmat, size_t max_num_bins);
+};
+
+
+/*!
+ * \brief A single row in global histogram index.
+ *  Directly represent the global index in the histogram entry.
+ */
+struct GHistIndexRow {
+  /*! \brief The index of the histogram */
+  const unsigned* index;
+  /*! \brief The size of the histogram */
+  unsigned size;
+  GHistIndexRow() {}
+  GHistIndexRow(const unsigned* index, unsigned size)
+      : index(index), size(size) {}
+};
+
+/*!
+ * \brief preprocessed global index matrix, in CSR format
+ *  Transform floating values to integer index in histogram
+ *  This is a global histogram index.
+ */
+struct GHistIndexMatrix {
+  /*! \brief row pointer */
+  std::vector<unsigned> row_ptr;
+  /*! \brief The index data */
+  std::vector<unsigned> index;
+  /*! \brief hit count of each index */
+  std::vector<unsigned> hit_count;
+  /*! \brief optional remap index from outter row_id -> internal row_id*/
+  std::vector<unsigned> remap_index;
+  /*! \brief The corresponding cuts */
+  const HistCutMatrix* cut;
+  // Create a global histogram matrix, given cut
+  void Init(DMatrix* p_fmat);
+  // build remap
+  void Remap();
+  // get i-th row
+  inline GHistIndexRow operator[](bst_uint i) const {
+    return GHistIndexRow(&index[0] + row_ptr[i], row_ptr[i + 1] - row_ptr[i]);
+  }
+};
+
+/*!
+ * \brief histogram of graident statistics for a single node.
+ *  Consists of multiple GHistEntry's, each entry showing total graident statistics 
+ *     for that particular bin
+ *  Uses global bin id so as to represent all features simultaneously
+ */
+struct GHistRow {
+  /*! \brief base pointer to first entry */
+  GHistEntry* begin;
+  /*! \brief number of entries */
+  unsigned size;
+
+  GHistRow() {}
+  GHistRow(GHistEntry* begin, unsigned size)
+    : begin(begin), size(size) {}
+};
+
+/*!
+ * \brief histogram of gradient statistics for multiple nodes
+ */
+class HistCollection {
+ public:
+  // access histogram for i-th node
+  inline GHistRow operator[](bst_uint nid) const {
+    const size_t kMax = std::numeric_limits<size_t>::max();
+    CHECK_NE(row_ptr_[nid], kMax);
+    return GHistRow(const_cast<GHistEntry*>(dmlc::BeginPtr(data_) + row_ptr_[nid]), nbins_);
+  }
+
+  // have we computed a histogram for i-th node?
+  inline bool RowExists(bst_uint nid) const {
+    const size_t kMax = std::numeric_limits<size_t>::max();
+    return (nid < row_ptr_.size() && row_ptr_[nid] != kMax);
+  }
+
+  // initialize histogram collection
+  inline void Init(size_t nbins) {
+    nbins_ = nbins;
+    row_ptr_.clear();
+    data_.clear();
+  }
+
+  // create an empty histogram for i-th node
+  inline void AddHistRow(bst_uint nid) {
+    const size_t kMax = std::numeric_limits<size_t>::max();
+    if (nid >= row_ptr_.size()) {
+      row_ptr_.resize(nid + 1, kMax);
+    }
+    CHECK_EQ(row_ptr_[nid], kMax);
+
+    row_ptr_[nid] = data_.size();
+    data_.resize(data_.size() + nbins_);
+  }
+
+ private:
+  /*! \brief number of all bins over all features */
+  size_t nbins_;
+
+  std::vector<GHistEntry> data_;
+
+  /*! \brief row_ptr_[nid] locates bin for historgram of node nid */
+  std::vector<size_t> row_ptr_;
+};
+
+/*!
+ * \brief builder for histograms of gradient statistics
+ */
+class GHistBuilder {
+ public:
+  // initialize builder
+  inline void Init(size_t nthread, size_t nbins) {
+    nthread_ = nthread;
+    nbins_ = nbins;
+    data_.resize(nthread * nbins, GHistEntry());
+  }
+
+  // construct a histogram via histogram aggregation
+  void BuildHist(const std::vector<bst_gpair>& gpair,
+                 const RowSetCollection::Elem row_indices,
+                 const GHistIndexMatrix& gmat,
+                 GHistRow hist);
+  // construct a histogram via subtraction trick
+  void SubtractionTrick(GHistRow self, GHistRow sibling, GHistRow parent);
+
+ private:
+  /*! \brief number of threads for parallel computation */
+  size_t nthread_;
+  /*! \brief number of all bins over all features */
+  size_t nbins_;
+  std::vector<GHistEntry> data_;
+};
+
+
+}  // namespace common
+}  // namespace xgboost
+#endif  // XGBOOST_COMMON_HIST_UTIL_H_
--- a/src/common/quantile.h
+++ b/src/common/quantile.h
@@ -348,10 +348,12 @@ struct WXQSummary : public WQSummary<DType, RType> {
      this->CopyFrom(src); return;
    }
    RType begin = src.data[0].rmax;
-    size_t n = maxsize - 1, nbig = 0;
+    // n is number of points exclude the min/max points
+    size_t n = maxsize - 2, nbig = 0;
+    // these is the range of data exclude the min/max point
    RType range = src.data[src.size - 1].rmin - begin;
    // prune off zero weights
-    if (range == 0.0f) {
+    if (range == 0.0f || maxsize <= 2) {
      // special case, contain only two effective data pts
      this->data[0] = src.data[0];
      this->data[1] = src.data[src.size - 1];
@@ -360,16 +362,21 @@ struct WXQSummary : public WQSummary<DType, RType> {
    } else {
      range = std::max(range, static_cast<RType>(1e-3f));
    }
+    // Get a big enough chunk size, bigger than range / n
+    // (multiply by 2 is a safe factor)
    const RType chunk = 2 * range / n;
    // minimized range
    RType mrange = 0;
    {
      // first scan, grab all the big chunk
-      // moving block index
+      // moving block index, exclude the two ends.
      size_t bid = 0;
-      for (size_t i = 1; i < src.size; ++i) {
+      for (size_t i = 1; i < src.size - 1; ++i) {
+        // detect big chunk data point in the middle
+        // always save these data points.
        if (CheckLarge(src.data[i], chunk)) {
          if (bid != i - 1) {
+            // accumulate the range of the rest points
            mrange += src.data[i].rmax_prev() - src.data[bid].rmin_next();
          }
          bid = i; ++nbig;
@@ -379,17 +386,18 @@ struct WXQSummary : public WQSummary<DType, RType> {
        mrange += src.data[src.size-1].rmax_prev() - src.data[bid].rmin_next();
      }
    }
-    if (nbig >= n - 1) {
+    // assert: there cannot be more than n big data points
+    if (nbig >= n) {
      // see what was the case
      LOG(INFO) << " check quantile stats, nbig=" << nbig << ", n=" << n;
      LOG(INFO) << " srcsize=" << src.size << ", maxsize=" << maxsize
                << ", range=" << range << ", chunk=" << chunk;
      src.Print();
-      CHECK(nbig < n - 1) << "quantile: too many large chunk";
+      CHECK(nbig < n) << "quantile: too many large chunk";
    }
    this->data[0] = src.data[0];
    this->size = 1;
-    // use smaller size
+    // The counter on the rest of points, to be selected equally from small chunks.
    n = n - nbig;
    // find the rest of point
    size_t bid = 0, k = 1, lastidx = 0;
--- a/src/common/row_set.h
+++ b/src/common/row_set.h
@@ -0,0 +1,104 @@
+/*!
+ * Copyright 2017 by Contributors
+ * \file row_set.h
+ * \brief Quick Utility to compute subset of rows
+ * \author Philip Cho, Tianqi Chen
+ */
+#ifndef XGBOOST_COMMON_ROW_SET_H_
+#define XGBOOST_COMMON_ROW_SET_H_
+
+#include <xgboost/data.h>
+#include <algorithm>
+#include <vector>
+
+namespace xgboost {
+namespace common {
+
+/*! \brief collection of rowset */
+class RowSetCollection {
+ public:
+  /*! \brief subset of rows */
+  struct Elem {
+    const bst_uint* begin;
+    const bst_uint* end;
+    Elem(void)
+        : begin(nullptr), end(nullptr) {}
+    Elem(const bst_uint* begin,
+         const bst_uint* end)
+        : begin(begin), end(end) {}
+
+    inline size_t size() const {
+      return end - begin;
+    }
+  };
+  /* \brief specifies how to split a rowset into two */
+  struct Split {
+    std::vector<bst_uint> left;
+    std::vector<bst_uint> right;
+  };
+  /*! \brief return corresponding element set given the node_id */
+  inline const Elem& operator[](unsigned node_id) const {
+    const Elem& e = elem_of_each_node_[node_id];
+    CHECK(e.begin != nullptr)
+        << "access element that is not in the set";
+    return e;
+  }
+  // clear up things
+  inline void Clear() {
+    row_indices_.clear();
+    elem_of_each_node_.clear();
+  }
+  // initialize node id 0->everything
+  inline void Init() {
+    CHECK_EQ(elem_of_each_node_.size(), 0);
+    const bst_uint* begin = dmlc::BeginPtr(row_indices_);
+    const bst_uint* end = dmlc::BeginPtr(row_indices_) + row_indices_.size();
+    elem_of_each_node_.emplace_back(Elem(begin, end));
+  }
+  // split rowset into two
+  inline void AddSplit(unsigned node_id,
+                       const std::vector<Split>& row_split_tloc,
+                       unsigned left_node_id,
+                       unsigned right_node_id) {
+    const Elem e = elem_of_each_node_[node_id];
+    const unsigned nthread = row_split_tloc.size();
+    CHECK(e.begin != nullptr);
+    bst_uint* all_begin = dmlc::BeginPtr(row_indices_);
+    bst_uint* begin = all_begin + (e.begin - all_begin);
+
+    bst_uint* it = begin;
+    // TODO(hcho3): parallelize this section
+    for (bst_omp_uint tid = 0; tid < nthread; ++tid) {
+      std::copy(row_split_tloc[tid].left.begin(), row_split_tloc[tid].left.end(), it);
+      it += row_split_tloc[tid].left.size();
+    }
+    bst_uint* split_pt = it;
+    for (bst_omp_uint tid = 0; tid < nthread; ++tid) {
+      std::copy(row_split_tloc[tid].right.begin(), row_split_tloc[tid].right.end(), it);
+      it += row_split_tloc[tid].right.size();
+    }
+
+    if (left_node_id >= elem_of_each_node_.size()) {
+      elem_of_each_node_.resize(left_node_id + 1, Elem(nullptr, nullptr));
+    }
+    if (right_node_id >= elem_of_each_node_.size()) {
+      elem_of_each_node_.resize(right_node_id + 1, Elem(nullptr, nullptr));
+    }
+
+    elem_of_each_node_[left_node_id] = Elem(begin, split_pt);
+    elem_of_each_node_[right_node_id] = Elem(split_pt, e.end);
+    elem_of_each_node_[node_id] = Elem(nullptr, nullptr);
+  }
+
+  // stores the row indices in the set
+  std::vector<bst_uint> row_indices_;
+
+ private:
+  // vector: node_id -> elements
+  std::vector<Elem> elem_of_each_node_;
+};
+
+}  // namespace common
+}  // namespace xgboost
+
+#endif  // XGBOOST_COMMON_ROW_SET_H_