Histogram Optimized Tree Grower (#1940)

* Support histogram-based algorithm + multiple tree growing strategy

* Add a brand new updater to support histogram-based algorithm, which buckets
  continuous features into discrete bins to speed up training. To use it, set
  `tree_method = fast_hist` to configuration.
* Support multiple tree growing strategies. For now, two policies are supported:
  * `grow_policy=depthwise` (default):  favor splitting at nodes closest to the
    root, i.e. grow depth-wise.
  * `grow_policy=lossguide`: favor splitting at nodes with highest loss change
* Improve single-threaded performance
  * Unroll critical loops
  * Introduce specialized code for dense data (i.e. no missing values)
* Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose`

* Adding a small test for hist method

* Fix memory error in row_set.h

When std::vector is resized, a reference to one of its element may become
stale. Any such reference must be updated as well.

* Resolve cross-platform compilation issues

* Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g.
  alignas(*) and new initializer syntax. To support g++ 4.6, use pre-C++11
  initializer and remove alignas(*).
* Versions of MSVC older than 2015 does not support alignas(*). To support
  MSVC 2012, remove alignas(*).
* For g++ 4.8 and newer, alignas(*) is enabled for performance benefits.
* Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases
  (which uses `using` to declate type aliases). So always use `typedef`.

* Fix a host of CI issues

* Remove dependency for libz on osx
* Fix heading for hist_util
* Fix minor style issues
* Add missing #include
* Remove extraneous logging

* Enable tree_method=hist in R

* Renaming HistMaker to GHistBuilder to avoid confusion

* Fix R integration

* Respond to style comments

* Consistent tie-breaking for priority queue using timestamps

* Last-minute style fixes

* Fix issuecomment-271977647

The way we quantize data is broken. The agaricus data consists of all
categorical values. When NAs are converted into 0's,
`HistCutMatrix::Init` assign both 0's and 1's to the same single bin.

Why? gmat only the smallest value (0) and an upper bound (2), which is twice
the maximum value (1). Add the maximum value itself to gmat to fix the issue.

* Fix issuecomment-272266358

* Remove padding from cut values for the continuous case
* For categorical/ordinal values, use midpoints as bin boundaries to be safe

* Fix CI issue -- do not use xrange(*)

* Fix corner case in quantile sketch

Signed-off-by: Philip Cho <chohyu01@cs.washington.edu>

* Adding a test for an edge case in quantile sketcher

max_bin=2 used to cause an exception.

* Fix fast_hist test

The test used to require a strictly increasing Test AUC for all examples.
One of them exhibits a small blip in Test AUC before achieving a Test AUC
of 1. (See bottom.)

Solution: do not require monotonic increase for this particular example.

[0] train-auc:0.99989 test-auc:0.999497
[1] train-auc:1 test-auc:0.999749
[2] train-auc:1 test-auc:0.999749
[3] train-auc:1 test-auc:0.999749
[4] train-auc:1 test-auc:0.999749
[5] train-auc:1 test-auc:0.999497
[6] train-auc:1 test-auc:1
[7] train-auc:1 test-auc:1
[8] train-auc:1 test-auc:1
[9] train-auc:1 test-auc:1
This commit is contained in:
Philip Cho
2017-01-13 09:25:55 -08:00
committed by Tianqi Chen
parent ef8d92fc52
commit aeb4e76118
13 changed files with 1509 additions and 31 deletions

214
src/common/hist_util.h Normal file
View File

@@ -0,0 +1,214 @@
/*!
* Copyright 2017 by Contributors
* \file hist_util.h
* \brief Utility for fast histogram aggregation
* \author Philip Cho, Tianqi Chen
*/
#ifndef XGBOOST_COMMON_HIST_UTIL_H_
#define XGBOOST_COMMON_HIST_UTIL_H_
#include <xgboost/data.h>
#include <limits>
#include <vector>
#include "row_set.h"
namespace xgboost {
namespace common {
/*! \brief sums of gradient statistics corresponding to a histogram bin */
struct GHistEntry {
/*! \brief sum of first-order gradient statistics */
double sum_grad;
/*! \brief sum of second-order gradient statistics */
double sum_hess;
GHistEntry() : sum_grad(0), sum_hess(0) {}
/*! \brief add a bst_gpair to the sum */
inline void Add(const bst_gpair& e) {
sum_grad += e.grad;
sum_hess += e.hess;
}
/*! \brief add a GHistEntry to the sum */
inline void Add(const GHistEntry& e) {
sum_grad += e.sum_grad;
sum_hess += e.sum_hess;
}
/*! \brief set sum to be difference of two GHistEntry's */
inline void SetSubtract(const GHistEntry& a, const GHistEntry& b) {
sum_grad = a.sum_grad - b.sum_grad;
sum_hess = a.sum_hess - b.sum_hess;
}
};
/*! \brief Cut configuration for one feature */
struct HistCutUnit {
/*! \brief the index pointer of each histunit */
const bst_float* cut;
/*! \brief number of cutting point, containing the maximum point */
size_t size;
// default constructor
HistCutUnit() {}
// constructor
HistCutUnit(const bst_float* cut, unsigned size)
: cut(cut), size(size) {}
};
/*! \brief cut configuration for all the features */
struct HistCutMatrix {
/*! \brief actual unit pointer */
std::vector<unsigned> row_ptr;
/*! \brief minimum value of each feature */
std::vector<bst_float> min_val;
/*! \brief the cut field */
std::vector<bst_float> cut;
/*! \brief Get histogram bound for fid */
inline HistCutUnit operator[](unsigned fid) const {
return HistCutUnit(dmlc::BeginPtr(cut) + row_ptr[fid],
row_ptr[fid + 1] - row_ptr[fid]);
}
// create histogram cut matrix given statistics from data
// using approximate quantile sketch approach
void Init(DMatrix* p_fmat, size_t max_num_bins);
};
/*!
* \brief A single row in global histogram index.
* Directly represent the global index in the histogram entry.
*/
struct GHistIndexRow {
/*! \brief The index of the histogram */
const unsigned* index;
/*! \brief The size of the histogram */
unsigned size;
GHistIndexRow() {}
GHistIndexRow(const unsigned* index, unsigned size)
: index(index), size(size) {}
};
/*!
* \brief preprocessed global index matrix, in CSR format
* Transform floating values to integer index in histogram
* This is a global histogram index.
*/
struct GHistIndexMatrix {
/*! \brief row pointer */
std::vector<unsigned> row_ptr;
/*! \brief The index data */
std::vector<unsigned> index;
/*! \brief hit count of each index */
std::vector<unsigned> hit_count;
/*! \brief optional remap index from outter row_id -> internal row_id*/
std::vector<unsigned> remap_index;
/*! \brief The corresponding cuts */
const HistCutMatrix* cut;
// Create a global histogram matrix, given cut
void Init(DMatrix* p_fmat);
// build remap
void Remap();
// get i-th row
inline GHistIndexRow operator[](bst_uint i) const {
return GHistIndexRow(&index[0] + row_ptr[i], row_ptr[i + 1] - row_ptr[i]);
}
};
/*!
* \brief histogram of graident statistics for a single node.
* Consists of multiple GHistEntry's, each entry showing total graident statistics
* for that particular bin
* Uses global bin id so as to represent all features simultaneously
*/
struct GHistRow {
/*! \brief base pointer to first entry */
GHistEntry* begin;
/*! \brief number of entries */
unsigned size;
GHistRow() {}
GHistRow(GHistEntry* begin, unsigned size)
: begin(begin), size(size) {}
};
/*!
* \brief histogram of gradient statistics for multiple nodes
*/
class HistCollection {
public:
// access histogram for i-th node
inline GHistRow operator[](bst_uint nid) const {
const size_t kMax = std::numeric_limits<size_t>::max();
CHECK_NE(row_ptr_[nid], kMax);
return GHistRow(const_cast<GHistEntry*>(dmlc::BeginPtr(data_) + row_ptr_[nid]), nbins_);
}
// have we computed a histogram for i-th node?
inline bool RowExists(bst_uint nid) const {
const size_t kMax = std::numeric_limits<size_t>::max();
return (nid < row_ptr_.size() && row_ptr_[nid] != kMax);
}
// initialize histogram collection
inline void Init(size_t nbins) {
nbins_ = nbins;
row_ptr_.clear();
data_.clear();
}
// create an empty histogram for i-th node
inline void AddHistRow(bst_uint nid) {
const size_t kMax = std::numeric_limits<size_t>::max();
if (nid >= row_ptr_.size()) {
row_ptr_.resize(nid + 1, kMax);
}
CHECK_EQ(row_ptr_[nid], kMax);
row_ptr_[nid] = data_.size();
data_.resize(data_.size() + nbins_);
}
private:
/*! \brief number of all bins over all features */
size_t nbins_;
std::vector<GHistEntry> data_;
/*! \brief row_ptr_[nid] locates bin for historgram of node nid */
std::vector<size_t> row_ptr_;
};
/*!
* \brief builder for histograms of gradient statistics
*/
class GHistBuilder {
public:
// initialize builder
inline void Init(size_t nthread, size_t nbins) {
nthread_ = nthread;
nbins_ = nbins;
data_.resize(nthread * nbins, GHistEntry());
}
// construct a histogram via histogram aggregation
void BuildHist(const std::vector<bst_gpair>& gpair,
const RowSetCollection::Elem row_indices,
const GHistIndexMatrix& gmat,
GHistRow hist);
// construct a histogram via subtraction trick
void SubtractionTrick(GHistRow self, GHistRow sibling, GHistRow parent);
private:
/*! \brief number of threads for parallel computation */
size_t nthread_;
/*! \brief number of all bins over all features */
size_t nbins_;
std::vector<GHistEntry> data_;
};
} // namespace common
} // namespace xgboost
#endif // XGBOOST_COMMON_HIST_UTIL_H_