* Add interaction effects and cox loss
* Minimize whitespace changes
* Cox loss now no longer needs a pre-sorted dataset.
* Address code review comments
* Remove mem check, rename to pred_interactions, include bias
* Make lint happy
* More lint fixes
* Fix cox loss indexing
* Fix main effects and tests
* Fix lint
* Use half interaction values on the off-diagonals
* Fix lint again
* Added GPU objective function and no-copy interface.
- xgboost::HostDeviceVector<T> syncs automatically between host and device
- no-copy interfaces have been added
- default implementations just sync the data to host
and call the implementations with std::vector
- GPU objective function, predictor, histogram updater process data
directly on GPU
* [R] fix predict contributions for data with no colnames
* [R] add a render parameter for xgb.plot.multi.trees; fixes#2628
* [R] update Rd's
* [R] remove unnecessary dep-package from R cmake install
* silence type warnings; readability
* [R] silence complaint about incomplete line at the end
* [R] initial version of xgb.plot.shap()
* [R] more work on xgb.plot.shap
* [R] enforce black font in xgb.plot.tree; fixes#2640
* [R] if feature names are available, check in predict that they are the same; fixes#2857
* [R] cran check and lint fixes
* remove tabs
* [R] add references; a test for plot.shap
* Fatal error if GPU algorithm selected without GPU support compiled
* Resolve type conversion warnings
* Fix gpu unit test failure
* Fix compressed iterator edge case
* Fix python unit test failures due to flake8 update on pip
* SHAP values for feature contributions
* Fix commenting error
* New polynomial time SHAP value estimation algorithm
* Update API to support SHAP values
* Fix merge conflicts with updates in master
* Correct submodule hashes
* Fix variable sized stack allocation
* Make lint happy
* Add docs
* Fix typo
* Adjust tolerances
* Remove unneeded def
* Fixed cpp test setup
* Updated R API and cleaned up
* Fixed test typo
* for MinGW, drop the 'lib' prefix from shared library name
* fix defines for 'g++ 4.8 or higher' to include g++ >= 5
* fix compile warnings
* [Appveyor] add MinGW with python; remove redundant jobs
* [Appveyor] also do python build for one of msvc jobs
* Integrating a faster version of grow_gpu plugin
1. Removed the older files to reduce duplication
2. Moved all of the grow_gpu files under 'exact' folder
3. All of them are inside 'exact' namespace to avoid any conflicts
4. Fixed a bug in benchmark.py while running only 'grow_gpu' plugin
5. Added cub and googletest submodules to ease integration and unit-testing
6. Updates to CMakeLists.txt to directly build cuda objects into libxgboost
* Added support for building gpu plugins through make flow
1. updated makefile and config.mk to add right targets
2. added unit-tests for gpu exact plugin code
* 1. Added support for building gpu plugin using 'make' flow as well
2. Updated instructions for building and testing gpu plugin
* Fix travis-ci errors for PR#2360
1. lint errors on unit-tests
2. removed googletest, instead depended upon dmlc-core provide gtest cache
* Some more fixes to travis-ci lint failures PR#2360
* Added Rory's copyrights to the files containing code from both.
* updated copyright statement as per Rory's request
* moved the static datasets into a script to generate them at runtime
* 1. memory usage print when silent=0
2. tests/ and test/ folder organization
3. removal of the dependency of googletest for just building xgboost
4. coding style updates for .cuh as well
* Fixes for compilation warnings
* add cuda object files as well when JVM_BINDINGS=ON
* [gblinear] add features contribution prediction; fix DumpModel bug
* [gbtree] minor changes to PredContrib
* [R] add feature contribution prediction to R
* [R] bump up version; update NEWS
* [gblinear] fix the base_margin issue; fixes#1969
* [R] list of matrices as output of multiclass feature contributions
* [gblinear] make order of DumpModel coefficients consistent: group index changes the fastest
* Fix compilation on OS X with GCC 7
Compilation failed with
In file included from src/tree/tree_updater.cc:6:0:
include/xgboost/tree_updater.h:75:46: error: 'function' is not a member of 'std'
std::function<TreeUpdater* ()> > {
caused by a missing <functional> include.
* Fixed another occurence of that issue spotted by @ClimberPG
* [R] add native routines registration
* c_api.h needs to include <cstdint> since it uses fixed width integer types
* [R] use registered native routines from R code
* [R] bump version; add info on native routine registration to the contributors guide
* make lint happy
* Add prediction of feature contributions
This implements the idea described at http://blog.datadive.net/interpreting-random-forests/
which tries to give insight in how a prediction is composed of its feature contributions
and a bias.
* Support multi-class models
* Calculate learning_rate per-tree instead of using the one from the first tree
* Do not rely on node.base_weight * learning_rate having the same value as the node mean value (aka leaf value, if it were a leaf); instead calculate them (lazily) on-the-fly
* Add simple test for contributions feature
* Check against param.num_nodes instead of checking for non-zero length
* Loop over all roots instead of only the first
* Bugfix 1: Fix segfault in multithreaded ApplySplitSparseData()
When there are more threads than rows in rowset, some threads end up
with empty ranges, causing them to crash. (iend - 1 needs to be
accessible as part of algorithm)
Fix: run only those threads with nonempty ranges.
* Add regression test for Bugfix 1
* Moving python_omp_test to existing python test group
Turns out you don't need to set "OMP_NUM_THREADS" to enable
multithreading. Just add nthread parameter.
* Bugfix 2: Fix corner case of ApplySplitSparseData() for categorical feature
When split value is less than all cut points, split_cond is set
incorrectly.
Fix: set split_cond = -1 to indicate this scenario
* Bugfix 3: Initialize data layout indicator before using it
data_layout_ is accessed before being set; this variable determines
whether feature 0 is included in feat_set.
Fix: re-order code in InitData() to initialize data_layout_ first
* Adding regression test for Bugfix 2
Unfortunately, no regression test for Bugfix 3, as there is no
way to deterministically assign value to an uninitialized variable.
* Add UpdatePredictionCache() option to updaters
Some updaters (e.g. fast_hist) has enough information to quickly compute
prediction cache for the training data. Each updater may override
UpdaterPredictionCache() method to update the prediction cache. Note: this
trick does not apply to validation data.
* Respond to code review
* Disable some debug messages by default
* Document UpdatePredictionCache() interface
* Remove base_margin logic from UpdatePredictionCache() implementation
* Do not take pointer to cfg, as reference may get stale
* Improve multi-threaded performance
* Use columnwise accessor to accelerate ApplySplit() step,
with support for a compressed representation
* Parallel sort for evaluation step
* Inline BuildHist() function
* Cache gradient pairs when building histograms in BuildHist()
* Add missing #if macro
* Respond to code review
* Use wrapper to enable parallel sort on Linux
* Fix C++ compatibility issues
* MSVC doesn't support unsigned in OpenMP loops
* gcc 4.6 doesn't support using keyword
* Fix lint issues
* Respond to code review
* Fix bug in ApplySplitSparseData()
* Attempting to read beyond the end of a sparse column
* Mishandling the case where an entire range of rows have missing values
* Fix training continuation bug
Disable UpdatePredictionCache() in the first iteration. This way, we can
accomodate the scenario where we build off of an existing (nonempty) ensemble.
* Add regression test for fast_hist
* Respond to code review
* Add back old version of ApplySplitSparseData
I use the online prediction function(`inline void Predict(const SparseBatch::Inst &inst, ... ) const;`), the results obtained are different from the results of the batch prediction function(` virtual void Predict(DMatrix* data, ...) const = 0`). After the investigation found that the online prediction function using the `base_score_` parameters, and the batch prediction function is not used in this parameter. It is found that the `base_score_` values are different when the same model file is loaded many times.
```
1st times:base_score_: 6.69023e-21
2nd times:base_score_: -3.7668e+19
3rd times:base_score_: 5.40507e+07
```
Online prediction results are affected by `base_score_` parameters. After deleting the if condition(`if (out_preds->size() == 1)`) , the online prediction is consistent with the batch prediction results, and the xgboost prediction results are consistent with python version. Therefore, it is likely that the online prediction function is bug
* Support histogram-based algorithm + multiple tree growing strategy
* Add a brand new updater to support histogram-based algorithm, which buckets
continuous features into discrete bins to speed up training. To use it, set
`tree_method = fast_hist` to configuration.
* Support multiple tree growing strategies. For now, two policies are supported:
* `grow_policy=depthwise` (default): favor splitting at nodes closest to the
root, i.e. grow depth-wise.
* `grow_policy=lossguide`: favor splitting at nodes with highest loss change
* Improve single-threaded performance
* Unroll critical loops
* Introduce specialized code for dense data (i.e. no missing values)
* Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose`
* Adding a small test for hist method
* Fix memory error in row_set.h
When std::vector is resized, a reference to one of its element may become
stale. Any such reference must be updated as well.
* Resolve cross-platform compilation issues
* Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g.
alignas(*) and new initializer syntax. To support g++ 4.6, use pre-C++11
initializer and remove alignas(*).
* Versions of MSVC older than 2015 does not support alignas(*). To support
MSVC 2012, remove alignas(*).
* For g++ 4.8 and newer, alignas(*) is enabled for performance benefits.
* Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases
(which uses `using` to declate type aliases). So always use `typedef`.
* Fix a host of CI issues
* Remove dependency for libz on osx
* Fix heading for hist_util
* Fix minor style issues
* Add missing #include
* Remove extraneous logging
* Enable tree_method=hist in R
* Renaming HistMaker to GHistBuilder to avoid confusion
* Fix R integration
* Respond to style comments
* Consistent tie-breaking for priority queue using timestamps
* Last-minute style fixes
* Fix issuecomment-271977647
The way we quantize data is broken. The agaricus data consists of all
categorical values. When NAs are converted into 0's,
`HistCutMatrix::Init` assign both 0's and 1's to the same single bin.
Why? gmat only the smallest value (0) and an upper bound (2), which is twice
the maximum value (1). Add the maximum value itself to gmat to fix the issue.
* Fix issuecomment-272266358
* Remove padding from cut values for the continuous case
* For categorical/ordinal values, use midpoints as bin boundaries to be safe
* Fix CI issue -- do not use xrange(*)
* Fix corner case in quantile sketch
Signed-off-by: Philip Cho <chohyu01@cs.washington.edu>
* Adding a test for an edge case in quantile sketcher
max_bin=2 used to cause an exception.
* Fix fast_hist test
The test used to require a strictly increasing Test AUC for all examples.
One of them exhibits a small blip in Test AUC before achieving a Test AUC
of 1. (See bottom.)
Solution: do not require monotonic increase for this particular example.
[0] train-auc:0.99989 test-auc:0.999497
[1] train-auc:1 test-auc:0.999749
[2] train-auc:1 test-auc:0.999749
[3] train-auc:1 test-auc:0.999749
[4] train-auc:1 test-auc:0.999749
[5] train-auc:1 test-auc:0.999497
[6] train-auc:1 test-auc:1
[7] train-auc:1 test-auc:1
[8] train-auc:1 test-auc:1
[9] train-auc:1 test-auc:1
* Fix various typos
* Add override to functions that are overridden
gcc gives warnings about functions that are being overridden by not
being marked as oveirridden. This fixes it.
* Use bst_float consistently
Use bst_float for all the variables that involve weight,
leaf value, gradient, hessian, gain, loss_chg, predictions,
base_margin, feature values.
In some cases, when due to additions and so on the value can
take a larger value, double is used.
This ensures that type conversions are minimal and reduces loss of
precision.
* Add format to the params accepted by DumpModel
Currently, only the test format is supported when trying to dump
a model. The plan is to add more such formats like JSON which are
easy to read and/or parse by machines. And to make the interface
for this even more generic to allow other formats to be added.
Hence, we make some modifications to make these function generic
and accept a new parameter "format" which signifies the format of
the dump to be created.
* Fix typos and errors in docs
* plugin: Mention all the register macros available
Document the register macros currently available to the plugin
writers so they know what exactly can be extended using hooks.
* sparce_page_source: Use same arg name in .h and .cc
* gbm: Add JSON dump
The dump_format argument can be used to specify what type
of dump file should be created. Add functionality to dump
gblinear and gbtree into a JSON file.
The JSON file has an array, each item is a JSON object for the tree.
For gblinear:
- The item is the bias and weights vectors
For gbtree:
- The item is the root node. The root node has a attribute "children"
which holds the children nodes. This happens recursively.
* core.py: Add arg dump_format for get_dump()
* Changes for Mingw64 compilation to ensure long is a consistent size.
Mainly impacts the Java API which would not compile, but there may be
silent errors on Windows with large datasets before this patch (as long
is 32-bits when compiled with mingw64 even in 64-bit mode).
* Adding ifdefs to ensure it still compiles on MacOS
* Makefile and create_jni.bat changes for Windows.
* Switching XGDMatrixCreateFromCSREx JNI call to use size_t cast
* Fixing lint error, adding profile switching to jvm-packages build to make create-jni.bat get called, adding myself to Contributors.Md