xgboost

Author	SHA1	Message	Date
Philip Hyunsu Cho	983cb0b374	Add option to disable default metric (#3606 )	2018-08-18 11:39:20 -07:00
Philip Hyunsu Cho	3c72654e3b	Revert "Fix #3485 , #3540 : Don't use dropout for predicting test sets" (#3563 ) * Revert "Fix #3485, #3540: Don't use dropout for predicting test sets (#3556)" This reverts commit 44811f233071c5805d70c287abd22b155b732727. * Document behavior of predict() for DART booster * Add notice to parameter.rst	2018-08-08 09:48:55 -07:00
Philip Hyunsu Cho	44811f2330	Fix #3485 , #3540 : Don't use dropout for predicting test sets (#3556 ) * Fix #3485, #3540: Don't use dropout for predicting test sets Dropout (for DART) should only be used at training time. * Add regression test	2018-08-05 10:17:21 -07:00
Philip Hyunsu Cho	8a5209c55e	Fix model saving for 'count:possion': max_delta_step as Booster attribute (#3515 ) * Save max_delta_step as an extra attribute of Booster Fixes #3509 and #3026, where `max_delta_step` parameter gets lost during serialization. * fix lint * Use camel case for global constant * disable local variable case in clang-tidy	2018-07-27 09:55:54 -07:00
Rory Mitchell	a96039141a	Dmatrix refactor stage 1 (#3301 ) * Use sparse page as singular CSR matrix representation * Simplify dmatrix methods * Reduce statefullness of batch iterators * BREAKING CHANGE: Remove prob_buffer_row parameter. Users are instead recommended to sample their dataset as a preprocessing step before using XGBoost.	2018-06-07 10:25:58 +12:00
Rory Mitchell	ccf80703ef	Clang-tidy static analysis (#3222 ) * Clang-tidy static analysis * Modernise checks * Google coding standard checks * Identifier renaming according to Google style	2018-04-19 18:57:13 +12:00
Andrew V. Adinetz	d5992dd881	Replaced std::vector-based interfaces with HostDeviceVector-based interfaces. (#3116 ) * Replaced std::vector-based interfaces with HostDeviceVector-based interfaces. - replacement was performed in the learner, boosters, predictors, updaters, and objective functions - only interfaces used in training were replaced; interfaces like PredictInstance() still use std::vector - refactoring necessary for replacement of interfaces was also performed, such as using HostDeviceVector in prediction cache * HostDeviceVector-based interfaces for custom objective function example plugin.	2018-02-28 13:00:04 +13:00
Rory Mitchell	10eb05a63a	Refactor linear modelling and add new coordinate descent updater (#3103 ) * Refactor linear modelling and add new coordinate descent updater * Allow unsorted column iterator * Add prediction cacheing to gblinear	2018-02-17 09:17:01 +13:00
Scott Lundberg	d878c36c84	Add SHAP interaction effects, fix minor bug, and add cox loss (#3043 ) * Add interaction effects and cox loss * Minimize whitespace changes * Cox loss now no longer needs a pre-sorted dataset. * Address code review comments * Remove mem check, rename to pred_interactions, include bias * Make lint happy * More lint fixes * Fix cox loss indexing * Fix main effects and tests * Fix lint * Use half interaction values on the off-diagonals * Fix lint again	2018-02-07 20:38:01 -06:00
Thejaswi	84ab74f3a5	Objective function evaluation on GPU with minimal PCIe transfers (#2935 ) * Added GPU objective function and no-copy interface. - xgboost::HostDeviceVector<T> syncs automatically between host and device - no-copy interfaces have been added - default implementations just sync the data to host and call the implementations with std::vector - GPU objective function, predictor, histogram updater process data directly on GPU	2018-01-12 21:33:39 +13:00
Rory Mitchell	c55f14668e	Update gpu_hist algorithm (#2901 )	2017-11-27 13:44:24 +13:00
Rory Mitchell	40c6e2f0c8	Improved gpu_hist_experimental algorithm (#2866 ) - Implement colsampling, subsampling for gpu_hist_experimental - Optimised multi-GPU implementation for gpu_hist_experimental - Make nccl optional - Add Volta architecture flag - Optimise RegLossObj - Add timing utilities for debug verbose mode - Bump required cuda version to 8.0	2017-11-11 13:58:40 +13:00
Rory Mitchell	13e7a2cff0	Various bug fixes (#2825 ) * Fatal error if GPU algorithm selected without GPU support compiled * Resolve type conversion warnings * Fix gpu unit test failure * Fix compressed iterator edge case * Fix python unit test failures due to flake8 update on pip	2017-10-25 14:45:01 +13:00
Scott Lundberg	78c4188cec	SHAP values for feature contributions (#2438 ) * SHAP values for feature contributions * Fix commenting error * New polynomial time SHAP value estimation algorithm * Update API to support SHAP values * Fix merge conflicts with updates in master * Correct submodule hashes * Fix variable sized stack allocation * Make lint happy * Add docs * Fix typo * Adjust tolerances * Remove unneeded def * Fixed cpp test setup * Updated R API and cleaned up * Fixed test typo	2017-10-12 12:35:51 -07:00
Rory Mitchell	4cb2f7598b	-Add experimental GPU algorithm for lossguided mode (#2755 ) -Improved GPU algorithm unit tests -Removed some thrust code to improve compile times	2017-10-01 00:18:35 +13:00
Rory Mitchell	ef23e424f1	[GPU-Plugin] Add GPU accelerated prediction (#2593 ) * [GPU-Plugin] Add GPU accelerated prediction * Improve allocation message * Update documentation * Resolve linker error for predictor * Add unit tests	2017-08-16 12:31:59 +12:00
Rory Mitchell	0e06d1805d	[WIP] Extract prediction into separate interface (#2531 ) * [WIP] Extract prediction into separate interface * Add copyright, fix linter errors * Add predictor to amalgamation * Fix documentation * Move prediction cache into predictor, add GBTreeModel * Updated predictor doc comments	2017-07-28 17:01:03 -07:00
Rory Mitchell	48f3003302	[GPU-Plugin] Change GPU plugin to use tree_method parameter, bump cmake version to 3.5 for GPU plugin, add compute architecture 3.5, remove unused cmake files (#2455 )	2017-06-29 16:19:45 +12:00
Maurus Cuelenaere	6bd1869026	Add prediction of feature contributions (#2003 ) * Add prediction of feature contributions This implements the idea described at http://blog.datadive.net/interpreting-random-forests/ which tries to give insight in how a prediction is composed of its feature contributions and a bias. * Support multi-class models * Calculate learning_rate per-tree instead of using the one from the first tree * Do not rely on node.base_weight * learning_rate having the same value as the node mean value (aka leaf value, if it were a leaf); instead calculate them (lazily) on-the-fly * Add simple test for contributions feature * Check against param.num_nodes instead of checking for non-zero length * Loop over all roots instead of only the first	2017-05-14 00:58:10 -05:00
ebernhardson	da58f34ff8	Store metrics with learner (#2241 ) Storing and then loading a model loses any eval_metric that was provided. This causes implementations that always store/load, like xgboost4j-spark, to be unable to eval with the desired metric.	2017-04-30 14:23:24 -07:00
Philip Cho	14fba01b5a	Improve multi-threaded performance (#2104 ) * Add UpdatePredictionCache() option to updaters Some updaters (e.g. fast_hist) has enough information to quickly compute prediction cache for the training data. Each updater may override UpdaterPredictionCache() method to update the prediction cache. Note: this trick does not apply to validation data. * Respond to code review * Disable some debug messages by default * Document UpdatePredictionCache() interface * Remove base_margin logic from UpdatePredictionCache() implementation * Do not take pointer to cfg, as reference may get stale * Improve multi-threaded performance * Use columnwise accessor to accelerate ApplySplit() step, with support for a compressed representation * Parallel sort for evaluation step * Inline BuildHist() function * Cache gradient pairs when building histograms in BuildHist() * Add missing #if macro * Respond to code review * Use wrapper to enable parallel sort on Linux * Fix C++ compatibility issues * MSVC doesn't support unsigned in OpenMP loops * gcc 4.6 doesn't support using keyword * Fix lint issues * Respond to code review * Fix bug in ApplySplitSparseData() * Attempting to read beyond the end of a sparse column * Mishandling the case where an entire range of rows have missing values * Fix training continuation bug Disable UpdatePredictionCache() in the first iteration. This way, we can accomodate the scenario where we build off of an existing (nonempty) ensemble. * Add regression test for fast_hist * Respond to code review * Add back old version of ApplySplitSparseData	2017-03-25 10:35:01 -07:00
Tianqi Chen	d581a3d0e7	[UPDATE] Update rabit and threadlocal (#2114 ) * [UPDATE] Update rabit and threadlocal * minor fix to make build system happy * upgrade requirement to g++4.8 * upgrade dmlc-core * update travis	2017-03-16 18:48:37 -07:00
Philip Cho	aeb4e76118	Histogram Optimized Tree Grower (#1940 ) * Support histogram-based algorithm + multiple tree growing strategy * Add a brand new updater to support histogram-based algorithm, which buckets continuous features into discrete bins to speed up training. To use it, set `tree_method = fast_hist` to configuration. * Support multiple tree growing strategies. For now, two policies are supported: * `grow_policy=depthwise` (default): favor splitting at nodes closest to the root, i.e. grow depth-wise. * `grow_policy=lossguide`: favor splitting at nodes with highest loss change * Improve single-threaded performance * Unroll critical loops * Introduce specialized code for dense data (i.e. no missing values) * Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose` * Adding a small test for hist method * Fix memory error in row_set.h When std::vector is resized, a reference to one of its element may become stale. Any such reference must be updated as well. * Resolve cross-platform compilation issues * Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g. alignas() and new initializer syntax. To support g++ 4.6, use pre-C++11 initializer and remove alignas(). * Versions of MSVC older than 2015 does not support alignas(). To support MSVC 2012, remove alignas(). * For g++ 4.8 and newer, alignas() is enabled for performance benefits. Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases (which uses `using` to declate type aliases). So always use `typedef`. * Fix a host of CI issues * Remove dependency for libz on osx * Fix heading for hist_util * Fix minor style issues * Add missing #include * Remove extraneous logging * Enable tree_method=hist in R * Renaming HistMaker to GHistBuilder to avoid confusion * Fix R integration * Respond to style comments * Consistent tie-breaking for priority queue using timestamps * Last-minute style fixes * Fix issuecomment-271977647 The way we quantize data is broken. The agaricus data consists of all categorical values. When NAs are converted into 0's, `HistCutMatrix::Init` assign both 0's and 1's to the same single bin. Why? gmat only the smallest value (0) and an upper bound (2), which is twice the maximum value (1). Add the maximum value itself to gmat to fix the issue. * Fix issuecomment-272266358 * Remove padding from cut values for the continuous case * For categorical/ordinal values, use midpoints as bin boundaries to be safe * Fix CI issue -- do not use xrange() Fix corner case in quantile sketch Signed-off-by: Philip Cho <chohyu01@cs.washington.edu> * Adding a test for an edge case in quantile sketcher max_bin=2 used to cause an exception. * Fix fast_hist test The test used to require a strictly increasing Test AUC for all examples. One of them exhibits a small blip in Test AUC before achieving a Test AUC of 1. (See bottom.) Solution: do not require monotonic increase for this particular example. [0] train-auc:0.99989 test-auc:0.999497 [1] train-auc:1 test-auc:0.999749 [2] train-auc:1 test-auc:0.999749 [3] train-auc:1 test-auc:0.999749 [4] train-auc:1 test-auc:0.999749 [5] train-auc:1 test-auc:0.999497 [6] train-auc:1 test-auc:1 [7] train-auc:1 test-auc:1 [8] train-auc:1 test-auc:1 [9] train-auc:1 test-auc:1	2017-01-13 09:25:55 -08:00
AbdealiJK	6f16f0ef58	Use bst_float consistently throughout (#1824 ) * Fix various typos * Add override to functions that are overridden gcc gives warnings about functions that are being overridden by not being marked as oveirridden. This fixes it. * Use bst_float consistently Use bst_float for all the variables that involve weight, leaf value, gradient, hessian, gain, loss_chg, predictions, base_margin, feature values. In some cases, when due to additions and so on the value can take a larger value, double is used. This ensures that type conversions are minimal and reduces loss of precision.	2016-11-30 10:02:10 -08:00
Simon DENEL	58aa1129ea	Fixing a few typos (#1771 ) * Fixing a few typos * Fixing a few typos	2016-11-13 15:47:52 -08:00
AbdealiJK	b94fcab4dc	Add dump_format=json option (#1726 ) * Add format to the params accepted by DumpModel Currently, only the test format is supported when trying to dump a model. The plan is to add more such formats like JSON which are easy to read and/or parse by machines. And to make the interface for this even more generic to allow other formats to be added. Hence, we make some modifications to make these function generic and accept a new parameter "format" which signifies the format of the dump to be created. * Fix typos and errors in docs * plugin: Mention all the register macros available Document the register macros currently available to the plugin writers so they know what exactly can be extended using hooks. * sparce_page_source: Use same arg name in .h and .cc * gbm: Add JSON dump The dump_format argument can be used to specify what type of dump file should be created. Add functionality to dump gblinear and gbtree into a JSON file. The JSON file has an array, each item is a JSON object for the tree. For gblinear: - The item is the bias and weights vectors For gbtree: - The item is the root node. The root node has a attribute "children" which holds the children nodes. This happens recursively. * core.py: Add arg dump_format for get_dump()	2016-11-04 09:55:25 -07:00
Shengwen Yang	3b9987ca9c	Fix the issue 1474 (#1615 ) * Fix 1474 * Fix crash issue when saving and loading poisson model * Rollback the wrong fix	2016-09-29 19:29:47 -07:00
Tianqi Chen	ecec5f7959	[CORE] Refactor cache mechanism (#1540 )	2016-09-02 20:39:07 -07:00
Vadim Khotilovich	75f401481f	no exception throwing within omp parallel; set nthread in Learner (#1421 )	2016-07-29 10:08:03 -07:00
Vadim Khotilovich	9a48a40cf1	Fixes for multiple and default metric (#1239 ) * fix multiple evaluation metrics * create DefaultEvalMetric only when really necessary * py test for #1239 * make travis happy	2016-06-04 22:17:35 -07:00
Vadim Khotilovich	185fef3fce	fixes for lint	2016-05-15 02:35:37 -05:00
Vadim Khotilovich	ea9285dd4f	methods to delete an attribute and get names of available attributes	2016-05-14 18:19:18 -05:00
Vadim Khotilovich	3e0732dea9	in Configure, set random seed only for uninitialized model	2016-04-26 02:03:22 -05:00
tqchen	a2714fe052	[METHOD], add tree method option to prefer faster algo	2016-03-13 12:24:47 -07:00
tqchen	ec2fb5bc48	Fix multi-class loading	2016-03-10 19:22:26 -08:00
tqchen	96b17971ac	Fix continue training in CLI	2016-03-10 12:43:25 -08:00
tqchen	ecb3a271be	[PYTHON-DIST] Distributed xgboost python training API.	2016-02-29 16:54:13 -08:00
tqchen	1495a43cea	[R] make all customizations to meet strict standard of cran	2016-01-16 10:25:12 -08:00
tqchen	c4d389c5df	[LZ] Improve lz4 format	2016-01-16 10:25:11 -08:00
tqchen	d75e3ed05d	[LIBXGBOOST] pass demo running.	2016-01-16 10:24:01 -08:00
tqchen	0d95e863c9	[LEARNER] refactor learner	2016-01-16 10:24:01 -08:00
tqchen	82ceb4de0a	[LEARNER] Init learner interface	2016-01-16 10:24:01 -08:00

1 2 3 4

192 Commits