xgboost

Author	SHA1	Message	Date
Andy Adinets	58d783df16	Fixed issue 3605. (#3628 ) * Fixed issue 3605. - https://github.com/dmlc/xgboost/issues/3605 * Fixed the bug in a better way. * Added a test to catch the bug. * Fixed linter errors.	2018-08-28 10:50:52 -07:00
Shiki-H	24a268a2e3	sklearn api for ranking (#3560 ) * added xgbranker * fixed predict method and ranking test * reformatted code in accordance with pep8 * fixed lint error * fixed docstring and added checks on objective * added ranking demo for python * fixed suffix in rank.py	2018-08-21 08:26:48 -07:00
Philip Hyunsu Cho	3c72654e3b	Revert "Fix #3485 , #3540 : Don't use dropout for predicting test sets" (#3563 ) * Revert "Fix #3485, #3540: Don't use dropout for predicting test sets (#3556)" This reverts commit 44811f233071c5805d70c287abd22b155b732727. * Document behavior of predict() for DART booster * Add notice to parameter.rst	2018-08-08 09:48:55 -07:00
Philip Hyunsu Cho	44811f2330	Fix #3485 , #3540 : Don't use dropout for predicting test sets (#3556 ) * Fix #3485, #3540: Don't use dropout for predicting test sets Dropout (for DART) should only be used at training time. * Add regression test	2018-08-05 10:17:21 -07:00
Andy Adinets	cc6a5a3666	Added finding quantiles on GPU. (#3393 ) * Added finding quantiles on GPU. - this includes datasets where weights are assigned to data rows - as the quantiles found by the new algorithm are not the same as those found by the old one, test thresholds in tests/python-gpu/test_gpu_updaters.py have been adjusted. * Adjustments and improved testing for finding quantiles on the GPU. - added C++ tests for the DeviceSketch() function - reduced one of the thresholds in test_gpu_updaters.py - adjusted the cuts found by the find_cuts_k kernel	2018-07-27 14:03:16 +12:00
jqmp	e9a97e0d88	Add total_gain and total_cover importance measures (#3498 ) Add `'total_gain'` and `'total_cover'` as possible `importance_type` arguments to `Booster.get_score` in the Python package. `get_score` already accepts a `'gain'` argument, which returns each feature's average gain over all of its splits. `'total_gain'` does the same, but returns a total rather than an average. This seems more intuitively meaningful, and also matches the behavior of the R package's `xgb.importance` function. I also added an analogous `'total_cover'` command for consistency. This should resolve #3484.	2018-07-23 00:30:55 -07:00
Henry Gouk	a13e29ece1	Add LASSO (#3429 ) * Allow multiple split constraints * Replace RidgePenalty with ElasticNet * Add test for checking Ridge, LASSO, and Elastic Net are implemented	2018-07-15 16:38:26 +12:00
Philip Hyunsu Cho	66e74d2223	Fix get_uint_info() (#3442 ) * Add regression test	2018-07-05 20:06:59 -07:00
Oliver Laslett	18813a26ab	allow arbitrary cross validation fold indices (#3353 ) * allow arbitrary cross validation fold indices - use training indices passed to `folds` parameter in `training.cv` - update doc string * add tests for arbitrary fold indices	2018-06-30 19:23:49 +00:00
Mike Liu	594bcea83e	Save and load model in sklearn API (#3192 ) * Add (load\|save)_model to XGBModel * Add docstring * Fix docstring * Fix mixed use of space and tab * Add a test * Fix Flake8 style errors	2018-06-30 19:21:49 +00:00
Rory Mitchell	a0a1df1aba	Refactor python tests (#3410 ) * Add unit test utility * Refactor updater tests. Add coverage for histmaker.	2018-06-27 11:20:27 +12:00
PSEUDOTENSOR / Jonathan McKinney	9ac163d0bb	Allow import via python datatable. (#3272 ) * Allow import via python datatable. * Write unit tests * Refactor dt API functions * Refactor python code * Lint fixes * Address review comments	2018-06-20 13:16:18 -07:00
pdavalo	480e3fd764	Sklearn: validation set weights (#2354 ) * Add option to use weights when evaluating metrics in validation sets * Add test for validation-set weights functionality * simplify case with no weights for test sets * fix lint issues	2018-05-23 17:06:20 -07:00
Rory Mitchell	a185ddfe03	Implement GPU accelerated coordinate descent algorithm (#3178 ) * Implement GPU accelerated coordinate descent algorithm. * Exclude external memory tests for GPU	2018-04-20 14:56:35 +12:00
Rory Mitchell	9fa45d3a9c	Fix bug with gpu_predictor caching behaviour (#3177 ) * Fixes #3162	2018-03-18 10:35:10 +13:00
Vadim Khotilovich	706be4e5d4	Additional improvements for gblinear (#3134 ) * fix rebase conflict * [core] additional gblinear improvements * [R] callback for gblinear coefficients history * force eta=1 for gblinear python tests * add top_k to GreedyFeatureSelector * set eta=1 in shotgun test * [core] fix SparsePage processing in gblinear; col-wise multithreading in greedy updater * set sorted flag within TryInitColData * gblinear tests: use scale, add external memory test * fix multiclass for greedy updater * fix whitespace * fix typo	2018-03-13 01:27:13 -05:00
redditur	d5f1b74ef5	'hist': Montonic Constraints (#3085 ) * Extended monotonic constraints support to 'hist' tree method. * Added monotonic constraints tests. * Fix the signature of NoConstraint::CalcSplitGain() * Document monotonic constraint support in 'hist' * Update signature of Update to account for latest refactor	2018-03-05 16:45:49 -08:00
Rory Mitchell	10eb05a63a	Refactor linear modelling and add new coordinate descent updater (#3103 ) * Refactor linear modelling and add new coordinate descent updater * Allow unsorted column iterator * Add prediction cacheing to gblinear	2018-02-17 09:17:01 +13:00
Scott Lundberg	d878c36c84	Add SHAP interaction effects, fix minor bug, and add cox loss (#3043 ) * Add interaction effects and cox loss * Minimize whitespace changes * Cox loss now no longer needs a pre-sorted dataset. * Address code review comments * Remove mem check, rename to pred_interactions, include bias * Make lint happy * More lint fixes * Fix cox loss indexing * Fix main effects and tests * Fix lint * Use half interaction values on the off-diagonals * Fix lint again	2018-02-07 20:38:01 -06:00
Scott Lundberg	78c4188cec	SHAP values for feature contributions (#2438 ) * SHAP values for feature contributions * Fix commenting error * New polynomial time SHAP value estimation algorithm * Update API to support SHAP values * Fix merge conflicts with updates in master * Correct submodule hashes * Fix variable sized stack allocation * Make lint happy * Add docs * Fix typo * Adjust tolerances * Remove unneeded def * Fixed cpp test setup * Updated R API and cleaned up * Fixed test typo	2017-10-12 12:35:51 -07:00
Tsukasa OMOTO	8d15024ac7	python: follow the default warning filters of Python (#2666 ) * python: follow the default warning filters of Python https://docs.python.org/3/library/warnings.html#default-warning-filters * update tests * update tests	2017-09-27 03:03:01 -04:00
Icyblade Dai	0e85b30fdd	Fix issue 2670 (#2671 ) * fix issue 2670 * add python<3.6 compatibility * fix Index * fix Index/MultiIndex * fix lint * fix W0622 really nonsense * fix lambda * Trigger Travis * add test for MultiIndex * remove tailing whitespace	2017-09-19 15:49:41 -04:00
PSEUDOTENSOR / Jonathan McKinney	6b375f6ad8	Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation (#2530 ) * Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation from numpy arrays for python interface.	2017-07-21 14:43:17 +12:00
wxchan	65d2513714	[python-package] fix sklearn n_jobs/nthreads and seed/random_state bug (#2378 ) * add a testcase causing RuntimeError * move seed/random_state/nthread/n_jobs check to get_xgb_params() * fix failed test	2017-06-12 09:33:42 -04:00
gaw89	0f3a404d91	Sklearn kwargs (#2338 ) * Added kwargs support for Sklearn API * Updated NEWS and CONTRIBUTORS * Fixed CONTRIBUTORS.md * Added clarification of *kwargs and test for proper usage Fixed lint error * Fixed more lint errors and clf assigned but never used * Fixed more lint errors * Fixed more lint errors * Fixed issue with changes from different branch bleeding over * Fixed issue with changes from other branch bleeding over * Added note that kwargs may not be compatible with Sklearn * Fixed linting on kwargs note	2017-05-23 21:47:53 -05:00
gaw89	6cea1e3fb7	Sklearn convention update (#2323 ) * Added n_jobs and random_state to keep up to date with sklearn API. Deprecated nthread and seed. Added tests for new params and deprecations. * Fixed docstring to reflect updates to n_jobs and random_state. * Fixed whitespace issues and removed nose import. * Added deprecation note for nthread and seed in docstring. * Attempted fix of deprecation tests. * Second attempted fix to tests. * Set n_jobs to 1.	2017-05-22 08:22:05 -05:00
jayzed82	29289d2302	Add option to choose booster in scikit intreface (gbtree by default) (#2303 ) * Add option to choose booster in scikit intreface (gbtree by default) * Add option to choose booster in scikit intreface: complete docstring. * Fix XGBClassifier to work with booster option * Added test case for gblinear booster	2017-05-18 23:12:27 -04:00
Maurus Cuelenaere	6bd1869026	Add prediction of feature contributions (#2003 ) * Add prediction of feature contributions This implements the idea described at http://blog.datadive.net/interpreting-random-forests/ which tries to give insight in how a prediction is composed of its feature contributions and a bias. * Support multi-class models * Calculate learning_rate per-tree instead of using the one from the first tree * Do not rely on node.base_weight * learning_rate having the same value as the node mean value (aka leaf value, if it were a leaf); instead calculate them (lazily) on-the-fly * Add simple test for contributions feature * Check against param.num_nodes instead of checking for non-zero length * Loop over all roots instead of only the first	2017-05-14 00:58:10 -05:00
Philip Cho	2715baef64	Fix bugs in multithreaded ApplySplitSparseData() (#2161 ) * Bugfix 1: Fix segfault in multithreaded ApplySplitSparseData() When there are more threads than rows in rowset, some threads end up with empty ranges, causing them to crash. (iend - 1 needs to be accessible as part of algorithm) Fix: run only those threads with nonempty ranges. * Add regression test for Bugfix 1 * Moving python_omp_test to existing python test group Turns out you don't need to set "OMP_NUM_THREADS" to enable multithreading. Just add nthread parameter. * Bugfix 2: Fix corner case of ApplySplitSparseData() for categorical feature When split value is less than all cut points, split_cond is set incorrectly. Fix: set split_cond = -1 to indicate this scenario * Bugfix 3: Initialize data layout indicator before using it data_layout_ is accessed before being set; this variable determines whether feature 0 is included in feat_set. Fix: re-order code in InitData() to initialize data_layout_ first * Adding regression test for Bugfix 2 Unfortunately, no regression test for Bugfix 3, as there is no way to deterministically assign value to an uninitialized variable.	2017-04-02 11:37:39 -07:00
Philip Cho	14fba01b5a	Improve multi-threaded performance (#2104 ) * Add UpdatePredictionCache() option to updaters Some updaters (e.g. fast_hist) has enough information to quickly compute prediction cache for the training data. Each updater may override UpdaterPredictionCache() method to update the prediction cache. Note: this trick does not apply to validation data. * Respond to code review * Disable some debug messages by default * Document UpdatePredictionCache() interface * Remove base_margin logic from UpdatePredictionCache() implementation * Do not take pointer to cfg, as reference may get stale * Improve multi-threaded performance * Use columnwise accessor to accelerate ApplySplit() step, with support for a compressed representation * Parallel sort for evaluation step * Inline BuildHist() function * Cache gradient pairs when building histograms in BuildHist() * Add missing #if macro * Respond to code review * Use wrapper to enable parallel sort on Linux * Fix C++ compatibility issues * MSVC doesn't support unsigned in OpenMP loops * gcc 4.6 doesn't support using keyword * Fix lint issues * Respond to code review * Fix bug in ApplySplitSparseData() * Attempting to read beyond the end of a sparse column * Mishandling the case where an entire range of rows have missing values * Fix training continuation bug Disable UpdatePredictionCache() in the first iteration. This way, we can accomodate the scenario where we build off of an existing (nonempty) ensemble. * Add regression test for fast_hist * Respond to code review * Add back old version of ApplySplitSparseData	2017-03-25 10:35:01 -07:00
Laurae	5c13aa0a8a	GLM test unit: make run deterministic (#2147 )	2017-03-24 08:54:39 -07:00
Icyblade Dai	301540f1d9	fix DeprecationWarning on sklearn.cross_validation (#2075 ) * fix DeprecationWarning on sklearn.cross_validation * fix syntax * fix kfold n_split issue * fix mistype * fix n_splits multiple value issue * split should pass a iterable * use np.arange instead of xrange, py3 compatibility	2017-03-17 08:38:22 -05:00
Philip Cho	aeb4e76118	Histogram Optimized Tree Grower (#1940 ) * Support histogram-based algorithm + multiple tree growing strategy * Add a brand new updater to support histogram-based algorithm, which buckets continuous features into discrete bins to speed up training. To use it, set `tree_method = fast_hist` to configuration. * Support multiple tree growing strategies. For now, two policies are supported: * `grow_policy=depthwise` (default): favor splitting at nodes closest to the root, i.e. grow depth-wise. * `grow_policy=lossguide`: favor splitting at nodes with highest loss change * Improve single-threaded performance * Unroll critical loops * Introduce specialized code for dense data (i.e. no missing values) * Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose` * Adding a small test for hist method * Fix memory error in row_set.h When std::vector is resized, a reference to one of its element may become stale. Any such reference must be updated as well. * Resolve cross-platform compilation issues * Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g. alignas() and new initializer syntax. To support g++ 4.6, use pre-C++11 initializer and remove alignas(). * Versions of MSVC older than 2015 does not support alignas(). To support MSVC 2012, remove alignas(). * For g++ 4.8 and newer, alignas() is enabled for performance benefits. Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases (which uses `using` to declate type aliases). So always use `typedef`. * Fix a host of CI issues * Remove dependency for libz on osx * Fix heading for hist_util * Fix minor style issues * Add missing #include * Remove extraneous logging * Enable tree_method=hist in R * Renaming HistMaker to GHistBuilder to avoid confusion * Fix R integration * Respond to style comments * Consistent tie-breaking for priority queue using timestamps * Last-minute style fixes * Fix issuecomment-271977647 The way we quantize data is broken. The agaricus data consists of all categorical values. When NAs are converted into 0's, `HistCutMatrix::Init` assign both 0's and 1's to the same single bin. Why? gmat only the smallest value (0) and an upper bound (2), which is twice the maximum value (1). Add the maximum value itself to gmat to fix the issue. * Fix issuecomment-272266358 * Remove padding from cut values for the continuous case * For categorical/ordinal values, use midpoints as bin boundaries to be safe * Fix CI issue -- do not use xrange() Fix corner case in quantile sketch Signed-off-by: Philip Cho <chohyu01@cs.washington.edu> * Adding a test for an edge case in quantile sketcher max_bin=2 used to cause an exception. * Fix fast_hist test The test used to require a strictly increasing Test AUC for all examples. One of them exhibits a small blip in Test AUC before achieving a Test AUC of 1. (See bottom.) Solution: do not require monotonic increase for this particular example. [0] train-auc:0.99989 test-auc:0.999497 [1] train-auc:1 test-auc:0.999749 [2] train-auc:1 test-auc:0.999749 [3] train-auc:1 test-auc:0.999749 [4] train-auc:1 test-auc:0.999749 [5] train-auc:1 test-auc:0.999497 [6] train-auc:1 test-auc:1 [7] train-auc:1 test-auc:1 [8] train-auc:1 test-auc:1 [9] train-auc:1 test-auc:1	2017-01-13 09:25:55 -08:00
jokari69	fb0fc0c580	option to shuffle data in mknfolds (#1459 ) * option to shuffle data in mknfolds * removed possibility to run as stand alone test * split function def in 2 lines for lint * option to shuffle data in mknfolds * removed possibility to run as stand alone test * split function def in 2 lines for lint	2016-12-23 07:53:30 +08:00
Yuan (Terry) Tang	090b37e85d	Bumped up err assert in glm test (#1792 )	2016-11-20 18:23:19 -06:00
AbdealiJK	b94fcab4dc	Add dump_format=json option (#1726 ) * Add format to the params accepted by DumpModel Currently, only the test format is supported when trying to dump a model. The plan is to add more such formats like JSON which are easy to read and/or parse by machines. And to make the interface for this even more generic to allow other formats to be added. Hence, we make some modifications to make these function generic and accept a new parameter "format" which signifies the format of the dump to be created. * Fix typos and errors in docs * plugin: Mention all the register macros available Document the register macros currently available to the plugin writers so they know what exactly can be extended using hooks. * sparce_page_source: Use same arg name in .h and .cc * gbm: Add JSON dump The dump_format argument can be used to specify what type of dump file should be created. Add functionality to dump gblinear and gbtree into a JSON file. The JSON file has an array, each item is a JSON object for the tree. For gblinear: - The item is the bias and weights vectors For gbtree: - The item is the root node. The root node has a attribute "children" which holds the children nodes. This happens recursively. * core.py: Add arg dump_format for get_dump()	2016-11-04 09:55:25 -07:00
Yuan (Terry) Tang	63829d656c	Fix mknfold using new StratifiedKFold API (#1660 )	2016-10-12 14:43:37 -07:00
Yuan (Terry) Tang	a64fd74421	Fix wrong expected feature types (#1646 )	2016-10-08 21:16:29 -07:00
Vadim Khotilovich	693ddb860e	More robust DMatrix creation from a sparse matrix (#1606 ) * [CORE] DMatrix from sparse w/ explicit #col #row; safer arg types * [python-package] c-api change for _init_from_csr _init_from_csc * fix spaces * [R-package] adopt the new XGDMatrixCreateFromCSCEx interface * [CORE] redirect old sparse creators to new ones	2016-09-25 10:01:22 -07:00
Yoshinori Nakano	949d1e3027	add Dart booster (#1220 )	2016-06-08 14:04:01 -07:00
Vadim Khotilovich	9a48a40cf1	Fixes for multiple and default metric (#1239 ) * fix multiple evaluation metrics * create DefaultEvalMetric only when really necessary * py test for #1239 * make travis happy	2016-06-04 22:17:35 -07:00
tqchen	149589c583	[PYTHON] Refactor trainnig API to use callback	2016-05-19 21:31:23 -07:00
Alistair Johnson	6750c8b743	Added other feature importances in python package (#1135 ) * added new function to calculate other feature importances * added capability to plot other feature importance measures * changed plotting default to fscore * added info on importance_type to boilerplate comment * updated text of error statement * added self module name to fix call * added unit test for feature importances * style fixes	2016-05-02 12:25:24 -05:00
sinhrks	9da2f3e613	DOC/TST: Fix Python sklearn dep	2016-05-01 17:27:43 +09:00
Faron	ad3f49e881	[py] eta decay bugfix	2016-04-30 15:51:57 +02:00
sinhrks	6bab164d80	Bug mixing DMatrix's with and without feature names	2016-04-30 14:42:57 +09:00
Faron	cf607e2448	[py] split value histograms	2016-04-28 20:26:21 +02:00
sinhrks	c55cc809e5	BUG: XGBClassifier.feature_importances_ raises ValueError if input is pandas DataFrame	2016-04-27 21:50:03 +09:00
sinhrks	8fc2456c87	Enable flake8	2016-04-24 17:32:31 +09:00
tqchen	ec2fb5bc48	Fix multi-class loading	2016-03-10 19:22:26 -08:00

1 2 3

107 Commits