xgboost

Author	SHA1	Message	Date
Maurus Cuelenaere	6bd1869026	Add prediction of feature contributions (#2003 ) * Add prediction of feature contributions This implements the idea described at http://blog.datadive.net/interpreting-random-forests/ which tries to give insight in how a prediction is composed of its feature contributions and a bias. * Support multi-class models * Calculate learning_rate per-tree instead of using the one from the first tree * Do not rely on node.base_weight * learning_rate having the same value as the node mean value (aka leaf value, if it were a leaf); instead calculate them (lazily) on-the-fly * Add simple test for contributions feature * Check against param.num_nodes instead of checking for non-zero length * Loop over all roots instead of only the first	2017-05-14 00:58:10 -05:00
Philip Cho	2715baef64	Fix bugs in multithreaded ApplySplitSparseData() (#2161 ) * Bugfix 1: Fix segfault in multithreaded ApplySplitSparseData() When there are more threads than rows in rowset, some threads end up with empty ranges, causing them to crash. (iend - 1 needs to be accessible as part of algorithm) Fix: run only those threads with nonempty ranges. * Add regression test for Bugfix 1 * Moving python_omp_test to existing python test group Turns out you don't need to set "OMP_NUM_THREADS" to enable multithreading. Just add nthread parameter. * Bugfix 2: Fix corner case of ApplySplitSparseData() for categorical feature When split value is less than all cut points, split_cond is set incorrectly. Fix: set split_cond = -1 to indicate this scenario * Bugfix 3: Initialize data layout indicator before using it data_layout_ is accessed before being set; this variable determines whether feature 0 is included in feat_set. Fix: re-order code in InitData() to initialize data_layout_ first * Adding regression test for Bugfix 2 Unfortunately, no regression test for Bugfix 3, as there is no way to deterministically assign value to an uninitialized variable.	2017-04-02 11:37:39 -07:00
Philip Cho	14fba01b5a	Improve multi-threaded performance (#2104 ) * Add UpdatePredictionCache() option to updaters Some updaters (e.g. fast_hist) has enough information to quickly compute prediction cache for the training data. Each updater may override UpdaterPredictionCache() method to update the prediction cache. Note: this trick does not apply to validation data. * Respond to code review * Disable some debug messages by default * Document UpdatePredictionCache() interface * Remove base_margin logic from UpdatePredictionCache() implementation * Do not take pointer to cfg, as reference may get stale * Improve multi-threaded performance * Use columnwise accessor to accelerate ApplySplit() step, with support for a compressed representation * Parallel sort for evaluation step * Inline BuildHist() function * Cache gradient pairs when building histograms in BuildHist() * Add missing #if macro * Respond to code review * Use wrapper to enable parallel sort on Linux * Fix C++ compatibility issues * MSVC doesn't support unsigned in OpenMP loops * gcc 4.6 doesn't support using keyword * Fix lint issues * Respond to code review * Fix bug in ApplySplitSparseData() * Attempting to read beyond the end of a sparse column * Mishandling the case where an entire range of rows have missing values * Fix training continuation bug Disable UpdatePredictionCache() in the first iteration. This way, we can accomodate the scenario where we build off of an existing (nonempty) ensemble. * Add regression test for fast_hist * Respond to code review * Add back old version of ApplySplitSparseData	2017-03-25 10:35:01 -07:00
Laurae	5c13aa0a8a	GLM test unit: make run deterministic (#2147 )	2017-03-24 08:54:39 -07:00
Icyblade Dai	301540f1d9	fix DeprecationWarning on sklearn.cross_validation (#2075 ) * fix DeprecationWarning on sklearn.cross_validation * fix syntax * fix kfold n_split issue * fix mistype * fix n_splits multiple value issue * split should pass a iterable * use np.arange instead of xrange, py3 compatibility	2017-03-17 08:38:22 -05:00
Tianqi Chen	d581a3d0e7	[UPDATE] Update rabit and threadlocal (#2114 ) * [UPDATE] Update rabit and threadlocal * minor fix to make build system happy * upgrade requirement to g++4.8 * upgrade dmlc-core * update travis	2017-03-16 18:48:37 -07:00
Philip Cho	aeb4e76118	Histogram Optimized Tree Grower (#1940 ) * Support histogram-based algorithm + multiple tree growing strategy * Add a brand new updater to support histogram-based algorithm, which buckets continuous features into discrete bins to speed up training. To use it, set `tree_method = fast_hist` to configuration. * Support multiple tree growing strategies. For now, two policies are supported: * `grow_policy=depthwise` (default): favor splitting at nodes closest to the root, i.e. grow depth-wise. * `grow_policy=lossguide`: favor splitting at nodes with highest loss change * Improve single-threaded performance * Unroll critical loops * Introduce specialized code for dense data (i.e. no missing values) * Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose` * Adding a small test for hist method * Fix memory error in row_set.h When std::vector is resized, a reference to one of its element may become stale. Any such reference must be updated as well. * Resolve cross-platform compilation issues * Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g. alignas() and new initializer syntax. To support g++ 4.6, use pre-C++11 initializer and remove alignas(). * Versions of MSVC older than 2015 does not support alignas(). To support MSVC 2012, remove alignas(). * For g++ 4.8 and newer, alignas() is enabled for performance benefits. Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases (which uses `using` to declate type aliases). So always use `typedef`. * Fix a host of CI issues * Remove dependency for libz on osx * Fix heading for hist_util * Fix minor style issues * Add missing #include * Remove extraneous logging * Enable tree_method=hist in R * Renaming HistMaker to GHistBuilder to avoid confusion * Fix R integration * Respond to style comments * Consistent tie-breaking for priority queue using timestamps * Last-minute style fixes * Fix issuecomment-271977647 The way we quantize data is broken. The agaricus data consists of all categorical values. When NAs are converted into 0's, `HistCutMatrix::Init` assign both 0's and 1's to the same single bin. Why? gmat only the smallest value (0) and an upper bound (2), which is twice the maximum value (1). Add the maximum value itself to gmat to fix the issue. * Fix issuecomment-272266358 * Remove padding from cut values for the continuous case * For categorical/ordinal values, use midpoints as bin boundaries to be safe * Fix CI issue -- do not use xrange() Fix corner case in quantile sketch Signed-off-by: Philip Cho <chohyu01@cs.washington.edu> * Adding a test for an edge case in quantile sketcher max_bin=2 used to cause an exception. * Fix fast_hist test The test used to require a strictly increasing Test AUC for all examples. One of them exhibits a small blip in Test AUC before achieving a Test AUC of 1. (See bottom.) Solution: do not require monotonic increase for this particular example. [0] train-auc:0.99989 test-auc:0.999497 [1] train-auc:1 test-auc:0.999749 [2] train-auc:1 test-auc:0.999749 [3] train-auc:1 test-auc:0.999749 [4] train-auc:1 test-auc:0.999749 [5] train-auc:1 test-auc:0.999497 [6] train-auc:1 test-auc:1 [7] train-auc:1 test-auc:1 [8] train-auc:1 test-auc:1 [9] train-auc:1 test-auc:1	2017-01-13 09:25:55 -08:00
jokari69	fb0fc0c580	option to shuffle data in mknfolds (#1459 ) * option to shuffle data in mknfolds * removed possibility to run as stand alone test * split function def in 2 lines for lint * option to shuffle data in mknfolds * removed possibility to run as stand alone test * split function def in 2 lines for lint	2016-12-23 07:53:30 +08:00
AbdealiJK	47ba2de7d4	tests/cpp: Add tests for multiclass_metric.cc	2016-12-04 11:25:57 -08:00
AbdealiJK	a7e20555a3	tests/cpp: Add tests for rank_metrics.cc	2016-12-04 11:25:57 -08:00
AbdealiJK	4a2ef130a7	tests/cpp: Add test for elementwise_metric.cc	2016-12-04 11:25:57 -08:00
AbdealiJK	03abd47f49	tests/cpp: Add tests for Metric RMSE	2016-12-04 11:25:57 -08:00
AbdealiJK	582c373274	tests/cpp: Add tests for metric.cc	2016-12-04 11:25:57 -08:00
AbdealiJK	cc859420ba	tests/cpp: Add tests for TweedieRegression	2016-12-04 11:25:57 -08:00
AbdealiJK	fa865564f6	tests/cpp: Add tests for GammaRegression	2016-12-04 11:25:57 -08:00
AbdealiJK	401e4b5220	tests/cpp: Add tests for PoissonRegression	2016-12-04 11:25:57 -08:00
AbdealiJK	d41aab4f61	tests/cpp: Add tests for regression_obj.cc Test the objective functions in regression_obj.cc tests/cpp: Add tests for objective.cc and RegLossObj	2016-12-04 11:25:57 -08:00
AbdealiJK	fd99d39372	tests/cpp: Add tests for SplitEntry	2016-12-04 11:25:57 -08:00
AbdealiJK	62e3468603	tests/cpp: Add tests for param.h	2016-12-04 11:25:57 -08:00
AbdealiJK	d6407c3746	tests/cpp: Add tests for SparsePageDMatrix The SparsePageDMatrix or external memory DMatrix reads data from the file IO rather than load it into RAM.	2016-12-04 11:25:57 -08:00
AbdealiJK	c3629c91d3	tests/cpp: Add tests for SimpleCSRSource Test the binary format saved and read by a SimpleDMatrix, which is internally the SimpleCSRSource.	2016-12-04 11:25:57 -08:00
AbdealiJK	be0f55d563	tests/cpp: Add tests for SimpleDMatrix	2016-12-04 11:25:57 -08:00
AbdealiJK	ef7fe06cf8	tests/cpp/test_metainfo: Add tests to save and load	2016-12-04 11:25:57 -08:00
AbdealiJK	8eb69e0677	travis: Add code coverage on success Update the code coverage of the project on codecov for easy viewing. Also the gcov on travis uses a different version which cannot find the directory of the given files, and it needs to be specified in the -o flag. Hence now we loop over the list of files and run them independently.	2016-12-04 11:25:57 -08:00
AbdealiJK	61a9b3a49e	travis: Run CPP tests	2016-12-04 11:25:57 -08:00
AbdealiJK	006f9e0760	Makefile: Add CPP code coverage	2016-12-04 11:25:57 -08:00
AbdealiJK	1f2ad36bad	Add make commands for tests This adds the make commands required to build and run tests.	2016-12-04 11:25:57 -08:00
Yuan (Terry) Tang	090b37e85d	Bumped up err assert in glm test (#1792 )	2016-11-20 18:23:19 -06:00
Tianqi Chen	060a0ac396	Update setup.sh	2016-11-19 17:57:47 -08:00
Tianqi Chen	aa841ee58d	Update setup.sh	2016-11-19 17:56:36 -08:00
AbdealiJK	b94fcab4dc	Add dump_format=json option (#1726 ) * Add format to the params accepted by DumpModel Currently, only the test format is supported when trying to dump a model. The plan is to add more such formats like JSON which are easy to read and/or parse by machines. And to make the interface for this even more generic to allow other formats to be added. Hence, we make some modifications to make these function generic and accept a new parameter "format" which signifies the format of the dump to be created. * Fix typos and errors in docs * plugin: Mention all the register macros available Document the register macros currently available to the plugin writers so they know what exactly can be extended using hooks. * sparce_page_source: Use same arg name in .h and .cc * gbm: Add JSON dump The dump_format argument can be used to specify what type of dump file should be created. Add functionality to dump gblinear and gbtree into a JSON file. The JSON file has an array, each item is a JSON object for the tree. For gblinear: - The item is the bias and weights vectors For gbtree: - The item is the root node. The root node has a attribute "children" which holds the children nodes. This happens recursively. * core.py: Add arg dump_format for get_dump()	2016-11-04 09:55:25 -07:00
Yuan (Terry) Tang	63829d656c	Fix mknfold using new StratifiedKFold API (#1660 )	2016-10-12 14:43:37 -07:00
Yuan (Terry) Tang	a64fd74421	Fix wrong expected feature types (#1646 )	2016-10-08 21:16:29 -07:00
Vadim Khotilovich	693ddb860e	More robust DMatrix creation from a sparse matrix (#1606 ) * [CORE] DMatrix from sparse w/ explicit #col #row; safer arg types * [python-package] c-api change for _init_from_csr _init_from_csc * fix spaces * [R-package] adopt the new XGDMatrixCreateFromCSCEx interface * [CORE] redirect old sparse creators to new ones	2016-09-25 10:01:22 -07:00
RAMitchell	93196eb811	cmake build system (#1314 ) * Changed c api to compile under MSVC * Include functional.h header for MSVC * Add cmake build	2016-07-02 19:07:35 -07:00
Yoshinori Nakano	949d1e3027	add Dart booster (#1220 )	2016-06-08 14:04:01 -07:00
Vadim Khotilovich	9a48a40cf1	Fixes for multiple and default metric (#1239 ) * fix multiple evaluation metrics * create DefaultEvalMetric only when really necessary * py test for #1239 * make travis happy	2016-06-04 22:17:35 -07:00
tqchen	149589c583	[PYTHON] Refactor trainnig API to use callback	2016-05-19 21:31:23 -07:00
Alistair Johnson	6750c8b743	Added other feature importances in python package (#1135 ) * added new function to calculate other feature importances * added capability to plot other feature importance measures * changed plotting default to fscore * added info on importance_type to boilerplate comment * updated text of error statement * added self module name to fix call * added unit test for feature importances * style fixes	2016-05-02 12:25:24 -05:00
sinhrks	9da2f3e613	DOC/TST: Fix Python sklearn dep	2016-05-01 17:27:43 +09:00
Faron	ad3f49e881	[py] eta decay bugfix	2016-04-30 15:51:57 +02:00
sinhrks	6bab164d80	Bug mixing DMatrix's with and without feature names	2016-04-30 14:42:57 +09:00
Faron	cf607e2448	[py] split value histograms	2016-04-28 20:26:21 +02:00
sinhrks	c55cc809e5	BUG: XGBClassifier.feature_importances_ raises ValueError if input is pandas DataFrame	2016-04-27 21:50:03 +09:00
sinhrks	8fc2456c87	Enable flake8	2016-04-24 17:32:31 +09:00
CodingCat	a31a978471	run native lib building command from maven	2016-03-16 16:47:08 -04:00
tqchen	ec2fb5bc48	Fix multi-class loading	2016-03-10 19:22:26 -08:00
tqchen	ced6d45e01	Update rabit	2016-03-02 20:53:34 -08:00
CodingCat	f8fff6c6fc	rename files/packages	2016-03-01 23:48:35 -05:00
CodingCat	55e36893cd	add style check for java and scala code	2016-03-01 20:53:50 -05:00

1 2 3 4

164 Commits