xgboost

Author	SHA1	Message	Date
Philip Cho	2715baef64	Fix bugs in multithreaded ApplySplitSparseData() (#2161 ) * Bugfix 1: Fix segfault in multithreaded ApplySplitSparseData() When there are more threads than rows in rowset, some threads end up with empty ranges, causing them to crash. (iend - 1 needs to be accessible as part of algorithm) Fix: run only those threads with nonempty ranges. * Add regression test for Bugfix 1 * Moving python_omp_test to existing python test group Turns out you don't need to set "OMP_NUM_THREADS" to enable multithreading. Just add nthread parameter. * Bugfix 2: Fix corner case of ApplySplitSparseData() for categorical feature When split value is less than all cut points, split_cond is set incorrectly. Fix: set split_cond = -1 to indicate this scenario * Bugfix 3: Initialize data layout indicator before using it data_layout_ is accessed before being set; this variable determines whether feature 0 is included in feat_set. Fix: re-order code in InitData() to initialize data_layout_ first * Adding regression test for Bugfix 2 Unfortunately, no regression test for Bugfix 3, as there is no way to deterministically assign value to an uninitialized variable.	2017-04-02 11:37:39 -07:00
Huffers	d45cf240a9	Remove xgboost's thread_local and switch to dmlc::ThreadLocalStore (#2121 ) * Remove xgboost's own version of thread_local and switch to dmlc::ThreadLocalStore (#2109) * Update dmlc-core	2017-03-27 09:09:18 -07:00
Philip Cho	14fba01b5a	Improve multi-threaded performance (#2104 ) * Add UpdatePredictionCache() option to updaters Some updaters (e.g. fast_hist) has enough information to quickly compute prediction cache for the training data. Each updater may override UpdaterPredictionCache() method to update the prediction cache. Note: this trick does not apply to validation data. * Respond to code review * Disable some debug messages by default * Document UpdatePredictionCache() interface * Remove base_margin logic from UpdatePredictionCache() implementation * Do not take pointer to cfg, as reference may get stale * Improve multi-threaded performance * Use columnwise accessor to accelerate ApplySplit() step, with support for a compressed representation * Parallel sort for evaluation step * Inline BuildHist() function * Cache gradient pairs when building histograms in BuildHist() * Add missing #if macro * Respond to code review * Use wrapper to enable parallel sort on Linux * Fix C++ compatibility issues * MSVC doesn't support unsigned in OpenMP loops * gcc 4.6 doesn't support using keyword * Fix lint issues * Respond to code review * Fix bug in ApplySplitSparseData() * Attempting to read beyond the end of a sparse column * Mishandling the case where an entire range of rows have missing values * Fix training continuation bug Disable UpdatePredictionCache() in the first iteration. This way, we can accomodate the scenario where we build off of an existing (nonempty) ensemble. * Add regression test for fast_hist * Respond to code review * Add back old version of ApplySplitSparseData	2017-03-25 10:35:01 -07:00
Qin Xiaoming	12cf0ae122	Update sparse_page_dmatrix.h (#2139 )	2017-03-23 11:01:40 -07:00
Zhiquan	e65564ba59	Update rank_obj.cc (#2126 ) typo: PairwieRankObj -> PairwiseRankObj	2017-03-21 20:06:16 -07:00
Tianqi Chen	d581a3d0e7	[UPDATE] Update rabit and threadlocal (#2114 ) * [UPDATE] Update rabit and threadlocal * minor fix to make build system happy * upgrade requirement to g++4.8 * upgrade dmlc-core * update travis	2017-03-16 18:48:37 -07:00
Oleg Sofrygin	9d19e13ed0	adding a copy of base_margin to slice, fixes a bug where base_margin was notcopied during cross-validation (#2007 )	2017-03-16 10:36:57 -07:00
Tianqi Chen	fd19b7a188	Automatically remove nan from input data when it is sparse. (#2062 ) * [DATALoad] Automatically remove Nan when load from sparse matrix * add log	2017-02-25 08:59:17 -08:00
Theodore Vasiloudis	9fb46e2c5e	[trivial] Fix typo in Poisson metric name. (#2026 )	2017-02-09 09:32:06 -08:00
Philip Cho	5d74578095	Disallow multiple roots for tree_method=hist (#1979 ) As discussed in issue #1978, tree_method=hist ignores the parameter param.num_roots; it simply assumes that the tree has only one root. In particular, when InitData() method initializes row_set_collection_, it simply assigns all rows to node 0, the value that's hard-coded. For now, the updater will simply fail when num_roots exceeds 1. I will revise the updater soon to support multiple roots.	2017-01-21 12:02:29 -08:00
Philip Cho	49ff7c1649	Rename parameter in fast_hist to disambiguate (#1962 )	2017-01-13 11:35:55 -08:00
Philip Cho	aeb4e76118	Histogram Optimized Tree Grower (#1940 ) * Support histogram-based algorithm + multiple tree growing strategy * Add a brand new updater to support histogram-based algorithm, which buckets continuous features into discrete bins to speed up training. To use it, set `tree_method = fast_hist` to configuration. * Support multiple tree growing strategies. For now, two policies are supported: * `grow_policy=depthwise` (default): favor splitting at nodes closest to the root, i.e. grow depth-wise. * `grow_policy=lossguide`: favor splitting at nodes with highest loss change * Improve single-threaded performance * Unroll critical loops * Introduce specialized code for dense data (i.e. no missing values) * Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose` * Adding a small test for hist method * Fix memory error in row_set.h When std::vector is resized, a reference to one of its element may become stale. Any such reference must be updated as well. * Resolve cross-platform compilation issues * Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g. alignas() and new initializer syntax. To support g++ 4.6, use pre-C++11 initializer and remove alignas(). * Versions of MSVC older than 2015 does not support alignas(). To support MSVC 2012, remove alignas(). * For g++ 4.8 and newer, alignas() is enabled for performance benefits. Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases (which uses `using` to declate type aliases). So always use `typedef`. * Fix a host of CI issues * Remove dependency for libz on osx * Fix heading for hist_util * Fix minor style issues * Add missing #include * Remove extraneous logging * Enable tree_method=hist in R * Renaming HistMaker to GHistBuilder to avoid confusion * Fix R integration * Respond to style comments * Consistent tie-breaking for priority queue using timestamps * Last-minute style fixes * Fix issuecomment-271977647 The way we quantize data is broken. The agaricus data consists of all categorical values. When NAs are converted into 0's, `HistCutMatrix::Init` assign both 0's and 1's to the same single bin. Why? gmat only the smallest value (0) and an upper bound (2), which is twice the maximum value (1). Add the maximum value itself to gmat to fix the issue. * Fix issuecomment-272266358 * Remove padding from cut values for the continuous case * For categorical/ordinal values, use midpoints as bin boundaries to be safe * Fix CI issue -- do not use xrange() Fix corner case in quantile sketch Signed-off-by: Philip Cho <chohyu01@cs.washington.edu> * Adding a test for an edge case in quantile sketcher max_bin=2 used to cause an exception. * Fix fast_hist test The test used to require a strictly increasing Test AUC for all examples. One of them exhibits a small blip in Test AUC before achieving a Test AUC of 1. (See bottom.) Solution: do not require monotonic increase for this particular example. [0] train-auc:0.99989 test-auc:0.999497 [1] train-auc:1 test-auc:0.999749 [2] train-auc:1 test-auc:0.999749 [3] train-auc:1 test-auc:0.999749 [4] train-auc:1 test-auc:0.999749 [5] train-auc:1 test-auc:0.999497 [6] train-auc:1 test-auc:1 [7] train-auc:1 test-auc:1 [8] train-auc:1 test-auc:1 [9] train-auc:1 test-auc:1	2017-01-13 09:25:55 -08:00
Vadim Khotilovich	d23ea5ca7d	An option for doing binomial+1 or epsilon-dropout from DART paper (#1922 ) * An option for doing binomial+1 or epsilon-dropout from DART paper * use callback-based discrete_distribution to make MSVC2013 happy	2017-01-05 16:23:22 -08:00
Qiang Kou (KK)	7948d1c799	disable openmp on solaris (#1912 )	2016-12-28 11:32:56 -08:00
wxchan	cee4aafb93	fix dart bug (#1882 )	2016-12-19 18:01:28 +01:00
Simon DENEL	7078c41dad	Changing omp_get_num_threads to omp_get_max_threads (#1831 ) * Updating dmlc-core * Changing omp_get_num_threads to omp_get_max_threads	2016-12-04 11:26:45 -08:00
AbdealiJK	5912e051b1	rank_metric.cc: Use GetWeight in EvalAMS The GetWeight is a wrapper which sets the correct weight if the weights vector is not provided. Hence accessing the default weights vector is not recommended.	2016-12-04 11:25:57 -08:00
AbdealiJK	b045ccd764	data.cc: Remove redundant ftype variable	2016-12-04 11:25:57 -08:00
JohnStott	1683e07461	Fix issue introduced from correction to log2 (#1837 ) https://github.com/dmlc/xgboost/pull/1642	2016-12-04 11:11:56 -08:00
Vadim Khotilovich	a44032d095	[CORE] The update process for a tree model, and its application to feature importance (#1670 ) * [CORE] allow updating trees in an existing model * [CORE] in refresh updater, allow keeping old leaf values and update stats only * [R-package] xgb.train mod to allow updating trees in an existing model * [R-package] added check for nrounds when is_update * [CORE] merge parameter declaration changes; unify their code style * [CORE] move the update-process trees initialization to Configure; rename default process_type to 'default'; fix the trees and trees_to_update sizes comparison check * [R-package] unit tests for the update process type * [DOC] documentation for process_type parameter; improved docs for updater, Gamma and Tweedie; added some parameter aliases; metrics indentation and some were non-documented * fix my sloppy merge conflict resolutions * [CORE] add a TreeProcessType enum * whitespace fix	2016-12-04 09:33:52 -08:00
AbdealiJK	6f16f0ef58	Use bst_float consistently throughout (#1824 ) * Fix various typos * Add override to functions that are overridden gcc gives warnings about functions that are being overridden by not being marked as oveirridden. This fixes it. * Use bst_float consistently Use bst_float for all the variables that involve weight, leaf value, gradient, hessian, gain, loss_chg, predictions, base_margin, feature values. In some cases, when due to additions and so on the value can take a larger value, double is used. This ensures that type conversions are minimal and reduces loss of precision.	2016-11-30 10:02:10 -08:00
RAMitchell	be2f28ec08	Update build instructions, improve memory usage (#1811 )	2016-11-25 09:43:22 -08:00
AbdealiJK	97371ff7e5	c_api.cc: Bring back silent argument (#1794 ) In ecb3a271bed151252fb048528ce5a90ad75bb68f the silent argument in XGDMatrixCreateFromFile of c_api.cc was always overridden to be false. This disabled the functionality to hide log messages. This commit reverts that part to enable the hiding of log messages.	2016-11-20 22:04:36 -08:00
Tony DiFranco	f11f2bd5fd	add default to poisson -> max_delta_step to enable loading/saving/dumping of model (#1781 )	2016-11-16 14:25:00 -08:00
Simon DENEL	58aa1129ea	Fixing a few typos (#1771 ) * Fixing a few typos * Fixing a few typos	2016-11-13 15:47:52 -08:00
Morten Hustveit	8b9d9669bb	Have ConsoleLogger log to stderr instead of stdout (#1714 ) On Unix systems, it's common for programs to read their input from stdin, and write their output to stdout. Messages should be written to stderr, where they won't corrupt a program's output, and where they can be seen by the user even if the output is being redirected. This is mostly a problem when XGBoost is being used from Python or from another program.	2016-11-10 12:39:52 -08:00
wl2776	6b5a23ccd5	fix build in MSVC 2013 (#1757 )	2016-11-10 12:34:30 -08:00
Tony DiFranco	2ad0948444	Tweedie Regression Post-Rebase (#1737 ) * add support for tweedie regression * added back readme line that was accidentally deleted * fixed linting errors * add support for tweedie regression * added back readme line that was accidentally deleted * fixed linting errors * rebased with upstream master and added R example * changed parameter name to tweedie_variance_power * linting error fix * refactored tweedie-nloglik metric to be more like the other parameterized metrics * added upper and lower bound check to tweedie metric * add support for tweedie regression * added back readme line that was accidentally deleted * fixed linting errors * added upper and lower bound check to tweedie metric * added back readme line that was accidentally deleted * rebased with upstream master and added R example * rebased again on top of upstream master * linting error fix * added upper and lower bound check to tweedie metric * rebased with master * lint fix * removed whitespace at end of line 186 - elementwise_metric.cc	2016-11-05 17:02:32 -07:00
AbdealiJK	b94fcab4dc	Add dump_format=json option (#1726 ) * Add format to the params accepted by DumpModel Currently, only the test format is supported when trying to dump a model. The plan is to add more such formats like JSON which are easy to read and/or parse by machines. And to make the interface for this even more generic to allow other formats to be added. Hence, we make some modifications to make these function generic and accept a new parameter "format" which signifies the format of the dump to be created. * Fix typos and errors in docs * plugin: Mention all the register macros available Document the register macros currently available to the plugin writers so they know what exactly can be extended using hooks. * sparce_page_source: Use same arg name in .h and .cc * gbm: Add JSON dump The dump_format argument can be used to specify what type of dump file should be created. Add functionality to dump gblinear and gbtree into a JSON file. The JSON file has an array, each item is a JSON object for the tree. For gblinear: - The item is the bias and weights vectors For gbtree: - The item is the root node. The root node has a attribute "children" which holds the children nodes. This happens recursively. * core.py: Add arg dump_format for get_dump()	2016-11-04 09:55:25 -07:00
AbdealiJK	378eb7d7c8	Fix typos and messages in docs (#1723 )	2016-10-30 22:52:19 -07:00
RAMitchell	ac41845d4b	Add GPU accelerated tree construction plugin (#1679 )	2016-10-20 20:14:47 -07:00
Liam Huang	001d8c4023	correct CalcDCG in rank_metric.cc and rank_obj.cc (#1642 ) * correct CalcDCG in rank_metric.cc DCG use log base-2, however `std::log` returns log base-e. * correct CalcDCG in rank_obj.cc DCG use log base-2, however `std::log` returns log base-e. * use std::log2 instead of std::log make it more elegant * use std::log2 instead of std::log make it more elegant	2016-10-18 10:23:41 -07:00
Shengwen Yang	3b9987ca9c	Fix the issue 1474 (#1615 ) * Fix 1474 * Fix crash issue when saving and loading poisson model * Rollback the wrong fix	2016-09-29 19:29:47 -07:00
Vadim Khotilovich	3efff6d052	fix for VX (#1614 )	2016-09-27 15:19:20 -07:00
phoenixbai	915ac0b8fe	the fix of missing value assignment for name_ variable in EvalRankList method (#1558 )	2016-09-26 08:57:17 -05:00
Vadim Khotilovich	693ddb860e	More robust DMatrix creation from a sparse matrix (#1606 ) * [CORE] DMatrix from sparse w/ explicit #col #row; safer arg types * [python-package] c-api change for _init_from_csr _init_from_csc * fix spaces * [R-package] adopt the new XGDMatrixCreateFromCSCEx interface * [CORE] redirect old sparse creators to new ones	2016-09-25 10:01:22 -07:00
Tianqi Chen	c93c9b7ed6	[TREE] Experimental version of monotone constraint (#1516 ) * [TREE] Experimental version of monotone constraint * Allow default detection of montone option * loose the condition of strict check * Update gbtree.cc	2016-09-07 21:28:43 -07:00
Tianqi Chen	ecec5f7959	[CORE] Refactor cache mechanism (#1540 )	2016-09-02 20:39:07 -07:00
Tianqi Chen	df38f251be	Fix warnings from g++5 or higher (#1510 )	2016-08-26 16:14:10 -07:00
Vadim Khotilovich	75f401481f	no exception throwing within omp parallel; set nthread in Learner (#1421 )	2016-07-29 10:08:03 -07:00
Shengwen Yang	7089301b62	Metrics for gamma regression (#1369 ) * Add deviance metric for gamma regression * Simplify the computation of nloglik for gamma regression * Add a description for gamma-deviance * Minor fix	2016-07-18 09:10:44 -05:00
anpark	0e61c514a7	fix duplicate loop over output_group when predict (#1342 ) * fix sparse page source meta info empty when load from dmatrix * fix duplicate loop over output_group when predict	2016-07-13 10:03:10 -07:00
anpark	3f32b3f0eb	fix sparse page source meta info empty when load from dmatrix (#1336 )	2016-07-07 21:17:35 -07:00
Shengwen Yang	77d17f6264	Add support for Gamma regression (#1258 ) * Add support for Gamma regression * Use base_score to replace the lp_bias * Remove the lp_bias config block * Add a demo for running gamma regression in Python * Typo fix * Revise the description for objective * Add a script to generate the autoclaims dataset	2016-07-06 10:22:46 -07:00
RAMitchell	93196eb811	cmake build system (#1314 ) * Changed c api to compile under MSVC * Include functional.h header for MSVC * Add cmake build	2016-07-02 19:07:35 -07:00
Frank	3b73824842	Fix ambiguous call to abs(c or c++). (#1308 )	2016-06-29 14:28:28 -07:00
Yoshinori Nakano	7cfeb5f012	fix Dart::NormalizeTrees (#1265 )	2016-06-09 15:28:24 -07:00
Yoshinori Nakano	949d1e3027	add Dart booster (#1220 )	2016-06-08 14:04:01 -07:00
Shengwen Yang	e034fdf74c	Fix issue #1236 : cli_main crashes when dumping count:poisson model (#1253 )	2016-06-07 21:52:47 -07:00
Vadim Khotilovich	9a48a40cf1	Fixes for multiple and default metric (#1239 ) * fix multiple evaluation metrics * create DefaultEvalMetric only when really necessary * py test for #1239 * make travis happy	2016-06-04 22:17:35 -07:00

1 2 3 4 5 ...

447 Commits