xgboost

Author	SHA1	Message	Date
Rory Mitchell	e6a9063344	Integer gradient summation for GPU histogram algorithm. (#2681 )	2017-09-08 15:07:29 +12:00
Rory Mitchell	5661a67d20	Add parallel sort for MSVC (#2609 )	2017-08-17 17:14:39 +12:00
Rory Mitchell	eda9e180f0	[GPU-Plugin] Various fixes (#2579 ) * Fix test large * Add check for max_depth 0 * Update readme * Add LBS specialisation for dense data * Add bst_gpair_precise * Temporarily disable accuracy tests on test_large.py * Solve unused variable compiler warning * Fix max_bin > 1024 error	2017-08-05 22:16:23 +12:00
Vadim Khotilovich	00eda28b3c	MinGW: shared library prefix and appveyor CI (#2539 ) * for MinGW, drop the 'lib' prefix from shared library name * fix defines for 'g++ 4.8 or higher' to include g++ >= 5 * fix compile warnings * [Appveyor] add MinGW with python; remove redundant jobs * [Appveyor] also do python build for one of msvc jobs	2017-07-25 01:06:47 -05:00
Rory Mitchell	5f1b0bb386	[GPU-Plugin] Unify gpu_gpair/bst_gpair. Refactor. (#2477 )	2017-07-01 17:31:13 +12:00
Rory Mitchell	8ab5d4611c	[GPU-Plugin] (#2227 ) * Add fast histogram algorithm * Fix Linux build * Add 'gpu_id' parameter	2017-04-25 16:37:10 -07:00
Philip Cho	2715baef64	Fix bugs in multithreaded ApplySplitSparseData() (#2161 ) * Bugfix 1: Fix segfault in multithreaded ApplySplitSparseData() When there are more threads than rows in rowset, some threads end up with empty ranges, causing them to crash. (iend - 1 needs to be accessible as part of algorithm) Fix: run only those threads with nonempty ranges. * Add regression test for Bugfix 1 * Moving python_omp_test to existing python test group Turns out you don't need to set "OMP_NUM_THREADS" to enable multithreading. Just add nthread parameter. * Bugfix 2: Fix corner case of ApplySplitSparseData() for categorical feature When split value is less than all cut points, split_cond is set incorrectly. Fix: set split_cond = -1 to indicate this scenario * Bugfix 3: Initialize data layout indicator before using it data_layout_ is accessed before being set; this variable determines whether feature 0 is included in feat_set. Fix: re-order code in InitData() to initialize data_layout_ first * Adding regression test for Bugfix 2 Unfortunately, no regression test for Bugfix 3, as there is no way to deterministically assign value to an uninitialized variable.	2017-04-02 11:37:39 -07:00
Philip Cho	14fba01b5a	Improve multi-threaded performance (#2104 ) * Add UpdatePredictionCache() option to updaters Some updaters (e.g. fast_hist) has enough information to quickly compute prediction cache for the training data. Each updater may override UpdaterPredictionCache() method to update the prediction cache. Note: this trick does not apply to validation data. * Respond to code review * Disable some debug messages by default * Document UpdatePredictionCache() interface * Remove base_margin logic from UpdatePredictionCache() implementation * Do not take pointer to cfg, as reference may get stale * Improve multi-threaded performance * Use columnwise accessor to accelerate ApplySplit() step, with support for a compressed representation * Parallel sort for evaluation step * Inline BuildHist() function * Cache gradient pairs when building histograms in BuildHist() * Add missing #if macro * Respond to code review * Use wrapper to enable parallel sort on Linux * Fix C++ compatibility issues * MSVC doesn't support unsigned in OpenMP loops * gcc 4.6 doesn't support using keyword * Fix lint issues * Respond to code review * Fix bug in ApplySplitSparseData() * Attempting to read beyond the end of a sparse column * Mishandling the case where an entire range of rows have missing values * Fix training continuation bug Disable UpdatePredictionCache() in the first iteration. This way, we can accomodate the scenario where we build off of an existing (nonempty) ensemble. * Add regression test for fast_hist * Respond to code review * Add back old version of ApplySplitSparseData	2017-03-25 10:35:01 -07:00
Philip Cho	aeb4e76118	Histogram Optimized Tree Grower (#1940 ) * Support histogram-based algorithm + multiple tree growing strategy * Add a brand new updater to support histogram-based algorithm, which buckets continuous features into discrete bins to speed up training. To use it, set `tree_method = fast_hist` to configuration. * Support multiple tree growing strategies. For now, two policies are supported: * `grow_policy=depthwise` (default): favor splitting at nodes closest to the root, i.e. grow depth-wise. * `grow_policy=lossguide`: favor splitting at nodes with highest loss change * Improve single-threaded performance * Unroll critical loops * Introduce specialized code for dense data (i.e. no missing values) * Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose` * Adding a small test for hist method * Fix memory error in row_set.h When std::vector is resized, a reference to one of its element may become stale. Any such reference must be updated as well. * Resolve cross-platform compilation issues * Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g. alignas() and new initializer syntax. To support g++ 4.6, use pre-C++11 initializer and remove alignas(). * Versions of MSVC older than 2015 does not support alignas(). To support MSVC 2012, remove alignas(). * For g++ 4.8 and newer, alignas() is enabled for performance benefits. Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases (which uses `using` to declate type aliases). So always use `typedef`. * Fix a host of CI issues * Remove dependency for libz on osx * Fix heading for hist_util * Fix minor style issues * Add missing #include * Remove extraneous logging * Enable tree_method=hist in R * Renaming HistMaker to GHistBuilder to avoid confusion * Fix R integration * Respond to style comments * Consistent tie-breaking for priority queue using timestamps * Last-minute style fixes * Fix issuecomment-271977647 The way we quantize data is broken. The agaricus data consists of all categorical values. When NAs are converted into 0's, `HistCutMatrix::Init` assign both 0's and 1's to the same single bin. Why? gmat only the smallest value (0) and an upper bound (2), which is twice the maximum value (1). Add the maximum value itself to gmat to fix the issue. * Fix issuecomment-272266358 * Remove padding from cut values for the continuous case * For categorical/ordinal values, use midpoints as bin boundaries to be safe * Fix CI issue -- do not use xrange() Fix corner case in quantile sketch Signed-off-by: Philip Cho <chohyu01@cs.washington.edu> * Adding a test for an edge case in quantile sketcher max_bin=2 used to cause an exception. * Fix fast_hist test The test used to require a strictly increasing Test AUC for all examples. One of them exhibits a small blip in Test AUC before achieving a Test AUC of 1. (See bottom.) Solution: do not require monotonic increase for this particular example. [0] train-auc:0.99989 test-auc:0.999497 [1] train-auc:1 test-auc:0.999749 [2] train-auc:1 test-auc:0.999749 [3] train-auc:1 test-auc:0.999749 [4] train-auc:1 test-auc:0.999749 [5] train-auc:1 test-auc:0.999497 [6] train-auc:1 test-auc:1 [7] train-auc:1 test-auc:1 [8] train-auc:1 test-auc:1 [9] train-auc:1 test-auc:1	2017-01-13 09:25:55 -08:00
AbdealiJK	6f16f0ef58	Use bst_float consistently throughout (#1824 ) * Fix various typos * Add override to functions that are overridden gcc gives warnings about functions that are being overridden by not being marked as oveirridden. This fixes it. * Use bst_float consistently Use bst_float for all the variables that involve weight, leaf value, gradient, hessian, gain, loss_chg, predictions, base_margin, feature values. In some cases, when due to additions and so on the value can take a larger value, double is used. This ensures that type conversions are minimal and reduces loss of precision.	2016-11-30 10:02:10 -08:00
Adam Pocock	445029bb82	[jvm-packages] XGBoost4j Windows fixes (#1639 ) * Changes for Mingw64 compilation to ensure long is a consistent size. Mainly impacts the Java API which would not compile, but there may be silent errors on Windows with large datasets before this patch (as long is 32-bits when compiled with mingw64 even in 64-bit mode). * Adding ifdefs to ensure it still compiles on MacOS * Makefile and create_jni.bat changes for Windows. * Switching XGDMatrixCreateFromCSREx JNI call to use size_t cast * Fixing lint error, adding profile switching to jvm-packages build to make create-jni.bat get called, adding myself to Contributors.Md	2016-10-18 08:35:25 -04:00
tqchen	2f2080a337	[TREE] Remove gap constraint, make tree construction more robust	2016-02-10 11:17:54 -08:00
tqchen	1495a43cea	[R] make all customizations to meet strict standard of cran	2016-01-16 10:25:12 -08:00
tqchen	ef1021e759	[IO] Enable external memory	2016-01-16 10:24:01 -08:00
tqchen	d75e3ed05d	[LIBXGBOOST] pass demo running.	2016-01-16 10:24:01 -08:00
tqchen	e4567bbc47	[REFACTOR] Add alias, allow missing variables, init gbm interface	2016-01-16 10:24:01 -08:00
tqchen	dedd87662b	[OBJ] Add basic objective function and registry	2016-01-16 10:24:01 -08:00
tqchen	46bcba7173	[DATA] basic data refactor done, basic version of csr source.	2016-01-16 10:24:00 -08:00
tqchen	7ff91fe5f9	Data interface ready	2016-01-16 10:24:00 -08:00
tqchen	d530e0c14f	[REFACTOR] cleanup structure	2016-01-16 10:24:00 -08:00

20 Commits