160 Commits

Author SHA1 Message Date
Rory Mitchell
5f1b0bb386 [GPU-Plugin] Unify gpu_gpair/bst_gpair. Refactor. (#2477) 2017-07-01 17:31:13 +12:00
PSEUDOTENSOR / Jonathan McKinney
6b287177c8 [GPU-Plugin] Multi-GPU gpu_id bug fixes for grow_gpu_hist and grow_gpu methods, and additional documentation for the gpu plugin. (#2463) 2017-06-30 20:04:17 +12:00
Rory Mitchell
0e48f87529 [GPU-Plugin] Make node_idx type 32 bit for hist algo. Set default n_gpus to 1. (#2445) 2017-06-23 18:26:45 +12:00
PSEUDOTENSOR / Jonathan McKinney
41efe32aa5 [GPU-Plugin] Multi-GPU for grow_gpu_hist histogram method using NVIDIA NCCL. (#2395) 2017-06-12 05:06:08 +12:00
Rory Mitchell
6bf968efe6 [GPU Plugin] Fast histogram speed improvements. Updated benchmarks. (#2258) 2017-05-08 09:21:38 -07:00
Rory Mitchell
8ab5d4611c [GPU-Plugin] (#2227)
* Add fast histogram algorithm
* Fix Linux build
* Add 'gpu_id' parameter
2017-04-25 16:37:10 -07:00
Philip Cho
2715baef64 Fix bugs in multithreaded ApplySplitSparseData() (#2161)
* Bugfix 1: Fix segfault in multithreaded ApplySplitSparseData()

When there are more threads than rows in rowset, some threads end up
with empty ranges, causing them to crash. (iend - 1 needs to be
accessible as part of algorithm)

Fix: run only those threads with nonempty ranges.

* Add regression test for Bugfix 1

* Moving python_omp_test to existing python test group

Turns out you don't need to set "OMP_NUM_THREADS" to enable
multithreading. Just add nthread parameter.

* Bugfix 2: Fix corner case of ApplySplitSparseData() for categorical feature

When split value is less than all cut points, split_cond is set
incorrectly.

Fix: set split_cond = -1 to indicate this scenario

* Bugfix 3: Initialize data layout indicator before using it

data_layout_ is accessed before being set; this variable determines
whether feature 0 is included in feat_set.

Fix: re-order code in InitData() to initialize data_layout_ first

* Adding regression test for Bugfix 2

Unfortunately, no regression test for Bugfix 3, as there is no
way to deterministically assign value to an uninitialized variable.
2017-04-02 11:37:39 -07:00
Philip Cho
14fba01b5a Improve multi-threaded performance (#2104)
* Add UpdatePredictionCache() option to updaters

Some updaters (e.g. fast_hist) has enough information to quickly compute
prediction cache for the training data. Each updater may override
UpdaterPredictionCache() method to update the prediction cache. Note: this
trick does not apply to validation data.

* Respond to code review

* Disable some debug messages by default
* Document UpdatePredictionCache() interface
* Remove base_margin logic from UpdatePredictionCache() implementation
* Do not take pointer to cfg, as reference may get stale

* Improve multi-threaded performance

* Use columnwise accessor to accelerate ApplySplit() step,
  with support for a compressed representation
* Parallel sort for evaluation step
* Inline BuildHist() function
* Cache gradient pairs when building histograms in BuildHist()

* Add missing #if macro

* Respond to code review

* Use wrapper to enable parallel sort on Linux

* Fix C++ compatibility issues

* MSVC doesn't support unsigned in OpenMP loops
* gcc 4.6 doesn't support using keyword

* Fix lint issues

* Respond to code review

* Fix bug in ApplySplitSparseData()

* Attempting to read beyond the end of a sparse column
* Mishandling the case where an entire range of rows have missing values

* Fix training continuation bug

Disable UpdatePredictionCache() in the first iteration. This way, we can
accomodate the scenario where we build off of an existing (nonempty) ensemble.

* Add regression test for fast_hist

* Respond to code review

* Add back old version of ApplySplitSparseData
2017-03-25 10:35:01 -07:00
Tianqi Chen
d581a3d0e7 [UPDATE] Update rabit and threadlocal (#2114)
* [UPDATE] Update rabit and threadlocal

* minor fix to make build system happy

* upgrade requirement to g++4.8

* upgrade dmlc-core

* update travis
2017-03-16 18:48:37 -07:00
Tianqi Chen
fd19b7a188 Automatically remove nan from input data when it is sparse. (#2062)
* [DATALoad] Automatically remove Nan when load from sparse matrix

* add log
2017-02-25 08:59:17 -08:00
Philip Cho
5d74578095 Disallow multiple roots for tree_method=hist (#1979)
As discussed in issue #1978, tree_method=hist ignores the parameter
param.num_roots; it simply assumes that the tree has only one root. In
particular, when InitData() method initializes row_set_collection_, it simply
assigns all rows to node 0, the value that's hard-coded.

For now, the updater will simply fail when num_roots exceeds 1. I will revise
the updater soon to support multiple roots.
2017-01-21 12:02:29 -08:00
Philip Cho
49ff7c1649 Rename parameter in fast_hist to disambiguate (#1962) 2017-01-13 11:35:55 -08:00
Philip Cho
aeb4e76118 Histogram Optimized Tree Grower (#1940)
* Support histogram-based algorithm + multiple tree growing strategy

* Add a brand new updater to support histogram-based algorithm, which buckets
  continuous features into discrete bins to speed up training. To use it, set
  `tree_method = fast_hist` to configuration.
* Support multiple tree growing strategies. For now, two policies are supported:
  * `grow_policy=depthwise` (default):  favor splitting at nodes closest to the
    root, i.e. grow depth-wise.
  * `grow_policy=lossguide`: favor splitting at nodes with highest loss change
* Improve single-threaded performance
  * Unroll critical loops
  * Introduce specialized code for dense data (i.e. no missing values)
* Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose`

* Adding a small test for hist method

* Fix memory error in row_set.h

When std::vector is resized, a reference to one of its element may become
stale. Any such reference must be updated as well.

* Resolve cross-platform compilation issues

* Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g.
  alignas(*) and new initializer syntax. To support g++ 4.6, use pre-C++11
  initializer and remove alignas(*).
* Versions of MSVC older than 2015 does not support alignas(*). To support
  MSVC 2012, remove alignas(*).
* For g++ 4.8 and newer, alignas(*) is enabled for performance benefits.
* Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases
  (which uses `using` to declate type aliases). So always use `typedef`.

* Fix a host of CI issues

* Remove dependency for libz on osx
* Fix heading for hist_util
* Fix minor style issues
* Add missing #include
* Remove extraneous logging

* Enable tree_method=hist in R

* Renaming HistMaker to GHistBuilder to avoid confusion

* Fix R integration

* Respond to style comments

* Consistent tie-breaking for priority queue using timestamps

* Last-minute style fixes

* Fix issuecomment-271977647

The way we quantize data is broken. The agaricus data consists of all
categorical values. When NAs are converted into 0's,
`HistCutMatrix::Init` assign both 0's and 1's to the same single bin.

Why? gmat only the smallest value (0) and an upper bound (2), which is twice
the maximum value (1). Add the maximum value itself to gmat to fix the issue.

* Fix issuecomment-272266358

* Remove padding from cut values for the continuous case
* For categorical/ordinal values, use midpoints as bin boundaries to be safe

* Fix CI issue -- do not use xrange(*)

* Fix corner case in quantile sketch

Signed-off-by: Philip Cho <chohyu01@cs.washington.edu>

* Adding a test for an edge case in quantile sketcher

max_bin=2 used to cause an exception.

* Fix fast_hist test

The test used to require a strictly increasing Test AUC for all examples.
One of them exhibits a small blip in Test AUC before achieving a Test AUC
of 1. (See bottom.)

Solution: do not require monotonic increase for this particular example.

[0] train-auc:0.99989 test-auc:0.999497
[1] train-auc:1 test-auc:0.999749
[2] train-auc:1 test-auc:0.999749
[3] train-auc:1 test-auc:0.999749
[4] train-auc:1 test-auc:0.999749
[5] train-auc:1 test-auc:0.999497
[6] train-auc:1 test-auc:1
[7] train-auc:1 test-auc:1
[8] train-auc:1 test-auc:1
[9] train-auc:1 test-auc:1
2017-01-13 09:25:55 -08:00
Simon DENEL
7078c41dad Changing omp_get_num_threads to omp_get_max_threads (#1831)
* Updating dmlc-core

* Changing omp_get_num_threads to omp_get_max_threads
2016-12-04 11:26:45 -08:00
Vadim Khotilovich
a44032d095 [CORE] The update process for a tree model, and its application to feature importance (#1670)
* [CORE] allow updating trees in an existing model

* [CORE] in refresh updater, allow keeping old leaf values and update stats only

* [R-package] xgb.train mod to allow updating trees in an existing model

* [R-package] added check for nrounds when is_update

* [CORE] merge parameter declaration changes; unify their code style

* [CORE] move the update-process trees initialization to Configure; rename default process_type to 'default'; fix the trees and trees_to_update sizes comparison check

* [R-package] unit tests for the update process type

* [DOC] documentation for process_type parameter; improved docs for updater, Gamma and Tweedie; added some parameter aliases; metrics indentation and some were non-documented

* fix my sloppy merge conflict resolutions

* [CORE] add a TreeProcessType enum

* whitespace fix
2016-12-04 09:33:52 -08:00
AbdealiJK
6f16f0ef58 Use bst_float consistently throughout (#1824)
* Fix various typos

* Add override to functions that are overridden

gcc gives warnings about functions that are being overridden by not
being marked as oveirridden. This fixes it.

* Use bst_float consistently

Use bst_float for all the variables that involve weight,
leaf value, gradient, hessian, gain, loss_chg, predictions,
base_margin, feature values.

In some cases, when due to additions and so on the value can
take a larger value, double is used.

This ensures that type conversions are minimal and reduces loss of
precision.
2016-11-30 10:02:10 -08:00
RAMitchell
be2f28ec08 Update build instructions, improve memory usage (#1811) 2016-11-25 09:43:22 -08:00
Simon DENEL
58aa1129ea Fixing a few typos (#1771)
* Fixing a few typos

* Fixing a few typos
2016-11-13 15:47:52 -08:00
AbdealiJK
b94fcab4dc Add dump_format=json option (#1726)
* Add format to the params accepted by DumpModel

Currently, only the test format is supported when trying to dump
a model. The plan is to add more such formats like JSON which are
easy to read and/or parse by machines. And to make the interface
for this even more generic to allow other formats to be added.

Hence, we make some modifications to make these function generic
and accept a new parameter "format" which signifies the format of
the dump to be created.

* Fix typos and errors in docs

* plugin: Mention all the register macros available

Document the register macros currently available to the plugin
writers so they know what exactly can be extended using hooks.

* sparce_page_source: Use same arg name in .h and .cc

* gbm: Add JSON dump

The dump_format argument can be used to specify what type
of dump file should be created. Add functionality to dump
gblinear and gbtree into a JSON file.

The JSON file has an array, each item is a JSON object for the tree.
For gblinear:
 - The item is the bias and weights vectors
For gbtree:
 - The item is the root node. The root node has a attribute "children"
   which holds the children nodes. This happens recursively.

* core.py: Add arg dump_format for get_dump()
2016-11-04 09:55:25 -07:00
RAMitchell
ac41845d4b Add GPU accelerated tree construction plugin (#1679) 2016-10-20 20:14:47 -07:00
Tianqi Chen
c93c9b7ed6 [TREE] Experimental version of monotone constraint (#1516)
* [TREE] Experimental version of monotone constraint

* Allow default detection of montone option

* loose the condition of strict check

* Update gbtree.cc
2016-09-07 21:28:43 -07:00
Vadim Khotilovich
75f401481f no exception throwing within omp parallel; set nthread in Learner (#1421) 2016-07-29 10:08:03 -07:00
Frank
3b73824842 Fix ambiguous call to abs(c or c++). (#1308) 2016-06-29 14:28:28 -07:00
tqchen
ecb3a271be [PYTHON-DIST] Distributed xgboost python training API. 2016-02-29 16:54:13 -08:00
tqchen
413f119c7e Update dmlc-core 2016-02-10 13:11:21 -08:00
tqchen
63c4ad7617 [APPROX] Make global proposal default, add group ptr solution 2016-02-10 11:19:10 -08:00
tqchen
ce4d59ed69 [TREE] Enable global proposal for faster speed 2016-02-10 11:19:10 -08:00
tqchen
2f2080a337 [TREE] Remove gap constraint, make tree construction more robust 2016-02-10 11:17:54 -08:00
tqchen
a500fbc9b0 [TREE] switch to two pass 2016-02-10 11:17:17 -08:00
tqchen
523afcbcd2 [TREE] Cleanup some functions, add utility function for two pass 2016-02-10 11:17:17 -08:00
tqchen
52227a8920 [TREE] Refactor histmaker 2016-02-10 11:17:17 -08:00
tqchen
88447ca32e [MEM] Add rowset struct to save memory with billion level rows 2016-02-10 11:17:17 -08:00
samuel-liyi
d3540aacc5 change the formula of fsplit value 2016-02-08 15:00:04 +08:00
tqchen
1495a43cea [R] make all customizations to meet strict standard of cran 2016-01-16 10:25:12 -08:00
tqchen
d75e3ed05d [LIBXGBOOST] pass demo running. 2016-01-16 10:24:01 -08:00
tqchen
4b4b36d047 [GBM] remove need to explicit InitModel, rename save/load 2016-01-16 10:24:01 -08:00
tqchen
e4567bbc47 [REFACTOR] Add alias, allow missing variables, init gbm interface 2016-01-16 10:24:01 -08:00
tqchen
d4677b6561 [TREE] finish move of updater 2016-01-16 10:24:01 -08:00
tqchen
4adc4cf0b9 [TREE] Move the files to target refactor location 2016-01-16 10:24:01 -08:00
tqchen
3128e1705b [TREE] Refactor colmaker 2016-01-16 10:24:01 -08:00
tqchen
20043f63a6 [TREE] Move colmaker 2016-01-16 10:24:01 -08:00
tqchen
c8ccb61b9e [TREE] Enable updater registry 2016-01-16 10:24:01 -08:00
tqchen
a62a66d545 [TREE] Finalize regression tree refactor 2016-01-16 10:24:01 -08:00
tqchen
d530e0c14f [REFACTOR] cleanup structure 2016-01-16 10:24:00 -08:00
Julian Quick
f51e1893fe fix minor typo 2016-01-01 20:03:45 -08:00
Vadim Khotilovich
c70022e6c4 spelling, wording, and doc fixes in c++ code
I was reading through the code and fixing some things in the comments.
Only a few trivial actual code changes were made to make things more
readable.
2015-12-12 21:40:12 -06:00
Tianqi Chen
fd8439ffbc Update param.h
enforce parallel option to 0 for now for stable result
2015-10-19 08:59:06 -07:00
tqchen
0162bb7034 lint half way 2015-07-03 18:31:52 -07:00
tqchen
e5dd894960 add a indicator opt 2015-06-02 11:38:06 -07:00
tqchen
09a841f810 auto turn on optimization 2015-05-15 23:54:34 -07:00