99 Commits

Author SHA1 Message Date
Jiaming Yuan
7d52c0b8c2
Requires setting leaf stat when expanding tree. (#5501)
* Fix GPU Hist feature importance.
2020-04-10 12:27:03 +08:00
Jiaming Yuan
6671b42dd4
Use ellpack for prediction only when sparsepage doesn't exist. (#5504) 2020-04-10 12:15:46 +08:00
Jiaming Yuan
0012f2ef93
Upgrade clang-tidy on CI. (#5469)
* Correct all clang-tidy errors.
* Upgrade clang-tidy to 10 on CI.

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-04-05 04:42:29 +08:00
Jiaming Yuan
459b175dc6
Split up test helpers header. (#5455) 2020-04-03 10:36:53 +08:00
Jiaming Yuan
4942da64ae
Refactor tests with data generator. (#5439) 2020-03-27 06:44:44 +08:00
Jiaming Yuan
ab7a46a1a4
Check whether current updater can modify a tree. (#5406)
* Check whether current updater can modify a tree.

* Fix tree model JSON IO for pruned trees.
2020-03-14 09:24:08 +08:00
Rory Mitchell
b745b7acce
Fix memory usage of device sketching (#5407) 2020-03-14 13:43:24 +13:00
Rory Mitchell
3ad4333b0e
Partial rewrite EllpackPage (#5352) 2020-03-11 10:15:53 +13:00
Rory Mitchell
a38e7bd19c
Sketching from adapters (#5365)
* Sketching from adapters

* Add weights test
2020-03-07 21:07:58 +13:00
Jiaming Yuan
8d06878bf9
Deterministic GPU histogram. (#5361)
* Use pre-rounding based method to obtain reproducible floating point
  summation.
* GPU Hist for regression and classification are bit-by-bit reproducible.
* Add doc.
* Switch to thrust reduce for `node_sum_gradient`.
2020-03-04 15:13:28 +08:00
Egor Smirnov
1b97eaf7a7
Optimized ApplySplit, BuildHist and UpdatePredictCache functions on CPU (#5244)
* Split up sparse and dense build hist kernels.
* Add `PartitionBuilder`.
2020-02-29 16:11:42 +08:00
sriramch
b81f8cbbc0
Move segment sorter to common (#5378)
- move segment sorter to common
- this is the first of a handful of pr's that splits the larger pr #5326
- it moves this facility to common (from ranking objective class), so that it can be
    used for metric computation
- it also wraps all the bald device pointers into span.
2020-02-29 15:42:07 +08:00
Jiaming Yuan
e0509b3307
Fix pruner. (#5335)
* Honor the tree depth.
* Prevent pruning pruned node.
2020-02-25 08:32:46 +08:00
Rory Mitchell
24ad9dec0b
Testing hist_util (#5251)
* Rank tests

* Remove categorical split specialisation

* Extend tests to multiple features, switch to WQSketch

* Add tests for SparseCuts

* Add external memory quantile tests, fix some existing tests
2020-02-14 14:36:43 +13:00
Jiaming Yuan
29eeea709a
Pass shared pointer instead of raw pointer to Learner. (#5302)
Extracted from https://github.com/dmlc/xgboost/pull/5220 .
2020-02-11 14:16:38 +08:00
Rong Ou
e4b74c4d22
Gradient based sampling for GPU Hist (#5093)
* Implement gradient based sampling for GPU Hist tree method.
* Add samplers and handle compacted page in GPU Hist.
2020-02-04 10:31:27 +08:00
Egor Smirnov
c67163250e
Optimized BuildHist function (#5156) 2020-01-29 23:32:57 -08:00
Jiaming Yuan
3eb1279bbf
Config for linear updaters. (#5222) 2020-01-25 11:26:46 +08:00
Egor Smirnov
7b17e76c5b Optimized EvaluateSplut function (#5138)
* Add block based threading utilities.
2019-12-31 18:18:42 +08:00
Jiaming Yuan
04db125699
Quick fix for memory leak in CPU Hist. (#5153)
Closes https://github.com/dmlc/xgboost/issues/3579 .

* Don't use map.
2019-12-31 14:05:53 +08:00
Jiaming Yuan
ad4a1c732c
Small refinements for JSON model. (#5112)
* Naming consistency.

* Remove duplicated test.
2019-12-11 19:49:01 +08:00
Jiaming Yuan
208ab3b1ff
Model IO in JSON. (#5110) 2019-12-11 11:20:40 +08:00
Jiaming Yuan
7ef5b78003
Implement JSON IO for updaters (#5094)
* Implement JSON IO for updaters.

* Remove parameters in split evaluator.
2019-12-07 00:24:00 +08:00
Jiaming Yuan
df9bdbbcb9
Fix parsing empty vector in parameter. (#5087) 2019-12-05 11:42:01 +08:00
Rong Ou
0afcc55d98 Support multiple batches in gpu_hist (#5014)
* Initial external memory training support for GPU Hist tree method.
2019-11-16 14:50:20 +08:00
Jiaming Yuan
97abcc7ee2
Extract interaction constraint from split evaluator. (#5034)
*  Extract interaction constraints from split evaluator.

The reason for doing so is mostly for model IO, where num_feature and interaction_constraints are copied in split evaluator. Also interaction constraint by itself is a feature selector, acting like column sampler and it's inefficient to bury it deep in the evaluator chain. Lastly removing one another copied parameter is a win.

*  Enable inc for approx tree method.

As now the implementation is spited up from evaluator class, it's also enabled for approx method.

*  Removing obsoleted code in colmaker.

They are never documented nor actually used in real world. Also there isn't a single test for those code blocks.

*  Unifying the types used for row and column.

As the size of input dataset is marching to billion, incorrect use of int is subject to overflow, also singed integer overflow is undefined behaviour. This PR starts the procedure for unifying used index type to unsigned integers. There's optimization that can utilize this undefined behaviour, but after some testings I don't see the optimization is beneficial to XGBoost.
2019-11-14 20:11:41 +08:00
Philip Hyunsu Cho
f4e7b707c9
Revert #4529 (#5008)
* Revert " Optimize ‘hist’ for multi-core CPU (#4529)"

This reverts commit 4d6590be3c9a043d44d9e4fe0a456a9f8179ec72.

* Fix build
2019-11-12 09:35:03 -08:00
Jiaming Yuan
ac457c56a2
Use `UpdateAllowUnknown' for non-model related parameter. (#4961)
* Use `UpdateAllowUnknown' for non-model related parameter.

Model parameter can not pack an additional boolean value due to binary IO
format.  This commit deals only with non-model related parameter configuration.

* Add tidy command line arg for use-dmlc-gtest.
2019-10-23 05:50:12 -04:00
Rong Ou
5b1715d97c Write ELLPACK pages to disk (#4879)
* add ellpack source
* add batch param
* extract function to parse cache info
* construct ellpack info separately
* push batch to ellpack page
* write ellpack page.
* make sparse page source reusable
2019-10-22 23:44:32 -04:00
Jiaming Yuan
ae536756ae
Add Model and Configurable interface. (#4945)
* Apply Configurable to objective functions.
* Apply Model to Learner and Regtree, gbm.
* Add Load/SaveConfig to objs.
* Refactor obj tests to use smart pointer.
* Dummy methods for Save/Load Model.
2019-10-18 01:56:02 -04:00
Jiaming Yuan
b61d534472
Span: use size_t' for index_type, add front' and `back'. (#4935)
* Use `size_t' for index_type.  Add `front' and `back'.

* Remove a batch of `static_cast'.
2019-10-14 09:13:33 -04:00
Jiaming Yuan
095de3bf5f
Export c++ headers in CMake installation. (#4897)
* Move get transpose into cc.

* Clean up headers in host device vector, remove thrust dependency.

* Move span and host device vector into public.

* Install c++ headers.

* Short notes for c and c++.

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2019-10-06 23:53:09 -04:00
Rong Ou
562bb0ae31 remove device shards (#4867) 2019-09-25 13:15:46 +08:00
Jiaming Yuan
0b89cd1dfa
Support gamma in GPU_Hist. (#4874)
* Just prevent building the tree instead of using an explicit pruner.
2019-09-24 10:16:08 +08:00
Rong Ou
125bcec62e Move ellpack page construction into DMatrix (#4833) 2019-09-16 23:50:55 -04:00
Rong Ou
733ed24dd9 further cleanup of single process multi-GPU code (#4810)
* use subspan in gpu predictor instead of copying
* Revise `HostDeviceVector`
2019-08-30 05:27:23 -04:00
Rong Ou
38ab79f889 Make HostDeviceVector single gpu only (#4773)
* Make HostDeviceVector single gpu only
2019-08-26 09:51:13 +12:00
Jiaming Yuan
9700776597 Cudf support. (#4745)
* Initial support for cudf integration.

* Add two C APIs for consuming data and metainfo.

* Add CopyFrom for SimpleCSRSource as a generic function to consume the data.

* Add FromDeviceColumnar for consuming device data.

* Add new MetaInfo::SetInfo for consuming label, weight etc.
2019-08-19 16:51:40 +12:00
Xu Xiao
ef9af33a00 [HOTFIX] distributed training with hist method (#4716)
* add parallel test for hist.EvalualiteSplit

* update test_openmp.py

* update test_openmp.py

* update test_openmp.py

* update test_openmp.py

* update test_openmp.py

* fix OMP schedule policy

* fix clang-tidy

* add logging: total_num_bins

* fix

* fix

* test

* replace guided OPENMP policy with static in updater_quantile_hist.cc
2019-08-13 11:27:29 -07:00
Rong Ou
c5b229632d [BREAKING] prevent multi-gpu usage (#4749)
* prevent multi-gpu usage

* fix distributed test

* combine gpu predictor tests

* set upper bound on n_gpus
2019-08-13 09:11:35 +12:00
Rong Ou
851b5b3808 Remove gpu_exact tree method (#4742) 2019-08-07 11:43:20 +12:00
Jiaming Yuan
9c469b3844
Move bitfield into common. (#4737)
* Prepare for columnar format support.
2019-08-06 02:49:32 -04:00
Rong Ou
6edddd7966 Refactor DMatrix to return batches of different page types (#4686)
* Use explicit template parameter for specifying page type.
2019-08-03 15:10:34 -04:00
Jiaming Yuan
f0064c07ab
Refactor configuration [Part II]. (#4577)
* Refactor configuration [Part II].

* General changes:
** Remove `Init` methods to avoid ambiguity.
** Remove `Configure(std::map<>)` to avoid redundant copying and prepare for
   parameter validation. (`std::vector` is returned from `InitAllowUnknown`).
** Add name to tree updaters for easier debugging.

* Learner changes:
** Make `LearnerImpl` the only source of configuration.

    All configurations are stored and carried out by `LearnerImpl::Configure()`.

** Remove booster in C API.

    Originally kept for "compatibility reason", but did not state why.  So here
    we just remove it.

** Add a `metric_names_` field in `LearnerImpl`.
** Remove `LazyInit`.  Configuration will always be lazy.
** Run `Configure` before every iteration.

* Predictor changes:
** Allocate both cpu and gpu predictor.
** Remove cpu_predictor from gpu_predictor.

    `GBTree` is now used to dispatch the predictor.

** Remove some GPU Predictor tests.

* IO

No IO changes.  The binary model format stability is tested by comparing
hashing value of save models between two commits
2019-07-20 08:34:56 -04:00
sriramch
7a388cbf8b Modify caching allocator/vector and fix issues relating to inability to train large datasets (#4615) 2019-07-09 18:33:27 +12:00
Jiaming Yuan
d9a47794a5 Fix CPU hist init for sparse dataset. (#4625)
* Fix CPU hist init for sparse dataset.

* Implement sparse histogram cut.
* Allow empty features.

* Fix windows build, don't use sparse in distributed environment.

* Comments.

* Smaller threshold.

* Fix windows omp.

* Fix msvc lambda capture.

* Fix MSVC macro.

* Fix MSVC initialization list.

* Fix MSVC initialization list x2.

* Preserve categorical feature behavior.

* Rename matrix to sparse cuts.
* Reuse UseGroup.
* Check for categorical data when adding cut.

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Sanity check.

* Fix comments.

* Fix comment.
2019-07-04 16:27:03 -07:00
Egor Smirnov
4d6590be3c Optimize ‘hist’ for multi-core CPU (#4529)
* Initial performance optimizations for xgboost

* remove includes

* revert float->double

* fix for CI

* fix for CI

* fix for CI

* fix for CI

* fix for CI

* fix for CI

* fix for CI

* fix for CI

* fix for CI

* fix for CI

* Check existence of _mm_prefetch and __builtin_prefetch

* Fix lint

* optimizations for CPU

* appling comments in review

* add some comments, code refactoring

* fixing issues in CI

* adding runtime checks

* remove 1 extra check

* remove extra checks in BuildHist

* remove checks

* add debug info

* added debug info

* revert changes

* added comments

* Apply suggestions from code review

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* apply review comments

* Remove unused function CreateNewNodes()

* Add descriptive comment on node_idx variable in QuantileHistMaker::Builder::BuildHistsBatch()
2019-06-27 11:33:49 -07:00
Jiaming Yuan
8bdf15120a
Implement tree model dump with code generator. (#4602)
* Implement tree model dump with a code generator.

* Split up generators.
* Implement graphviz generator.
* Use pattern matching.

* [Breaking] Return a Source in `to_graphviz` instead of Digraph in Python package.


Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2019-06-26 15:20:44 +08:00
Rong Ou
6125521caf fix compiler warning (#4588) 2019-06-21 04:06:26 +08:00
Rory Mitchell
221e163185
Refactor out row partitioning logic from gpu_hist, introduce caching device vectors (#4554) 2019-06-20 18:24:09 +12:00