117 Commits

Author SHA1 Message Date
Rory Mitchell
87ebfc1315
Implement cudf construction with adapters. (#5189) 2020-01-09 20:23:06 +13:00
Jiaming Yuan
3136185bc5
JSON configuration IO. (#5111)
* Add saving/loading JSON configuration.
* Implement Python pickle interface with new IO routines.
* Basic tests for training continuation.
2019-12-15 17:31:53 +08:00
Jiaming Yuan
208ab3b1ff
Model IO in JSON. (#5110) 2019-12-11 11:20:40 +08:00
Rory Mitchell
c7cc657a4d
Use adapters for SparsePageDMatrix (#5092) 2019-12-11 15:59:23 +13:00
Jiaming Yuan
64af1ecf86
[Breaking] Remove num roots. (#5059) 2019-12-05 21:58:43 +08:00
Rory Mitchell
e3c34c79be
External data adapters (#5044)
* Use external data adapters as lightweight intermediate layer between external data and DMatrix
2019-12-04 10:56:17 +13:00
Jiaming Yuan
97abcc7ee2
Extract interaction constraint from split evaluator. (#5034)
*  Extract interaction constraints from split evaluator.

The reason for doing so is mostly for model IO, where num_feature and interaction_constraints are copied in split evaluator. Also interaction constraint by itself is a feature selector, acting like column sampler and it's inefficient to bury it deep in the evaluator chain. Lastly removing one another copied parameter is a win.

*  Enable inc for approx tree method.

As now the implementation is spited up from evaluator class, it's also enabled for approx method.

*  Removing obsoleted code in colmaker.

They are never documented nor actually used in real world. Also there isn't a single test for those code blocks.

*  Unifying the types used for row and column.

As the size of input dataset is marching to billion, incorrect use of int is subject to overflow, also singed integer overflow is undefined behaviour. This PR starts the procedure for unifying used index type to unsigned integers. There's optimization that can utilize this undefined behaviour, but after some testings I don't see the optimization is beneficial to XGBoost.
2019-11-14 20:11:41 +08:00
Jiaming Yuan
f24be2efb4 Use configure_file() to configure version only (#4974)
* Avoid writing build_config.h

* Remove build_config.h all together.

* Lint.
2019-10-22 23:47:00 -07:00
Jiaming Yuan
5620322a48
[Breaking] Add global versioning. (#4936)
* Use CMake config file for representing version.

* Generate c and Python version file with CMake.

The generated file is written into source tree.  But unless XGBoost upgrades
its version, there will be no actual modification.  This retains compatibility
with Makefiles for R.

* Add XGBoost version the DMatrix binaries.
* Simplify prefetch detection in CMakeLists.txt
2019-10-22 23:27:26 -04:00
Jiaming Yuan
d669ea1eaa
Deprecate set group (#4864)
* Convert jvm package and R package.

* Restore for compatibility.
2019-09-17 21:26:54 -04:00
Jiaming Yuan
5374f52531
Complete cudf support. (#4850)
* Handles missing value.
* Accept all floating point and integer types.
* Move to cudf 9.0 API.
* Remove requirement on `null_count`.
* Arbitrary column types support.
2019-09-16 23:52:00 -04:00
Jiaming Yuan
9700776597 Cudf support. (#4745)
* Initial support for cudf integration.

* Add two C APIs for consuming data and metainfo.

* Add CopyFrom for SimpleCSRSource as a generic function to consume the data.

* Add FromDeviceColumnar for consuming device data.

* Add new MetaInfo::SetInfo for consuming label, weight etc.
2019-08-19 16:51:40 +12:00
Jiaming Yuan
f0064c07ab
Refactor configuration [Part II]. (#4577)
* Refactor configuration [Part II].

* General changes:
** Remove `Init` methods to avoid ambiguity.
** Remove `Configure(std::map<>)` to avoid redundant copying and prepare for
   parameter validation. (`std::vector` is returned from `InitAllowUnknown`).
** Add name to tree updaters for easier debugging.

* Learner changes:
** Make `LearnerImpl` the only source of configuration.

    All configurations are stored and carried out by `LearnerImpl::Configure()`.

** Remove booster in C API.

    Originally kept for "compatibility reason", but did not state why.  So here
    we just remove it.

** Add a `metric_names_` field in `LearnerImpl`.
** Remove `LazyInit`.  Configuration will always be lazy.
** Run `Configure` before every iteration.

* Predictor changes:
** Allocate both cpu and gpu predictor.
** Remove cpu_predictor from gpu_predictor.

    `GBTree` is now used to dispatch the predictor.

** Remove some GPU Predictor tests.

* IO

No IO changes.  The binary model format stability is tested by comparing
hashing value of save models between two commits
2019-07-20 08:34:56 -04:00
Jiaming Yuan
d9a47794a5 Fix CPU hist init for sparse dataset. (#4625)
* Fix CPU hist init for sparse dataset.

* Implement sparse histogram cut.
* Allow empty features.

* Fix windows build, don't use sparse in distributed environment.

* Comments.

* Smaller threshold.

* Fix windows omp.

* Fix msvc lambda capture.

* Fix MSVC macro.

* Fix MSVC initialization list.

* Fix MSVC initialization list x2.

* Preserve categorical feature behavior.

* Rename matrix to sparse cuts.
* Reuse UseGroup.
* Check for categorical data when adding cut.

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Sanity check.

* Fix comments.

* Fix comment.
2019-07-04 16:27:03 -07:00
Jiaming Yuan
8bdf15120a
Implement tree model dump with code generator. (#4602)
* Implement tree model dump with a code generator.

* Split up generators.
* Implement graphviz generator.
* Use pattern matching.

* [Breaking] Return a Source in `to_graphviz` instead of Digraph in Python package.


Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2019-06-26 15:20:44 +08:00
Bryan Woods
278562db13 Add support for cross-validation using query ID (#4474)
* adding support for matrix slicing with query ID for cross-validation

* hail mary test of unrar installation for windows tests

* trying to modify tests to run in Github CI

* Remove dependency on wget and unrar

* Save error log from R test

* Relax assertion in test_training

* Use int instead of bool in C function interface

* Revise R interface

* Add XGDMatrixSliceDMatrixEx and keep old XGDMatrixSliceDMatrix for API compatibility
2019-05-23 10:45:02 -07:00
Jiaming Yuan
207f058711 Refactor CMake scripts. (#4323)
* Refactor CMake scripts.

* Remove CMake CUDA wrapper.
* Bump CMake version for CUDA.
* Use CMake to handle Doxygen.
* Split up CMakeList.
* Export install target.
* Use modern CMake.
* Remove build.sh
* Workaround for gpu_hist test.
* Use cmake 3.12.

* Revert machine.conf.

* Move CLI test to gpu.

* Small cleanup.

* Support using XGBoost as submodule.

* Fix windows

* Fix cpp tests on Windows

* Remove duplicated find_package.
2019-04-15 10:08:12 -07:00
Hajime Morrita
680a1b36f3 Get rid of a few trivial compiler warnings. (#4312) 2019-03-31 00:02:29 +08:00
Jiaming Yuan
fdcae024e7
Remove deprecated C APIs. (#4266) 2019-03-17 16:42:44 +08:00
Jiaming Yuan
7b9043cf71
Fix clang-tidy warnings. (#4149)
* Upgrade gtest for clang-tidy.
* Use CMake to install GTest instead of mv.
* Don't enforce clang-tidy to return 0 due to errors in thrust.
* Add a small test for tidy itself.

* Reformat.
2019-03-13 02:25:51 +08:00
Philip Hyunsu Cho
2aaae2e7bb
Fix #4163: always copy sliced data (#4165)
* Revert "Accept numpy array view. (#4147)"

This reverts commit a985a99cf0dacb26a5d734835473d492d3c2a0df.

* Fix #4163: always copy sliced data

* Remove print() from the test; check shape equality

* Check if 'base' attribute exists

* Fix lint

* Address reviewer comment

* Fix lint
2019-02-20 14:46:34 -08:00
Jiaming Yuan
a985a99cf0
Accept numpy array view. (#4147)
* Accept array view (slice) in metainfo.
2019-02-18 22:21:34 +08:00
Philip Hyunsu Cho
91537e7353
Fix #3342 and h2oai/h2o4gpu#625: Save predictor parameters in model file (#3856)
* Fix #3342 and h2oai/h2o4gpu#625: Save predictor parameters in model file

This allows pickled models to retain predictor attributes, such as
'predictor' (whether to use CPU or GPU) and 'n_gpu' (number of GPUs
to use). Related: h2oai/h2o4gpu#625

Closes #3342.

TODO. Write a test.

* Fix lint

* Do not load GPU predictor into CPU-only XGBoost

* Add a test for pickling GPU predictors

* Make sample data big enough to pass multi GPU test

* Update test_gpu_predictor.cu
2018-11-03 21:45:38 -07:00
Andy Adinets
72cd1517d6 Replaced std::vector with HostDeviceVector in MetaInfo and SparsePage. (#3446)
* Replaced std::vector with HostDeviceVector in MetaInfo and SparsePage.

- added distributions to HostDeviceVector
- using HostDeviceVector for labels, weights and base margings in MetaInfo
- using HostDeviceVector for offset and data in SparsePage
- other necessary refactoring

* Added const version of HostDeviceVector API calls.

- const versions added to calls that can trigger data transfers, e.g. DevicePointer()
- updated the code that uses HostDeviceVector
- objective functions now accept const HostDeviceVector<bst_float>& for predictions

* Updated src/linear/updater_gpu_coordinate.cu.

* Added read-only state for HostDeviceVector sync.

- this means no copies are performed if both host and devices access
  the HostDeviceVector read-only

* Fixed linter and test errors.

- updated the lz4 plugin
- added ConstDeviceSpan to HostDeviceVector
- using device % dh::NVisibleDevices() for the physical device number,
  e.g. in calls to cudaSetDevice()

* Fixed explicit template instantiation errors for HostDeviceVector.

- replaced HostDeviceVector<unsigned int> with HostDeviceVector<int>

* Fixed HostDeviceVector tests that require multiple GPUs.

- added a mock set device handler; when set, it is called instead of cudaSetDevice()
2018-08-30 14:28:47 +12:00
trivialfis
2c502784ff Span class. (#3548)
* Add basic Span class based on ISO++20.

* Use Span<Entry const> instead of Inst in SparsePage.

* Add DeviceSpan in HostDeviceVector, use it in regression obj.
2018-08-14 17:58:11 +12:00
Philip Hyunsu Cho
109473dae2
Fix #3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows (#3553)
* Fix #3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows

Description: The bug is triggered when

1. The data matrix has empty rows at the bottom. More precisely, the rows
   `n-k+1`, `n-k+2`, ..., `n` of the matrix have missing values in all
   dimensions (`n` number of instances, `k` number of trailing rows)
2. The data matrix is given as Compressed Sparse Column (CSC) format.

Diagnosis: When the CSC matrix is converted to Compressed Sparse Row (CSR)
format (this is common format used for DMatrix), the trailing empty rows
are silently ignored. More specifically, the row pointer (`offset`) of the
newly created CSR matrix does not take account of these rows.

Fix: Modify the row pointer.

* Add regression test
2018-08-05 10:15:42 -07:00
hlsc
5850a2558a fix DMatrix load_row_split bug (#3431) 2018-07-28 17:21:30 -07:00
Rory Mitchell
1b59316444
Updates for GPU CI tests (#3467)
* Fail GPU CI after test failure

* Fix GPU linear tests

* Reduced number of GPU tests to speed up CI

* Remove static allocations of device memory

* Resolve illegal memory access for updater_fast_hist.cc

* Fix broken r tests dependency

* Update python install documentation for GPU
2018-07-16 18:05:53 +12:00
Philip Hyunsu Cho
48d6e68690
Add callback interface to re-direct console output (#3438)
* Add callback interface to re-direct console output

* Exempt TrackerLogger from custom logging

* Fix lint
2018-07-05 11:32:30 -07:00
liuliang01
0cf88d036f Add qid like ranklib format (#2749)
* add qid for https://github.com/dmlc/xgboost/issues/2748

* change names

* change spaces

* change qid to bst_uint type

* change qid type to size_t

* change qid first to SIZE_MAX

* change qid type from size_t to uint64_t

* update dmlc-core

* fix qids name error

* fix group_ptr_ error

* Style fix

* Add qid handling logic to SparsePage

* New MetaInfo format + backward compatibility fix

Old MetaInfo format (1.0) doesn't contain qid field. We still want to be able
to read from MetaInfo files saved in old format. Also, define a new format
(2.0) that contains the qid field. This way, we can distinguish files that
contain qid and those that do not.

* Update MetaInfo test

* Simply group assignment logic

* Explicitly set qid=nullptr in NativeDataIter

NativeDataIter's callback does not support qid field. Users of NativeDataIter
will need to call setGroup() function separately to set group information.

* Save qids_ in SaveBinary()

* Upgrade dmlc-core submodule

* Add a test for reading qid

* Add contributor

* Check the size of qids_

* Document qid format
2018-06-30 20:24:03 +00:00
Yun Ni
30d10ab035 Convert handle == nullptr from SegFault to user-friendly error. (#3021)
* Convert SegFault to user-friendly error.

* Apply the change to DMatrix API as well
2018-06-29 06:30:26 +00:00
PSEUDOTENSOR / Jonathan McKinney
9ac163d0bb Allow import via python datatable. (#3272)
* Allow import via python datatable.

* Write unit tests

* Refactor dt API functions

* Refactor python code

* Lint fixes

* Address review comments
2018-06-20 13:16:18 -07:00
Rory Mitchell
a96039141a
Dmatrix refactor stage 1 (#3301)
* Use sparse page as singular CSR matrix representation

* Simplify dmatrix methods

* Reduce statefullness of batch iterators

* BREAKING CHANGE: Remove prob_buffer_row parameter. Users are instead recommended to sample their dataset as a preprocessing step before using XGBoost.
2018-06-07 10:25:58 +12:00
Rory Mitchell
ccf80703ef
Clang-tidy static analysis (#3222)
* Clang-tidy static analysis

* Modernise checks

* Google coding standard checks

* Identifier renaming according to Google style
2018-04-19 18:57:13 +12:00
Will Storey
00d9728e4b Fix memory leak in XGDMatrixCreateFromMat_omp() (#3182)
* Fix memory leak in XGDMatrixCreateFromMat_omp()

This replaces the array allocated by new with a std::vector.

Fixes #3161
2018-03-18 15:03:27 +13:00
Andrew V. Adinetz
d5992dd881 Replaced std::vector-based interfaces with HostDeviceVector-based interfaces. (#3116)
* Replaced std::vector-based interfaces with HostDeviceVector-based interfaces.

- replacement was performed in the learner, boosters, predictors,
  updaters, and objective functions
- only interfaces used in training were replaced;
  interfaces like PredictInstance() still use std::vector
- refactoring necessary for replacement of interfaces was also performed,
  such as using HostDeviceVector in prediction cache

* HostDeviceVector-based interfaces for custom objective function example plugin.
2018-02-28 13:00:04 +13:00
Abraham Zhan
874525c152 c_api.cc variable declared inapproiate (#3044)
In line 461, the "size_t offset = 0;" should be declared before any calculation, otherwise will cause compilation error. 

```
I:\Libraries\xgboost\src\c_api\c_api.cc(416): error C2146: Missing ";" before "offset" [I:\Libraries\xgboost\build\objxgboost.vcxproj]
```
2018-02-09 01:32:01 -08:00
Scott Lundberg
d878c36c84 Add SHAP interaction effects, fix minor bug, and add cox loss (#3043)
* Add interaction effects and cox loss

* Minimize whitespace changes

* Cox loss now no longer needs a pre-sorted dataset.

* Address code review comments

* Remove mem check, rename to pred_interactions, include bias

* Make lint happy

* More lint fixes

* Fix cox loss indexing

* Fix main effects and tests

* Fix lint

* Use half interaction values on the off-diagonals

* Fix lint again
2018-02-07 20:38:01 -06:00
Scott Lundberg
78c4188cec SHAP values for feature contributions (#2438)
* SHAP values for feature contributions

* Fix commenting error

* New polynomial time SHAP value estimation algorithm

* Update API to support SHAP values

* Fix merge conflicts with updates in master

* Correct submodule hashes

* Fix variable sized stack allocation

* Make lint happy

* Add docs

* Fix typo

* Adjust tolerances

* Remove unneeded def

* Fixed cpp test setup

* Updated R API and cleaned up

* Fixed test typo
2017-10-12 12:35:51 -07:00
Vadim Khotilovich
00eda28b3c MinGW: shared library prefix and appveyor CI (#2539)
* for MinGW, drop the 'lib' prefix from shared library name

* fix defines for 'g++ 4.8 or higher' to include g++ >= 5

* fix compile warnings

* [Appveyor] add MinGW with python; remove redundant jobs

* [Appveyor] also do python build for one of msvc jobs
2017-07-25 01:06:47 -05:00
PSEUDOTENSOR / Jonathan McKinney
6b375f6ad8 Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation (#2530)
* Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation from numpy arrays for python interface.
2017-07-21 14:43:17 +12:00
Maurus Cuelenaere
6bd1869026 Add prediction of feature contributions (#2003)
* Add prediction of feature contributions

This implements the idea described at http://blog.datadive.net/interpreting-random-forests/
which tries to give insight in how a prediction is composed of its feature contributions
and a bias.

* Support multi-class models

* Calculate learning_rate per-tree instead of using the one from the first tree

* Do not rely on node.base_weight * learning_rate having the same value as the node mean value (aka leaf value, if it were a leaf); instead calculate them (lazily) on-the-fly

* Add simple test for contributions feature

* Check against param.num_nodes instead of checking for non-zero length

* Loop over all roots instead of only the first
2017-05-14 00:58:10 -05:00
Preston Parry
1ab8088a09 Removes extraneous log (#2186)
This log appears to fire every time I ask the python package to make a prediction. It's the only log that fires from XGBoost. When we're getting predictions on millions of items a day in production, this log seems out of place.
2017-04-11 17:38:29 -07:00
Huffers
d45cf240a9 Remove xgboost's thread_local and switch to dmlc::ThreadLocalStore (#2121)
* Remove xgboost's own version of thread_local and switch to dmlc::ThreadLocalStore (#2109)

* Update dmlc-core
2017-03-27 09:09:18 -07:00
Tianqi Chen
d581a3d0e7 [UPDATE] Update rabit and threadlocal (#2114)
* [UPDATE] Update rabit and threadlocal

* minor fix to make build system happy

* upgrade requirement to g++4.8

* upgrade dmlc-core

* update travis
2017-03-16 18:48:37 -07:00
Oleg Sofrygin
9d19e13ed0 adding a copy of base_margin to slice, fixes a bug where base_margin was notcopied during cross-validation (#2007) 2017-03-16 10:36:57 -07:00
Tianqi Chen
fd19b7a188 Automatically remove nan from input data when it is sparse. (#2062)
* [DATALoad] Automatically remove Nan when load from sparse matrix

* add log
2017-02-25 08:59:17 -08:00
Simon DENEL
7078c41dad Changing omp_get_num_threads to omp_get_max_threads (#1831)
* Updating dmlc-core

* Changing omp_get_num_threads to omp_get_max_threads
2016-12-04 11:26:45 -08:00
AbdealiJK
6f16f0ef58 Use bst_float consistently throughout (#1824)
* Fix various typos

* Add override to functions that are overridden

gcc gives warnings about functions that are being overridden by not
being marked as oveirridden. This fixes it.

* Use bst_float consistently

Use bst_float for all the variables that involve weight,
leaf value, gradient, hessian, gain, loss_chg, predictions,
base_margin, feature values.

In some cases, when due to additions and so on the value can
take a larger value, double is used.

This ensures that type conversions are minimal and reduces loss of
precision.
2016-11-30 10:02:10 -08:00
AbdealiJK
97371ff7e5 c_api.cc: Bring back silent argument (#1794)
In ecb3a271bed151252fb048528ce5a90ad75bb68f the silent argument
in XGDMatrixCreateFromFile of c_api.cc was always overridden to
be false. This disabled the functionality to hide log messages.

This commit reverts that part to enable the hiding of log messages.
2016-11-20 22:04:36 -08:00