147 Commits

Author SHA1 Message Date
Andrew V. Adinetz
d5992dd881 Replaced std::vector-based interfaces with HostDeviceVector-based interfaces. (#3116)
* Replaced std::vector-based interfaces with HostDeviceVector-based interfaces.

- replacement was performed in the learner, boosters, predictors,
  updaters, and objective functions
- only interfaces used in training were replaced;
  interfaces like PredictInstance() still use std::vector
- refactoring necessary for replacement of interfaces was also performed,
  such as using HostDeviceVector in prediction cache

* HostDeviceVector-based interfaces for custom objective function example plugin.
2018-02-28 13:00:04 +13:00
Yuan (Terry) Tang
11bfa8584d
Remove unnecessary dependencies in distributed test (#3132) 2018-02-24 20:24:34 -05:00
Rory Mitchell
10eb05a63a
Refactor linear modelling and add new coordinate descent updater (#3103)
* Refactor linear modelling and add new coordinate descent updater

* Allow unsorted column iterator

* Add prediction cacheing to gblinear
2018-02-17 09:17:01 +13:00
Scott Lundberg
d878c36c84 Add SHAP interaction effects, fix minor bug, and add cox loss (#3043)
* Add interaction effects and cox loss

* Minimize whitespace changes

* Cox loss now no longer needs a pre-sorted dataset.

* Address code review comments

* Remove mem check, rename to pred_interactions, include bias

* Make lint happy

* More lint fixes

* Fix cox loss indexing

* Fix main effects and tests

* Fix lint

* Use half interaction values on the off-diagonals

* Fix lint again
2018-02-07 20:38:01 -06:00
Thejaswi
84ab74f3a5 Objective function evaluation on GPU with minimal PCIe transfers (#2935)
* Added GPU objective function and no-copy interface.

- xgboost::HostDeviceVector<T> syncs automatically between host and device
- no-copy interfaces have been added
- default implementations just sync the data to host
  and call the implementations with std::vector
- GPU objective function, predictor, histogram updater process data
  directly on GPU
2018-01-12 21:33:39 +13:00
Rory Mitchell
7759ab99ee
Fix Google test warnings and error (#2957) 2017-12-20 00:13:56 +13:00
Rory Mitchell
1b77903eeb
Fix several GPU bugs (#2916)
* Fix #2905

* Fix gpu_exact test failures

* Fix bug in GPU prediction where multiple calls to batch prediction can produce incorrect results

* Fix GPU documentation formatting
2017-12-04 08:27:49 +13:00
Rory Mitchell
c51adb49b6
Monotone constraints for gpu_hist (#2904) 2017-11-30 10:26:19 +13:00
Rory Mitchell
c55f14668e
Update gpu_hist algorithm (#2901) 2017-11-27 13:44:24 +13:00
Rory Mitchell
24f527a1c0
AVX gradients (#2878)
* AVX gradients

* Add google test for AVX

* Create fallback implementation, remove fma instruction

* Improved accuracy of AVX exp function
2017-11-27 08:56:01 +13:00
Rory Mitchell
40c6e2f0c8
Improved gpu_hist_experimental algorithm (#2866)
- Implement colsampling, subsampling for gpu_hist_experimental

 - Optimised multi-GPU implementation for gpu_hist_experimental

 - Make nccl optional

 - Add Volta architecture flag

 - Optimise RegLossObj

 - Add timing utilities for debug verbose mode

 - Bump required cuda version to 8.0
2017-11-11 13:58:40 +13:00
Rory Mitchell
13e7a2cff0 Various bug fixes (#2825)
* Fatal error if GPU algorithm selected without GPU support compiled

* Resolve type conversion warnings

* Fix gpu unit test failure

* Fix compressed iterator edge case

* Fix python unit test failures due to flake8 update on pip
2017-10-25 14:45:01 +13:00
Scott Lundberg
78c4188cec SHAP values for feature contributions (#2438)
* SHAP values for feature contributions

* Fix commenting error

* New polynomial time SHAP value estimation algorithm

* Update API to support SHAP values

* Fix merge conflicts with updates in master

* Correct submodule hashes

* Fix variable sized stack allocation

* Make lint happy

* Add docs

* Fix typo

* Adjust tolerances

* Remove unneeded def

* Fixed cpp test setup

* Updated R API and cleaned up

* Fixed test typo
2017-10-12 12:35:51 -07:00
Rory Mitchell
4cb2f7598b -Add experimental GPU algorithm for lossguided mode (#2755)
-Improved GPU algorithm unit tests
-Removed some thrust code to improve compile times
2017-10-01 00:18:35 +13:00
Tsukasa OMOTO
8d15024ac7 python: follow the default warning filters of Python (#2666)
* python: follow the default warning filters of Python

https://docs.python.org/3/library/warnings.html#default-warning-filters

* update tests

* update tests
2017-09-27 03:03:01 -04:00
Icyblade Dai
0e85b30fdd Fix issue 2670 (#2671)
* fix issue 2670

* add python<3.6 compatibility

* fix Index

* fix Index/MultiIndex

* fix lint

* fix W0622

really nonsense

* fix lambda

* Trigger Travis

* add test for MultiIndex

* remove tailing whitespace
2017-09-19 15:49:41 -04:00
Rory Mitchell
9c85903f0b Add GPU documentation (#2695)
* Add GPU documentation

* Update Python GPU tests
2017-09-10 19:42:46 +12:00
Rory Mitchell
e6a9063344 Integer gradient summation for GPU histogram algorithm. (#2681) 2017-09-08 15:07:29 +12:00
Rory Mitchell
15267eedf2 [GPU-Plugin] Major refactor 2 (#2664)
* Change cmake option

* Move source files

* Move google tests

* Move python tests

* Move benchmarks

* Move documentation

* Remove makefile support

* Fix test run

* Move GPU tests
2017-09-08 09:57:16 +12:00
Yun Ni
f04bde05fd Add Coverage Report for Java and Python (#2667)
* Add coverage report for java

* Add coverage report for python

* Increase memory for JVM unit tests

* Increase memory for JVM unit tests
2017-09-05 14:46:51 -07:00
Rory Mitchell
ef23e424f1 [GPU-Plugin] Add GPU accelerated prediction (#2593)
* [GPU-Plugin] Add GPU accelerated prediction

* Improve allocation message

* Update documentation

* Resolve linker error for predictor

* Add unit tests
2017-08-16 12:31:59 +12:00
PSEUDOTENSOR / Jonathan McKinney
6b375f6ad8 Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation (#2530)
* Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation from numpy arrays for python interface.
2017-07-21 14:43:17 +12:00
PSEUDOTENSOR / Jonathan McKinney
ca7fc9fda3 [GPU-Plugin] Fix gpu_hist to allow matrices with more than just 2^{32} elements. Also fixed CPU hist algorithm. (#2518) 2017-07-18 11:19:27 +12:00
Michal Malohlava
33ee7d1615 [BUILD] Dockerfile and Jenkinsfile revisited (#2514)
Includes:
  - Dockerfile changes
    - Dockerfile clean up
    - Fix execution privileges of files used from Dockerfile.
    - New Dockerfile entrypoint to replace with_user script
    - Defined a placeholders for CPU testing (script and Dockerfile)
  - Jenkinsfile
    - Jenkins file milestone defined
    - Single source code checkout and propagation via stash/unstash
    - Bash needs to be explicitly used in launching make build, since we need
access to environment
    - Jenkinsfile build factory for cmake and make style of jobs
    - Archivation of artifacts (*.so, *.whl, *.egg) produced by cmake build

Missing:
  - CPU testing
  - Python3 env build and testing
2017-07-13 17:51:47 +12:00
Rory Mitchell
530f01e21c [GPU-Plugin] Add load balancing search to gpu_hist. Add compressed iterator. (#2504) 2017-07-11 22:36:39 +12:00
Rory Mitchell
e939192978 Cmake improvements (#2487)
* Cmake improvements
* Add google test to cmake
2017-07-06 18:05:11 +12:00
Rory Mitchell
1899f9e744 [GPU-Plugin] Add basic continuous integration for GPU plugin. (#2431) 2017-06-22 10:15:28 -04:00
Sergei Lebedev
2cb51f7097 [jvm-packages] Another pack of build/CI improvements (#2422)
* [jvm-packages] Fixed compilation on Windows

* [jvm-packages] Build the JNI bindings on Appveyor

* [jvm-packages] Build & test on OS X

* [jvm-packages] Re-applied the CMake build changes reverted by #2395

* Fixed Appveyor JVM build

* Muted Maven on Travis

* Don't link with libawt

* "linux2"->"linux"

Python2.x and 3.X use slightly different values for ``sys.platform``.
2017-06-21 12:28:35 -07:00
wxchan
65d2513714 [python-package] fix sklearn n_jobs/nthreads and seed/random_state bug (#2378)
* add a testcase causing RuntimeError

* move seed/random_state/nthread/n_jobs check to get_xgb_params()

* fix failed test
2017-06-12 09:33:42 -04:00
Thejaswi
85b2fb3eee [GPU-Plugin] Integration of a faster version of grow_gpu plugin into mainstream (#2360)
* Integrating a faster version of grow_gpu plugin
1. Removed the older files to reduce duplication
2. Moved all of the grow_gpu files under 'exact' folder
3. All of them are inside 'exact' namespace to avoid any conflicts
4. Fixed a bug in benchmark.py while running only 'grow_gpu' plugin
5. Added cub and googletest submodules to ease integration and unit-testing
6. Updates to CMakeLists.txt to directly build cuda objects into libxgboost

* Added support for building gpu plugins through make flow
1. updated makefile and config.mk to add right targets
2. added unit-tests for gpu exact plugin code

* 1. Added support for building gpu plugin using 'make' flow as well
2. Updated instructions for building and testing gpu plugin

* Fix travis-ci errors for PR#2360
1. lint errors on unit-tests
2. removed googletest, instead depended upon dmlc-core provide gtest cache

* Some more fixes to travis-ci lint failures PR#2360

* Added Rory's copyrights to the files containing code from both.

* updated copyright statement as per Rory's request

* moved the static datasets into a script to generate them at runtime

* 1. memory usage print when silent=0
2. tests/ and test/ folder organization
3. removal of the dependency of googletest for just building xgboost
4. coding style updates for .cuh as well

* Fixes for compilation warnings

* add cuda object files as well when JVM_BINDINGS=ON
2017-06-06 09:39:53 +12:00
gaw89
0f3a404d91 Sklearn kwargs (#2338)
* Added kwargs support for Sklearn API

* Updated NEWS and CONTRIBUTORS

* Fixed CONTRIBUTORS.md

* Added clarification of **kwargs and test for proper usage

* Fixed lint error

* Fixed more lint errors and clf assigned but never used

* Fixed more lint errors

* Fixed more lint errors

* Fixed issue with changes from different branch bleeding over

* Fixed issue with changes from other branch bleeding over

* Added note that kwargs may not be compatible with Sklearn

* Fixed linting on kwargs note
2017-05-23 21:47:53 -05:00
gaw89
6cea1e3fb7 Sklearn convention update (#2323)
* Added n_jobs and random_state to keep up to date with sklearn API.
Deprecated nthread and seed.  Added tests for new params and
deprecations.

* Fixed docstring to reflect updates to n_jobs and random_state.

* Fixed whitespace issues and removed nose import.

* Added deprecation note for nthread and seed in docstring.

* Attempted fix of deprecation tests.

* Second attempted fix to tests.

* Set n_jobs to 1.
2017-05-22 08:22:05 -05:00
jayzed82
29289d2302 Add option to choose booster in scikit intreface (gbtree by default) (#2303)
* Add option to choose booster in scikit intreface (gbtree by default)

* Add option to choose booster in scikit intreface: complete docstring.

* Fix XGBClassifier to work with booster option

* Added test case for gblinear booster
2017-05-18 23:12:27 -04:00
Maurus Cuelenaere
6bd1869026 Add prediction of feature contributions (#2003)
* Add prediction of feature contributions

This implements the idea described at http://blog.datadive.net/interpreting-random-forests/
which tries to give insight in how a prediction is composed of its feature contributions
and a bias.

* Support multi-class models

* Calculate learning_rate per-tree instead of using the one from the first tree

* Do not rely on node.base_weight * learning_rate having the same value as the node mean value (aka leaf value, if it were a leaf); instead calculate them (lazily) on-the-fly

* Add simple test for contributions feature

* Check against param.num_nodes instead of checking for non-zero length

* Loop over all roots instead of only the first
2017-05-14 00:58:10 -05:00
Philip Cho
2715baef64 Fix bugs in multithreaded ApplySplitSparseData() (#2161)
* Bugfix 1: Fix segfault in multithreaded ApplySplitSparseData()

When there are more threads than rows in rowset, some threads end up
with empty ranges, causing them to crash. (iend - 1 needs to be
accessible as part of algorithm)

Fix: run only those threads with nonempty ranges.

* Add regression test for Bugfix 1

* Moving python_omp_test to existing python test group

Turns out you don't need to set "OMP_NUM_THREADS" to enable
multithreading. Just add nthread parameter.

* Bugfix 2: Fix corner case of ApplySplitSparseData() for categorical feature

When split value is less than all cut points, split_cond is set
incorrectly.

Fix: set split_cond = -1 to indicate this scenario

* Bugfix 3: Initialize data layout indicator before using it

data_layout_ is accessed before being set; this variable determines
whether feature 0 is included in feat_set.

Fix: re-order code in InitData() to initialize data_layout_ first

* Adding regression test for Bugfix 2

Unfortunately, no regression test for Bugfix 3, as there is no
way to deterministically assign value to an uninitialized variable.
2017-04-02 11:37:39 -07:00
Philip Cho
14fba01b5a Improve multi-threaded performance (#2104)
* Add UpdatePredictionCache() option to updaters

Some updaters (e.g. fast_hist) has enough information to quickly compute
prediction cache for the training data. Each updater may override
UpdaterPredictionCache() method to update the prediction cache. Note: this
trick does not apply to validation data.

* Respond to code review

* Disable some debug messages by default
* Document UpdatePredictionCache() interface
* Remove base_margin logic from UpdatePredictionCache() implementation
* Do not take pointer to cfg, as reference may get stale

* Improve multi-threaded performance

* Use columnwise accessor to accelerate ApplySplit() step,
  with support for a compressed representation
* Parallel sort for evaluation step
* Inline BuildHist() function
* Cache gradient pairs when building histograms in BuildHist()

* Add missing #if macro

* Respond to code review

* Use wrapper to enable parallel sort on Linux

* Fix C++ compatibility issues

* MSVC doesn't support unsigned in OpenMP loops
* gcc 4.6 doesn't support using keyword

* Fix lint issues

* Respond to code review

* Fix bug in ApplySplitSparseData()

* Attempting to read beyond the end of a sparse column
* Mishandling the case where an entire range of rows have missing values

* Fix training continuation bug

Disable UpdatePredictionCache() in the first iteration. This way, we can
accomodate the scenario where we build off of an existing (nonempty) ensemble.

* Add regression test for fast_hist

* Respond to code review

* Add back old version of ApplySplitSparseData
2017-03-25 10:35:01 -07:00
Laurae
5c13aa0a8a GLM test unit: make run deterministic (#2147) 2017-03-24 08:54:39 -07:00
Icyblade Dai
301540f1d9 fix DeprecationWarning on sklearn.cross_validation (#2075)
* fix DeprecationWarning on sklearn.cross_validation

* fix syntax

* fix kfold n_split issue

* fix mistype

* fix n_splits multiple value issue

* split should pass a iterable

* use np.arange instead of xrange, py3 compatibility
2017-03-17 08:38:22 -05:00
Tianqi Chen
d581a3d0e7 [UPDATE] Update rabit and threadlocal (#2114)
* [UPDATE] Update rabit and threadlocal

* minor fix to make build system happy

* upgrade requirement to g++4.8

* upgrade dmlc-core

* update travis
2017-03-16 18:48:37 -07:00
Philip Cho
aeb4e76118 Histogram Optimized Tree Grower (#1940)
* Support histogram-based algorithm + multiple tree growing strategy

* Add a brand new updater to support histogram-based algorithm, which buckets
  continuous features into discrete bins to speed up training. To use it, set
  `tree_method = fast_hist` to configuration.
* Support multiple tree growing strategies. For now, two policies are supported:
  * `grow_policy=depthwise` (default):  favor splitting at nodes closest to the
    root, i.e. grow depth-wise.
  * `grow_policy=lossguide`: favor splitting at nodes with highest loss change
* Improve single-threaded performance
  * Unroll critical loops
  * Introduce specialized code for dense data (i.e. no missing values)
* Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose`

* Adding a small test for hist method

* Fix memory error in row_set.h

When std::vector is resized, a reference to one of its element may become
stale. Any such reference must be updated as well.

* Resolve cross-platform compilation issues

* Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g.
  alignas(*) and new initializer syntax. To support g++ 4.6, use pre-C++11
  initializer and remove alignas(*).
* Versions of MSVC older than 2015 does not support alignas(*). To support
  MSVC 2012, remove alignas(*).
* For g++ 4.8 and newer, alignas(*) is enabled for performance benefits.
* Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases
  (which uses `using` to declate type aliases). So always use `typedef`.

* Fix a host of CI issues

* Remove dependency for libz on osx
* Fix heading for hist_util
* Fix minor style issues
* Add missing #include
* Remove extraneous logging

* Enable tree_method=hist in R

* Renaming HistMaker to GHistBuilder to avoid confusion

* Fix R integration

* Respond to style comments

* Consistent tie-breaking for priority queue using timestamps

* Last-minute style fixes

* Fix issuecomment-271977647

The way we quantize data is broken. The agaricus data consists of all
categorical values. When NAs are converted into 0's,
`HistCutMatrix::Init` assign both 0's and 1's to the same single bin.

Why? gmat only the smallest value (0) and an upper bound (2), which is twice
the maximum value (1). Add the maximum value itself to gmat to fix the issue.

* Fix issuecomment-272266358

* Remove padding from cut values for the continuous case
* For categorical/ordinal values, use midpoints as bin boundaries to be safe

* Fix CI issue -- do not use xrange(*)

* Fix corner case in quantile sketch

Signed-off-by: Philip Cho <chohyu01@cs.washington.edu>

* Adding a test for an edge case in quantile sketcher

max_bin=2 used to cause an exception.

* Fix fast_hist test

The test used to require a strictly increasing Test AUC for all examples.
One of them exhibits a small blip in Test AUC before achieving a Test AUC
of 1. (See bottom.)

Solution: do not require monotonic increase for this particular example.

[0] train-auc:0.99989 test-auc:0.999497
[1] train-auc:1 test-auc:0.999749
[2] train-auc:1 test-auc:0.999749
[3] train-auc:1 test-auc:0.999749
[4] train-auc:1 test-auc:0.999749
[5] train-auc:1 test-auc:0.999497
[6] train-auc:1 test-auc:1
[7] train-auc:1 test-auc:1
[8] train-auc:1 test-auc:1
[9] train-auc:1 test-auc:1
2017-01-13 09:25:55 -08:00
jokari69
fb0fc0c580 option to shuffle data in mknfolds (#1459)
* option to shuffle data in mknfolds

* removed possibility to run as stand alone test

* split function def in 2 lines for lint

* option to shuffle data in mknfolds

* removed possibility to run as stand alone test

* split function def in 2 lines for lint
2016-12-23 07:53:30 +08:00
AbdealiJK
47ba2de7d4 tests/cpp: Add tests for multiclass_metric.cc 2016-12-04 11:25:57 -08:00
AbdealiJK
a7e20555a3 tests/cpp: Add tests for rank_metrics.cc 2016-12-04 11:25:57 -08:00
AbdealiJK
4a2ef130a7 tests/cpp: Add test for elementwise_metric.cc 2016-12-04 11:25:57 -08:00
AbdealiJK
03abd47f49 tests/cpp: Add tests for Metric RMSE 2016-12-04 11:25:57 -08:00
AbdealiJK
582c373274 tests/cpp: Add tests for metric.cc 2016-12-04 11:25:57 -08:00
AbdealiJK
cc859420ba tests/cpp: Add tests for TweedieRegression 2016-12-04 11:25:57 -08:00
AbdealiJK
fa865564f6 tests/cpp: Add tests for GammaRegression 2016-12-04 11:25:57 -08:00
AbdealiJK
401e4b5220 tests/cpp: Add tests for PoissonRegression 2016-12-04 11:25:57 -08:00
AbdealiJK
d41aab4f61 tests/cpp: Add tests for regression_obj.cc
Test the objective functions in regression_obj.cc

tests/cpp: Add tests for objective.cc and RegLossObj
2016-12-04 11:25:57 -08:00