3050 Commits

Author SHA1 Message Date
Michaël Benesty
8e2a1ff2bf Improve setinfo documentation on R package (#2357) 2017-05-30 20:08:31 +02:00
Sergei Lebedev
433269c335 Minor improvements to xgboost/jvm-packages build (#2356)
* Specified 'exec-maven-plugin' version

* Changed 'create_jni.sh' to fail on error

and also report each of the executed commands, which makes it easier
to debug.
2017-05-30 17:51:27 +02:00
davidt0x
b29b7d1d76 Fixed loop bound in create.new.tree.features (#2328)
for loop in create.new.tree.features was referencing length(trees) as the upper bound of the loop. trees is a base R dataset and not the model that the code is generating. Changed loop boundary to model$niter which should be the number of trees.
2017-05-30 17:50:33 +02:00
Juang, Yi-Lin
812300bb7f Update CONTRIBUTORS.md (#2350) 2017-05-27 08:38:32 -07:00
Juang, Yi-Lin
6776292951 Minor cleanup (#2342)
* Clean up demo of multiclass classification

* Remove extra space
2017-05-26 09:40:41 -04:00
Alexander Kiselev
f1dc82e3e1 Update parameter.md (#2348) 2017-05-25 09:27:10 -04:00
gaw89
0f3a404d91 Sklearn kwargs (#2338)
* Added kwargs support for Sklearn API

* Updated NEWS and CONTRIBUTORS

* Fixed CONTRIBUTORS.md

* Added clarification of **kwargs and test for proper usage

* Fixed lint error

* Fixed more lint errors and clf assigned but never used

* Fixed more lint errors

* Fixed more lint errors

* Fixed issue with changes from different branch bleeding over

* Fixed issue with changes from other branch bleeding over

* Added note that kwargs may not be compatible with Sklearn

* Fixed linting on kwargs note
2017-05-23 21:47:53 -05:00
gaw89
6cea1e3fb7 Sklearn convention update (#2323)
* Added n_jobs and random_state to keep up to date with sklearn API.
Deprecated nthread and seed.  Added tests for new params and
deprecations.

* Fixed docstring to reflect updates to n_jobs and random_state.

* Fixed whitespace issues and removed nose import.

* Added deprecation note for nthread and seed in docstring.

* Attempted fix of deprecation tests.

* Second attempted fix to tests.

* Set n_jobs to 1.
2017-05-22 08:22:05 -05:00
Vadim Khotilovich
da1629e848 [gbtree] fix update process to work with multiclass and multitree; fixes #2315 (#2332) 2017-05-21 23:47:57 -05:00
Vadim Khotilovich
b52db87d5c adding feature contributions to R and gblinear (#2295)
* [gblinear] add features contribution prediction; fix DumpModel bug

* [gbtree] minor changes to PredContrib

* [R] add feature contribution prediction to R

* [R] bump up version; update NEWS

* [gblinear] fix the base_margin issue; fixes #1969

* [R] list of matrices as output of multiclass feature contributions

* [gblinear] make order of DumpModel coefficients consistent: group index changes the fastest
2017-05-21 07:41:51 -04:00
Sergei Lebedev
e5e721722e Fix compilation on OS X with GCC 7 (#2256)
* Fix compilation on OS X with GCC 7

Compilation failed with

In file included from src/tree/tree_updater.cc:6:0:
include/xgboost/tree_updater.h:75:46: error: 'function' is not a member of 'std'
                                         std::function<TreeUpdater* ()> > {

caused by a missing <functional> include.

* Fixed another occurence of that issue spotted by @ClimberPG
2017-05-19 22:04:07 -07:00
PSEUDOTENSOR / Jonathan McKinney
3ca64ffa02 [GPU-Plugin] Improved split finding performance. (#2325) 2017-05-19 19:16:24 -07:00
jayzed82
29289d2302 Add option to choose booster in scikit intreface (gbtree by default) (#2303)
* Add option to choose booster in scikit intreface (gbtree by default)

* Add option to choose booster in scikit intreface: complete docstring.

* Fix XGBClassifier to work with booster option

* Added test case for gblinear booster
2017-05-18 23:12:27 -04:00
Nan Zhu
96f9776ab0 Update ISSUE_TEMPLATE.md (#2308)
* Update ISSUE_TEMPLATE.md

* Update ISSUE_TEMPLATE.md
2017-05-18 08:49:07 -07:00
Nan Zhu
a607f697e3 [jvm-packages] Disable fast histo for spark (#2296)
* add back train method but mark as deprecated

* fix scalastyle error

* disable fast histogram in xgboost4j-spark temporarily
2017-05-15 20:43:16 -07:00
Vadim Khotilovich
c66ca79221 [R] native routines registration (#2290)
* [R] add native routines registration

* c_api.h needs to include <cstdint> since it uses fixed width integer types

* [R] use registered native routines from R code

* [R] bump version; add info on native routine registration to the contributors guide

* make lint happy
2017-05-14 11:00:46 -07:00
Maurus Cuelenaere
6bd1869026 Add prediction of feature contributions (#2003)
* Add prediction of feature contributions

This implements the idea described at http://blog.datadive.net/interpreting-random-forests/
which tries to give insight in how a prediction is composed of its feature contributions
and a bias.

* Support multi-class models

* Calculate learning_rate per-tree instead of using the one from the first tree

* Do not rely on node.base_weight * learning_rate having the same value as the node mean value (aka leaf value, if it were a leaf); instead calculate them (lazily) on-the-fly

* Add simple test for contributions feature

* Check against param.num_nodes instead of checking for non-zero length

* Loop over all roots instead of only the first
2017-05-14 00:58:10 -05:00
Sergei Lebedev
e62be19c70 Removed 'flink.suffix' and added 'flink.version' (#2277)
The former was just Scala binary tag, and the latter was hardcoded in
the 'xgboost4j-flink' POM.
2017-05-10 08:42:40 -07:00
Nan Zhu
428453f7d6 [jvm-packages] fix the persistence of XGBoostEstimator (#2265)
* add back train method but mark as deprecated

* fix scalastyle error

* fix the persistence of XGBoostEstimator

* test persistence of a complete pipeline

* fix compilation issue

* do not allow persist custom_eval and custom_obj

* fix the failed tesl
2017-05-08 21:58:06 -07:00
Rory Mitchell
6bf968efe6 [GPU Plugin] Fast histogram speed improvements. Updated benchmarks. (#2258) 2017-05-08 09:21:38 -07:00
Dmitry Nikulin
98ea461532 Fix typo (#2264) 2017-05-07 16:54:48 -07:00
ebernhardson
197a9eacc5 [jvm-packages] Expose json dumps to scala (#2247)
* Add parameter passthru of format on Booster.getModelDump
2017-05-02 17:41:27 -07:00
ebernhardson
ccccf8a015 [jvm-packages] Accept groupData in spark model eval (#2244)
* Support model evaluation for ranking tasks by accepting
 groupData in XGBoostModel.eval
2017-05-02 10:03:20 -07:00
Vadim Khotilovich
a375ad2822 [R] maintenance Apr 2017 (#2237)
* [R] make sure things work for a single split model; fixes #2191

* [R] add option use_int_id to xgb.model.dt.tree

* [R] add example of exporting tree plot to a file

* [R] set save_period = NULL as default in xgboost() to be the same as in xgb.train; fixes #2182

* [R] it's a good practice after CRAN releases to bump up package version in dev

* [R] allow xgb.DMatrix construction from integer dense matrices

* [R] xgb.DMatrix: silent parameter; improve documentation

* [R] xgb.model.dt.tree code style changes

* [R] update NEWS with parameter changes

* [R] code safety & style; handle non-strict matrix and inherited classes of input and model; fixes #2242

* [R] change to x.y.z.p R-package versioning scheme and set version to 0.6.4.3

* [R] add an R package versioning section to the contributors guide

* [R] R-package/README.md: clean up the redundant old installation instructions, link the contributors guide
2017-05-01 22:51:34 -07:00
Philip Cho
d769b6bcb5 Fix performance degradation of BuildHist on Windows (#2243)
Reported in issue #2165. Dynamic scheduling of OpenMP loops involve
implicit synchronization. To implement synchronization, libgomp uses futex
(fast userspace mutex), whereas MinGW uses kernel-space mutex, which is more
costly. With chunk size of 1, synchronization overhead may become prohibitive
on Windows machines.

Solution: use 'guided' schedule to minimize the number of syncs
2017-05-01 15:54:44 -07:00
ebernhardson
da58f34ff8 Store metrics with learner (#2241)
Storing and then loading a model loses any eval_metric that was
provided. This causes implementations that always store/load, like
xgboost4j-spark, to be unable to eval with the desired metric.
2017-04-30 14:23:24 -07:00
ebernhardson
d3b866e3fd [jvm-packages] Expose json formatted booster dumps (#2233) (#2234)
* Change Booster dump from XGBoosterDumpModel to XGBoosterDumpModelEx

Allows exposing multiple formatting options of model dumping.
2017-04-29 20:23:09 -07:00
Qiang Kou (KK)
c441d0916e fix #2228 (#2238) 2017-04-29 18:44:08 -07:00
Rory Mitchell
8ab5d4611c [GPU-Plugin] (#2227)
* Add fast histogram algorithm
* Fix Linux build
* Add 'gpu_id' parameter
2017-04-25 16:37:10 -07:00
Tianqi Chen
d281c6aafa Update CONTRIBUTORS.md 2017-04-22 08:46:31 -07:00
Alex Bain
dbaa5d0bdf Disable invalid check for completely sparse batch that results in failed assertion for issue #1827 (#2213) 2017-04-21 09:28:02 -07:00
Nan Zhu
392aa6d1d3 [jvm-packages] make XGBoostModel hold BoosterParams as well (#2214)
* add back train method but mark as deprecated

* fix scalastyle error

* make XGBoostModel hold BoosterParams as well
2017-04-21 08:12:50 -07:00
Benjamin Pachev
e38bea3cdf Update README.md (#2202)
Add a link to a demo for the proposed PHP XGBoost wrapper.
2017-04-17 15:28:37 -07:00
avpronkin
31e800f340 erratum in index.md (#2203)
Mxnet instead of XGBoost
2017-04-17 15:24:18 -07:00
Seong-Jin Kim
8222755564 Fix typo in R-package README.md (#2190) 2017-04-13 20:22:23 +02:00
Preston Parry
1ab8088a09 Removes extraneous log (#2186)
This log appears to fire every time I ask the python package to make a prediction. It's the only log that fires from XGBoost. When we're getting predictions on millions of items a day in production, this log seems out of place.
2017-04-11 17:38:29 -07:00
Nan Zhu
a837fa9620 [jvm-packages] rdds containing boosters should be cleaned once we got boosters to driver (#2183) 2017-04-11 06:12:49 -07:00
Nan Zhu
f08077606c [jvm-packages] Clean external cache (#2181)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* small fix for cleanExternalCache
2017-04-10 07:49:58 -07:00
Nan Zhu
8d8cbcc6db [jvm-packages] fixed several issues in unit tests (#2173)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* fix several issues in tests
2017-04-06 06:25:23 -07:00
Philip Cho
2715baef64 Fix bugs in multithreaded ApplySplitSparseData() (#2161)
* Bugfix 1: Fix segfault in multithreaded ApplySplitSparseData()

When there are more threads than rows in rowset, some threads end up
with empty ranges, causing them to crash. (iend - 1 needs to be
accessible as part of algorithm)

Fix: run only those threads with nonempty ranges.

* Add regression test for Bugfix 1

* Moving python_omp_test to existing python test group

Turns out you don't need to set "OMP_NUM_THREADS" to enable
multithreading. Just add nthread parameter.

* Bugfix 2: Fix corner case of ApplySplitSparseData() for categorical feature

When split value is less than all cut points, split_cond is set
incorrectly.

Fix: set split_cond = -1 to indicate this scenario

* Bugfix 3: Initialize data layout indicator before using it

data_layout_ is accessed before being set; this variable determines
whether feature 0 is included in feat_set.

Fix: re-order code in InitData() to initialize data_layout_ first

* Adding regression test for Bugfix 2

Unfortunately, no regression test for Bugfix 3, as there is no
way to deterministically assign value to an uninitialized variable.
2017-04-02 11:37:39 -07:00
Denis M Korzhenkov
ed5e75de2f Nonreproducible sequence of evaluations fixed (#2153)
As `num_round=2` there is no `0003.model` file after training.
2017-03-29 10:11:23 -07:00
Rory Mitchell
a33fa05bda GPU Plugin: Bug fix #2048 (#2155) 2017-03-29 10:10:57 -07:00
Huffers
d45cf240a9 Remove xgboost's thread_local and switch to dmlc::ThreadLocalStore (#2121)
* Remove xgboost's own version of thread_local and switch to dmlc::ThreadLocalStore (#2109)

* Update dmlc-core
2017-03-27 09:09:18 -07:00
Philip Cho
14fba01b5a Improve multi-threaded performance (#2104)
* Add UpdatePredictionCache() option to updaters

Some updaters (e.g. fast_hist) has enough information to quickly compute
prediction cache for the training data. Each updater may override
UpdaterPredictionCache() method to update the prediction cache. Note: this
trick does not apply to validation data.

* Respond to code review

* Disable some debug messages by default
* Document UpdatePredictionCache() interface
* Remove base_margin logic from UpdatePredictionCache() implementation
* Do not take pointer to cfg, as reference may get stale

* Improve multi-threaded performance

* Use columnwise accessor to accelerate ApplySplit() step,
  with support for a compressed representation
* Parallel sort for evaluation step
* Inline BuildHist() function
* Cache gradient pairs when building histograms in BuildHist()

* Add missing #if macro

* Respond to code review

* Use wrapper to enable parallel sort on Linux

* Fix C++ compatibility issues

* MSVC doesn't support unsigned in OpenMP loops
* gcc 4.6 doesn't support using keyword

* Fix lint issues

* Respond to code review

* Fix bug in ApplySplitSparseData()

* Attempting to read beyond the end of a sparse column
* Mishandling the case where an entire range of rows have missing values

* Fix training continuation bug

Disable UpdatePredictionCache() in the first iteration. This way, we can
accomodate the scenario where we build off of an existing (nonempty) ensemble.

* Add regression test for fast_hist

* Respond to code review

* Add back old version of ApplySplitSparseData
2017-03-25 10:35:01 -07:00
Denis M Korzhenkov
332aea26a3 Formatting fixed for CLI parameters (#2145)
Fixed list of parameters format for CLI mode
2017-03-24 08:54:58 -07:00
Laurae
5c13aa0a8a GLM test unit: make run deterministic (#2147) 2017-03-24 08:54:39 -07:00
付雨帆
f1fe024a9d Update md grammar for the README.md (#2141) 2017-03-23 11:02:06 -07:00
Qin Xiaoming
12cf0ae122 Update sparse_page_dmatrix.h (#2139) 2017-03-23 11:01:40 -07:00
Yang Zhang
48835c3a4e Update predict leaf indices (#2135)
* Updated sklearn_parallel.py for soon-to-be-deprecated modules

* Updated predict_leaf_indices.py; Use python3 print() as other exmaples and removed unused module
2017-03-22 19:12:34 -07:00
Matthew R. Becker
a4bae1bdcd ENH more makefile updates (#2133)
This commit proposes a simpler single compiler specification for OSX and *nix. It also let's people override the setting on both systems, not just *nix.
2017-03-22 16:22:15 -05:00