Compare commits

...

1152 Commits

Author SHA1 Message Date
Jiaming Yuan
584b45a9cc Release 1.5.0. (#7317) 2021-10-15 12:21:04 +08:00
Jiaming Yuan
30c1b5c54c [backport] Fix prediction with cat data in sklearn interface. (#7306) (#7312)
* Specify DMatrix parameter for pre-processing dataframe.
* Add document about the behaviour of prediction.
2021-10-12 18:49:57 +08:00
Jiaming Yuan
36e247aca4 Fix weighted samples in multi-class AUC. (#7300) (#7305) 2021-10-11 18:00:36 +08:00
Jiaming Yuan
c4aff733bb [backport] Fix cv verbose_eval (#7291) (#7296) 2021-10-08 14:24:27 +08:00
Jiaming Yuan
cdbfd21d31 [backport] Fix gamma neg log likelihood. (#7275) (#7285) 2021-10-05 23:01:11 +08:00
Jiaming Yuan
508a0b0dbd [backport] [R] Fix document for nthread. (#7263) (#7269) 2021-09-28 14:41:32 +08:00
Jiaming Yuan
e04e773f9f Add RC1 tag for building packages. (#7261) 2021-09-28 11:50:18 +08:00
Jiaming Yuan
1debabb321 Change version to 1.5.0. (#7258) 2021-09-26 13:27:54 +08:00
Jiaming Yuan
d8a549e6ac Avoid thread block with sparse data. (#7255) 2021-09-25 13:11:34 +08:00
Jiaming Yuan
ca17f8a5fc Dispatch thrust versions and upgrade rmm. (#7254)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2021-09-25 03:43:23 +08:00
Jiaming Yuan
fbd58bf190 [jvm-packages] Create demo and test for xgboost4j early stopping. (#7252) 2021-09-25 03:29:27 +08:00
Bobby Wang
0ee11dac77 [jvm-packages][xgboost4j-gpu] Support GPU dataframe and DeviceQuantileDMatrix (#7195)
Following classes are added to support dataframe in java binding:

- `Column` is an abstract type for a single column in tabular data.
- `ColumnBatch` is an abstract type for dataframe.

- `CuDFColumn` is an implementaiton of `Column` that consume cuDF column
- `CudfColumnBatch` is an implementation of `ColumnBatch` that consumes cuDF dataframe.

- `DeviceQuantileDMatrix` is the interface for quantized data.

The Java implementation mimics the Python interface and uses `__cuda_array_interface__` protocol for memory indexing.  One difference is on JVM package, the data batch is staged on the host as java iterators cannot be reset.

Co-authored-by: jiamingy <jm.yuan@outlook.com>
2021-09-24 14:25:00 +08:00
Philip Hyunsu Cho
d27a427dc5 [CI] Rotate access keys for uploading MacOS artifacts from Travis CI (#7253) 2021-09-24 10:44:00 +08:00
ShvetsKS
475fd1abec Reduced span overheads in objective function calculate (#7206)
Co-authored-by: fis <jm.yuan@outlook.com>
2021-09-23 04:43:59 +08:00
Jiaming Yuan
9472be7d77 Fix initialization from pandas series. (#7243) 2021-09-23 04:43:25 +08:00
david-cortes
4f93e5586a Improve wording for warning (#7248)
This warning sounds  a bit ungrammatical. Additionally, the second part of the warning is not clear. This PR changes the wording to make it clearer.
2021-09-21 10:48:11 +08:00
Jiaming Yuan
18bd16341a Update Python intro. [skip ci] (#7235)
* Fix the link to demo.
* Stop recommending text file inputs.
* Brief mention to scikit-learn interface.
* Fix indent warning in tree method doc.
2021-09-21 02:47:09 +00:00
david-cortes
61a619b5c3 [R] Avoid symbol naming conflicts with other packages (#7245)
* don't register all R symbols

* typo
2021-09-19 11:17:08 -07:00
Jiaming Yuan
e48e05e6e2 Add typehint to rabit module. (#7240) 2021-09-17 18:31:02 +08:00
Jiaming Yuan
c735c17f33 Disable callback and ES on random forest. (#7236) 2021-09-17 18:21:17 +08:00
Jiaming Yuan
c311a8c1d8 Enable compiling with system cub. (#7232)
- Tested with all CUDA 11.x.
- Workaround cub scan by using discard iterator in AUC.
- Limit the size of Argsort when compiled with CUDA cub.
2021-09-17 14:28:18 +08:00
Jiaming Yuan
b18f5f61b0 Fix pylint (#7241) 2021-09-17 11:50:36 +08:00
Jiaming Yuan
38a23f66a8 Fix typo in release script. [skip ci] (#7238) 2021-09-17 11:14:05 +08:00
Jiaming Yuan
8ad7e8eeb0 [doc] Fix typo. [skip ci] (#7226) 2021-09-17 11:13:49 +08:00
Jiaming Yuan
22d56cebf1 Encode pandas categorical data automatically. (#7231) 2021-09-17 11:09:55 +08:00
Jiaming Yuan
32e0858501 Fix travis. (#7237) 2021-09-17 10:06:23 +08:00
Jiaming Yuan
31c1e13f90 Categorical data support in CPU sketching. (#7221) 2021-09-17 04:37:09 +08:00
Jiaming Yuan
9f63d6fead [jvm-packages] Deprecate constructors with implicit missing value. (#7225) 2021-09-17 04:35:04 +08:00
Jiaming Yuan
0ed979b096 Support more input types for categorical data. (#7220)
* Support more input types for categorical data.

* Shorten the type name from "categorical" to "c".
* Tests for np/cp array and scipy csr/csc/coo.
* Specify the type for feature info.
2021-09-16 20:39:30 +08:00
Jiaming Yuan
2942dc68e4 Fix mixed types in GPU sketching. (#7228) 2021-09-16 00:10:25 +08:00
Jiaming Yuan
037dd0820d Implement __sklearn_is_fitted__. (#7230) 2021-09-15 19:09:04 +08:00
Jiaming Yuan
d997c967d5 Demo for experimental categorical data support. (#7213) 2021-09-15 08:20:12 +08:00
Jiaming Yuan
3515931305 Initial support for external memory in gradient index. (#7183)
* Add hessian to batch param in preparation of new approx impl.
* Extract a push method for gradient index matrix.
* Use span instead of vector ref for hessian in sketching.
* Create a binary format for gradient index.
2021-09-13 12:40:56 +08:00
Christian Lorentzen
a0dcf6f5c1 [DOC] Improve tutorial on feature interactions (#7219) 2021-09-12 21:40:02 +08:00
Jiaming Yuan
804b2ac60f Expose DMatrix API for CUDA columnar and array. (#7217)
* Use JSON encoded configurations.
* Expose them into header file.
2021-09-09 17:55:25 +08:00
Jiaming Yuan
68a2c7b8d6 Fix memory leak in demo. (#7216) 2021-09-09 13:51:03 +08:00
Jiaming Yuan
b12e7f7edd Add noexcept to JSON objects. (#7205) 2021-09-07 13:56:48 +08:00
Jiaming Yuan
3a4f51f39f Avoid calling CUDA code on CPU for linear model. (#7154) 2021-09-01 10:45:31 +08:00
Jiaming Yuan
ba69244a94 Restore the custom double atomic add. (#7198) 2021-08-28 18:30:42 +08:00
Jiaming Yuan
7a1d67f9cb [breaking] Use integer atomic for GPU histogram. (#7180)
On GPU we use rouding factor to truncate the gradient for deterministic results. This PR changes the gradient representation to fixed point number with exponent aligned with rounding factor.

    [breaking] Drop non-deterministic histogram.
    Use fixed point for shared memory.

This PR is to improve the performance of GPU Hist. 

Co-authored-by: Andy Adinets <aadinets@nvidia.com>
2021-08-28 05:17:05 +08:00
Jiaming Yuan
e7d7ab6bc3 Better error message for ncclUnhandledCudaError. (#7190) 2021-08-27 10:29:22 +08:00
Philip Hyunsu Cho
b70e07da1f [CI] Clean up in beginning of each task in Win CI (#7189) 2021-08-25 04:15:22 -07:00
Jiaming Yuan
cdfaa705f3 Fix building on CUDA 11.0. (#7187) 2021-08-25 02:57:53 -07:00
Philip Hyunsu Cho
3060f0b562 [CI] Automatically build GPU-enabled R package for Windows (#7185)
* [CI] Automatically build GPU-enabled R package for Windows

* Update Jenkinsfile-win64

* Build R package for the release branch only

* Update install doc
2021-08-25 02:11:01 -07:00
Jiaming Yuan
9c64618cb6 [breaking] Remove CUDA sm_35, add sm_86 (#7182) 2021-08-25 16:04:23 +08:00
Philip Hyunsu Cho
d04312b9c0 [CI] Fix hanging Python setup in Windows CI (#7186) 2021-08-24 22:03:51 -07:00
Jiaming Yuan
ee8d1f5ed8 Fix histogram truncation. (#7181)
* Fix truncation.

* Lint.

* lint.
2021-08-24 18:34:32 -07:00
Jiaming Yuan
3290a4f3ed Re-enable feature validation in predict proba. (#7177) 2021-08-22 15:28:08 +08:00
Jiaming Yuan
bf562bd33c Remove unused code. (#7175) 2021-08-18 14:02:19 +08:00
Anton Kostin
01b7acba30 Update conf.py (#7174) 2021-08-17 03:38:26 +08:00
Anton Kostin
ec849ec335 Update README.md (#7173) 2021-08-17 03:37:53 +08:00
Martin Petříček
46c46829ce Fix model loading from stream (#7067)
Fix bug introduced in 17913713b5 (allow loading from byte array)

When loading model from stream, only last buffer read from the input stream is used to construct the model.

This may work for models smaller than 1 MiB (if you are lucky enough to read the whole model at once), but will always fail if the model is larger.
2021-08-15 21:04:33 +08:00
Jiaming Yuan
6bcbc77226 [doc] Fix typo. [skip ci] (#7170) 2021-08-13 03:48:16 +08:00
Jiaming Yuan
3f38d983a6 Fix prediction configuration. (#7159)
After the predictor parameter was added to the constructor, this configuration was broken.
2021-08-11 16:34:36 +08:00
Jiaming Yuan
9600ca83f3 Remove synchronization in monitor. (#7164)
* Remove synchronization in monitor.

Calling rabit functions during destruction is flaky.

* Add xgboost prefix to nvtx marker.
2021-08-11 16:33:53 +08:00
Jiaming Yuan
149f209af6 Extract histogram builder from CPU Hist. (#7152)
* Extract the CPU histogram builder.
* Fix tests.
* Reduce number of histograms being built.
2021-08-09 21:15:21 +08:00
Philip Hyunsu Cho
336af4f974 Work around a segfault observed in SparsePage::Push() (#7161)
* Work around a segfault observed in SparsePage::Push()

* Revert "Work around a segfault observed in SparsePage::Push()"

This reverts commit 30934844d00908750a5442082eb4769b1489f6a9.

* Don't call vector::resize() inside OpenMP block

* Set GITHUB_PAT env var to fix R tests

* Use built-in GITHUB_TOKEN
2021-08-08 02:12:30 -07:00
AJ Schmidt
f7003dc819 Include cpack (#7160)
Co-authored-by: ptaylor <paul.e.taylor@me.com>
2021-08-07 00:57:34 +08:00
Jiaming Yuan
8a84be37b8 Pass scikit learn estimator checks for regressor. (#7130)
* Check data shape.
* Check labels.
2021-08-03 18:58:20 +08:00
Jiaming Yuan
8ee127469f [R] Fix nthread in DMatrix constructor. (#7127)
* Break the R C API for nthread.
2021-08-03 17:39:25 +08:00
Jiaming Yuan
ba47eda61b [doc] Use figure directive. (#7143) 2021-08-03 15:56:25 +08:00
Jiaming Yuan
e2c406f5c8 Support min_delta in early stopping. (#7137)
* Support `min_delta` in early stopping.

* Remove abs_tol.
2021-08-03 14:29:17 +08:00
Jiaming Yuan
7bdedacb54 Document for process_type. (#7135)
* Update document for prune and refresh.

* Add demo.
2021-08-03 13:11:52 +08:00
Jiaming Yuan
d080b5a953 Fix model slicing. (#7149)
* Use correct pointer.
* Remove best_iteration/best_score.
2021-08-03 11:51:56 +08:00
Jiaming Yuan
36346f8f56 C API demo for inference. (#7151) 2021-08-03 00:46:47 +08:00
Jiaming Yuan
1369133916 [dask] Remove the workaround for segfault. (#7146) 2021-07-30 03:57:53 +08:00
Philip Hyunsu Cho
f1a4a1ac95 [CI] Upgrade build image to CentOS 7 + GCC 8; require CUDA 10.1 and later (#7141) 2021-07-29 10:54:33 -07:00
graue70
dfdf0b08fc Fix typo and grammatical mistake in error message (#7134) 2021-07-28 17:17:05 +08:00
Gil Forsyth
92ae3abc97 [dask] Disallow importing non-dask estimators from xgboost.dask (#7133)
* Disallow importing non-dask estimators from xgboost.dask

This is mostly a style change, but also avoids a user error (that I have
committed on a few occasions).  Since `XGBRegressor` and `XGBClassifier`
are imported as parent classes for the `dask` estimators, without
defining an `__all__`, autocomplete (or muscle) memory will produce the
following with little prompting:

```
from xgboost.dask import XGBClassifier
```

There's nothing inherently wrong with that, but given that
`XGBClassifier` is not `dask` enabled, it can lead to confusing behavior
until you figure out you should've typed

```
from xgboost.dask import DaskXGBClassifier
```

Another option is to alias import the existing non-dask estimators.

* Remove base/iter class, add train predict funcs
2021-07-28 02:07:23 +08:00
Robert Maynard
1a75f43304 Allow compilation with nvcc 11.4 (#7131)
* Use type aliases for discard iterators

* update to include host_vector as thrust 1.12 doesn't bring it in as a side-effect

* cub::DispatchRadixSort requires signed offset types
2021-07-27 20:05:33 +08:00
Jiaming Yuan
7017dd5a26 [JVM-Packages] Use Python tracker in XGBoost for JVM package. (#7132) 2021-07-27 16:20:42 +08:00
Jiaming Yuan
48d5de80a2 [R] Fix softprob reshape. (#7126) 2021-07-27 15:25:17 +08:00
Jiaming Yuan
7ee7a95b84 Use upstream URI in distributed quantile tests. (#7129)
* Use upstream URI in distributed quantile tests.

* Fix test cv `PytestAssertRewriteWarning`.
2021-07-27 14:09:49 +08:00
Jiaming Yuan
e88ac9cc54 [dask] Extend tree stats tests. (#7128)
* Add tests to GPU.
* Assert cover in children sums up to the parent.
2021-07-27 12:22:13 +08:00
Jiaming Yuan
778135f657 Fix parameter loading with training continuation. (#7121)
* Add a demo for training continuation.
2021-07-23 10:51:47 +08:00
Taewoo Kim
41e882f80b Check input value is duplicated when quantile queue is full (#7091)
Co-authored-by: Taewoo Kim <taewoo@layer6.com>
2021-07-23 03:07:01 +08:00
ShvetsKS
caa9e527dd Remove extra sync for dense data (#7120)
Co-authored-by: SHVETS, KIRILL <kirill.shvets@intel.com>
2021-07-22 19:02:31 +08:00
Jiaming Yuan
e6088366df Export Python Interface for external memory. (#7070)
* Add Python iterator interface.
* Add tests.
* Add demo.
* Add documents.
* Handle empty dataset.
2021-07-22 15:15:53 +08:00
farfarawayzyt
e64ee6592f fix typo in src/common/hist.cc BuildHistKernel (#7116) 2021-07-21 19:53:05 +08:00
naveenkb
9f7f8b976d [XGBoost4J-Spark] bestIteration and bestScore for early stopping (#7095) 2021-07-19 18:46:49 +08:00
farfarawayzyt
d7c14496d2 fix typo in arguments of PartitionBuilder::Init (#7113)
Co-authored-by: Yuntian Zhang <zhangyt@lamda.nju.edu.cn>
2021-07-16 15:46:22 +08:00
Jiaming Yuan
bd1f3a38f0 Rewrite sparse dmatrix using callbacks. (#7092)
- Reduce dependency on dmlc parsers and provide an interface for users to load data by themselves.
- Remove use of threaded iterator and IO queue.
- Remove `page_size`.
- Make sure the number of pages in memory is bounded.
- Make sure the cache can not be violated.
- Provide an interface for internal algorithms to process data asynchronously.
2021-07-16 12:33:31 +08:00
Jiaming Yuan
2f524e9f41 [dask] Work around segfault in prediction. (#7112) 2021-07-16 04:27:05 +08:00
Jiaming Yuan
abec3dbf6d Fix thread safety of softmax prediction. (#7104) 2021-07-16 02:06:55 +08:00
Philip Hyunsu Cho
2801d69fb7 [CI] Pin libomp to 11.1.0 (#7107) 2021-07-15 11:16:51 +08:00
Jiaming Yuan
8e8232fb4c [CI] Update R cache. (#7102) 2021-07-14 03:15:35 +08:00
Jiaming Yuan
345796825f Optional find dependency in installed cmake config. (#7099)
* Find dependency only when xgboost is built as static library.
* Resolve msvc warning.
* Add test for linking shared library.
2021-07-11 17:20:55 +08:00
ZabelTech
1d91f71119 fix typo in XGDMatrixSetFloatInfo example (#7097) 2021-07-10 21:40:25 +08:00
Jiaming Yuan
77f6cf2d13 Support hessian in host sketch container. (#7081)
Prepare for migrating approx onto hist's codebase.
2021-07-08 16:33:58 +08:00
Jiaming Yuan
84d359efb8 Support host data in proxy DMatrix. (#7087) 2021-07-08 11:35:48 +08:00
Jiaming Yuan
5d7cdf2e36 [Breaking] Rename Quantile DMatrix C API. (#7082)
The role of ProxyDMatrix is going beyond what it was designed.  Now it's used by both
QuantileDeviceDMatrix and inplace prediction.  After the refactoring of sparse DMatrix it
will also be used for external memory.  Renaming the C API to extract it from
QuantileDeviceDMatrix.
2021-07-08 11:34:14 +08:00
Jiaming Yuan
c766f143ab Refactor external memory formats. (#7089)
* Save base_rowid.
* Return write size.
* Remove unused function.
2021-07-08 04:04:51 +08:00
Jiaming Yuan
689eb8f620 Check external memory support for exact tree method. (#7088) 2021-07-08 02:12:57 +08:00
Jiaming Yuan
615ab2b03e Extract evaluate splits from CPU hist. (#7079)
Other than modularizing the split evaluation function, this PR also removes some more functions including `InitNewNodes` and `BuildNodeStats` among some other unused variables.  Also, scattered code like setting leaf weights is grouped into the split evaluator and `NodeEntry` is simplified and made private.  Another subtle difference with the original implementation is that the modified code doesn't call `tree[nidx].Parent()` to traversal upward.
2021-07-07 15:16:25 +08:00
Jeff H
d22b293f2f Update reference to treelite website (#7084)
treelite.io is no longer a valid site and re-directs users to a parked domain. Re-directing to the documentation is safer at this point.
2021-07-06 22:15:07 -07:00
Jiaming Yuan
f937f514aa Remove lz4 compression with external memory. (#7076) 2021-07-06 14:46:43 +08:00
Jiaming Yuan
116d711815 Make SimpleDMatrix ctor reusable. (#7075) 2021-07-06 13:38:24 +08:00
Jiaming Yuan
d7e1fa7664 Fix feature names and types in output model slice. (#7078) 2021-07-06 11:47:49 +08:00
Jiaming Yuan
ffa66aace0 Persist data in dask test. (#7077) 2021-07-06 11:47:17 +08:00
Jiaming Yuan
b56d3d5d5c Fix with latest panda range index. (#7074) 2021-07-03 16:43:52 +08:00
Jiaming Yuan
93f3acdef9 Fix with latest pylint. (#7071) 2021-07-02 21:26:00 +08:00
Jiaming Yuan
a5d222fcdb Handle categorical split in model histogram and dataframe. (#7065)
* Error on get_split_value_histogram when feature is categorical
* Add a category column to output dataframe
2021-07-02 13:10:36 +08:00
Jiaming Yuan
1cd20efe68 Move GHistIndex into DMatrix. (#7064) 2021-07-01 00:44:49 +08:00
Jiaming Yuan
1c8fdf2218 Remove use of device_idx in dh::LaunchN. (#7063)
It's an unused parameter, removing it can make the CI log more readable.
2021-06-29 11:37:26 +08:00
Philip Hyunsu Cho
dd4db347f3 Fix early stopping behavior with MAPE metric (#7061) 2021-06-26 03:02:33 +08:00
Jiaming Yuan
8fa32fdda2 Implement categorical data support for SHAP. (#7053)
* Add CPU implementation.
* Update GPUTreeSHAP.
* Add GPU implementation by defining custom split condition.
2021-06-25 19:02:46 +08:00
Jiaming Yuan
663136aa08 Implement feature score for linear model. (#7048)
* Add feature score support for linear model.
* Port R interface to the new implementation.
* Add linear model support in Python.

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2021-06-25 14:34:02 +08:00
Philip Hyunsu Cho
b2d300e727 [CI] Upgrade to CMake 3.14 (#7060)
* [CI] Upgrade to CMake 3.14

* Add FATAL_ERROR directive, for users with CMake 2.x
2021-06-24 18:07:24 -07:00
Jiaming Yuan
1d4d345634 Tests for dask skl categorical data support. (#7054) 2021-06-24 16:33:57 +08:00
Jiaming Yuan
da1ad798ca Convert numpy float to Python float in feat score. (#7047) 2021-06-21 20:58:43 +08:00
Jiaming Yuan
bbfffb444d Fix race condition in CPU shap. (#7050) 2021-06-21 10:03:15 +08:00
Jiaming Yuan
29f8fd6fee Support categorical split in tree model dump. (#7036) 2021-06-18 16:46:20 +08:00
Jiaming Yuan
7968c0d051 Test on s390x. (#7038)
* Fix && remove unused parameter.
2021-06-18 14:55:08 +08:00
Jiaming Yuan
86715e4cd4 Support categorical data for dask functional interface and DQM. (#7043)
* Support categorical data for dask functional interface and DQM.

* Implement categorical data support for GPU GK-merge.
* Add support for dask functional interface.
* Add support for DQM.

* Get newer cupy.
2021-06-18 13:06:52 +08:00
Jiaming Yuan
7dd29ffd47 Implement feature score in GBTree. (#7041)
* Categorical data support.
* Eliminate text parsing during feature score computation.
2021-06-18 11:53:16 +08:00
Jiaming Yuan
dcd84b3979 [CI] Configure RAPIDS, dask, modin (#7033)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2021-06-18 10:27:51 +08:00
Jiaming Yuan
d9799b09d0 Categorical data support for cuDF. (#7042)
* Add support in DMatrix.
* Add support in DQM, except for iterator.
2021-06-17 13:54:33 +08:00
Jiaming Yuan
5c2d7a18c9 Parallel model dump for trees. (#7040) 2021-06-15 14:08:26 +08:00
ShvetsKS
2567404ab6 Simplify sparse and dense CPU hist kernels (#7029)
* Simplify sparse and dense kernels
* Extract row partitioner.

Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>
2021-06-11 18:26:30 +08:00
Jiaming Yuan
1faad825f4 Remove appveyor badge. [skip ci] (#7035) 2021-06-11 14:37:18 +08:00
Jiaming Yuan
b56614e9b8 [R] Use new predict function. (#6819)
* Call new C prediction API.
* Add `strict_shape`.
* Add `iterationrange`.
* Update document.
2021-06-11 13:03:29 +08:00
jmoralez
25514e104a [dask] speed up tests (#7020) 2021-06-11 11:43:01 +08:00
Jiaming Yuan
f79cc4a7a4 Implement categorical prediction for CPU and GPU predict leaf. (#7001)
* Categorical prediction with CPU predictor and GPU predict leaf.

* Implement categorical prediction for CPU prediction.
* Implement categorical prediction for GPU predict leaf.
* Refactor the prediction functions to have a unified get next node function.

Co-authored-by: Shvets Kirill <kirill.shvets@intel.com>
2021-06-11 10:11:45 +08:00
Jiaming Yuan
72f9daf9b6 Fix gpu_id with custom objective. (#7015) 2021-06-09 14:51:17 +08:00
TP Boudreau
bd2ca543c4 Fix BinarySearchBin() argument types (#7026) 2021-06-08 19:05:46 +08:00
Jiaming Yuan
7beb2f7fae Hide symbols in CI build + hide symbols for C and CUDA (#6798)
* Hide symbols in CI build.
* Hide symbols for other languages.
2021-06-04 02:35:46 +08:00
Jiaming Yuan
c4b9f4f622 Add enable_categorical to sklearn. (#7011) 2021-06-04 02:29:14 +08:00
Philip Hyunsu Cho
655e6992f6 [Dask] Add example of using custom callback in Dask (#6995) 2021-06-03 07:05:55 +08:00
ShvetsKS
5cdaac00c1 Remove feature grouping (#7018)
Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>
2021-06-03 04:35:26 +08:00
Philip Hyunsu Cho
05db6a6c29 [CI] Upgrade cuDF and RMM to 21.06 nightly (#7012)
* [CI] Upgrade cuDF and RMM to 21.06 nightly

* Trim outdated test cases

* Pin Dask version to 2021.05.0 for now
2021-06-02 11:59:30 -07:00
ShvetsKS
57c732655e Merge lossgude and depthwise strategies for CPU hist (#7007)
* fix java/scala test: max depth is also valid parameter for lossguide

Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>
2021-06-03 01:49:43 +08:00
Jiaming Yuan
ee4f51a631 Support for all primitive types from array. (#7003)
* Change C API name.
* Test for all primitive types from array.
* Add native support for CPU 128 float.
* Convert boolean and float16 in Python.

* Fix dask version for now.
2021-06-01 08:34:48 +08:00
Jiaming Yuan
816b789bf0 Add predictor to skl constructor. (#7000) 2021-05-29 04:52:56 +08:00
ShvetsKS
55b823b27d Reduce 'InitSampling' complexity and set gradients to zero (#6922)
Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>
2021-05-29 04:52:23 +08:00
Jiaming Yuan
89a49cf30e Fix dask predict on DaskDMatrix with iteration_range. (#7005) 2021-05-29 04:43:12 +08:00
Jiaming Yuan
4cf95a6041 Support numpy array interface (#6998) 2021-05-27 16:08:22 +08:00
Jiaming Yuan
ab6fd304c4 [Python] Change development release postfix to dev (#6988) 2021-05-27 16:06:51 +08:00
Jiaming Yuan
29d6a5e2b8 [CI] Move appveyor tests to action (#6986)
* Drop support for VS14, use VS15 instead.
* Drop support for mingw.
* Remove debug build.
* Split up jvm tests.
* Split up Python tests.
2021-05-27 04:49:45 +08:00
Jiaming Yuan
86e60e3ba8 Guard against index error in prediction. (#6982)
* Remove `best_ntree_limit` from documents.
2021-05-25 23:24:59 +08:00
Philip Hyunsu Cho
c6d87e5e18 [CI] Remove stray build artifact to avoid error in artifact packaging (#6994) 2021-05-25 19:48:27 +08:00
Jiaming Yuan
a4bc7ecf27 Restore R cache on github action. (#6985) 2021-05-25 18:53:44 +08:00
Jiaming Yuan
6e52aefb37 Revert OMP guard. (#6987)
The guard protects the global variable from being changed by XGBoost.  But this leads to a
bug that the `n_threads` parameter is no longer used after the first iteration.  This is
due to the fact that `omp_set_num_threads` is only called once in `Learner::Configure` at
the beginning of the training process.

The guard is still useful for `gpu_id`, since this is called all the times in our codebase
doesn't matter which iteration we are currently running.
2021-05-25 08:56:28 +08:00
Jiaming Yuan
cf06a266a8 [dask][doc] Wrap the example in main guard. (#6979) 2021-05-25 08:24:47 +08:00
Mads R. B. Kristensen
81bdfb835d lazy_isinstance(): use .__class__ for type check (#6974) 2021-05-21 11:33:08 +08:00
Emil Sadek
29c942f2a8 [doc] Capitalize section headers (#6976) 2021-05-21 11:31:05 +08:00
Adam Pocock
2320aa0da2 Making the Java library loader emit helpful error messages on missing dependencies. (#6926) 2021-05-19 14:53:56 +08:00
Jiaming Yuan
5cb51a191e [dask][doc] Add small example for sklearn interface. (#6970) 2021-05-19 13:50:45 +08:00
Jiaming Yuan
7e846bb965 Fix prediction on df with latest dask. (#6969) 2021-05-19 12:23:03 +08:00
Jiaming Yuan
6e104f0570 Add news for 1.4.2. [skip ci] (#6963) 2021-05-17 02:50:55 +08:00
ReeceGoding
42fc7ca6a0 Corrected lapply comment in callbacks.R (#6967)
The comment was made false by the removal of the pipes.
2021-05-17 02:31:50 +08:00
Livius
a4886c404a Fix compilation error on x86 (#6964)
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
2021-05-14 13:31:49 +08:00
ReeceGoding
f94f479358 Simplify list2mat call from lapply in callbacks.R (#6966) 2021-05-14 03:40:58 +08:00
Jiaming Yuan
d245bc891e Add tolerance to early stopping. (#6942) 2021-05-14 00:19:51 +08:00
James Lamb
894e9bc5d4 [R-package] remove dependency on {magrittr} (#6928)
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2021-05-13 04:34:59 +08:00
Jiaming Yuan
44cc9c04ea Fix multiclass auc with empty dataset. (#6947) 2021-05-12 15:01:14 +08:00
Jiaming Yuan
05ac415780 [dask] Set dataframe index in predict. (#6944) 2021-05-12 13:24:21 +08:00
Andrew Ziem
3e7e426b36 Fix spelling in documents (#6948)
* Update roxygen2 doc.

Co-authored-by: fis <jm.yuan@outlook.com>
2021-05-11 20:44:36 +08:00
vslaykovsky
2a9979e256 Fixed incorrect feature mismatch error message (#6949)
data.shape[0] denotes the number of samples, data.shape[1] is the number of features
2021-05-11 13:52:11 +08:00
Philip Hyunsu Cho
90cd724be1 [CI] Fix CI/CD pipeline broken by latest auditwheel (4.0.0) (#6951) 2021-05-10 22:43:15 -07:00
Daniel Saxton
e41619b1fc Link to valid tree_method values in docs (#6935) 2021-05-06 17:33:18 +08:00
Philip Hyunsu Cho
ec6ce08cd0 [jvm-packages] Make it easier to release GPU/CPU code artifacts to Maven Central (#6940) 2021-05-04 14:00:03 -07:00
Jose Manuel Llorens
4ddbaeea32 Improve warning when using np.ndarray subsets (#6934) 2021-05-04 13:24:41 +08:00
Ali
b35dd76dca [R] don't remove CMakeLists in cleanup (#6930)
currently installing the R-pacakge will leave the repo in dirty state, since
`CmakeLists.txt` is already checked in. This fixes the `cleanup`
script to not delete this file.
2021-05-03 17:46:15 +08:00
Jiaming Yuan
37ad60fe25 Enforce input data is not object. (#6927)
* Check for object data type.
* Allow strided arrays with greater underlying buffer size.
2021-05-02 00:09:01 +08:00
Jiaming Yuan
a1d23f6613 Relax test for decision stump in distributed environment. (#6919) 2021-04-30 09:04:11 +08:00
Jiaming Yuan
45ddc39c1d Relax shotgun test. (#6918) 2021-04-30 09:03:12 +08:00
Jiaming Yuan
34df1f588b Reduce Travis environment setup time. (#6912)
* Remove unused r from travis.
* Don't update homebrew.
* Don't install indirect/unused dependencies like libgit2, wget, openssl.
* Move graphviz installation to conda.
2021-04-30 09:02:40 +08:00
Jiaming Yuan
b31d37eac5 [CI] Fix custom metric test with empty dataset. (#6917) 2021-04-30 09:00:05 +08:00
Jiaming Yuan
db6285fb55 [CI] Skip external memory gtest on osx. (#6901) 2021-04-30 08:59:33 +08:00
david-cortes
4e1a8b1fe5 Update R handles in-place (#6903)
* update R handles in-place #fixes 6896

* update test to expect non-null handle

* remove unused variable

* fix failing tests

* solve linter complains
2021-04-29 12:50:46 -07:00
Philip Hyunsu Cho
5472ef626c [R] Re-generate Roxygen2 doc (#6915) 2021-04-29 11:55:07 -07:00
James Lamb
20f34d9776 [R-package] Update dependencies from CMake-based installation (#6906)
* remove stringi
* add Matrix and jsonlite
2021-04-29 01:32:01 +08:00
Jiaming Yuan
ef473b1f09 Disable pylint error. (#6911) 2021-04-29 01:01:37 +08:00
Jiaming Yuan
8760ec4827 Ensure predict leaf output 1-dim vector where there's only 1 tree. (#6889) 2021-04-23 15:07:48 +08:00
Jiaming Yuan
54afa3ac7a Relax shotgun test. (#6900)
It's non-deterministic algorithm, the test is flaky.
2021-04-23 13:01:44 +08:00
Jiaming Yuan
a2ecbdaa31 Add an API guard to prevent global variables being changed. (#6891) 2021-04-23 10:27:57 +08:00
Jiaming Yuan
896aede340 Reorganize the installation documents. (#6877)
* Split up installation and building from source.
* Use consistent section titles.

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2021-04-22 04:48:32 +08:00
Jiaming Yuan
74b41637de Revert "[jvm-packages] Add XGBOOST_RABIT_TRACKER_IP_FOR_TEST to set rabit tracker IP. (#6869)" (#6886)
This reverts commit 2828da3c4c.
2021-04-21 11:20:10 -07:00
Kai Fricke
c8cc3eacc9 [docs] Add tutorial for XGBoost-Ray (#6884)
* Add XGBoost-Ray tutorial

* Add link to modin
2021-04-22 02:07:13 +08:00
Bobby Wang
2828da3c4c [jvm-packages] Add XGBOOST_RABIT_TRACKER_IP_FOR_TEST to set rabit tracker IP. (#6869)
* Add `XGBOOST_RABIT_TRACKER_IP_FOR_TEST` to set rabit tracker IP

* change spark and rabit tracker IP to 127.0.0.1on GitHub Action.

Co-authored-by: fis <jm.yuan@outlook.com>
2021-04-22 02:00:22 +08:00
Jiaming Yuan
233bdf105f Remove setDaemon in tracker. (#6872) 2021-04-22 01:57:13 +08:00
Jiaming Yuan
71b938f608 1.4.1 release news. (#6876) 2021-04-22 01:55:57 +08:00
Jiaming Yuan
146549260a Bump version to 1.5.0 snapshot in master. (#6875) 2021-04-22 01:53:44 +08:00
Jiaming Yuan
bec2b4f094 Revert "Use CPU input for test_boost_from_prediction. (#6818)" (#6858)
This reverts commit 74f3a2f4b5.
2021-04-20 14:54:02 +08:00
Bobby Wang
2c684ffd32 [jvm-packages] fix "key not found: train" issue (#6842)
* [jvm-packages] fix "key not found: train" issue

* fix bug
2021-04-18 23:28:39 -07:00
Jiaming Yuan
556a83022d Implement unified update prediction cache for (gpu_)hist. (#6860)
* Implement utilites for linalg.
* Unify the update prediction cache functions.
* Implement update prediction cache for multi-class gpu hist.
2021-04-17 00:29:34 +08:00
Jiaming Yuan
1b26a2a561 Copy output data for argsort. (#6866)
Fix GPU AUC.
2021-04-16 21:05:01 +08:00
Jiaming Yuan
a5d7094a45 Update documents. (#6856)
* Add early stopping section to prediction doc.
* Remove best_ntree_limit.
* Better doxygen output.
2021-04-16 12:41:03 +08:00
ReeceGoding
d31a57cf5f Removed typo in callbacks.R (#6863)
Changed "TURE" to "TRUE".
2021-04-16 05:43:22 +08:00
Jiaming Yuan
bccb7e87d1 Update dmlc-core. (#6862)
* Install pandoc, pandoc-citeproc on CI.
2021-04-16 00:14:17 +08:00
ReeceGoding
2e8c101b4a Removed magrittr dependency in callbacks.R (#6855) 2021-04-15 18:45:17 +08:00
Philip Hyunsu Cho
4224c08cac Add demo for using AFT survival with Dask (#6853) 2021-04-13 16:18:33 -07:00
Philip Hyunsu Cho
878b990fcd [CI] Upload Doxygen to correct destination (#6854) 2021-04-13 16:18:13 -07:00
Jiaming Yuan
dee5ef2dfd Typehint for Sklearn. (#6799) 2021-04-14 06:55:21 +08:00
Jiaming Yuan
3d919db0c0 Fix pip release script. [skip ci] (#6845) 2021-04-14 06:46:02 +08:00
Jiaming Yuan
b9a4f3336a 1.4 release notes. (#6843) 2021-04-13 08:38:27 +08:00
Philip Hyunsu Cho
ea7a6a0321 [CI] Pack R package tarball with pre-built xgboost.so (with GPU support) (#6827)
* Add scripts for packaging R package with GPU-enabled libxgboost.so

* [CI] Automatically build R package tarball

* Add comments

* Don't build tarball for pull requests

* Update the installation doc
2021-04-07 21:15:34 -07:00
Jiaming Yuan
f294c4e023 Use constexpr in dh::CopyIf. (#6828) 2021-04-08 07:37:47 +08:00
Viktor Szathmáry
b65e3c4444 [jvm] reduce scala-compiler, scalatest dependency scopes (#6730)
* [jvm] reduce scala-compiler, scalatest dependency scopes

* [jvm] workaround for GpuTestSuite scalatest dependency

* scalatest scope tweak
2021-04-07 15:22:08 -07:00
Jiaming Yuan
7bcc8b3e5c Use batched copy if. (#6826) 2021-04-06 10:34:04 +08:00
giladmaya
aa0d8f20c1 Support configuring constraints by feature names (#6783)
Co-authored-by: fis <jm.yuan@outlook.com>
2021-04-04 06:53:33 +08:00
Jiaming Yuan
7e06c81894 Fix approximated predict contribution. (#6811) 2021-04-03 02:15:03 +08:00
Jiaming Yuan
0cced530ea [doc] Clarify prediction function. (#6813) 2021-04-03 02:12:04 +08:00
Jiaming Yuan
b1fdb220f4 Remove deprecated n_gpus parameter. (#6821) 2021-04-02 03:02:32 +08:00
Jiaming Yuan
74f3a2f4b5 Use CPU input for test_boost_from_prediction. (#6818) 2021-04-02 00:11:35 +08:00
Jiaming Yuan
47b62480af More general predict proba. (#6817)
* Use `output_margin` for `softmax`.
* Add test for dask binary cls.

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2021-04-01 19:52:12 +08:00
Jiaming Yuan
a5c852660b Update document for sklearn model IO. (#6809)
* Update the use of JSON.
* Remove unnecessary type cast.
2021-04-01 15:52:36 +08:00
Jiaming Yuan
905fdd3e08 Fix typos in AUC. (#6795) 2021-03-31 16:35:42 +08:00
Jiaming Yuan
ca998df912 Clarify the behavior of use_rmm. (#6808)
* Clarify the `use_rmm` flag in document and demo.
2021-03-31 15:43:11 +08:00
Jiaming Yuan
3039dd194b Don't estimate sketch batch size when rmm is used. (#6807) 2021-03-31 15:29:56 +08:00
Jiaming Yuan
10ae0f9511 Fix doc for apply method. (#6796) 2021-03-31 15:28:31 +08:00
Jiaming Yuan
138fe8516a Remove unnecessary calls to iota. (#6797) 2021-03-31 15:27:23 +08:00
Jiaming Yuan
79b8b560d2 Optimize dart inplace predict perf. (#6804) 2021-03-31 15:20:54 +08:00
JohanWork
4aa12e10c0 Update URL (#6810) 2021-03-30 22:27:30 +08:00
James Lamb
f01af43eb0 [dask] disable work stealing explicitly for training tasks (#6794) 2021-03-29 16:47:56 +08:00
Jiaming Yuan
a59c7323b4 Fix inplace predict missing value. (#6787) 2021-03-27 05:36:10 +08:00
Jiaming Yuan
5c87c2bba8 Update demo for prediction. (#6789)
* Remove use of deprecated ntree_limit.
* Add sklearn demo.
2021-03-27 03:09:25 +08:00
ShvetsKS
8825670c9c Memory consumption fix for row-major adapters (#6779)
Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>
Co-authored-by: fis <jm.yuan@outlook.com>
2021-03-26 08:44:30 +08:00
Philip Hyunsu Cho
744c46995c [CI] Upload xgboost4j.dll to S3 (#6781) 2021-03-25 11:34:34 -07:00
Jiaming Yuan
a7083d3c13 Fix dart inplace prediction with GPU input. (#6777)
* Fix dart inplace predict with data on GPU, which might trigger a fatal check
for device access right.
* Avoid copying data whenever possible.
2021-03-25 12:00:32 +08:00
Jiaming Yuan
1d90577800 Verify strictly positive labels for gamma regression. (#6778)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2021-03-25 11:46:52 +08:00
Jiaming Yuan
794fd6a46b Support v3 cuda array interface. (#6776) 2021-03-25 09:58:09 +08:00
Jiaming Yuan
bcc0277338 Re-implement ROC-AUC. (#6747)
* Re-implement ROC-AUC.

* Binary
* MultiClass
* LTR
* Add documents.

This PR resolves a few issues:
  - Define a value when the dataset is invalid, which can happen if there's an
  empty dataset, or when the dataset contains only positive or negative values.
  - Define ROC-AUC for multi-class classification.
  - Define weighted average value for distributed setting.
  - A correct implementation for learning to rank task.  Previous
  implementation is just binary classification with averaging across groups,
  which doesn't measure ordered learning to rank.
2021-03-20 16:52:40 +08:00
Jiaming Yuan
4ee8340e79 Support column major array. (#6765) 2021-03-20 05:19:46 +08:00
Jiaming Yuan
f6fe15d11f Improve parameter validation (#6769)
* Add quotes to unused parameters.
* Check for whitespace.
2021-03-20 01:56:55 +08:00
Jiaming Yuan
23b4165a6b Fix gamma deviance (#6761) 2021-03-20 01:56:17 +08:00
ReeceGoding
c2b6b80600 R documentation: Make construction of DMatrix consistent.
* Fix inconsistency of construction of DMatrix.
* Fix missing parameters.
2021-03-20 01:55:13 +08:00
Qingyun Wu
642336add7 [doc] Add FLAML as a fast tuning tool for XGBoost (#6770)
Co-authored-by: Qingyun Wu <qiw@microsoft.com>
2021-03-20 01:47:39 +08:00
Philip Hyunsu Cho
4230dcb614 Re-introduce double buffer in UpdatePosition, to fix perf regression in gpu_hist (#6757)
* Revert "gpu_hist performance tweaks (#5707)"

This reverts commit f779980f7e.

* Address reviewer's comment

* Fix build error
2021-03-18 13:56:10 -07:00
Jiaming Yuan
e2d8a99413 Add document for tests directory. [skip ci] (#6760) 2021-03-18 15:15:50 +08:00
ReeceGoding
4e00737c60 Fix R documentation for xgb.train. (#6764)
The [general documentation](https://xgboost.readthedocs.io/en/latest/parameter.html#parameters-for-tree-booster) clearly has alpha and lambda under its "Parameters for Tree Booster" heading. Furthermore, the R package clearly uses alpha and lambda when told to use the tree booster. This update adds those two parameters to the documentation for the R package.


Closed issue #6763.
2021-03-18 15:04:00 +08:00
Jiaming Yuan
4f75f514ce Fix GPU RF (#6755)
* Fix sampling.
2021-03-17 06:23:35 +08:00
Jiaming Yuan
1a73a28511 Add device argsort. (#6749)
This is part of https://github.com/dmlc/xgboost/pull/6747 .
2021-03-16 16:05:22 +08:00
Jiaming Yuan
325bc93e16 [dask] Use distributed.MultiLock (#6743)
* [dask] Use `distributed.MultiLock`

This enables training multiple models in parallel.

* Conditionally import `MultiLock`.
* Use async train directly in scikit learn interface.
* Use `worker_client` when available.
2021-03-16 14:19:41 +08:00
Igor Rukhovich
19a2c54265 Prediction by indices (subsample < 1) (#6683)
* Another implementation of predicting by indices

* Fixed omp parallel_for variable type

* Removed SparsePageView from Updater
2021-03-16 15:08:20 +13:00
Philip Hyunsu Cho
366f3cb9d8 Add use_rmm flag to global configuration (#6656)
* Ensure RMM is 0.18 or later

* Add use_rmm flag to global configuration

* Modify XGBCachingDeviceAllocatorImpl to skip CUB when use_rmm=True

* Update the demo

* [CI] Pin NumPy to 1.19.4, since NumPy 1.19.5 doesn't work with latest Shap
2021-03-09 14:53:05 -08:00
Philip Hyunsu Cho
e4894111ba Update dmlc-core submodule (#6745) 2021-03-07 00:30:26 -08:00
Bobby Wang
49c22c23b4 [jvm-packages] fix early stopping doesn't work even without custom_eval setting (#6738)
* [jvm-packages] fix early stopping doesn't work even without custom_eval setting

* remove debug info

* resolve comment
2021-03-06 20:19:40 -08:00
Philip Hyunsu Cho
5ae7f9944b [CI] Clear R package cache (#6746) 2021-03-06 08:37:16 -08:00
Jiaming Yuan
f20074e826 Check for invalid data. (#6742) 2021-03-04 14:37:20 +08:00
Jiaming Yuan
a9b4a95225 Fix learning rate scheduler with cv. (#6720)
* Expose more methods in cvpack and packed booster.
* Fix cv context in deprecated callbacks.
* Fix document.
2021-02-28 13:57:42 +08:00
kangsheng89
9c8523432a fix relocatable include in CMakeList (#6734) (#6737) 2021-02-27 19:17:29 +08:00
Roffild
1fa6793a4e Tests for regression metrics with weights. (#6729) 2021-02-25 22:08:14 +08:00
Jiaming Yuan
9da2287ab8 [breaking] Save booster feature info in JSON, remove feature name generation. (#6605)
* Save feature info in booster in JSON model.
* [breaking] Remove automatic feature name generation in `DMatrix`.

This PR is to enable reliable feature validation in Python package.
2021-02-25 18:54:16 +08:00
capybara
b6167cd2ff [dask] Use client to persist collections (#6722)
Co-authored-by: fis <jm.yuan@outlook.com>
2021-02-25 16:40:38 +08:00
Louis Desreumaux
9b530e5697 Improve OpenMP exception handling (#6680) 2021-02-25 13:56:16 +08:00
Jiaming Yuan
c375173dca Support pylint 2.7.0 (#6726) 2021-02-25 12:49:58 +08:00
Honza Sterba
17913713b5 [jvm] Add ability to load booster direct from byte array (#6655)
* Add ability to load booster direct from byte array

* fix compiler error

* move InputStream to byte-buffer conversion

- move it from Booster to XGBoost facade class
2021-02-23 11:28:27 -08:00
Jiaming Yuan
872e559b91 Use inplace predict for sklearn. (#6718)
* Use inplace predict for sklearn when possible.
2021-02-22 12:27:04 +08:00
Benjamin Lehmann
25077564ab Fixes small typo in sklearn documentation (#6717)
Replaces "dowm" with "down" on parameter n_jobs
2021-02-20 07:36:06 +08:00
Jiaming Yuan
bdedaab8d1 Fix pylint. (#6714) 2021-02-19 11:53:27 +08:00
ShvetsKS
9f15b9e322 Optimize CPU prediction (#6696)
Co-authored-by: Shvets Kirill <kirill.shvets@intel.com>
2021-02-16 14:41:22 +08:00
James Lamb
dc97b5f19f [dask] remove outdated comment (#6699) 2021-02-15 18:49:11 +08:00
Roffild
4c5d2608e0 [python-package] Fix class Booster: feature_types = None (#6705) 2021-02-13 17:50:23 +08:00
ShvetsKS
9a0399e898 Removed unnecessary PredictBatch calls (#6700)
Co-authored-by: Shvets Kirill <kirill.shvets@intel.com>
2021-02-10 20:15:14 +08:00
Ali
9b267a435e Bail out early if libxgboost exists in python setup (#6694)
Skip `copy_tree` when existing build is found.
2021-02-10 10:50:10 +08:00
Jiaming Yuan
e8c5c53e2f Use Predictor for dart. (#6693)
* Use normal predictor for dart booster.
* Implement `inplace_predict` for dart.
* Enable `dart` for dask interface now that it's thread-safe.
* categorical data should be working out of box for dart now.

The implementation is not very efficient as it has to pull back the data and
apply weight for each tree, but still a significant improvement over previous
implementation as now we no longer binary search for each sample.

* Fix output prediction shape on dataframe.
2021-02-09 23:30:19 +08:00
Jiaming Yuan
dbf7e9d3cb Remove R cache in github action. (#6695)
The cache stores outdated packages with wrong linkage.  Right now there's no way to clear the cache.
2021-02-09 18:53:20 +08:00
Jiaming Yuan
1335db6113 [dask] Improve documents. (#6687)
* Add tag for versions.
* use autoclass in sphinx build.
Made some class methods to be private to avoid exporting documents.
2021-02-09 09:20:58 +08:00
Jiaming Yuan
5d48d40d9a Fix DMatrix slice with feature types. (#6689) 2021-02-09 08:13:51 +08:00
Jiaming Yuan
218a5fb6dd Simplify Span checks. (#6685)
* Stop printing out message.
* Remove R specialization.

The printed message is not really useful anyway, without a reproducible example
there's no way to fix it.  But if there's a reproducible example, we can always
obtain these information by a debugger.  Removing the `printf` function avoids
creating the context in kernel.
2021-02-09 08:12:58 +08:00
Jiaming Yuan
4656b09d5d [breaking] Add prediction fucntion for DMatrix and use inplace predict for dask. (#6668)
* Add a new API function for predicting on `DMatrix`.  This function aligns
with rest of the `XGBoosterPredictFrom*` functions on semantic of function
arguments.
* Purge `ntree_limit` from libxgboost, use iteration instead.
* [dask] Use `inplace_predict` by default for dask sklearn models.
* [dask] Run prediction shape inference on worker instead of client.

The breaking change is in the Python sklearn `apply` function, I made it to be
consistent with other prediction functions where `best_iteration` is used by
default.
2021-02-08 18:26:32 +08:00
Jiaming Yuan
dbb5208a0a Use __array_interface__ for creating DMatrix from CSR. (#6675)
* Use __array_interface__ for creating DMatrix from CSR.
* Add configuration.
2021-02-05 21:09:47 +08:00
Jiaming Yuan
1e949110da Use generic dispatching routine for array interface. (#6672) 2021-02-05 09:23:38 +08:00
Jiaming Yuan
a4101de678 Fix divide by 0 in feature importance when no split is found. (#6676) 2021-02-05 03:39:30 +08:00
Jiaming Yuan
72892cc80d [dask] Disable gblinear and dart. (#6665) 2021-02-04 09:13:09 +08:00
Jiaming Yuan
9d62b14591 Fix document. [skip ci] (#6669) 2021-02-02 20:43:31 +08:00
Jiaming Yuan
411592a347 Enhance inplace prediction. (#6653)
* Accept array interface for csr and array.
* Accept an optional proxy dmatrix for metainfo.

This constructs an explicit `_ProxyDMatrix` type in Python.

* Remove unused doc.
* Add strict output.
2021-02-02 11:41:46 +08:00
Jiaming Yuan
87ab1ad607 [dask] Accept Future of model for prediction. (#6650)
This PR changes predict and inplace_predict to accept a Future of model, to avoid sending models to workers repeatably.

* Document is updated to reflect functionality additions in recent changes.
2021-02-02 08:45:52 +08:00
Jiaming Yuan
a9ec0ea6da Align device id in predict transform with predictor. (#6662) 2021-02-02 08:33:29 +08:00
Jiaming Yuan
d8ec7aad5a [dask] Add a 1 line sample to infer output shape. (#6645)
* [dask] Use a 1 line sample to infer output shape.

This is for inferring shape with direct prediction (without DaskDMatrix).
There are a few things that requires known output shape before carrying out
actual prediction, including dask meta data, output dataframe columns.

* Infer output shape based on local prediction.
* Remove set param in predict function as it's not thread safe nor necessary as
we now let dask to decide the parallelism.
* Simplify prediction on `DaskDMatrix`.
2021-01-30 18:55:50 +08:00
Jiaming Yuan
c3c8e66fc9 Make prediction functions thread safe. (#6648) 2021-01-28 23:29:43 +08:00
Philip Hyunsu Cho
0f2ed21a9d [Breaking] Change default evaluation metric for binary:logitraw objective to logloss (#6647) 2021-01-29 00:12:12 +09:00
Jiaming Yuan
d167892c7e [dask] Ensure model can be pickled. (#6651) 2021-01-28 21:47:57 +08:00
Philip Hyunsu Cho
0ad6e18a2a [CI] Do not mix up stashed executable built for ARM and x86_64 platforms (#6646) 2021-01-27 23:57:26 +09:00
Philip Hyunsu Cho
55ee2bd77f [CI] Add ARM64 test to Jenkins pipeline (#6643)
* Add ARM64 test to Jenkins pipeline

* Check for bundled libgomp

* Use a separate test suite for ARM64

* Ensure that x86 jobs don't run on ARM workers
2021-01-27 21:51:17 +09:00
Jiaming Yuan
1b70a323a7 Improve string view to reduce string allocation. (#6644) 2021-01-27 19:08:52 +08:00
Jiaming Yuan
bc08e0c9d1 Remove experimental_json_serialization from tests. (#6640) 2021-01-27 17:44:49 +08:00
Jiaming Yuan
8968ca7c0a Disable s390x and arm64 tests on travis for now. (#6641) 2021-01-27 16:21:40 +08:00
Jiaming Yuan
d19a0ddacf Move sdist test to action. (#6635)
* Move x86 linux and osx sdist test to action.

* Add Windows.
2021-01-26 08:25:59 +08:00
Jiaming Yuan
740d042255 Add base_margin for evaluation dataset. (#6591)
* Add base margin to evaluation datasets.
* Unify the code base for evaluation matrices.
2021-01-26 02:11:02 +08:00
Jiaming Yuan
4bf23c2391 Specify shape in prediction contrib and interaction. (#6614) 2021-01-26 02:08:22 +08:00
Jiaming Yuan
8942c98054 Define metainfo and other parameters for all DMatrix interfaces. (#6601)
This PR ensures all DMatrix types have a common interface.

* Fix logic in avoiding duplicated DMatrix in sklearn.
* Check for consistency between DMatrix types.
* Add doc for bounds.
2021-01-25 16:06:06 +08:00
Jiaming Yuan
561809200a Fix document for tree methods. (#6633) 2021-01-25 15:52:08 +08:00
Adam Pocock
fec66d033a [jvm-packages] JVM library loader extensions (#6630)
* [java] extending the library loader to use both OS and CPU architecture.

* Simplifying create_jni.py's architecture detection.

* Tidying up the architecture detection in create_jni.py
2021-01-25 15:51:39 +08:00
Jiaming Yuan
a275f40267 [dask] Rework base margin test. (#6627) 2021-01-22 17:49:13 +08:00
Jiaming Yuan
7bc56fa0ed Use simple print in tracker print function. (#6609) 2021-01-21 21:15:43 +08:00
Jiaming Yuan
26982f9fce Skip unused CMake argument in setup.py (#6611) 2021-01-21 17:25:33 +08:00
Jiaming Yuan
f0fd7629ae Add helper script and doc for releasing pip package. (#6613)
* Fix `long_description_content_type`.
2021-01-21 14:46:52 +08:00
Bobby Wang
9d2832a3a3 fix potential TaskFailedListener's callback won't be called (#6612)
there is possibility that onJobStart of TaskFailedListener won't be called, if
the job is submitted before the other thread adds addSparkListener.

detail can be found at https://github.com/dmlc/xgboost/pull/6019#issuecomment-760937628
2021-01-21 14:20:32 +08:00
Jiaming Yuan
f8bb678c67 Exclude dmlc test on github action. (#6625) 2021-01-20 18:50:20 +08:00
Jiaming Yuan
d6d72de339 Revert ntree limit fix (#6616)
The old (before fix) best_ntree_limit ignores the num_class parameters, which is incorrect. In before we workarounded it in c++ layer to avoid possible breaking changes on other language bindings. But the Python interpretation stayed incorrect. The PR fixed that in Python to consider num_class, but didn't remove the old workaround, so tree calculation in predictor is incorrect, see PredictBatch in CPUPredictor.
2021-01-19 23:51:16 +08:00
Jiaming Yuan
d132933550 Remove type check for solaris. (#6610) 2021-01-16 02:58:19 +08:00
Jiaming Yuan
d356b7a071 Restore unknown data support. (#6595) 2021-01-14 04:51:16 +08:00
Jiaming Yuan
89a00a5866 [dask] Random forest estimators (#6602) 2021-01-13 20:59:20 +08:00
Jiaming Yuan
0027220aa0 [breaking] Remove duplicated predict functions, Fix attributes IO. (#6593)
* Fix attributes not being restored.
* Rename all `data` to `X`. [breaking]
2021-01-13 16:56:49 +08:00
ShvetsKS
7f4d3a91b9 Multiclass prediction caching for CPU Hist (#6550)
Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>
2021-01-13 04:42:07 +08:00
Jiaming Yuan
03cd087da1 Remove duplicated DMatrix. (#6592) 2021-01-12 09:36:56 +08:00
Jiaming Yuan
c709f2aaaf Fix evaluation result for XGBRanker. (#6594)
* Remove duplicated code, which fixes typo `evals_result` -> `evals_result_`.
2021-01-12 09:36:41 +08:00
Jiaming Yuan
f2f7dd87b8 Use view for SparsePage exclusively. (#6590) 2021-01-11 18:04:55 +08:00
Jiaming Yuan
78f2cd83d7 Suppress hypothesis health check for dask client. (#6589) 2021-01-11 14:11:57 +08:00
Jiaming Yuan
80065d571e [dask] Add DaskXGBRanker (#6576)
* Initial support for distributed LTR using dask.

* Support `qid` in libxgboost.
* Refactor `predict` and `n_features_in_`, `best_[score/iteration/ntree_limit]`
  to avoid duplicated code.
* Define `DaskXGBRanker`.

The dask ranker doesn't support group structure, instead it uses query id and
convert to group ptr internally.
2021-01-08 18:35:09 +08:00
Jiaming Yuan
96d3d32265 [dask] Add shap tests. (#6575) 2021-01-08 14:59:27 +08:00
Jiaming Yuan
7c9dcbedbc Fix best_ntree_limit for dart and gblinear. (#6579) 2021-01-08 10:05:39 +08:00
Jiaming Yuan
f5ff90cd87 Support _estimator_type. (#6582)
* Use `_estimator_type`.

For more info, see: https://scikit-learn.org/stable/developers/develop.html#estimator-types

* Model trained from dask can be loaded by single node skl interface.
2021-01-08 10:01:16 +08:00
Jiaming Yuan
8747885a8b Support Solaris. (#6578)
* Add system header.

* Remove use of TR1 on Solaris

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2021-01-07 09:05:05 +08:00
TP Boudreau
b2246ae7ef Update dmlc-core submodule and conform to new API (#6431)
* Update dmlc-core submodule and conform to new API

* Remove unsupported parameter from method signature

* Update dmlc-core submodule and conform to new API

* Update dmlc-core

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2021-01-05 16:12:22 -08:00
Jiaming Yuan
60cfd14349 [dask, sklearn] Fix predict proba. (#6566)
* For sklearn:
  - Handles user defined objective function.
  - Handles `softmax`.

* For dask:
  - Use the implementation from sklearn, the previous implementation doesn't perform any extra handling.
2021-01-05 08:29:06 +08:00
Jiaming Yuan
516a93d25c Fix best_ntree_limit. (#6569) 2021-01-03 05:58:54 +08:00
James Lamb
195a41cef1 [python-package] remove unnecessary files to reduce sdist size (fixes #6560) (#6565) 2021-01-02 15:56:39 +08:00
Jiaming Yuan
2b049b32e9 Document various tree methods. (#6564) 2021-01-02 15:40:46 +08:00
Philip Hyunsu Cho
fa13992264 Calling XGBModel.fit() should clear the Booster by default (#6562)
* Calling XGBModel.fit() should clear the Booster by default

* Document the behavior of fit()

* Allow sklearn object to be passed in directly via xgb_model argument

* Fix lint
2020-12-31 11:02:08 -08:00
Jiaming Yuan
5e9e525223 Remove warnings in tests. (#6554) 2020-12-31 13:41:18 +08:00
James Lamb
8ad22bf4e7 Add credentials to .gitignore (#6559) 2020-12-30 15:58:14 -08:00
Jiaming Yuan
de8fd852a5 [dask] Add type hints. (#6519)
* Add validate_features.
* Show type hints in doc.

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-12-29 19:41:02 +08:00
Jiaming Yuan
610ee632cc [Breaking] Rename data to X in predict_proba. (#6555)
New Scikit-Learn version uses keyword argument, and `X` is the predefined
keyword.

* Use pip to install latest Python graphviz on Windows CI.
2020-12-28 21:36:03 +08:00
Jiaming Yuan
cb207a355d Add script for generating release tarball. (#6544) 2020-12-23 16:08:10 +08:00
Gorkem Ozkaya
2231940d1d Clip small positive values in gamma-nloglik (#6537)
For the `gamma-nloglik` eval metric, small positive values in the labels are causing `NaN`'s in the outputs, as reported here: https://github.com/dmlc/xgboost/issues/5349. This will add clipping on them, similar to what is done in other metrics like `poisson-nloglik` and `logloss`.
2020-12-22 03:11:40 +08:00
MBSMachineLearning
95cbfad990 "featue_map" typo changed to "feature_map" (#6540) 2020-12-21 22:11:11 +08:00
Philip Hyunsu Cho
fbb980d9d3 Expand ~ into the home directory on Linux and MacOS (#6531) 2020-12-19 23:35:13 -08:00
Philip Hyunsu Cho
cd0821500c Add Saturn Cloud Dask XGBoost tutorial to Awesome XGBoost [skip ci] (#6532) 2020-12-19 15:57:05 -08:00
Philip Hyunsu Cho
380f6f4ab8 Remove cupy.array_equal, since it's not compatible with cuPy 7.8 (#6528) 2020-12-18 09:16:52 -08:00
Jiaming Yuan
ca3da55de4 Support early stopping with training continuation, correct num boosted rounds. (#6506)
* Implement early stopping with training continuation.

* Add new C API for obtaining boosted rounds.

* Fix off by 1 in `save_best`.

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-12-17 19:59:19 +08:00
Philip Hyunsu Cho
125b3c0f2d Lazy import cuDF and Dask (#6522)
* Lazy import cuDF

* Lazy import Dask

Co-authored-by: PSEUDOTENSOR / Jonathan McKinney <pseudotensor@gmail.com>

* Fix lint

Co-authored-by: PSEUDOTENSOR / Jonathan McKinney <pseudotensor@gmail.com>
2020-12-17 01:51:35 -08:00
Philip Hyunsu Cho
ad1a527709 Enable loading model from <1.0.0 trained with objective='binary:logitraw' (#6517)
* Enable loading model from <1.0.0 trained with objective='binary:logitraw'

* Add binary:logitraw in model compatibility testing suite

* Feedback from @trivialfis: Override ProbToMargin() for LogisticRaw

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
2020-12-16 16:53:46 -08:00
Philip Hyunsu Cho
bf6cfe3b99 [Breaking] Upgrade cuDF and RMM to 0.18 nightlies; require RMM 0.18+ for RMM plugin (#6510)
* [CI] Upgrade cuDF and RMM to 0.18 nightlies

* Modify RMM plugin to be compatible with RMM 0.18

* Update src/common/device_helpers.cuh

Co-authored-by: Mark Harris <mharris@nvidia.com>

Co-authored-by: Mark Harris <mharris@nvidia.com>
2020-12-16 10:07:52 -08:00
Jiaming Yuan
d8d684538c [CI] Split up main.yml, add mypy. (#6515) 2020-12-17 00:15:44 +08:00
Jiaming Yuan
c5876277a8 Drop saving binary format for memory snapshot. (#6513) 2020-12-17 00:14:57 +08:00
Jiaming Yuan
0e97d97d50 Fix merge conflict. (#6512) 2020-12-16 18:02:25 +08:00
hzy001
749364f25d Update the C API comments (#6457)
Signed-off-by: Hao Ziyu <haoziyu@qiyi.com>

Co-authored-by: Hao Ziyu <haoziyu@qiyi.com>
2020-12-16 14:56:13 +08:00
Jiaming Yuan
347f593169 Accept numpy array for DMatrix slice index. (#6368) 2020-12-16 14:42:52 +08:00
Jiaming Yuan
ef4a0e0aac Fix DMatrix feature names/types IO. (#6507)
* Fix feature names/types IO

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-12-16 14:24:27 +08:00
Jiaming Yuan
886486a519 Support categorical data in GPU weighted sketching. (#6508) 2020-12-16 14:23:28 +08:00
Igor Rukhovich
5c8ccf4455 Improved InitSampling function speed by 2.12 times (#6410)
* Improved InitSampling function speed by 2.12 times

* Added explicit conversion
2020-12-15 20:59:24 -08:00
Jiaming Yuan
3c3f026ec1 Move metric configuration into booster. (#6504) 2020-12-16 05:35:04 +08:00
Jiaming Yuan
d45c0d843b Show partition status in dask error. (#6366) 2020-12-16 02:58:21 +08:00
James Lamb
1e2c3ade9e [doc] [dask] Add example on early stopping with Dask (#6501)
Co-authored-by: fis <jm.yuan@outlook.com>
2020-12-15 22:23:23 +08:00
ShvetsKS
8139849ab6 Fix handling of print period in EvaluationMonitor (#6499)
Co-authored-by: Kirill Shvets <kirill.shvets@intel.com>
2020-12-15 19:20:19 +08:00
Philip Hyunsu Cho
9a194273cd Add conda-forge badge (#6502) 2020-12-14 18:58:03 -08:00
Philip Hyunsu Cho
aac4eba2ef Add release note for 1.3.0 in NEWS.md (#6495)
* Add release note for 1.3.0

* Address reviewer's comment

* Fix silly mistake

* Apply suggestions from code review

Co-authored-by: John Zedlewski <904524+JohnZed@users.noreply.github.com>

Co-authored-by: John Zedlewski <904524+JohnZed@users.noreply.github.com>
2020-12-14 14:42:30 -08:00
James Lamb
afc4567268 [doc] [dask] fix partitioning in Dask example (#6389) 2020-12-14 18:37:49 +08:00
Jiaming Yuan
a30461cf87 [dask] Support all parameters in regressor and classifier. (#6471)
* Add eval_metric.
* Add callback.
* Add feature weights.
* Add custom objective.
2020-12-14 07:35:56 +08:00
Philip Hyunsu Cho
c31e3efa7c Pass correct split_type to GPU predictor (#6491)
* Pass correct split_type to GPU predictor

* Add a test
2020-12-11 19:30:00 -08:00
Philip Hyunsu Cho
0d483cb7c1 Bump version to 1.4.0 snapshot in master (#6486) 2020-12-10 07:38:08 -08:00
Philip Hyunsu Cho
b8044e6136 [CI] Use manylinux2010_x86_64 container to vendor libgomp (#6485) 2020-12-10 07:37:15 -08:00
Jiaming Yuan
0ffaf0f5be Fix dask ip resolution. (#6475)
This adopts the solution used in dask/dask-xgboost#40 which employs the get_host_ip from dmlc-core tracker.
2020-12-07 16:36:23 -08:00
Jiaming Yuan
47b86180f6 Don't validate feature when number of rows is 0. (#6472) 2020-12-07 18:08:51 +08:00
Philip Hyunsu Cho
55bdf084cb [Doc] Document that AUC and AUCPR are for binary classification/ranking [skip ci] (#5899) 2020-12-06 22:17:20 -08:00
Jiaming Yuan
703c2d06aa Fix global config default value. (#6470) 2020-12-06 06:15:33 +08:00
Jiaming Yuan
d6386e45e8 Fix filtering callable objects in skl xgb param. (#6466)
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-12-05 17:20:36 +08:00
Philip Hyunsu Cho
05e5563c2c [CI] Fix CentOS 6 Docker images (#6467) 2020-12-04 21:33:11 -08:00
Philip Hyunsu Cho
84b726ef53 Vendor libgomp in the manylinux Python wheel (#6461)
* Vendor libgomp in the manylinux2014_aarch64 wheel

* Use vault repo, since CentOS 6 has reached End-of-Life on Nov 30

* Vendor libgomp in the manylinux2010_x86_64 wheel

* Run verification step inside the container
2020-12-03 19:55:32 -08:00
Philip Hyunsu Cho
c103ec51d8 Enforce row-major order in cuPy array (#6459) 2020-12-03 18:29:10 -08:00
Philip Hyunsu Cho
4f70e14031 Fix docstring of config.py to use correct versionadded (#6458) 2020-12-03 10:41:53 -08:00
Philip Hyunsu Cho
fb56da5e8b Add global configuration (#6414)
* Add management functions for global configuration: XGBSetGlobalConfig(), XGBGetGlobalConfig().
* Add Python interface: set_config(), get_config(), and config_context().
* Add unit tests for Python
* Add R interface: xgb.set.config(), xgb.get.config()
* Add unit tests for R

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
2020-12-03 00:05:18 -08:00
hzy001
c2ba4fb957 Fix broken links. (#6455)
Co-authored-by: Hao Ziyu <haoziyu@qiyi.com>
Co-authored-by: fis <jm.yuan@outlook.com>
2020-12-02 17:39:12 +08:00
Jiaming Yuan
927c316aeb Fix period in evaluation monitor. (#6441) 2020-11-29 03:18:33 +08:00
Jiaming Yuan
f4ff1c53fd Fix CLI ranking demo. (#6439)
Save model at final round.
2020-11-29 03:12:06 +08:00
Honza Sterba
b0036b339b Optionaly fail when gpu_id is set to invalid value (#6342) 2020-11-28 15:14:12 +08:00
ShvetsKS
956beead70 Thread local memory allocation for BuildHist (#6358)
* thread mem locality

* fix apply

* cleanup

* fix lint

* fix tests

* simple try

* fix

* fix

* apply comments

* fix comments

* fix

* apply simple comment

Co-authored-by: ShvetsKS <kirill.shvets@intel.com>
2020-11-25 17:50:12 +03:00
Philip Hyunsu Cho
4dbbeb635d [CI] Upgrade cuDF and RMM to 0.17 nightlies (#6434) 2020-11-24 13:21:41 -08:00
Philip Hyunsu Cho
0c85b90671 [R] Fix R package installation via CMake (#6423) 2020-11-22 05:49:09 -08:00
Jiaming Yuan
42d31d9dcb Fix MPI build. (#6403) 2020-11-21 13:38:21 +08:00
Jiaming Yuan
2ce2a1a4d8 [SKL] Propagate parameters to booster during set_param. (#6416) 2020-11-20 20:37:35 +08:00
zhang_jf
cc581b3b6b Misleading exception information: no such param of "allow_non_zero_missing" (#6418) 2020-11-20 19:33:34 +08:00
Jiaming Yuan
00218d065a [dask] Update document. [skip ci] (#6413) 2020-11-20 19:16:19 +08:00
Jiaming Yuan
c120822a24 Fix flaky sparse page dmatrix test. (#6417) 2020-11-20 19:15:45 +08:00
Jiaming Yuan
a7b42adb74 Fix dask predict (#6412) 2020-11-20 10:10:52 +08:00
Jiaming Yuan
44a9d69efb Small cleanup to evaluator. (#6400) 2020-11-20 09:33:51 +08:00
Philip Hyunsu Cho
9c9070aea2 Use pytest conventions consistently (#6337)
* Do not derive from unittest.TestCase (not needed for pytest)

* assertRaises -> pytest.raises

* Simplify test_empty_dmatrix with test parametrization

* setUpClass -> setup_class, tearDownClass -> teardown_class

* Don't import unittest; import pytest

* Use plain assert

* Use parametrized tests in more places

* Fix test_gpu_with_sklearn.py

* Put back run_empty_dmatrix_reg / run_empty_dmatrix_cls

* Fix test_eta_decay_gpu_hist

* Add parametrized tests for monotone constraints

* Fix test names

* Remove test parametrization

* Revise test_slice to be not flaky
2020-11-19 17:00:15 -08:00
Philip Hyunsu Cho
c763b50dd0 [CI] Upgrade to MacOS Mojave image (#6406) 2020-11-18 20:29:10 -08:00
Nan Zhu
4d1d5d4010 [jvm-packages] fix potential unit test suites aborted issue (#6373)
* fix race conditio

* code cleaning

rm pom.xml-e

* clean again

* fix compilation issue

* recover

* avoid using getOrCreate

* interrupt zombie threads

* safe guard

* fix deadlock

* Update SparkParallelismTracker.scala
2020-11-17 10:59:26 -08:00
Philip Hyunsu Cho
e426b6e040 [R] Do not convert continuous labels to factors (#6380)
* [R] Do not convert continuous labels to factors

* Address reviewer's comment
2020-11-17 09:19:16 -08:00
James Lamb
3cca1c5fa1 [R] remove uses of exists() (#6387) 2020-11-17 15:06:23 +08:00
Jiaming Yuan
3ac173fc8b Fix typo. (#6399) 2020-11-16 16:59:12 -08:00
Nikhil Choudhary
ae1662028a Fixed few grammatical mistakes in doc (#6393) 2020-11-15 13:48:08 +08:00
Philip Hyunsu Cho
5cb24d0d39 Fix broken link in CLI doc (#6396) 2020-11-14 17:58:07 -08:00
ShvetsKS
512b464cfa Disable HT for DMatrix creation (#6386)
Co-authored-by: SHVETS, KIRILL <kirill.shvets@intel.com>
2020-11-14 22:18:33 +08:00
Jiaming Yuan
fcd6fad822 [dask] Small cleanup. (#6391) 2020-11-14 22:15:05 +08:00
Jiaming Yuan
4ccf92ea34 [dask] Fix union of workers. (#6375) 2020-11-13 16:55:05 +08:00
Jiaming Yuan
fcfeb4959c Deprecate positional arguments. (#6365)
Deprecate positional arguments in following functions:

- `__init__` for all classes in sklearn module.
- `fit` method for all classes in sklearn module.
- dask interface.
- `set_info` for `DMatrix` class.

Refactor the evaluation matrices handling.
2020-11-13 11:10:30 +08:00
Philip Hyunsu Cho
e5193c21a1 [dask] Allow empty data matrix in AFT survival (#6379)
* [dask] Allow empty data matrix in AFT survival

* Add unit test
2020-11-12 17:49:58 -08:00
Philip Hyunsu Cho
5a33c2f3a0 [CI] Add noLD R test (#6382)
* [CI] Add noLD test

* Make noLD test only trigger with a PR comment

* [CI] Don't install stringi

* Add the Titanic example as a unit test

* Document trigger

* add to index

* Clarify that it needs to be a review comment
2020-11-12 12:41:25 -08:00
Jiaming Yuan
c1a62b5fa2 Expect gpu external memory to fail. (#6381) 2020-11-12 19:24:48 +08:00
Jiaming Yuan
c90f968d92 Update Python documents. (#6376) 2020-11-12 17:51:32 +08:00
Philip Hyunsu Cho
c5645180a6 [R] Fix a crash that occurs with noLD R (#6378) 2020-11-11 21:09:08 -08:00
James Lamb
12d27f43ff [doc] make Dask distributed example copy-pastable (#6345) 2020-11-11 20:22:17 -08:00
Jiaming Yuan
d711d648cb Fix label errors in graph visualization (#6369) 2020-11-11 17:44:59 -08:00
Jiaming Yuan
debeae2509 [R] Fix warnings from R check --as-cran (#6374)
* Remove exit and printf.

* Fix warnings.
2020-11-11 18:39:37 +08:00
Jiaming Yuan
6e12c2a6f8 [dask] Supoort running on GKE. (#6343)
* Avoid accessing `scheduler_info()['workers']`.
* Avoid calling `client.gather` inside task.
* Avoid using `client.scheduler_address`.
2020-11-11 18:04:34 +08:00
Jiaming Yuan
8a17610666 Implement GPU predict leaf. (#6187) 2020-11-11 17:33:47 +08:00
Philip Hyunsu Cho
7f101d1b33 [CI] Remove R check from Jenkins (#6372)
* Remove R check from Jenkins

* Print stacktrace when CRAN test fail in GitHub Actions

* Add verbose flag in tests/ci_build/print_r_stacktrace.sh

* Fix path in tests/ci_build/print_r_stacktrace.sh
2020-11-10 22:46:54 -08:00
Jiaming Yuan
a5cfa7841e Run R check as cran on action. [skip ci] (#6371)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-11-11 12:02:53 +08:00
Jiaming Yuan
43efadea2e Deterministic data partitioning for external memory (#6317)
* Make external memory data partitioning deterministic.

* Change the meaning of `page_size` from bytes to number of rows.

* Design a data pool.

* Note for external memory.

* Enable unity build on Windows CI.

* Force garbage collect on test.
2020-11-11 06:11:06 +08:00
Jean Lescut-Muller
9564886d9f Update custom_metric_obj.rst (#6367) 2020-11-10 22:29:22 +08:00
Jiaming Yuan
e65e3cf36e Support shared library in system path. (#6362) 2020-11-10 16:04:25 +08:00
Jiaming Yuan
184e2eac7d Add period to evaluation monitor. (#6348) 2020-11-10 07:47:48 +08:00
ShvetsKS
d411f98d26 simple fix for static shedule in predict (#6357)
Co-authored-by: ShvetsKS <kirill.shvets@intel.com>
2020-11-09 17:01:30 +08:00
Jiaming Yuan
519cee115a Avoid resetting seed for every configuration. (#6349) 2020-11-06 10:28:35 +08:00
James Lamb
f3a4253984 Ignore files from local Dask development (#6346) 2020-11-05 13:54:46 +08:00
Jack Dunn
51e6531315 Fix missing space in warning message (#6340) 2020-11-04 06:03:16 -05:00
Jiaming Yuan
2cc9662005 Support slicing tree model (#6302)
This PR is meant the end the confusion around best_ntree_limit and unify model slicing. We have multi-class and random forests, asking users to understand how to set ntree_limit is difficult and error prone.

* Implement the save_best option in early stopping.

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-11-02 23:27:39 -08:00
Rory Mitchell
29745c6df2 Fix inclusive scan for large sizes (#6234) 2020-11-03 17:01:43 +13:00
Jiaming Yuan
7756192906 [dask] Fix prediction on DaskDMatrix with multiple meta data. (#6333)
* Unify the meta handling methods.
2020-11-02 19:18:44 -05:00
Jiaming Yuan
5a7b3592ed Optional find_package for sanitizers. (#6329) 2020-11-02 19:17:17 -05:00
Jiaming Yuan
048acf81cd Enable shap sparse test. (#6332) 2020-11-01 20:59:27 +08:00
Igor Moura
5e1e972aea Clean up warnings (#6325) 2020-10-30 23:50:29 +08:00
nabokovas
f0fe18fc28 Add a new github actions badge (#6321) 2020-10-30 17:57:21 +08:00
Jiaming Yuan
6ff331b705 Fix Python callback. (#6320) 2020-10-30 05:03:44 +08:00
Sergio Gavilán
b181a88f9f Reduced some C++ compiler warnings (#6197)
* Removed some warnings

* Rebase with master

* Solved C++ Google Tests errors made by refactoring in order to remove warnings

* Undo renaming path -> path_

* Fix style check

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-10-29 12:36:00 -07:00
Jiaming Yuan
c80657b542 Fix flaky data initialization test. (#6318) 2020-10-30 03:11:22 +08:00
Naveed Ahmed Saleem Janvekar
608bda7052 [jvm-packages] add example to handle missing value other than 0 (#5677)
add example to handle missing value other than 0 under Dealing with missing values section
2020-10-28 17:24:35 -07:00
Jiaming Yuan
74ea82209b Lazy import dask libraries. (#6309)
* Lazy import dask libraries.

* Lint && fix.

* Use short name.
2020-10-28 15:50:11 -07:00
Jiaming Yuan
dfac5f89e9 Group CLI demo into subdirectory. (#6258)
CLI is not most developed interface. Putting them into correct directory can help new users to avoid it as most of the use cases are from a language binding.
2020-10-28 14:40:44 -07:00
James Lamb
6383757dca [R] allow xgb.plot.importance() calls to fill a grid (#6294) 2020-10-28 14:37:28 -07:00
Tanuja Kirthi Doddapaneni
d261ba029a Added USE_NCCL_LIB_PATH option to enable user to set NCCL_LIBRARY during build (#6310)
Description: To enable user to set NCCL_LIBRARY during build
2020-10-28 14:36:31 -07:00
vcarpani
671971e12e Compiler warnings (#6286)
* Fix warnings for json.h

* Fix warnings for metric.h

* Fix warnings for updater_quantile_hist.cc.

* Fix warnings for updater_histmaker.cc.

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-10-28 13:46:15 -07:00
Jiaming Yuan
e8884c4637 Document tree method for feature weights. (#6312) 2020-10-28 13:42:13 -07:00
Philip Hyunsu Cho
143b278267 Mark flaky tests as XFAIL (#6299)
* Temporarily skip TestGPUUpdaters::test_categorical

* Temporarily skip test_boost_from_prediction[approx]
2020-10-28 11:50:57 -07:00
Jiaming Yuan
c4da967b5c Support unity build. (#6295)
* Support unity build.

* Setup on Windows Jenkins.

* Revert "Setup on Windows Jenkins."

This reverts commit 8345cb8d2b009eec8ae9fa6f16412a7c9b6ec12c.
2020-10-28 11:49:28 -07:00
Philip Hyunsu Cho
f6169c0b16 [CI] Use separate Docker cache for each CUDA version (#6305) 2020-10-28 11:07:00 -07:00
Jiaming Yuan
3310e208fd Fix inplace prediction interval. (#6259)
* Add back the interval in call.
* Make the interval non-optional.
2020-10-28 13:13:59 +08:00
Jiaming Yuan
cc76724762 Reduce warning. (#6273) 2020-10-27 12:24:19 -07:00
DIVYA CHAUHAN
4e9c4f2d73 Create a tutorial for using the C API in a C/C++ application (#6285)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-10-27 12:19:20 -07:00
James Lamb
e1de390e6e [ci] replace 'egrep' with 'grep -E' (#6287) 2020-10-27 12:05:48 -07:00
Rory Mitchell
f0c3ff313f Update GPUTreeShap, add docs (#6281)
* Update GPUTreeShap, add docs

* Fix test

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-10-27 18:22:12 +13:00
Jiaming Yuan
b180223d18 Cleanup RABIT. (#6290)
* Remove recovery and MPI speed tests.
* Remove readme.
* Remove Python binding.
* Add checks in C API.
2020-10-27 08:48:22 +08:00
Akira Funahashi
8e0f5a6fc7 Update plugin instructions for CMake build (#6289) 2020-10-26 17:42:07 -07:00
Philip Hyunsu Cho
c8ec62103a Deprecate LabelEncoder in XGBClassifier; Enable cuDF/cuPy inputs in XGBClassifier (#6269)
* Deprecate LabelEncoder in XGBClassifier; skip LabelEncoder for cuDF/cuPy inputs

* Add unit tests for cuDF and cuPy inputs with XGBClassifier

* Fix lint

* Clarify warning

* Move use_label_encoder option to XGBClassifier constructor

* Add a test for cudf.Series

* Add use_label_encoder to XGBRFClassifier doc

* Address reviewer feedback
2020-10-26 13:20:51 -07:00
Jiaming Yuan
bcfab4d726 Revert "Disable JSON full serialization for now. (#6248)" (#6266)
This reverts commit 6d293020fb.
2020-10-27 03:30:47 +08:00
Jiaming Yuan
d61b628bf5 Remove RABIT CMake targets. (#6275)
* Now it's built as part of libxgboost.
* Set correct C API error in RABIT initialization and finalization.
* Remove redundant message.
* Guard the tracker print C API.
2020-10-27 01:30:20 +08:00
Jiaming Yuan
2686d32a36 Skip dask tests on ARM. (#6267)
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-10-26 15:09:05 +08:00
Philip Hyunsu Cho
677f676172 Use UserWarning for old callback, as DeprecationWarning is not visible (#6270) 2020-10-22 01:10:52 -07:00
Philip Hyunsu Cho
1300467d36 Fix a typo in is_arm() in testing.py [skip ci] (#6271) 2020-10-22 13:07:14 +08:00
Jiaming Yuan
b5c2a47b20 Drop single point model recovery (#6262)
* Pass rabit params in JVM package.
* Implement timeout using poll timeout parameter.
* Remove OOB data check.
2020-10-21 15:27:03 +08:00
Jiaming Yuan
81c37c28d5 Time the CPU tests on Jenkins. (#6257)
* Time the CPU tests on Jenkins.
* Reduce thread contention.
* Add doc.
* Skip heavy tests on ARM.
2020-10-20 17:19:07 -07:00
Igor Moura
d1254808d5 Clean up C++ warnings (#6213) 2020-10-19 23:02:33 +08:00
Jiaming Yuan
ddf37cca30 Unify thread configuration. (#6186) 2020-10-19 16:05:42 +08:00
Philip Hyunsu Cho
7f6ed5780c [CI] Build a Python wheel for aarch64 platform (#6253) 2020-10-18 22:35:19 -07:00
Jiaming Yuan
5037abeb86 Fix linear gpu input (#6255) 2020-10-19 12:02:36 +08:00
Yuan Tang
cdcdab98b8 Add sponsors link to FUNDING.yml (#6252) 2020-10-18 19:17:11 -07:00
Philip Hyunsu Cho
65ea42bd42 [CI] Reduce testing load with RMM (#6249)
* [CI] Reduce testing load with RMM

* Address reviewer's comment
2020-10-18 19:16:46 -07:00
Manikya Bardhan
549f361b71 Updated winning solutions list (#6254) 2020-10-19 04:06:48 +08:00
Jiaming Yuan
6d293020fb Disable JSON full serialization for now. (#6248)
* Disable JSON serialization for now.

* Multi-class classification is checkpointing for each iteration.
This brings significant overhead.

Revert: 90355b4f00

* Set R tests to use binary.
2020-10-16 17:59:54 +08:00
Jiaming Yuan
52452bebb9 Fix cls typo. (#6247) 2020-10-16 16:40:44 +08:00
Yuan Tang
3098d7cee0 Add link to XGBoost's Twitter handle (#6244) 2020-10-15 16:54:34 -07:00
Jiaming Yuan
3da5a69dc9 Fix typo in dask interface. (#6240) 2020-10-15 15:26:29 +08:00
dependabot[bot]
06e453ddf4 Bump junit from 4.11 to 4.13.1 in /jvm-packages/xgboost4j (#6230)
Bumps [junit](https://github.com/junit-team/junit4) from 4.11 to 4.13.1.
- [Release notes](https://github.com/junit-team/junit4/releases)
- [Changelog](https://github.com/junit-team/junit4/blob/main/doc/ReleaseNotes4.11.md)
- [Commits](https://github.com/junit-team/junit4/compare/r4.11...r4.13.1)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2020-10-13 19:46:19 -07:00
dependabot[bot]
b51a717deb Bump junit from 4.11 to 4.13.1 in /jvm-packages/xgboost4j-gpu (#6233)
Bumps [junit](https://github.com/junit-team/junit4) from 4.11 to 4.13.1.
- [Release notes](https://github.com/junit-team/junit4/releases)
- [Changelog](https://github.com/junit-team/junit4/blob/main/doc/ReleaseNotes4.11.md)
- [Commits](https://github.com/junit-team/junit4/compare/r4.11...r4.13.1)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2020-10-13 19:44:56 -07:00
Jiaming Yuan
bed7ae4083 Loop over thrust::reduce. (#6229)
* Check input chunk size of dqdm.
* Add doc for current limitation.
2020-10-14 10:40:56 +13:00
Rory Mitchell
734a911a26 Loop over copy_if (#6201)
* Loop over copy_if

* Catch OOM.

Co-authored-by: fis <jm.yuan@outlook.com>
2020-10-14 10:23:16 +13:00
Wittty-Panda
0fc263ead5 Update the list of winning solutions (#6222) 2020-10-13 20:05:12 +08:00
Jiaming Yuan
b05073bda5 [dask] Test for data initializaton. (#6226) 2020-10-13 11:08:35 +08:00
Jiaming Yuan
2443275891 Cleanup Python code. (#6223)
* Remove pathlike as XGBoost 1.2 requires Python 3.6.
* Move conditional import of dask/distributed into dask module.
2020-10-12 15:44:41 +08:00
Jiaming Yuan
70c2039748 Catch all standard exceptions in C API. (#6220)
* `std::bad_alloc` is not guaranteed to be caught.
2020-10-12 14:01:46 +08:00
Jiaming Yuan
2241563f23 Handle duplicated values in sketching. (#6178)
* Accumulate weights in duplicated values.
* Fix device id in iterative dmatrix.
2020-10-10 19:32:44 +08:00
Jiaming Yuan
ab5b35134f Rework Python callback functions. (#6199)
* Define a new callback interface for Python.
* Deprecate the old callbacks.
* Enable early stopping on dask.
2020-10-10 17:52:36 +08:00
Jiaming Yuan
b5b24354b8 More categorical tests and disable shap sparse test. (#6219)
* Fix tree load with 32 category.
2020-10-10 16:12:37 +08:00
Philip Hyunsu Cho
c991eb612d [jvm-packages] Fix up build for xgboost4j-gpu, xgboost4j-spark-gpu (#6216)
* [CI] Clean up build for JVM packages

* Use correct path for saving native lib

* Fix groupId of maven-surefire-plugin

* Fix stashing of xgboost4j_jar_gpu

* [CI] Don't run xgboost4j-tester with GPU, since it doesn't use gpu_hist
2020-10-09 14:08:15 -07:00
Jiaming Yuan
70ce5216b5 Add high level tests for categorical data. (#6179)
* Fix unique.
2020-10-09 09:27:23 +08:00
vcarpani
6bc9747df5 Reduce compile warnings (#6198)
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-10-08 23:14:59 +08:00
ShvetsKS
a4ce0eae43 CPU predict performance improvement (#6127)
Co-authored-by: ShvetsKS <kirill.shvets@intel.com>
2020-10-08 15:50:21 +03:00
Jiaming Yuan
4cfdcaaf7b Move non-OpenMP gtest to GitHub Actions (#6210) 2020-10-08 00:58:21 -07:00
Jiaming Yuan
ddc4f20e54 Add JSON schema for categorical splits. (#6194) 2020-10-07 17:33:31 +08:00
odidev
a2fea33103 Added arm64 job in Travis-CI (#6200)
Signed-off-by: odidev <odidev@puresoftware.com>
2020-10-07 15:02:09 +08:00
Igor Moura
5908598666 [Doc] Add info on GPU compiler (#6204)
* Add note about the required compiler version for CUDA. 
* Also added a link that gives a short explanation on compute capability version
2020-10-06 11:35:18 +08:00
Yuan Tang
1013224888 Consistent style for build status badge (#6203) 2020-10-05 18:23:21 -07:00
Philip Hyunsu Cho
f121f2738f [CI] Fix Docker build for CUDA 11 (#6202) 2020-10-05 17:54:14 -07:00
Jiaming Yuan
fd58005edf Ignore cachedir by joblib. [skip ci] (#6193) 2020-10-04 14:54:32 +08:00
DIVYA CHAUHAN
750bd0ae9a Update the list of winning solutions using XGBoost (#6192)
Co-authored-by: divya <divyachauhan661@gmail.com>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-10-03 13:39:58 -07:00
Christian Lorentzen
cf4f019ed6 [Breaking] Change default evaluation metric for classification to logloss / mlogloss (#6183)
* Change DefaultEvalMetric of classification from error to logloss

* Change default binary metric in plugin/example/custom_obj.cc

* Set old error metric in python tests

* Set old error metric in R tests

* Fix missed eval metrics and typos in R tests

* Fix setting eval_metric twice in R tests

* Add warning for empty eval_metric for classification

* Fix Dask tests

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-10-02 12:06:47 -07:00
John Quitto-Graham
e0e4f15d0e Fix a comment in demo to use correct reference (#6190)
Co-authored-by: John Quitto Graham <johnq@dgx07.aselab.nvidia.com>
2020-10-01 13:16:04 -07:00
Philip Hyunsu Cho
eb7946ff25 Hide C++ symbols from dmlc-core (#6188) 2020-10-01 10:07:13 -07:00
lacrosse91
6bc41df2fe [Doc] Add list of winning solutions in data science competitions using XGBoost (#6177) 2020-09-30 14:41:29 -07:00
Jiaming Yuan
f0c63902ff Use default allocator in sketching. (#6182) 2020-09-30 14:55:59 +08:00
Jiaming Yuan
444131a2e6 Add categorical data support to GPU Hist. (#6164) 2020-09-29 11:27:25 +08:00
Jiaming Yuan
798af22ff4 Add categorical data support to GPU predictor. (#6165) 2020-09-29 11:25:34 +08:00
Jiaming Yuan
7622b8cdb8 Enable categorical data support on Python DMatrix. (#6166)
* Only pandas is recognized.
2020-09-29 11:22:56 +08:00
Jiaming Yuan
52c0b3f100 Fix error message. (#6176) 2020-09-29 11:18:25 +08:00
Rory Mitchell
dda9e1e487 Update GPUTreeshap (#6163)
* Reduce shap test duration

* Test interoperability with shap package

* Add feature interactions

* Update GPUTreeShap
2020-09-28 09:43:47 +13:00
Jiaming Yuan
434a3f35a3 Add TAGS to gitignore. [skip ci] (#6175) 2020-09-27 21:27:40 +08:00
Jiaming Yuan
07355599c2 Option for generating device debug info. (#6168)
* Supply `-G;-src-in-ptx` when `USE_DEVICE_DEBUG` is set and debug mode is selected.
* Refactor CMake script to gather all CUDA configuration.
* Use CMAKE_CUDA_ARCHITECTURES.  Close #6029.
* Add compute 80.  Close #5999
2020-09-27 03:26:56 +08:00
Kyle Nicholson
e6a238c020 Update base margin dask (#6155)
* Add `base-margin`
* Add `output_margin` to regressor.

Co-authored-by: fis <jm.yuan@outlook.com>
2020-09-26 21:30:52 +08:00
Alexander Gugel
03b8fdec74 Add DMatrix usage examples to c-api-demo (#5854)
* Add DMatrix usage examples to c-api-demo

* Add XGDMatrixCreateFromCSREx example

* Add XGDMatrixCreateFromCSCEx example
2020-09-26 02:10:12 -07:00
Philip Hyunsu Cho
2c4dedb7a0 [CI] Test C API demo (#6159)
* Fix CMake install config to use dependencies

* [CI] Test C API demo

* Explicitly cast num_feature, to avoid warning in Linux
2020-09-25 14:49:01 -07:00
Philip Hyunsu Cho
bd2b1eabd0 Add back support for scipy.sparse.coo_matrix (#6162) 2020-09-25 00:49:49 -07:00
Philip Hyunsu Cho
72ef553550 Fall back to CUB allocator if RMM memory pool is not set up (#6150)
* Fall back to CUB allocator if RMM memory pool is not set up

* Fix build

* Prevent memory leak

* Add note about lack of memory initialisation

* Add check for other fast allocators

* Set use_cub_allocator_ to true when RMM is not enabled

* Fix clang-tidy

* Do not demangle symbol; add check to ensure Linux+Clang/GCC combo
2020-09-24 11:04:50 -07:00
Zeno Gantner
5b05f88ba9 Cosmetic fixes in faq.rst (#6161) 2020-09-24 21:05:10 +08:00
Jiaming Yuan
14afdb4d92 Support categorical data in ellpack. (#6140) 2020-09-24 19:28:57 +08:00
Jiaming Yuan
78d72ef936 Add DaskDeviceQuantileDMatrix demo. (#6156) 2020-09-24 14:08:28 +08:00
Philip Hyunsu Cho
678ea40b24 [CI] Upgrade cuDF and RMM to 0.16 nightlies; upgrade to Ubuntu 18.04 (#6157)
* [CI] Upgrade cuDF and RMM to 0.16 nightlies

* Use Ubuntu 18.04 in RMM test, since RMM needs GCC 7+
2020-09-23 19:48:44 -07:00
James Lamb
c686bc0461 [R] remove warning in configure.ac (fixes #6151) (#6152)
* [R] remove warning in configure.ac (fixes #6151)

* update configure
2020-09-22 22:47:38 -07:00
Jiaming Yuan
e033caa3ba Remove linking RMM library. (#6146)
* Remove linking RMM library.

* RMM is now header only.

* Remove remaining reference.
2020-09-22 16:59:33 -07:00
Jiaming Yuan
452ac8ea62 Time GPU tests on CI. (#6141) 2020-09-22 14:25:10 +08:00
Jiaming Yuan
33d80ffad0 [dask] Support more meta data on functional interface. (#6132)
* Add base_margin, label_(lower|upper)_bound.
* Test survival training with dask.
2020-09-21 16:56:37 +08:00
Jiaming Yuan
7065779afa Improve JSON format for categorical features. (#6128)
* Gather categories for all nodes.
2020-09-21 15:35:05 +08:00
Jiaming Yuan
210c131ce7 Support categorical data in GPU sketching. (#6137) 2020-09-21 13:53:06 +08:00
Nan Zhu
c932fb50a1 [jvm-packages]add xgboost4j-gpu/xgboost4j-spark-gpu module to facilitate release (#6136)
* add xgboost4j-gpu/xgboost4j-spark-gpu module to facilitate release

* Update pom.xml
2020-09-20 09:20:38 -07:00
Jiaming Yuan
a069a21e03 Implement intrusive ptr (#6129)
* Use intrusive ptr for JSON.
2020-09-20 20:07:16 +08:00
Jiaming Yuan
e319b63f9e Merge extract cuts into QuantileContainer. (#6125)
* Use pruning for initial summary construction.
2020-09-18 16:36:39 +08:00
Jiaming Yuan
cc82ca167a [dask] Refactor meta data handling. (#6130) 2020-09-18 13:26:40 +08:00
Jiaming Yuan
5384ed85c8 Use caching allocator from RMM, when RMM is enabled (#6131) 2020-09-17 21:51:49 -07:00
neko
6bc9b9dc4f Fix doc for CMake requirement. (#6123) 2020-09-16 17:59:43 +08:00
Philip Hyunsu Cho
9e955fb9b0 [R] Check warnings explicitly for model compatibility tests (#6114)
* [R] Check warnings explicitly for model compatibility tests

* Address reviewer's feedback
2020-09-15 10:49:48 -07:00
Philip Hyunsu Cho
33577ef5d3 Add MAPE metric (#6119) 2020-09-14 18:45:27 -07:00
Rory Mitchell
47350f6acb Allow kwargs in dask predict (#6117) 2020-09-15 13:04:03 +12:00
Jiaming Yuan
b5f52f0b1b Validate weights are positive values. (#6115) 2020-09-15 09:03:55 +08:00
Jiaming Yuan
c6f2b8c841 Upgrade gputreeshap. (#6099)
* Upgrade gputreeshap.

Co-authored-by: Rory Mitchell <r.a.mitchell.nz@gmail.com>
2020-09-15 12:57:22 +12:00
Vitalie Spinu
1453bee3e7 [R] Remove stringi dependency (#6109)
* [R] Fix empty empty tests and a test warnings

* [R] Remove stringi dependency (fix #5905)

* Fix R lint check

* [R] Fix automatic conversion to factor in R < 4.0.0 in xgb.model.dt.tree

* Add `R` Makefile variable

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-09-12 13:18:08 -07:00
Jiaming Yuan
07945290a2 Remove unused RABIT targets. (#6110)
* Remove rabit mock.
* Remove rabit base.
2020-09-11 14:09:44 +08:00
Jiaming Yuan
c92d751ad1 Enable building rabit on Windows (#6105) 2020-09-11 11:54:46 +08:00
Jiaming Yuan
08bdb2efc8 Fix dask doc. [skip ci] (#6108) 2020-09-11 10:56:12 +08:00
Bobby Wang
00b0ad1293 [Doc] add doc for kill_spark_context_on_worker_failure parameter (#6097)
* [Doc] add doc for kill_spark_context_on_worker_failure parameter

* resolve comments
2020-09-09 21:28:44 -07:00
Philip Hyunsu Cho
d0ccb13d09 Work around a compiler bug in MacOS AppleClang 11 (#6103)
* Workaround a compiler bug in MacOS AppleClang

* [CI] Run C++ test with MacOS Catalina + AppleClang 11.0.3

* [CI] Migrate cmake_test on MacOS from Travis CI to GitHub Actions

* Install OpenMP runtime

* [CI] Use CMake to locate lz4 lib
2020-09-09 21:21:55 -07:00
Philip Hyunsu Cho
9338582d79 [CI] Fix CTest by running it in a correct directory (#6104)
* [CI] Fix CTest by running it in a correct directory

* [CI] Do not run dmlc-core unit tests with sanitizer
2020-09-09 10:31:09 -07:00
Jiaming Yuan
3dcd85fab5 Refactor rabit tests (#6096)
* Merge rabit tests into XGBoost.
* Run them On CI.
* Simplification for CMake scripts.
2020-09-09 12:30:29 +08:00
Jiaming Yuan
318bffaa10 Fix custom obj link. [skip ci] (#6100) 2020-09-09 10:55:38 +08:00
Jiaming Yuan
b0001a6e29 Correct style warnings from clang-tidy for rabit. (#6095) 2020-09-08 12:13:58 +08:00
Hristo Iliev
da61d9460b [jvm-packages] Add getNumFeature method (#6075)
* Add getNumFeature to the Java API
* Add getNumFeature to the Scala API
* Add unit tests for getNumFeature

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-09-07 20:57:46 -07:00
Jiaming Yuan
93e9af43bb Unify set index data. (#6062) 2020-09-08 11:38:41 +08:00
Jiaming Yuan
e5d40b39cd [Breaking] Don't save leaf child count in JSON. (#6094)
The field is deprecated and not used anywhere in XGBoost.
2020-09-08 11:11:13 +08:00
Jiaming Yuan
5994f3b14c Don't link imported target. (#6093) 2020-09-07 02:51:09 -07:00
Philip Hyunsu Cho
974ba12f38 Fix CMake build with BUILD_STATIC_LIB option (#6090)
* Fix CMake build with BUILD_STATIC_LIB option

* Disable BUILD_STATIC_LIB option when R/JVM pkg is enabled

* Add objxgboost to install target only when BUILD_STATIC_LIB=ON
2020-09-07 02:38:29 -07:00
Daniel Steinberg
68c55a37d9 Add cache name back to external_memory.py files. (#6088) 2020-09-06 16:01:09 +08:00
Boris Feld
24ca9348f7 Fix typo in xgboost.callback.early_stop docstring (#6071) 2020-09-06 13:37:07 +08:00
Rory Mitchell
2e907abdb8 Updates to GPUTreeShap (#6087)
* Extract paths on device

* Update GPUTreeShap
2020-09-06 13:39:08 +12:00
Bobby Wang
0e2d5669f6 [jvm-packages] cancel job instead of killing SparkContext (#6019)
* cancel job instead of killing SparkContext

This PR changes the default behavior that kills SparkContext. Instead, This PR
cancels jobs when coming across task failed. That means the SparkContext is
still alive even some exceptions happen.

* add a parameter to control if killing SparkContext

* cancel the jobs the failed task belongs to

* remove the jobId from the map when one job failed.

* resolve comments
2020-09-02 14:20:59 -07:00
Tong He
3912f3de06 Updates from 1.2.0 cran submission (#6077)
* update for 1.2.0 cran submission

* recover cmakelists

* fix unittest from the shap PR

* trigger CI
2020-09-02 20:50:23 +08:00
Philip Hyunsu Cho
9be969cc7a Add release note for 1.2.0 in NEWS.md (#6063)
* Update query_contributors.py to account for pagination

* Add the release note for 1.2.0

* Add release note for patch releases 

* Apply suggestions from code review 

* Fix typo 

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Co-authored-by: John Zedlewski <904524+JohnZed@users.noreply.github.com>
2020-09-02 00:49:02 -07:00
Anthony D'Amato
ada964f16e Clean the way deterministic paritioning is computed (#6033)
We propose to only use the rowHashCode to compute the partitionKey, adding the FeatureValue hashCode does not bring more value and would make the computation slower. Even though a collision would appear at 0.2% with MurmurHash3 this is bearable for partitioning, this won't have any impact on the data balancing.
2020-08-30 14:38:23 -07:00
ShvetsKS
c1ca872d1e Modin DF support (#6055)
* Modin DF support

* mode change

* tests were added, ci env was extended

* mode change

* Remove redundant installation of modin

* Add a pytest skip marker for modin

* Install Modin[ray] from PyPI

* fix interfering

* avoid extra conversion

* delete cv test for modin

* revert cv function

Co-authored-by: ShvetsKS <kirill.shvets@intel.com>
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-08-29 22:33:30 +03:00
FelixYBW
3a990433f9 set maxBins to 256. Align with c code in src/tree/param.h (#6066) 2020-08-28 15:06:11 +03:00
Rory Mitchell
9bddecee05 Update GPUTreeShap (#6064)
* Update GPUTreeShap

* Update src/CMakeLists.txt

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-08-27 12:01:53 -07:00
Jiaming Yuan
2fcc4f2886 Unify evaluation functions. (#6037) 2020-08-26 14:23:27 +08:00
Jiaming Yuan
80c8547147 Make binary bin search reusable. (#6058)
* Move binary search row to hist util.
* Remove dead code.
2020-08-26 05:05:11 +08:00
Philip Hyunsu Cho
9c14e430af [CI] Improve JVM test in GitHub Actions (#5930)
* [CI] Improve JVM test in GitHub Actions

* Use env var for Wagon options [skip ci]

* Move the retry flag to pom.xml [skip ci]

* Export env var RABIT_MOCK to run Spark tests [skip ci]

* Correct location of env var

* Re-try up to 5 times [skip ci]

* Don't run distributed training test on Windows

* Fix typo

* Update main.yml
2020-08-25 10:14:46 -07:00
Jiaming Yuan
81d8dd79ca Bump header version. (#6056) 2020-08-26 00:29:00 +08:00
Jiaming Yuan
20c95be625 Expand categorical node. (#6028)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-08-25 18:53:57 +08:00
Rory Mitchell
9a4e8b1d81 GPUTreeShap (#6038) 2020-08-25 12:47:41 +12:00
Philip Hyunsu Cho
b3193052b3 Bump version to 1.3.0 snapshot in master (#6052) 2020-08-23 17:13:46 -07:00
Philip Hyunsu Cho
4729458a36 [jvm-packages] [doc] Update install doc for JVM packages (#6051) 2020-08-23 14:14:53 -07:00
Philip Hyunsu Cho
cfced58c1c [CI] Port CI fixes from the 1.2.0 branch (#6050)
* Fix a unit test on CLI, to handle RC versions

* [CI] Use mgpu machine to run gpu hist unit tests

* [CI] Build GPU-enabled JAR artifact and deploy to xgboost-maven-repo
2020-08-22 23:24:46 -07:00
Jiaming Yuan
a144daf034 Limit tree depth for GPU hist. (#6045) 2020-08-22 19:34:52 +08:00
Jiaming Yuan
b9ebbffc57 Fix plotting test. (#6040)
Previously the test loads a model generated by `test_basic.py`, now we generate
the model explicitly.

* Cleanup saved files for basic tests.
2020-08-22 13:18:48 +08:00
Jiaming Yuan
7a46515d3d Remove win2016 jvm github action test. (#6042) 2020-08-20 19:39:46 -07:00
Jiaming Yuan
7be2e04bd4 Fix scikit learn cls doc. (#6041) 2020-08-20 19:23:06 -07:00
Philip Hyunsu Cho
1fd29edf66 [CI] Migrate linters to GitHub Actions (#6035)
* [CI] Move lint to GitHub Actions

* [CI] Move Doxygen to GitHub Actions

* [CI] Move Sphinx build test to GitHub Actions

* [CI] Reduce workload for Windows R tests

* [CI] Move clang-tidy to Build stage
2020-08-19 12:33:51 -07:00
ShvetsKS
24f2e6c97e Optimize DMatrix build time. (#5877)
Co-authored-by: SHVETS, KIRILL <kirill.shvets@intel.com>
2020-08-20 01:37:03 +08:00
Jiaming Yuan
29b7fea572 Optimize cpu sketch allreduce for sparse data. (#6009)
* Bypass RABIT serialization reducer and use custom allgather based merging.
2020-08-19 10:03:45 +08:00
Jiaming Yuan
90355b4f00 Make JSON the default full serialization format. (#6027) 2020-08-19 09:57:43 +08:00
Anthony D'Amato
f58e41bad8 Fix deterministic partitioning with dataset containing Double.NaN (#5996)
The functions featureValueOfSparseVector or featureValueOfDenseVector could return a Float.NaN if the input vectore was containing any missing values. This would make fail the partition key computation and most of the vectors would end up in the same partition. We fix this by avoid returning a NaN and simply use the row HashCode in this case.
We added a test to ensure that the repartition is indeed now uniform on input dataset containing values by checking that the partitions size variance is below a certain threshold.

Signed-off-by: Anthony D'Amato <anthony.damato@hotmail.fr>
2020-08-18 18:55:37 -07:00
Cuong Duong
e51cba6195 Add SHAP summary plot using ggplot2 (#5882)
* add SHAP summary plot using ggplot2

* Update xgb.plot.shap

* Update example in xgb.plot.shap documentation

* update logic, add tests

* whitespace fixes

* whitespace fixes for test_helpers

* namespace for sd function

* explicitly declare variables that are automatically evaluated by data.table

* Fix R lint

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-08-18 18:04:09 -07:00
Qi Zhang
989ddd036f Swap byte-order in binary serializer to support big-endian arch (#5813)
* fixed some endian issues

* Use dmlc::ByteSwap() to simplify code

* Fix lint check

* [CI] Add test for s390x

* Download latest CMake on s390x

* Fix a bug in my code

* Save magic number in dmatrix with byteswap on big-endian machine

* Save version in binary with byteswap on big-endian machine

* Load scalar with byteswap in MetaInfo

* Add a debugging message

* Handle arrays correctly when byteswapping

* EOF can also be 255

* Handle magic number in MetaInfo carefully

* Skip Tree.Load test for big-endian, since the test manually builds little-endian binary model

* Handle missing packages in Python tests

* Don't use boto3 in model compatibility tests

* Add s390 Docker file for local testing

* Add model compatibility tests

* Add R compatibility test

* Revert "Add R compatibility test"

This reverts commit c2d2bdcb7dbae133cbb927fcd20f7e83ee2b18a8.

Co-authored-by: Qi Zhang <q.zhang@ibm.com>
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-08-18 14:47:17 -07:00
Jiaming Yuan
4d99c58a5f Feature weights (#5962) 2020-08-18 19:55:41 +08:00
Jiaming Yuan
a418278064 Merge pull request #6023 from trivialfis/merge-rabit
Merge rabit
2020-08-18 09:01:56 +08:00
Philip Hyunsu Cho
14d5ce712c [CI] Fix Dask Pytest fixture (#6024) 2020-08-17 16:45:22 -07:00
fis
111968ca58 Merge rabit 2020-08-18 03:52:33 +08:00
fis
1c5904df3f Remove rabit. 2020-08-18 03:48:36 +08:00
Jiaming Yuan
d240463b38 Revert "Remove warning about memset. (#6003)" (#6020)
This reverts commit 12e3fb6a6c.
2020-08-17 20:10:15 +08:00
Philip Hyunsu Cho
511bb22ffd [Doc] Add dtreeviz as a showcase example of integration with 3rd-party software (#6013) 2020-08-13 20:53:59 -07:00
Philip Hyunsu Cho
e3ec7b01df [CI] Cancel builds on subsequent pushes (#6011)
* [CI] Cancel builds on subsequent pushes

* Use a more secure method

* test commit
2020-08-13 11:17:39 -07:00
Jiaming Yuan
674c409e9d Remove rabit dependency on public headers. (#6005) 2020-08-13 08:26:20 +08:00
Jiaming Yuan
12e3fb6a6c Remove warning about memset. (#6003) 2020-08-13 08:25:46 +08:00
Philip Hyunsu Cho
9adb812a0a RMM integration plugin (#5873)
* [CI] Add RMM as an optional dependency

* Replace caching allocator with pool allocator from RMM

* Revert "Replace caching allocator with pool allocator from RMM"

This reverts commit e15845d4e72e890c2babe31a988b26503a7d9038.

* Use rmm::mr::get_default_resource()

* Try setting default resource (doesn't work yet)

* Allocate pool_mr in the heap

* Prevent leaking pool_mr handle

* Separate EXPECT_DEATH() in separate test suite suffixed DeathTest

* Turn off death tests for RMM

* Address reviewer's feedback

* Prevent leaking of cuda_mr

* Fix Jenkinsfile syntax

* Remove unnecessary function in Jenkinsfile

* [CI] Install NCCL into RMM container

* Run Python tests

* Try building with RMM, CUDA 10.0

* Do not use RMM for CUDA 10.0 target

* Actually test for test_rmm flag

* Fix TestPythonGPU

* Use CNMeM allocator, since pool allocator doesn't yet support multiGPU

* Use 10.0 container to build RMM-enabled XGBoost

* Revert "Use 10.0 container to build RMM-enabled XGBoost"

This reverts commit 789021fa31112e25b683aef39fff375403060141.

* Fix Jenkinsfile

* [CI] Assign larger /dev/shm to NCCL

* Use 10.2 artifact to run multi-GPU Python tests

* Add CUDA 10.0 -> 11.0 cross-version test; remove CUDA 10.0 target

* Rename Conda env rmm_test -> gpu_test

* Use env var to opt into CNMeM pool for C++ tests

* Use identical CUDA version for RMM builds and tests

* Use Pytest fixtures to enable RMM pool in Python tests

* Move RMM to plugin/CMakeLists.txt; use PLUGIN_RMM

* Use per-device MR; use command arg in gtest

* Set CMake prefix path to use Conda env

* Use 0.15 nightly version of RMM

* Remove unnecessary header

* Fix a unit test when cudf is missing

* Add RMM demos

* Remove print()

* Use HostDeviceVector in GPU predictor

* Simplify pytest setup; use LocalCUDACluster fixture

* Address reviewers' commments

Co-authored-by: Hyunsu Cho <chohyu01@cs.wasshington.edu>
2020-08-12 01:26:02 -07:00
Jiaming Yuan
c3ea3b7e37 Fix nightly build doc. [skip ci] (#6004)
* Fix nightly build doc. [skip ci]

* Fix title too short. [skip ci]
2020-08-12 15:00:40 +08:00
Jiaming Yuan
ee70a2380b Unify CPU hist sketching (#5880) 2020-08-12 01:33:06 +08:00
jameskrach
bd6b7f4aa7 [Breaking] Fix .predict() method and add .predict_proba() in xgboost.dask.DaskXGBClassifier (#5986) 2020-08-11 16:11:28 +08:00
Jiaming Yuan
6f7112a848 Move warning about empty dataset. (#5998) 2020-08-11 14:10:51 +08:00
Jiaming Yuan
f93f1c03fc Rabit update. (#5978)
* Remove parameter on JVM Packages.
2020-08-11 09:17:32 +08:00
Jiaming Yuan
0b2a26fa74 Remove skmaker. (#5971) 2020-08-09 15:23:31 +08:00
Vladislav Epifanov
388f975cf5 Introducing DPC++-based plugin (predictor, objective function) supporting oneAPI programming model (#5825)
* Added plugin with DPC++-based predictor and objective function

* Update CMakeLists.txt

* Update regression_obj_oneapi.cc

* Added README.md for OneAPI plugin

* Added OneAPI predictor support to gbtree

* Update README.md

* Merged kernels in gradient computation. Enabled multiple loss functions with DPC++ backend

* Aligned plugin CMake files with latest master changes. Fixed whitespace typos

* Removed debug output

* [CI] Make oneapi_plugin a CMake target

* Added tests for OneAPI plugin for predictor and obj. functions

* Temporarily switched to default selector for device dispacthing in OneAPI plugin to enable execution in environments without gpus

* Updated readme file.

* Fixed USM usage in predictor

* Removed workaround with explicit templated names for DPC++ kernels

* Fixed warnings in plugin tests

* Fix CMake build of gtest

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-08-08 18:40:40 -07:00
Anthony D'Amato
7cf3e9be59 Fix typo in tracker logging (#5994) 2020-08-09 03:45:46 +08:00
James Lamb
589b385ec6 [R] fix uses of 1:length(x) and other small things (#5992) 2020-08-09 03:31:33 +08:00
Jiaming Yuan
801e6b6800 Fix dask predict shape infer. (#5989) 2020-08-08 14:29:22 +08:00
Jiaming Yuan
4acdd7c6f6 Remove stop process. (#143) 2020-08-05 10:12:00 -07:00
Jiaming Yuan
9c6e791e64 Enforce tree order in JSON. (#5974)
* Make JSON model IO more future proof by using tree id in model loading.
2020-08-05 16:44:52 +08:00
Jiaming Yuan
dde9c5aaff Fix missing data warning. (#5969)
* Fix data warning.

* Add numpy/scipy test.
2020-08-05 16:19:12 +08:00
Jiaming Yuan
8599f87597 Update JSON schema. (#5982)
* Update JSON schema for pseudo huber.
* Update JSON model schema.
2020-08-05 15:21:11 +08:00
Jiaming Yuan
9c93531709 Update Python custom objective demo. (#5981) 2020-08-05 12:27:19 +08:00
Jiaming Yuan
1149a7a292 Fix sklearn doc. (#5980) 2020-08-05 12:26:19 +08:00
Jiaming Yuan
b069431c28 Export DaskDeviceQuantileDMatrix in doc. [skip ci] (#5975) 2020-08-05 00:48:10 +08:00
FelixYBW
e6cd74ead3 Set a minimal reducer size and parent_down size (#139)
* set a minimal reducer msg size. Receive the same data size from parent each time.

* When parent read from a child, check it receive minimal reduce size.
 fix bug. Rewrite the minimal reducer size check, make sure it's 1~N times of minimal reduce size

 Assume the minimal reduce size is X, the logic here is
 1: each child upload total_size of message
 2: each parent receive X message at least, up to total_size
 3: parent reduce X or NxX or total_size message
 4: parent sends X or NxX or total_size message to its parent
 4: parent's parent receive X message at least, up to total_size. Then reduce X or NxX or total_size message
 6: parent's parent sends X or NxX or total_size message to its children
 7: parent receives X or NxX or total_size message, sends to its children
 8: child receive X or NxN or total_size message.

 During the whole process, each transfer is (1~N)xX Byte message or up to total_size.

 if X is larger than total_size, then allreduce allways reduce the whole messages and pass down.

* Follow style check rule

* fix the cpplint check

* fix allreduce_base header seq

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-07-25 12:46:45 -07:00
Philip Hyunsu Cho
74bf00a5ab De-duplicate macro _CRT_SECURE_NO_WARNINGS / _CRT_SECURE_NO_DEPRECATE (#136)
* De-duplicate macro _CRT_SECURE_NO_WARNINGS / _CRT_SECURE_NO_DEPRECATE

* Move all macros to base.h

* Fix CI
2020-06-28 09:51:50 -07:00
Philip Hyunsu Cho
8fe7f5dc43 [CI] Pass new cpplint check (#141) 2020-06-05 00:45:53 -07:00
Jiaming Yuan
a6008d5d93 Add RABIT_DLL tag to definitions of rabit APIs. (#140)
* Add RABIT_DLL tag to definitions of rabit APIs.
* Fix Travis tests.
2020-05-19 18:20:31 +08:00
Philip Hyunsu Cho
4fb34a008d Use 'default' visibility for C symbols (#138) 2020-04-23 20:48:52 -07:00
Philip Hyunsu Cho
2f7fcff4d7 Fix build on FreeBSD (#133) 2020-01-27 12:15:32 -08:00
Nan Zhu
6e563951af fix hanging trainings (#132)
* fix hanging connections

* remove logging
2020-01-27 09:12:02 -08:00
Chen Qin
0d6a853212 fix xgboost build failure introduced by allgather interface (#129)
* fix missing allgether rabit declaration

* fix allgather signature mismatch

* fix type conversion

* fix GetRingPrevRank
2020-01-01 22:45:14 +08:00
Chen Qin
493ad834a1 allow duplicated bootstrap allreduce overwrite previous results (#128)
* allow timeout to 0 to eanble immediate exit

* disable duplicated signature check, overwrite results with same key
2019-11-13 10:19:58 +08:00
nateagr
1907b25cd0 Expose RabitAllGatherRing and RabitGetRingPrevRank (#113)
* add unittests

* Expose RabitAllGatherRing and RabitGetRingPrevRank

* Enabled TCP_NODELAY to decrease latency
2019-11-12 19:55:32 +08:00
Jiaming Yuan
90e2239372 Fix cmake variable. (#126) 2019-11-05 01:27:08 -05:00
Chen Qin
2f25347168 allow timeout to 0 to eanble immediate exit (#125) 2019-10-22 14:38:55 -07:00
Chen Qin
d22e0809a8 throw dmlc::Error (#120)
* throw dmlc::Error handled by xgboost jni
2019-10-16 13:12:15 -04:00
Philip Hyunsu Cho
33dbc10aab Fix compilation failure on Windows (#119)
* Fix compilation failure on Windows

* Fix lint
2019-10-15 23:37:42 +07:00
Chen Qin
8e2c201d23 fix assert timeout_sec (#117) 2019-10-14 04:44:26 -04:00
Jiaming Yuan
ed9328ceae Fix lint. (#115) 2019-10-13 07:38:29 -04:00
Jiaming Yuan
6dab74689c Add SeekEnd to MemoryFixSizeBuffer. (#109)
* Don't assert buffer size.
2019-10-13 00:09:25 -04:00
Chen Qin
5d1b613910 exit when allreduce/broadcast error cause timeout (#112)
* keep async timeout task

* add missing pthread to cmake

* add tests

* Add a sleep period to avoid flushing the tracker.
2019-10-11 03:39:39 -04:00
Chen Qin
af7281afe3 unittests mock, cleanup (#111)
* cleanup, fix issue involved after remove is_bootstrap parameter

* misc

* clean

* add unittests
2019-10-01 13:36:11 -07:00
Chen Qin
ddcc2d85da Clean up cmake script and code includes (#106)
* Clean up CMake scripts and related include paths.
* Add unittests.
2019-09-26 02:29:04 -04:00
Xu Xiao
e92641887b remove unreached code of AllreduceRobust::CheckAndRecover (#108) 2019-09-18 23:06:59 -04:00
Jiaming Yuan
d4ce6807c7 Don't use _builtin_FUNCTION. (#107) 2019-09-18 12:05:23 -04:00
Chen Qin
9a7ac85d7e remove is_bootstrap parameter (#102)
* apply openmp simd

* clean __buildin detection, moving windows build check from xgboost project, add openmp support for vectorize reduce

* apply openmp only to rabit

* orgnize rabit signature

* remove is_bootstrap, use load_checkpoint as implict flag

* visual studio don't support latest openmp

* orgnize omp declarations

* replace memory copy with vector cast

* Revert "replace memory copy with vector cast"

This reverts commit 28de4792dcdff40d83d458510d23b7ef0b191d79.

* Revert "orgnize omp declarations"

This reverts commit 31341233d31ce93ccf34d700262b1f3f6690bbfe.

* remove openmp settings, merge into a upcoming pr

* mis

* per feedback, update comments
2019-09-10 11:45:50 -07:00
Chen Qin
5797dcb64e support bootstrap allreduce/broadcast (#98)
* support run rabit tests as xgboost subproject using xgboost/dmlc-core

* support tracker config set/get

* remove redudant printf

* remove redudant printf

* add c++0x declaration

* log allreduce/broadcast caller, engine should track caller stack for
investigation

* tracker support binary config format

* Revert "tracker support binary config format"

This reverts commit 2a28e5e2b55c200cb621af8d19f17ab1bc62503b.

* remove caller, prototype fetch allreduce/broadcast results from resbuf

* store cached allreduce/broadcast seq_no to tracker

* allow restore all caches from other nodes

* try new rabit collective cache, todo: recv_link seems down

* link up cache restore with main recovery

* cleanup load cache state

* update cache api

* pass test.mk

* have a working tests

* try to unify check into actionsummary

* more logging to debug distributed hist three method issue

* update rabit interface to support caller signature matching

* splite seq_counter from cur_cache_seq to different variables

* still see issue with inf loop

* support debug print caller as well as allreduce op

* cleanup

* remove get/set cache from model_recover, adding recover in
loadcheckpoint

* clarify rabit cache strategy, cache is set only by successful collective
call involving all nodes with unique cache key. if all nodes call
getcache at same time, we keep rabit run collective call. If some nodes
call getcache while others not, we backfill cache from those nodes with
most entries

* revert caller logs

* fix lint error

* fix engine mpi signature

* support getcache by ref

* allow result buffer presiet to filestream

* add loging

* try fix checkpoint failure recovery case

* use int64_t to avoid overflow caused seq fault

* try avoid int overflow

* try fix checkpoint failure recovery case

* try avoid seqno overflow to negative by offseting specifial flag value
adding cache seq no to checkpoint/load checkpoint/check point ack to avoid
confusion from cache recovery

* fix cache seq assert error

* remove loging, handle edge case

* add extensive log to checkpoint state  with different seq no

* fix lint errors

* clean up comments before merge back to master

* add logs to allreduce/broadcast/checkpoint

* use unsinged int 32 and give seq no larger range

* address remove allreduce dropseq code segment

* using caller signature to filter bootstrapallreduces

* remove get/set cache from empty

* apply signature to reducer

* apply signature to broadcast

* add key to broadcat log

* fix broadcast signature

* fix default _line value for non linux system

* adding comments, remove sleep(1)

* fix osx build issue

* try fix mpi

* fix doc

* fix engine_empty api

* logging, adding more logs, restore immutable assertion

* print unsinged int with ud

* fix lint

* rename seqtype to kSeq and KCache indicating it's usage
apply kDiffSeq check to load_cache routine

* comment allreduce/broadcast log

* allow tests run on arm

* enable flag to turn on / off cache

* add log info alert if user choose to enable rabit bootstrap cache

* add rabit_debug setting so user can use config to turn on

* log flags when user turn on rabit_debug

* force rabit restart if tracker assign -1 rank

* use OPENMP to vecotrize reducer

* address comment

* Revert "address comment"

This reverts commit 1dc61f33e7357dad8fa65528abeb81db92c5f9ed.

* fix checkpoint size print 0

* per feedback, remove DISABLEOPEMP, address race condition

* - remove openmp from this pr
- update name from cache to boostrapcache

* add default value of signature macros

* remove openmp from cmake file

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Update src/allreduce_robust.cc

Co-Authored-By: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* run test with cmake

* remove openmp

* fix cmake based tests

* use cmake test fix darwin .dylib issue

* move around rabit_signature definition due to windows build

* misc, add c++ check in CMakeFile

* per feedback

* resolve CMake file

* update rabit version
2019-08-27 18:12:33 -07:00
Nan Zhu
dba32d54d1 shutdown for multiple times (#99) 2019-07-16 12:41:39 -07:00
Nan Zhu
65b718a5e7 return values in Init and Finalize (#96)
* make inti function return values

* address the comments
2019-06-25 20:05:54 -07:00
Nan Zhu
fc85f776f4 allow not stop process in error (#97)
* allow not stop process in error

* fix merge error
2019-06-25 13:04:39 -07:00
Nan Zhu
a429748e24 allow multi call on init (#92) 2019-04-26 18:41:02 -07:00
Chen Qin
5c3b36f346 Allow using external dmlc-core (#91)
* Set `RABIT_BUILD_DMLC=1` if use dmlc-core in rabit

* remove dmlc-core
2019-04-26 15:28:45 +08:00
Chen Qin
e3d51d3e62 [rabit harden] Enable all tests (#90)
* include osx in tests
* address `time_wait` on port assignment
* increase submit attempts.
* cleanup tests
2019-04-24 19:12:11 +08:00
Chen Qin
ecd4bf7aae [rabit harden] replace hardcopy dmlc-core headers with submodule links (#86)
* backport dmlc header changes to rabit

* use gitmodule to reference latest dmlc header files

* include ref to dmlc-core
fix cmake

* update cmake file, add cmake build traivs task

* try force using g++-4.8

* per feedback, update cmake
2019-03-23 13:11:29 +08:00
Chen Qin
785d7e54d3 [mpi] add engine_mpi travis build (#83) 2019-03-15 22:58:47 +08:00
Chen Qin
ed06e0c6af [rabit harden] fix rabit tests (#81)
* enable model recovery tests
* force use gcc4.8 in Travis
2019-03-15 07:16:45 +08:00
Jiaming Yuan
1cc34f01db Fix ssize_t definition. (#80)
* Fix linter.
2019-02-18 19:25:08 +08:00
Jiaming Yuan
0101a4719c Remove dmlc logging. (#78)
* Remove dmlc logging header.

* Fix lint.
2019-02-16 18:37:54 -08:00
Jiaming Yuan
05941a5f96 Try fixing mingw build error when using CMake. (#77)
* Try fixing mingw build error when using CMake.

* Check __MINGW32__ .

* Fix linter.
2019-02-16 22:35:43 +08:00
Chen Qin
eb2590b774 workaround macosx java test race condition (#74)
* fix error in dmlc#57, clean up comments and naming

* include missing packages, disable recovery tests for now

* disable local_recover tests until we have a bug fix

* support larger cluster

* fix lint, merge with master

* fix mac osx test failure in https://github.com/dmlc/xgboost/pull/3818

* Update allreduce_robust.cc
2018-10-26 12:39:31 -07:00
Chen Qin
3a35dabfae support larger cluster (#73)
* fix error in dmlc#57, clean up comments and naming

* include missing packages, disable recovery tests for now

* disable local_recover tests until we have a bug fix

* support larger cluster

* fix lint, merge with master
2018-10-22 10:13:45 -07:00
Chen Qin
69cdfae22f disable travis model_recover tests, fix doc generate failure (#71)
* add missing packackges used in dmlc submit

* disable local_recovery tests til we have code fix

* fix doc gen failure
2018-10-19 18:18:16 -07:00
Chen Qin
785bde6f87 add missing packackges used in dmlc submit (#70) 2018-10-19 13:04:33 -07:00
Ruifeng Zheng
edc403fb2c init (#60) 2018-07-04 12:31:24 -07:00
Philip Hyunsu Cho
87143deb4c Don't define DMLC_LOG_STACK_TRACE on Solaris (#59)
DMLC_LOG_STACK_TRACE involves use of non-standard header execinfo.h, which
causes compilation failure on Solaris.
2018-06-15 22:33:46 -07:00
trivialfis
fc5072b100 Fix building shared library. (#58) 2018-05-24 09:05:37 -07:00
Will Storey
7bc46b8c75 Allow compiling with -Werror=strict-prototypes (#56)
Without this, with gcc 7.3.0, we see things like:

/xgboost/include/xgboost/c_api.h:98:1: error: function
declaration isn't a prototype [-Werror=strict-prototypes]
 XGB_DLL const char *XGBGetLastError();
  ^~~~~~~
2018-03-18 22:21:35 -07:00
Dennis O'Brien
440e81db0b Fixed print statements and xrange to be compatibile with Python 2 and 3. (#55) 2018-02-26 12:19:04 -08:00
David Hirvonen
0759d5ed2b add cmake w/ relocatable pkgconfig installation (#53) 2018-01-07 14:49:39 -08:00
snehlatamohite
2eb1a1a371 Use -msse2 flag depending upon architecure while compiling the rabit code (#49) 2017-09-01 08:42:45 -07:00
Qiang Kou (KK)
41c96a25a9 To compile on ARM CPU (#46) 2017-07-12 20:24:19 -07:00
Artem Krylysov
0b406754fa Fix C API header compatibility with C compilers (#44) 2017-06-01 09:21:48 -07:00
Ziyue Huang
ab5f203b44 fix error: ‘nullptr’ was not declared in this scope (#43) 2017-04-23 10:44:11 -07:00
tqchen
a1acf23b60 only doc rabit 2017-03-17 22:09:13 -07:00
tqchen
a764d45cfb sync dmlc headers 2017-03-16 10:16:23 -07:00
AbdealiJK
21b5e12913 allreduce_robust.cc: Allow num_global_replica to be 0 (#38)
In some cases, users may not want to have any global replica of
the data being broadcasted/all-reduced. In such cases, set the
result_buffer_round to -1 as a flag that this is not necessary
and check for it.
2016-11-23 19:34:11 -08:00
Tianqi Chen
032152ad24 Update .travis.yml 2016-11-23 10:14:32 -08:00
kabu4i
af1b7d6e7a Applied FreeBSD support (#37) 2016-11-15 21:10:51 -08:00
tqchen
a9a2a69dc1 Merge branch 'master' of ssh://github.com/tqchen/rabit 2016-08-26 15:06:03 -07:00
tqchen
cd1db1afaa sync dmlc header 2016-08-26 15:05:42 -07:00
tomlaube
1007a26641 Fixing the imports to work with MPI (#30) 2016-08-26 15:04:41 -07:00
elferdo
7e15fdd9c6 FreeBSD does not have fopen64 (as of 10.3). Detect it and replace with (#29)
fopen.
2016-08-20 08:35:01 -07:00
Tianqi Chen
2dd7476ad7 Merge pull request #28 from randomjohnnyh/master
Use getaddrinfo instead of gethostbyname for thread safety
2016-07-27 10:40:24 -07:00
Johnny Ho
9d235c31a7 Use getaddrinfo instead of gethostbyname for thread safety 2016-07-27 02:35:02 -04:00
Tianqi Chen
8f61535b83 Update README.md 2016-05-10 20:14:53 -07:00
Tianqi Chen
b8aec1730c Update README.md 2016-05-10 20:14:29 -07:00
tqchen
e19fced5cb [FIX] rabit on single node 2016-05-10 20:05:59 -07:00
tqchen
849b20b7c8 add distributed checking 2016-04-11 15:43:01 -07:00
tqchen
be50e7b632 Make rabit library thread local 2016-03-01 20:12:51 -08:00
tqchen
aeb4008606 remove connect msg 2016-02-29 16:27:48 -08:00
tqchen
1392e9f3da fix travis 2016-02-29 15:51:36 -08:00
tqchen
225f5258c7 [DMLC] Add dep to dmlc logging 2016-02-29 14:59:44 -08:00
tqchen
56ec4263f9 fix type 2016-02-28 13:19:54 -08:00
tqchen
e3188afbe8 fix 2016-02-28 13:09:18 -08:00
tqchen
c7d53aecc3 add link tag 2016-02-28 09:44:11 -08:00
tqchen
26c87ec6e7 fix test 2016-02-28 09:35:08 -08:00
tqchen
f0f07ecd22 fix 2016-02-27 20:51:00 -08:00
tqchen
e814dc8a4b Fix docstring 2016-02-27 18:13:42 -08:00
tqchen
d45fca0298 fix build 2016-02-27 18:10:58 -08:00
tqchen
7479791f6a refactor: librabit 2016-02-27 10:14:41 -08:00
tqchen
73b6e9bbd0 [TRACKER] remove tracker in rabit, use DMLC 2016-02-27 09:07:40 -08:00
tqchen
112d866dc9 [RABIT] fix rabit in local mode 2016-01-12 21:34:26 -08:00
tqchen
05b958c178 [RABIT] Sync with dmlc 2016-01-09 21:43:29 -08:00
Tianqi Chen
bed63208af Merge pull request #26 from DrAndrey/master
Fix bug with name of sleep function
2015-11-18 09:58:21 -08:00
Andrey
291ab05023 Remove redundant whitespace again 2015-11-18 10:21:03 +03:00
Andrey
de251635b1 Remove redundant whitespace 2015-11-18 00:53:53 +03:00
Andrey
3a6be65a20 Fix bug with name of sleep function 2015-11-17 21:45:52 +03:00
Tianqi Chen
e81a11dd7e Merge pull request #25 from daiyl0320/master
add retry mechanism to ConnectTracker and modify Listen backlog to 128 in rabit_traker.py
2015-10-20 19:34:01 -07:00
yonglong.dyl
35c3b371ea add retry mechanism to ConnectTracker and modify Listen backlog to 128
in rabit_traker.py
2015-10-21 10:24:07 +08:00
tqchen
c71ed6fccb try deply doxygen 2015-08-22 21:37:14 -07:00
tqchen
62e5647a33 try deply doxygen 2015-08-22 21:33:23 -07:00
tqchen
732f1c634c try 2015-08-21 08:40:55 -07:00
tqchen
2fa6e0245a ok 2015-08-21 08:08:32 -07:00
tqchen
053766503c minor 2015-08-21 08:07:25 -07:00
tqchen
7b59dcb8b8 minor 2015-08-21 07:59:06 -07:00
tqchen
5934950ce2 new doc 2015-08-02 17:54:46 -07:00
tqchen
f5381871a3 ok 2015-08-01 21:40:05 -07:00
tqchen
44b60490f4 new doc 2015-08-01 21:36:09 -07:00
tqchen
387339bf17 add more 2015-07-30 18:16:15 -07:00
tqchen
9d4397aa4a chg 2015-07-30 17:59:16 -07:00
tqchen
2879a4853b chg 2015-07-30 17:58:42 -07:00
tqchen
30e3110170 ok 2015-07-28 23:18:15 -07:00
tqchen
9ff0301515 add link translation 2015-07-28 23:16:48 -07:00
tqchen
6b629c2e81 k 2015-07-27 18:41:17 -07:00
tqchen
32e19558e6 ok 2015-07-27 18:38:22 -07:00
tqchen
8f4839d1d9 fix 2015-07-27 18:34:43 -07:00
tqchen
93137b2e52 ok 2015-07-27 18:34:07 -07:00
tqchen
7eeeb79599 reload recommonmark 2015-07-27 18:33:19 -07:00
tqchen
a8f00cc4a5 minor 2015-07-27 18:16:03 -07:00
tqchen
19b0f019c7 ok 2015-07-27 18:14:01 -07:00
tqchen
dd011849b7 minor 2015-07-27 17:59:22 -07:00
tqchen
c1cdc194e9 minor 2015-07-27 17:50:02 -07:00
tqchen
fcf0f4351a try rst 2015-07-27 17:47:28 -07:00
tqchen
cbc21ae531 try 2015-07-27 17:46:08 -07:00
tqchen
62ddfa7709 tiny 2015-07-26 21:13:35 -07:00
tqchen
aefc05cb91 final change 2015-07-26 21:09:58 -07:00
tqchen
2aee9b4959 minor 2015-07-26 20:57:48 -07:00
tqchen
fe4e7c2b96 ok 2015-07-26 20:56:54 -07:00
tqchen
800198349f change to subtitle 2015-07-26 20:54:10 -07:00
tqchen
5ca33e48ea ok 2015-07-26 20:52:52 -07:00
tqchen
88f7d24de9 update guide 2015-07-26 20:52:34 -07:00
tqchen
29d43ab52f add code 2015-07-26 14:57:24 -07:00
tqchen
fe8bb3b60e minor hack for readthedocs 2015-07-26 14:47:40 -07:00
tqchen
229c71d9b5 Merge branch 'master' of ssh://github.com/dmlc/rabit 2015-07-26 14:46:24 -07:00
tqchen
7424218392 ok 2015-07-26 14:46:16 -07:00
Tianqi Chen
d1d45bbdae Update README.md 2015-07-26 14:43:08 -07:00
Tianqi Chen
1e8813f3bd Update README.md 2015-07-26 14:42:57 -07:00
Tianqi Chen
1ccc9903a1 Update README.md 2015-07-26 14:41:25 -07:00
tqchen
0323e0670e remove readme 2015-07-26 14:22:50 -07:00
tqchen
679a835d38 remove theme 2015-07-26 14:14:19 -07:00
tqchen
7ea5b7c209 remove numpydoc to napoleon 2015-07-26 14:02:43 -07:00
tqchen
b73e2be55e Merge branch 'master' of ssh://github.com/dmlc/rabit
Conflicts:
	doc/python-requirements.txt
2015-07-26 13:52:31 -07:00
tqchen
174228356d ok 2015-07-26 13:51:56 -07:00
Tianqi Chen
1838e25b8a Update python-requirements.txt 2015-07-26 13:05:52 -07:00
tqchen
bc4e957c39 ok 2015-07-26 13:00:18 -07:00
tqchen
fba6fc208c ok 2015-07-26 12:54:21 -07:00
tqchen
025110185e ok 2015-07-26 12:52:37 -07:00
tqchen
d50b905824 ok 2015-07-26 12:46:19 -07:00
tqchen
d4f2509178 ok 2015-07-26 12:43:49 -07:00
tqchen
cdf401a77c ok 2015-07-26 12:40:21 -07:00
tqchen
fef0ef26f1 new doc 2015-07-26 12:29:18 -07:00
tqchen
cef360d782 ok 2015-07-26 12:15:00 -07:00
tqchen
c125d2a8bb ok 2015-07-26 12:14:54 -07:00
tqchen
270a49ee75 add requirments 2015-07-23 22:22:52 -07:00
tqchen
744f9015bb get the basic doc 2015-07-23 22:14:42 -07:00
tqchen
1cb5cad50c Merge branch 'master' of ssh://github.com/dmlc/rabit 2015-07-03 17:42:00 -07:00
tqchen
8cc07ba391 minor 2015-07-03 17:41:52 -07:00
Tianqi Chen
d74f126592 Update .travis.yml 2015-07-03 15:35:47 -07:00
Tianqi Chen
52b3dcdf07 Update .travis.yml 2015-07-03 15:33:38 -07:00
Tianqi Chen
099581b591 Update .travis.yml 2015-07-03 15:31:43 -07:00
Tianqi Chen
1258046f14 Update .travis.yml 2015-07-03 15:29:30 -07:00
Tianqi Chen
7addac910b Update Makefile 2015-07-03 15:23:26 -07:00
Tianqi Chen
0ea7adff92 Update .travis.yml 2015-07-03 15:21:20 -07:00
Tianqi Chen
f858856586 Update travis_script.sh 2015-07-03 15:20:59 -07:00
Tianqi Chen
d8eac4ae27 Update README.md 2015-07-03 15:17:22 -07:00
tqchen
3cc49ad0e8 lint and travis 2015-07-03 15:15:11 -07:00
tqchen
ceedf4ea96 fix 2015-05-28 12:37:06 -07:00
Tianqi Chen
fd8920c71d fix win32 2015-05-28 12:24:26 -07:00
tqchen
8bbed35736 modify 2015-05-28 10:44:19 -07:00
Tianqi Chen
9520b90c4f Merge pull request #14 from dmlc/hjk41
add kLongLong and kULongLong
2015-05-20 05:38:01 +02:00
Chuntao Hong
df14bb1671 fix type 2015-05-20 11:36:17 +08:00
Chuntao Hong
f441dc7ed8 replace tab with blankspace 2015-05-20 11:33:48 +08:00
Chuntao Hong
2467942886 remove unnecessary include 2015-05-20 11:32:16 +08:00
Chuntao Hong
181ef47053 defined long long and ulonglong 2015-05-20 11:27:50 +08:00
Chuntao Hong
1582180e5b use int32_t to define int and int64_t to define long. in VC long is 32bit 2015-05-20 10:09:09 +08:00
tqchen
e0b7da0302 fix 2015-05-02 21:47:43 -07:00
tqchen
fa99857467 try fix warning on some platforms 2015-05-01 22:45:11 -07:00
tqchen
24f17df782 ok 2015-04-29 20:23:39 -07:00
tqchen
4fe8d1d66b ok io 2015-04-29 20:21:37 -07:00
tqchen
a5d77ca08d checkin new dmlc interface 2015-04-29 20:17:27 -07:00
tqchen
d1d2ab4599 remove at end 2015-04-28 10:49:44 -07:00
tqchen
e1ddcc2eb7 Merge branch 'master' of ssh://github.com/dmlc/rabit 2015-04-27 15:55:58 -07:00
tqchen
6745667eb0 new dmlc io 2015-04-27 15:55:51 -07:00
tqchen
c5b4610cfe sge scheduler change 2015-04-26 22:08:47 -07:00
tqchen
fed1683b9b minor 2015-04-25 21:24:38 -07:00
Tianqi Chen
c01520f173 change 2015-04-25 21:23:16 -07:00
tqchen
27340f95e4 final minor 2015-04-25 21:19:42 -07:00
Tianqi Chen
e03eabccda allow win32 2015-04-25 21:18:36 -07:00
tqchen
82ca10acb6 better handling at msvc 2015-04-25 20:52:07 -07:00
Tianqi Chen
6601939588 Merge pull request #12 from zjf/patch-2
Update rabit-inl.h
2015-04-23 23:16:49 -07:00
Jianfeng Zhu
df8f917463 Update rabit-inl.h
Fix missing parenthese
2015-04-24 14:09:47 +08:00
tqchen
c60b284e1f resize during tracker print 2015-04-20 11:37:45 -07:00
tqchen
c67967161e fix io style 2015-04-19 00:21:38 -07:00
tqchen
f52daf9be1 make timer cross platform 2015-04-19 00:01:48 -07:00
tqchen
7568f75f45 new io interface 2015-04-17 20:35:44 -07:00
tqchen
3bf8661ec1 add std before basic 2015-04-13 13:43:34 -07:00
tqchen
18f4d6c0ba remove rabit learn 2015-04-11 20:25:52 -07:00
tqchen
bcfbe51e7e fix dmlc io 2015-04-11 18:16:52 -07:00
tqchen
ad383b084d ok 2015-04-11 17:55:20 -07:00
tqchen
3b8c04a902 Merge branch 'master' of ssh://github.com/dmlc/rabit 2015-04-11 17:35:11 -07:00
tqchen
9dd97cc141 keepup with dmlc core 2015-04-11 17:35:03 -07:00
Ubuntu
ef13aaf379 ch 2015-04-11 05:29:07 +00:00
tqchen
50a66b3855 fix empty engine 2015-04-09 08:44:33 -07:00
tqchen
e08542c635 fix doc 2015-04-08 15:30:56 -07:00
tqchen
e95c96232a remove I prefix from interface, serializable now takes in pointer 2015-04-08 15:25:58 -07:00
tqchen
b15f6cd2ac rabit unifires with dmlc 2015-04-05 09:55:24 -07:00
tqchen
5634ec3008 ok 2015-04-03 22:25:33 -07:00
tqchen
2dd6c2f0c9 Merge branch 'master' of ssh://github.com/dmlc/rabit 2015-03-30 22:18:20 -07:00
tqchen
38d7f999a7 checkin wormhole spliter 2015-03-30 22:18:02 -07:00
Tianqi Chen
8acb96a627 Merge pull request #10 from ryanzz/master
fixed a mistake
2015-03-30 08:46:15 -07:00
ryanzz
911a1f0ce2 fixed a mistake 2015-03-30 16:25:36 +08:00
tqchen
732d8c33d1 inteface changing 2015-03-29 22:00:37 -07:00
tqchen
684ea0ad26 inteface changing 2015-03-29 22:00:33 -07:00
tqchen
8cb4c02165 add dmlc support 2015-03-28 22:44:10 -07:00
tqchen
be2ff703bc allow adapting wormhole 2015-03-27 17:33:51 -07:00
tqchen
16975b447c try pass on tokens during application submission 2015-03-27 11:04:19 -07:00
tqchen
eb1f4a4003 change auto to ip 2015-03-26 23:26:30 -07:00
tqchen
59e63bc135 minor 2015-03-21 00:38:37 -07:00
tqchen
62330505e1 ok 2015-03-21 00:37:59 -07:00
tqchen
14477f9f5a add namenode 2015-03-21 00:35:30 -07:00
tqchen
75a6d349c6 add libhdfs opts 2015-03-21 00:26:30 -07:00
tqchen
e3c76bfafb minmum fix 2015-03-21 00:25:16 -07:00
tqchen
8b3c435241 chg 2015-03-20 15:11:50 -07:00
tqchen
2035799817 test code 2015-03-20 13:02:46 -07:00
tqchen
7751b2b320 add debug 2015-03-15 23:52:16 -07:00
tqchen
769031375a ok 2015-03-15 23:47:24 -07:00
tqchen
bd346b4844 ok 2015-03-15 23:44:32 -07:00
tqchen
faba1dca6c add testload 2015-03-15 23:42:18 -07:00
tqchen
6f7783e4f6 add testload 2015-03-15 23:42:17 -07:00
tqchen
e5f034040e ok 2015-03-15 23:20:30 -07:00
tqchen
3ed9ec808f chg 2015-03-15 23:19:54 -07:00
tqchen
e552ac401e ask for more ram in am 2015-03-15 23:14:56 -07:00
tqchen
b2505e3d6f only stop nm when sucess 2015-03-15 23:02:15 -07:00
tqchen
bc696c9273 add queue info 2015-03-15 22:54:09 -07:00
tqchen
f3e867ed97 add option queue 2015-03-15 22:38:51 -07:00
tqchen
5dc843cff3 refactor fileio 2015-03-14 16:46:54 -07:00
tqchen
cd9c81be91 quick fix 2015-03-14 09:20:04 -07:00
tqchen
1e23af2adc add virtual destructor to iseekstream 2015-03-14 00:20:37 -07:00
tqchen
f165ffbc95 fix hdfs 2015-03-13 22:59:04 -07:00
tqchen
8cc650847a allow demo to pass in env 2015-03-13 22:27:36 -07:00
tqchen
fad4d69ee4 ok 2015-03-13 21:38:03 -07:00
tqchen
0fd6197b8b fix more 2015-03-13 21:36:09 -07:00
tqchen
7423837303 fix more 2015-03-13 21:36:08 -07:00
tqchen
d25de54008 add temporal solution, run_yarn_prog.py 2015-03-13 21:13:19 -07:00
tqchen
e5a9e31d13 final attempt 2015-03-13 00:04:51 -07:00
tqchen
ed3bee84c2 add command back 2015-03-12 22:48:30 -07:00
tqchen
07740003b8 add hdfs to resource 2015-03-12 22:43:41 -07:00
tqchen
9b66e7edf2 fix hadoop 2015-03-12 20:57:49 -07:00
tqchen
6812f14886 ok 2015-03-12 09:44:43 -07:00
tqchen
08e1c16dd2 change hadoop prefix back to hadoop home 2015-03-12 09:06:42 -07:00
Tianqi Chen
d6b68286ee Update build.sh 2015-03-12 09:03:02 -07:00
tqchen
146e069000 bugfix: logical boundary for ring buffer 2015-03-11 20:28:34 -07:00
tqchen
19cb685c40 ok 2015-03-12 02:59:50 +00:00
tqchen
4cf3c13750 Merge branch 'master' of ssh://github.com/tqchen/rabit
Conflicts:
	tracker/rabit_tracker.py
2015-03-11 13:35:35 -07:00
tqchen
20daddbeda add tracker 2015-03-11 13:27:23 -07:00
tqchen
c57dad8b17 add ringbased passing and batch schedule 2015-03-11 12:00:19 -07:00
tqchen
295d8a12f1 update 2015-03-10 15:28:10 -07:00
tqchen
994cb02a66 add sge 2015-03-10 15:26:40 -07:00
tqchen
014c86603d OK 2015-03-10 10:51:39 -07:00
tqchen
091634b259 fix 2015-03-09 14:56:01 -07:00
tqchen
d558f6f550 redefine distributed means 2015-03-09 14:43:05 -07:00
tqchen
c8efc01367 more complicated yarn script 2015-03-09 14:36:44 -07:00
tqchen
28ca7becbd add linear readme 2015-03-09 13:12:40 -07:00
tqchen
ca4b20fad1 add linear readme 2015-03-09 13:12:04 -07:00
tqchen
1133628c01 add linear readme 2015-03-09 13:11:17 -07:00
tqchen
6a1167611c update docs 2015-03-09 13:00:34 -07:00
Tianqi Chen
a607047aa1 Update build.sh 2015-03-08 23:55:42 -07:00
tqchen
2c1cfd8be6 complete yarn 2015-03-08 23:51:42 -07:00
tqchen
4f28e32ebd change formater 2015-03-08 12:29:07 -07:00
tqchen
2fbda812bc fix stdin input 2015-03-08 12:22:11 -07:00
tqchen
3258bcf531 checkin yarn master 2015-03-08 11:03:13 -07:00
tqchen
67ebf81e7a allow setup from env variables 2015-03-07 16:45:31 -08:00
tqchen
9b6bf57e79 fix hdfs 2015-03-07 09:08:21 -08:00
tqchen
395d5c29d5 add make system 2015-03-06 22:30:23 -08:00
tqchen
88ce76767e refactor io, initial hdfs file access need test 2015-03-06 22:17:27 -08:00
tqchen
19be870562 chgs 2015-03-06 21:12:04 -08:00
tqchen
a1bd3c64f0 Merge branch 'master' of ssh://github.com/tqchen/rabit 2015-03-06 21:09:59 -08:00
tqchen
1a573f987b introduce input split 2015-03-06 21:08:04 -08:00
tqchen
29476f1c6b fix timer issue 2015-03-06 20:59:10 -08:00
tqchen
d4ec037f2e fix rabit 2015-03-03 13:12:05 -08:00
tqchen
6612fcf36c Merge branch 'master' of ssh://github.com/tqchen/rabit 2015-03-02 16:10:15 -08:00
tqchen
d29892cb22 add mock option statis 2015-03-02 16:10:08 -08:00
tqchen
4fa054e26e new tracker 2015-03-02 07:32:25 +00:00
tqchen
75c647cd84 update tracker for host IP 2015-03-01 23:27:59 -08:00
tqchen
e4ce8efab5 add hadoop linear example 2015-03-02 04:36:48 +00:00
tqchen
76ecb4a031 add hadoop linear example 2015-03-02 04:35:56 +00:00
Ubuntu
2e1c4c945e add hadoop linear example 2015-03-02 04:35:01 +00:00
tqchen
4db0a62a06 bugfix of lazy prepare 2015-02-11 20:31:46 -08:00
tqchen
87017bd4cd license 2015-02-11 14:49:51 -08:00
tqchen
dc703e1b62 license 2015-02-11 14:48:59 -08:00
tqchen
c171440324 change license to bsd 2015-02-11 14:44:26 -08:00
Tianqi Chen
7db2070598 Update README.md 2015-02-09 20:53:29 -08:00
tqchen
581fe06a9b add mocktest 2015-02-09 20:46:38 -08:00
tqchen
d2f252f87a ok 2015-02-09 20:35:30 -08:00
tqchen
4a5b9e5f78 add all 2015-02-09 20:26:39 -08:00
tqchen
12ee049a74 init version of lbfgs 2015-02-09 17:44:32 -08:00
tqchen
37a28376bb complete lbfgs solver 2015-02-09 11:04:19 -08:00
tqchen
6ade7cba94 complete lbfgs 2015-02-08 23:08:59 -08:00
tqchen
1bb8fe9615 chg makefile 2015-01-30 16:46:10 -08:00
tqchen
fb13cab216 change makefile 2015-01-30 16:30:45 -08:00
Tyler
1479e370f8 fixed small bug in mpi submission script 2015-01-25 00:12:46 -08:00
Tianqi Chen
0ca7a63670 Update README.md 2015-01-22 09:16:46 -08:00
tqchen
5ef4830b55 ok 2015-01-20 20:30:22 -08:00
tqchen
93a13381c1 chg note 2015-01-20 20:27:43 -08:00
tqchen
4ebe657dd7 fix in cxx11 2015-01-19 21:37:02 -08:00
tqchen
85b746394e change def of reducer to take function ptr 2015-01-19 21:24:52 -08:00
tqchen
fe6366eb40 add engine base 2015-01-19 19:11:15 -08:00
Tianqi Chen
a98720ebc9 more deps 2015-01-19 08:20:43 -08:00
tqchen
1db6449b01 remove include in -I, make things easier to direct compile 2015-01-18 21:30:19 -08:00
tqchen
c7282acb2a doc 2015-01-18 19:55:04 -08:00
tqchen
f332750359 minor fix 2015-01-18 18:17:41 -08:00
tqchen
9edb3b306f update doc 2015-01-18 18:14:20 -08:00
tqchen
c46120a46b add win32 ver 2015-01-16 21:10:47 -08:00
Tianqi Chen
537497f520 changes 2015-01-16 21:10:01 -08:00
Tianqi Chen
56a80f431b check in windows solutions, pass small test in windows 2015-01-16 20:56:34 -08:00
tqchen
774d501c1f add languages 2015-01-16 11:13:27 -08:00
tqchen
7396c87249 chg 2015-01-16 10:53:31 -08:00
tqchen
c7533f92bb desgin goal 2015-01-16 10:50:05 -08:00
tqchen
38b7fec37a ok 2015-01-16 10:46:55 -08:00
tqchen
c798fc2a29 change toolkit to rabitlearn 2015-01-16 10:45:54 -08:00
tqchen
f5245c615c ok 2015-01-16 10:12:47 -08:00
nachocano
aebb7998a3 updating doc 2015-01-16 00:45:04 -08:00
nachocano
b87da8fe9a small typo 2015-01-15 10:52:39 -08:00
tqchen
1f35478b82 chg docs 2015-01-15 10:29:32 -08:00
tqchen
6d5ac6446c chg test folder 2015-01-15 10:24:58 -08:00
tqchen
8f23eb11d7 change convention 2015-01-15 10:22:59 -08:00
tqchen
0617281863 phrase python as a lib 2015-01-15 10:09:14 -08:00
nachocano
7d67f6f26d removing section 2015-01-15 01:24:04 -08:00
nachocano
34c8253ad6 Merge branch 'master' of https://github.com/tqchen/allreduce 2015-01-15 01:22:21 -08:00
nachocano
86e61ad6a5 adding changes suggested by Tianqi 2015-01-15 01:21:40 -08:00
tqchen
6dbaddd2b9 ok 2015-01-14 22:11:00 -08:00
tqchen
a7faac2f09 ok 2015-01-14 21:59:45 -08:00
tqchen
f161d2f1e5 fix bug in initialization of routing 2015-01-14 19:40:41 -08:00
tqchen
797fe27efe struct return type version 2015-01-14 15:43:28 -08:00
tqchen
a57c5c5425 add more error report when things goes wrong, need review 2015-01-14 15:32:36 -08:00
tqchen
968b33ec79 set all tracker thread to deamon 2015-01-14 12:05:00 -08:00
tqchen
87c7817124 add lazy check, need test, find a race condition 2015-01-14 11:58:43 -08:00
Tianqi Chen
bddfa2fc24 Merge pull request #7 from lqhl/master
update the fault tolerence section
2015-01-14 10:09:04 -08:00
Tianqi Chen
d05df9836b Merge pull request #8 from cblsjtu/master
correct a mistake
2015-01-14 10:07:17 -08:00
Boliang Chen
2f2e481fc3 correct a mistake 2015-01-14 20:53:34 +08:00
Qin Liu
1dda51f1fa update the fault tolerence section 2015-01-14 17:07:30 +08:00
tqchen
348a1e7619 change default behavior to behave normal 2015-01-13 22:21:15 -08:00
tqchen
478d250818 minor change 2015-01-13 20:01:15 -08:00
tqchen
532575b752 ok 2015-01-13 14:41:37 -08:00
tqchen
c127f9650c Merge branch 'master' of ssh://github.com/tqchen/rabit 2015-01-13 14:29:20 -08:00
tqchen
3419cf9aa7 add auto caching of python in hadoop script, mock test module to python, with checkpt 2015-01-13 14:29:10 -08:00
tqchen
877fc42e40 add data 2015-01-13 12:51:55 -08:00
nachocano
f79e5fc041 adding more stuff 2015-01-13 01:00:58 -08:00
nachocano
95c6d7398f adding more stuff 2015-01-13 00:59:20 -08:00
nachocano
5c7967e863 adding link 2015-01-13 00:49:57 -08:00
nachocano
54e2f7e90d adding wrapper section 2015-01-13 00:48:37 -08:00
nachocano
48c42bf189 fixing stuff 2015-01-13 00:18:46 -08:00
nachocano
92c94176c1 adding some changes to kmeans 2015-01-13 00:13:05 -08:00
tqchen
15e085cd32 basic allreduce lib ready 2015-01-12 22:59:36 -08:00
tqchen
2d72c853df checkin broadcast python module 2015-01-12 22:32:13 -08:00
tqchen
9a4a81f100 add wrapper 2015-01-12 21:33:01 -08:00
tqchen
61626aaf85 add more data types 2015-01-12 20:45:07 -08:00
tqchen
5a457d69fc Merge branch 'master' of ssh://github.com/tqchen/rabit
Conflicts:
	tracker/rabit_hadoop.py
2015-01-12 12:03:00 -08:00
tqchen
7572794add add stacklevel for rabit 2015-01-12 12:02:28 -08:00
Tianqi Chen
60a10b3322 Merge pull request #6 from cblsjtu/master
modify some explanation
2015-01-12 08:55:41 -08:00
Boliang Chen
ec3fd9bd2a modify some explanation 2015-01-12 23:46:20 +08:00
Boliang Chen
34cde09b2b modify some explanation 2015-01-12 23:41:45 +08:00
nachocano
8dd94461e1 guide 2015-01-12 00:24:46 -08:00
nachocano
9e04ab62fb adding breaks 2015-01-12 00:23:42 -08:00
nachocano
9907bafa1d fix 2015-01-12 00:20:43 -08:00
nachocano
30f3971bee adding more description to toolkit 2015-01-12 00:14:40 -08:00
tqchen
6b651176a3 yarn is part of hadoop script 2015-01-11 21:28:13 -08:00
tqchen
a120edc56e shorter 2015-01-11 11:48:08 -08:00
tqchen
5146409a1d simpler 2015-01-11 11:47:37 -08:00
tqchen
db2ebf7410 use unified script, auto detect hadoop version 2015-01-11 11:46:12 -08:00
tqchen
bfc3f61010 minor 2015-01-11 11:15:12 -08:00
tqchen
78bfe867e6 unify hadoop and yarn script 2015-01-11 11:13:02 -08:00
Tianqi Chen
03dca6d6b3 Merge pull request #5 from EricChenDM/master
add yarn script
2015-01-11 10:41:08 -08:00
chenshuaihua
b2dec95862 yarn script 2015-01-12 00:09:00 +08:00
chenshuaihua
26b5fdac40 yarn script 2015-01-11 23:54:31 +08:00
chenshuaihua
00323f462a yarn script 2015-01-11 23:32:14 +08:00
chenshuaihua
981f69ff55 yarn script 2015-01-11 23:23:58 +08:00
chenshuaihua
5e843cfbbd yarn script 2015-01-11 23:22:26 +08:00
chenshuaihua
b5ac85f103 yarn script 2015-01-11 23:19:04 +08:00
chenshuaihua
d81fb6a9e6 test 2015-01-11 21:59:38 +08:00
nachocano
d269cb9c50 guide stuff 2015-01-11 01:43:32 -08:00
nachocano
2d97833f48 slightly change 2015-01-11 01:35:04 -08:00
nachocano
eef79067a8 more cosmetic stuff 2015-01-11 01:31:10 -08:00
nachocano
aea4c10847 cosmetic changes to tutorial 2015-01-11 01:07:51 -08:00
Tianqi Chen
7eb4258951 Merge pull request #4 from cblsjtu/master
explain time out
2015-01-10 23:43:01 -08:00
Boliang Chen
c6d0be57d4 explain timeout 2015-01-11 15:39:50 +08:00
Boliang Chen
80b0d06b7e merger from tqchen 2015-01-11 14:56:20 +08:00
Boliang Chen
8685b740cc Merge remote-tracking branch 'tqchen/master' 2015-01-11 14:53:10 +08:00
Boliang Chen
7fa23f2d2f modify default jobname 2015-01-11 14:52:48 +08:00
tqchen
ed264002a0 Merge branch 'master' of ssh://github.com/tqchen/rabit
Conflicts:
	tracker/rabit_hadoop.py
2015-01-10 22:50:38 -08:00
tqchen
2e3361f0e0 fix -f 2015-01-10 22:49:56 -08:00
Boliang Chen
363994f29d Merge remote-tracking branch 'tqchen/master' 2015-01-11 13:46:32 +08:00
Boliang Chen
3f4bf96c5d temp 2015-01-11 13:46:18 +08:00
tqchen
0100fdd18d auto jobname 2015-01-10 21:21:39 -08:00
cblsjtu
c0f85c681e Merge pull request #1 from tqchen/master
merge from tqchen
2015-01-11 11:00:08 +08:00
tqchen
43c129f431 chg script 2015-01-10 17:49:09 -08:00
tqchen
500a57697d chg script 2015-01-10 17:45:53 -08:00
tqchen
c2ab64afe3 fix comment 2015-01-10 10:01:31 -08:00
tqchen
6b30fb2bea update cache script 2015-01-10 09:58:10 -08:00
Tianqi Chen
9d34d2e036 Merge pull request #1 from cblsjtu/master
fix several bugs
2015-01-10 09:29:55 -08:00
Boliang Chen
76c15dffde remove blank 2015-01-11 00:16:05 +08:00
Boliang Chen
d986693fbd fix bugs 2015-01-11 00:14:37 +08:00
Boliang Chen
7f5cb3aa0e modify hs 2015-01-10 10:58:53 +08:00
Boliang Chen
697a01bfb4 har -> jar 2015-01-10 10:54:12 +08:00
tqchen
1b4921977f update doc 2015-01-03 05:20:18 -08:00
tqchen
be355c1e60 minor 2015-01-01 06:06:55 -08:00
tqchen
d10a435d64 correct 2015-01-01 06:06:02 -08:00
tqchen
eb2b086b65 ok 2015-01-01 06:04:02 -08:00
tqchen
08ca3b0849 add more links 2015-01-01 06:02:32 -08:00
tqchen
61f21859d9 add api 2015-01-01 05:57:46 -08:00
tqchen
2bfbbfb381 checkin API doc 2015-01-01 05:48:34 -08:00
tqchen
31a3d22af4 add broadcast 2015-01-01 05:42:38 -08:00
tqchen
90a8505208 update guide 2015-01-01 05:42:03 -08:00
tqchen
06206e1d03 start checkin guides 2014-12-30 06:22:54 -08:00
tqchen
bfb9aa3d77 add native script 2014-12-30 04:37:50 -08:00
tqchen
1bcea65117 change nslave to nworker 2014-12-29 18:44:30 -08:00
tqchen
bdfa1a0220 change nslave to nworker 2014-12-29 18:42:24 -08:00
tqchen
39504825d8 add kmeans example 2014-12-29 18:32:56 -08:00
tqchen
76abd80cb7 change indentation 2014-12-29 18:17:20 -08:00
tqchen
b1340bf310 add auto cache 2014-12-29 06:50:17 -08:00
tqchen
c731e82fae add command 2014-12-29 06:37:07 -08:00
tqchen
491716c418 chg 2014-12-29 06:21:34 -08:00
tqchen
d64d0ef1dc cleanup submission script 2014-12-29 06:11:58 -08:00
tqchen
27d6977a3e cpplint pass 2014-12-28 05:12:07 -08:00
tqchen
15836eb98e add task id 2014-12-22 04:17:23 -08:00
tqchen
0dd51d5dd0 add attempt id for hadoop 2014-12-22 04:12:38 -08:00
tqchen
6e6031cbe9 add mock 2014-12-22 03:59:01 -08:00
tqchen
d82a6ed811 add file command 2014-12-22 03:48:14 -08:00
tqchen
ab7492dbc2 add support for yarn 2014-12-22 03:24:00 -08:00
tqchen
d3433c5946 change script 2014-12-22 01:54:11 -08:00
tqchen
975bcc8261 fix 2014-12-22 01:26:59 -08:00
tqchen
dd8d9646c4 rm mpi dep 2014-12-22 01:25:06 -08:00
tqchen
bb2ecc6ad5 remove c++11 2014-12-22 01:10:14 -08:00
tqchen
7a2ae105ea fix script 2014-12-22 01:03:12 -08:00
tqchen
fd533d9a76 add kmeans 2014-12-22 00:32:08 -08:00
tqchen
5fe3c58b4a add kmeans hadoop 2014-12-22 00:31:01 -08:00
tqchen
dcb6e22a9e add mapred tasks 2014-12-22 00:20:13 -08:00
tqchen
12399a1d42 add more mocktest 2014-12-21 17:59:12 -08:00
tqchen
a624051b85 add keepalive to socket, fix recover problem when a node is requester and pass data 2014-12-21 17:55:08 -08:00
tqchen
cfea4dbe85 fix rabit for single node without initialization 2014-12-21 04:35:32 -08:00
tqchen
e40047f9c2 new mock test 2014-12-20 18:38:54 -08:00
tqchen
10bb407a2c add mock engine 2014-12-20 18:31:33 -08:00
tqchen
ecf91ee081 change usage 2014-12-20 16:54:15 -08:00
tqchen
925d014271 change file structure 2014-12-20 16:19:54 -08:00
tqchen
77d74f6c0d fix bug in lambda allreduce 2014-12-20 05:04:16 -08:00
tqchen
5570e7ceae add complex types 2014-12-19 21:12:10 -08:00
tqchen
e72a869fd1 add complex reducer in 2014-12-19 20:57:53 -08:00
tqchen
2c0a0671ad skip actions when there is only 1 node 2014-12-19 19:21:21 -08:00
tqchen
6151899ce2 add tracker print 2014-12-19 18:40:06 -08:00
tqchen
6bf282c6c2 isolate iserializable 2014-12-19 17:36:42 -08:00
tqchen
8c35cff02c improve script 2014-12-19 04:21:16 -08:00
tqchen
9f42b78a18 improve tracker script 2014-12-19 04:20:45 -08:00
tqchen
69d7f71ae8 change kmeans to using lambda 2014-12-19 02:12:53 -08:00
tqchen
1754fdbf4e enable support for lambda preprocessing function, and c++11 2014-12-19 02:00:43 -08:00
tqchen
58331067f8 cleanup testcases 2014-12-18 23:50:59 -08:00
tqchen
aa2cb38543 ResetLink still not ok 2014-12-18 21:45:38 -08:00
tqchen
6b18ee9edb Merge branch 'master' of ssh://github.com/tqchen/rabit 2014-12-18 19:02:05 -08:00
tqchen
c8faed0b54 pass local model recover test 2014-12-18 18:53:58 -08:00
tqchen
dbd05a65b5 nice fix, start check local check 2014-12-18 18:39:24 -08:00
Tianqi Chen
31403a41cd Update rabit.h 2014-12-09 21:03:41 -08:00
tqchen
3f22596e3c check in license 2014-12-09 20:57:54 -08:00
tqchen
cc5efb8d81 Merge branch 'master' of ssh://github.com/tqchen/rabit 2014-12-09 20:56:33 -08:00
root
5aff7fab29 adding : 2014-12-08 17:15:49 +00:00
root
dfb3961eea changing port 2014-12-08 17:13:42 +00:00
Tianqi Chen
39f2dcdfef Update rabit_tracker.py 2014-12-08 08:36:55 -08:00
tqchen
2750679270 normal state running ok 2014-12-07 20:57:29 -08:00
tqchen
b38fa40fa6 fix ring passing 2014-12-07 20:25:42 -08:00
tqchen
8d570b54c7 add code to help link reuse, start test numreplica 2014-12-07 16:22:02 -08:00
tqchen
e2adce1cc1 add ring setup version 2014-12-07 16:09:28 -08:00
tqchen
322e40c72e Merge branch 'master' of ssh://github.com/tqchen/rabit 2014-12-06 23:00:18 -08:00
tqchen
328cf187ba check in the ring passing 2014-12-06 23:00:10 -08:00
nachocano
20b03e781c to run all executables 2014-12-06 15:37:09 -08:00
nachocano
fcf2f0a03d to stderr 2014-12-06 15:22:29 -08:00
nachocano
cd8ab469ff Merge branch 'master' of https://github.com/tqchen/allreduce 2014-12-06 15:14:19 -08:00
nachocano
659b9cd517 changing number of repetitions 2014-12-06 15:14:14 -08:00
root
52d472c209 using hostfile 2014-12-06 20:30:35 +00:00
nachocano
9ed59e71f6 speed runner 2014-12-06 12:09:40 -08:00
nachocano
e0053c62e1 adding executable 2014-12-06 12:05:08 -08:00
nachocano
8f0d7d1d3e changing to -ho not to conflict with help 2014-12-06 12:01:05 -08:00
nachocano
771891491c Merge branch 'master' of https://github.com/tqchen/allreduce 2014-12-06 11:59:22 -08:00
nachocano
f203d13efc speed runner 2014-12-06 11:59:16 -08:00
nachocano
14e400226a submit mpi to include machine file 2014-12-06 11:33:05 -08:00
tqchen
58f80c5675 Merge branch 'master' of ssh://github.com/tqchen/rabit 2014-12-06 11:25:18 -08:00
tqchen
4a7d84e861 chg string bcast 2014-12-06 11:25:08 -08:00
tqchen
1519f74f3c ok 2014-12-06 11:20:52 -08:00
tqchen
0e012cb05e add speed test 2014-12-06 11:05:24 -08:00
tqchen
19631ecef6 more tracker renaming 2014-12-06 09:24:12 -08:00
tqchen
a569bf2698 change gitignore 2014-12-06 09:19:08 -08:00
tqchen
dc12958fc7 rename master to tracker, to emphasie rabit is p2p in computing 2014-12-06 09:15:31 -08:00
nachocano
67b68ceae6 adding timing 2014-12-05 16:00:47 -08:00
nachocano
54eb5623cb worked on my machine !!! finally 2014-12-05 15:24:00 -08:00
nachocano
d9c22e54de closer, but still does not work... stays in map 100%. I think an exception is being thrown 2014-12-05 13:28:42 -08:00
tqchen
7765e2dc55 add status report 2014-12-05 09:49:26 -08:00
tqchen
ab278513ab ok 2014-12-05 09:39:51 -08:00
Tianqi Chen
e7a22792ac Update submit_job_hadoop.py 2014-12-05 09:14:44 -08:00
Tianqi Chen
e05098cacb Update submit_job_hadoop.py 2014-12-05 09:10:26 -08:00
Tianqi Chen
f9e95ab522 Update submit_job_hadoop.py 2014-12-05 09:09:20 -08:00
nachocano
bb7d6814a7 creating initial version of hadoop submit script. Not working.
Not sure how to get the master uri and port. I believe I cannot do it before I launch the job.

Updating the name from submit_job to submit_job_mpi
2014-12-05 03:27:02 -08:00
nachocano
e00fb99e7b cosmetic 2014-12-04 19:02:11 -08:00
nachocano
e9a3f5169e cosmetic changes 2014-12-04 18:02:07 -08:00
tqchen
1af3e81ada chg robust to reliable 2014-12-04 17:32:22 -08:00
tqchen
7cd5474f1a chg interface 2014-12-04 17:31:40 -08:00
tqchen
821eb21ae2 before make rabit public 2014-12-04 17:30:58 -08:00
tqchen
cc410b8c90 add local model in checkpoint interface, a new goal 2014-12-04 11:09:15 -08:00
tqchen
79e7862583 change note 2014-12-04 09:09:56 -08:00
tqchen
f9d634ce06 change notes 2014-12-04 09:09:29 -08:00
tqchen
65a1cdf8e5 remove doc from main repo 2014-12-04 09:07:36 -08:00
tqchen
67229fd7a9 change model 2014-12-04 09:05:48 -08:00
tqchen
3033177e9e ok 2014-12-03 22:36:16 -08:00
tqchen
656a8fa3a2 ok 2014-12-03 22:32:30 -08:00
tqchen
0e9b64649a ok 2014-12-03 22:30:23 -08:00
tqchen
9da3c6c573 Merge branch 'master' of ssh://github.com/tqchen/rabit 2014-12-03 22:28:59 -08:00
tqchen
09a1305628 chg readme 2014-12-03 22:27:52 -08:00
nachocano
7d314fef78 open for writing 2014-12-03 21:58:58 -08:00
nachocano
dece767084 Revert "open for writing"
This reverts commit 63bf9c7995.
2014-12-03 21:58:33 -08:00
nachocano
63bf9c7995 open for writing 2014-12-03 21:58:17 -08:00
tqchen
1c76483b4b ok 2014-12-03 21:53:34 -08:00
tqchen
9abe6ad4d8 checkin makefile 2014-12-03 21:30:11 -08:00
tqchen
8175df1002 bug fix in kmeans 2014-12-03 20:05:16 -08:00
tqchen
a1a1a8895e add kmeans 2014-12-03 18:23:58 -08:00
tqchen
69af79d45d sparse kmeans 2014-12-03 18:15:28 -08:00
nachocano
e3a95b2d1a Merge branch 'master' of https://github.com/tqchen/allreduce 2014-12-03 15:39:05 -08:00
nachocano
5c23b94069 updating kmeans based on Tianqi feedback. More efficient now 2014-12-03 15:38:58 -08:00
tqchen
85bb6cd027 Merge branch 'master' of ssh://github.com/tqchen/rabit 2014-12-03 15:13:09 -08:00
tqchen
90b9f1a98a add keepalive script 2014-12-03 15:04:30 -08:00
nachocano
55c2a5dc83 Merge branch 'master' of https://github.com/tqchen/allreduce 2014-12-03 14:21:42 -08:00
nachocano
1d0d5bb141 kmeans seems to be working.. not restarting anything though 2014-12-03 14:21:10 -08:00
tqchen
7a983a4079 add keepalive 2014-12-03 13:21:30 -08:00
tqchen
2523288509 basic recovery works 2014-12-03 12:19:08 -08:00
tqchen
8a6768763d bug fixed ver 2014-12-03 11:51:39 -08:00
tqchen
a186f8c3aa ok 2014-12-03 11:19:43 -08:00
tqchen
ceeb6f0690 bug version, check in and rollback 2014-12-03 11:17:39 -08:00
tqchen
f3e5b6e13c ok 2014-12-03 10:00:47 -08:00
tqchen
34f2f887b1 add more broadcast and basic broadcast 2014-12-03 09:59:13 -08:00
nachocano
20b51cc9ce cleaner 2014-12-03 01:44:34 -08:00
nachocano
56aad86231 adding incomplete kmeans.
I'm having a problem with the broadcast, and still need to implement the logic
2014-12-03 01:16:13 -08:00
tqchen
ed1de6df80 change AllReduce to Allreduce 2014-12-02 21:11:48 -08:00
nachocano
8cb5b68cb6 Merge branch 'master' of https://github.com/tqchen/allreduce 2014-12-02 11:28:27 -08:00
nachocano
e4abca9494 changing report folder to doc 2014-12-02 11:28:20 -08:00
tqchen
0a3300d773 rabit run on MPI 2014-12-02 11:20:19 -08:00
nachocano
2fab05c83e adding some design goals. 2014-12-02 11:07:07 -08:00
nachocano
40f7ee1cab adding simple image 2014-12-02 01:49:54 -08:00
nachocano
2c166d7a3a adding some initial skeleton of the report. 2014-12-02 01:19:36 -08:00
tqchen
dcea64c838 check in model recover 2014-12-01 21:41:37 -08:00
tqchen
255218a2f3 change in interface, seems resetlink is still bad 2014-12-01 21:39:51 -08:00
tqchen
b76cd5858c seems ok version 2014-12-01 20:18:25 -08:00
tqchen
46b5d46111 fix one bug, another comes 2014-12-01 19:53:41 -08:00
tqchen
993ff8bb91 find one bug, continue to next one 2014-12-01 19:34:27 -08:00
tqchen
2cde04867f Merge branch 'master' of ssh://github.com/tqchen/rabit 2014-12-01 16:57:33 -08:00
tqchen
337840d29b recover not yet working 2014-12-01 16:57:26 -08:00
Tianqi Chen
fd2c57b8a4 Update engine_robust.cc 2014-12-01 15:32:57 -08:00
tqchen
1c5167d96e rabit seems ready to run 2014-12-01 10:32:30 -08:00
Tianqi Chen
0d63646015 Update README.md 2014-12-01 10:04:10 -08:00
Tianqi Chen
b5367f48f6 Update README.md 2014-12-01 10:03:45 -08:00
Tianqi Chen
62c8ce9657 Update README.md 2014-12-01 10:03:31 -08:00
tqchen
eb2ca06d67 fresh name fresh start 2014-12-01 09:17:05 -08:00
tqchen
16f729115e checkin allreduce recover 2014-11-30 22:41:04 -08:00
tqchen
9355f5faf2 more conservative exception watching 2014-11-30 21:39:22 -08:00
tqchen
8cef2086f5 smarter select for allreduce and bcast 2014-11-30 21:31:45 -08:00
tqchen
f7928c68a3 next round try more careful select design 2014-11-30 21:07:34 -08:00
tqchen
ecb09a23bc add recover data, do a round of review 2014-11-30 20:59:55 -08:00
tqchen
b9b58a1275 bugfix in decide 2014-11-30 17:48:30 -08:00
tqchen
4a6c01c83c minor change in decide 2014-11-30 17:48:02 -08:00
tqchen
27f6f8ea9e bugfix in msg passing 2014-11-30 17:42:18 -08:00
tqchen
d8d648549f finish message passing, do a review on msg passing and decide 2014-11-30 17:40:30 -08:00
tqchen
38cd595235 check in message passing 2014-11-30 16:38:47 -08:00
tqchen
7a60cb7f3e checkin decide request, todo message passing 2014-11-30 16:37:26 -08:00
tqchen
68f13cd739 tight 2014-11-30 11:46:21 -08:00
tqchen
d1ce3c697c inline 2014-11-30 11:45:50 -08:00
tqchen
2e536eda29 check in the recover strategy 2014-11-30 11:42:59 -08:00
tqchen
155ed3a814 seems a OK version of reset, start to work on decide exec 2014-11-29 22:22:51 -08:00
tqchen
5b0bb53184 refactor code style, reset link still need thoughts 2014-11-29 20:15:27 -08:00
tqchen
42505f473d finish reset link log 2014-11-29 15:14:43 -08:00
tqchen
98756c068a livelock in oob send recv 2014-11-28 21:58:15 -08:00
tqchen
aa54a038f2 livelock in oob send recv 2014-11-28 21:56:58 -08:00
tqchen
a30075794b initial version of robust engine, add discard link, need more random mock test, next milestone will be recovery 2014-11-28 15:56:12 -08:00
nachocano
a8128493c2 execute it like this: ./test.sh 4 4000 testcase0.conf ./
Now we are passing the folder where the round instances are saved.
The problem is that calling utils::Check or utils::Assert on 1 or 2 nodes, shutdowns all of them. Only those should be shutdown and this will work. There maybe some other mechanism to shutdown a particular node. Tianqi?
2014-11-28 01:48:26 -08:00
nachocano
faed8285cd execute it like ./test.sh 4 4000 testcase0.conf to obtain a successful execution
updating mock. It now wraps the calls to sync and reads config from configuration file.
I believe it's better not to use the preprocessor directive, i.e. not to put any test code in the engine_tcp. I just call the mock in the test_allreduce file. It's a file purely for testing purposes, so it's fine to use the mock there.
2014-11-28 00:16:35 -08:00
nachocano
21f3f3eec4 adding const to variable to comply with google code convention...
may need to change more stuff though. Taint what else do you mean? Spaces, tabs, names?
2014-11-27 17:03:31 -08:00
tqchen
2f1ba40786 change in socket, to pass out error code 2014-11-27 16:17:07 -08:00
nachocano
c565104491 adding some references to mock inside TEST preprocessor directive.
It shouldn't be an assert because it shutdowns the process. Instead should check on the value and return some sort of error, so that we can recover.
The mock contains queues, indexed by the rank of the process. For each node, you can configure the behavior you expect (success or failure for now) when you call any of the methods (AllReduce, Broadcast, LoadCheckPoint and CheckPoint)... If you call several times AllReduce, the outputs will pop from the queue, i.e., first you can retrieve a success, then a failure and so on.
Pretty basic for now, need to tune it better
2014-11-26 17:24:29 -08:00
nachocano
54fcff189f dummy mock for now 2014-11-26 16:37:23 -08:00
tqchen
d37f38c455 initial version of allreduce 2014-11-25 16:15:56 -08:00
Tianqi Chen
5e5bdda491 Initial commit 2014-11-25 14:37:18 -08:00
688 changed files with 47154 additions and 16427 deletions

1
.github/FUNDING.yml vendored
View File

@@ -1 +1,2 @@
open_collective: xgboost
custom: https://xgboost.ai/sponsors

74
.github/workflows/jvm_tests.yml vendored Normal file
View File

@@ -0,0 +1,74 @@
name: XGBoost-JVM-Tests
on: [push, pull_request]
jobs:
test-with-jvm:
name: Test JVM on OS ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [windows-latest, ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: actions/setup-python@v2
with:
python-version: '3.8'
architecture: 'x64'
- uses: actions/setup-java@v1
with:
java-version: 1.8
- name: Install Python packages
run: |
python -m pip install wheel setuptools
python -m pip install awscli
- name: Cache Maven packages
uses: actions/cache@v2
with:
path: ~/.m2
key: ${{ runner.os }}-m2-${{ hashFiles('./jvm-packages/pom.xml') }}
restore-keys: ${{ runner.os }}-m2
- name: Test XGBoost4J
run: |
cd jvm-packages
mvn test -B -pl :xgboost4j_2.12
- name: Extract branch name
shell: bash
run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
id: extract_branch
if: |
(github.ref == 'refs/heads/master' || contains(github.ref, 'refs/heads/release_')) &&
matrix.os == 'windows-latest'
- name: Publish artifact xgboost4j.dll to S3
run: |
cd lib/
Rename-Item -Path xgboost4j.dll -NewName xgboost4j_${{ github.sha }}.dll
dir
python -m awscli s3 cp xgboost4j_${{ github.sha }}.dll s3://xgboost-nightly-builds/${{ steps.extract_branch.outputs.branch }}/ --acl public-read
if: |
(github.ref == 'refs/heads/master' || contains(github.ref, 'refs/heads/release_')) &&
matrix.os == 'windows-latest'
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_IAM_S3_UPLOADER }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_IAM_S3_UPLOADER }}
- name: Test XGBoost4J-Spark
run: |
rm -rfv build/
cd jvm-packages
mvn -B test
if: matrix.os == 'ubuntu-latest' # Distributed training doesn't work on Windows
env:
RABIT_MOCK: ON

View File

@@ -6,133 +6,227 @@ name: XGBoost-CI
# events but only for the master branch
on: [push, pull_request]
env:
R_PACKAGES: c('XML', 'igraph', 'data.table', 'magrittr', 'stringi', 'ggplot2', 'DiagrammeR', 'Ckmeans.1d.dp', 'vcd', 'testthat', 'lintr', 'knitr', 'rmarkdown', 'e1071', 'cplm', 'devtools')
# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
test-with-jvm:
name: Test JVM on OS ${{ matrix.os }}
gtest-cpu:
name: Test Google C++ test (CPU)
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [windows-latest, windows-2016, ubuntu-latest]
os: [macos-10.15]
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: actions/setup-java@v1
with:
java-version: 1.8
- name: Cache Maven packages
uses: actions/cache@v2
with:
path: ~/.m2
key: ${{ runner.os }}-m2-${{ hashFiles('./jvm-packages/pom.xml') }}
restore-keys: ${{ runner.os }}-m2
- name: Test JVM packages
- name: Install system packages
run: |
cd jvm-packages
mvn test -pl :xgboost4j_2.12
lintr:
runs-on: ${{ matrix.config.os }}
name: Run R linters on OS ${{ matrix.config.os }}, R ${{ matrix.config.r }}, Compiler ${{ matrix.config.compiler }}, Build ${{ matrix.config.build }}
strategy:
matrix:
config:
- {os: windows-latest, r: 'release', compiler: 'mingw', build: 'autotools'}
env:
R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
RSPM: ${{ matrix.config.rspm }}
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: r-lib/actions/setup-r@master
with:
r-version: ${{ matrix.config.r }}
- name: Cache R packages
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-1-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-2-
- name: Install dependencies
shell: Rscript {0}
# Use libomp 11.1.0: https://github.com/dmlc/xgboost/issues/7039
wget https://raw.githubusercontent.com/Homebrew/homebrew-core/679923b4eb48a8dc7ecc1f05d06063cd79b3fc00/Formula/libomp.rb -O $(find $(brew --repository) -name libomp.rb)
brew install ninja libomp
brew pin libomp
- name: Build gtest binary
run: |
install.packages(${{ env.R_PACKAGES }},
repos = 'http://cloud.r-project.org',
dependencies = c('Depends', 'Imports', 'LinkingTo'))
- name: Run lintr
mkdir build
cd build
cmake .. -DGOOGLE_TEST=ON -DUSE_OPENMP=ON -DUSE_DMLC_GTEST=ON -DPLUGIN_DENSE_PARSER=ON -GNinja
ninja -v
- name: Run gtest binary
run: |
cd R-package
R.exe CMD INSTALL .
Rscript.exe tests/run_lint.R
test-with-R:
runs-on: ${{ matrix.config.os }}
name: Test R on OS ${{ matrix.config.os }}, R ${{ matrix.config.r }}, Compiler ${{ matrix.config.compiler }}, Build ${{ matrix.config.build }}
cd build
./testxgboost
ctest -R TestXGBoostCLI --extra-verbose
gtest-cpu-nonomp:
name: Test Google C++ unittest (CPU Non-OMP)
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
config:
- {os: windows-latest, r: 'release', compiler: 'msvc', build: 'autotools'}
- {os: windows-2016, r: 'release', compiler: 'msvc', build: 'autotools'}
- {os: windows-latest, r: 'release', compiler: 'msvc', build: 'cmake'}
- {os: windows-2016, r: 'release', compiler: 'msvc', build: 'cmake'}
- {os: windows-latest, r: 'release', compiler: 'mingw', build: 'autotools'}
- {os: windows-2016, r: 'release', compiler: 'mingw', build: 'autotools'}
- {os: windows-latest, r: 'release', compiler: 'mingw', build: 'cmake'}
- {os: windows-2016, r: 'release', compiler: 'mingw', build: 'cmake'}
env:
R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
RSPM: ${{ matrix.config.rspm }}
os: [ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: r-lib/actions/setup-r@master
with:
r-version: ${{ matrix.config.r }}
- name: Cache R packages
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-1-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-2-
- name: Install dependencies
shell: Rscript {0}
- name: Install system packages
run: |
install.packages(${{ env.R_PACKAGES }},
repos = 'http://cloud.r-project.org',
dependencies = c('Depends', 'Imports', 'LinkingTo'))
sudo apt-get install -y --no-install-recommends ninja-build
- name: Build and install XGBoost
shell: bash -l {0}
run: |
mkdir build
cd build
cmake .. -GNinja -DGOOGLE_TEST=ON -DUSE_DMLC_GTEST=ON -DUSE_OPENMP=OFF
ninja -v
- name: Run gtest binary
run: |
cd build
ctest --extra-verbose
c-api-demo:
name: Test installing XGBoost lib + building the C API demo
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: ["ubuntu-latest"]
python-version: ["3.8"]
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- name: Install system packages
run: |
sudo apt-get install -y --no-install-recommends ninja-build
- uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
python-version: ${{ matrix.python-version }}
activate-environment: test
- name: Display Conda env
shell: bash -l {0}
run: |
conda info
conda list
- name: Build and install XGBoost static library
shell: bash -l {0}
run: |
mkdir build
cd build
cmake .. -DBUILD_STATIC_LIB=ON -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -GNinja
ninja -v install
cd -
- name: Build and run C API demo with static
shell: bash -l {0}
run: |
pushd .
cd demo/c-api/
mkdir build
cd build
cmake .. -GNinja -DCMAKE_PREFIX_PATH=$CONDA_PREFIX
ninja -v
ctest
cd ..
rm -rf ./build
popd
- name: Build and install XGBoost shared library
shell: bash -l {0}
run: |
cd build
cmake .. -DBUILD_STATIC_LIB=OFF -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -GNinja
ninja -v install
cd -
- name: Build and run C API demo with shared
shell: bash -l {0}
run: |
pushd .
cd demo/c-api/
mkdir build
cd build
cmake .. -GNinja -DCMAKE_PREFIX_PATH=$CONDA_PREFIX
ninja -v
ctest
popd
./tests/ci_build/verify_link.sh ./demo/c-api/build/basic/api-demo
./tests/ci_build/verify_link.sh ./demo/c-api/build/external-memory/external-memory-demo
lint:
runs-on: ubuntu-latest
name: Code linting for Python and C++
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: actions/setup-python@v2
with:
python-version: '3.6' # Version range or exact version of a Python version to use, using SemVer's version range syntax
architecture: 'x64' # optional x64 or x86. Defaults to x64 if not specified
- name: Test R
python-version: '3.7'
architecture: 'x64'
- name: Install Python packages
run: |
python tests/ci_build/test_r_package.py --compiler="${{ matrix.config.compiler }}" --build-tool="${{ matrix.config.build }}"
python -m pip install wheel setuptools
python -m pip install pylint cpplint numpy scipy scikit-learn
- name: Run lint
run: |
make lint
mypy:
runs-on: ubuntu-latest
name: Type checking for Python
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: actions/setup-python@v2
with:
python-version: '3.7'
architecture: 'x64'
- name: Install Python packages
run: |
python -m pip install wheel setuptools mypy pandas dask[complete] distributed
- name: Run mypy
run: |
make mypy
doxygen:
runs-on: ubuntu-latest
name: Generate C/C++ API doc using Doxygen
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: actions/setup-python@v2
with:
python-version: '3.7'
architecture: 'x64'
- name: Install system packages
run: |
sudo apt-get install -y --no-install-recommends doxygen graphviz ninja-build
python -m pip install wheel setuptools
python -m pip install awscli
- name: Run Doxygen
run: |
mkdir build
cd build
cmake .. -DBUILD_C_DOC=ON -GNinja
ninja -v doc_doxygen
- name: Extract branch name
shell: bash
run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
id: extract_branch
if: github.ref == 'refs/heads/master' || contains(github.ref, 'refs/heads/release_')
- name: Publish
run: |
cd build/
tar cvjf ${{ steps.extract_branch.outputs.branch }}.tar.bz2 doc_doxygen/
python -m awscli s3 cp ./${{ steps.extract_branch.outputs.branch }}.tar.bz2 s3://xgboost-docs/doxygen/ --acl public-read
if: github.ref == 'refs/heads/master' || contains(github.ref, 'refs/heads/release_')
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID_IAM_S3_UPLOADER }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY_IAM_S3_UPLOADER }}
sphinx:
runs-on: ubuntu-latest
name: Build docs using Sphinx
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: actions/setup-python@v2
with:
python-version: '3.8'
architecture: 'x64'
- name: Install system packages
run: |
sudo apt-get install -y --no-install-recommends graphviz
python -m pip install wheel setuptools
python -m pip install -r doc/requirements.txt
- name: Extract branch name
shell: bash
run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
id: extract_branch
if: github.ref == 'refs/heads/master' || contains(github.ref, 'refs/heads/release_')
- name: Run Sphinx
run: |
make -C doc html
env:
SPHINX_GIT_BRANCH: ${{ steps.extract_branch.outputs.branch }}

94
.github/workflows/python_tests.yml vendored Normal file
View File

@@ -0,0 +1,94 @@
name: XGBoost-Python-Tests
on: [push, pull_request]
jobs:
python-sdist-test:
runs-on: ${{ matrix.os }}
name: Test installing XGBoost Python source package on ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-10.15, windows-latest]
python-version: ["3.8"]
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- name: Install osx system dependencies
if: matrix.os == 'macos-10.15'
run: |
# Use libomp 11.1.0: https://github.com/dmlc/xgboost/issues/7039
wget https://raw.githubusercontent.com/Homebrew/homebrew-core/679923b4eb48a8dc7ecc1f05d06063cd79b3fc00/Formula/libomp.rb -O $(find $(brew --repository) -name libomp.rb)
brew install ninja libomp
brew pin libomp
- name: Install Ubuntu system dependencies
if: matrix.os == 'ubuntu-latest'
run: |
sudo apt-get install -y --no-install-recommends ninja-build
- uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
python-version: ${{ matrix.python-version }}
activate-environment: test
- name: Display Conda env
shell: bash -l {0}
run: |
conda info
conda list
- name: Build and install XGBoost
shell: bash -l {0}
run: |
cd python-package
python --version
python setup.py sdist
pip install -v ./dist/xgboost-*.tar.gz
cd ..
python -c 'import xgboost'
python-tests:
name: Test XGBoost Python package on ${{ matrix.config.os }}
runs-on: ${{ matrix.config.os }}
strategy:
matrix:
config:
- {os: windows-2016, compiler: 'msvc', python-version: '3.8'}
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: conda-incubator/setup-miniconda@v2
with:
auto-update-conda: true
python-version: ${{ matrix.config.python-version }}
activate-environment: win64_test
environment-file: tests/ci_build/conda_env/win64_cpu_test.yml
- name: Display Conda env
shell: bash -l {0}
run: |
conda info
conda list
- name: Build XGBoost with msvc
shell: bash -l {0}
if: matrix.config.compiler == 'msvc'
run: |
mkdir build_msvc
cd build_msvc
cmake .. -G"Visual Studio 15 2017" -DCMAKE_CONFIGURATION_TYPES="Release" -A x64 -DGOOGLE_TEST=ON -DUSE_DMLC_GTEST=ON
cmake --build . --config Release --parallel $(nproc)
- name: Install Python package
shell: bash -l {0}
run: |
cd python-package
python --version
python setup.py bdist_wheel --universal
pip install ./dist/*.whl
- name: Test Python package
shell: bash -l {0}
run: |
pytest -s -v ./tests/python

44
.github/workflows/r_nold.yml vendored Normal file
View File

@@ -0,0 +1,44 @@
# Run R tests with noLD R. Only triggered by a pull request review
# See discussion at https://github.com/dmlc/xgboost/pull/6378
name: XGBoost-R-noLD
on:
pull_request_review_comment:
types: [created]
env:
R_PACKAGES: c('XML', 'igraph', 'data.table', 'ggplot2', 'DiagrammeR', 'Ckmeans.1d.dp', 'vcd', 'testthat', 'lintr', 'knitr', 'rmarkdown', 'e1071', 'cplm', 'devtools', 'float', 'titanic')
jobs:
test-R-noLD:
if: github.event.comment.body == '/gha run r-nold-test' && contains('OWNER,MEMBER,COLLABORATOR', github.event.comment.author_association)
timeout-minutes: 120
runs-on: ubuntu-latest
container: rhub/debian-gcc-devel-nold
steps:
- name: Install git and system packages
shell: bash
run: |
apt-get update && apt-get install -y git libcurl4-openssl-dev libssl-dev libssh2-1-dev libgit2-dev libxml2-dev
- uses: actions/checkout@v2
with:
submodules: 'true'
- name: Install dependencies
shell: bash
run: |
cat > install_libs.R <<EOT
install.packages(${{ env.R_PACKAGES }},
repos = 'http://cloud.r-project.org',
dependencies = c('Depends', 'Imports', 'LinkingTo'))
EOT
/tmp/R-devel/bin/Rscript install_libs.R
- name: Run R tests
shell: bash
run: |
cd R-package && \
/tmp/R-devel/bin/R CMD INSTALL . && \
/tmp/R-devel/bin/R -q -e "library(testthat); setwd('tests'); source('testthat.R')"

138
.github/workflows/r_tests.yml vendored Normal file
View File

@@ -0,0 +1,138 @@
name: XGBoost-R-Tests
on: [push, pull_request]
env:
R_PACKAGES: c('XML', 'igraph', 'data.table', 'ggplot2', 'DiagrammeR', 'Ckmeans.1d.dp', 'vcd', 'testthat', 'lintr', 'knitr', 'rmarkdown', 'e1071', 'cplm', 'devtools', 'float', 'titanic')
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
jobs:
lintr:
runs-on: ${{ matrix.config.os }}
name: Run R linters on OS ${{ matrix.config.os }}, R ${{ matrix.config.r }}, Compiler ${{ matrix.config.compiler }}, Build ${{ matrix.config.build }}
strategy:
matrix:
config:
- {os: windows-latest, r: 'release', compiler: 'mingw', build: 'autotools'}
env:
R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
RSPM: ${{ matrix.config.rspm }}
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: r-lib/actions/setup-r@master
with:
r-version: ${{ matrix.config.r }}
- name: Cache R packages
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
- name: Install dependencies
shell: Rscript {0}
run: |
install.packages(${{ env.R_PACKAGES }},
repos = 'http://cloud.r-project.org',
dependencies = c('Depends', 'Imports', 'LinkingTo'))
- name: Run lintr
run: |
cd R-package
R.exe CMD INSTALL .
Rscript.exe tests/helper_scripts/run_lint.R
test-with-R:
runs-on: ${{ matrix.config.os }}
name: Test R on OS ${{ matrix.config.os }}, R ${{ matrix.config.r }}, Compiler ${{ matrix.config.compiler }}, Build ${{ matrix.config.build }}
strategy:
fail-fast: false
matrix:
config:
- {os: windows-2016, r: 'release', compiler: 'mingw', build: 'autotools'}
- {os: windows-2016, r: 'release', compiler: 'msvc', build: 'cmake'}
- {os: windows-2016, r: 'release', compiler: 'mingw', build: 'cmake'}
env:
R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
RSPM: ${{ matrix.config.rspm }}
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: r-lib/actions/setup-r@master
with:
r-version: ${{ matrix.config.r }}
- name: Cache R packages
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
- name: Install dependencies
shell: Rscript {0}
run: |
install.packages(${{ env.R_PACKAGES }},
repos = 'http://cloud.r-project.org',
dependencies = c('Depends', 'Imports', 'LinkingTo'))
- uses: actions/setup-python@v2
with:
python-version: '3.7'
architecture: 'x64'
- name: Test R
run: |
python tests/ci_build/test_r_package.py --compiler="${{ matrix.config.compiler }}" --build-tool="${{ matrix.config.build }}"
test-R-CRAN:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
config:
- {r: 'release'}
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: r-lib/actions/setup-r@master
with:
r-version: ${{ matrix.config.r }}
- uses: r-lib/actions/setup-tinytex@master
- name: Install system packages
run: |
sudo apt-get update && sudo apt-get install libcurl4-openssl-dev libssl-dev libssh2-1-dev libgit2-dev pandoc pandoc-citeproc
- name: Cache R packages
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-2-${{ hashFiles('R-package/DESCRIPTION') }}
- name: Install dependencies
shell: Rscript {0}
run: |
install.packages(${{ env.R_PACKAGES }},
repos = 'http://cloud.r-project.org',
dependencies = c('Depends', 'Imports', 'LinkingTo'))
- name: Check R Package
run: |
# Print stacktrace upon success of failure
make Rcheck || tests/ci_build/print_r_stacktrace.sh fail
tests/ci_build/print_r_stacktrace.sh success

22
.gitignore vendored
View File

@@ -71,6 +71,7 @@ build
build_plugin
recommonmark/
tags
TAGS
*.class
target
*.swp
@@ -104,4 +105,23 @@ R-package/src/Makevars
/cmake-build-debug/
# GDB
.gdb_history
.gdb_history
# Python joblib.Memory used in pytest.
cachedir/
# Files from local Dask work
dask-worker-space/
# Jupyter notebook checkpoints
.ipynb_checkpoints/
# credentials and key material
config
credentials
credentials.csv
*.env
*.pem
*.pub
*.rdp
*_rsa

7
.gitmodules vendored
View File

@@ -1,9 +1,10 @@
[submodule "dmlc-core"]
path = dmlc-core
url = https://github.com/dmlc/dmlc-core
[submodule "rabit"]
path = rabit
url = https://github.com/dmlc/rabit
branch = main
[submodule "cub"]
path = cub
url = https://github.com/NVlabs/cub
[submodule "gputreeshap"]
path = gputreeshap
url = https://github.com/rapidsai/gputreeshap.git

View File

@@ -1,52 +1,35 @@
# disable sudo for container build.
sudo: required
# Enabling test OS X
os:
- linux
- osx
osx_image: xcode10.1
dist: bionic
# Use Build Matrix to do lint and build seperately
env:
matrix:
# python package test
- TASK=python_test
# test installation of Python source distribution
- TASK=python_sdist_test
# java package test
- TASK=java_test
# cmake test
- TASK=cmake_test
global:
- secure: "PR16i9F8QtNwn99C5NDp8nptAS+97xwDtXEJJfEiEVhxPaaRkOp0MPWhogCaK0Eclxk1TqkgWbdXFknwGycX620AzZWa/A1K3gAs+GrpzqhnPMuoBJ0Z9qxXTbSJvCyvMbYwVrjaxc/zWqdMU8waWz8A7iqKGKs/SqbQ3rO6v7c="
- secure: "dAGAjBokqm/0nVoLMofQni/fWIBcYSmdq4XvCBX1ZAMDsWnuOfz/4XCY6h2lEI1rVHZQ+UdZkc9PioOHGPZh5BnvE49/xVVWr9c4/61lrDOlkD01ZjSAeoV0fAZq+93V/wPl4QV+MM+Sem9hNNzFSbN5VsQLAiWCSapWsLdKzqA="
- secure: "lqkL5SCM/CBwgVb1GWoOngpojsa0zCSGcvF0O3/45rBT1EpNYtQ4LRJ1+XcHi126vdfGoim/8i7AQhn5eOgmZI8yAPBeoUZ5zSrejD3RUpXr2rXocsvRRP25Z4mIuAGHD9VAHtvTdhBZRVV818W02pYduSzAeaY61q/lU3xmWsE="
- secure: "mzms6X8uvdhRWxkPBMwx+mDl3d+V1kUpZa7UgjT+dr4rvZMzvKtjKp/O0JZZVogdgZjUZf444B98/7AvWdSkGdkfz2QdmhWmXzNPfNuHtmfCYMdijsgFIGLuD3GviFL/rBiM2vgn32T3QqFiEJiC5StparnnXimPTc9TpXQRq5c="
matrix:
exclude:
- os: linux
jobs:
include:
- os: osx
arch: amd64
osx_image: xcode10.2
env: TASK=python_test
- os: linux
- os: osx
arch: amd64
osx_image: xcode10.2
env: TASK=java_test
- os: linux
env: TASK=cmake_test
arch: s390x
env: TASK=s390x_test
# dependent brew packages
# the dependencies from homebrew is installed manually from setup script due to outdated image from travis.
addons:
homebrew:
update: false
apt:
packages:
- cmake
- libomp
- graphviz
- openssl
- libgit2
- lz4
- wget
- r
update: true
- unzip
before_install:
- source tests/travis/travis_setup_env.sh

View File

@@ -1,9 +1,10 @@
cmake_minimum_required(VERSION 3.13)
project(xgboost LANGUAGES CXX C VERSION 1.2.0)
cmake_minimum_required(VERSION 3.14 FATAL_ERROR)
project(xgboost LANGUAGES CXX C VERSION 1.5.0)
include(cmake/Utils.cmake)
list(APPEND CMAKE_MODULE_PATH "${xgboost_SOURCE_DIR}/cmake/modules")
cmake_policy(SET CMP0022 NEW)
cmake_policy(SET CMP0079 NEW)
set(CMAKE_POLICY_DEFAULT_CMP0063 NEW)
cmake_policy(SET CMP0063 NEW)
if ((${CMAKE_VERSION} VERSION_GREATER 3.13) OR (${CMAKE_VERSION} VERSION_EQUAL 3.13))
@@ -23,9 +24,11 @@ write_version()
set_default_configuration_release()
#-- Options
## User options
option(BUILD_C_DOC "Build documentation for C APIs using Doxygen." OFF)
option(USE_OPENMP "Build with OpenMP support." ON)
option(BUILD_STATIC_LIB "Build static library" OFF)
option(RABIT_BUILD_MPI "Build MPI" OFF)
## Bindings
option(JVM_BINDINGS "Build JVM bindings" OFF)
option(R_LIB "Build shared library for R package" OFF)
@@ -37,6 +40,7 @@ option(ENABLE_ALL_WARNINGS "Enable all compiler warnings. Only effective for GCC
option(LOG_CAPI_INVOCATION "Log all C API invocations for debugging" OFF)
option(GOOGLE_TEST "Build google tests" OFF)
option(USE_DMLC_GTEST "Use google tests bundled with dmlc-core submodule" OFF)
option(USE_DEVICE_DEBUG "Generate CUDA device debug info." OFF)
option(USE_NVTX "Build with cuda profiling annotations. Developers only." OFF)
set(NVTX_HEADER_DIR "" CACHE PATH "Path to the stand-alone nvtx header")
option(RABIT_MOCK "Build rabit with mock" OFF)
@@ -45,6 +49,7 @@ option(HIDE_CXX_SYMBOLS "Build shared library and hide all C++ symbols" OFF)
option(USE_CUDA "Build with GPU acceleration" OFF)
option(USE_NCCL "Build with NCCL to enable distributed GPU support." OFF)
option(BUILD_WITH_SHARED_NCCL "Build with shared NCCL library." OFF)
option(BUILD_WITH_CUDA_CUB "Build with cub in CUDA installation" OFF)
set(GPU_COMPUTE_VER "" CACHE STRING
"Semicolon separated list of compute versions to be built against, e.g. '35;61'")
## Copied From dmlc
@@ -58,8 +63,10 @@ set(ENABLED_SANITIZERS "address" "leak" CACHE STRING
"Semicolon separated list of sanitizer names. E.g 'address;leak'. Supported sanitizers are
address, leak, undefined and thread.")
## Plugins
option(PLUGIN_LZ4 "Build lz4 plugin" OFF)
option(PLUGIN_DENSE_PARSER "Build dense parser plugin" OFF)
option(PLUGIN_RMM "Build with RAPIDS Memory Manager (RMM)" OFF)
## TODO: 1. Add check if DPC++ compiler is used for building
option(PLUGIN_UPDATER_ONEAPI "DPC++ updater" OFF)
option(ADD_PKGCONFIG "Add xgboost.pc into system." ON)
#-- Checks for building XGBoost
@@ -69,6 +76,9 @@ endif (USE_DEBUG_OUTPUT AND (NOT (CMAKE_BUILD_TYPE MATCHES Debug)))
if (USE_NCCL AND NOT (USE_CUDA))
message(SEND_ERROR "`USE_NCCL` must be enabled with `USE_CUDA` flag.")
endif (USE_NCCL AND NOT (USE_CUDA))
if (USE_DEVICE_DEBUG AND NOT (USE_CUDA))
message(SEND_ERROR "`USE_DEVICE_DEBUG` must be enabled with `USE_CUDA` flag.")
endif (USE_DEVICE_DEBUG AND NOT (USE_CUDA))
if (BUILD_WITH_SHARED_NCCL AND (NOT USE_NCCL))
message(SEND_ERROR "Build XGBoost with -DUSE_NCCL=ON to enable BUILD_WITH_SHARED_NCCL.")
endif (BUILD_WITH_SHARED_NCCL AND (NOT USE_NCCL))
@@ -82,11 +92,29 @@ endif (R_LIB AND GOOGLE_TEST)
if (USE_AVX)
message(SEND_ERROR "The option 'USE_AVX' is deprecated as experimental AVX features have been removed from XGBoost.")
endif (USE_AVX)
if (PLUGIN_LZ4)
message(SEND_ERROR "The option 'PLUGIN_LZ4' is removed from XGBoost.")
endif (PLUGIN_LZ4)
if (PLUGIN_RMM AND NOT (USE_CUDA))
message(SEND_ERROR "`PLUGIN_RMM` must be enabled with `USE_CUDA` flag.")
endif (PLUGIN_RMM AND NOT (USE_CUDA))
if (PLUGIN_RMM AND NOT ((CMAKE_CXX_COMPILER_ID STREQUAL "Clang") OR (CMAKE_CXX_COMPILER_ID STREQUAL "GNU")))
message(SEND_ERROR "`PLUGIN_RMM` must be used with GCC or Clang compiler.")
endif (PLUGIN_RMM AND NOT ((CMAKE_CXX_COMPILER_ID STREQUAL "Clang") OR (CMAKE_CXX_COMPILER_ID STREQUAL "GNU")))
if (PLUGIN_RMM AND NOT (CMAKE_SYSTEM_NAME STREQUAL "Linux"))
message(SEND_ERROR "`PLUGIN_RMM` must be used with Linux.")
endif (PLUGIN_RMM AND NOT (CMAKE_SYSTEM_NAME STREQUAL "Linux"))
if (ENABLE_ALL_WARNINGS)
if ((NOT CMAKE_CXX_COMPILER_ID MATCHES "Clang") AND (NOT CMAKE_CXX_COMPILER_ID STREQUAL "GNU"))
message(SEND_ERROR "ENABLE_ALL_WARNINGS is only available for Clang and GCC.")
endif ((NOT CMAKE_CXX_COMPILER_ID MATCHES "Clang") AND (NOT CMAKE_CXX_COMPILER_ID STREQUAL "GNU"))
endif (ENABLE_ALL_WARNINGS)
if (BUILD_STATIC_LIB AND (R_LIB OR JVM_BINDINGS))
message(SEND_ERROR "Cannot build a static library libxgboost.a when R or JVM packages are enabled.")
endif (BUILD_STATIC_LIB AND (R_LIB OR JVM_BINDINGS))
if (PLUGIN_RMM AND (NOT BUILD_WITH_CUDA_CUB))
message(SEND_ERROR "Cannot build with RMM using cub submodule.")
endif (PLUGIN_RMM AND (NOT BUILD_WITH_CUDA_CUB))
#-- Sanitizer
if (USE_SANITIZER)
@@ -95,18 +123,18 @@ if (USE_SANITIZER)
endif (USE_SANITIZER)
if (USE_CUDA)
SET(USE_OPENMP ON CACHE BOOL "CUDA requires OpenMP" FORCE)
set(USE_OPENMP ON CACHE BOOL "CUDA requires OpenMP" FORCE)
# `export CXX=' is ignored by CMake CUDA.
set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_CXX_COMPILER})
message(STATUS "Configured CUDA host compiler: ${CMAKE_CUDA_HOST_COMPILER}")
enable_language(CUDA)
if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_LESS 10.0)
message(FATAL_ERROR "CUDA version must be at least 10.0!")
if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_LESS 10.1)
message(FATAL_ERROR "CUDA version must be at least 10.1!")
endif()
set(GEN_CODE "")
format_gencode_flags("${GPU_COMPUTE_VER}" GEN_CODE)
message(STATUS "CUDA GEN_CODE: ${GEN_CODE}")
add_subdirectory(${PROJECT_SOURCE_DIR}/gputreeshap)
endif (USE_CUDA)
if (FORCE_COLORED_OUTPUT AND (CMAKE_GENERATOR STREQUAL "Ninja") AND
@@ -126,67 +154,41 @@ if (USE_OPENMP)
find_package(OpenMP REQUIRED)
endif (USE_OPENMP)
# core xgboost
add_subdirectory(${xgboost_SOURCE_DIR}/src)
if (USE_NCCL)
find_package(Nccl REQUIRED)
endif (USE_NCCL)
# dmlc-core
msvc_use_static_runtime()
add_subdirectory(${xgboost_SOURCE_DIR}/dmlc-core)
set_target_properties(dmlc PROPERTIES
CXX_STANDARD 14
CXX_STANDARD_REQUIRED ON
POSITION_INDEPENDENT_CODE ON)
if (MSVC)
target_compile_options(dmlc PRIVATE
-D_CRT_SECURE_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE)
if (TARGET dmlc_unit_tests)
target_compile_options(dmlc_unit_tests PRIVATE
-D_CRT_SECURE_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE)
endif (TARGET dmlc_unit_tests)
endif (MSVC)
if (ENABLE_ALL_WARNINGS)
target_compile_options(dmlc PRIVATE -Wall -Wextra)
endif (ENABLE_ALL_WARNINGS)
target_link_libraries(objxgboost PUBLIC dmlc)
# rabit
set(RABIT_BUILD_DMLC OFF)
set(DMLC_ROOT ${xgboost_SOURCE_DIR}/dmlc-core)
set(RABIT_WITH_R_LIB ${R_LIB})
add_subdirectory(rabit)
if (RABIT_BUILD_MPI)
find_package(MPI REQUIRED)
endif (RABIT_BUILD_MPI)
if (RABIT_MOCK)
target_link_libraries(objxgboost PUBLIC rabit_mock_static)
if (MSVC)
target_compile_options(rabit_mock_static PRIVATE
-D_CRT_SECURE_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE)
endif (MSVC)
else()
target_link_libraries(objxgboost PUBLIC rabit)
if (MSVC)
target_compile_options(rabit PRIVATE
-D_CRT_SECURE_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE)
endif (MSVC)
endif(RABIT_MOCK)
foreach(lib rabit rabit_base rabit_empty rabit_mock rabit_mock_static)
# Explicitly link dmlc to rabit, so that configured header (build_config.h)
# from dmlc is correctly applied to rabit.
if (TARGET ${lib})
target_link_libraries(${lib} dmlc ${CMAKE_THREAD_LIBS_INIT})
if (HIDE_CXX_SYMBOLS) # Hide all C++ symbols from Rabit
set_target_properties(${lib} PROPERTIES CXX_VISIBILITY_PRESET hidden)
endif (HIDE_CXX_SYMBOLS)
if (ENABLE_ALL_WARNINGS)
target_compile_options(${lib} PRIVATE -Wall -Wextra)
endif (ENABLE_ALL_WARNINGS)
endif (TARGET ${lib})
endforeach()
# core xgboost
add_subdirectory(${xgboost_SOURCE_DIR}/src)
target_link_libraries(objxgboost PUBLIC dmlc)
# Exports some R specific definitions and objects
if (R_LIB)
add_subdirectory(${xgboost_SOURCE_DIR}/R-package)
endif (R_LIB)
# This creates its own shared library `xgboost4j'.
if (JVM_BINDINGS)
add_subdirectory(${xgboost_SOURCE_DIR}/jvm-packages)
endif (JVM_BINDINGS)
# Plugin
add_subdirectory(${xgboost_SOURCE_DIR}/plugin)
@@ -197,47 +199,37 @@ else (BUILD_STATIC_LIB)
add_library(xgboost SHARED)
endif (BUILD_STATIC_LIB)
target_link_libraries(xgboost PRIVATE objxgboost)
if (USE_NVTX)
enable_nvtx(xgboost)
endif (USE_NVTX)
#-- Hide all C++ symbols
if (HIDE_CXX_SYMBOLS)
set_target_properties(objxgboost PROPERTIES CXX_VISIBILITY_PRESET hidden)
set_target_properties(xgboost PROPERTIES CXX_VISIBILITY_PRESET hidden)
endif (HIDE_CXX_SYMBOLS)
target_include_directories(xgboost
INTERFACE
$<INSTALL_INTERFACE:${CMAKE_INSTALL_PREFIX}/include>
$<INSTALL_INTERFACE:$<INSTALL_PREFIX>/include>
$<BUILD_INTERFACE:${CMAKE_CURRENT_LIST_DIR}/include>)
# This creates its own shared library `xgboost4j'.
if (JVM_BINDINGS)
add_subdirectory(${xgboost_SOURCE_DIR}/jvm-packages)
endif (JVM_BINDINGS)
#-- End shared library
#-- CLI for xgboost
add_executable(runxgboost ${xgboost_SOURCE_DIR}/src/cli_main.cc)
target_link_libraries(runxgboost PRIVATE objxgboost)
if (USE_NVTX)
enable_nvtx(runxgboost)
endif (USE_NVTX)
target_include_directories(runxgboost
PRIVATE
${xgboost_SOURCE_DIR}/include
${xgboost_SOURCE_DIR}/dmlc-core/include
${xgboost_SOURCE_DIR}/rabit/include)
set_target_properties(
runxgboost PROPERTIES
OUTPUT_NAME xgboost
CXX_STANDARD 14
CXX_STANDARD_REQUIRED ON)
${xgboost_SOURCE_DIR}/rabit/include
)
set_target_properties(runxgboost PROPERTIES OUTPUT_NAME xgboost)
#-- End CLI for xgboost
# Common setup for all targets
foreach(target xgboost objxgboost dmlc runxgboost)
xgboost_target_properties(${target})
xgboost_target_link_libraries(${target})
xgboost_target_defs(${target})
endforeach()
if (JVM_BINDINGS)
xgboost_target_properties(xgboost4j)
xgboost_target_link_libraries(xgboost4j)
xgboost_target_defs(xgboost4j)
endif (JVM_BINDINGS)
set_output_directory(runxgboost ${xgboost_SOURCE_DIR})
set_output_directory(xgboost ${xgboost_SOURCE_DIR}/lib)
# Ensure these two targets do not build simultaneously, as they produce outputs with conflicting names
@@ -262,12 +254,27 @@ if (BUILD_C_DOC)
run_doxygen()
endif (BUILD_C_DOC)
include(CPack)
include(GNUInstallDirs)
# Install all headers. Please note that currently the C++ headers does not form an "API".
install(DIRECTORY ${xgboost_SOURCE_DIR}/include/xgboost
DESTINATION ${CMAKE_INSTALL_INCLUDEDIR})
install(TARGETS xgboost runxgboost
# Install libraries. If `xgboost` is a static lib, specify `objxgboost` also, to avoid the
# following error:
#
# > install(EXPORT ...) includes target "xgboost" which requires target "objxgboost" that is not
# > in any export set.
#
# https://github.com/dmlc/xgboost/issues/6085
if (BUILD_STATIC_LIB)
set(INSTALL_TARGETS xgboost runxgboost objxgboost dmlc)
else (BUILD_STATIC_LIB)
set(INSTALL_TARGETS xgboost runxgboost)
endif (BUILD_STATIC_LIB)
install(TARGETS ${INSTALL_TARGETS}
EXPORT XGBoostTargets
ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}
LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}
@@ -297,12 +304,18 @@ install(
if (GOOGLE_TEST)
enable_testing()
# Unittests.
add_executable(testxgboost)
target_link_libraries(testxgboost PRIVATE objxgboost)
xgboost_target_properties(testxgboost)
xgboost_target_link_libraries(testxgboost)
xgboost_target_defs(testxgboost)
add_subdirectory(${xgboost_SOURCE_DIR}/tests/cpp)
add_test(
NAME TestXGBoostLib
COMMAND testxgboost
WORKING_DIRECTORY ${xgboost_BINARY_DIR})
# CLI tests
configure_file(
${xgboost_SOURCE_DIR}/tests/cli/machine.conf.in

View File

@@ -43,7 +43,7 @@ Committers are people who have made substantial contribution to the project and
Become a Committer
------------------
XGBoost is a opensource project and we are actively looking for new committers who are willing to help maintaining and lead the project.
XGBoost is a open source project and we are actively looking for new committers who are willing to help maintaining and lead the project.
Committers comes from contributors who:
* Made substantial contribution to the project.
* Willing to spent time on maintaining and lead the project.
@@ -59,7 +59,7 @@ List of Contributors
* [Skipper Seabold](https://github.com/jseabold)
- Skipper is the major contributor to the scikit-learn module of XGBoost.
* [Zygmunt Zając](https://github.com/zygmuntz)
- Zygmunt is the master behind the early stopping feature frequently used by kagglers.
- Zygmunt is the master behind the early stopping feature frequently used by Kagglers.
* [Ajinkya Kale](https://github.com/ajkl)
* [Boliang Chen](https://github.com/cblsjtu)
* [Yangqing Men](https://github.com/yanqingmen)
@@ -91,7 +91,7 @@ List of Contributors
* [Henry Gouk](https://github.com/henrygouk)
* [Pierre de Sahb](https://github.com/pdesahb)
* [liuliang01](https://github.com/liuliang01)
- liuliang01 added support for the qid column for LibSVM input format. This makes ranking task easier in distributed setting.
- liuliang01 added support for the qid column for LIBSVM input format. This makes ranking task easier in distributed setting.
* [Andrew Thia](https://github.com/BlueTea88)
- Andrew Thia implemented feature interaction constraints
* [Wei Tian](https://github.com/weitian)

308
Jenkinsfile vendored
View File

@@ -7,7 +7,7 @@
dockerRun = 'tests/ci_build/ci_build.sh'
// Which CUDA version to use when building reference distribution wheel
ref_cuda_ver = '10.0'
ref_cuda_ver = '10.1'
import groovy.transform.Field
@@ -38,26 +38,15 @@ pipeline {
agent { label 'job_initializer' }
steps {
script {
def buildNumber = env.BUILD_NUMBER as int
if (buildNumber > 1) milestone(buildNumber - 1)
milestone(buildNumber)
checkoutSrcs()
commit_id = "${GIT_COMMIT}"
}
sh 'python3 tests/jenkins_get_approval.py'
stash name: 'srcs'
milestone ordinal: 1
}
}
stage('Jenkins Linux: Formatting Check') {
agent none
steps {
script {
parallel ([
'clang-tidy': { ClangTidy() },
'lint': { Lint() },
'sphinx-doc': { SphinxDoc() },
'doxygen': { Doxygen() }
])
}
milestone ordinal: 2
}
}
stage('Jenkins Linux: Build') {
@@ -65,22 +54,21 @@ pipeline {
steps {
script {
parallel ([
'clang-tidy': { ClangTidy() },
'build-cpu': { BuildCPU() },
'build-cpu-arm64': { BuildCPUARM64() },
'build-cpu-rabit-mock': { BuildCPUMock() },
'build-cpu-non-omp': { BuildCPUNonOmp() },
// Build reference, distribution-ready Python wheel with CUDA 10.0
// using CentOS 6 image
'build-gpu-cuda10.0': { BuildCUDA(cuda_version: '10.0') },
// The build-gpu-* builds below use Ubuntu image
// Build reference, distribution-ready Python wheel with CUDA 10.1
// using CentOS 7 image
'build-gpu-cuda10.1': { BuildCUDA(cuda_version: '10.1') },
'build-gpu-cuda10.2': { BuildCUDA(cuda_version: '10.2') },
'build-gpu-cuda11.0': { BuildCUDA(cuda_version: '11.0') },
'build-jvm-packages-gpu-cuda10.0': { BuildJVMPackagesWithCUDA(spark_version: '3.0.0', cuda_version: '10.0') },
// The build-gpu-* builds below use Ubuntu image
'build-gpu-cuda11.0': { BuildCUDA(cuda_version: '11.0', build_rmm: true) },
'build-gpu-rpkg': { BuildRPackageWithCUDA(cuda_version: '10.1') },
'build-jvm-packages-gpu-cuda10.1': { BuildJVMPackagesWithCUDA(spark_version: '3.0.0', cuda_version: '11.0') },
'build-jvm-packages': { BuildJVMPackages(spark_version: '3.0.0') },
'build-jvm-doc': { BuildJVMDoc() }
])
}
milestone ordinal: 3
}
}
stage('Jenkins Linux: Test') {
@@ -89,20 +77,17 @@ pipeline {
script {
parallel ([
'test-python-cpu': { TestPythonCPU() },
'test-python-gpu-cuda10.2': { TestPythonGPU(host_cuda_version: '10.2') },
'test-python-gpu-cuda11.0-cross': { TestPythonGPU(artifact_cuda_version: '10.0', host_cuda_version: '11.0') },
'test-python-cpu-arm64': { TestPythonCPUARM64() },
// artifact_cuda_version doesn't apply to RMM tests; RMM tests will always match CUDA version between artifact and host env
'test-python-gpu-cuda11.0-cross': { TestPythonGPU(artifact_cuda_version: '10.1', host_cuda_version: '11.0', test_rmm: true) },
'test-python-gpu-cuda11.0': { TestPythonGPU(artifact_cuda_version: '11.0', host_cuda_version: '11.0') },
'test-python-mgpu-cuda10.2': { TestPythonGPU(artifact_cuda_version: '10.2', host_cuda_version: '10.2', multi_gpu: true) },
'test-cpp-gpu-cuda10.2': { TestCppGPU(artifact_cuda_version: '10.2', host_cuda_version: '10.2') },
'test-cpp-gpu-cuda11.0': { TestCppGPU(artifact_cuda_version: '11.0', host_cuda_version: '11.0') },
'test-jvm-jdk8-cuda10.0': { CrossTestJVMwithJDKGPU(artifact_cuda_version: '10.0', host_cuda_version: '10.0') },
'test-python-mgpu-cuda11.0': { TestPythonGPU(artifact_cuda_version: '10.1', host_cuda_version: '11.0', multi_gpu: true, test_rmm: true) },
'test-cpp-gpu-cuda11.0': { TestCppGPU(artifact_cuda_version: '11.0', host_cuda_version: '11.0', test_rmm: true) },
'test-jvm-jdk8': { CrossTestJVMwithJDK(jdk_version: '8', spark_version: '3.0.0') },
'test-jvm-jdk11': { CrossTestJVMwithJDK(jdk_version: '11') },
'test-jvm-jdk12': { CrossTestJVMwithJDK(jdk_version: '12') },
'test-r-3.5.3': { TestR(use_r35: true) }
'test-jvm-jdk12': { CrossTestJVMwithJDK(jdk_version: '12') }
])
}
milestone ordinal: 4
}
}
stage('Jenkins Linux: Deploy') {
@@ -113,7 +98,6 @@ pipeline {
'deploy-jvm-packages': { DeployJVMPackages(spark_version: '3.0.0') }
])
}
milestone ordinal: 5
}
}
}
@@ -135,7 +119,7 @@ def checkoutSrcs() {
}
def GetCUDABuildContainerType(cuda_version) {
return (cuda_version == ref_cuda_ver) ? 'gpu_build_centos6' : 'gpu_build'
return (cuda_version == ref_cuda_ver) ? 'gpu_build_centos7' : 'gpu_build'
}
def ClangTidy() {
@@ -144,7 +128,7 @@ def ClangTidy() {
echo "Running clang-tidy job..."
def container_type = "clang_tidy"
def docker_binary = "docker"
def dockerArgs = "--build-arg CUDA_VERSION=10.1"
def dockerArgs = "--build-arg CUDA_VERSION_ARG=10.1"
sh """
${dockerRun} ${container_type} ${docker_binary} ${dockerArgs} python3 tests/ci_build/tidy.py
"""
@@ -152,50 +136,6 @@ def ClangTidy() {
}
}
def Lint() {
node('linux && cpu') {
unstash name: 'srcs'
echo "Running lint..."
def container_type = "cpu"
def docker_binary = "docker"
sh """
${dockerRun} ${container_type} ${docker_binary} bash -c "source activate cpu_test && make lint"
"""
deleteDir()
}
}
def SphinxDoc() {
node('linux && cpu') {
unstash name: 'srcs'
echo "Running sphinx-doc..."
def container_type = "cpu"
def docker_binary = "docker"
def docker_extra_params = "CI_DOCKER_EXTRA_PARAMS_INIT='-e SPHINX_GIT_BRANCH=${BRANCH_NAME}'"
sh """#!/bin/bash
${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} bash -c "source activate cpu_test && make -C doc html"
"""
deleteDir()
}
}
def Doxygen() {
node('linux && cpu') {
unstash name: 'srcs'
echo "Running doxygen..."
def container_type = "cpu"
def docker_binary = "docker"
sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/doxygen.sh ${BRANCH_NAME}
"""
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Uploading doc...'
s3Upload file: "build/${BRANCH_NAME}.tar.bz2", bucket: 'xgboost-docs', acl: 'PublicRead', path: "doxygen/${BRANCH_NAME}.tar.bz2"
}
deleteDir()
}
}
def BuildCPU() {
node('linux && cpu') {
unstash name: 'srcs'
@@ -207,15 +147,15 @@ def BuildCPU() {
# This step is not necessary, but here we include it, to ensure that DMLC_CORE_USE_CMAKE flag is correctly propagated
# We want to make sure that we use the configured header build/dmlc/build_config.h instead of include/dmlc/build_config_default.h.
# See discussion at https://github.com/dmlc/xgboost/issues/5510
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_via_cmake.sh -DPLUGIN_LZ4=ON -DPLUGIN_DENSE_PARSER=ON
${dockerRun} ${container_type} ${docker_binary} build/testxgboost
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_via_cmake.sh -DPLUGIN_DENSE_PARSER=ON
${dockerRun} ${container_type} ${docker_binary} bash -c "cd build && ctest --extra-verbose"
"""
// Sanitizer test
def docker_extra_params = "CI_DOCKER_EXTRA_PARAMS_INIT='-e ASAN_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer -e ASAN_OPTIONS=symbolize=1 -e UBSAN_OPTIONS=print_stacktrace=1:log_path=ubsan_error.log --cap-add SYS_PTRACE'"
sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_via_cmake.sh -DUSE_SANITIZER=ON -DENABLED_SANITIZERS="address;leak;undefined" \
-DCMAKE_BUILD_TYPE=Debug -DSANITIZER_PATH=/usr/lib/x86_64-linux-gnu/
${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} build/testxgboost
${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} bash -c "cd build && ctest --exclude-regex AllTestsInDMLCUnitTests --extra-verbose"
"""
stash name: 'xgboost_cli', includes: 'xgboost'
@@ -223,6 +163,35 @@ def BuildCPU() {
}
}
def BuildCPUARM64() {
node('linux && arm64') {
unstash name: 'srcs'
echo "Build CPU ARM64"
def container_type = "aarch64"
def docker_binary = "docker"
def wheel_tag = "manylinux2014_aarch64"
sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_via_cmake.sh --conda-env=aarch64_test -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOL=ON
${dockerRun} ${container_type} ${docker_binary} bash -c "cd build && ctest --extra-verbose"
${dockerRun} ${container_type} ${docker_binary} bash -c "cd python-package && rm -rf dist/* && python setup.py bdist_wheel --universal"
${dockerRun} ${container_type} ${docker_binary} python tests/ci_build/rename_whl.py python-package/dist/*.whl ${commit_id} ${wheel_tag}
${dockerRun} ${container_type} ${docker_binary} bash -c "auditwheel repair --plat ${wheel_tag} python-package/dist/*.whl && python tests/ci_build/rename_whl.py wheelhouse/*.whl ${commit_id} ${wheel_tag}"
mv -v wheelhouse/*.whl python-package/dist/
# Make sure that libgomp.so is vendored in the wheel
${dockerRun} ${container_type} ${docker_binary} bash -c "unzip -l python-package/dist/*.whl | grep libgomp || exit -1"
"""
echo 'Stashing Python wheel...'
stash name: "xgboost_whl_arm64_cpu", includes: 'python-package/dist/*.whl'
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Uploading Python wheel...'
path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/"
s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', workingDir: 'python-package/dist', includePathPattern:'**/*.whl'
}
stash name: 'xgboost_cli_arm64', includes: 'xgboost'
deleteDir()
}
}
def BuildCPUMock() {
node('linux && cpu') {
unstash name: 'srcs'
@@ -238,39 +207,32 @@ def BuildCPUMock() {
}
}
def BuildCPUNonOmp() {
node('linux && cpu') {
unstash name: 'srcs'
echo "Build CPU without OpenMP"
def container_type = "cpu"
def docker_binary = "docker"
sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_via_cmake.sh -DUSE_OPENMP=OFF
"""
echo "Running Non-OpenMP C++ test..."
sh """
${dockerRun} ${container_type} ${docker_binary} build/testxgboost
"""
deleteDir()
}
}
def BuildCUDA(args) {
node('linux && cpu_build') {
unstash name: 'srcs'
echo "Build with CUDA ${args.cuda_version}"
def container_type = GetCUDABuildContainerType(args.cuda_version)
def docker_binary = "docker"
def docker_args = "--build-arg CUDA_VERSION=${args.cuda_version}"
def docker_args = "--build-arg CUDA_VERSION_ARG=${args.cuda_version}"
def arch_flag = ""
if (env.BRANCH_NAME != 'master' && !(env.BRANCH_NAME.startsWith('release'))) {
arch_flag = "-DGPU_COMPUTE_VER=75"
}
def wheel_tag = "manylinux2014_x86_64"
sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/build_via_cmake.sh -DUSE_CUDA=ON -DUSE_NCCL=ON -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON ${arch_flag}
${dockerRun} ${container_type} ${docker_binary} ${docker_args} bash -c "cd python-package && rm -rf dist/* && python setup.py bdist_wheel --universal"
${dockerRun} ${container_type} ${docker_binary} ${docker_args} python tests/ci_build/rename_whl.py python-package/dist/*.whl ${commit_id} manylinux2010_x86_64
${dockerRun} ${container_type} ${docker_binary} ${docker_args} python tests/ci_build/rename_whl.py python-package/dist/*.whl ${commit_id} ${wheel_tag}
"""
if (args.cuda_version == ref_cuda_ver) {
sh """
${dockerRun} auditwheel_x86_64 ${docker_binary} auditwheel repair --plat ${wheel_tag} python-package/dist/*.whl
${dockerRun} ${container_type} ${docker_binary} ${docker_args} python tests/ci_build/rename_whl.py wheelhouse/*.whl ${commit_id} ${wheel_tag}
mv -v wheelhouse/*.whl python-package/dist/
# Make sure that libgomp.so is vendored in the wheel
${dockerRun} auditwheel_x86_64 ${docker_binary} bash -c "unzip -l python-package/dist/*.whl | grep libgomp || exit -1"
"""
}
echo 'Stashing Python wheel...'
stash name: "xgboost_whl_cuda${args.cuda_version}", includes: 'python-package/dist/*.whl'
if (args.cuda_version == ref_cuda_ver && (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release'))) {
@@ -280,17 +242,51 @@ def BuildCUDA(args) {
}
echo 'Stashing C++ test executable (testxgboost)...'
stash name: "xgboost_cpp_tests_cuda${args.cuda_version}", includes: 'build/testxgboost'
if (args.build_rmm) {
echo "Build with CUDA ${args.cuda_version} and RMM"
container_type = "rmm"
docker_binary = "docker"
docker_args = "--build-arg CUDA_VERSION_ARG=${args.cuda_version}"
sh """
rm -rf build/
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/build_via_cmake.sh --conda-env=gpu_test -DUSE_CUDA=ON -DUSE_NCCL=ON -DPLUGIN_RMM=ON -DBUILD_WITH_CUDA_CUB=ON ${arch_flag}
${dockerRun} ${container_type} ${docker_binary} ${docker_args} bash -c "cd python-package && rm -rf dist/* && python setup.py bdist_wheel --universal"
${dockerRun} ${container_type} ${docker_binary} ${docker_args} python tests/ci_build/rename_whl.py python-package/dist/*.whl ${commit_id} manylinux2014_x86_64
"""
echo 'Stashing Python wheel...'
stash name: "xgboost_whl_rmm_cuda${args.cuda_version}", includes: 'python-package/dist/*.whl'
echo 'Stashing C++ test executable (testxgboost)...'
stash name: "xgboost_cpp_tests_rmm_cuda${args.cuda_version}", includes: 'build/testxgboost'
}
deleteDir()
}
}
def BuildRPackageWithCUDA(args) {
node('linux && cpu_build') {
unstash name: 'srcs'
def container_type = 'gpu_build_r_centos7'
def docker_binary = "docker"
def docker_args = "--build-arg CUDA_VERSION_ARG=${args.cuda_version}"
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/build_r_pkg_with_cuda.sh ${commit_id}
"""
echo 'Uploading R tarball...'
path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/"
s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', includePathPattern:'xgboost_r_gpu_linux_*.tar.gz'
}
deleteDir()
}
}
def BuildJVMPackagesWithCUDA(args) {
node('linux && gpu') {
node('linux && mgpu') {
unstash name: 'srcs'
echo "Build XGBoost4J-Spark with Spark ${args.spark_version}, CUDA ${args.cuda_version}"
def container_type = "jvm_gpu_build"
def docker_binary = "nvidia-docker"
def docker_args = "--build-arg CUDA_VERSION=${args.cuda_version}"
def docker_args = "--build-arg CUDA_VERSION_ARG=${args.cuda_version}"
def arch_flag = ""
if (env.BRANCH_NAME != 'master' && !(env.BRANCH_NAME.startsWith('release'))) {
arch_flag = "-DGPU_COMPUTE_VER=75"
@@ -301,7 +297,7 @@ def BuildJVMPackagesWithCUDA(args) {
${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/build_jvm_packages.sh ${args.spark_version} -Duse.cuda=ON $arch_flag
"""
echo "Stashing XGBoost4J JAR with CUDA ${args.cuda_version} ..."
stash name: 'xgboost4j_jar_gpu', includes: "jvm-packages/xgboost4j/target/*.jar,jvm-packages/xgboost4j-spark/target/*.jar,jvm-packages/xgboost4j-example/target/*.jar"
stash name: 'xgboost4j_jar_gpu', includes: "jvm-packages/xgboost4j-gpu/target/*.jar,jvm-packages/xgboost4j-spark-gpu/target/*.jar"
deleteDir()
}
}
@@ -355,6 +351,21 @@ def TestPythonCPU() {
}
}
def TestPythonCPUARM64() {
node('linux && arm64') {
unstash name: "xgboost_whl_arm64_cpu"
unstash name: 'srcs'
unstash name: 'xgboost_cli_arm64'
echo "Test Python CPU ARM64"
def container_type = "aarch64"
def docker_binary = "docker"
sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/test_python.sh cpu-arm64
"""
deleteDir()
}
}
def TestPythonGPU(args) {
def nodeReq = (args.multi_gpu) ? 'linux && mgpu' : 'linux && gpu'
def artifact_cuda_version = (args.artifact_cuda_version) ?: ref_cuda_ver
@@ -365,38 +376,21 @@ def TestPythonGPU(args) {
echo "Test Python GPU: CUDA ${args.host_cuda_version}"
def container_type = "gpu"
def docker_binary = "nvidia-docker"
def docker_args = "--build-arg CUDA_VERSION=${args.host_cuda_version}"
if (args.multi_gpu) {
echo "Using multiple GPUs"
// Allocate extra space in /dev/shm to enable NCCL
def docker_extra_params = "CI_DOCKER_EXTRA_PARAMS_INIT='--shm-size=4g'"
sh """
${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/test_python.sh mgpu
"""
} else {
echo "Using a single GPU"
sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/test_python.sh gpu
"""
def docker_args = "--build-arg CUDA_VERSION_ARG=${args.host_cuda_version}"
def mgpu_indicator = (args.multi_gpu) ? 'mgpu' : 'gpu'
// Allocate extra space in /dev/shm to enable NCCL
def docker_extra_params = (args.multi_gpu) ? "CI_DOCKER_EXTRA_PARAMS_INIT='--shm-size=4g'" : ''
sh "${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/test_python.sh ${mgpu_indicator}"
if (args.test_rmm) {
sh "rm -rfv build/ python-package/dist/"
unstash name: "xgboost_whl_rmm_cuda${args.host_cuda_version}"
unstash name: "xgboost_cpp_tests_rmm_cuda${args.host_cuda_version}"
sh "${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/test_python.sh ${mgpu_indicator} --use-rmm-pool"
}
deleteDir()
}
}
def TestCppRabit() {
node(nodeReq) {
unstash name: 'xgboost_rabit_tests'
unstash name: 'srcs'
echo "Test C++, rabit mock on"
def container_type = "cpu"
def docker_binary = "docker"
sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/runxgb.sh xgboost tests/ci_build/approx.conf.in
"""
deleteDir()
}
}
def TestCppGPU(args) {
def nodeReq = 'linux && mgpu'
def artifact_cuda_version = (args.artifact_cuda_version) ?: ref_cuda_ver
@@ -406,26 +400,19 @@ def TestCppGPU(args) {
echo "Test C++, CUDA ${args.host_cuda_version}"
def container_type = "gpu"
def docker_binary = "nvidia-docker"
def docker_args = "--build-arg CUDA_VERSION=${args.host_cuda_version}"
def docker_args = "--build-arg CUDA_VERSION_ARG=${args.host_cuda_version}"
sh "${dockerRun} ${container_type} ${docker_binary} ${docker_args} build/testxgboost"
deleteDir()
}
}
def CrossTestJVMwithJDKGPU(args) {
def nodeReq = 'linux && mgpu'
node(nodeReq) {
unstash name: "xgboost4j_jar_gpu"
unstash name: 'srcs'
if (args.spark_version != null) {
echo "Test XGBoost4J on a machine with JDK ${args.jdk_version}, Spark ${args.spark_version}, CUDA ${args.host_cuda_version}"
} else {
echo "Test XGBoost4J on a machine with JDK ${args.jdk_version}, CUDA ${args.host_cuda_version}"
if (args.test_rmm) {
sh "rm -rfv build/"
unstash name: "xgboost_cpp_tests_rmm_cuda${args.host_cuda_version}"
echo "Test C++, CUDA ${args.host_cuda_version} with RMM"
container_type = "rmm"
docker_binary = "nvidia-docker"
docker_args = "--build-arg CUDA_VERSION_ARG=${args.host_cuda_version}"
sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} bash -c "source activate gpu_test && build/testxgboost --use-rmm-pool --gtest_filter=-*DeathTest.*"
"""
}
def container_type = "gpu_jvm"
def docker_binary = "nvidia-docker"
def docker_args = "--build-arg CUDA_VERSION=${args.host_cuda_version}"
sh "${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/test_jvm_gpu_cross.sh"
deleteDir()
}
}
@@ -452,30 +439,13 @@ def CrossTestJVMwithJDK(args) {
}
}
def TestR(args) {
node('linux && cpu') {
unstash name: 'srcs'
echo "Test R package"
def container_type = "rproject"
def docker_binary = "docker"
def use_r35_flag = (args.use_r35) ? "1" : "0"
def docker_args = "--build-arg USE_R35=${use_r35_flag}"
sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/build_test_rpkg.sh || tests/ci_build/print_r_stacktrace.sh
"""
deleteDir()
}
}
def DeployJVMPackages(args) {
node('linux && cpu') {
unstash name: 'srcs'
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Deploying to xgboost-maven-repo S3 repo...'
def container_type = "jvm"
def docker_binary = "docker"
sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/deploy_jvm_packages.sh ${args.spark_version}
${dockerRun} jvm_gpu_build docker --build-arg CUDA_VERSION_ARG=10.1 tests/ci_build/deploy_jvm_packages.sh ${args.spark_version}
"""
}
deleteDir()

View File

@@ -25,12 +25,14 @@ pipeline {
agent { label 'job_initializer' }
steps {
script {
def buildNumber = env.BUILD_NUMBER as int
if (buildNumber > 1) milestone(buildNumber - 1)
milestone(buildNumber)
checkoutSrcs()
commit_id = "${GIT_COMMIT}"
}
sh 'python3 tests/jenkins_get_approval.py'
stash name: 'srcs'
milestone ordinal: 1
}
}
stage('Jenkins Win64: Build') {
@@ -38,10 +40,10 @@ pipeline {
steps {
script {
parallel ([
'build-win64-cuda10.1': { BuildWin64() }
'build-win64-cuda10.1': { BuildWin64() },
'build-rpkg-win64-cuda10.1': { BuildRPackageWithCUDAWin64() }
])
}
milestone ordinal: 2
}
}
stage('Jenkins Win64: Test') {
@@ -52,7 +54,6 @@ pipeline {
'test-win64-cuda10.1': { TestWin64() },
])
}
milestone ordinal: 3
}
}
}
@@ -75,6 +76,7 @@ def checkoutSrcs() {
def BuildWin64() {
node('win64 && cuda10_unified') {
deleteDir()
unstash name: 'srcs'
echo "Building XGBoost for Windows AMD64 target..."
bat "nvcc --version"
@@ -85,7 +87,7 @@ def BuildWin64() {
bat """
mkdir build
cd build
cmake .. -G"Visual Studio 15 2017 Win64" -DUSE_CUDA=ON -DCMAKE_VERBOSE_MAKEFILE=ON -DGOOGLE_TEST=ON -DUSE_DMLC_GTEST=ON ${arch_flag}
cmake .. -G"Visual Studio 15 2017 Win64" -DUSE_CUDA=ON -DCMAKE_VERBOSE_MAKEFILE=ON -DGOOGLE_TEST=ON -DUSE_DMLC_GTEST=ON ${arch_flag} -DCMAKE_UNITY_BUILD=ON
"""
bat """
cd build
@@ -115,8 +117,26 @@ def BuildWin64() {
}
}
def BuildRPackageWithCUDAWin64() {
node('win64 && cuda10_unified') {
deleteDir()
unstash name: 'srcs'
bat "nvcc --version"
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
bat """
bash tests/ci_build/build_r_pkg_with_cuda_win64.sh ${commit_id}
"""
echo 'Uploading R tarball...'
path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/"
s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', includePathPattern:'xgboost_r_gpu_win64_*.tar.gz'
}
deleteDir()
}
}
def TestWin64() {
node('win64 && cuda10_unified') {
deleteDir()
unstash name: 'srcs'
unstash name: 'xgboost_whl'
unstash name: 'xgboost_cli'
@@ -127,7 +147,7 @@ def TestWin64() {
bat "build\\testxgboost.exe"
echo "Installing Python dependencies..."
def env_name = 'win64_' + UUID.randomUUID().toString().replaceAll('-', '')
bat "conda env create -n ${env_name} --file=tests/ci_build/conda_env/win64_test.yml"
bat "conda activate && mamba env create -n ${env_name} --file=tests/ci_build/conda_env/win64_test.yml"
echo "Installing Python wheel..."
bat """
conda activate ${env_name} && for /R %%i in (python-package\\dist\\*.whl) DO python -m pip install "%%i"

View File

@@ -86,6 +86,20 @@ cover: check
)
endif
# dask is required to pass, others are not
# If any of the dask tests failed, contributor won't see the other error.
mypy:
cd python-package; \
mypy ./xgboost/dask.py && \
mypy ./xgboost/rabit.py && \
mypy ../demo/guide-python/external_memory.py && \
mypy ../tests/python-gpu/test_gpu_with_dask.py && \
mypy ../tests/python/test_data_iterator.py && \
mypy ../tests/python-gpu/test_gpu_data_iterator.py && \
mypy ./xgboost/sklearn.py || exit 1; \
mypy . || true ;
clean:
$(RM) -rf build lib bin *~ */*~ */*/*~ */*/*/*~ */*.o */*/*.o */*/*/*.o #xgboost
$(RM) -rf build_tests *.gcov tests/cpp/xgboost_test
@@ -134,14 +148,18 @@ Rpack: clean_all
sed -i -e 's/@OPENMP_LIB@//g' xgboost/src/Makevars.win
rm -f xgboost/src/Makevars.win-e # OSX sed create this extra file; remove it
bash R-package/remove_warning_suppression_pragma.sh
bash xgboost/remove_warning_suppression_pragma.sh
rm xgboost/remove_warning_suppression_pragma.sh
rm -rfv xgboost/tests/helper_scripts/
R ?= R
Rbuild: Rpack
R CMD build --no-build-vignettes xgboost
$(R) CMD build xgboost
rm -rf xgboost
Rcheck: Rbuild
R CMD check xgboost*.tar.gz
$(R) CMD check --as-cran xgboost*.tar.gz
-include build/*.d
-include build/*/*.d

690
NEWS.md
View File

@@ -3,6 +3,670 @@ XGBoost Change Log
This file records the changes in xgboost library in reverse chronological order.
## v1.4.2 (2021.05.13)
This is a patch release for Python package with following fixes:
* Handle the latest version of cupy.ndarray in inplace_predict. (#6933)
* Ensure output array from predict_leaf is (n_samples, ) when there's only 1 tree. 1.4.0 outputs (n_samples, 1). (#6889)
* Fix empty dataset handling with multi-class AUC. (#6947)
* Handle object type from pandas in inplace_predict. (#6927)
## v1.4.1 (2021.04.20)
This is a bug fix release.
* Fix GPU implementation of AUC on some large datasets. (#6866)
## v1.4.0 (2021.04.12)
### Introduction of pre-built binary package for R, with GPU support
Starting with release 1.4.0, users now have the option of installing `{xgboost}` without
having to build it from the source. This is particularly advantageous for users who want
to take advantage of the GPU algorithm (`gpu_hist`), as previously they'd have to build
`{xgboost}` from the source using CMake and NVCC. Now installing `{xgboost}` with GPU
support is as easy as: `R CMD INSTALL ./xgboost_r_gpu_linux.tar.gz`. (#6827)
See the instructions at https://xgboost.readthedocs.io/en/latest/build.html
### Improvements on prediction functions
XGBoost has many prediction types including shap value computation and inplace prediction.
In 1.4 we overhauled the underlying prediction functions for C API and Python API with an
unified interface. (#6777, #6693, #6653, #6662, #6648, #6668, #6804)
* Starting with 1.4, sklearn interface prediction will use inplace predict by default when
input data is supported.
* Users can use inplace predict with `dart` booster and enable GPU acceleration just
like `gbtree`.
* Also all prediction functions with tree models are now thread-safe. Inplace predict is
improved with `base_margin` support.
* A new set of C predict functions are exposed in the public interface.
* A user-visible change is a newly added parameter called `strict_shape`. See
https://xgboost.readthedocs.io/en/latest/prediction.html for more details.
### Improvement on Dask interface
* Starting with 1.4, the Dask interface is considered to be feature-complete, which means
all of the models found in the single node Python interface are now supported in Dask,
including but not limited to ranking and random forest. Also, the prediction function
is significantly faster and supports shap value computation.
- Most of the parameters found in single node sklearn interface are supported by
Dask interface. (#6471, #6591)
- Implements learning to rank. On the Dask interface, we use the newly added support of
query ID to enable group structure. (#6576)
- The Dask interface has Python type hints support. (#6519)
- All models can be safely pickled. (#6651)
- Random forest estimators are now supported. (#6602)
- Shap value computation is now supported. (#6575, #6645, #6614)
- Evaluation result is printed on the scheduler process. (#6609)
- `DaskDMatrix` (and device quantile dmatrix) now accepts all meta-information. (#6601)
* Prediction optimization. We enhanced and speeded up the prediction function for the
Dask interface. See the latest Dask tutorial page in our document for an overview of
how you can optimize it even further. (#6650, #6645, #6648, #6668)
* Bug fixes
- If you are using the latest Dask and distributed where `distributed.MultiLock` is
present, XGBoost supports training multiple models on the same cluster in
parallel. (#6743)
- A bug fix for when using `dask.client` to launch async task, XGBoost might use a
different client object internally. (#6722)
* Other improvements on documents, blogs, tutorials, and demos. (#6389, #6366, #6687,
#6699, #6532, #6501)
### Python package
With changes from Dask and general improvement on prediction, we have made some
enhancements on the general Python interface and IO for booster information. Starting
from 1.4, booster feature names and types can be saved into the JSON model. Also some
model attributes like `best_iteration`, `best_score` are restored upon model load. On
sklearn interface, some attributes are now implemented as Python object property with
better documents.
* Breaking change: All `data` parameters in prediction functions are renamed to `X`
for better compliance to sklearn estimator interface guidelines.
* Breaking change: XGBoost used to generate some pseudo feature names with `DMatrix`
when inputs like `np.ndarray` don't have column names. The procedure is removed to
avoid conflict with other inputs. (#6605)
* Early stopping with training continuation is now supported. (#6506)
* Optional import for Dask and cuDF are now lazy. (#6522)
* As mentioned in the prediction improvement summary, the sklearn interface uses inplace
prediction whenever possible. (#6718)
* Booster information like feature names and feature types are now saved into the JSON
model file. (#6605)
* All `DMatrix` interfaces including `DeviceQuantileDMatrix` and counterparts in Dask
interface (as mentioned in the Dask changes summary) now accept all the meta-information
like `group` and `qid` in their constructor for better consistency. (#6601)
* Booster attributes are restored upon model load so users don't have to call `attr`
manually. (#6593)
* On sklearn interface, all models accept `base_margin` for evaluation datasets. (#6591)
* Improvements over the setup script including smaller sdist size and faster installation
if the C++ library is already built (#6611, #6694, #6565).
* Bug fixes for Python package:
- Don't validate feature when number of rows is 0. (#6472)
- Move metric configuration into booster. (#6504)
- Calling XGBModel.fit() should clear the Booster by default (#6562)
- Support `_estimator_type`. (#6582)
- [dask, sklearn] Fix predict proba. (#6566, #6817)
- Restore unknown data support. (#6595)
- Fix learning rate scheduler with cv. (#6720)
- Fixes small typo in sklearn documentation (#6717)
- [python-package] Fix class Booster: feature_types = None (#6705)
- Fix divide by 0 in feature importance when no split is found. (#6676)
### JVM package
* [jvm-packages] fix early stopping doesn't work even without custom_eval setting (#6738)
* fix potential TaskFailedListener's callback won't be called (#6612)
* [jvm] Add ability to load booster direct from byte array (#6655)
* [jvm-packages] JVM library loader extensions (#6630)
### R package
* R documentation: Make construction of DMatrix consistent.
* Fix R documentation for xgb.train. (#6764)
### ROC-AUC
We re-implemented the ROC-AUC metric in XGBoost. The new implementation supports
multi-class classification and has better support for learning to rank tasks that are not
binary. Also, it has a better-defined average on distributed environments with additional
handling for invalid datasets. (#6749, #6747, #6797)
### Global configuration.
Starting from 1.4, XGBoost's Python, R and C interfaces support a new global configuration
model where users can specify some global parameters. Currently, supported parameters are
`verbosity` and `use_rmm`. The latter is experimental, see rmm plugin demo and
related README file for details. (#6414, #6656)
### Other New features.
* Better handling for input data types that support `__array_interface__`. For some
data types including GPU inputs and `scipy.sparse.csr_matrix`, XGBoost employs
`__array_interface__` for processing the underlying data. Starting from 1.4, XGBoost
can accept arbitrary array strides (which means column-major is supported) without
making data copies, potentially reducing a significant amount of memory consumption.
Also version 3 of `__cuda_array_interface__` is now supported. (#6776, #6765, #6459,
#6675)
* Improved parameter validation, now feeding XGBoost with parameters that contain
whitespace will trigger an error. (#6769)
* For Python and R packages, file paths containing the home indicator `~` are supported.
* As mentioned in the Python changes summary, the JSON model can now save feature
information of the trained booster. The JSON schema is updated accordingly. (#6605)
* Development of categorical data support is continued. Newly added weighted data support
and `dart` booster support. (#6508, #6693)
* As mentioned in Dask change summary, ranking now supports the `qid` parameter for
query groups. (#6576)
* `DMatrix.slice` can now consume a numpy array. (#6368)
### Other breaking changes
* Aside from the feature name generation, there are 2 breaking changes:
- Drop saving binary format for memory snapshot. (#6513, #6640)
- Change default evaluation metric for binary:logitraw objective to logloss (#6647)
### CPU Optimization
* Aside from the general changes on predict function, some optimizations are applied on
CPU implementation. (#6683, #6550, #6696, #6700)
* Also performance for sampling initialization in `hist` is improved. (#6410)
### Notable fixes in the core library
These fixes do not reside in particular language bindings:
* Fixes for gamma regression. This includes checking for invalid input values, fixes for
gamma deviance metric, and better floating point guard for gamma negative log-likelihood
metric. (#6778, #6537, #6761)
* Random forest with `gpu_hist` might generate low accuracy in previous versions. (#6755)
* Fix a bug in GPU sketching when data size exceeds limit of 32-bit integer. (#6826)
* Memory consumption fix for row-major adapters (#6779)
* Don't estimate sketch batch size when rmm is used. (#6807) (#6830)
* Fix in-place predict with missing value. (#6787)
* Re-introduce double buffer in UpdatePosition, to fix perf regression in gpu_hist (#6757)
* Pass correct split_type to GPU predictor (#6491)
* Fix DMatrix feature names/types IO. (#6507)
* Use view for `SparsePage` exclusively to avoid some data access races. (#6590)
* Check for invalid data. (#6742)
* Fix relocatable include in CMakeList (#6734) (#6737)
* Fix DMatrix slice with feature types. (#6689)
### Other deprecation notices:
* This release will be the last release to support CUDA 10.0. (#6642)
* Starting in the next release, the Python package will require Pip 19.3+ due to the use
of manylinux2014 tag. Also, CentOS 6, RHEL 6 and other old distributions will not be
supported.
### Known issue:
MacOS build of the JVM packages doesn't support multi-threading out of the box. To enable
multi-threading with JVM packages, MacOS users will need to build the JVM packages from
the source. See https://xgboost.readthedocs.io/en/latest/jvm/index.html#installation-from-source
### Doc
* Dedicated page for `tree_method` parameter is added. (#6564, #6633)
* [doc] Add FLAML as a fast tuning tool for XGBoost (#6770)
* Add document for tests directory. [skip ci] (#6760)
* Fix doc string of config.py to use correct `versionadded` (#6458)
* Update demo for prediction. (#6789)
* [Doc] Document that AUCPR is for binary classification/ranking (#5899)
* Update the C API comments (#6457)
* Fix document. [skip ci] (#6669)
### Maintenance: Testing, continuous integration
* Use CPU input for test_boost_from_prediction. (#6818)
* [CI] Upload xgboost4j.dll to S3 (#6781)
* Update dmlc-core submodule (#6745)
* [CI] Use manylinux2010_x86_64 container to vendor libgomp (#6485)
* Add conda-forge badge (#6502)
* Fix merge conflict. (#6512)
* [CI] Split up main.yml, add mypy. (#6515)
* [Breaking] Upgrade cuDF and RMM to 0.18 nightlies; require RMM 0.18+ for RMM plugin (#6510)
* "featue_map" typo changed to "feature_map" (#6540)
* Add script for generating release tarball. (#6544)
* Add credentials to .gitignore (#6559)
* Remove warnings in tests. (#6554)
* Update dmlc-core submodule and conform to new API (#6431)
* Suppress hypothesis health check for dask client. (#6589)
* Fix pylint. (#6714)
* [CI] Clear R package cache (#6746)
* Exclude dmlc test on github action. (#6625)
* Tests for regression metrics with weights. (#6729)
* Add helper script and doc for releasing pip package. (#6613)
* Support pylint 2.7.0 (#6726)
* Remove R cache in github action. (#6695)
* [CI] Do not mix up stashed executable built for ARM and x86_64 platforms (#6646)
* [CI] Add ARM64 test to Jenkins pipeline (#6643)
* Disable s390x and arm64 tests on travis for now. (#6641)
* Move sdist test to action. (#6635)
* [dask] Rework base margin test. (#6627)
### Maintenance: Refactor code for legibility and maintainability
* Improve OpenMP exception handling (#6680)
* Improve string view to reduce string allocation. (#6644)
* Simplify Span checks. (#6685)
* Use generic dispatching routine for array interface. (#6672)
## v1.3.0 (2020.12.08)
### XGBoost4J-Spark: Exceptions should cancel jobs gracefully instead of killing SparkContext (#6019).
* By default, exceptions in XGBoost4J-Spark causes the whole SparkContext to shut down, necessitating the restart of the Spark cluster. This behavior is often a major inconvenience.
* Starting from 1.3.0 release, XGBoost adds a new parameter `killSparkContextOnWorkerFailure` to optionally prevent killing SparkContext. If this parameter is set, exceptions will gracefully cancel training jobs instead of killing SparkContext.
### GPUTreeSHAP: GPU acceleration of the TreeSHAP algorithm (#6038, #6064, #6087, #6099, #6163, #6281, #6332)
* [SHAP (SHapley Additive exPlanations)](https://github.com/slundberg/shap) is a game theoretic approach to explain predictions of machine learning models. It computes feature importance scores for individual examples, establishing how each feature influences a particular prediction. TreeSHAP is an optimized SHAP algorithm specifically designed for decision tree ensembles.
* Starting with 1.3.0 release, it is now possible to leverage CUDA-capable GPUs to accelerate the TreeSHAP algorithm. Check out [the demo notebook](https://github.com/dmlc/xgboost/blob/master/demo/gpu_acceleration/shap.ipynb).
* The CUDA implementation of the TreeSHAP algorithm is hosted at [rapidsai/GPUTreeSHAP](https://github.com/rapidsai/gputreeshap). XGBoost imports it as a Git submodule.
### New style Python callback API (#6199, #6270, #6320, #6348, #6376, #6399, #6441)
* The XGBoost Python package now offers a re-designed callback API. The new callback API lets you design various extensions of training in idomatic Python. In addition, the new callback API allows you to use early stopping with the native Dask API (`xgboost.dask`). Check out [the tutorial](https://xgboost.readthedocs.io/en/release_1.3.0/python/callbacks.html) and [the demo](https://github.com/dmlc/xgboost/blob/master/demo/guide-python/callbacks.py).
### Enable the use of `DeviceQuantileDMatrix` / `DaskDeviceQuantileDMatrix` with large data (#6201, #6229, #6234).
* `DeviceQuantileDMatrix` can achieve memory saving by avoiding extra copies of the training data, and the saving is bigger for large data. Unfortunately, large data with more than 2^31 elements was triggering integer overflow bugs in CUB and Thrust. Tracking issue: #6228.
* This release contains a series of work-arounds to allow the use of `DeviceQuantileDMatrix` with large data:
- Loop over `copy_if` (#6201)
- Loop over `thrust::reduce` (#6229)
- Implement the inclusive scan algorithm in-house, to handle large offsets (#6234)
### Support slicing of tree models (#6302)
* Accessing the best iteration of a model after the application of early stopping used to be error-prone, need to manually pass the `ntree_limit` argument to the `predict()` function.
* Now we provide a simple interface to slice tree models by specifying a range of boosting rounds. The tree ensemble can be split into multiple sub-ensembles via the slicing interface. Check out [an example](https://xgboost.readthedocs.io/en/release_1.3.0/python/model.html).
* In addition, the early stopping callback now supports `save_best` option. When enabled, XGBoost will save (persist) the model at the best boosting round and discard the trees that were fit subsequent to the best round.
### Weighted subsampling of features (columns) (#5962)
* It is now possible to sample features (columns) via weighted subsampling, in which features with higher weights are more likely to be selected in the sample. Weighted subsampling allows you to encode domain knowledge by emphasizing a particular set of features in the choice of tree splits. In addition, you can prevent particular features from being used in any splits, by assigning them zero weights.
* Check out [the demo](https://github.com/dmlc/xgboost/blob/master/demo/guide-python/feature_weights.py).
### Improved integration with Dask
* Support reverse-proxy environment such as Google Kubernetes Engine (#6343, #6475)
* An XGBoost training job will no longer use all available workers. Instead, it will only use the workers that contain input data (#6343).
* The new callback API works well with the Dask training API.
* The `predict()` and `fit()` function of `DaskXGBClassifier` and `DaskXGBRegressor` now accept a base margin (#6155).
* Support more meta data in the Dask API (#6130, #6132, #6333).
* Allow passing extra keyword arguments as `kwargs` in `predict()` (#6117)
* Fix typo in dask interface: `sample_weights` -> `sample_weight` (#6240)
* Allow empty data matrix in AFT survival, as Dask may produce empty partitions (#6379)
* Speed up prediction by overlapping prediction jobs in all workers (#6412)
### Experimental support for direct splits with categorical features (#6028, #6128, #6137, #6140, #6164, #6165, #6166, #6179, #6194, #6219)
* Currently, XGBoost requires users to one-hot-encode categorical variables. This has adverse performance implications, as the creation of many dummy variables results into higher memory consumption and may require fitting deeper trees to achieve equivalent model accuracy.
* The 1.3.0 release of XGBoost contains an experimental support for direct handling of categorical variables in test nodes. Each test node will have the condition of form `feature_value \in match_set`, where the `match_set` on the right hand side contains one or more matching categories. The matching categories in `match_set` represent the condition for traversing to the right child node. Currently, XGBoost will only generate categorical splits with only a single matching category ("one-vs-rest split"). In a future release, we plan to remove this restriction and produce splits with multiple matching categories in `match_set`.
* The categorical split requires the use of JSON model serialization. The legacy binary serialization method cannot be used to save (persist) models with categorical splits.
* Note. This feature is currently highly experimental. Use it at your own risk. See the detailed list of limitations at [#5949](https://github.com/dmlc/xgboost/pull/5949).
### Experimental plugin for RAPIDS Memory Manager (#5873, #6131, #6146, #6150, #6182)
* RAPIDS Memory Manager library ([rapidsai/rmm](https://github.com/rapidsai/rmm)) provides a collection of efficient memory allocators for NVIDIA GPUs. It is now possible to use XGBoost with memory allocators provided by RMM, by enabling the RMM integration plugin. With this plugin, XGBoost is now able to share a common GPU memory pool with other applications using RMM, such as the RAPIDS data science packages.
* See [the demo](https://github.com/dmlc/xgboost/blob/master/demo/rmm_plugin/README.md) for a working example, as well as directions for building XGBoost with the RMM plugin.
* The plugin will be soon considered non-experimental, once #6297 is resolved.
### Experimental plugin for oneAPI programming model (#5825)
* oneAPI is a programming interface developed by Intel aimed at providing one programming model for many types of hardware such as CPU, GPU, FGPA and other hardware accelerators.
* XGBoost now includes an experimental plugin for using oneAPI for the predictor and objective functions. The plugin is hosted in the directory `plugin/updater_oneapi`.
* Roadmap: #5442
### Pickling the XGBoost model will now trigger JSON serialization (#6027)
* The pickle will now contain the JSON string representation of the XGBoost model, as well as related configuration.
### Performance improvements
* Various performance improvement on multi-core CPUs
- Optimize DMatrix build time by up to 3.7x. (#5877)
- CPU predict performance improvement, by up to 3.6x. (#6127)
- Optimize CPU sketch allreduce for sparse data (#6009)
- Thread local memory allocation for BuildHist, leading to speedup up to 1.7x. (#6358)
- Disable hyperthreading for DMatrix creation (#6386). This speeds up DMatrix creation by up to 2x.
- Simple fix for static shedule in predict (#6357)
* Unify thread configuration, to make it easy to utilize all CPU cores (#6186)
* [jvm-packages] Clean the way deterministic paritioning is computed (#6033)
* Speed up JSON serialization by implementing an intrusive pointer class (#6129). It leads to 1.5x-2x performance boost.
### API additions
* [R] Add SHAP summary plot using ggplot2 (#5882)
* Modin DataFrame can now be used as input (#6055)
* [jvm-packages] Add `getNumFeature` method (#6075)
* Add MAPE metric (#6119)
* Implement GPU predict leaf. (#6187)
* Enable cuDF/cuPy inputs in `XGBClassifier` (#6269)
* Document tree method for feature weights. (#6312)
* Add `fail_on_invalid_gpu_id` parameter, which will cause XGBoost to terminate upon seeing an invalid value of `gpu_id` (#6342)
### Breaking: the default evaluation metric for classification is changed to `logloss` / `mlogloss` (#6183)
* The default metric used to be accuracy, and it is not statistically consistent to perform early stopping with the accuracy metric when we are really optimizing the log loss for the `binary:logistic` objective.
* For statistical consistency, the default metric for classification has been changed to `logloss`. Users may choose to preserve the old behavior by explicitly specifying `eval_metric`.
### Breaking: `skmaker` is now removed (#5971)
* The `skmaker` updater has not been documented nor tested.
### Breaking: the JSON model format no longer stores the leaf child count (#6094).
* The leaf child count field has been deprecated and is not used anywhere in the XGBoost codebase.
### Breaking: XGBoost now requires MacOS 10.14 (Mojave) and later.
* Homebrew has dropped support for MacOS 10.13 (High Sierra), so we are not able to install the OpenMP runtime (`libomp`) from Homebrew on MacOS 10.13. Please use MacOS 10.14 (Mojave) or later.
### Deprecation notices
* The use of `LabelEncoder` in `XGBClassifier` is now deprecated and will be removed in the next minor release (#6269). The deprecation is necessary to support multiple types of inputs, such as cuDF data frames or cuPy arrays.
* The use of certain positional arguments in the Python interface is deprecated (#6365). Users will use deprecation warnings for the use of position arguments for certain function parameters. New code should use keyword arguments as much as possible. We have not yet decided when we will fully require the use of keyword arguments.
### Bug-fixes
* On big-endian arch, swap the byte order in the binary serializer to enable loading models that were produced by a little-endian machine (#5813).
* [jvm-packages] Fix deterministic partitioning with dataset containing Double.NaN (#5996)
* Limit tree depth for GPU hist to 31 to prevent integer overflow (#6045)
* [jvm-packages] Set `maxBins` to 256 to align with the default value in the C++ code (#6066)
* [R] Fix CRAN check (#6077)
* Add back support for `scipy.sparse.coo_matrix` (#6162)
* Handle duplicated values in sketching. (#6178)
* Catch all standard exceptions in C API. (#6220)
* Fix linear GPU input (#6255)
* Fix inplace prediction interval. (#6259)
* [R] allow `xgb.plot.importance()` calls to fill a grid (#6294)
* Lazy import dask libraries. (#6309)
* Deterministic data partitioning for external memory (#6317)
* Avoid resetting seed for every configuration. (#6349)
* Fix label errors in graph visualization (#6369)
* [jvm-packages] fix potential unit test suites aborted issue due to race condition (#6373)
* [R] Fix warnings from `R check --as-cran` (#6374)
* [R] Fix a crash that occurs with noLD R (#6378)
* [R] Do not convert continuous labels to factors (#6380)
* [R] remove uses of `exists()` (#6387)
* Propagate parameters to the underlying `Booster` handle from `XGBClassifier.set_param` / `XGBRegressor.set_param`. (#6416)
* [R] Fix R package installation via CMake (#6423)
* Enforce row-major order in cuPy array (#6459)
* Fix filtering callable objects in the parameters passed to the scikit-learn API. (#6466)
### Maintenance: Testing, continuous integration, build system
* [CI] Improve JVM test in GitHub Actions (#5930)
* Refactor plotting test so that it can run independently (#6040)
* [CI] Cancel builds on subsequent pushes (#6011)
* Fix Dask Pytest fixture (#6024)
* [CI] Migrate linters to GitHub Actions (#6035)
* [CI] Remove win2016 JVM test from GitHub Actions (#6042)
* Fix CMake build with `BUILD_STATIC_LIB` option (#6090)
* Don't link imported target in CMake (#6093)
* Work around a compiler bug in MacOS AppleClang 11 (#6103)
* [CI] Fix CTest by running it in a correct directory (#6104)
* [R] Check warnings explicitly for model compatibility tests (#6114)
* [jvm-packages] add xgboost4j-gpu/xgboost4j-spark-gpu module to facilitate release (#6136)
* [CI] Time GPU tests. (#6141)
* [R] remove warning in configure.ac (#6152)
* [CI] Upgrade cuDF and RMM to 0.16 nightlies; upgrade to Ubuntu 18.04 (#6157)
* [CI] Test C API demo (#6159)
* Option for generating device debug info. (#6168)
* Update `.gitignore` (#6175, #6193, #6346)
* Hide C++ symbols from dmlc-core (#6188)
* [CI] Added arm64 job in Travis-CI (#6200)
* [CI] Fix Docker build for CUDA 11 (#6202)
* [CI] Move non-OpenMP gtest to GitHub Actions (#6210)
* [jvm-packages] Fix up build for xgboost4j-gpu, xgboost4j-spark-gpu (#6216)
* Add more tests for categorical data support (#6219)
* [dask] Test for data initializaton. (#6226)
* Bump junit from 4.11 to 4.13.1 in /jvm-packages/xgboost4j (#6230)
* Bump junit from 4.11 to 4.13.1 in /jvm-packages/xgboost4j-gpu (#6233)
* [CI] Reduce testing load with RMM (#6249)
* [CI] Build a Python wheel for aarch64 platform (#6253)
* [CI] Time the CPU tests on Jenkins. (#6257)
* [CI] Skip Dask tests on ARM. (#6267)
* Fix a typo in `is_arm()` in testing.py (#6271)
* [CI] replace `egrep` with `grep -E` (#6287)
* Support unity build. (#6295)
* [CI] Mark flaky tests as XFAIL (#6299)
* [CI] Use separate Docker cache for each CUDA version (#6305)
* Added `USE_NCCL_LIB_PATH` option to enable user to set `NCCL_LIBRARY` during build (#6310)
* Fix flaky data initialization test. (#6318)
* Add a badge for GitHub Actions (#6321)
* Optional `find_package` for sanitizers. (#6329)
* Use pytest conventions consistently in Python tests (#6337)
* Fix missing space in warning message (#6340)
* Update `custom_metric_obj.rst` (#6367)
* [CI] Run R check with `--as-cran` flag on GitHub Actions (#6371)
* [CI] Remove R check from Jenkins (#6372)
* Mark GPU external memory test as XFAIL. (#6381)
* [CI] Add noLD R test (#6382)
* Fix MPI build. (#6403)
* [CI] Upgrade to MacOS Mojave image (#6406)
* Fix flaky sparse page dmatrix test. (#6417)
* [CI] Upgrade cuDF and RMM to 0.17 nightlies (#6434)
* [CI] Fix CentOS 6 Docker images (#6467)
* [CI] Vendor libgomp in the manylinux Python wheel (#6461)
* [CI] Hot fix for libgomp vendoring (#6482)
### Maintenance: Clean up and merge the Rabit submodule (#6023, #6095, #6096, #6105, #6110, #6262, #6275, #6290)
* The Rabit submodule is now maintained as part of the XGBoost codebase.
* Tests for Rabit are now part of the test suites of XGBoost.
* Rabit can now be built on the Windows platform.
* We made various code re-formatting for the C++ code with clang-tidy.
* Public headers of XGBoost no longer depend on Rabit headers.
* Unused CMake targets for Rabit were removed.
* Single-point model recovery has been dropped and removed from Rabit, simplifying the Rabit code greatly. The single-point model recovery feature has not been adequately maintained over the years.
* We removed the parts of Rabit that were not useful for XGBoost.
### Maintenance: Refactor code for legibility and maintainability
* Unify CPU hist sketching (#5880)
* [R] fix uses of 1:length(x) and other small things (#5992)
* Unify evaluation functions. (#6037)
* Make binary bin search reusable. (#6058)
* Unify set index data. (#6062)
* [R] Remove `stringi` dependency (#6109)
* Merge extract cuts into QuantileContainer. (#6125)
* Reduce C++ compiler warnings (#6197, #6198, #6213, #6286, #6325)
* Cleanup Python code. (#6223)
* Small cleanup to evaluator. (#6400)
### Usability Improvements, Documentation
* [jvm-packages] add example to handle missing value other than 0 (#5677)
* Add DMatrix usage examples to the C API demo (#5854)
* List `DaskDeviceQuantileDMatrix` in the doc. (#5975)
* Update Python custom objective demo. (#5981)
* Update the JSON model schema to document more objective functions. (#5982)
* [Python] Fix warning when `missing` field is not used. (#5969)
* Fix typo in tracker logging (#5994)
* Move a warning about empty dataset, so that it's shown for all objectives and metrics (#5998)
* Fix the instructions for installing the nightly build. (#6004)
* [Doc] Add dtreeviz as a showcase example of integration with 3rd-party software (#6013)
* [jvm-packages] [doc] Update install doc for JVM packages (#6051)
* Fix typo in `xgboost.callback.early_stop` docstring (#6071)
* Add cache suffix to the files used in the external memory demo. (#6088)
* [Doc] Document the parameter `kill_spark_context_on_worker_failure` (#6097)
* Fix link to the demo for custom objectives (#6100)
* Update Dask doc. (#6108)
* Validate weights are positive values. (#6115)
* Document the updated CMake version requirement. (#6123)
* Add demo for `DaskDeviceQuantileDMatrix`. (#6156)
* Cosmetic fixes in `faq.rst` (#6161)
* Fix error message. (#6176)
* [Doc] Add list of winning solutions in data science competitions using XGBoost (#6177)
* Fix a comment in demo to use correct reference (#6190)
* Update the list of winning solutions using XGBoost (#6192)
* Consistent style for build status badge (#6203)
* [Doc] Add info on GPU compiler (#6204)
* Update the list of winning solutions (#6222, #6254)
* Add link to XGBoost's Twitter handle (#6244)
* Fix minor typos in XGBClassifier methods' docstrings (#6247)
* Add sponsors link to FUNDING.yml (#6252)
* Group CLI demo into subdirectory. (#6258)
* Reduce warning messages from `gbtree`. (#6273)
* Create a tutorial for using the C API in a C/C++ application (#6285)
* Update plugin instructions for CMake build (#6289)
* [doc] make Dask distributed example copy-pastable (#6345)
* [Python] Add option to use `libxgboost.so` from the system path (#6362)
* Fixed few grammatical mistakes in doc (#6393)
* Fix broken link in CLI doc (#6396)
* Improve documentation for the Dask API (#6413)
* Revise misleading exception information: no such param of `allow_non_zero_missing` (#6418)
* Fix CLI ranking demo. (#6439)
* Fix broken links. (#6455)
### Acknowledgement
**Contributors**: Nan Zhu (@CodingCat), @FelixYBW, Jack Dunn (@JackDunnNZ), Jean Lescut-Muller (@JeanLescut), Boris Feld (@Lothiraldan), Nikhil Choudhary (@Nikhil1O1), Rory Mitchell (@RAMitchell), @ShvetsKS, Anthony D'Amato (@Totoketchup), @Wittty-Panda, neko (@akiyamaneko), Alexander Gugel (@alexanderGugel), @dependabot[bot], DIVYA CHAUHAN (@divya661), Daniel Steinberg (@dstein64), Akira Funahashi (@funasoul), Philip Hyunsu Cho (@hcho3), Tong He (@hetong007), Hristo Iliev (@hiliev), Honza Sterba (@honzasterba), @hzy001, Igor Moura (@igormp), @jameskrach, James Lamb (@jameslamb), Naveed Ahmed Saleem Janvekar (@janvekarnaveed), Kyle Nicholson (@kylejn27), lacrosse91 (@lacrosse91), Christian Lorentzen (@lorentzenchr), Manikya Bardhan (@manikyabard), @nabokovas, John Quitto-Graham (@nvidia-johnq), @odidev, Qi Zhang (@qzhang90), Sergio Gavilán (@sgavil), Tanuja Kirthi Doddapaneni (@tanuja3), Cuong Duong (@tcuongd), Yuan Tang (@terrytangyuan), Jiaming Yuan (@trivialfis), vcarpani (@vcarpani), Vladislav Epifanov (@vepifanov), Vitalie Spinu (@vspinu), Bobby Wang (@wbo4958), Zeno Gantner (@zenogantner), zhang_jf (@zuston)
**Reviewers**: Nan Zhu (@CodingCat), John Zedlewski (@JohnZed), Rory Mitchell (@RAMitchell), @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Anthony D'Amato (@Totoketchup), @Wittty-Panda, Alexander Gugel (@alexanderGugel), Codecov Comments Bot (@codecov-commenter), Codecov (@codecov-io), DIVYA CHAUHAN (@divya661), Devin Robison (@drobison00), Geoffrey Blake (@geoffreyblake), Mark Harris (@harrism), Philip Hyunsu Cho (@hcho3), Honza Sterba (@honzasterba), Igor Moura (@igormp), @jakirkham, @jameskrach, James Lamb (@jameslamb), Janakarajan Natarajan (@janaknat), Jake Hemstad (@jrhemstad), Keith Kraus (@kkraus14), Kyle Nicholson (@kylejn27), Christian Lorentzen (@lorentzenchr), Michael Mayer (@mayer79), Nikolay Petrov (@napetrov), @odidev, PSEUDOTENSOR / Jonathan McKinney (@pseudotensor), Qi Zhang (@qzhang90), Sergio Gavilán (@sgavil), Scott Lundberg (@slundberg), Cuong Duong (@tcuongd), Yuan Tang (@terrytangyuan), Jiaming Yuan (@trivialfis), vcarpani (@vcarpani), Vladislav Epifanov (@vepifanov), Vincent Nijs (@vnijs), Vitalie Spinu (@vspinu), Bobby Wang (@wbo4958), William Hicks (@wphicks)
## v1.2.0 (2020.08.22)
### XGBoost4J-Spark now supports the GPU algorithm (#5171)
* Now XGBoost4J-Spark is able to leverage NVIDIA GPU hardware to speed up training.
* There is on-going work for accelerating the rest of the data pipeline with NVIDIA GPUs (#5950, #5972).
### XGBoost now supports CUDA 11 (#5808)
* It is now possible to build XGBoost with CUDA 11. Note that we do not yet distribute pre-built binaries built with CUDA 11; all current distributions use CUDA 10.0.
### Better guidance for persisting XGBoost models in an R environment (#5940, #5964)
* Users are strongly encouraged to use `xgb.save()` and `xgb.save.raw()` instead of `saveRDS()`. This is so that the persisted models can be accessed with future releases of XGBoost.
* The previous release (1.1.0) had problems loading models that were saved with `saveRDS()`. This release adds a compatibility layer to restore access to the old RDS files. Note that this is meant to be a temporary measure; users are advised to stop using `saveRDS()` and migrate to `xgb.save()` and `xgb.save.raw()`.
### New objectives and metrics
* The pseudo-Huber loss `reg:pseudohubererror` is added (#5647). The corresponding metric is `mphe`. Right now, the slope is hard-coded to 1.
* The Accelerated Failure Time objective for survival analysis (`survival:aft`) is now accelerated on GPUs (#5714, #5716). The survival metrics `aft-nloglik` and `interval-regression-accuracy` are also accelerated on GPUs.
### Improved integration with scikit-learn
* Added `n_features_in_` attribute to the scikit-learn interface to store the number of features used (#5780). This is useful for integrating with some scikit-learn features such as `StackingClassifier`. See [this link](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep010/proposal.html) for more details.
* `XGBoostError` now inherits `ValueError`, which conforms scikit-learn's exception requirement (#5696).
### Improved integration with Dask
* The XGBoost Dask API now exposes an asynchronous interface (#5862). See [the document](https://xgboost.readthedocs.io/en/latest/tutorials/dask.html#working-with-asyncio) for details.
* Zero-copy ingestion of GPU arrays via `DaskDeviceQuantileDMatrix` (#5623, #5799, #5800, #5803, #5837, #5874, #5901): Previously, the Dask interface had to make 2 data copies: one for concatenating the Dask partition/block into a single block and another for internal representation. To save memory, we introduce `DaskDeviceQuantileDMatrix`. As long as Dask partitions are resident in the GPU memory, `DaskDeviceQuantileDMatrix` is able to ingest them directly without making copies. This matrix type wraps `DeviceQuantileDMatrix`.
* The prediction function now returns GPU Series type if the input is from Dask-cuDF (#5710). This is to preserve the input data type.
### Robust handling of external data types (#5689, #5893)
- As we support more and more external data types, the handling logic has proliferated all over the code base and became hard to keep track. It also became unclear how missing values and threads are handled. We refactored the Python package code to collect all data handling logic to a central location, and now we have an explicit list of of all supported data types.
### Improvements in GPU-side data matrix (`DeviceQuantileDMatrix`)
* The GPU-side data matrix now implements its own quantile sketching logic, so that data don't have to be transported back to the main memory (#5700, #5747, #5760, #5846, #5870, #5898). The GK sketching algorithm is also now better documented.
- Now we can load extremely sparse dataset like URL, although performance is still sub-optimal.
* The GPU-side data matrix now exposes an iterative interface (#5783), so that users are able to construct a matrix from a data iterator. See the [Python demo](https://github.com/dmlc/xgboost/blob/release_1.2.0/demo/guide-python/data_iterator.py).
### New language binding: Swift (#5728)
* Visit https://github.com/kongzii/SwiftXGBoost for more details.
### Robust model serialization with JSON (#5772, #5804, #5831, #5857, #5934)
* We continue efforts from the 1.0.0 release to adopt JSON as the format to save and load models robustly.
* JSON model IO is significantly faster and produces smaller model files.
* Round-trip reproducibility is guaranteed, via the introduction of an efficient float-to-string conversion algorithm known as [the Ryū algorithm](https://dl.acm.org/doi/10.1145/3192366.3192369). The conversion is locale-independent, producing consistent numeric representation regardless of the locale setting of the user's machine.
* We fixed an issue in loading large JSON files to memory.
* It is now possible to load a JSON file from a remote source such as S3.
### Performance improvements
* CPU hist tree method optimization
- Skip missing lookup in hist row partitioning if data is dense. (#5644)
- Specialize training procedures for CPU hist tree method on distributed environment. (#5557)
- Add single point histogram for CPU hist. Previously gradient histogram for CPU hist is hard coded to be 64 bit, now users can specify the parameter `single_precision_histogram` to use 32 bit histogram instead for faster training performance. (#5624, #5811)
* GPU hist tree method optimization
- Removed some unnecessary synchronizations and better memory allocation pattern. (#5707)
- Optimize GPU Hist for wide dataset. Previously for wide dataset the atomic operation is performed on global memory, now it can run on shared memory for faster histogram building. But there's a known small regression on GeForce cards with dense data. (#5795, #5926, #5948, #5631)
### API additions
* Support passing fmap to importance plot (#5719). Now importance plot can show actual names of features instead of default ones.
* Support 64bit seed. (#5643)
* A new C API `XGBoosterGetNumFeature` is added for getting number of features in booster (#5856).
* Feature names and feature types are now stored in C++ core and saved in binary DMatrix (#5858).
### Breaking: The `predict()` method of `DaskXGBClassifier` now produces class predictions (#5986). Use `predict_proba()` to obtain probability predictions.
* Previously, `DaskXGBClassifier.predict()` produced probability predictions. This is inconsistent with the behavior of other scikit-learn classifiers, where `predict()` returns class predictions. We make a breaking change in 1.2.0 release so that `DaskXGBClassifier.predict()` now correctly produces class predictions and thus behave like other scikit-learn classifiers. Furthermore, we introduce the `predict_proba()` method for obtaining probability predictions, again to be in line with other scikit-learn classifiers.
### Breaking: Custom evaluation metric now receives raw prediction (#5954)
* Previously, the custom evaluation metric received a transformed prediction result when used with a classifier. Now the custom metric will receive a raw (untransformed) prediction and will need to transform the prediction itself. See [demo/guide-python/custom\_softmax.py](https://github.com/dmlc/xgboost/blob/release_1.2.0/demo/guide-python/custom_softmax.py) for an example.
* This change is to make the custom metric behave consistently with the custom objective, which already receives raw prediction (#5564).
### Breaking: XGBoost4J-Spark now requires Spark 3.0 and Scala 2.12 (#5836, #5890)
* Starting with version 3.0, Spark can manage GPU resources and allocate them among executors.
* Spark 3.0 dropped support for Scala 2.11 and now only supports Scala 2.12. Thus, XGBoost4J-Spark also only supports Scala 2.12.
### Breaking: XGBoost Python package now requires Python 3.6 and later (#5715)
* Python 3.6 has many useful features such as f-strings.
### Breaking: XGBoost now adopts the C++14 standard (#5664)
* Make sure to use a sufficiently modern C++ compiler that supports C++14, such as Visual Studio 2017, GCC 5.0+, and Clang 3.4+.
### Bug-fixes
* Fix a data race in the prediction function (#5853). As a byproduct, the prediction function now uses a thread-local data store and became thread-safe.
* Restore capability to run prediction when the test input has fewer features than the training data (#5955). This capability is necessary to support predicting with LIBSVM inputs. The previous release (1.1) had broken this capability, so we restore it in this version with better tests.
* Fix OpenMP build with CMake for R package, to support CMake 3.13 (#5895).
* Fix Windows 2016 build (#5902, #5918).
* Fix edge cases in scikit-learn interface with Pandas input by disabling feature validation. (#5953)
* [R] Enable weighted learning to rank (#5945)
* [R] Fix early stopping with custom objective (#5923)
* Fix NDK Build (#5886)
* Add missing explicit template specializations for greater portability (#5921)
* Handle empty rows in data iterators correctly (#5929). This bug affects file loader and JVM data frames.
* Fix `IsDense` (#5702)
* [jvm-packages] Fix wrong method name `setAllowZeroForMissingValue` (#5740)
* Fix shape inference for Dask predict (#5989)
### Usability Improvements, Documentation
* [Doc] Document that CUDA 10.0 is required (#5872)
* Refactored command line interface (CLI). Now CLI is able to handle user errors and output basic document. (#5574)
* Better error handling in Python: use `raise from` syntax to preserve full stacktrace (#5787).
* The JSON model dump now has a formal schema (#5660, #5818). The benefit is to prevent `dump_model()` function from breaking. See [this document](https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html#difference-between-saving-model-and-dumping-model) to understand the difference between saving and dumping models.
* Add a reference to the GPU external memory paper (#5684)
* Document more objective parameters in the R package (#5682)
* Document the existence of pre-built binary wheels for MacOS (#5711)
* Remove `max.depth` in the R gblinear example. (#5753)
* Added conda environment file for building docs (#5773)
* Mention dask blog post in the doc, which introduces using Dask with GPU and some internal workings. (#5789)
* Fix rendering of Markdown docs (#5821)
* Document new objectives and metrics available on GPUs (#5909)
* Better message when no GPU is found. (#5594)
* Remove the use of `silent` parameter from R demos. (#5675)
* Don't use masked array in array interface. (#5730)
* Update affiliation of @terrytangyuan: Ant Financial -> Ant Group (#5827)
* Move dask tutorial closer other distributed tutorials (#5613)
* Update XGBoost + Dask overview documentation (#5961)
* Show `n_estimators` in the docstring of the scikit-learn interface (#6041)
* Fix a type in a doctring of the scikit-learn interface (#5980)
### Maintenance: testing, continuous integration, build system
* [CI] Remove CUDA 9.0 from CI (#5674, #5745)
* Require CUDA 10.0+ in CMake build (#5718)
* [R] Remove dependency on gendef for Visual Studio builds (fixes #5608) (#5764). This enables building XGBoost with GPU support with R 4.x.
* [R-package] Reduce duplication in configure.ac (#5693)
* Bump com.esotericsoftware to 4.0.2 (#5690)
* Migrate some tests from AppVeyor to GitHub Actions to speed up the tests. (#5911, #5917, #5919, #5922, #5928)
* Reduce cost of the Jenkins CI server (#5884, #5904, #5892). We now enforce a daily budget via an automated monitor. We also dramatically reduced the workload for the Windows platform, since the cloud VM cost is vastly greater for Windows.
* [R] Set up automated R linter (#5944)
* [R] replace uses of T and F with TRUE and FALSE (#5778)
* Update Docker container 'CPU' (#5956)
* Simplify CMake build with modern CMake techniques (#5871)
* Use `hypothesis` package for testing (#5759, #5835, #5849).
* Define `_CRT_SECURE_NO_WARNINGS` to remove unneeded warnings in MSVC (#5434)
* Run all Python demos in CI, to ensure that they don't break (#5651)
* Enhance nvtx support (#5636). Now we can use unified timer between CPU and GPU. Also CMake is able to find nvtx automatically.
* Speed up python test. (#5752)
* Add helper for generating batches of data. (#5756)
* Add c-api-demo to .gitignore (#5855)
* Add option to enable all compiler warnings in GCC/Clang (#5897)
* Make Python model compatibility test runnable locally (#5941)
* Add cupy to Windows CI (#5797)
* [CI] Fix cuDF install; merge 'gpu' and 'cudf' test suite (#5814)
* Update rabit submodule (#5680, #5876)
* Force colored output for Ninja build. (#5959)
* [CI] Assign larger /dev/shm to NCCL (#5966)
* Add missing Pytest marks to AsyncIO unit test (#5968)
* [CI] Use latest cuDF and dask-cudf (#6048)
* Add CMake flag to log C API invocations, to aid debugging (#5925)
* Fix a unit test on CLI, to handle RC versions (#6050)
* [CI] Use mgpu machine to run gpu hist unit tests (#6050)
* [CI] Build GPU-enabled JAR artifact and deploy to xgboost-maven-repo (#6050)
### Maintenance: Refactor code for legibility and maintainability
* Remove dead code in DMatrix initialization. (#5635)
* Catch dmlc error by ref. (#5678)
* Refactor the `gpu_hist` split evaluation in preparation for batched nodes enumeration. (#5610)
* Remove column major specialization. (#5755)
* Remove unused imports in Python (#5776)
* Avoid including `c_api.h` in header files. (#5782)
* Remove unweighted GK quantile, which is unused. (#5816)
* Add Python binding for rabit ops. (#5743)
* Implement `Empty` method for host device vector. (#5781)
* Remove print (#5867)
* Enforce tree order in JSON (#5974)
### Acknowledgement
**Contributors**: Nan Zhu (@CodingCat), @LionOrCatThatIsTheQuestion, Dmitry Mottl (@Mottl), Rory Mitchell (@RAMitchell), @ShvetsKS, Alex Wozniakowski (@a-wozniakowski), Alexander Gugel (@alexanderGugel), @anttisaukko, @boxdot, Andy Adinets (@canonizer), Ram Rachum (@cool-RR), Elliot Hershberg (@elliothershberg), Jason E. Aten, Ph.D. (@glycerine), Philip Hyunsu Cho (@hcho3), @jameskrach, James Lamb (@jameslamb), James Bourbeau (@jrbourbeau), Peter Jung (@kongzii), Lorenz Walthert (@lorenzwalthert), Oleksandr Kuvshynov (@okuvshynov), Rong Ou (@rongou), Shaochen Shi (@shishaochen), Yuan Tang (@terrytangyuan), Jiaming Yuan (@trivialfis), Bobby Wang (@wbo4958), Zhang Zhang (@zhangzhang10)
**Reviewers**: Nan Zhu (@CodingCat), @LionOrCatThatIsTheQuestion, Hao Yang (@QuantHao), Rory Mitchell (@RAMitchell), @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Alex Wozniakowski (@a-wozniakowski), Amit Kumar (@aktech), Avinash Barnwal (@avinashbarnwal), @boxdot, Andy Adinets (@canonizer), Chandra Shekhar Reddy (@chandrureddy), Ram Rachum (@cool-RR), Cristiano Goncalves (@cristianogoncalves), Elliot Hershberg (@elliothershberg), Jason E. Aten, Ph.D. (@glycerine), Philip Hyunsu Cho (@hcho3), Tong He (@hetong007), James Lamb (@jameslamb), James Bourbeau (@jrbourbeau), Lee Drake (@leedrake5), DougM (@mengdong), Oleksandr Kuvshynov (@okuvshynov), RongOu (@rongou), Shaochen Shi (@shishaochen), Xu Xiao (@sperlingxx), Yuan Tang (@terrytangyuan), Theodore Vasiloudis (@thvasilo), Jiaming Yuan (@trivialfis), Bobby Wang (@wbo4958), Zhang Zhang (@zhangzhang10)
## v1.1.1 (2020.06.06)
This patch release applies the following patches to 1.1.0 release:
* CPU performance improvement in the PyPI wheels (#5720)
* Fix loading old model (#5724)
* Install pkg-config file (#5744)
## v1.1.0 (2020.05.17)
### Better performance on multi-core CPUs (#5244, #5334, #5522)
@@ -203,6 +867,16 @@ Upgrading to latest pip allows us to depend on newer versions of system librarie
**Reviewers**: Nan Zhu (@CodingCat), @LeZhengThu, Rory Mitchell (@RAMitchell), @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Steve Bronder (@SteveBronder), Nikita Titov (@StrikerRUS), Andrew Kane (@ankane), Avinash Barnwal (@avinashbarnwal), @brydag, Andy Adinets (@canonizer), Chandra Shekhar Reddy (@chandrureddy), Chen Qin (@chenqin), Codecov (@codecov-io), David Díaz Vico (@daviddiazvico), Darby Payne (@dpayne), Jason E. Aten, Ph.D. (@glycerine), Philip Hyunsu Cho (@hcho3), James Lamb (@jameslamb), @johnny-cat, Mu Li (@mli), Mate Soos (@msoos), @rnyak, Rong Ou (@rongou), Sriram Chandramouli (@sriramch), Toby Dylan Hocking (@tdhock), Yuan Tang (@terrytangyuan), Oleksandr Pryimak (@trams), Jiaming Yuan (@trivialfis), Liang-Chi Hsieh (@viirya), Bobby Wang (@wbo4958),
## v1.0.2 (2020.03.03)
This patch release applies the following patches to 1.0.0 release:
* Fix a small typo in sklearn.py that broke multiple eval metrics (#5341)
* Restore loading model from buffer (#5360)
* Use type name for data type check (#5364)
## v1.0.1 (2020.02.21)
This release is identical to the 1.0.0 release, except that it fixes a small bug that rendered 1.0.0 incompatible with Python 3.5. See #5328.
## v1.0.0 (2020.02.19)
This release marks a major milestone for the XGBoost project.
@@ -440,7 +1114,7 @@ This release marks a major milestone for the XGBoost project.
* Specify version macro in CMake. (#4730)
* Include dmlc-tracker into XGBoost Python package (#4731)
* [CI] Use long key ID for Ubuntu repository fingerprints. (#4783)
* Remove plugin, cuda related code in automake & autoconf files (#4789)
* Remove plugin, CUDA related code in automake & autoconf files (#4789)
* Skip related tests when scikit-learn is not installed. (#4791)
* Ignore vscode and clion files (#4866)
* Use bundled Google Test by default (#4900)
@@ -471,7 +1145,7 @@ This release marks a major milestone for the XGBoost project.
### Usability Improvements, Documentation
* Add Random Forest API to Python API doc (#4500)
* Fix Python demo and doc. (#4545)
* Remove doc about not supporting cuda 10.1 (#4578)
* Remove doc about not supporting CUDA 10.1 (#4578)
* Address some sphinx warnings and errors, add doc for building doc. (#4589)
* Add instruction to run formatting checks locally (#4591)
* Fix docstring for `XGBModel.predict()` (#4592)
@@ -486,7 +1160,7 @@ This release marks a major milestone for the XGBoost project.
* Update XGBoost4J-Spark doc (#4804)
* Regular formatting for evaluation metrics (#4803)
* [jvm-packages] Refine documentation for handling missing values in XGBoost4J-Spark (#4805)
* Monitor for distributed envorinment (#4829). This is useful for identifying performance bottleneck.
* Monitor for distributed environment (#4829). This is useful for identifying performance bottleneck.
* Add check for length of weights and produce a good error message (#4872)
* Fix DMatrix doc (#4884)
* Export C++ headers in CMake installation (#4897)
@@ -958,7 +1632,7 @@ This release is packed with many new features and bug fixes.
### Known issues
* Quantile sketcher fails to produce any quantile for some edge cases (#2943)
* The `hist` algorithm leaks memory when used with learning rate decay callback (#3579)
* Using custom evaluation funciton together with early stopping causes assertion failure in XGBoost4J-Spark (#3595)
* Using custom evaluation function together with early stopping causes assertion failure in XGBoost4J-Spark (#3595)
* Early stopping doesn't work with `gblinear` learner (#3789)
* Label and weight vectors are not reshared upon the change in number of GPUs (#3794). To get around this issue, delete the `DMatrix` object and re-load.
* The `DMatrix` Python objects are initialized with incorrect values when given array slices (#3841)
@@ -1052,7 +1726,7 @@ This version is only applicable for the Python package. The content is identical
- Add scripts to cross-build and deploy artifacts (#3276, #3307)
- Fix a compilation error for Scala 2.10 (#3332)
* BREAKING CHANGES
- `XGBClassifier.predict_proba()` no longer accepts paramter `output_margin`. The paramater makes no sense for `predict_proba()` because the method is to predict class probabilities, not raw margin scores.
- `XGBClassifier.predict_proba()` no longer accepts parameter `output_margin`. The parameter makes no sense for `predict_proba()` because the method is to predict class probabilities, not raw margin scores.
## v0.71 (2018.04.11)
* This is a minor release, mainly motivated by issues concerning `pip install`, e.g. #2426, #3189, #3118, and #3194.
@@ -1068,7 +1742,7 @@ This version is only applicable for the Python package. The content is identical
- AUC-PR metric for ranking task (#3172)
- Monotonic constraints for 'hist' algorithm (#3085)
* GPU support
- Create an abtract 1D vector class that moves data seamlessly between the main and GPU memory (#2935, #3116, #3068). This eliminates unnecessary PCIe data transfer during training time.
- Create an abstract 1D vector class that moves data seamlessly between the main and GPU memory (#2935, #3116, #3068). This eliminates unnecessary PCIe data transfer during training time.
- Fix minor bugs (#3051, #3217)
- Fix compatibility error for CUDA 9.1 (#3218)
* Python package:
@@ -1096,7 +1770,7 @@ This version is only applicable for the Python package. The content is identical
* Refactored gbm to allow more friendly cache strategy
- Specialized some prediction routine
* Robust `DMatrix` construction from a sparse matrix
* Faster consturction of `DMatrix` from 2D NumPy matrices: elide copies, use of multiple threads
* Faster construction of `DMatrix` from 2D NumPy matrices: elide copies, use of multiple threads
* Automatically remove nan from input data when it is sparse.
- This can solve some of user reported problem of istart != hist.size
* Fix the single-instance prediction function to obtain correct predictions
@@ -1124,7 +1798,7 @@ This version is only applicable for the Python package. The content is identical
- Faster, histogram-based tree algorithm (`tree_method='hist'`) .
- GPU/CUDA accelerated tree algorithms (`tree_method='gpu_hist'` or `'gpu_exact'`), including the GPU-based predictor.
- Monotonic constraints: when other features are fixed, force the prediction to be monotonic increasing with respect to a certain specified feature.
- Faster gradient caculation using AVX SIMD
- Faster gradient calculation using AVX SIMD
- Ability to export models in JSON format
- Support for Tweedie regression
- Additional dropout options for DART: binomial+1, epsilon

View File

@@ -1,8 +1,8 @@
Package: xgboost
Type: Package
Title: Extreme Gradient Boosting
Version: 1.2.0.1
Date: 2020-02-21
Version: 1.5.0.1
Date: 2021-10-13
Authors@R: c(
person("Tianqi", "Chen", role = c("aut"),
email = "tianqi.tchen@gmail.com"),
@@ -53,16 +53,15 @@ Suggests:
testthat,
lintr,
igraph (>= 1.0.1),
jsonlite,
float,
crayon
crayon,
titanic
Depends:
R (>= 3.3.0)
Imports:
Matrix (>= 1.1-0),
methods,
data.table (>= 1.9.6),
magrittr (>= 1.5),
stringi (>= 0.5.2)
jsonlite (>= 1.0),
RoxygenNote: 7.1.1
SystemRequirements: GNU make, C++14

View File

@@ -36,8 +36,10 @@ export(xgb.create.features)
export(xgb.cv)
export(xgb.dump)
export(xgb.gblinear.history)
export(xgb.get.config)
export(xgb.ggplot.deepness)
export(xgb.ggplot.importance)
export(xgb.ggplot.shap.summary)
export(xgb.importance)
export(xgb.load)
export(xgb.load.raw)
@@ -46,10 +48,12 @@ export(xgb.plot.deepness)
export(xgb.plot.importance)
export(xgb.plot.multi.trees)
export(xgb.plot.shap)
export(xgb.plot.shap.summary)
export(xgb.plot.tree)
export(xgb.save)
export(xgb.save.raw)
export(xgb.serialize)
export(xgb.set.config)
export(xgb.train)
export(xgb.unserialize)
export(xgboost)
@@ -76,14 +80,10 @@ importFrom(graphics,lines)
importFrom(graphics,par)
importFrom(graphics,points)
importFrom(graphics,title)
importFrom(magrittr,"%>%")
importFrom(jsonlite,fromJSON)
importFrom(jsonlite,toJSON)
importFrom(stats,median)
importFrom(stats,predict)
importFrom(stringi,stri_detect_regex)
importFrom(stringi,stri_match_first_regex)
importFrom(stringi,stri_replace_all_regex)
importFrom(stringi,stri_replace_first_regex)
importFrom(stringi,stri_split_regex)
importFrom(utils,head)
importFrom(utils,object.size)
importFrom(utils,str)

View File

@@ -188,7 +188,7 @@ cb.reset.parameters <- function(new_params) {
pnames <- gsub("\\.", "_", names(new_params))
nrounds <- NULL
# run some checks in the begining
# run some checks in the beginning
init <- function(env) {
nrounds <<- env$end_iteration - env$begin_iteration + 1
@@ -263,10 +263,7 @@ cb.reset.parameters <- function(new_params) {
#' \itemize{
#' \item \code{best_score} the evaluation score at the best iteration
#' \item \code{best_iteration} at which boosting iteration the best score has occurred (1-based index)
#' \item \code{best_ntreelimit} to use with the \code{ntreelimit} parameter in \code{predict}.
#' It differs from \code{best_iteration} in multiclass or random forest settings.
#' }
#'
#' The Same values are also stored as xgb-attributes:
#' \itemize{
#' \item \code{best_iteration} is stored as a 0-based iteration index (for interoperability of binary models)
@@ -352,9 +349,15 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
finalizer <- function(env) {
if (!is.null(env$bst)) {
attr_best_score <- as.numeric(xgb.attr(env$bst$handle, 'best_score'))
if (best_score != attr_best_score)
stop("Inconsistent 'best_score' values between the closure state: ", best_score,
" and the xgb.attr: ", attr_best_score)
if (best_score != attr_best_score) {
# If the difference is too big, throw an error
if (abs(best_score - attr_best_score) >= 1e-14) {
stop("Inconsistent 'best_score' values between the closure state: ", best_score,
" and the xgb.attr: ", attr_best_score)
}
# If the difference is due to floating-point truncation, update best_score
best_score <- attr_best_score
}
env$bst$best_iteration <- best_iteration
env$bst$best_ntreelimit <- best_ntreelimit
env$bst$best_score <- best_score
@@ -492,13 +495,12 @@ cb.cv.predict <- function(save_models = FALSE) {
rep(NA_real_, N)
}
ntreelimit <- NVL(env$basket$best_ntreelimit,
env$end_iteration * env$num_parallel_tree)
iterationrange <- c(1, NVL(env$basket$best_iteration, env$end_iteration) + 1)
if (NVL(env$params[['booster']], '') == 'gblinear') {
ntreelimit <- 0 # must be 0 for gblinear
iterationrange <- c(1, 1) # must be 0 for gblinear
}
for (fd in env$bst_folds) {
pr <- predict(fd$bst, fd$watchlist[[2]], ntreelimit = ntreelimit, reshape = TRUE)
pr <- predict(fd$bst, fd$watchlist[[2]], iterationrange = iterationrange, reshape = TRUE)
if (is.matrix(pred)) {
pred[fd$index, ] <- pr
} else {
@@ -527,7 +529,7 @@ cb.cv.predict <- function(save_models = FALSE) {
#' Callback closure for collecting the model coefficients history of a gblinear booster
#' during its training.
#'
#' @param sparse when set to FALSE/TURE, a dense/sparse matrix is used to store the result.
#' @param sparse when set to FALSE/TRUE, a dense/sparse matrix is used to store the result.
#' Sparse format is useful when one expects only a subset of coefficients to be non-zero,
#' when using the "thrifty" feature selector with fairly small number of top features
#' selected per iteration.
@@ -554,7 +556,6 @@ cb.cv.predict <- function(save_models = FALSE) {
#' #
#' # In the iris dataset, it is hard to linearly separate Versicolor class from the rest
#' # without considering the 2nd order interactions:
#' require(magrittr)
#' x <- model.matrix(Species ~ .^2, iris)[,-1]
#' colnames(x)
#' dtrain <- xgb.DMatrix(scale(x), label = 1*(iris$Species == "versicolor"))
@@ -575,7 +576,7 @@ cb.cv.predict <- function(save_models = FALSE) {
#' bst <- xgb.train(param, dtrain, list(tr=dtrain), nrounds = 200, eta = 0.8,
#' updater = 'coord_descent', feature_selector = 'thrifty', top_k = 1,
#' callbacks = list(cb.gblinear.history()))
#' xgb.gblinear.history(bst) %>% matplot(type = 'l')
#' matplot(xgb.gblinear.history(bst), type = 'l')
#' # Componentwise boosting is known to have similar effect to Lasso regularization.
#' # Try experimenting with various values of top_k, eta, nrounds,
#' # as well as different feature_selectors.
@@ -584,7 +585,7 @@ cb.cv.predict <- function(save_models = FALSE) {
#' bst <- xgb.cv(param, dtrain, nfold = 5, nrounds = 100, eta = 0.8,
#' callbacks = list(cb.gblinear.history()))
#' # coefficients in the CV fold #3
#' xgb.gblinear.history(bst)[[3]] %>% matplot(type = 'l')
#' matplot(xgb.gblinear.history(bst)[[3]], type = 'l')
#'
#'
#' #### Multiclass classification:
@@ -597,15 +598,15 @@ cb.cv.predict <- function(save_models = FALSE) {
#' bst <- xgb.train(param, dtrain, list(tr=dtrain), nrounds = 70, eta = 0.5,
#' callbacks = list(cb.gblinear.history()))
#' # Will plot the coefficient paths separately for each class:
#' xgb.gblinear.history(bst, class_index = 0) %>% matplot(type = 'l')
#' xgb.gblinear.history(bst, class_index = 1) %>% matplot(type = 'l')
#' xgb.gblinear.history(bst, class_index = 2) %>% matplot(type = 'l')
#' matplot(xgb.gblinear.history(bst, class_index = 0), type = 'l')
#' matplot(xgb.gblinear.history(bst, class_index = 1), type = 'l')
#' matplot(xgb.gblinear.history(bst, class_index = 2), type = 'l')
#'
#' # CV:
#' bst <- xgb.cv(param, dtrain, nfold = 5, nrounds = 70, eta = 0.5,
#' callbacks = list(cb.gblinear.history(FALSE)))
#' # 1st forld of 1st class
#' xgb.gblinear.history(bst, class_index = 0)[[1]] %>% matplot(type = 'l')
#' # 1st fold of 1st class
#' matplot(xgb.gblinear.history(bst, class_index = 0)[[1]], type = 'l')
#'
#' @export
cb.gblinear.history <- function(sparse=FALSE) {
@@ -636,9 +637,14 @@ cb.gblinear.history <- function(sparse=FALSE) {
if (!is.null(env$bst)) { # # xgb.train:
coefs <<- list2mat(coefs)
} else { # xgb.cv:
# first lapply transposes the list
coefs <<- lapply(seq_along(coefs[[1]]), function(i) lapply(coefs, "[[", i)) %>%
lapply(function(x) list2mat(x))
# second lapply transposes the list
coefs <<- lapply(
X = lapply(
X = seq_along(coefs[[1]]),
FUN = function(i) lapply(coefs, "[[", i)
),
FUN = list2mat
)
}
}

View File

@@ -1,6 +1,6 @@
#
# This file is for the low level reuseable utility functions
# that are not supposed to be visibe to a user.
# This file is for the low level reusable utility functions
# that are not supposed to be visible to a user.
#
#
@@ -20,6 +20,12 @@ NVL <- function(x, val) {
stop("typeof(x) == ", typeof(x), " is not supported by NVL")
}
# List of classification and ranking objectives
.CLASSIFICATION_OBJECTIVES <- function() {
return(c('binary:logistic', 'binary:logitraw', 'binary:hinge', 'multi:softmax',
'multi:softprob', 'rank:pairwise', 'rank:ndcg', 'rank:map'))
}
#
# Low-level functions for boosting --------------------------------------------
@@ -167,13 +173,13 @@ xgb.iter.eval <- function(booster_handle, watchlist, iter, feval = NULL) {
evnames <- names(watchlist)
if (is.null(feval)) {
msg <- .Call(XGBoosterEvalOneIter_R, booster_handle, as.integer(iter), watchlist, as.list(evnames))
msg <- stri_split_regex(msg, '(\\s+|:|\\s+)')[[1]][-1]
res <- as.numeric(msg[c(FALSE, TRUE)]) # even indices are the values
names(res) <- msg[c(TRUE, FALSE)] # odds are the names
mat <- matrix(strsplit(msg, '\\s+|:')[[1]][-1], nrow = 2)
res <- structure(as.numeric(mat[2, ]), names = mat[1, ])
} else {
res <- sapply(seq_along(watchlist), function(j) {
w <- watchlist[[j]]
preds <- predict(booster_handle, w, outputmargin = TRUE, ntreelimit = 0) # predict using all trees
## predict using all trees
preds <- predict(booster_handle, w, outputmargin = TRUE, iterationrange = c(1, 1))
eval_res <- feval(preds, w)
out <- eval_res$value
names(out) <- paste0(evnames[j], "-", eval_res$metric)
@@ -188,13 +194,23 @@ xgb.iter.eval <- function(booster_handle, watchlist, iter, feval = NULL) {
# Helper functions for cross validation ---------------------------------------
#
# Possibly convert the labels into factors, depending on the objective.
# The labels are converted into factors only when the given objective refers to the classification
# or ranking tasks.
convert.labels <- function(labels, objective_name) {
if (objective_name %in% .CLASSIFICATION_OBJECTIVES()) {
return(as.factor(labels))
} else {
return(labels)
}
}
# Generates random (stratified if needed) CV folds
generate.cv.folds <- function(nfold, nrows, stratified, label, params) {
# cannot do it for rank
if (exists('objective', where = params) &&
is.character(params$objective) &&
strtrim(params$objective, 5) == 'rank:') {
objective <- params$objective
if (is.character(objective) && strtrim(objective, 5) == 'rank:') {
stop("\n\tAutomatic generation of CV-folds is not implemented for ranking!\n",
"\tConsider providing pre-computed CV-folds through the 'folds=' parameter.\n")
}
@@ -207,19 +223,16 @@ generate.cv.folds <- function(nfold, nrows, stratified, label, params) {
# - For classification, need to convert y labels to factor before making the folds,
# and then do stratification by factor levels.
# - For regression, leave y numeric and do stratification by quantiles.
if (exists('objective', where = params) &&
is.character(params$objective)) {
# If 'objective' provided in params, assume that y is a classification label
# unless objective is reg:squarederror
if (params$objective != 'reg:squarederror')
y <- factor(y)
if (is.character(objective)) {
y <- convert.labels(y, params$objective)
} else {
# If no 'objective' given in params, it means that user either wants to
# use the default 'reg:squarederror' objective or has provided a custom
# obj function. Here, assume classification setting when y has 5 or less
# unique values:
if (length(unique(y)) <= 5)
if (length(unique(y)) <= 5) {
y <- factor(y)
}
}
folds <- xgb.createFolds(y, nfold)
} else {
@@ -272,7 +285,7 @@ xgb.createFolds <- function(y, k = 10)
for (i in seq_along(numInClass)) {
## create a vector of integers from 1:k as many times as possible without
## going over the number of samples in the class. Note that if the number
## of samples in a class is less than k, nothing is producd here.
## of samples in a class is less than k, nothing is produced here.
seqVector <- rep(seq_len(k), numInClass[i] %/% k)
## add enough random integers to get length(seqVector) == numInClass[i]
if (numInClass[i] %% k > 0) seqVector <- c(seqVector, sample.int(k, numInClass[i] %% k))
@@ -349,6 +362,7 @@ NULL
#' # Save as a stand-alone file (JSON); load it with xgb.load()
#' xgb.save(bst, 'xgb.model.json')
#' bst2 <- xgb.load('xgb.model.json')
#' if (file.exists('xgb.model.json')) file.remove('xgb.model.json')
#'
#' # Save as a raw byte vector; load it with xgb.load.raw()
#' xgb_bytes <- xgb.save.raw(bst)
@@ -364,6 +378,7 @@ NULL
#' obj2 <- readRDS('my_object.rds')
#' # Re-construct xgb.Booster object from the bytes
#' bst2 <- xgb.load.raw(obj2$xgb_model_bytes)
#' if (file.exists('my_object.rds')) file.remove('my_object.rds')
#'
#' @name a-compatibility-note-for-saveRDS-save
NULL

View File

@@ -1,7 +1,7 @@
# Construct an internal xgboost Booster and return a handle to it.
# internal utility function
xgb.Booster.handle <- function(params = list(), cachelist = list(),
modelfile = NULL) {
modelfile = NULL, handle = NULL) {
if (typeof(cachelist) != "list" ||
!all(vapply(cachelist, inherits, logical(1), what = 'xgb.DMatrix'))) {
stop("cachelist must be a list of xgb.DMatrix objects")
@@ -11,6 +11,7 @@ xgb.Booster.handle <- function(params = list(), cachelist = list(),
if (typeof(modelfile) == "character") {
## A filename
handle <- .Call(XGBoosterCreate_R, cachelist)
modelfile <- path.expand(modelfile)
.Call(XGBoosterLoadModel_R, handle, modelfile[1])
class(handle) <- "xgb.Booster.handle"
if (length(params) > 0) {
@@ -19,7 +20,7 @@ xgb.Booster.handle <- function(params = list(), cachelist = list(),
return(handle)
} else if (typeof(modelfile) == "raw") {
## A memory buffer
bst <- xgb.unserialize(modelfile)
bst <- xgb.unserialize(modelfile, handle)
xgb.parameters(bst) <- params
return (bst)
} else if (inherits(modelfile, "xgb.Booster")) {
@@ -128,7 +129,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
stop("argument type must be xgb.Booster")
if (is.null.handle(object$handle)) {
object$handle <- xgb.Booster.handle(modelfile = object$raw)
object$handle <- xgb.Booster.handle(modelfile = object$raw, handle = object$handle)
} else {
if (is.null(object$raw) && saveraw) {
object$raw <- xgb.serialize(object$handle)
@@ -167,8 +168,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' @param outputmargin whether the prediction should be returned in the for of original untransformed
#' sum of predictions from boosting iterations' results. E.g., setting \code{outputmargin=TRUE} for
#' logistic regression would result in predictions for log-odds instead of probabilities.
#' @param ntreelimit limit the number of model's trees or boosting iterations used in prediction (see Details).
#' It will use all the trees by default (\code{NULL} value).
#' @param ntreelimit Deprecated, use \code{iterationrange} instead.
#' @param predleaf whether predict leaf index.
#' @param predcontrib whether to return feature contributions to individual predictions (see Details).
#' @param approxcontrib whether to use a fast approximation for feature contributions (see Details).
@@ -178,16 +178,19 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' or predinteraction flags is TRUE.
#' @param training whether is the prediction result used for training. For dart booster,
#' training predicting will perform dropout.
#' @param iterationrange Specifies which layer of trees are used in prediction. For
#' example, if a random forest is trained with 100 rounds. Specifying
#' `iteration_range=(1, 21)`, then only the forests built during [1, 21) (half open set)
#' rounds are used in this prediction. It's 1-based index just like R vector. When set
#' to \code{c(1, 1)} XGBoost will use all trees.
#' @param strict_shape Default is \code{FALSE}. When it's set to \code{TRUE}, output
#' type and shape of prediction are invariant to model type.
#'
#' @param ... Parameters passed to \code{predict.xgb.Booster}
#'
#' @details
#' Note that \code{ntreelimit} is not necessarily equal to the number of boosting iterations
#' and it is not necessarily equal to the number of trees in a model.
#' E.g., in a random forest-like model, \code{ntreelimit} would limit the number of trees.
#' But for multiclass classification, while there are multiple trees per iteration,
#' \code{ntreelimit} limits the number of boosting iterations.
#'
#' Also note that \code{ntreelimit} would currently do nothing for predictions from gblinear,
#' Note that \code{iterationrange} would currently do nothing for predictions from gblinear,
#' since gblinear doesn't keep its boosting history.
#'
#' One possible practical applications of the \code{predleaf} option is to use the model
@@ -208,7 +211,8 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' of the most important features first. See below about the format of the returned results.
#'
#' @return
#' For regression or binary classification, it returns a vector of length \code{nrows(newdata)}.
#' The return type is different depending whether \code{strict_shape} is set to \code{TRUE}. By default,
#' for regression or binary classification, it returns a vector of length \code{nrows(newdata)}.
#' For multiclass classification, either a \code{num_class * nrows(newdata)} vector or
#' a \code{(nrows(newdata), num_class)} dimension matrix is returned, depending on
#' the \code{reshape} value.
@@ -230,6 +234,13 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' For a multiclass case, a list of \code{num_class} elements is returned, where each element is
#' such an array.
#'
#' When \code{strict_shape} is set to \code{TRUE}, the output is always an array. For
#' normal prediction, the output is a 2-dimension array \code{(num_class, nrow(newdata))}.
#'
#' For \code{predcontrib = TRUE}, output is \code{(ncol(newdata) + 1, num_class, nrow(newdata))}
#' For \code{predinteraction = TRUE}, output is \code{(ncol(newdata) + 1, ncol(newdata) + 1, num_class, nrow(newdata))}
#' For \code{predleaf = TRUE}, output is \code{(n_trees_in_forest, num_class, n_iterations, nrow(newdata))}
#'
#' @seealso
#' \code{\link{xgb.train}}.
#'
@@ -252,7 +263,7 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' # use all trees by default
#' pred <- predict(bst, test$data)
#' # use only the 1st tree
#' pred1 <- predict(bst, test$data, ntreelimit = 1)
#' pred1 <- predict(bst, test$data, iterationrange = c(1, 2))
#'
#' # Predicting tree leafs:
#' # the result is an nsamples X ntrees matrix
@@ -304,31 +315,14 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' all.equal(pred, pred_labels)
#' # prediction from using only 5 iterations should result
#' # in the same error as seen in iteration 5:
#' pred5 <- predict(bst, as.matrix(iris[, -5]), ntreelimit=5)
#' pred5 <- predict(bst, as.matrix(iris[, -5]), iterationrange=c(1, 6))
#' sum(pred5 != lb)/length(lb)
#'
#'
#' ## random forest-like model of 25 trees for binary classification:
#'
#' set.seed(11)
#' bst <- xgboost(data = train$data, label = train$label, max_depth = 5,
#' nthread = 2, nrounds = 1, objective = "binary:logistic",
#' num_parallel_tree = 25, subsample = 0.6, colsample_bytree = 0.1)
#' # Inspect the prediction error vs number of trees:
#' lb <- test$label
#' dtest <- xgb.DMatrix(test$data, label=lb)
#' err <- sapply(1:25, function(n) {
#' pred <- predict(bst, dtest, ntreelimit=n)
#' sum((pred > 0.5) != lb)/length(lb)
#' })
#' plot(err, type='l', ylim=c(0,0.1), xlab='#trees')
#'
#' @rdname predict.xgb.Booster
#' @export
predict.xgb.Booster <- function(object, newdata, missing = NA, outputmargin = FALSE, ntreelimit = NULL,
predleaf = FALSE, predcontrib = FALSE, approxcontrib = FALSE, predinteraction = FALSE,
reshape = FALSE, training = FALSE, ...) {
reshape = FALSE, training = FALSE, iterationrange = NULL, strict_shape = FALSE, ...) {
object <- xgb.Booster.complete(object, saveraw = FALSE)
if (!inherits(newdata, "xgb.DMatrix"))
newdata <- xgb.DMatrix(newdata, missing = missing)
@@ -336,62 +330,114 @@ predict.xgb.Booster <- function(object, newdata, missing = NA, outputmargin = FA
!is.null(colnames(newdata)) &&
!identical(object[["feature_names"]], colnames(newdata)))
stop("Feature names stored in `object` and `newdata` are different!")
if (is.null(ntreelimit))
ntreelimit <- NVL(object$best_ntreelimit, 0)
if (NVL(object$params[['booster']], '') == 'gblinear')
if (NVL(object$params[['booster']], '') == 'gblinear' || is.null(ntreelimit))
ntreelimit <- 0
if (ntreelimit < 0)
stop("ntreelimit cannot be negative")
option <- 0L + 1L * as.logical(outputmargin) + 2L * as.logical(predleaf) + 4L * as.logical(predcontrib) +
8L * as.logical(approxcontrib) + 16L * as.logical(predinteraction)
ret <- .Call(XGBoosterPredict_R, object$handle, newdata, option[1],
as.integer(ntreelimit), as.integer(training))
n_ret <- length(ret)
n_row <- nrow(newdata)
npred_per_case <- n_ret / n_row
if (n_ret %% n_row != 0)
stop("prediction length ", n_ret, " is not multiple of nrows(newdata) ", n_row)
if (predleaf) {
ret <- if (n_ret == n_row) {
matrix(ret, ncol = 1)
if (ntreelimit != 0 && is.null(iterationrange)) {
## only ntreelimit, initialize iteration range
iterationrange <- c(0, 0)
} else if (ntreelimit == 0 && !is.null(iterationrange)) {
## only iteration range, handle 1-based indexing
iterationrange <- c(iterationrange[1] - 1, iterationrange[2] - 1)
} else if (ntreelimit != 0 && !is.null(iterationrange)) {
## both are specified, let libgxgboost throw an error
} else {
## no limit is supplied, use best
if (is.null(object$best_iteration)) {
iterationrange <- c(0, 0)
} else {
matrix(ret, nrow = n_row, byrow = TRUE)
## We don't need to + 1 as R is 1-based index.
iterationrange <- c(0, as.integer(object$best_iteration))
}
} else if (predcontrib) {
n_col1 <- ncol(newdata) + 1
n_group <- npred_per_case / n_col1
cnames <- if (!is.null(colnames(newdata))) c(colnames(newdata), "BIAS") else NULL
ret <- if (n_ret == n_row) {
matrix(ret, ncol = 1, dimnames = list(NULL, cnames))
} else if (n_group == 1) {
matrix(ret, nrow = n_row, byrow = TRUE, dimnames = list(NULL, cnames))
} else {
arr <- array(ret, c(n_col1, n_group, n_row),
dimnames = list(cnames, NULL, NULL)) %>% aperm(c(2, 3, 1)) # [group, row, col]
lapply(seq_len(n_group), function(g) arr[g, , ])
}
## Handle the 0 length values.
box <- function(val) {
if (length(val) == 0) {
cval <- vector(, 1)
cval[0] <- val
return(cval)
}
return (val)
}
## We set strict_shape to TRUE then drop the dimensions conditionally
args <- list(
training = box(training),
strict_shape = box(TRUE),
iteration_begin = box(as.integer(iterationrange[1])),
iteration_end = box(as.integer(iterationrange[2])),
ntree_limit = box(as.integer(ntreelimit)),
type = box(as.integer(0))
)
set_type <- function(type) {
if (args$type != 0) {
stop("One type of prediction at a time.")
}
return(box(as.integer(type)))
}
if (outputmargin) {
args$type <- set_type(1)
}
if (predcontrib) {
args$type <- set_type(if (approxcontrib) 3 else 2)
}
if (predinteraction) {
args$type <- set_type(if (approxcontrib) 5 else 4)
}
if (predleaf) {
args$type <- set_type(6)
}
predts <- .Call(
XGBoosterPredictFromDMatrix_R, object$handle, newdata, jsonlite::toJSON(args, auto_unbox = TRUE)
)
names(predts) <- c("shape", "results")
shape <- predts$shape
ret <- predts$results
n_row <- nrow(newdata)
if (n_row != shape[1]) {
stop("Incorrect predict shape.")
}
arr <- array(data = ret, dim = rev(shape))
cnames <- if (!is.null(colnames(newdata))) c(colnames(newdata), "BIAS") else NULL
if (predcontrib) {
dimnames(arr) <- list(cnames, NULL, NULL)
if (!strict_shape) {
arr <- aperm(a = arr, perm = c(2, 3, 1)) # [group, row, col]
}
} else if (predinteraction) {
n_col1 <- ncol(newdata) + 1
n_group <- npred_per_case / n_col1^2
cnames <- if (!is.null(colnames(newdata))) c(colnames(newdata), "BIAS") else NULL
ret <- if (n_ret == n_row) {
matrix(ret, ncol = 1, dimnames = list(NULL, cnames))
} else if (n_group == 1) {
array(ret, c(n_col1, n_col1, n_row), dimnames = list(cnames, cnames, NULL)) %>% aperm(c(3, 1, 2))
} else {
arr <- array(ret, c(n_col1, n_col1, n_group, n_row),
dimnames = list(cnames, cnames, NULL, NULL)) %>% aperm(c(3, 4, 1, 2)) # [group, row, col1, col2]
lapply(seq_len(n_group), function(g) arr[g, , , ])
dimnames(arr) <- list(cnames, cnames, NULL, NULL)
if (!strict_shape) {
arr <- aperm(a = arr, perm = c(3, 4, 1, 2)) # [group, row, col, col]
}
} else if (reshape && npred_per_case > 1) {
ret <- matrix(ret, nrow = n_row, byrow = TRUE)
}
return(ret)
if (!strict_shape) {
n_groups <- shape[2]
if (predleaf) {
arr <- matrix(arr, nrow = n_row, byrow = TRUE)
} else if (predcontrib && n_groups != 1) {
arr <- lapply(seq_len(n_groups), function(g) arr[g, , ])
} else if (predinteraction && n_groups != 1) {
arr <- lapply(seq_len(n_groups), function(g) arr[g, , , ])
} else if (!reshape && n_groups != 1) {
arr <- ret
} else if (reshape && n_groups != 1) {
arr <- matrix(arr, ncol = n_groups, byrow = TRUE)
}
arr <- drop(arr)
if (length(dim(arr)) == 1) {
arr <- as.vector(arr)
} else if (length(dim(arr)) == 2) {
arr <- as.matrix(arr)
}
}
return(arr)
}
#' @rdname predict.xgb.Booster

View File

@@ -1,7 +1,7 @@
#' Construct xgb.DMatrix object
#'
#' Construct xgb.DMatrix object from either a dense matrix, a sparse matrix, or a local file.
#' Supported input file formats are either a libsvm text file or a binary file that was created previously by
#' Supported input file formats are either a LIBSVM text file or a binary file that was created previously by
#' \code{\link{xgb.DMatrix.save}}).
#'
#' @param data a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character
@@ -11,25 +11,26 @@
#' @param missing a float value to represents missing values in data (used only when input is a dense matrix).
#' It is useful when a 0 or some other extreme value represents missing values in data.
#' @param silent whether to suppress printing an informational message after loading from a file.
#' @param nthread Number of threads used for creating DMatrix.
#' @param ... the \code{info} data could be passed directly as parameters, without creating an \code{info} list.
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
#' xgb.DMatrix.save(dtrain, 'xgb.DMatrix.data')
#' dtrain <- xgb.DMatrix('xgb.DMatrix.data')
#' if (file.exists('xgb.DMatrix.data')) file.remove('xgb.DMatrix.data')
#' @export
xgb.DMatrix <- function(data, info = list(), missing = NA, silent = FALSE, ...) {
xgb.DMatrix <- function(data, info = list(), missing = NA, silent = FALSE, nthread = NULL, ...) {
cnames <- NULL
if (typeof(data) == "character") {
if (length(data) > 1)
stop("'data' has class 'character' and length ", length(data),
".\n 'data' accepts either a numeric matrix or a single filename.")
data <- path.expand(data)
handle <- .Call(XGDMatrixCreateFromFile_R, data, as.integer(silent))
} else if (is.matrix(data)) {
handle <- .Call(XGDMatrixCreateFromMat_R, data, missing)
handle <- .Call(XGDMatrixCreateFromMat_R, data, missing, as.integer(NVL(nthread, -1)))
cnames <- colnames(data)
} else if (inherits(data, "dgCMatrix")) {
handle <- .Call(XGDMatrixCreateFromCSC_R, data@p, data@i, data@x, nrow(data))
@@ -51,12 +52,12 @@ xgb.DMatrix <- function(data, info = list(), missing = NA, silent = FALSE, ...)
# get dmatrix from data, label
# internal helper method
xgb.get.DMatrix <- function(data, label = NULL, missing = NA, weight = NULL) {
xgb.get.DMatrix <- function(data, label = NULL, missing = NA, weight = NULL, nthread = NULL) {
if (inherits(data, "dgCMatrix") || is.matrix(data)) {
if (is.null(label)) {
stop("label must be provided when data is a matrix")
}
dtrain <- xgb.DMatrix(data, label = label, missing = missing)
dtrain <- xgb.DMatrix(data, label = label, missing = missing, nthread = nthread)
if (!is.null(weight)){
setinfo(dtrain, "weight", weight)
}
@@ -65,6 +66,7 @@ xgb.get.DMatrix <- function(data, label = NULL, missing = NA, weight = NULL) {
warning("xgboost: label will be ignored.")
}
if (is.character(data)) {
data <- path.expand(data)
dtrain <- xgb.DMatrix(data[1])
} else if (inherits(data, "xgb.DMatrix")) {
dtrain <- data
@@ -160,9 +162,9 @@ dimnames.xgb.DMatrix <- function(x) {
#' The \code{name} field can be one of the following:
#'
#' \itemize{
#' \item \code{label}: label Xgboost learn from ;
#' \item \code{label}: label XGBoost learn from ;
#' \item \code{weight}: to do a weight rescale ;
#' \item \code{base_margin}: base margin is the base prediction Xgboost will boost from ;
#' \item \code{base_margin}: base margin is the base prediction XGBoost will boost from ;
#' \item \code{nrow}: number of rows of the \code{xgb.DMatrix}.
#'
#' }
@@ -171,8 +173,7 @@ dimnames.xgb.DMatrix <- function(x) {
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
#'
#' labels <- getinfo(dtrain, 'label')
#' setinfo(dtrain, 'label', 1-labels)
@@ -216,16 +217,15 @@ getinfo.xgb.DMatrix <- function(object, name, ...) {
#' The \code{name} field can be one of the following:
#'
#' \itemize{
#' \item \code{label}: label Xgboost learn from ;
#' \item \code{label}: label XGBoost learn from ;
#' \item \code{weight}: to do a weight rescale ;
#' \item \code{base_margin}: base margin is the base prediction Xgboost will boost from ;
#' \item \code{base_margin}: base margin is the base prediction XGBoost will boost from ;
#' \item \code{group}: number of rows in each group (to use with \code{rank:pairwise} objective).
#' }
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
#'
#' labels <- getinfo(dtrain, 'label')
#' setinfo(dtrain, 'label', 1-labels)
@@ -290,8 +290,7 @@ setinfo.xgb.DMatrix <- function(object, name, info, ...) {
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
#'
#' dsub <- slice(dtrain, 1:42)
#' labels1 <- getinfo(dsub, 'label')
@@ -347,8 +346,7 @@ slice.xgb.DMatrix <- function(object, idxset, ...) {
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
#'
#' dtrain
#' print(dtrain, verbose=TRUE)
@@ -357,7 +355,7 @@ slice.xgb.DMatrix <- function(object, idxset, ...) {
#' @export
print.xgb.DMatrix <- function(x, verbose = FALSE, ...) {
cat('xgb.DMatrix dim:', nrow(x), 'x', ncol(x), ' info: ')
infos <- c()
infos <- character(0)
if (length(getinfo(x, 'label')) > 0) infos <- 'label'
if (length(getinfo(x, 'weight')) > 0) infos <- c(infos, 'weight')
if (length(getinfo(x, 'base_margin')) > 0) infos <- c(infos, 'base_margin')

View File

@@ -7,8 +7,7 @@
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#' dtrain <- xgb.DMatrix(train$data, label=train$label)
#' dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
#' xgb.DMatrix.save(dtrain, 'xgb.DMatrix.data')
#' dtrain <- xgb.DMatrix('xgb.DMatrix.data')
#' if (file.exists('xgb.DMatrix.data')) file.remove('xgb.DMatrix.data')
@@ -19,6 +18,7 @@ xgb.DMatrix.save <- function(dmatrix, fname) {
if (!inherits(dmatrix, "xgb.DMatrix"))
stop("dmatrix must be xgb.DMatrix")
fname <- path.expand(fname)
.Call(XGDMatrixSaveBinary_R, dmatrix, fname[1], 0L)
return(TRUE)
}

38
R-package/R/xgb.config.R Normal file
View File

@@ -0,0 +1,38 @@
#' Global configuration consists of a collection of parameters that can be applied in the global
#' scope. See \url{https://xgboost.readthedocs.io/en/stable/parameter.html} for the full list of
#' parameters supported in the global configuration. Use \code{xgb.set.config} to update the
#' values of one or more global-scope parameters. Use \code{xgb.get.config} to fetch the current
#' values of all global-scope parameters (listed in
#' \url{https://xgboost.readthedocs.io/en/stable/parameter.html}).
#'
#' @rdname xgbConfig
#' @title Set and get global configuration
#' @name xgb.set.config, xgb.get.config
#' @export xgb.set.config xgb.get.config
#' @param ... List of parameters to be set, as keyword arguments
#' @return
#' \code{xgb.set.config} returns \code{TRUE} to signal success. \code{xgb.get.config} returns
#' a list containing all global-scope parameters and their values.
#'
#' @examples
#' # Set verbosity level to silent (0)
#' xgb.set.config(verbosity = 0)
#' # Now global verbosity level is 0
#' config <- xgb.get.config()
#' print(config$verbosity)
#' # Set verbosity level to warning (1)
#' xgb.set.config(verbosity = 1)
#' # Now global verbosity level is 1
#' config <- xgb.get.config()
#' print(config$verbosity)
xgb.set.config <- function(...) {
new_config <- list(...)
.Call(XGBSetGlobalConfig_R, jsonlite::toJSON(new_config, auto_unbox = TRUE))
return(TRUE)
}
#' @rdname xgbConfig
xgb.get.config <- function() {
config <- .Call(XGBGetGlobalConfig_R)
return(jsonlite::fromJSON(config))
}

View File

@@ -48,8 +48,8 @@
#' @examples
#' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost')
#' dtrain <- xgb.DMatrix(data = agaricus.train$data, label = agaricus.train$label)
#' dtest <- xgb.DMatrix(data = agaricus.test$data, label = agaricus.test$label)
#' dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
#' dtest <- with(agaricus.test, xgb.DMatrix(data, label = label))
#'
#' param <- list(max_depth=2, eta=1, silent=1, objective='binary:logistic')
#' nrounds = 4

View File

@@ -36,6 +36,8 @@
#' \item \code{error} binary classification error rate
#' \item \code{rmse} Rooted mean square error
#' \item \code{logloss} negative log-likelihood function
#' \item \code{mae} Mean absolute error
#' \item \code{mape} Mean absolute percentage error
#' \item \code{auc} Area under curve
#' \item \code{aucpr} Area under PR curve
#' \item \code{merror} Exact matching error, used to evaluate multi-class classification
@@ -79,7 +81,7 @@
#'
#' All observations are used for both training and validation.
#'
#' Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation}
#' Adapted from \url{https://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29}
#'
#' @return
#' An object of class \code{xgb.cv.synchronous} with the following elements:
@@ -99,9 +101,7 @@
#' parameter or randomly generated.
#' \item \code{best_iteration} iteration number with the best evaluation metric value
#' (only available with early stopping).
#' \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
#' which could further be used in \code{predict} method
#' (only available with early stopping).
#' \item \code{best_ntreelimit} and the \code{ntreelimit} Deprecated attributes, use \code{best_iteration} instead.
#' \item \code{pred} CV prediction values available when \code{prediction} is set.
#' It is either vector or matrix (see \code{\link{cb.cv.predict}}).
#' \item \code{models} a list of the CV folds' models. It is only available with the explicit
@@ -110,7 +110,7 @@
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
#' dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
#' cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = list("rmse","auc"),
#' max_depth = 3, eta = 1, objective = "binary:logistic")
#' print(cv)

View File

@@ -56,16 +56,17 @@ xgb.dump <- function(model, fname = NULL, fmap = "", with_stats=FALSE,
as.character(dump_format))
if (is.null(fname))
model_dump <- stri_replace_all_regex(model_dump, '\t', '')
model_dump <- gsub('\t', '', model_dump, fixed = TRUE)
if (dump_format == "text")
model_dump <- unlist(stri_split_regex(model_dump, '\n'))
model_dump <- unlist(strsplit(model_dump, '\n', fixed = TRUE))
model_dump <- grep('^\\s*$', model_dump, invert = TRUE, value = TRUE)
if (is.null(fname)) {
return(model_dump)
} else {
fname <- path.expand(fname)
writeLines(model_dump, fname[1])
return(TRUE)
}

View File

@@ -99,6 +99,84 @@ xgb.ggplot.deepness <- function(model = NULL, which = c("2x1", "max.depth", "med
}
}
#' @rdname xgb.plot.shap.summary
#' @export
xgb.ggplot.shap.summary <- function(data, shap_contrib = NULL, features = NULL, top_n = 10, model = NULL,
trees = NULL, target_class = NULL, approxcontrib = FALSE, subsample = NULL) {
data_list <- xgb.shap.data(
data = data,
shap_contrib = shap_contrib,
features = features,
top_n = top_n,
model = model,
trees = trees,
target_class = target_class,
approxcontrib = approxcontrib,
subsample = subsample,
max_observations = 10000 # 10,000 samples per feature.
)
p_data <- prepare.ggplot.shap.data(data_list, normalize = TRUE)
# Reverse factor levels so that the first level is at the top of the plot
p_data[, "feature" := factor(feature, rev(levels(feature)))]
p <- ggplot2::ggplot(p_data, ggplot2::aes(x = feature, y = p_data$shap_value, colour = p_data$feature_value)) +
ggplot2::geom_jitter(alpha = 0.5, width = 0.1) +
ggplot2::scale_colour_viridis_c(limits = c(-3, 3), option = "plasma", direction = -1) +
ggplot2::geom_abline(slope = 0, intercept = 0, colour = "darkgrey") +
ggplot2::coord_flip()
p
}
#' Combine and melt feature values and SHAP contributions for sample
#' observations.
#'
#' Conforms to data format required for ggplot functions.
#'
#' Internal utility function.
#'
#' @param data_list List containing 'data' and 'shap_contrib' returned by
#' \code{xgb.shap.data()}.
#' @param normalize Whether to standardize feature values to have mean 0 and
#' standard deviation 1 (useful for comparing multiple features on the same
#' plot). Default \code{FALSE}.
#'
#' @return A data.table containing the observation ID, the feature name, the
#' feature value (normalized if specified), and the SHAP contribution value.
prepare.ggplot.shap.data <- function(data_list, normalize = FALSE) {
data <- data_list[["data"]]
shap_contrib <- data_list[["shap_contrib"]]
data <- data.table::as.data.table(as.matrix(data))
if (normalize) {
data[, (names(data)) := lapply(.SD, normalize)]
}
data[, "id" := seq_len(nrow(data))]
data_m <- data.table::melt.data.table(data, id.vars = "id", variable.name = "feature", value.name = "feature_value")
shap_contrib <- data.table::as.data.table(as.matrix(shap_contrib))
shap_contrib[, "id" := seq_len(nrow(shap_contrib))]
shap_contrib_m <- data.table::melt.data.table(shap_contrib, id.vars = "id", variable.name = "feature", value.name = "shap_value")
p_data <- data.table::merge.data.table(data_m, shap_contrib_m, by = c("id", "feature"))
p_data
}
#' Scale feature value to have mean 0, standard deviation 1
#'
#' This is used to compare multiple features on the same plot.
#' Internal utility function
#'
#' @param x Numeric vector
#'
#' @return Numeric vector with mean 0 and sd 1.
normalize <- function(x) {
loc <- mean(x, na.rm = TRUE)
scale <- stats::sd(x, na.rm = TRUE)
(x - loc) / scale
}
# Plot multiple ggplot graph aligned by rows and columns.
# ... the plots
# cols number of columns
@@ -131,5 +209,5 @@ multiplot <- function(..., cols = 1) {
globalVariables(c(
"Cluster", "ggplot", "aes", "geom_bar", "coord_flip", "xlab", "ylab", "ggtitle", "theme",
"element_blank", "element_text", "V1", "Weight"
"element_blank", "element_text", "V1", "Weight", "feature"
))

View File

@@ -96,40 +96,44 @@ xgb.importance <- function(feature_names = NULL, model = NULL, trees = NULL,
if (!(is.null(feature_names) || is.character(feature_names)))
stop("feature_names: Has to be a character vector")
model_text_dump <- xgb.dump(model = model, with_stats = TRUE)
# linear model
if (model_text_dump[2] == "bias:"){
weights <- which(model_text_dump == "weight:") %>%
{model_text_dump[(. + 1):length(model_text_dump)]} %>%
as.numeric
num_class <- NVL(model$params$num_class, 1)
if (is.null(feature_names))
feature_names <- seq(to = length(weights) / num_class) - 1
if (length(feature_names) * num_class != length(weights))
stop("feature_names length does not match the number of features used in the model")
result <- if (num_class == 1) {
data.table(Feature = feature_names, Weight = weights)[order(-abs(Weight))]
model <- xgb.Booster.complete(model)
config <- jsonlite::fromJSON(xgb.config(model))
if (config$learner$gradient_booster$name == "gblinear") {
args <- list(importance_type = "weight", feature_names = feature_names)
results <- .Call(
XGBoosterFeatureScore_R, model$handle, jsonlite::toJSON(args, auto_unbox = TRUE, null = "null")
)
names(results) <- c("features", "shape", "weight")
n_classes <- if (length(results$shape) == 2) { results$shape[2] } else { 0 }
importance <- if (n_classes == 0) {
data.table(Feature = results$features, Weight = results$weight)[order(-abs(Weight))]
} else {
data.table(Feature = rep(feature_names, each = num_class),
Weight = weights,
Class = seq_len(num_class) - 1)[order(Class, -abs(Weight))]
data.table(
Feature = rep(results$features, each = n_classes), Weight = results$weight, Class = seq_len(n_classes) - 1
)[order(Class, -abs(Weight))]
}
} else { # tree model
result <- xgb.model.dt.tree(feature_names = feature_names,
text = model_text_dump,
trees = trees)[
Feature != "Leaf", .(Gain = sum(Quality),
Cover = sum(Cover),
Frequency = .N), by = Feature][
, `:=`(Gain = Gain / sum(Gain),
Cover = Cover / sum(Cover),
Frequency = Frequency / sum(Frequency))][
order(Gain, decreasing = TRUE)]
} else {
concatenated <- list()
output_names <- vector()
for (importance_type in c("weight", "gain", "cover")) {
args <- list(importance_type = importance_type, feature_names = feature_names)
results <- .Call(
XGBoosterFeatureScore_R, model$handle, jsonlite::toJSON(args, auto_unbox = TRUE, null = "null")
)
names(results) <- c("features", "shape", importance_type)
concatenated[
switch(importance_type, "weight" = "Frequency", "gain" = "Gain", "cover" = "Cover")
] <- results[importance_type]
output_names <- results$features
}
importance <- data.table(
Feature = output_names,
Gain = concatenated$Gain / sum(concatenated$Gain),
Cover = concatenated$Cover / sum(concatenated$Cover),
Frequency = concatenated$Frequency / sum(concatenated$Frequency)
)[order(Gain, decreasing = TRUE)]
}
result
importance
}
# Avoid error messages during CRAN check.

View File

@@ -87,11 +87,11 @@ xgb.model.dt.tree <- function(feature_names = NULL, model = NULL, text = NULL,
}
if (length(text) < 2 ||
sum(stri_detect_regex(text, 'yes=(\\d+),no=(\\d+)')) < 1) {
sum(grepl('yes=(\\d+),no=(\\d+)', text)) < 1) {
stop("Non-tree model detected! This function can only be used with tree models.")
}
position <- which(!is.na(stri_match_first_regex(text, "booster")))
position <- which(grepl("booster", text, fixed = TRUE))
add.tree.id <- function(node, tree) if (use_int_id) node else paste(tree, node, sep = "-")
@@ -108,9 +108,9 @@ xgb.model.dt.tree <- function(feature_names = NULL, model = NULL, text = NULL,
}
td <- td[Tree %in% trees & !grepl('^booster', t)]
td[, Node := stri_match_first_regex(t, "(\\d+):")[, 2] %>% as.integer]
td[, Node := as.integer(sub("^([0-9]+):.*", "\\1", t))]
if (!use_int_id) td[, ID := add.tree.id(Node, Tree)]
td[, isLeaf := !is.na(stri_match_first_regex(t, "leaf"))]
td[, isLeaf := grepl("leaf", t, fixed = TRUE)]
# parse branch lines
branch_rx <- paste0("f(\\d+)<(", anynumber_regex, ")\\] yes=(\\d+),no=(\\d+),missing=(\\d+),",
@@ -118,10 +118,11 @@ xgb.model.dt.tree <- function(feature_names = NULL, model = NULL, text = NULL,
branch_cols <- c("Feature", "Split", "Yes", "No", "Missing", "Quality", "Cover")
td[isLeaf == FALSE,
(branch_cols) := {
# skip some indices with spurious capture groups from anynumber_regex
xtr <- stri_match_first_regex(t, branch_rx)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]
xtr[, 3:5] <- add.tree.id(xtr[, 3:5], Tree)
lapply(seq_len(ncol(xtr)), function(i) xtr[, i])
matches <- regmatches(t, regexec(branch_rx, t))
# skip some indices with spurious capture groups from anynumber_regex
xtr <- do.call(rbind, matches)[, c(2, 3, 5, 6, 7, 8, 10), drop = FALSE]
xtr[, 3:5] <- add.tree.id(xtr[, 3:5], Tree)
as.data.table(xtr)
}]
# assign feature_names when available
if (!is.null(feature_names)) {
@@ -135,8 +136,9 @@ xgb.model.dt.tree <- function(feature_names = NULL, model = NULL, text = NULL,
leaf_cols <- c("Feature", "Quality", "Cover")
td[isLeaf == TRUE,
(leaf_cols) := {
xtr <- stri_match_first_regex(t, leaf_rx)[, c(2, 4)]
c("Leaf", lapply(seq_len(ncol(xtr)), function(i) xtr[, i]))
matches <- regmatches(t, regexec(leaf_rx, t))
xtr <- do.call(rbind, matches)[, c(2, 4)]
c("Leaf", as.data.table(xtr))
}]
# convert some columns to numeric

View File

@@ -99,21 +99,20 @@ xgb.plot.importance <- function(importance_matrix = NULL, top_n = NULL, measure
}
if (plot) {
op <- par(no.readonly = TRUE)
mar <- op$mar
original_mar <- par()$mar
# reset margins so this function doesn't have side effects
on.exit({par(mar = original_mar)})
mar <- original_mar
if (!is.null(left_margin))
mar[2] <- left_margin
par(mar = mar)
# reverse the order of rows to have the highest ranked at the top
importance_matrix[nrow(importance_matrix):1,
importance_matrix[rev(seq_len(nrow(importance_matrix))),
barplot(Importance, horiz = TRUE, border = NA, cex.names = cex,
names.arg = Feature, las = 1, ...)]
grid(NULL, NA)
# redraw over the grid
importance_matrix[nrow(importance_matrix):1,
barplot(Importance, horiz = TRUE, border = NA, add = TRUE)]
par(op)
}
invisible(importance_matrix)

View File

@@ -67,7 +67,7 @@ xgb.plot.multi.trees <- function(model, feature_names = NULL, features_keep = 5,
# first number of the path represents the tree, then the following numbers are related to the path to follow
# root init
root.nodes <- tree.matrix[stri_detect_regex(ID, "\\d+-0"), ID]
root.nodes <- tree.matrix[Node == 0, ID]
tree.matrix[ID %in% root.nodes, abs.node.position := root.nodes]
precedent.nodes <- root.nodes
@@ -75,8 +75,8 @@ xgb.plot.multi.trees <- function(model, feature_names = NULL, features_keep = 5,
while (tree.matrix[, sum(is.na(abs.node.position))] > 0) {
yes.row.nodes <- tree.matrix[abs.node.position %in% precedent.nodes & !is.na(Yes)]
no.row.nodes <- tree.matrix[abs.node.position %in% precedent.nodes & !is.na(No)]
yes.nodes.abs.pos <- yes.row.nodes[, abs.node.position] %>% paste0("_0")
no.nodes.abs.pos <- no.row.nodes[, abs.node.position] %>% paste0("_1")
yes.nodes.abs.pos <- paste0(yes.row.nodes[, abs.node.position], "_0")
no.nodes.abs.pos <- paste0(no.row.nodes[, abs.node.position], "_1")
tree.matrix[ID %in% yes.row.nodes[, Yes], abs.node.position := yes.nodes.abs.pos]
tree.matrix[ID %in% no.row.nodes[, No], abs.node.position := no.nodes.abs.pos]
@@ -86,28 +86,34 @@ xgb.plot.multi.trees <- function(model, feature_names = NULL, features_keep = 5,
tree.matrix[!is.na(Yes), Yes := paste0(abs.node.position, "_0")]
tree.matrix[!is.na(No), No := paste0(abs.node.position, "_1")]
remove.tree <- . %>% stri_replace_first_regex(pattern = "^\\d+-", replacement = "")
tree.matrix[, `:=`(abs.node.position = remove.tree(abs.node.position),
Yes = remove.tree(Yes),
No = remove.tree(No))]
for (nm in c("abs.node.position", "Yes", "No"))
data.table::set(tree.matrix, j = nm, value = sub("^\\d+-", "", tree.matrix[[nm]]))
nodes.dt <- tree.matrix[
, .(Quality = sum(Quality))
, by = .(abs.node.position, Feature)
][, .(Text = paste0(Feature[1:min(length(Feature), features_keep)],
" (",
format(Quality[1:min(length(Quality), features_keep)], digits = 5),
")") %>%
paste0(collapse = "\n"))
, by = abs.node.position]
][, .(Text = paste0(
paste0(
Feature[1:min(length(Feature), features_keep)],
" (",
format(Quality[1:min(length(Quality), features_keep)], digits = 5),
")"
),
collapse = "\n"
)
)
, by = abs.node.position
]
edges.dt <- tree.matrix[Feature != "Leaf", .(abs.node.position, Yes)] %>%
list(tree.matrix[Feature != "Leaf", .(abs.node.position, No)]) %>%
rbindlist() %>%
setnames(c("From", "To")) %>%
.[, .N, .(From, To)] %>%
.[, N := NULL]
edges.dt <- data.table::rbindlist(
l = list(
tree.matrix[Feature != "Leaf", .(abs.node.position, Yes)],
tree.matrix[Feature != "Leaf", .(abs.node.position, No)]
)
)
data.table::setnames(edges.dt, c("From", "To"))
edges.dt <- edges.dt[, .N, .(From, To)]
edges.dt[, N := NULL]
nodes <- DiagrammeR::create_node_df(
n = nrow(nodes.dt),
@@ -123,21 +129,25 @@ xgb.plot.multi.trees <- function(model, feature_names = NULL, features_keep = 5,
nodes_df = nodes,
edges_df = edges,
attr_theme = NULL
) %>%
DiagrammeR::add_global_graph_attrs(
)
graph <- DiagrammeR::add_global_graph_attrs(
graph = graph,
attr_type = "graph",
attr = c("layout", "rankdir"),
value = c("dot", "LR")
) %>%
DiagrammeR::add_global_graph_attrs(
)
graph <- DiagrammeR::add_global_graph_attrs(
graph = graph,
attr_type = "node",
attr = c("color", "fillcolor", "style", "shape", "fontname"),
value = c("DimGray", "beige", "filled", "rectangle", "Helvetica")
) %>%
DiagrammeR::add_global_graph_attrs(
)
graph <- DiagrammeR::add_global_graph_attrs(
graph = graph,
attr_type = "edge",
attr = c("color", "arrowsize", "arrowhead", "fontname"),
value = c("DimGray", "1.5", "vee", "Helvetica"))
value = c("DimGray", "1.5", "vee", "Helvetica")
)
if (!render) return(invisible(graph))

View File

@@ -33,7 +33,7 @@
#' @param col_loess a color to use for the loess curves.
#' @param span_loess the \code{span} parameter in \code{\link[stats]{loess}}'s call.
#' @param which whether to do univariate or bivariate plotting. NOTE: only 1D is implemented so far.
#' @param plot whether a plot should be drawn. If FALSE, only a lits of matrices is returned.
#' @param plot whether a plot should be drawn. If FALSE, only a list of matrices is returned.
#' @param ... other parameters passed to \code{plot}.
#'
#' @details
@@ -81,6 +81,7 @@
#' xgb.plot.shap(agaricus.test$data, model = bst, features = "odor=none")
#' contr <- predict(bst, agaricus.test$data, predcontrib = TRUE)
#' xgb.plot.shap(agaricus.test$data, contr, model = bst, top_n = 12, n_col = 3)
#' xgb.ggplot.shap.summary(agaricus.test$data, contr, model = bst, top_n = 12) # Summary plot
#'
#' # multiclass example - plots for each class separately:
#' nclass <- 3
@@ -99,6 +100,7 @@
#' n_col = 2, col = col, pch = 16, pch_NA = 17)
#' xgb.plot.shap(x, model = mbst, trees = trees0 + 2, target_class = 2, top_n = 4,
#' n_col = 2, col = col, pch = 16, pch_NA = 17)
#' xgb.ggplot.shap.summary(x, model = mbst, target_class = 0, top_n = 4) # Summary plot
#'
#' @rdname xgb.plot.shap
#' @export
@@ -109,69 +111,33 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
plot_NA = TRUE, col_NA = rgb(0.7, 0, 1, 0.6), pch_NA = '.', pos_NA = 1.07,
plot_loess = TRUE, col_loess = 2, span_loess = 0.5,
which = c("1d", "2d"), plot = TRUE, ...) {
if (!is.matrix(data) && !inherits(data, "dgCMatrix"))
stop("data: must be either matrix or dgCMatrix")
if (is.null(shap_contrib) && (is.null(model) || !inherits(model, "xgb.Booster")))
stop("when shap_contrib is not provided, one must provide an xgb.Booster model")
if (is.null(features) && (is.null(model) || !inherits(model, "xgb.Booster")))
stop("when features are not provided, one must provide an xgb.Booster model to rank the features")
if (!is.null(shap_contrib) &&
(!is.matrix(shap_contrib) || nrow(shap_contrib) != nrow(data) || ncol(shap_contrib) != ncol(data) + 1))
stop("shap_contrib is not compatible with the provided data")
nsample <- if (is.null(subsample)) min(100000, nrow(data)) else as.integer(subsample * nrow(data))
idx <- sample(1:nrow(data), nsample)
data <- data[idx, ]
if (is.null(shap_contrib)) {
shap_contrib <- predict(model, data, predcontrib = TRUE, approxcontrib = approxcontrib)
} else {
shap_contrib <- shap_contrib[idx, ]
}
data_list <- xgb.shap.data(
data = data,
shap_contrib = shap_contrib,
features = features,
top_n = top_n,
model = model,
trees = trees,
target_class = target_class,
approxcontrib = approxcontrib,
subsample = subsample,
max_observations = 100000
)
data <- data_list[["data"]]
shap_contrib <- data_list[["shap_contrib"]]
features <- colnames(data)
which <- match.arg(which)
if (which == "2d")
stop("2D plots are not implemented yet")
if (is.null(features)) {
imp <- xgb.importance(model = model, trees = trees)
top_n <- as.integer(top_n[1])
if (top_n < 1 && top_n > 100)
stop("top_n: must be an integer within [1, 100]")
features <- imp$Feature[1:min(top_n, NROW(imp))]
}
if (is.character(features)) {
if (is.null(colnames(data)))
stop("Either provide `data` with column names or provide `features` as column indices")
features <- match(features, colnames(data))
}
if (n_col > length(features)) n_col <- length(features)
if (is.list(shap_contrib)) { # multiclass: either choose a class or merge
shap_contrib <- if (!is.null(target_class)) shap_contrib[[target_class + 1]]
else Reduce("+", lapply(shap_contrib, abs))
}
shap_contrib <- shap_contrib[, features, drop = FALSE]
data <- data[, features, drop = FALSE]
cols <- colnames(data)
if (is.null(cols)) cols <- colnames(shap_contrib)
if (is.null(cols)) cols <- paste0('X', 1:ncol(data))
colnames(data) <- cols
colnames(shap_contrib) <- cols
if (plot && which == "1d") {
op <- par(mfrow = c(ceiling(length(features) / n_col), n_col),
oma = c(0, 0, 0, 0) + 0.2,
mar = c(3.5, 3.5, 0, 0) + 0.1,
mgp = c(1.7, 0.6, 0))
for (f in cols) {
for (f in features) {
ord <- order(data[, f])
x <- data[, f][ord]
y <- shap_contrib[, f][ord]
@@ -191,7 +157,7 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
plot(x2plot, y, pch = pch, xlab = f, col = col, xlim = x_lim, ylim = y_lim, ylab = ylab, ...)
grid()
if (plot_loess) {
# compress x to 3 digits, and mean-aggredate y
# compress x to 3 digits, and mean-aggregate y
zz <- data.table(x = signif(x, 3), y)[, .(.N, y = mean(y)), x]
if (nrow(zz) <= 5) {
lines(zz$x, zz$y, col = col_loess)
@@ -216,3 +182,108 @@ xgb.plot.shap <- function(data, shap_contrib = NULL, features = NULL, top_n = 1,
}
invisible(list(data = data, shap_contrib = shap_contrib))
}
#' SHAP contribution dependency summary plot
#'
#' Compare SHAP contributions of different features.
#'
#' A point plot (each point representing one sample from \code{data}) is
#' produced for each feature, with the points plotted on the SHAP value axis.
#' Each point (observation) is coloured based on its feature value. The plot
#' hence allows us to see which features have a negative / positive contribution
#' on the model prediction, and whether the contribution is different for larger
#' or smaller values of the feature. We effectively try to replicate the
#' \code{summary_plot} function from https://github.com/slundberg/shap.
#'
#' @inheritParams xgb.plot.shap
#'
#' @return A \code{ggplot2} object.
#' @export
#'
#' @examples # See \code{\link{xgb.plot.shap}}.
#' @seealso \code{\link{xgb.plot.shap}}, \code{\link{xgb.ggplot.shap.summary}},
#' \url{https://github.com/slundberg/shap}
xgb.plot.shap.summary <- function(data, shap_contrib = NULL, features = NULL, top_n = 10, model = NULL,
trees = NULL, target_class = NULL, approxcontrib = FALSE, subsample = NULL) {
# Only ggplot implementation is available.
xgb.ggplot.shap.summary(data, shap_contrib, features, top_n, model, trees, target_class, approxcontrib, subsample)
}
#' Prepare data for SHAP plots. To be used in xgb.plot.shap, xgb.plot.shap.summary, etc.
#' Internal utility function.
#'
#' @inheritParams xgb.plot.shap
#' @keywords internal
#'
#' @return A list containing: 'data', a matrix containing sample observations
#' and their feature values; 'shap_contrib', a matrix containing the SHAP contribution
#' values for these observations.
xgb.shap.data <- function(data, shap_contrib = NULL, features = NULL, top_n = 1, model = NULL,
trees = NULL, target_class = NULL, approxcontrib = FALSE,
subsample = NULL, max_observations = 100000) {
if (!is.matrix(data) && !inherits(data, "dgCMatrix"))
stop("data: must be either matrix or dgCMatrix")
if (is.null(shap_contrib) && (is.null(model) || !inherits(model, "xgb.Booster")))
stop("when shap_contrib is not provided, one must provide an xgb.Booster model")
if (is.null(features) && (is.null(model) || !inherits(model, "xgb.Booster")))
stop("when features are not provided, one must provide an xgb.Booster model to rank the features")
if (!is.null(shap_contrib) &&
(!is.matrix(shap_contrib) || nrow(shap_contrib) != nrow(data) || ncol(shap_contrib) != ncol(data) + 1))
stop("shap_contrib is not compatible with the provided data")
if (is.character(features) && is.null(colnames(data)))
stop("either provide `data` with column names or provide `features` as column indices")
if (is.null(model$feature_names) && model$nfeatures != ncol(data))
stop("if model has no feature_names, columns in `data` must match features in model")
if (!is.null(subsample)) {
idx <- sample(x = seq_len(nrow(data)), size = as.integer(subsample * nrow(data)), replace = FALSE)
} else {
idx <- seq_len(min(nrow(data), max_observations))
}
data <- data[idx, ]
if (is.null(colnames(data))) {
colnames(data) <- paste0("X", seq_len(ncol(data)))
}
if (!is.null(shap_contrib)) {
if (is.list(shap_contrib)) { # multiclass: either choose a class or merge
shap_contrib <- if (!is.null(target_class)) shap_contrib[[target_class + 1]] else Reduce("+", lapply(shap_contrib, abs))
}
shap_contrib <- shap_contrib[idx, ]
if (is.null(colnames(shap_contrib))) {
colnames(shap_contrib) <- paste0("X", seq_len(ncol(data)))
}
} else {
shap_contrib <- predict(model, newdata = data, predcontrib = TRUE, approxcontrib = approxcontrib)
if (is.list(shap_contrib)) { # multiclass: either choose a class or merge
shap_contrib <- if (!is.null(target_class)) shap_contrib[[target_class + 1]] else Reduce("+", lapply(shap_contrib, abs))
}
}
if (is.null(features)) {
if (!is.null(model$feature_names)) {
imp <- xgb.importance(model = model, trees = trees)
} else {
imp <- xgb.importance(model = model, trees = trees, feature_names = colnames(data))
}
top_n <- top_n[1]
if (top_n < 1 | top_n > 100) stop("top_n: must be an integer within [1, 100]")
features <- imp$Feature[1:min(top_n, NROW(imp))]
}
if (is.character(features)) {
features <- match(features, colnames(data))
}
shap_contrib <- shap_contrib[, features, drop = FALSE]
data <- data[, features, drop = FALSE]
list(
data = data,
shap_contrib = shap_contrib
)
}

View File

@@ -99,33 +99,41 @@ xgb.plot.tree <- function(feature_names = NULL, model = NULL, trees = NULL, plot
fontcolor = "black")
edges <- DiagrammeR::create_edge_df(
from = match(dt[Feature != "Leaf", c(ID)] %>% rep(2), dt$ID),
from = match(rep(dt[Feature != "Leaf", c(ID)], 2), dt$ID),
to = match(dt[Feature != "Leaf", c(Yes, No)], dt$ID),
label = dt[Feature != "Leaf", paste("<", Split)] %>%
c(rep("", nrow(dt[Feature != "Leaf"]))),
style = dt[Feature != "Leaf", ifelse(Missing == Yes, "bold", "solid")] %>%
c(dt[Feature != "Leaf", ifelse(Missing == No, "bold", "solid")]),
label = c(
dt[Feature != "Leaf", paste("<", Split)],
rep("", nrow(dt[Feature != "Leaf"]))
),
style = c(
dt[Feature != "Leaf", ifelse(Missing == Yes, "bold", "solid")],
dt[Feature != "Leaf", ifelse(Missing == No, "bold", "solid")]
),
rel = "leading_to")
graph <- DiagrammeR::create_graph(
nodes_df = nodes,
edges_df = edges,
attr_theme = NULL
) %>%
DiagrammeR::add_global_graph_attrs(
)
graph <- DiagrammeR::add_global_graph_attrs(
graph = graph,
attr_type = "graph",
attr = c("layout", "rankdir"),
value = c("dot", "LR")
) %>%
DiagrammeR::add_global_graph_attrs(
)
graph <- DiagrammeR::add_global_graph_attrs(
graph = graph,
attr_type = "node",
attr = c("color", "style", "fontname"),
value = c("DimGray", "filled", "Helvetica")
) %>%
DiagrammeR::add_global_graph_attrs(
)
graph <- DiagrammeR::add_global_graph_attrs(
graph = graph,
attr_type = "edge",
attr = c("color", "arrowsize", "arrowhead", "fontname"),
value = c("DimGray", "1.5", "vee", "Helvetica"))
value = c("DimGray", "1.5", "vee", "Helvetica")
)
if (!render) return(invisible(graph))

View File

@@ -42,6 +42,7 @@ xgb.save <- function(model, fname) {
if (inherits(model, "xgb.DMatrix")) " Use xgb.DMatrix.save to save an xgb.DMatrix object." else "")
}
model <- xgb.Booster.complete(model, saveraw = FALSE)
fname <- path.expand(fname)
.Call(XGBoosterSaveModel_R, model$handle, fname[1])
return(TRUE)
}

View File

@@ -15,7 +15,7 @@
#'
#' 2. Booster Parameters
#'
#' 2.1. Parameter for Tree Booster
#' 2.1. Parameters for Tree Booster
#'
#' \itemize{
#' \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
@@ -24,12 +24,14 @@
#' \item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1
#' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
#' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
#' \item \code{lambda} L2 regularization term on weights. Default: 1
#' \item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
#' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through XGBoost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
#' \item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.
#' \item \code{interaction_constraints} A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from \code{0} (\code{0} references the first column). Leave argument unspecified for no interaction constraints.
#' }
#'
#' 2.2. Parameter for Linear Booster
#' 2.2. Parameters for Linear Booster
#'
#' \itemize{
#' \item \code{lambda} L2 regularization term on weights. Default: 0
@@ -49,10 +51,10 @@
#' \item \code{binary:logistic} logistic regression for binary classification. Output probability.
#' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
#' \item \code{binary:hinge}: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
#' \item \code{count:poisson}: poisson regression for count data, output mean of poisson distribution. \code{max_delta_step} is set to 0.7 by default in poisson regression (used to safeguard optimization).
#' \item \code{count:poisson}: Poisson regression for count data, output mean of Poisson distribution. \code{max_delta_step} is set to 0.7 by default in poisson regression (used to safeguard optimization).
#' \item \code{survival:cox}: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function \code{h(t) = h0(t) * HR)}.
#' \item \code{survival:aft}: Accelerated failure time model for censored survival time data. See \href{https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html}{Survival Analysis with Accelerated Failure Time} for details.
#' \item \code{aft_loss_distribution}: Probabilty Density Function used by \code{survival:aft} and \code{aft-nloglik} metric.
#' \item \code{aft_loss_distribution}: Probability Density Function used by \code{survival:aft} and \code{aft-nloglik} metric.
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class - 1}.
#' \item \code{multi:softprob} same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
#' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
@@ -124,22 +126,24 @@
#' Parallelization is automatically enabled if \code{OpenMP} is present.
#' Number of threads can also be manually specified via \code{nthread} parameter.
#'
#' The evaluation metric is chosen automatically by Xgboost (according to the objective)
#' The evaluation metric is chosen automatically by XGBoost (according to the objective)
#' when the \code{eval_metric} parameter is not provided.
#' User may set one or several \code{eval_metric} parameters.
#' Note that when using a customized metric, only this single metric can be used.
#' The following is the list of built-in metrics for which Xgboost provides optimized implementation:
#' The following is the list of built-in metrics for which XGBoost provides optimized implementation:
#' \itemize{
#' \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
#' \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
#' \item \code{mlogloss} multiclass logloss. \url{http://wiki.fast.ai/index.php/Log_Loss}
#' \item \code{rmse} root mean square error. \url{https://en.wikipedia.org/wiki/Root_mean_square_error}
#' \item \code{logloss} negative log-likelihood. \url{https://en.wikipedia.org/wiki/Log-likelihood}
#' \item \code{mlogloss} multiclass logloss. \url{https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html}
#' \item \code{error} Binary classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
#' By default, it uses the 0.5 threshold for predicted values to define negative and positive instances.
#' Different threshold (e.g., 0.) could be specified as "error@0."
#' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
#' \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
#' \item \code{mae} Mean absolute error
#' \item \code{mape} Mean absolute percentage error
#' \item \code{auc} Area under the curve. \url{https://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
#' \item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation.
#' \item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{http://en.wikipedia.org/wiki/NDCG}
#' \item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{https://en.wikipedia.org/wiki/NDCG}
#' }
#'
#' The following callbacks are automatically created when certain parameters are set:
@@ -167,9 +171,6 @@
#' explicitly passed.
#' \item \code{best_iteration} iteration number with the best evaluation metric value
#' (only available with early stopping).
#' \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
#' which could further be used in \code{predict} method
#' (only available with early stopping).
#' \item \code{best_score} the best evaluation metric value during early stopping.
#' (only available with early stopping).
#' \item \code{feature_names} names of the training dataset features
@@ -191,8 +192,8 @@
#' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost')
#'
#' dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
#' dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
#' dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
#' dtest <- with(agaricus.test, xgb.DMatrix(data, label = label))
#' watchlist <- list(train = dtrain, eval = dtest)
#'
#' ## A simple xgb.train example:

View File

@@ -1,11 +1,21 @@
#' Load the instance back from \code{\link{xgb.serialize}}
#'
#' @param buffer the buffer containing booster instance saved by \code{\link{xgb.serialize}}
#' @param handle An \code{xgb.Booster.handle} object which will be overwritten with
#' the new deserialized object. Must be a null handle (e.g. when loading the model through
#' `readRDS`). If not provided, a new handle will be created.
#' @return An \code{xgb.Booster.handle} object.
#'
#' @export
xgb.unserialize <- function(buffer) {
xgb.unserialize <- function(buffer, handle = NULL) {
cachelist <- list()
handle <- .Call(XGBoosterCreate_R, cachelist)
if (is.null(handle)) {
handle <- .Call(XGBoosterCreate_R, cachelist)
} else {
if (!is.null.handle(handle))
stop("'handle' is not null/empty. Cannot overwrite existing handle.")
.Call(XGBoosterCreateInEmptyObj_R, cachelist, handle)
}
tryCatch(
.Call(XGBoosterUnserializeFromBuffer_R, handle, buffer),
error = function(e) {

View File

@@ -10,7 +10,7 @@ xgboost <- function(data = NULL, label = NULL, missing = NA, weight = NULL,
save_period = NULL, save_name = "xgboost.model",
xgb_model = NULL, callbacks = list(), ...) {
dtrain <- xgb.get.DMatrix(data, label, missing, weight)
dtrain <- xgb.get.DMatrix(data, label, missing, weight, nthread = params$nthread)
watchlist <- list(train = dtrain)
@@ -90,12 +90,8 @@ NULL
#' @importFrom data.table setkey
#' @importFrom data.table setkeyv
#' @importFrom data.table setnames
#' @importFrom magrittr %>%
#' @importFrom stringi stri_detect_regex
#' @importFrom stringi stri_match_first_regex
#' @importFrom stringi stri_replace_first_regex
#' @importFrom stringi stri_replace_all_regex
#' @importFrom stringi stri_split_regex
#' @importFrom jsonlite fromJSON
#' @importFrom jsonlite toJSON
#' @importFrom utils object.size str tail
#' @importFrom stats predict
#' @importFrom stats median

View File

@@ -30,4 +30,4 @@ Examples
Development
-----------
* See the [R Package section](https://xgboost.readthedocs.io/en/latest/contribute.html#r-package) of the contributors guide.
* See the [R Package section](https://xgboost.readthedocs.io/en/latest/contrib/coding_guide.html#r-coding-guideline) of the contributors guide.

View File

@@ -1,4 +1,3 @@
#!/bin/sh
rm -f src/Makevars
rm -f CMakeLists.txt

2
R-package/configure vendored
View File

@@ -2732,7 +2732,7 @@ $as_echo "${ac_pkg_openmp}" >&6; }
OPENMP_CXXFLAGS=''
OPENMP_LIB=''
echo '*****************************************************************************************'
echo 'WARNING: OpenMP is unavailable on this Mac OSX system. Training speed may be suboptimal.'
echo ' OpenMP is unavailable on this Mac OSX system. Training speed may be suboptimal.'
echo ' To use all CPU cores for training jobs, you should install OpenMP by running\n'
echo ' brew install libomp'
echo '*****************************************************************************************'

View File

@@ -39,7 +39,7 @@ then
OPENMP_CXXFLAGS=''
OPENMP_LIB=''
echo '*****************************************************************************************'
echo 'WARNING: OpenMP is unavailable on this Mac OSX system. Training speed may be suboptimal.'
echo ' OpenMP is unavailable on this Mac OSX system. Training speed may be suboptimal.'
echo ' To use all CPU cores for training jobs, you should install OpenMP by running\n'
echo ' brew install libomp'
echo '*****************************************************************************************'
@@ -52,4 +52,3 @@ AC_SUBST(ENDIAN_FLAG)
AC_SUBST(BACKTRACE_LIB)
AC_CONFIG_FILES([src/Makevars])
AC_OUTPUT

View File

@@ -1,6 +1,6 @@
basic_walkthrough Basic feature walkthrough
caret_wrapper Use xgboost to train in caret library
custom_objective Cutomize loss function, and evaluation metric
custom_objective Customize loss function, and evaluation metric
boost_from_prediction Boosting from existing prediction
predict_first_ntree Predicting using first n trees
generalized_linear_model Generalized Linear Model
@@ -8,8 +8,8 @@ cross_validation Cross validation
create_sparse_matrix Create Sparse Matrix
predict_leaf_indices Predicting the corresponding leaves
early_stopping Early Stop in training
poisson_regression Poisson Regression on count data
tweedie_regression Tweddie Regression
poisson_regression Poisson regression on count data
tweedie_regression Tweedie regression
gpu_accelerated GPU-accelerated tree building algorithms
interaction_constraints Interaction constraints among features

View File

@@ -2,7 +2,7 @@ XGBoost R Feature Walkthrough
====
* [Basic walkthrough of wrappers](basic_walkthrough.R)
* [Train a xgboost model from caret library](caret_wrapper.R)
* [Cutomize loss function, and evaluation metric](custom_objective.R)
* [Customize loss function, and evaluation metric](custom_objective.R)
* [Boosting from existing prediction](boost_from_prediction.R)
* [Predicting using first n trees](predict_first_ntree.R)
* [Generalized Linear Model](generalized_linear_model.R)

View File

@@ -40,7 +40,7 @@ print("Train xgboost with verbose 2, also print information about tree")
bst <- xgboost(data = dtrain, max_depth = 2, eta = 1, nrounds = 2,
nthread = 2, objective = "binary:logistic", verbose = 2)
# you can also specify data as file path to a LibSVM format input
# you can also specify data as file path to a LIBSVM format input
# since we do not have this file with us, the following line is just for illustration
# bst <- xgboost(data = 'agaricus.train.svm', max_depth = 2, eta = 1, nrounds = 2,objective = "binary:logistic")

View File

@@ -2,17 +2,17 @@ require(xgboost)
require(Matrix)
require(data.table)
if (!require(vcd)) {
install.packages('vcd') #Available in Cran. Used for its dataset with categorical values.
install.packages('vcd') #Available in CRAN. Used for its dataset with categorical values.
require(vcd)
}
# According to its documentation, Xgboost works only on numbers.
# According to its documentation, XGBoost works only on numbers.
# Sometimes the dataset we have to work on have categorical data.
# A categorical variable is one which have a fixed number of values. By example, if for each observation a variable called "Colour" can have only "red", "blue" or "green" as value, it is a categorical variable.
#
# In R, categorical variable is called Factor.
# Type ?factor in console for more information.
#
# In this demo we will see how to transform a dense dataframe with categorical variables to a sparse matrix before analyzing it in Xgboost.
# In this demo we will see how to transform a dense dataframe with categorical variables to a sparse matrix before analyzing it in XGBoost.
# The method we are going to see is usually called "one hot encoding".
#load Arthritis dataset in memory.
@@ -25,13 +25,13 @@ df <- data.table(Arthritis, keep.rownames = FALSE)
cat("Print the dataset\n")
print(df)
# 2 columns have factor type, one has ordinal type (ordinal variable is a categorical variable with values wich can be ordered, here: None > Some > Marked).
# 2 columns have factor type, one has ordinal type (ordinal variable is a categorical variable with values which can be ordered, here: None > Some > Marked).
cat("Structure of the dataset\n")
str(df)
# Let's add some new categorical features to see if it helps. Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but Tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features.
# For the first feature we create groups of age by rounding the real age. Note that we transform it to factor (categorical data) so the algorithm treat them as independant values.
# For the first feature we create groups of age by rounding the real age. Note that we transform it to factor (categorical data) so the algorithm treat them as independent values.
df[, AgeDiscret := as.factor(round(Age / 10, 0))]
# Here is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value based on nothing. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!).

View File

@@ -22,10 +22,10 @@ xgb.cv(param, dtrain, nrounds, nfold = 5,
metrics = 'error', showsd = FALSE)
###
# you can also do cross validation with cutomized loss function
# you can also do cross validation with customized loss function
# See custom_objective.R
##
print ('running cross validation, with cutomsized loss function')
print ('running cross validation, with customized loss function')
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")

View File

@@ -12,7 +12,7 @@ watchlist <- list(eval = dtest, train = dtrain)
num_round <- 2
# user define objective function, given prediction, return gradient and second order gradient
# this is loglikelihood loss
# this is log likelihood loss
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1 / (1 + exp(-preds))
@@ -23,9 +23,9 @@ logregobj <- function(preds, dtrain) {
# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make buildin evalution metric not function properly
# this may make builtin evaluation metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the buildin evaluation error assumes input is after logistic transformation
# the builtin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")

View File

@@ -11,7 +11,7 @@ param <- list(max_depth = 2, eta = 1, nthread = 2, verbosity = 0)
watchlist <- list(eval = dtest)
num_round <- 20
# user define objective function, given prediction, return gradient and second order gradient
# this is loglikelihood loss
# this is log likelihood loss
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1 / (1 + exp(-preds))
@@ -21,9 +21,9 @@ logregobj <- function(preds, dtrain) {
}
# user defined evaluation function, return a pair metric_name, result
# NOTE: when you do customized loss function, the default prediction value is margin
# this may make buildin evalution metric not function properly
# this may make builtin evaluation metric not function properly
# for example, we are doing logistic loss, the prediction is score before logistic transformation
# the buildin evaluation error assumes input is after logistic transformation
# the builtin evaluation error assumes input is after logistic transformation
# Take this in mind when you use the customization, and maybe you need write customized evaluation function
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")

View File

@@ -36,7 +36,7 @@ treeInteractions <- function(input_tree, input_max_depth) {
interaction_trees <- trees[!is.na(Split) & !is.na(parent_1),
c('Feature', paste0('parent_feat_', 1:(input_max_depth - 1))),
with = FALSE]
interaction_trees_split <- split(interaction_trees, 1:nrow(interaction_trees))
interaction_trees_split <- split(interaction_trees, seq_len(nrow(interaction_trees)))
interaction_list <- lapply(interaction_trees_split, as.character)
# Remove NAs (no parent interaction)
@@ -101,8 +101,8 @@ bst3_interactions <- treeInteractions(bst3_tree, 4)
# Show monotonic constraints still apply by checking scores after incrementing V1
x1 <- sort(unique(x[['V1']]))
for (i in 1:length(x1)){
testdata <- copy(x[, -c('V1')])
for (i in seq_along(x1)){
testdata <- copy(x[, - ('V1')])
testdata[['V1']] <- x1[i]
testdata <- testdata[, paste0('V', 1:10), with = FALSE]
pred <- predict(bst3, as.matrix(testdata))

View File

@@ -43,6 +43,7 @@ bst2 <- xgb.load('xgb.model')
# Save as a stand-alone file (JSON); load it with xgb.load()
xgb.save(bst, 'xgb.model.json')
bst2 <- xgb.load('xgb.model.json')
if (file.exists('xgb.model.json')) file.remove('xgb.model.json')
# Save as a raw byte vector; load it with xgb.load.raw()
xgb_bytes <- xgb.save.raw(bst)
@@ -58,5 +59,6 @@ saveRDS(obj, 'my_object.rds')
obj2 <- readRDS('my_object.rds')
# Re-construct xgb.Booster object from the bytes
bst2 <- xgb.load.raw(obj2$xgb_model_bytes)
if (file.exists('my_object.rds')) file.remove('my_object.rds')
}

View File

@@ -38,10 +38,7 @@ The following additional fields are assigned to the model's R object:
\itemize{
\item \code{best_score} the evaluation score at the best iteration
\item \code{best_iteration} at which boosting iteration the best score has occurred (1-based index)
\item \code{best_ntreelimit} to use with the \code{ntreelimit} parameter in \code{predict}.
It differs from \code{best_iteration} in multiclass or random forest settings.
}
The Same values are also stored as xgb-attributes:
\itemize{
\item \code{best_iteration} is stored as a 0-based iteration index (for interoperability of binary models)

View File

@@ -8,7 +8,7 @@ during its training.}
cb.gblinear.history(sparse = FALSE)
}
\arguments{
\item{sparse}{when set to FALSE/TURE, a dense/sparse matrix is used to store the result.
\item{sparse}{when set to FALSE/TRUE, a dense/sparse matrix is used to store the result.
Sparse format is useful when one expects only a subset of coefficients to be non-zero,
when using the "thrifty" feature selector with fairly small number of top features
selected per iteration.}
@@ -36,7 +36,6 @@ Callback function expects the following values to be set in its calling frame:
#
# In the iris dataset, it is hard to linearly separate Versicolor class from the rest
# without considering the 2nd order interactions:
require(magrittr)
x <- model.matrix(Species ~ .^2, iris)[,-1]
colnames(x)
dtrain <- xgb.DMatrix(scale(x), label = 1*(iris$Species == "versicolor"))
@@ -57,7 +56,7 @@ matplot(coef_path, type = 'l')
bst <- xgb.train(param, dtrain, list(tr=dtrain), nrounds = 200, eta = 0.8,
updater = 'coord_descent', feature_selector = 'thrifty', top_k = 1,
callbacks = list(cb.gblinear.history()))
xgb.gblinear.history(bst) \%>\% matplot(type = 'l')
matplot(xgb.gblinear.history(bst), type = 'l')
# Componentwise boosting is known to have similar effect to Lasso regularization.
# Try experimenting with various values of top_k, eta, nrounds,
# as well as different feature_selectors.
@@ -66,7 +65,7 @@ xgb.gblinear.history(bst) \%>\% matplot(type = 'l')
bst <- xgb.cv(param, dtrain, nfold = 5, nrounds = 100, eta = 0.8,
callbacks = list(cb.gblinear.history()))
# coefficients in the CV fold #3
xgb.gblinear.history(bst)[[3]] \%>\% matplot(type = 'l')
matplot(xgb.gblinear.history(bst)[[3]], type = 'l')
#### Multiclass classification:
@@ -79,15 +78,15 @@ param <- list(booster = "gblinear", objective = "multi:softprob", num_class = 3,
bst <- xgb.train(param, dtrain, list(tr=dtrain), nrounds = 70, eta = 0.5,
callbacks = list(cb.gblinear.history()))
# Will plot the coefficient paths separately for each class:
xgb.gblinear.history(bst, class_index = 0) \%>\% matplot(type = 'l')
xgb.gblinear.history(bst, class_index = 1) \%>\% matplot(type = 'l')
xgb.gblinear.history(bst, class_index = 2) \%>\% matplot(type = 'l')
matplot(xgb.gblinear.history(bst, class_index = 0), type = 'l')
matplot(xgb.gblinear.history(bst, class_index = 1), type = 'l')
matplot(xgb.gblinear.history(bst, class_index = 2), type = 'l')
# CV:
bst <- xgb.cv(param, dtrain, nfold = 5, nrounds = 70, eta = 0.5,
callbacks = list(cb.gblinear.history(FALSE)))
# 1st forld of 1st class
xgb.gblinear.history(bst, class_index = 0)[[1]] \%>\% matplot(type = 'l')
# 1st fold of 1st class
matplot(xgb.gblinear.history(bst, class_index = 0)[[1]], type = 'l')
}
\seealso{

View File

@@ -23,9 +23,9 @@ Get information of an xgb.DMatrix object
The \code{name} field can be one of the following:
\itemize{
\item \code{label}: label Xgboost learn from ;
\item \code{label}: label XGBoost learn from ;
\item \code{weight}: to do a weight rescale ;
\item \code{base_margin}: base margin is the base prediction Xgboost will boost from ;
\item \code{base_margin}: base margin is the base prediction XGBoost will boost from ;
\item \code{nrow}: number of rows of the \code{xgb.DMatrix}.
}
@@ -34,8 +34,7 @@ The \code{name} field can be one of the following:
}
\examples{
data(agaricus.train, package='xgboost')
train <- agaricus.train
dtrain <- xgb.DMatrix(train$data, label=train$label)
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
labels <- getinfo(dtrain, 'label')
setinfo(dtrain, 'label', 1-labels)

View File

@@ -0,0 +1,18 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.ggplot.R
\name{normalize}
\alias{normalize}
\title{Scale feature value to have mean 0, standard deviation 1}
\usage{
normalize(x)
}
\arguments{
\item{x}{Numeric vector}
}
\value{
Numeric vector with mean 0 and sd 1.
}
\description{
This is used to compare multiple features on the same plot.
Internal utility function
}

View File

@@ -17,6 +17,8 @@
predinteraction = FALSE,
reshape = FALSE,
training = FALSE,
iterationrange = NULL,
strict_shape = FALSE,
...
)
@@ -34,8 +36,7 @@ missing values in data (e.g., sometimes 0 or some other extreme value is used).}
sum of predictions from boosting iterations' results. E.g., setting \code{outputmargin=TRUE} for
logistic regression would result in predictions for log-odds instead of probabilities.}
\item{ntreelimit}{limit the number of model's trees or boosting iterations used in prediction (see Details).
It will use all the trees by default (\code{NULL} value).}
\item{ntreelimit}{Deprecated, use \code{iterationrange} instead.}
\item{predleaf}{whether predict leaf index.}
@@ -52,10 +53,20 @@ or predinteraction flags is TRUE.}
\item{training}{whether is the prediction result used for training. For dart booster,
training predicting will perform dropout.}
\item{iterationrange}{Specifies which layer of trees are used in prediction. For
example, if a random forest is trained with 100 rounds. Specifying
`iteration_range=(1, 21)`, then only the forests built during [1, 21) (half open set)
rounds are used in this prediction. It's 1-based index just like R vector. When set
to \code{c(1, 1)} XGBoost will use all trees.}
\item{strict_shape}{Default is \code{FALSE}. When it's set to \code{TRUE}, output
type and shape of prediction are invariant to model type.}
\item{...}{Parameters passed to \code{predict.xgb.Booster}}
}
\value{
For regression or binary classification, it returns a vector of length \code{nrows(newdata)}.
The return type is different depending whether \code{strict_shape} is set to \code{TRUE}. By default,
for regression or binary classification, it returns a vector of length \code{nrows(newdata)}.
For multiclass classification, either a \code{num_class * nrows(newdata)} vector or
a \code{(nrows(newdata), num_class)} dimension matrix is returned, depending on
the \code{reshape} value.
@@ -76,18 +87,19 @@ two dimensions. The "+ 1" columns corresponds to bias. Summing this array along
produce practically the same result as predict with \code{predcontrib = TRUE}.
For a multiclass case, a list of \code{num_class} elements is returned, where each element is
such an array.
When \code{strict_shape} is set to \code{TRUE}, the output is always an array. For
normal prediction, the output is a 2-dimension array \code{(num_class, nrow(newdata))}.
For \code{predcontrib = TRUE}, output is \code{(ncol(newdata) + 1, num_class, nrow(newdata))}
For \code{predinteraction = TRUE}, output is \code{(ncol(newdata) + 1, ncol(newdata) + 1, num_class, nrow(newdata))}
For \code{predleaf = TRUE}, output is \code{(n_trees_in_forest, num_class, n_iterations, nrow(newdata))}
}
\description{
Predicted values based on either xgboost model or model handle object.
}
\details{
Note that \code{ntreelimit} is not necessarily equal to the number of boosting iterations
and it is not necessarily equal to the number of trees in a model.
E.g., in a random forest-like model, \code{ntreelimit} would limit the number of trees.
But for multiclass classification, while there are multiple trees per iteration,
\code{ntreelimit} limits the number of boosting iterations.
Also note that \code{ntreelimit} would currently do nothing for predictions from gblinear,
Note that \code{iterationrange} would currently do nothing for predictions from gblinear,
since gblinear doesn't keep its boosting history.
One possible practical applications of the \code{predleaf} option is to use the model
@@ -120,7 +132,7 @@ bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
# use all trees by default
pred <- predict(bst, test$data)
# use only the 1st tree
pred1 <- predict(bst, test$data, ntreelimit = 1)
pred1 <- predict(bst, test$data, iterationrange = c(1, 2))
# Predicting tree leafs:
# the result is an nsamples X ntrees matrix
@@ -172,25 +184,9 @@ str(pred)
all.equal(pred, pred_labels)
# prediction from using only 5 iterations should result
# in the same error as seen in iteration 5:
pred5 <- predict(bst, as.matrix(iris[, -5]), ntreelimit=5)
pred5 <- predict(bst, as.matrix(iris[, -5]), iterationrange=c(1, 6))
sum(pred5 != lb)/length(lb)
## random forest-like model of 25 trees for binary classification:
set.seed(11)
bst <- xgboost(data = train$data, label = train$label, max_depth = 5,
nthread = 2, nrounds = 1, objective = "binary:logistic",
num_parallel_tree = 25, subsample = 0.6, colsample_bytree = 0.1)
# Inspect the prediction error vs number of trees:
lb <- test$label
dtest <- xgb.DMatrix(test$data, label=lb)
err <- sapply(1:25, function(n) {
pred <- predict(bst, dtest, ntreelimit=n)
sum((pred > 0.5) != lb)/length(lb)
})
plot(err, type='l', ylim=c(0,0.1), xlab='#trees')
}
\references{
Scott M. Lundberg, Su-In Lee, "A Unified Approach to Interpreting Model Predictions", NIPS Proceedings 2017, \url{https://arxiv.org/abs/1705.07874}

View File

@@ -0,0 +1,27 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.ggplot.R
\name{prepare.ggplot.shap.data}
\alias{prepare.ggplot.shap.data}
\title{Combine and melt feature values and SHAP contributions for sample
observations.}
\usage{
prepare.ggplot.shap.data(data_list, normalize = FALSE)
}
\arguments{
\item{data_list}{List containing 'data' and 'shap_contrib' returned by
\code{xgb.shap.data()}.}
\item{normalize}{Whether to standardize feature values to have mean 0 and
standard deviation 1 (useful for comparing multiple features on the same
plot). Default \code{FALSE}.}
}
\value{
A data.table containing the observation ID, the feature name, the
feature value (normalized if specified), and the SHAP contribution value.
}
\description{
Conforms to data format required for ggplot functions.
}
\details{
Internal utility function.
}

View File

@@ -19,8 +19,7 @@ Currently it displays dimensions and presence of info-fields and colnames.
}
\examples{
data(agaricus.train, package='xgboost')
train <- agaricus.train
dtrain <- xgb.DMatrix(train$data, label=train$label)
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
dtrain
print(dtrain, verbose=TRUE)

View File

@@ -25,16 +25,15 @@ Set information of an xgb.DMatrix object
The \code{name} field can be one of the following:
\itemize{
\item \code{label}: label Xgboost learn from ;
\item \code{label}: label XGBoost learn from ;
\item \code{weight}: to do a weight rescale ;
\item \code{base_margin}: base margin is the base prediction Xgboost will boost from ;
\item \code{base_margin}: base margin is the base prediction XGBoost will boost from ;
\item \code{group}: number of rows in each group (to use with \code{rank:pairwise} objective).
}
}
\examples{
data(agaricus.train, package='xgboost')
train <- agaricus.train
dtrain <- xgb.DMatrix(train$data, label=train$label)
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
labels <- getinfo(dtrain, 'label')
setinfo(dtrain, 'label', 1-labels)

View File

@@ -28,8 +28,7 @@ original xgb.DMatrix object
}
\examples{
data(agaricus.train, package='xgboost')
train <- agaricus.train
dtrain <- xgb.DMatrix(train$data, label=train$label)
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
dsub <- slice(dtrain, 1:42)
labels1 <- getinfo(dsub, 'label')

View File

@@ -4,7 +4,14 @@
\alias{xgb.DMatrix}
\title{Construct xgb.DMatrix object}
\usage{
xgb.DMatrix(data, info = list(), missing = NA, silent = FALSE, ...)
xgb.DMatrix(
data,
info = list(),
missing = NA,
silent = FALSE,
nthread = NULL,
...
)
}
\arguments{
\item{data}{a \code{matrix} object (either numeric or integer), a \code{dgCMatrix} object, or a character
@@ -18,17 +25,18 @@ It is useful when a 0 or some other extreme value represents missing values in d
\item{silent}{whether to suppress printing an informational message after loading from a file.}
\item{nthread}{Number of threads used for creating DMatrix.}
\item{...}{the \code{info} data could be passed directly as parameters, without creating an \code{info} list.}
}
\description{
Construct xgb.DMatrix object from either a dense matrix, a sparse matrix, or a local file.
Supported input file formats are either a libsvm text file or a binary file that was created previously by
Supported input file formats are either a LIBSVM text file or a binary file that was created previously by
\code{\link{xgb.DMatrix.save}}).
}
\examples{
data(agaricus.train, package='xgboost')
train <- agaricus.train
dtrain <- xgb.DMatrix(train$data, label=train$label)
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
xgb.DMatrix.save(dtrain, 'xgb.DMatrix.data')
dtrain <- xgb.DMatrix('xgb.DMatrix.data')
if (file.exists('xgb.DMatrix.data')) file.remove('xgb.DMatrix.data')

View File

@@ -16,8 +16,7 @@ Save xgb.DMatrix object to binary file
}
\examples{
data(agaricus.train, package='xgboost')
train <- agaricus.train
dtrain <- xgb.DMatrix(train$data, label=train$label)
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
xgb.DMatrix.save(dtrain, 'xgb.DMatrix.data')
dtrain <- xgb.DMatrix('xgb.DMatrix.data')
if (file.exists('xgb.DMatrix.data')) file.remove('xgb.DMatrix.data')

View File

@@ -59,8 +59,8 @@ a rule on certain features."
\examples{
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(data = agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(data = agaricus.test$data, label = agaricus.test$label)
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
dtest <- with(agaricus.test, xgb.DMatrix(data, label = label))
param <- list(max_depth=2, eta=1, silent=1, objective='binary:logistic')
nrounds = 4

View File

@@ -70,6 +70,8 @@ from each CV model. This parameter engages the \code{\link{cb.cv.predict}} callb
\item \code{error} binary classification error rate
\item \code{rmse} Rooted mean square error
\item \code{logloss} negative log-likelihood function
\item \code{mae} Mean absolute error
\item \code{mape} Mean absolute percentage error
\item \code{auc} Area under curve
\item \code{aucpr} Area under PR curve
\item \code{merror} Exact matching error, used to evaluate multi-class classification
@@ -133,9 +135,7 @@ An object of class \code{xgb.cv.synchronous} with the following elements:
parameter or randomly generated.
\item \code{best_iteration} iteration number with the best evaluation metric value
(only available with early stopping).
\item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
which could further be used in \code{predict} method
(only available with early stopping).
\item \code{best_ntreelimit} and the \code{ntreelimit} Deprecated attributes, use \code{best_iteration} instead.
\item \code{pred} CV prediction values available when \code{prediction} is set.
It is either vector or matrix (see \code{\link{cb.cv.predict}}).
\item \code{models} a list of the CV folds' models. It is only available with the explicit
@@ -154,11 +154,11 @@ The cross-validation process is then repeated \code{nrounds} times, with each of
All observations are used for both training and validation.
Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation}
Adapted from \url{https://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29}
}
\examples{
data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = list("rmse","auc"),
max_depth = 3, eta = 1, objective = "binary:logistic")
print(cv)

View File

@@ -87,7 +87,7 @@ more than 5 distinct values.}
\item{which}{whether to do univariate or bivariate plotting. NOTE: only 1D is implemented so far.}
\item{plot}{whether a plot should be drawn. If FALSE, only a lits of matrices is returned.}
\item{plot}{whether a plot should be drawn. If FALSE, only a list of matrices is returned.}
\item{...}{other parameters passed to \code{plot}.}
}
@@ -131,6 +131,7 @@ bst <- xgboost(agaricus.train$data, agaricus.train$label, nrounds = 50,
xgb.plot.shap(agaricus.test$data, model = bst, features = "odor=none")
contr <- predict(bst, agaricus.test$data, predcontrib = TRUE)
xgb.plot.shap(agaricus.test$data, contr, model = bst, top_n = 12, n_col = 3)
xgb.ggplot.shap.summary(agaricus.test$data, contr, model = bst, top_n = 12) # Summary plot
# multiclass example - plots for each class separately:
nclass <- 3
@@ -149,6 +150,7 @@ xgb.plot.shap(x, model = mbst, trees = trees0 + 1, target_class = 1, top_n = 4,
n_col = 2, col = col, pch = 16, pch_NA = 17)
xgb.plot.shap(x, model = mbst, trees = trees0 + 2, target_class = 2, top_n = 4,
n_col = 2, col = col, pch = 16, pch_NA = 17)
xgb.ggplot.shap.summary(x, model = mbst, target_class = 0, top_n = 4) # Summary plot
}
\references{

View File

@@ -0,0 +1,78 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.ggplot.R, R/xgb.plot.shap.R
\name{xgb.ggplot.shap.summary}
\alias{xgb.ggplot.shap.summary}
\alias{xgb.plot.shap.summary}
\title{SHAP contribution dependency summary plot}
\usage{
xgb.ggplot.shap.summary(
data,
shap_contrib = NULL,
features = NULL,
top_n = 10,
model = NULL,
trees = NULL,
target_class = NULL,
approxcontrib = FALSE,
subsample = NULL
)
xgb.plot.shap.summary(
data,
shap_contrib = NULL,
features = NULL,
top_n = 10,
model = NULL,
trees = NULL,
target_class = NULL,
approxcontrib = FALSE,
subsample = NULL
)
}
\arguments{
\item{data}{data as a \code{matrix} or \code{dgCMatrix}.}
\item{shap_contrib}{a matrix of SHAP contributions that was computed earlier for the above
\code{data}. When it is NULL, it is computed internally using \code{model} and \code{data}.}
\item{features}{a vector of either column indices or of feature names to plot. When it is NULL,
feature importance is calculated, and \code{top_n} high ranked features are taken.}
\item{top_n}{when \code{features} is NULL, top_n [1, 100] most important features in a model are taken.}
\item{model}{an \code{xgb.Booster} model. It has to be provided when either \code{shap_contrib}
or \code{features} is missing.}
\item{trees}{passed to \code{\link{xgb.importance}} when \code{features = NULL}.}
\item{target_class}{is only relevant for multiclass models. When it is set to a 0-based class index,
only SHAP contributions for that specific class are used.
If it is not set, SHAP importances are averaged over all classes.}
\item{approxcontrib}{passed to \code{\link{predict.xgb.Booster}} when \code{shap_contrib = NULL}.}
\item{subsample}{a random fraction of data points to use for plotting. When it is NULL,
it is set so that up to 100K data points are used.}
}
\value{
A \code{ggplot2} object.
}
\description{
Compare SHAP contributions of different features.
}
\details{
A point plot (each point representing one sample from \code{data}) is
produced for each feature, with the points plotted on the SHAP value axis.
Each point (observation) is coloured based on its feature value. The plot
hence allows us to see which features have a negative / positive contribution
on the model prediction, and whether the contribution is different for larger
or smaller values of the feature. We effectively try to replicate the
\code{summary_plot} function from https://github.com/slundberg/shap.
}
\examples{
# See \code{\link{xgb.plot.shap}}.
}
\seealso{
\code{\link{xgb.plot.shap}}, \code{\link{xgb.ggplot.shap.summary}},
\url{https://github.com/slundberg/shap}
}

View File

@@ -0,0 +1,55 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.plot.shap.R
\name{xgb.shap.data}
\alias{xgb.shap.data}
\title{Prepare data for SHAP plots. To be used in xgb.plot.shap, xgb.plot.shap.summary, etc.
Internal utility function.}
\usage{
xgb.shap.data(
data,
shap_contrib = NULL,
features = NULL,
top_n = 1,
model = NULL,
trees = NULL,
target_class = NULL,
approxcontrib = FALSE,
subsample = NULL,
max_observations = 1e+05
)
}
\arguments{
\item{data}{data as a \code{matrix} or \code{dgCMatrix}.}
\item{shap_contrib}{a matrix of SHAP contributions that was computed earlier for the above
\code{data}. When it is NULL, it is computed internally using \code{model} and \code{data}.}
\item{features}{a vector of either column indices or of feature names to plot. When it is NULL,
feature importance is calculated, and \code{top_n} high ranked features are taken.}
\item{top_n}{when \code{features} is NULL, top_n [1, 100] most important features in a model are taken.}
\item{model}{an \code{xgb.Booster} model. It has to be provided when either \code{shap_contrib}
or \code{features} is missing.}
\item{trees}{passed to \code{\link{xgb.importance}} when \code{features = NULL}.}
\item{target_class}{is only relevant for multiclass models. When it is set to a 0-based class index,
only SHAP contributions for that specific class are used.
If it is not set, SHAP importances are averaged over all classes.}
\item{approxcontrib}{passed to \code{\link{predict.xgb.Booster}} when \code{shap_contrib = NULL}.}
\item{subsample}{a random fraction of data points to use for plotting. When it is NULL,
it is set so that up to 100K data points are used.}
}
\value{
A list containing: 'data', a matrix containing sample observations
and their feature values; 'shap_contrib', a matrix containing the SHAP contribution
values for these observations.
}
\description{
Prepare data for SHAP plots. To be used in xgb.plot.shap, xgb.plot.shap.summary, etc.
Internal utility function.
}
\keyword{internal}

View File

@@ -54,7 +54,7 @@ xgboost(
2. Booster Parameters
2.1. Parameter for Tree Booster
2.1. Parameters for Tree Booster
\itemize{
\item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3
@@ -63,12 +63,14 @@ xgboost(
\item \code{min_child_weight} minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
\item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nrounds}. Default: 1
\item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
\item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
\item \code{lambda} L2 regularization term on weights. Default: 1
\item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
\item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through XGBoost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1
\item \code{monotone_constraints} A numerical vector consists of \code{1}, \code{0} and \code{-1} with its length equals to the number of features in the training data. \code{1} is increasing, \code{-1} is decreasing and \code{0} is no constraint.
\item \code{interaction_constraints} A list of vectors specifying feature indices of permitted interactions. Each item of the list represents one permitted interaction where specified features are allowed to interact with each other. Feature index values should start from \code{0} (\code{0} references the first column). Leave argument unspecified for no interaction constraints.
}
2.2. Parameter for Linear Booster
2.2. Parameters for Linear Booster
\itemize{
\item \code{lambda} L2 regularization term on weights. Default: 0
@@ -88,10 +90,10 @@ xgboost(
\item \code{binary:logistic} logistic regression for binary classification. Output probability.
\item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
\item \code{binary:hinge}: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
\item \code{count:poisson}: poisson regression for count data, output mean of poisson distribution. \code{max_delta_step} is set to 0.7 by default in poisson regression (used to safeguard optimization).
\item \code{count:poisson}: Poisson regression for count data, output mean of Poisson distribution. \code{max_delta_step} is set to 0.7 by default in poisson regression (used to safeguard optimization).
\item \code{survival:cox}: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function \code{h(t) = h0(t) * HR)}.
\item \code{survival:aft}: Accelerated failure time model for censored survival time data. See \href{https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html}{Survival Analysis with Accelerated Failure Time} for details.
\item \code{aft_loss_distribution}: Probabilty Density Function used by \code{survival:aft} and \code{aft-nloglik} metric.
\item \code{aft_loss_distribution}: Probability Density Function used by \code{survival:aft} and \code{aft-nloglik} metric.
\item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class - 1}.
\item \code{multi:softprob} same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
\item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
@@ -185,9 +187,6 @@ An object of class \code{xgb.Booster} with the following elements:
explicitly passed.
\item \code{best_iteration} iteration number with the best evaluation metric value
(only available with early stopping).
\item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration,
which could further be used in \code{predict} method
(only available with early stopping).
\item \code{best_score} the best evaluation metric value during early stopping.
(only available with early stopping).
\item \code{feature_names} names of the training dataset features
@@ -209,22 +208,24 @@ than the \code{xgboost} interface.
Parallelization is automatically enabled if \code{OpenMP} is present.
Number of threads can also be manually specified via \code{nthread} parameter.
The evaluation metric is chosen automatically by Xgboost (according to the objective)
The evaluation metric is chosen automatically by XGBoost (according to the objective)
when the \code{eval_metric} parameter is not provided.
User may set one or several \code{eval_metric} parameters.
Note that when using a customized metric, only this single metric can be used.
The following is the list of built-in metrics for which Xgboost provides optimized implementation:
The following is the list of built-in metrics for which XGBoost provides optimized implementation:
\itemize{
\item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
\item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
\item \code{mlogloss} multiclass logloss. \url{http://wiki.fast.ai/index.php/Log_Loss}
\item \code{rmse} root mean square error. \url{https://en.wikipedia.org/wiki/Root_mean_square_error}
\item \code{logloss} negative log-likelihood. \url{https://en.wikipedia.org/wiki/Log-likelihood}
\item \code{mlogloss} multiclass logloss. \url{https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html}
\item \code{error} Binary classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
By default, it uses the 0.5 threshold for predicted values to define negative and positive instances.
Different threshold (e.g., 0.) could be specified as "error@0."
\item \code{merror} Multiclass classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
\item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
\item \code{mae} Mean absolute error
\item \code{mape} Mean absolute percentage error
\item \code{auc} Area under the curve. \url{https://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
\item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation.
\item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{http://en.wikipedia.org/wiki/NDCG}
\item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{https://en.wikipedia.org/wiki/NDCG}
}
The following callbacks are automatically created when certain parameters are set:
@@ -240,8 +241,8 @@ The following callbacks are automatically created when certain parameters are se
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
dtrain <- with(agaricus.train, xgb.DMatrix(data, label = label))
dtest <- with(agaricus.test, xgb.DMatrix(data, label = label))
watchlist <- list(train = dtrain, eval = dtest)
## A simple xgb.train example:

View File

@@ -4,10 +4,17 @@
\alias{xgb.unserialize}
\title{Load the instance back from \code{\link{xgb.serialize}}}
\usage{
xgb.unserialize(buffer)
xgb.unserialize(buffer, handle = NULL)
}
\arguments{
\item{buffer}{the buffer containing booster instance saved by \code{\link{xgb.serialize}}}
\item{handle}{An \code{xgb.Booster.handle} object which will be overwritten with
the new deserialized object. Must be a null handle (e.g. when loading the model through
`readRDS`). If not provided, a new handle will be created.}
}
\value{
An \code{xgb.Booster.handle} object.
}
\description{
Load the instance back from \code{\link{xgb.serialize}}

View File

@@ -0,0 +1,39 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.config.R
\name{xgb.set.config, xgb.get.config}
\alias{xgb.set.config, xgb.get.config}
\alias{xgb.set.config}
\alias{xgb.get.config}
\title{Set and get global configuration}
\usage{
xgb.set.config(...)
xgb.get.config()
}
\arguments{
\item{...}{List of parameters to be set, as keyword arguments}
}
\value{
\code{xgb.set.config} returns \code{TRUE} to signal success. \code{xgb.get.config} returns
a list containing all global-scope parameters and their values.
}
\description{
Global configuration consists of a collection of parameters that can be applied in the global
scope. See \url{https://xgboost.readthedocs.io/en/stable/parameter.html} for the full list of
parameters supported in the global configuration. Use \code{xgb.set.config} to update the
values of one or more global-scope parameters. Use \code{xgb.get.config} to fetch the current
values of all global-scope parameters (listed in
\url{https://xgboost.readthedocs.io/en/stable/parameter.html}).
}
\examples{
# Set verbosity level to silent (0)
xgb.set.config(verbosity = 0)
# Now global verbosity level is 0
config <- xgb.get.config()
print(config$verbosity)
# Set verbosity level to warning (1)
xgb.set.config(verbosity = 1)
# Now global verbosity level is 1
config <- xgb.get.config()
print(config$verbosity)
}

View File

@@ -8,7 +8,7 @@ CXX_STD = CXX14
XGB_RFLAGS = -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0\
-DDMLC_ENABLE_STD_THREAD=$(ENABLE_STD_THREAD) -DDMLC_DISABLE_STDIN=1\
-DDMLC_LOG_CUSTOMIZE=1 -DXGBOOST_CUSTOMIZE_LOGGER=1\
-DRABIT_CUSTOMIZE_MSG_ -DRABIT_STRICT_CXX98_
-DRABIT_CUSTOMIZE_MSG_
# disable the use of thread_local for 32 bit windows:
ifeq ($(R_OSTYPE)$(WIN),windows)
@@ -17,8 +17,9 @@ endif
$(foreach v, $(XGB_RFLAGS), $(warning $(v)))
PKG_CPPFLAGS= -I$(PKGROOT)/include -I$(PKGROOT)/dmlc-core/include -I$(PKGROOT)/rabit/include -I$(PKGROOT) $(XGB_RFLAGS)
PKG_CXXFLAGS= @OPENMP_CXXFLAGS@ @ENDIAN_FLAG@ -pthread
PKG_CXXFLAGS= @OPENMP_CXXFLAGS@ @ENDIAN_FLAG@ -pthread $(CXX_VISIBILITY)
PKG_LIBS = @OPENMP_CXXFLAGS@ @OPENMP_LIB@ @ENDIAN_FLAG@ @BACKTRACE_LIB@ -pthread
OBJECTS= ./xgboost_R.o ./xgboost_custom.o ./xgboost_assert.o ./init.o\
$(PKGROOT)/amalgamation/xgboost-all0.o $(PKGROOT)/amalgamation/dmlc-minimum0.o\
$(PKGROOT)/rabit/src/engine_empty.o $(PKGROOT)/rabit/src/c_api.o
OBJECTS= ./xgboost_R.o ./xgboost_custom.o ./xgboost_assert.o ./init.o \
$(PKGROOT)/amalgamation/xgboost-all0.o $(PKGROOT)/amalgamation/dmlc-minimum0.o \
$(PKGROOT)/rabit/src/engine.o $(PKGROOT)/rabit/src/rabit_c_api.o \
$(PKGROOT)/rabit/src/allreduce_base.o

View File

@@ -3,7 +3,7 @@ PKGROOT=./
ENABLE_STD_THREAD=0
# _*_ mode: Makefile; _*_
# This file is only used for windows compilation from github
# This file is only used for Windows compilation from GitHub
# It will be replaced with Makevars.in for the CRAN version
.PHONY: all xgblib
all: $(SHLIB)
@@ -20,7 +20,7 @@ CXX_STD = CXX14
XGB_RFLAGS = -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0\
-DDMLC_ENABLE_STD_THREAD=$(ENABLE_STD_THREAD) -DDMLC_DISABLE_STDIN=1\
-DDMLC_LOG_CUSTOMIZE=1 -DXGBOOST_CUSTOMIZE_LOGGER=1\
-DRABIT_CUSTOMIZE_MSG_ -DRABIT_STRICT_CXX98_
-DRABIT_CUSTOMIZE_MSG_
# disable the use of thread_local for 32 bit windows:
ifeq ($(R_OSTYPE)$(WIN),windows)
@@ -31,8 +31,9 @@ $(foreach v, $(XGB_RFLAGS), $(warning $(v)))
PKG_CPPFLAGS= -I$(PKGROOT)/include -I$(PKGROOT)/dmlc-core/include -I$(PKGROOT)/rabit/include -I$(PKGROOT) $(XGB_RFLAGS)
PKG_CXXFLAGS= $(SHLIB_OPENMP_CXXFLAGS) $(SHLIB_PTHREAD_FLAGS)
PKG_LIBS = $(SHLIB_OPENMP_CXXFLAGS) $(SHLIB_PTHREAD_FLAGS)
OBJECTS= ./xgboost_R.o ./xgboost_custom.o ./xgboost_assert.o ./init.o\
$(PKGROOT)/amalgamation/xgboost-all0.o $(PKGROOT)/amalgamation/dmlc-minimum0.o\
$(PKGROOT)/rabit/src/engine_empty.o $(PKGROOT)/rabit/src/c_api.o
OBJECTS= ./xgboost_R.o ./xgboost_custom.o ./xgboost_assert.o ./init.o \
$(PKGROOT)/amalgamation/xgboost-all0.o $(PKGROOT)/amalgamation/dmlc-minimum0.o \
$(PKGROOT)/rabit/src/engine.o $(PKGROOT)/rabit/src/rabit_c_api.o \
$(PKGROOT)/rabit/src/allreduce_base.o
$(OBJECTS) : xgblib

View File

@@ -9,6 +9,7 @@
#include <Rinternals.h>
#include <stdlib.h>
#include <R_ext/Rdynload.h>
#include <R_ext/Visibility.h>
/* FIXME:
Check these declarations against the C/Fortran source code.
@@ -17,6 +18,7 @@ Check these declarations against the C/Fortran source code.
/* .Call calls */
extern SEXP XGBoosterBoostOneIter_R(SEXP, SEXP, SEXP, SEXP);
extern SEXP XGBoosterCreate_R(SEXP);
extern SEXP XGBoosterCreateInEmptyObj_R(SEXP, SEXP);
extern SEXP XGBoosterDumpModel_R(SEXP, SEXP, SEXP, SEXP);
extern SEXP XGBoosterEvalOneIter_R(SEXP, SEXP, SEXP, SEXP);
extern SEXP XGBoosterGetAttrNames_R(SEXP);
@@ -29,6 +31,7 @@ extern SEXP XGBoosterSerializeToBuffer_R(SEXP handle);
extern SEXP XGBoosterUnserializeFromBuffer_R(SEXP handle, SEXP raw);
extern SEXP XGBoosterModelToRaw_R(SEXP);
extern SEXP XGBoosterPredict_R(SEXP, SEXP, SEXP, SEXP, SEXP);
extern SEXP XGBoosterPredictFromDMatrix_R(SEXP, SEXP, SEXP);
extern SEXP XGBoosterSaveModel_R(SEXP, SEXP);
extern SEXP XGBoosterSetAttr_R(SEXP, SEXP, SEXP);
extern SEXP XGBoosterSetParam_R(SEXP, SEXP, SEXP);
@@ -36,17 +39,21 @@ extern SEXP XGBoosterUpdateOneIter_R(SEXP, SEXP, SEXP);
extern SEXP XGCheckNullPtr_R(SEXP);
extern SEXP XGDMatrixCreateFromCSC_R(SEXP, SEXP, SEXP, SEXP);
extern SEXP XGDMatrixCreateFromFile_R(SEXP, SEXP);
extern SEXP XGDMatrixCreateFromMat_R(SEXP, SEXP);
extern SEXP XGDMatrixCreateFromMat_R(SEXP, SEXP, SEXP);
extern SEXP XGDMatrixGetInfo_R(SEXP, SEXP);
extern SEXP XGDMatrixNumCol_R(SEXP);
extern SEXP XGDMatrixNumRow_R(SEXP);
extern SEXP XGDMatrixSaveBinary_R(SEXP, SEXP, SEXP);
extern SEXP XGDMatrixSetInfo_R(SEXP, SEXP, SEXP);
extern SEXP XGDMatrixSliceDMatrix_R(SEXP, SEXP);
extern SEXP XGBSetGlobalConfig_R(SEXP);
extern SEXP XGBGetGlobalConfig_R();
extern SEXP XGBoosterFeatureScore_R(SEXP, SEXP);
static const R_CallMethodDef CallEntries[] = {
{"XGBoosterBoostOneIter_R", (DL_FUNC) &XGBoosterBoostOneIter_R, 4},
{"XGBoosterCreate_R", (DL_FUNC) &XGBoosterCreate_R, 1},
{"XGBoosterCreateInEmptyObj_R", (DL_FUNC) &XGBoosterCreateInEmptyObj_R, 2},
{"XGBoosterDumpModel_R", (DL_FUNC) &XGBoosterDumpModel_R, 4},
{"XGBoosterEvalOneIter_R", (DL_FUNC) &XGBoosterEvalOneIter_R, 4},
{"XGBoosterGetAttrNames_R", (DL_FUNC) &XGBoosterGetAttrNames_R, 1},
@@ -59,6 +66,7 @@ static const R_CallMethodDef CallEntries[] = {
{"XGBoosterUnserializeFromBuffer_R", (DL_FUNC) &XGBoosterUnserializeFromBuffer_R, 2},
{"XGBoosterModelToRaw_R", (DL_FUNC) &XGBoosterModelToRaw_R, 1},
{"XGBoosterPredict_R", (DL_FUNC) &XGBoosterPredict_R, 5},
{"XGBoosterPredictFromDMatrix_R", (DL_FUNC) &XGBoosterPredictFromDMatrix_R, 3},
{"XGBoosterSaveModel_R", (DL_FUNC) &XGBoosterSaveModel_R, 2},
{"XGBoosterSetAttr_R", (DL_FUNC) &XGBoosterSetAttr_R, 3},
{"XGBoosterSetParam_R", (DL_FUNC) &XGBoosterSetParam_R, 3},
@@ -66,20 +74,23 @@ static const R_CallMethodDef CallEntries[] = {
{"XGCheckNullPtr_R", (DL_FUNC) &XGCheckNullPtr_R, 1},
{"XGDMatrixCreateFromCSC_R", (DL_FUNC) &XGDMatrixCreateFromCSC_R, 4},
{"XGDMatrixCreateFromFile_R", (DL_FUNC) &XGDMatrixCreateFromFile_R, 2},
{"XGDMatrixCreateFromMat_R", (DL_FUNC) &XGDMatrixCreateFromMat_R, 2},
{"XGDMatrixCreateFromMat_R", (DL_FUNC) &XGDMatrixCreateFromMat_R, 3},
{"XGDMatrixGetInfo_R", (DL_FUNC) &XGDMatrixGetInfo_R, 2},
{"XGDMatrixNumCol_R", (DL_FUNC) &XGDMatrixNumCol_R, 1},
{"XGDMatrixNumRow_R", (DL_FUNC) &XGDMatrixNumRow_R, 1},
{"XGDMatrixSaveBinary_R", (DL_FUNC) &XGDMatrixSaveBinary_R, 3},
{"XGDMatrixSetInfo_R", (DL_FUNC) &XGDMatrixSetInfo_R, 3},
{"XGDMatrixSliceDMatrix_R", (DL_FUNC) &XGDMatrixSliceDMatrix_R, 2},
{"XGBSetGlobalConfig_R", (DL_FUNC) &XGBSetGlobalConfig_R, 1},
{"XGBGetGlobalConfig_R", (DL_FUNC) &XGBGetGlobalConfig_R, 0},
{"XGBoosterFeatureScore_R", (DL_FUNC) &XGBoosterFeatureScore_R, 2},
{NULL, NULL, 0}
};
#if defined(_WIN32)
__declspec(dllexport)
#endif // defined(_WIN32)
void R_init_xgboost(DllInfo *dll) {
void attribute_visible R_init_xgboost(DllInfo *dll) {
R_registerRoutines(dll, NULL, CallEntries, NULL, NULL);
R_useDynamicSymbols(dll, FALSE);
}

View File

@@ -0,0 +1,3 @@
LIBRARY xgboost.dll
EXPORTS
R_init_xgboost

View File

@@ -1,6 +1,7 @@
// Copyright (c) 2014 by Contributors
#include <dmlc/logging.h>
#include <dmlc/omp.h>
#include <dmlc/common.h>
#include <xgboost/c_api.h>
#include <vector>
#include <string>
@@ -8,6 +9,8 @@
#include <cstring>
#include <cstdio>
#include <sstream>
#include "../../src/common/threading_utils.h"
#include "./xgboost_R.h"
/*!
@@ -37,11 +40,11 @@
using namespace dmlc;
SEXP XGCheckNullPtr_R(SEXP handle) {
XGB_DLL SEXP XGCheckNullPtr_R(SEXP handle) {
return ScalarLogical(R_ExternalPtrAddr(handle) == NULL);
}
void _DMatrixFinalizer(SEXP ext) {
XGB_DLL void _DMatrixFinalizer(SEXP ext) {
R_API_BEGIN();
if (R_ExternalPtrAddr(ext) == NULL) return;
CHECK_CALL(XGDMatrixFree(R_ExternalPtrAddr(ext)));
@@ -49,7 +52,22 @@ void _DMatrixFinalizer(SEXP ext) {
R_API_END();
}
SEXP XGDMatrixCreateFromFile_R(SEXP fname, SEXP silent) {
XGB_DLL SEXP XGBSetGlobalConfig_R(SEXP json_str) {
R_API_BEGIN();
CHECK_CALL(XGBSetGlobalConfig(CHAR(asChar(json_str))));
R_API_END();
return R_NilValue;
}
XGB_DLL SEXP XGBGetGlobalConfig_R() {
const char* json_str;
R_API_BEGIN();
CHECK_CALL(XGBGetGlobalConfig(&json_str));
R_API_END();
return mkString(json_str);
}
XGB_DLL SEXP XGDMatrixCreateFromFile_R(SEXP fname, SEXP silent) {
SEXP ret;
R_API_BEGIN();
DMatrixHandle handle;
@@ -61,8 +79,7 @@ SEXP XGDMatrixCreateFromFile_R(SEXP fname, SEXP silent) {
return ret;
}
SEXP XGDMatrixCreateFromMat_R(SEXP mat,
SEXP missing) {
XGB_DLL SEXP XGDMatrixCreateFromMat_R(SEXP mat, SEXP missing, SEXP n_threads) {
SEXP ret;
R_API_BEGIN();
SEXP dim = getAttrib(mat, R_DimSymbol);
@@ -77,14 +94,21 @@ SEXP XGDMatrixCreateFromMat_R(SEXP mat,
din = REAL(mat);
}
std::vector<float> data(nrow * ncol);
#pragma omp parallel for schedule(static)
dmlc::OMPException exc;
int32_t threads = xgboost::common::OmpGetNumThreads(asInteger(n_threads));
#pragma omp parallel for schedule(static) num_threads(threads)
for (omp_ulong i = 0; i < nrow; ++i) {
for (size_t j = 0; j < ncol; ++j) {
data[i * ncol +j] = is_int ? static_cast<float>(iin[i + nrow * j]) : din[i + nrow * j];
}
exc.Run([&]() {
for (size_t j = 0; j < ncol; ++j) {
data[i * ncol +j] = is_int ? static_cast<float>(iin[i + nrow * j]) : din[i + nrow * j];
}
});
}
exc.Rethrow();
DMatrixHandle handle;
CHECK_CALL(XGDMatrixCreateFromMat(BeginPtr(data), nrow, ncol, asReal(missing), &handle));
CHECK_CALL(XGDMatrixCreateFromMat_omp(BeginPtr(data), nrow, ncol,
asReal(missing), &handle, threads));
ret = PROTECT(R_MakeExternalPtr(handle, R_NilValue, R_NilValue));
R_RegisterCFinalizerEx(ret, _DMatrixFinalizer, TRUE);
R_API_END();
@@ -92,10 +116,8 @@ SEXP XGDMatrixCreateFromMat_R(SEXP mat,
return ret;
}
SEXP XGDMatrixCreateFromCSC_R(SEXP indptr,
SEXP indices,
SEXP data,
SEXP num_row) {
XGB_DLL SEXP XGDMatrixCreateFromCSC_R(SEXP indptr, SEXP indices, SEXP data,
SEXP num_row) {
SEXP ret;
R_API_BEGIN();
const int *p_indptr = INTEGER(indptr);
@@ -111,11 +133,15 @@ SEXP XGDMatrixCreateFromCSC_R(SEXP indptr,
for (size_t i = 0; i < nindptr; ++i) {
col_ptr_[i] = static_cast<size_t>(p_indptr[i]);
}
dmlc::OMPException exc;
#pragma omp parallel for schedule(static)
for (int64_t i = 0; i < static_cast<int64_t>(ndata); ++i) {
indices_[i] = static_cast<unsigned>(p_indices[i]);
data_[i] = static_cast<float>(p_data[i]);
exc.Run([&]() {
indices_[i] = static_cast<unsigned>(p_indices[i]);
data_[i] = static_cast<float>(p_data[i]);
});
}
exc.Rethrow();
DMatrixHandle handle;
CHECK_CALL(XGDMatrixCreateFromCSCEx(BeginPtr(col_ptr_), BeginPtr(indices_),
BeginPtr(data_), nindptr, ndata,
@@ -127,7 +153,7 @@ SEXP XGDMatrixCreateFromCSC_R(SEXP indptr,
return ret;
}
SEXP XGDMatrixSliceDMatrix_R(SEXP handle, SEXP idxset) {
XGB_DLL SEXP XGDMatrixSliceDMatrix_R(SEXP handle, SEXP idxset) {
SEXP ret;
R_API_BEGIN();
int len = length(idxset);
@@ -147,7 +173,7 @@ SEXP XGDMatrixSliceDMatrix_R(SEXP handle, SEXP idxset) {
return ret;
}
SEXP XGDMatrixSaveBinary_R(SEXP handle, SEXP fname, SEXP silent) {
XGB_DLL SEXP XGDMatrixSaveBinary_R(SEXP handle, SEXP fname, SEXP silent) {
R_API_BEGIN();
CHECK_CALL(XGDMatrixSaveBinary(R_ExternalPtrAddr(handle),
CHAR(asChar(fname)),
@@ -156,16 +182,20 @@ SEXP XGDMatrixSaveBinary_R(SEXP handle, SEXP fname, SEXP silent) {
return R_NilValue;
}
SEXP XGDMatrixSetInfo_R(SEXP handle, SEXP field, SEXP array) {
XGB_DLL SEXP XGDMatrixSetInfo_R(SEXP handle, SEXP field, SEXP array) {
R_API_BEGIN();
int len = length(array);
const char *name = CHAR(asChar(field));
dmlc::OMPException exc;
if (!strcmp("group", name)) {
std::vector<unsigned> vec(len);
#pragma omp parallel for schedule(static)
for (int i = 0; i < len; ++i) {
vec[i] = static_cast<unsigned>(INTEGER(array)[i]);
exc.Run([&]() {
vec[i] = static_cast<unsigned>(INTEGER(array)[i]);
});
}
exc.Rethrow();
CHECK_CALL(XGDMatrixSetUIntInfo(R_ExternalPtrAddr(handle),
CHAR(asChar(field)),
BeginPtr(vec), len));
@@ -173,8 +203,11 @@ SEXP XGDMatrixSetInfo_R(SEXP handle, SEXP field, SEXP array) {
std::vector<float> vec(len);
#pragma omp parallel for schedule(static)
for (int i = 0; i < len; ++i) {
vec[i] = REAL(array)[i];
exc.Run([&]() {
vec[i] = REAL(array)[i];
});
}
exc.Rethrow();
CHECK_CALL(XGDMatrixSetFloatInfo(R_ExternalPtrAddr(handle),
CHAR(asChar(field)),
BeginPtr(vec), len));
@@ -183,7 +216,7 @@ SEXP XGDMatrixSetInfo_R(SEXP handle, SEXP field, SEXP array) {
return R_NilValue;
}
SEXP XGDMatrixGetInfo_R(SEXP handle, SEXP field) {
XGB_DLL SEXP XGDMatrixGetInfo_R(SEXP handle, SEXP field) {
SEXP ret;
R_API_BEGIN();
bst_ulong olen;
@@ -201,7 +234,7 @@ SEXP XGDMatrixGetInfo_R(SEXP handle, SEXP field) {
return ret;
}
SEXP XGDMatrixNumRow_R(SEXP handle) {
XGB_DLL SEXP XGDMatrixNumRow_R(SEXP handle) {
bst_ulong nrow;
R_API_BEGIN();
CHECK_CALL(XGDMatrixNumRow(R_ExternalPtrAddr(handle), &nrow));
@@ -209,7 +242,7 @@ SEXP XGDMatrixNumRow_R(SEXP handle) {
return ScalarInteger(static_cast<int>(nrow));
}
SEXP XGDMatrixNumCol_R(SEXP handle) {
XGB_DLL SEXP XGDMatrixNumCol_R(SEXP handle) {
bst_ulong ncol;
R_API_BEGIN();
CHECK_CALL(XGDMatrixNumCol(R_ExternalPtrAddr(handle), &ncol));
@@ -224,7 +257,7 @@ void _BoosterFinalizer(SEXP ext) {
R_ClearExternalPtr(ext);
}
SEXP XGBoosterCreate_R(SEXP dmats) {
XGB_DLL SEXP XGBoosterCreate_R(SEXP dmats) {
SEXP ret;
R_API_BEGIN();
int len = length(dmats);
@@ -241,7 +274,22 @@ SEXP XGBoosterCreate_R(SEXP dmats) {
return ret;
}
SEXP XGBoosterSetParam_R(SEXP handle, SEXP name, SEXP val) {
XGB_DLL SEXP XGBoosterCreateInEmptyObj_R(SEXP dmats, SEXP R_handle) {
R_API_BEGIN();
int len = length(dmats);
std::vector<void*> dvec;
for (int i = 0; i < len; ++i) {
dvec.push_back(R_ExternalPtrAddr(VECTOR_ELT(dmats, i)));
}
BoosterHandle handle;
CHECK_CALL(XGBoosterCreate(BeginPtr(dvec), dvec.size(), &handle));
R_SetExternalPtrAddr(R_handle, handle);
R_RegisterCFinalizerEx(R_handle, _BoosterFinalizer, TRUE);
R_API_END();
return R_NilValue;
}
XGB_DLL SEXP XGBoosterSetParam_R(SEXP handle, SEXP name, SEXP val) {
R_API_BEGIN();
CHECK_CALL(XGBoosterSetParam(R_ExternalPtrAddr(handle),
CHAR(asChar(name)),
@@ -250,7 +298,7 @@ SEXP XGBoosterSetParam_R(SEXP handle, SEXP name, SEXP val) {
return R_NilValue;
}
SEXP XGBoosterUpdateOneIter_R(SEXP handle, SEXP iter, SEXP dtrain) {
XGB_DLL SEXP XGBoosterUpdateOneIter_R(SEXP handle, SEXP iter, SEXP dtrain) {
R_API_BEGIN();
CHECK_CALL(XGBoosterUpdateOneIter(R_ExternalPtrAddr(handle),
asInteger(iter),
@@ -259,17 +307,21 @@ SEXP XGBoosterUpdateOneIter_R(SEXP handle, SEXP iter, SEXP dtrain) {
return R_NilValue;
}
SEXP XGBoosterBoostOneIter_R(SEXP handle, SEXP dtrain, SEXP grad, SEXP hess) {
XGB_DLL SEXP XGBoosterBoostOneIter_R(SEXP handle, SEXP dtrain, SEXP grad, SEXP hess) {
R_API_BEGIN();
CHECK_EQ(length(grad), length(hess))
<< "gradient and hess must have same length";
int len = length(grad);
std::vector<float> tgrad(len), thess(len);
dmlc::OMPException exc;
#pragma omp parallel for schedule(static)
for (int j = 0; j < len; ++j) {
tgrad[j] = REAL(grad)[j];
thess[j] = REAL(hess)[j];
exc.Run([&]() {
tgrad[j] = REAL(grad)[j];
thess[j] = REAL(hess)[j];
});
}
exc.Rethrow();
CHECK_CALL(XGBoosterBoostOneIter(R_ExternalPtrAddr(handle),
R_ExternalPtrAddr(dtrain),
BeginPtr(tgrad), BeginPtr(thess),
@@ -278,7 +330,7 @@ SEXP XGBoosterBoostOneIter_R(SEXP handle, SEXP dtrain, SEXP grad, SEXP hess) {
return R_NilValue;
}
SEXP XGBoosterEvalOneIter_R(SEXP handle, SEXP iter, SEXP dmats, SEXP evnames) {
XGB_DLL SEXP XGBoosterEvalOneIter_R(SEXP handle, SEXP iter, SEXP dmats, SEXP evnames) {
const char *ret;
R_API_BEGIN();
CHECK_EQ(length(dmats), length(evnames))
@@ -303,8 +355,8 @@ SEXP XGBoosterEvalOneIter_R(SEXP handle, SEXP iter, SEXP dmats, SEXP evnames) {
return mkString(ret);
}
SEXP XGBoosterPredict_R(SEXP handle, SEXP dmat, SEXP option_mask,
SEXP ntree_limit, SEXP training) {
XGB_DLL SEXP XGBoosterPredict_R(SEXP handle, SEXP dmat, SEXP option_mask,
SEXP ntree_limit, SEXP training) {
SEXP ret;
R_API_BEGIN();
bst_ulong olen;
@@ -324,21 +376,60 @@ SEXP XGBoosterPredict_R(SEXP handle, SEXP dmat, SEXP option_mask,
return ret;
}
SEXP XGBoosterLoadModel_R(SEXP handle, SEXP fname) {
XGB_DLL SEXP XGBoosterPredictFromDMatrix_R(SEXP handle, SEXP dmat, SEXP json_config) {
SEXP r_out_shape;
SEXP r_out_result;
SEXP r_out;
R_API_BEGIN();
char const *c_json_config = CHAR(asChar(json_config));
bst_ulong out_dim;
bst_ulong const *out_shape;
float const *out_result;
CHECK_CALL(XGBoosterPredictFromDMatrix(R_ExternalPtrAddr(handle),
R_ExternalPtrAddr(dmat), c_json_config,
&out_shape, &out_dim, &out_result));
r_out_shape = PROTECT(allocVector(INTSXP, out_dim));
size_t len = 1;
for (size_t i = 0; i < out_dim; ++i) {
INTEGER(r_out_shape)[i] = out_shape[i];
len *= out_shape[i];
}
r_out_result = PROTECT(allocVector(REALSXP, len));
#pragma omp parallel for
for (omp_ulong i = 0; i < len; ++i) {
REAL(r_out_result)[i] = out_result[i];
}
r_out = PROTECT(allocVector(VECSXP, 2));
SET_VECTOR_ELT(r_out, 0, r_out_shape);
SET_VECTOR_ELT(r_out, 1, r_out_result);
R_API_END();
UNPROTECT(3);
return r_out;
}
XGB_DLL SEXP XGBoosterLoadModel_R(SEXP handle, SEXP fname) {
R_API_BEGIN();
CHECK_CALL(XGBoosterLoadModel(R_ExternalPtrAddr(handle), CHAR(asChar(fname))));
R_API_END();
return R_NilValue;
}
SEXP XGBoosterSaveModel_R(SEXP handle, SEXP fname) {
XGB_DLL SEXP XGBoosterSaveModel_R(SEXP handle, SEXP fname) {
R_API_BEGIN();
CHECK_CALL(XGBoosterSaveModel(R_ExternalPtrAddr(handle), CHAR(asChar(fname))));
R_API_END();
return R_NilValue;
}
SEXP XGBoosterModelToRaw_R(SEXP handle) {
XGB_DLL SEXP XGBoosterModelToRaw_R(SEXP handle) {
SEXP ret;
R_API_BEGIN();
bst_ulong olen;
@@ -353,7 +444,7 @@ SEXP XGBoosterModelToRaw_R(SEXP handle) {
return ret;
}
SEXP XGBoosterLoadModelFromRaw_R(SEXP handle, SEXP raw) {
XGB_DLL SEXP XGBoosterLoadModelFromRaw_R(SEXP handle, SEXP raw) {
R_API_BEGIN();
CHECK_CALL(XGBoosterLoadModelFromBuffer(R_ExternalPtrAddr(handle),
RAW(raw),
@@ -362,7 +453,7 @@ SEXP XGBoosterLoadModelFromRaw_R(SEXP handle, SEXP raw) {
return R_NilValue;
}
SEXP XGBoosterSaveJsonConfig_R(SEXP handle) {
XGB_DLL SEXP XGBoosterSaveJsonConfig_R(SEXP handle) {
const char* ret;
R_API_BEGIN();
bst_ulong len {0};
@@ -373,14 +464,14 @@ SEXP XGBoosterSaveJsonConfig_R(SEXP handle) {
return mkString(ret);
}
SEXP XGBoosterLoadJsonConfig_R(SEXP handle, SEXP value) {
XGB_DLL SEXP XGBoosterLoadJsonConfig_R(SEXP handle, SEXP value) {
R_API_BEGIN();
CHECK_CALL(XGBoosterLoadJsonConfig(R_ExternalPtrAddr(handle), CHAR(asChar(value))));
R_API_END();
return R_NilValue;
}
SEXP XGBoosterSerializeToBuffer_R(SEXP handle) {
XGB_DLL SEXP XGBoosterSerializeToBuffer_R(SEXP handle) {
SEXP ret;
R_API_BEGIN();
bst_ulong out_len;
@@ -395,7 +486,7 @@ SEXP XGBoosterSerializeToBuffer_R(SEXP handle) {
return ret;
}
SEXP XGBoosterUnserializeFromBuffer_R(SEXP handle, SEXP raw) {
XGB_DLL SEXP XGBoosterUnserializeFromBuffer_R(SEXP handle, SEXP raw) {
R_API_BEGIN();
CHECK_CALL(XGBoosterUnserializeFromBuffer(R_ExternalPtrAddr(handle),
RAW(raw),
@@ -404,7 +495,7 @@ SEXP XGBoosterUnserializeFromBuffer_R(SEXP handle, SEXP raw) {
return R_NilValue;
}
SEXP XGBoosterDumpModel_R(SEXP handle, SEXP fmap, SEXP with_stats, SEXP dump_format) {
XGB_DLL SEXP XGBoosterDumpModel_R(SEXP handle, SEXP fmap, SEXP with_stats, SEXP dump_format) {
SEXP out;
R_API_BEGIN();
bst_ulong olen;
@@ -441,7 +532,7 @@ SEXP XGBoosterDumpModel_R(SEXP handle, SEXP fmap, SEXP with_stats, SEXP dump_for
return out;
}
SEXP XGBoosterGetAttr_R(SEXP handle, SEXP name) {
XGB_DLL SEXP XGBoosterGetAttr_R(SEXP handle, SEXP name) {
SEXP out;
R_API_BEGIN();
int success;
@@ -461,7 +552,7 @@ SEXP XGBoosterGetAttr_R(SEXP handle, SEXP name) {
return out;
}
SEXP XGBoosterSetAttr_R(SEXP handle, SEXP name, SEXP val) {
XGB_DLL SEXP XGBoosterSetAttr_R(SEXP handle, SEXP name, SEXP val) {
R_API_BEGIN();
const char *v = isNull(val) ? nullptr : CHAR(asChar(val));
CHECK_CALL(XGBoosterSetAttr(R_ExternalPtrAddr(handle),
@@ -470,7 +561,7 @@ SEXP XGBoosterSetAttr_R(SEXP handle, SEXP name, SEXP val) {
return R_NilValue;
}
SEXP XGBoosterGetAttrNames_R(SEXP handle) {
XGB_DLL SEXP XGBoosterGetAttrNames_R(SEXP handle) {
SEXP out;
R_API_BEGIN();
bst_ulong len;
@@ -489,3 +580,51 @@ SEXP XGBoosterGetAttrNames_R(SEXP handle) {
UNPROTECT(1);
return out;
}
XGB_DLL SEXP XGBoosterFeatureScore_R(SEXP handle, SEXP json_config) {
SEXP out_features_sexp;
SEXP out_scores_sexp;
SEXP out_shape_sexp;
SEXP r_out;
R_API_BEGIN();
char const *c_json_config = CHAR(asChar(json_config));
bst_ulong out_n_features;
char const **out_features;
bst_ulong out_dim;
bst_ulong const *out_shape;
float const *out_scores;
CHECK_CALL(XGBoosterFeatureScore(R_ExternalPtrAddr(handle), c_json_config,
&out_n_features, &out_features,
&out_dim, &out_shape, &out_scores));
out_shape_sexp = PROTECT(allocVector(INTSXP, out_dim));
size_t len = 1;
for (size_t i = 0; i < out_dim; ++i) {
INTEGER(out_shape_sexp)[i] = out_shape[i];
len *= out_shape[i];
}
out_scores_sexp = PROTECT(allocVector(REALSXP, len));
#pragma omp parallel for
for (omp_ulong i = 0; i < len; ++i) {
REAL(out_scores_sexp)[i] = out_scores[i];
}
out_features_sexp = PROTECT(allocVector(STRSXP, out_n_features));
for (size_t i = 0; i < out_n_features; ++i) {
SET_STRING_ELT(out_features_sexp, i, mkChar(out_features[i]));
}
r_out = PROTECT(allocVector(VECSXP, 3));
SET_VECTOR_ELT(r_out, 0, out_features_sexp);
SET_VECTOR_ELT(r_out, 1, out_shape_sexp);
SET_VECTOR_ELT(r_out, 2, out_scores_sexp);
R_API_END();
UNPROTECT(4);
return r_out;
}

View File

@@ -21,6 +21,19 @@
*/
XGB_DLL SEXP XGCheckNullPtr_R(SEXP handle);
/*!
* \brief Set global configuration
* \param json_str a JSON string representing the list of key-value pairs
* \return R_NilValue
*/
XGB_DLL SEXP XGBSetGlobalConfig_R(SEXP json_str);
/*!
* \brief Get global configuration
* \return JSON string
*/
XGB_DLL SEXP XGBGetGlobalConfig_R();
/*!
* \brief load a data matrix
* \param fname name of the content
@@ -34,10 +47,12 @@ XGB_DLL SEXP XGDMatrixCreateFromFile_R(SEXP fname, SEXP silent);
* This assumes the matrix is stored in column major format
* \param data R Matrix object
* \param missing which value to represent missing value
* \param n_threads Number of threads used to construct DMatrix from dense matrix.
* \return created dmatrix
*/
XGB_DLL SEXP XGDMatrixCreateFromMat_R(SEXP mat,
SEXP missing);
SEXP missing,
SEXP n_threads);
/*!
* \brief create a matrix content from CSC format
* \param indptr pointer to column headers
@@ -103,6 +118,14 @@ XGB_DLL SEXP XGDMatrixNumCol_R(SEXP handle);
*/
XGB_DLL SEXP XGBoosterCreate_R(SEXP dmats);
/*!
* \brief create xgboost learner, saving the pointer into an existing R object
* \param dmats a list of dmatrix handles that will be cached
* \param R_handle a clean R external pointer (not holding any object)
*/
XGB_DLL SEXP XGBoosterCreateInEmptyObj_R(SEXP dmats, SEXP R_handle);
/*!
* \brief set parameters
* \param handle handle
@@ -143,7 +166,7 @@ XGB_DLL SEXP XGBoosterBoostOneIter_R(SEXP handle, SEXP dtrain, SEXP grad, SEXP h
XGB_DLL SEXP XGBoosterEvalOneIter_R(SEXP handle, SEXP iter, SEXP dmats, SEXP evnames);
/*!
* \brief make prediction based on dmat
* \brief (Deprecated) make prediction based on dmat
* \param handle handle
* \param dmat data matrix
* \param option_mask output_margin:1 predict_leaf:2
@@ -152,6 +175,16 @@ XGB_DLL SEXP XGBoosterEvalOneIter_R(SEXP handle, SEXP iter, SEXP dmats, SEXP evn
*/
XGB_DLL SEXP XGBoosterPredict_R(SEXP handle, SEXP dmat, SEXP option_mask,
SEXP ntree_limit, SEXP training);
/*!
* \brief Run prediction on DMatrix, replacing `XGBoosterPredict_R`
* \param handle handle
* \param dmat data matrix
* \param json_config See `XGBoosterPredictFromDMatrix` in xgboost c_api.h
*
* \return A list containing 2 vectors, first one for shape while second one for prediction result.
*/
XGB_DLL SEXP XGBoosterPredictFromDMatrix_R(SEXP handle, SEXP dmat, SEXP json_config);
/*!
* \brief load model from existing file
* \param handle handle
@@ -244,4 +277,12 @@ XGB_DLL SEXP XGBoosterSetAttr_R(SEXP handle, SEXP name, SEXP val);
*/
XGB_DLL SEXP XGBoosterGetAttrNames_R(SEXP handle);
/*!
* \brief Get feature scores from the model.
* \param json_config See `XGBoosterFeatureScore` in xgboost c_api.h
* \return A vector with the first element as feature names, second element as shape of
* feature scores and thrid element as feature scores.
*/
XGB_DLL SEXP XGBoosterFeatureScore_R(SEXP handle, SEXP json_config);
#endif // XGBOOST_WRAPPER_R_H_ // NOLINT(*)

View File

@@ -13,27 +13,10 @@ void CustomLogMessage::Log(const std::string& msg) {
}
} // namespace dmlc
// implements rabit error handling.
extern "C" {
void XGBoostAssert_R(int exp, const char *fmt, ...);
void XGBoostCheck_R(int exp, const char *fmt, ...);
}
namespace rabit {
namespace utils {
extern "C" {
void (*Printf)(const char *fmt, ...) = Rprintf;
void (*Assert)(int exp, const char *fmt, ...) = XGBoostAssert_R;
void (*Check)(int exp, const char *fmt, ...) = XGBoostCheck_R;
void (*Error)(const char *fmt, ...) = error;
}
}
}
namespace xgboost {
ConsoleLogger::~ConsoleLogger() {
if (cur_verbosity_ == LogVerbosity::kIgnore ||
cur_verbosity_ <= global_verbosity_) {
cur_verbosity_ <= GlobalVerbosity()) {
dmlc::CustomLogMessage::Log(log_stream_.str());
}
}

View File

@@ -1,10 +0,0 @@
model_generator_metadata <- function() {
return (list(
kRounds = 2,
kRows = 1000,
kCols = 4,
kForests = 2,
kMaxDepth = 2,
kClasses = 3
))
}

View File

@@ -2,10 +2,16 @@
# of saved model files from XGBoost version 0.90 and 1.0.x.
library(xgboost)
library(Matrix)
source('./generate_models_params.R')
set.seed(0)
metadata <- model_generator_metadata()
metadata <- list(
kRounds = 2,
kRows = 1000,
kCols = 4,
kForests = 2,
kMaxDepth = 2,
kClasses = 3
)
X <- Matrix(data = rnorm(metadata$kRows * metadata$kCols), nrow = metadata$kRows,
ncol = metadata$kCols, sparse = TRUE)
w <- runif(metadata$kRows)
@@ -46,11 +52,16 @@ generate_logistic_model <- function () {
y <- sample(0:1, size = metadata$kRows, replace = TRUE)
stopifnot(max(y) == 1, min(y) == 0)
data <- xgb.DMatrix(X, label = y, weight = w)
params <- list(tree_method = 'hist', num_parallel_tree = metadata$kForests,
max_depth = metadata$kMaxDepth, objective = 'binary:logistic')
booster <- xgb.train(params, data, nrounds = metadata$kRounds)
save_booster(booster, 'logit')
objective <- c('binary:logistic', 'binary:logitraw')
name <- c('logit', 'logitraw')
for (i in seq_len(length(objective))) {
data <- xgb.DMatrix(X, label = y, weight = w)
params <- list(tree_method = 'hist', num_parallel_tree = metadata$kForests,
max_depth = metadata$kMaxDepth, objective = objective[i])
booster <- xgb.train(params, data, nrounds = metadata$kRounds)
save_booster(booster, name[i])
}
}
generate_classification_model <- function () {

View File

@@ -6,21 +6,21 @@ my_linters <- list(
assignment_linter = lintr::assignment_linter,
closed_curly_linter = lintr::closed_curly_linter,
commas_linter = lintr::commas_linter,
# commented_code_linter = lintr::commented_code_linter,
equals_na = lintr::equals_na_linter,
infix_spaces_linter = lintr::infix_spaces_linter,
line_length_linter = lintr::line_length_linter,
no_tab_linter = lintr::no_tab_linter,
object_usage_linter = lintr::object_usage_linter,
# snake_case_linter = lintr::snake_case_linter,
# multiple_dots_linter = lintr::multiple_dots_linter,
object_length_linter = lintr::object_length_linter,
open_curly_linter = lintr::open_curly_linter,
# single_quotes_linter = lintr::single_quotes_linter,
semicolon = lintr::semicolon_terminator_linter,
seq = lintr::seq_linter,
spaces_inside_linter = lintr::spaces_inside_linter,
spaces_left_parentheses_linter = lintr::spaces_left_parentheses_linter,
trailing_blank_lines_linter = lintr::trailing_blank_lines_linter,
trailing_whitespace_linter = lintr::trailing_whitespace_linter,
true_false = lintr::T_and_F_symbol_linter
true_false = lintr::T_and_F_symbol_linter,
unneeded_concatenation = lintr::unneeded_concatenation_linter
)
results <- lapply(

View File

@@ -17,7 +17,8 @@ test_that("train and predict binary classification", {
nrounds <- 2
expect_output(
bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = nrounds, objective = "binary:logistic")
eta = 1, nthread = 2, nrounds = nrounds, objective = "binary:logistic",
eval_metric = "error")
, "train-error")
expect_equal(class(bst), "xgb.Booster")
expect_equal(bst$niter, nrounds)
@@ -33,6 +34,10 @@ test_that("train and predict binary classification", {
err_pred1 <- sum((pred1 > 0.5) != train$label) / length(train$label)
err_log <- bst$evaluation_log[1, train_error]
expect_lt(abs(err_pred1 - err_log), 10e-6)
pred2 <- predict(bst, train$data, iterationrange = c(1, 2))
expect_length(pred1, 6513)
expect_equal(pred1, pred2)
})
test_that("parameter validation works", {
@@ -65,7 +70,7 @@ test_that("parameter validation works", {
xgb.train(params = params, data = dtrain, nrounds = nrounds))
print(output)
}
expect_output(incorrect(), "bar, foo")
expect_output(incorrect(), '\\\\"bar\\\\", \\\\"foo\\\\"')
})
@@ -122,7 +127,7 @@ test_that("train and predict softprob", {
expect_output(
bst <- xgboost(data = as.matrix(iris[, -5]), label = lb,
max_depth = 3, eta = 0.5, nthread = 2, nrounds = 5,
objective = "multi:softprob", num_class = 3)
objective = "multi:softprob", num_class = 3, eval_metric = "merror")
, "train-merror")
expect_false(is.null(bst$evaluation_log))
expect_lt(bst$evaluation_log[, min(train_merror)], 0.025)
@@ -142,6 +147,24 @@ test_that("train and predict softprob", {
pred_labels <- max.col(mpred) - 1
err <- sum(pred_labels != lb) / length(lb)
expect_equal(bst$evaluation_log[1, train_merror], err, tolerance = 5e-6)
mpred1 <- predict(bst, as.matrix(iris[, -5]), reshape = TRUE, iterationrange = c(1, 2))
expect_equal(mpred, mpred1)
d <- cbind(
x1 = rnorm(100),
x2 = rnorm(100),
x3 = rnorm(100)
)
y <- sample.int(10, 100, replace = TRUE) - 1
dtrain <- xgb.DMatrix(data = d, info = list(label = y))
booster <- xgb.train(
params = list(tree_method = "hist"), data = dtrain, nrounds = 4, num_class = 10,
objective = "multi:softprob"
)
predt <- predict(booster, as.matrix(d), reshape = TRUE, strict_shape = FALSE)
expect_equal(ncol(predt), 10)
expect_equal(rowSums(predt), rep(1, 100), tolerance = 1e-7)
})
test_that("train and predict softmax", {
@@ -150,7 +173,7 @@ test_that("train and predict softmax", {
expect_output(
bst <- xgboost(data = as.matrix(iris[, -5]), label = lb,
max_depth = 3, eta = 0.5, nthread = 2, nrounds = 5,
objective = "multi:softmax", num_class = 3)
objective = "multi:softmax", num_class = 3, eval_metric = "merror")
, "train-merror")
expect_false(is.null(bst$evaluation_log))
expect_lt(bst$evaluation_log[, min(train_merror)], 0.025)
@@ -167,7 +190,7 @@ test_that("train and predict RF", {
lb <- train$label
# single iteration
bst <- xgboost(data = train$data, label = lb, max_depth = 5,
nthread = 2, nrounds = 1, objective = "binary:logistic",
nthread = 2, nrounds = 1, objective = "binary:logistic", eval_metric = "error",
num_parallel_tree = 20, subsample = 0.6, colsample_bytree = 0.1)
expect_equal(bst$niter, 1)
expect_equal(xgb.ntree(bst), 20)
@@ -181,10 +204,8 @@ test_that("train and predict RF", {
pred_err_20 <- sum((pred > 0.5) != lb) / length(lb)
expect_equal(pred_err_20, pred_err)
#pred <- predict(bst, train$data, ntreelimit = 1)
#pred_err_1 <- sum((pred > 0.5) != lb)/length(lb)
#expect_lt(pred_err, pred_err_1)
#expect_lt(pred_err, 0.08)
pred1 <- predict(bst, train$data, iterationrange = c(1, 2))
expect_equal(pred, pred1)
})
test_that("train and predict RF with softprob", {
@@ -193,7 +214,8 @@ test_that("train and predict RF with softprob", {
set.seed(11)
bst <- xgboost(data = as.matrix(iris[, -5]), label = lb,
max_depth = 3, eta = 0.9, nthread = 2, nrounds = nrounds,
objective = "multi:softprob", num_class = 3, verbose = 0,
objective = "multi:softprob", eval_metric = "merror",
num_class = 3, verbose = 0,
num_parallel_tree = 4, subsample = 0.5, colsample_bytree = 0.5)
expect_equal(bst$niter, 15)
expect_equal(xgb.ntree(bst), 15 * 3 * 4)
@@ -245,11 +267,12 @@ test_that("training continuation works", {
expect_equal(bst$raw, bst2$raw)
expect_equal(dim(bst2$evaluation_log), c(2, 2))
# test continuing from a model in file
xgb.save(bst1, "xgboost.model")
bst2 <- xgb.train(param, dtrain, nrounds = 2, watchlist, verbose = 0, xgb_model = "xgboost.model")
xgb.save(bst1, "xgboost.json")
bst2 <- xgb.train(param, dtrain, nrounds = 2, watchlist, verbose = 0, xgb_model = "xgboost.json")
if (!windows_flag && !solaris_flag)
expect_equal(bst$raw, bst2$raw)
expect_equal(dim(bst2$evaluation_log), c(2, 2))
file.remove("xgboost.json")
})
test_that("model serialization works", {
@@ -273,7 +296,7 @@ test_that("xgb.cv works", {
expect_output(
cv <- xgb.cv(data = train$data, label = train$label, max_depth = 2, nfold = 5,
eta = 1., nthread = 2, nrounds = 2, objective = "binary:logistic",
verbose = TRUE)
eval_metric = "error", verbose = TRUE)
, "train-error:")
expect_is(cv, 'xgb.cv.synchronous')
expect_false(is.null(cv$evaluation_log))
@@ -298,7 +321,7 @@ test_that("xgb.cv works with stratified folds", {
eta = 1., nthread = 2, nrounds = 2, objective = "binary:logistic",
verbose = TRUE, stratified = TRUE)
# Stratified folds should result in a different evaluation logs
expect_true(all(cv$evaluation_log[, test_error_mean] != cv2$evaluation_log[, test_error_mean]))
expect_true(all(cv$evaluation_log[, test_logloss_mean] != cv2$evaluation_log[, test_logloss_mean]))
})
test_that("train and predict with non-strict classes", {
@@ -328,7 +351,7 @@ test_that("train and predict with non-strict classes", {
expect_error(pr <- predict(bst, train_dense), regexp = NA)
expect_equal(pr0, pr)
# when someone inhertis from xgb.Booster, it should still be possible to use it as xgb.Booster
# when someone inherits from xgb.Booster, it should still be possible to use it as xgb.Booster
class(bst) <- c('super.Booster', 'xgb.Booster')
expect_error(pr <- predict(bst, train_dense), regexp = NA)
expect_equal(pr0, pr)
@@ -343,7 +366,7 @@ test_that("max_delta_step works", {
bst1 <- xgb.train(param, dtrain, nrounds, watchlist, verbose = 1)
# model with restricted max_delta_step
bst2 <- xgb.train(param, dtrain, nrounds, watchlist, verbose = 1, max_delta_step = 1)
# the no-restriction model is expected to have consistently lower loss during the initial interations
# the no-restriction model is expected to have consistently lower loss during the initial iterations
expect_true(all(bst1$evaluation_log$train_logloss < bst2$evaluation_log$train_logloss))
expect_lt(mean(bst1$evaluation_log$train_logloss) / mean(bst2$evaluation_log$train_logloss), 0.8)
})
@@ -382,3 +405,57 @@ test_that("Configuration works", {
reloaded_config <- xgb.config(bst)
expect_equal(config, reloaded_config);
})
test_that("strict_shape works", {
n_rounds <- 2
test_strict_shape <- function(bst, X, n_groups) {
predt <- predict(bst, X, strict_shape = TRUE)
margin <- predict(bst, X, outputmargin = TRUE, strict_shape = TRUE)
contri <- predict(bst, X, predcontrib = TRUE, strict_shape = TRUE)
interact <- predict(bst, X, predinteraction = TRUE, strict_shape = TRUE)
leaf <- predict(bst, X, predleaf = TRUE, strict_shape = TRUE)
n_rows <- nrow(X)
n_cols <- ncol(X)
expect_equal(dim(predt), c(n_groups, n_rows))
expect_equal(dim(margin), c(n_groups, n_rows))
expect_equal(dim(contri), c(n_cols + 1, n_groups, n_rows))
expect_equal(dim(interact), c(n_cols + 1, n_cols + 1, n_groups, n_rows))
expect_equal(dim(leaf), c(1, n_groups, n_rounds, n_rows))
if (n_groups != 1) {
for (g in seq_len(n_groups)) {
expect_lt(max(abs(colSums(contri[, g, ]) - margin[g, ])), 1e-5)
}
}
}
test_iris <- function() {
y <- as.numeric(iris$Species) - 1
X <- as.matrix(iris[, -5])
bst <- xgboost(data = X, label = y,
max_depth = 2, nrounds = n_rounds,
objective = "multi:softprob", num_class = 3, eval_metric = "merror")
test_strict_shape(bst, X, 3)
}
test_agaricus <- function() {
data(agaricus.train, package = 'xgboost')
X <- agaricus.train$data
y <- agaricus.train$label
bst <- xgboost(data = X, label = y, max_depth = 2,
nrounds = n_rounds, objective = "binary:logistic",
eval_metric = 'error', eval_metric = 'auc', eval_metric = "logloss")
test_strict_shape(bst, X, 1)
}
test_iris()
test_agaricus()
})

View File

@@ -2,6 +2,7 @@
require(xgboost)
require(data.table)
require(titanic)
context("callbacks")
@@ -26,7 +27,8 @@ watchlist <- list(train = dtrain, test = dtest)
err <- function(label, pr) sum((pr > 0.5) != label) / length(label)
param <- list(objective = "binary:logistic", max_depth = 2, nthread = 2)
param <- list(objective = "binary:logistic", eval_metric = "error",
max_depth = 2, nthread = 2)
test_that("cb.print.evaluation works as expected", {
@@ -105,7 +107,8 @@ test_that("cb.evaluation.log works as expected", {
})
param <- list(objective = "binary:logistic", max_depth = 4, nthread = 2)
param <- list(objective = "binary:logistic", eval_metric = "error",
max_depth = 4, nthread = 2)
test_that("can store evaluation_log without printing", {
expect_silent(
@@ -173,16 +176,16 @@ test_that("cb.reset.parameters works as expected", {
})
test_that("cb.save.model works as expected", {
files <- c('xgboost_01.model', 'xgboost_02.model', 'xgboost.model')
files <- c('xgboost_01.json', 'xgboost_02.json', 'xgboost.json')
for (f in files) if (file.exists(f)) file.remove(f)
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, eta = 1, verbose = 0,
save_period = 1, save_name = "xgboost_%02d.model")
expect_true(file.exists('xgboost_01.model'))
expect_true(file.exists('xgboost_02.model'))
b1 <- xgb.load('xgboost_01.model')
save_period = 1, save_name = "xgboost_%02d.json")
expect_true(file.exists('xgboost_01.json'))
expect_true(file.exists('xgboost_02.json'))
b1 <- xgb.load('xgboost_01.json')
expect_equal(xgb.ntree(b1), 1)
b2 <- xgb.load('xgboost_02.model')
b2 <- xgb.load('xgboost_02.json')
expect_equal(xgb.ntree(b2), 2)
xgb.config(b2) <- xgb.config(bst)
@@ -191,9 +194,9 @@ test_that("cb.save.model works as expected", {
# save_period = 0 saves the last iteration's model
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, eta = 1, verbose = 0,
save_period = 0)
expect_true(file.exists('xgboost.model'))
b2 <- xgb.load('xgboost.model')
save_period = 0, save_name = 'xgboost.json')
expect_true(file.exists('xgboost.json'))
b2 <- xgb.load('xgboost.json')
xgb.config(b2) <- xgb.config(bst)
expect_equal(bst$raw, b2$raw)
@@ -236,7 +239,7 @@ test_that("early stopping xgb.train works", {
test_that("early stopping using a specific metric works", {
set.seed(11)
expect_output(
bst <- xgb.train(param, dtrain, nrounds = 20, watchlist, eta = 0.6,
bst <- xgb.train(param[-2], dtrain, nrounds = 20, watchlist, eta = 0.6,
eval_metric = "logloss", eval_metric = "auc",
callbacks = list(cb.early.stop(stopping_rounds = 3, maximize = FALSE,
metric_name = 'test_logloss')))
@@ -252,6 +255,26 @@ test_that("early stopping using a specific metric works", {
expect_equal(logloss_log, logloss_pred, tolerance = 1e-5)
})
test_that("early stopping works with titanic", {
# This test was inspired by https://github.com/dmlc/xgboost/issues/5935
# It catches possible issues on noLD R
titanic <- titanic::titanic_train
titanic$Pclass <- as.factor(titanic$Pclass)
dtx <- model.matrix(~ 0 + ., data = titanic[, c("Pclass", "Sex")])
dty <- titanic$Survived
xgboost::xgboost(
data = dtx,
label = dty,
objective = "binary:logistic",
eval_metric = "auc",
nrounds = 100,
early_stopping_rounds = 3
)
expect_true(TRUE) # should not crash
})
test_that("early stopping xgb.cv works", {
set.seed(11)
expect_output(

View File

@@ -0,0 +1,21 @@
context('Test global configuration')
test_that('Global configuration works with verbosity', {
old_verbosity <- xgb.get.config()$verbosity
for (v in c(0, 1, 2, 3)) {
xgb.set.config(verbosity = v)
expect_equal(xgb.get.config()$verbosity, v)
}
xgb.set.config(verbosity = old_verbosity)
expect_equal(xgb.get.config()$verbosity, old_verbosity)
})
test_that('Global configuration works with use_rmm flag', {
old_use_rmm_flag <- xgb.get.config()$use_rmm
for (v in c(TRUE, FALSE)) {
xgb.set.config(use_rmm = v)
expect_equal(xgb.get.config()$use_rmm, v)
}
xgb.set.config(use_rmm = old_use_rmm_flag)
expect_equal(xgb.get.config()$use_rmm, old_use_rmm_flag)
})

View File

@@ -47,7 +47,7 @@ test_that("custom objective with early stop works", {
bst <- xgb.train(param, dtrain, 10, watchlist)
expect_equal(class(bst), "xgb.Booster")
train_log <- bst$evaluation_log$train_error
expect_true(all(diff(train_log)) <= 0)
expect_true(all(diff(train_log) <= 0))
})
test_that("custom objective using DMatrix attr works", {

View File

@@ -64,8 +64,8 @@ test_that("xgb.DMatrix: getinfo & setinfo", {
expect_true(setinfo(dtest, 'group', c(50, 50)))
expect_error(setinfo(dtest, 'group', test_label))
# providing character values will give a warning
expect_warning(setinfo(dtest, 'weight', rep('a', nrow(test_data))))
# providing character values will give an error
expect_error(setinfo(dtest, 'weight', rep('a', nrow(test_data))))
# any other label should error
expect_error(setinfo(dtest, 'asdf', test_label))
@@ -99,7 +99,7 @@ test_that("xgb.DMatrix: colnames", {
dtest <- xgb.DMatrix(test_data, label = test_label)
expect_equal(colnames(dtest), colnames(test_data))
expect_error(colnames(dtest) <- 'asdf')
new_names <- make.names(1:ncol(test_data))
new_names <- make.names(seq_len(ncol(test_data)))
expect_silent(colnames(dtest) <- new_names)
expect_equal(colnames(dtest), new_names)
expect_silent(colnames(dtest) <- NULL)

View File

@@ -9,7 +9,8 @@ test_that("train and prediction when gctorture is on", {
test <- agaricus.test
gctorture(TRUE)
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
pred <- predict(bst, test$data)
gctorture(FALSE)
expect_length(pred, length(test$label))
})

View File

@@ -8,7 +8,7 @@ test_that("gblinear works", {
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
param <- list(objective = "binary:logistic", booster = "gblinear",
param <- list(objective = "binary:logistic", eval_metric = "error", booster = "gblinear",
nthread = 2, eta = 0.8, alpha = 0.0001, lambda = 0.0001)
watchlist <- list(eval = dtest, train = dtrain)

View File

@@ -110,7 +110,7 @@ test_that("predict feature contributions works", {
pred <- predict(bst.GLM, sparse_matrix, outputmargin = TRUE)
expect_lt(max(abs(rowSums(pred_contr) - pred)), 1e-5)
# manual calculation of linear terms
coefs <- xgb.dump(bst.GLM)[-c(1, 2, 4)] %>% as.numeric
coefs <- as.numeric(xgb.dump(bst.GLM)[-c(1, 2, 4)])
coefs <- c(coefs[-1], coefs[1]) # intercept must be the last
pred_contr_manual <- sweep(cbind(sparse_matrix, 1), 2, coefs, FUN = "*")
expect_equal(as.numeric(pred_contr), as.numeric(pred_contr_manual),
@@ -130,7 +130,11 @@ test_that("predict feature contributions works", {
pred <- predict(mbst.GLM, as.matrix(iris[, -5]), outputmargin = TRUE, reshape = TRUE)
pred_contr <- predict(mbst.GLM, as.matrix(iris[, -5]), predcontrib = TRUE)
expect_length(pred_contr, 3)
coefs_all <- xgb.dump(mbst.GLM)[-c(1, 2, 6)] %>% as.numeric %>% matrix(ncol = 3, byrow = TRUE)
coefs_all <- matrix(
data = as.numeric(xgb.dump(mbst.GLM)[-c(1, 2, 6)]),
ncol = 3,
byrow = TRUE
)
for (g in seq_along(pred_contr)) {
expect_equal(colnames(pred_contr[[g]]), c(colnames(iris[, -5]), "BIAS"))
expect_lt(max(abs(rowSums(pred_contr[[g]]) - pred[, g])), float_tolerance)
@@ -174,7 +178,7 @@ test_that("SHAPs sum to predictions, with or without DART", {
expect_equal(rowSums(shap), pred, tol = tol)
expect_equal(apply(shapi, 1, sum), pred, tol = tol)
for (i in 1 : nrow(d))
for (i in seq_len(nrow(d)))
for (f in list(rowSums, colSums))
expect_equal(f(shapi[i, , ]), shap[i, ], tol = tol)
}
@@ -238,12 +242,13 @@ if (grepl('Windows', Sys.info()[['sysname']]) ||
test_that("xgb.Booster serializing as R object works", {
saveRDS(bst.Tree, 'xgb.model.rds')
bst <- readRDS('xgb.model.rds')
if (file.exists('xgb.model.rds')) file.remove('xgb.model.rds')
dtrain <- xgb.DMatrix(sparse_matrix, label = label)
expect_equal(predict(bst.Tree, dtrain), predict(bst, dtrain), tolerance = float_tolerance)
expect_equal(xgb.dump(bst.Tree), xgb.dump(bst))
xgb.save(bst, 'xgb.model')
if (file.exists('xgb.model')) file.remove('xgb.model')
bst <- readRDS('xgb.model.rds')
if (file.exists('xgb.model.rds')) file.remove('xgb.model.rds')
nil_ptr <- new("externalptr")
class(nil_ptr) <- "xgb.Booster.handle"
expect_true(identical(bst$handle, nil_ptr))
@@ -335,8 +340,8 @@ test_that("xgb.model.dt.tree and xgb.importance work with a single split model",
})
test_that("xgb.plot.tree works with and without feature names", {
xgb.plot.tree(feature_names = feature.names, model = bst.Tree)
xgb.plot.tree(model = bst.Tree)
expect_silent(xgb.plot.tree(feature_names = feature.names, model = bst.Tree))
expect_silent(xgb.plot.tree(model = bst.Tree))
})
test_that("xgb.plot.multi.trees works with and without feature names", {
@@ -351,11 +356,47 @@ test_that("xgb.plot.deepness works", {
xgb.ggplot.deepness(model = bst.Tree)
})
test_that("xgb.shap.data works when top_n is provided", {
data_list <- xgb.shap.data(data = sparse_matrix, model = bst.Tree, top_n = 2)
expect_equal(names(data_list), c("data", "shap_contrib"))
expect_equal(NCOL(data_list$data), 2)
expect_equal(NCOL(data_list$shap_contrib), 2)
expect_equal(NROW(data_list$data), NROW(data_list$shap_contrib))
expect_gt(length(colnames(data_list$data)), 0)
expect_gt(length(colnames(data_list$shap_contrib)), 0)
# for multiclass without target class provided
data_list <- xgb.shap.data(data = as.matrix(iris[, -5]), model = mbst.Tree, top_n = 2)
expect_equal(dim(data_list$shap_contrib), c(nrow(iris), 2))
# for multiclass with target class provided
data_list <- xgb.shap.data(data = as.matrix(iris[, -5]), model = mbst.Tree, top_n = 2, target_class = 0)
expect_equal(dim(data_list$shap_contrib), c(nrow(iris), 2))
})
test_that("xgb.shap.data works with subsampling", {
data_list <- xgb.shap.data(data = sparse_matrix, model = bst.Tree, top_n = 2, subsample = 0.8)
expect_equal(NROW(data_list$data), as.integer(0.8 * nrow(sparse_matrix)))
expect_equal(NROW(data_list$data), NROW(data_list$shap_contrib))
})
test_that("prepare.ggplot.shap.data works", {
data_list <- xgb.shap.data(data = sparse_matrix, model = bst.Tree, top_n = 2)
plot_data <- prepare.ggplot.shap.data(data_list, normalize = TRUE)
expect_s3_class(plot_data, "data.frame")
expect_equal(names(plot_data), c("id", "feature", "feature_value", "shap_value"))
expect_s3_class(plot_data$feature, "factor")
# Each observation should have 1 row for each feature
expect_equal(nrow(plot_data), nrow(sparse_matrix) * 2)
})
test_that("xgb.plot.shap works", {
sh <- xgb.plot.shap(data = sparse_matrix, model = bst.Tree, top_n = 2, col = 4)
expect_equal(names(sh), c("data", "shap_contrib"))
expect_equal(NCOL(sh$data), 2)
expect_equal(NCOL(sh$shap_contrib), 2)
})
test_that("xgb.plot.shap.summary works", {
expect_silent(xgb.plot.shap.summary(data = sparse_matrix, model = bst.Tree, top_n = 2))
expect_silent(xgb.ggplot.shap.summary(data = sparse_matrix, model = bst.Tree, top_n = 2))
})
test_that("check.deprecation works", {
@@ -374,3 +415,26 @@ test_that("check.deprecation works", {
, "\'dumm\' was partially matched to \'dummy\'")
expect_equal(res, list(a = 1, DUMMY = 22))
})
test_that('convert.labels works', {
y <- c(0, 1, 0, 0, 1)
for (objective in c('binary:logistic', 'binary:logitraw', 'binary:hinge')) {
res <- xgboost:::convert.labels(y, objective_name = objective)
expect_s3_class(res, 'factor')
expect_equal(res, factor(res))
}
y <- c(0, 1, 3, 2, 1, 4)
for (objective in c('multi:softmax', 'multi:softprob', 'rank:pairwise', 'rank:ndcg',
'rank:map')) {
res <- xgboost:::convert.labels(y, objective_name = objective)
expect_s3_class(res, 'factor')
expect_equal(res, factor(res))
}
y <- c(1.2, 3.0, -1.0, 10.0)
for (objective in c('reg:squarederror', 'reg:squaredlogerror', 'reg:logistic',
'reg:pseudohubererror', 'count:poisson', 'survival:cox', 'survival:aft',
'reg:gamma', 'reg:tweedie')) {
res <- xgboost:::convert.labels(y, objective_name = objective)
expect_equal(class(res), 'numeric')
}
})

View File

@@ -1,7 +1,6 @@
context('Test prediction of feature interactions')
require(xgboost)
require(magrittr)
set.seed(123)
@@ -32,7 +31,7 @@ test_that("predict feature interactions works", {
cont <- predict(b, dm, predcontrib = TRUE)
expect_equal(dim(cont), c(N, P + 1))
# make sure for each row they add up to marginal predictions
max(abs(rowSums(cont) - pred)) %>% expect_lt(0.001)
expect_lt(max(abs(rowSums(cont) - pred)), 0.001)
# Hand-construct the 'ground truth' feature contributions:
gt_cont <- cbind(
2. * X[, 1],
@@ -52,21 +51,24 @@ test_that("predict feature interactions works", {
expect_equal(dimnames(intr), list(NULL, cn, cn))
# check the symmetry
max(abs(aperm(intr, c(1, 3, 2)) - intr)) %>% expect_lt(0.00001)
expect_lt(max(abs(aperm(intr, c(1, 3, 2)) - intr)), 0.00001)
# sums WRT columns must be close to feature contributions
max(abs(apply(intr, c(1, 2), sum) - cont)) %>% expect_lt(0.00001)
expect_lt(max(abs(apply(intr, c(1, 2), sum) - cont)), 0.00001)
# diagonal terms for features 3,4,5 must be close to zero
Reduce(max, sapply(3:P, function(i) max(abs(intr[, i, i])))) %>% expect_lt(0.05)
expect_lt(Reduce(max, sapply(3:P, function(i) max(abs(intr[, i, i])))), 0.05)
# BIAS must have no interactions
max(abs(intr[, 1:P, P + 1])) %>% expect_lt(0.00001)
expect_lt(max(abs(intr[, 1:P, P + 1])), 0.00001)
# interactions other than 2 x 3 must be close to zero
intr23 <- intr
intr23[, 2, 3] <- 0
Reduce(max, sapply(1:P, function(i) max(abs(intr23[, i, (i + 1):(P + 1)])))) %>% expect_lt(0.05)
expect_lt(
Reduce(max, sapply(1:P, function(i) max(abs(intr23[, i, (i + 1):(P + 1)])))),
0.05
)
# Construct the 'ground truth' contributions of interactions directly from the linear terms:
gt_intr <- array(0, c(N, P + 1, P + 1))
@@ -119,23 +121,39 @@ test_that("multiclass feature interactions work", {
dm <- xgb.DMatrix(as.matrix(iris[, -5]), label = as.numeric(iris$Species) - 1)
param <- list(eta = 0.1, max_depth = 4, objective = 'multi:softprob', num_class = 3)
b <- xgb.train(param, dm, 40)
pred <- predict(b, dm, outputmargin = TRUE) %>% array(c(3, 150)) %>% t
pred <- t(
array(
data = predict(b, dm, outputmargin = TRUE),
dim = c(3, 150)
)
)
# SHAP contributions:
cont <- predict(b, dm, predcontrib = TRUE)
expect_length(cont, 3)
# rewrap them as a 3d array
cont <- unlist(cont) %>% array(c(150, 5, 3))
cont <- array(
data = unlist(cont),
dim = c(150, 5, 3)
)
# make sure for each row they add up to marginal predictions
max(abs(apply(cont, c(1, 3), sum) - pred)) %>% expect_lt(0.001)
expect_lt(max(abs(apply(cont, c(1, 3), sum) - pred)), 0.001)
# SHAP interaction contributions:
intr <- predict(b, dm, predinteraction = TRUE)
expect_length(intr, 3)
# rewrap them as a 4d array
intr <- unlist(intr) %>% array(c(150, 5, 5, 3)) %>% aperm(c(4, 1, 2, 3)) # [grp, row, col, col]
intr <- aperm(
a = array(
data = unlist(intr),
dim = c(150, 5, 5, 3)
),
perm = c(4, 1, 2, 3) # [grp, row, col, col]
)
# check the symmetry
max(abs(aperm(intr, c(1, 2, 4, 3)) - intr)) %>% expect_lt(0.00001)
expect_lt(max(abs(aperm(intr, c(1, 2, 4, 3)) - intr)), 0.00001)
# sums WRT columns must be close to feature contributions
max(abs(apply(intr, c(1, 2, 3), sum) - aperm(cont, c(3, 1, 2)))) %>% expect_lt(0.00001)
expect_lt(max(abs(apply(intr, c(1, 2, 3), sum) - aperm(cont, c(3, 1, 2)))), 0.00001)
})

View File

@@ -1,10 +1,16 @@
require(xgboost)
require(jsonlite)
source('../generate_models_params.R')
context("Models from previous versions of XGBoost can be loaded")
metadata <- model_generator_metadata()
metadata <- list(
kRounds = 2,
kRows = 1000,
kCols = 4,
kForests = 2,
kMaxDepth = 2,
kClasses = 3
)
run_model_param_check <- function (config) {
testthat::expect_equal(config$learner$learner_model_param$num_feature, '4')
@@ -33,6 +39,10 @@ run_booster_check <- function (booster, name) {
testthat::expect_equal(config$learner$learner_train_param$objective, 'multi:softmax')
testthat::expect_equal(as.numeric(config$learner$learner_model_param$num_class),
metadata$kClasses)
} else if (name == 'logitraw') {
testthat::expect_equal(get_num_tree(booster), metadata$kForests * metadata$kRounds)
testthat::expect_equal(as.numeric(config$learner$learner_model_param$num_class), 0)
testthat::expect_equal(config$learner$learner_train_param$objective, 'binary:logitraw')
} else if (name == 'logit') {
testthat::expect_equal(get_num_tree(booster), metadata$kForests * metadata$kRounds)
testthat::expect_equal(as.numeric(config$learner$learner_model_param$num_class), 0)
@@ -55,7 +65,7 @@ test_that("Models from previous versions of XGBoost can be loaded", {
zipfile <- file.path(getwd(), file_name)
model_dir <- file.path(getwd(), 'models')
download.file(paste('https://', bucket, '.s3-', region, '.amazonaws.com/', file_name, sep = ''),
destfile = zipfile, mode = 'wb')
destfile = zipfile, mode = 'wb', quiet = TRUE)
unzip(zipfile, overwrite = TRUE)
pred_data <- xgb.DMatrix(matrix(c(0, 0, 0, 0), nrow = 1, ncol = 4))
@@ -66,13 +76,35 @@ test_that("Models from previous versions of XGBoost can be loaded", {
m <- regmatches(model_file, m)[[1]]
model_xgb_ver <- m[2]
name <- m[3]
is_rds <- endsWith(model_file, '.rds')
if (endsWith(model_file, '.rds')) {
booster <- readRDS(model_file)
} else {
booster <- xgb.load(model_file)
cpp_warning <- capture.output({
# Expect an R warning when a model is loaded from RDS and it was generated by version < 1.1.x
if (is_rds && compareVersion(model_xgb_ver, '1.1.1.1') < 0) {
booster <- readRDS(model_file)
expect_warning(predict(booster, newdata = pred_data))
booster <- readRDS(model_file)
expect_warning(run_booster_check(booster, name))
} else {
if (is_rds) {
booster <- readRDS(model_file)
} else {
booster <- xgb.load(model_file)
}
predict(booster, newdata = pred_data)
run_booster_check(booster, name)
}
})
if (compareVersion(model_xgb_ver, '1.0.0.0') < 0) {
# Expect a C++ warning when a model was generated in version < 1.0.x
m <- grepl(paste0('.*Loading model from XGBoost < 1\\.0\\.0, consider saving it again for ',
'improved compatibility.*'), cpp_warning, perl = TRUE)
expect_true(length(m) > 0 && all(m))
} else if (is_rds && model_xgb_ver == '1.1.1.1') {
# Expect a C++ warning when a model is loaded from RDS and it was generated by version 1.1.x
m <- grepl(paste0('.*Attempted to load internal configuration for a model file that was ',
'generated by a previous version of XGBoost.*'), cpp_warning, perl = TRUE)
expect_true(length(m) > 0 && all(m))
}
predict(booster, newdata = pred_data)
run_booster_check(booster, name)
})
})

View File

@@ -19,5 +19,5 @@ test_that("monotone constraints for regression", {
pred.ord <- pred[ind]
expect_true({
!any(diff(pred.ord) > 0)
}, "Monotone Contraint Satisfied")
}, "Monotone constraint satisfied")
})

View File

@@ -1,9 +1,9 @@
context('Test poisson regression model')
context('Test Poisson regression model')
require(xgboost)
set.seed(1994)
test_that("poisson regression works", {
test_that("Poisson regression works", {
data(mtcars)
bst <- xgboost(data = as.matrix(mtcars[, -11]), label = mtcars[, 11],
objective = 'count:poisson', nrounds = 10, verbose = 0)

View File

@@ -1,5 +1,5 @@
---
title: "Understand your dataset with Xgboost"
title: "Understand your dataset with XGBoost"
output:
rmarkdown::html_vignette:
css: vignette.css
@@ -18,9 +18,9 @@ Understand your dataset with XGBoost
Introduction
------------
The purpose of this vignette is to show you how to use **Xgboost** to discover and understand your own dataset better.
The purpose of this vignette is to show you how to use **XGBoost** to discover and understand your own dataset better.
This vignette is not about predicting anything (see [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **Xgboost** to highlight the *link* between the *features* of your data and the *outcome*.
This vignette is not about predicting anything (see [XGBoost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)). We will explain how to use **XGBoost** to highlight the *link* between the *features* of your data and the *outcome*.
Package loading:
@@ -39,7 +39,7 @@ Preparation of the dataset
### Numeric v.s. categorical variables
**Xgboost** manages only `numeric` vectors.
**XGBoost** manages only `numeric` vectors.
What to do when you have *categorical* data?
@@ -57,7 +57,7 @@ To answer the question above we will convert *categorical* variables to `numeric
In this Vignette we will see how to transform a *dense* `data.frame` (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.
The method we are going to see is usually called [one-hot encoding](http://en.wikipedia.org/wiki/One-hot).
The method we are going to see is usually called [one-hot encoding](https://en.wikipedia.org/wiki/One-hot).
The first step is to load `Arthritis` dataset in memory and wrap it with `data.table` package.
@@ -66,7 +66,7 @@ data(Arthritis)
df <- data.table(Arthritis, keep.rownames = FALSE)
```
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **Xgboost** **R** package use `data.table`.
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **XGBoost** **R** package use `data.table`.
The first thing we want to do is to have a look to the first few lines of the `data.table`:
@@ -137,8 +137,8 @@ levels(df[,Treatment])
#### Encoding categorical features
Next step, we will transform the categorical data to dummy variables.
Several encoding methods exist, e.g., [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) is a common approach.
We will use the [dummy contrast coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#dummy) which is popular because it produces "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
Several encoding methods exist, e.g., [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) is a common approach.
We will use the [dummy contrast coding](https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/) which is popular because it produces "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`.
@@ -166,7 +166,7 @@ output_vector = df[,Improved] == "Marked"
Build the model
---------------
The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [Xgboost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)).
The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or at the vignette [XGBoost presentation](https://github.com/dmlc/xgboost/blob/master/R-package/vignettes/xgboostPresentation.Rmd)).
```{r}
bst <- xgboost(data = sparse_matrix, label = output_vector, max_depth = 4,
@@ -176,7 +176,7 @@ bst <- xgboost(data = sparse_matrix, label = output_vector, max_depth = 4,
You can see some `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well the model explains your data. Lower is better.
A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future).
A small value for training error may be a symptom of [overfitting](https://en.wikipedia.org/wiki/Overfitting), meaning the model will not accurately predict the future values.
> Here you can see the numbers decrease until line 7 and then increase.
>
@@ -304,19 +304,19 @@ Linear model may not be that smart in this scenario.
Special Note: What about Random Forests™?
-----------------------------------------
As you may know, [Random Forests](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](http://en.wikipedia.org/wiki/Ensemble_learning) family.
As you may know, [Random Forests](https://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning) family.
Both trains several decision trees for one dataset. The *main* difference is that in Random Forests, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`).
Both trains several decision trees for one dataset. The *main* difference is that in Random Forests, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`).
This difference have an impact on a corner case in feature importance analysis: the *correlated features*.
Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and Random Forests).
Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and Random Forests).
However, in Random Forests this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the *importance* of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...
However, in Random Forests this random choice will be done for each tree, because each tree is independent from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the *importance* of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.
If you want to try Random Forests algorithm, you can tweak Xgboost parameters!
If you want to try Random Forests algorithm, you can tweak XGBoost parameters!
For instance, to compute a model with 1000 trees, with a 0.5 factor on sampling rows and columns:
@@ -326,7 +326,7 @@ data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
#Random Forest - 1000 trees
#Random Forest - 1000 trees
bst <- xgboost(data = train$data, label = train$label, max_depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nrounds = 1, objective = "binary:logistic")
#Boosting - 3 rounds
@@ -335,4 +335,4 @@ bst <- xgboost(data = train$data, label = train$label, max_depth = 4, nrounds =
> Note that the parameter `round` is set to `1`.
> [**Random Forests**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software.
> [**Random Forests**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software.

View File

@@ -24,7 +24,7 @@
author = "K. Bache and M. Lichman",
year = "2013",
title = "{UCI} Machine Learning Repository",
url = "http://archive.ics.uci.edu/ml",
url = "http://archive.ics.uci.edu/ml/",
institution = "University of California, Irvine, School of Information and Computer Sciences"
}

View File

@@ -1,5 +1,5 @@
---
title: "Xgboost presentation"
title: "XGBoost presentation"
output:
rmarkdown::html_vignette:
css: vignette.css
@@ -8,7 +8,7 @@ output:
bibliography: xgboost.bib
author: Tianqi Chen, Tong He, Michaël Benesty
vignette: >
%\VignetteIndexEntry{Xgboost presentation}
%\VignetteIndexEntry{XGBoost presentation}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
@@ -19,9 +19,9 @@ XGBoost R Tutorial
## Introduction
**Xgboost** is short for e**X**treme **G**radient **Boost**ing package.
**XGBoost** is short for e**X**treme **G**radient **Boost**ing package.
The purpose of this Vignette is to show you how to use **Xgboost** to build a model and make predictions.
The purpose of this Vignette is to show you how to use **XGBoost** to build a model and make predictions.
It is an efficient and scalable implementation of gradient boosting framework by @friedman2000additive and @friedman2001greedy. Two solvers are included:
@@ -46,10 +46,10 @@ It has several features:
## Installation
### Github version
### GitHub version
For weekly updated version (highly recommended), install from *Github*:
For weekly updated version (highly recommended), install from *GitHub*:
```{r installGithub, eval=FALSE}
install.packages("drat", repos="https://cran.rstudio.com")
@@ -68,7 +68,7 @@ The version 0.4-2 is on CRAN, and you can install it by:
install.packages("xgboost")
```
Formerly available versions can be obtained from the CRAN [archive](https://cran.r-project.org/src/contrib/Archive/xgboost)
Formerly available versions can be obtained from the CRAN [archive](https://cran.r-project.org/src/contrib/Archive/xgboost/)
## Learning
@@ -82,7 +82,7 @@ require(xgboost)
### Dataset presentation
In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the the same as you will use on in your every day life :-).
In this example, we are aiming to predict whether a mushroom can be eaten or not (like in many tutorials, example data are the same as you will use on in your every day life :-).
Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.
@@ -148,7 +148,7 @@ We will train decision tree model using the following parameters:
* `objective = "binary:logistic"`: we will train a binary classification model ;
* `max_depth = 2`: the trees won't be deep, because our case is very simple ;
* `nthread = 2`: the number of cpu threads we are going to use;
* `nthread = 2`: the number of CPU threads we are going to use;
* `nrounds = 2`: there will be two passes on the data, the second one will enhance the model by further reducing the difference between ground truth and prediction.
```{r trainingSparse, message=F, warning=F}
@@ -180,7 +180,7 @@ bstDMatrix <- xgboost(data = dtrain, max_depth = 2, eta = 1, nthread = 2, nround
**XGBoost** has several features to help you to view how the learning progress internally. The purpose is to help you to set the best parameters, which is the key of your model quality.
One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced technics).
One of the simplest way to see the training progress is to set the `verbose` option (see below for more advanced techniques).
```{r trainingVerbose0, message=T, warning=F}
# verbose = 0, no message
@@ -253,7 +253,7 @@ The most important thing to remember is that **to do a classification, you just
*Multiclass* classification works in a similar way.
This metric is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well!
This metric is **`r round(err, 2)`** and is pretty low: our yummy mushroom model works well!
## Advanced features

View File

@@ -16,7 +16,7 @@ XGBoost from JSON
## Introduction
The purpose of this Vignette is to show you how to correctly load and work with an **Xgboost** model that has been dumped to JSON. **Xgboost** internally converts all data to [32-bit floats](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), and the values dumped to JSON are decimal representations of these values. When working with a model that has been parsed from a JSON file, care must be taken to correctly treat:
The purpose of this Vignette is to show you how to correctly load and work with an **XGBoost** model that has been dumped to JSON. **XGBoost** internally converts all data to [32-bit floats](https://en.wikipedia.org/wiki/Single-precision_floating-point_format), and the values dumped to JSON are decimal representations of these values. When working with a model that has been parsed from a JSON file, care must be taken to correctly treat:
- the input data, which should be converted to 32-bit floats
- any 32-bit floats that were stored in JSON as decimal representations
@@ -172,9 +172,9 @@ bst_from_json_preds <- ifelse(fl(data$dates)<fl(node$split_condition),
bst_preds == bst_from_json_preds
```
None are exactly equal again. What is going on here? Well, since we are using the value `1` in the calcuations, we have introduced a double into the calculation. Because of this, all float values are promoted to 64-bit doubles and the 64-bit version of the exponentiation operator `exp` is also used. On the other hand, xgboost uses the 32-bit version of the exponentation operator in its [sigmoid function](https://github.com/dmlc/xgboost/blob/54980b8959680a0da06a3fc0ec776e47c8cbb0a1/src/common/math.h#L25-L27).
None are exactly equal again. What is going on here? Well, since we are using the value `1` in the calculations, we have introduced a double into the calculation. Because of this, all float values are promoted to 64-bit doubles and the 64-bit version of the exponentiation operator `exp` is also used. On the other hand, xgboost uses the 32-bit version of the exponentiation operator in its [sigmoid function](https://github.com/dmlc/xgboost/blob/54980b8959680a0da06a3fc0ec776e47c8cbb0a1/src/common/math.h#L25-L27).
How do we fix this? We have to ensure we use the correct datatypes everywhere and the correct operators. If we use only floats, the float library that we have loaded will ensure the 32-bit float exponention operator is applied.
How do we fix this? We have to ensure we use the correct data types everywhere and the correct operators. If we use only floats, the float library that we have loaded will ensure the 32-bit float exponentiation operator is applied.
```{r}
# calculate the predictions casting doubles to floats
bst_from_json_preds <- ifelse(fl(data$dates)<fl(node$split_condition),

View File

@@ -1,13 +1,15 @@
<img src=https://raw.githubusercontent.com/dmlc/dmlc.github.io/master/img/logo-m/xgboost.png width=135/> eXtreme Gradient Boosting
===========
[![Build Status](https://xgboost-ci.net/job/xgboost/job/master/badge/icon?style=plastic)](https://xgboost-ci.net/blue/organizations/jenkins/xgboost/activity)
[![Build Status](https://xgboost-ci.net/job/xgboost/job/master/badge/icon)](https://xgboost-ci.net/blue/organizations/jenkins/xgboost/activity)
[![Build Status](https://img.shields.io/travis/dmlc/xgboost.svg?label=build&logo=travis&branch=master)](https://travis-ci.org/dmlc/xgboost)
[![Build Status](https://ci.appveyor.com/api/projects/status/5ypa8vaed6kpmli8?svg=true)](https://ci.appveyor.com/project/tqchen/xgboost)
[![XGBoost-CI](https://github.com/dmlc/xgboost/workflows/XGBoost-CI/badge.svg?branch=master)](https://github.com/dmlc/xgboost/actions)
[![Documentation Status](https://readthedocs.org/projects/xgboost/badge/?version=latest)](https://xgboost.readthedocs.org)
[![GitHub license](http://dmlc.github.io/img/apache2.svg)](./LICENSE)
[![CRAN Status Badge](http://www.r-pkg.org/badges/version/xgboost)](http://cran.r-project.org/web/packages/xgboost)
[![PyPI version](https://badge.fury.io/py/xgboost.svg)](https://pypi.python.org/pypi/xgboost/)
[![Conda version](https://img.shields.io/conda/vn/conda-forge/py-xgboost.svg)](https://anaconda.org/conda-forge/py-xgboost)
[![Optuna](https://img.shields.io/badge/Optuna-integrated-blue)](https://optuna.org)
[![Twitter](https://img.shields.io/badge/@XGBoostProject--_.svg?style=social&logo=twitter)](https://twitter.com/XGBoostProject)
[Community](https://xgboost.ai/community) |
[Documentation](https://xgboost.readthedocs.org) |
@@ -22,7 +24,7 @@ The same code runs on major distributed environment (Kubernetes, Hadoop, SGE, MP
License
-------
© Contributors, 2019. Licensed under an [Apache-2](https://github.com/dmlc/xgboost/blob/master/LICENSE) license.
© Contributors, 2021. Licensed under an [Apache-2](https://github.com/dmlc/xgboost/blob/master/LICENSE) license.
Contribute to XGBoost
---------------------

View File

@@ -12,5 +12,3 @@
#include "../dmlc-core/src/data.cc"
#include "../dmlc-core/src/io.cc"
#include "../dmlc-core/src/recordio.cc"

View File

@@ -14,6 +14,7 @@
#include "../src/metric/elementwise_metric.cc"
#include "../src/metric/multiclass_metric.cc"
#include "../src/metric/rank_metric.cc"
#include "../src/metric/auc.cc"
#include "../src/metric/survival_metric.cc"
// objectives
@@ -36,19 +37,18 @@
#include "../src/data/simple_dmatrix.cc"
#include "../src/data/sparse_page_raw_format.cc"
#include "../src/data/ellpack_page.cc"
#include "../src/data/ellpack_page_source.cc"
#include "../src/data/gradient_index.cc"
#include "../src/data/gradient_index_page_source.cc"
#include "../src/data/gradient_index_format.cc"
#include "../src/data/sparse_page_dmatrix.cc"
#include "../src/data/proxy_dmatrix.cc"
// prediction
#include "../src/predictor/predictor.cc"
#include "../src/predictor/cpu_predictor.cc"
#if DMLC_ENABLE_STD_THREAD
#include "../src/data/sparse_page_dmatrix.cc"
#endif
// trees
#include "../src/tree/param.cc"
#include "../src/tree/split_evaluator.cc"
#include "../src/tree/tree_model.cc"
#include "../src/tree/tree_updater.cc"
#include "../src/tree/updater_colmaker.cc"
@@ -57,7 +57,6 @@
#include "../src/tree/updater_refresh.cc"
#include "../src/tree/updater_sync.cc"
#include "../src/tree/updater_histmaker.cc"
#include "../src/tree/updater_skmaker.cc"
#include "../src/tree/constraints.cc"
// linear
@@ -68,9 +67,12 @@
// global
#include "../src/learner.cc"
#include "../src/logging.cc"
#include "../src/global_config.cc"
#include "../src/common/common.cc"
#include "../src/common/random.cc"
#include "../src/common/charconv.cc"
#include "../src/common/timer.cc"
#include "../src/common/quantile.cc"
#include "../src/common/host_device_vector.cc"
#include "../src/common/hist_util.cc"
#include "../src/common/json.cc"

View File

@@ -1,71 +0,0 @@
environment:
matrix:
- target: msvc
ver: 2015
generator: "Visual Studio 14 2015 Win64"
configuration: Debug
- target: msvc
ver: 2015
generator: "Visual Studio 14 2015 Win64"
configuration: Release
- target: mingw
generator: "Unix Makefiles"
#matrix:
# fast_finish: true
platform:
- x64
install:
- git submodule update --init --recursive
# MinGW
- set PATH=C:\msys64\mingw64\bin;C:\msys64\usr\bin;%PATH%
- gcc -v
- ls -l C:\
# Miniconda3
- call C:\Miniconda3-x64\Scripts\activate.bat
- conda info
- where python
- python --version
# do python build for mingw and one of the msvc jobs
- set DO_PYTHON=off
- if /i "%target%" == "mingw" set DO_PYTHON=on
- if /i "%target%_%ver%_%configuration%" == "msvc_2015_Release" set DO_PYTHON=on
- if /i "%DO_PYTHON%" == "on" (
conda config --set always_yes true &&
conda update -q conda &&
conda install -y numpy scipy pandas matplotlib pytest scikit-learn graphviz python-graphviz hypothesis
)
- set PATH=C:\Miniconda3-x64\Library\bin\graphviz;%PATH%
build_script:
- cd %APPVEYOR_BUILD_FOLDER%
- if /i "%target%" == "msvc" (
mkdir build_msvc%ver% &&
cd build_msvc%ver% &&
cmake .. -G"%generator%" -DCMAKE_CONFIGURATION_TYPES="Release;Debug;" &&
msbuild xgboost.sln
)
- if /i "%target%" == "mingw" (
mkdir build_mingw &&
cd build_mingw &&
cmake .. -G"%generator%" &&
make -j2
)
# Python package
- if /i "%DO_PYTHON%" == "on" (
cd %APPVEYOR_BUILD_FOLDER%\python-package &&
python setup.py install &&
mkdir wheel &&
python setup.py bdist_wheel --universal --plat-name win-amd64 -d wheel
)
test_script:
- cd %APPVEYOR_BUILD_FOLDER%
- if /i "%DO_PYTHON%" == "on" python -m pytest tests/python
artifacts:
# binary Python wheel package
- path: '**\*.whl'
name: Bits

View File

@@ -1 +1 @@
@xgboost_VERSION_MAJOR@.@xgboost_VERSION_MINOR@.@xgboost_VERSION_PATCH@-SNAPSHOT
@xgboost_VERSION_MAJOR@.@xgboost_VERSION_MINOR@.@xgboost_VERSION_PATCH@

Some files were not shown because too many files have changed in this diff Show More