Compare commits

...

307 Commits

Author SHA1 Message Date
Philip Hyunsu Cho
bcb15a980f 1.2.1 patch release (#6206)
* Hide C++ symbols from dmlc-core (#6188)

* Up version to 1.2.1

* Fix lint

* [CI] Fix Docker build for CUDA 11 (#6202)

* Update Dockerfile.gpu
2020-10-12 15:10:16 -07:00
Tong He
0cd0dad0b5 Fix CRAN submission (#6076) 2020-09-01 23:38:27 -07:00
Philip Hyunsu Cho
884098ec22 [CI] Fix CRAN check (#6067) 2020-08-28 21:24:49 +08:00
Hyunsu Cho
738786680b Release 1.2.0 2020-08-22 18:25:18 -07:00
Philip Hyunsu Cho
04232c01b2 [CI] Fix broken tests (#6048) 2020-08-22 11:43:38 -07:00
Jiaming Yuan
0353a78ab7 Fix scikit learn cls doc. (#6041) 2020-08-20 19:25:12 -07:00
Hyunsu Cho
0089a0e6bf Fix another typo 2020-08-12 19:29:08 +00:00
Philip Hyunsu Cho
03a68a1714 Fix typo 2020-08-12 01:34:33 -07:00
Hyunsu Cho
a0da8a7e0a Make RC2 2020-08-12 00:50:51 -07:00
Hyunsu Cho
eee4eff49b [CI] Build GPU-enabled JAR artifact and deploy to xgboost-maven-repo 2020-08-12 00:50:47 -07:00
Jiaming Yuan
936a854baa Back port fixes to 1.2 (#6002)
* Fix sklearn doc. (#5980)

* Enforce tree order in JSON. (#5974)

* Make JSON model IO more future proof by using tree id in model loading.

* Fix dask predict shape infer. (#5989)

* [Breaking] Fix .predict() method and add .predict_proba() in xgboost.dask.DaskXGBClassifier (#5986)
2020-08-11 20:22:31 +08:00
Hyunsu Cho
7856da5827 [CI] Use mgpu machine to run gpu hist unit tests 2020-08-02 02:33:05 -07:00
Hyunsu Cho
50a0def6c3 Make RC1 2020-08-02 08:56:20 +00:00
Hyunsu Cho
9116a0ec10 Fix a unit test on CLI, to handle RC versions 2020-08-02 08:56:15 +00:00
Shaochen Shi
71197d1dfa [jvm-packages] Fix wrong method name setAllowZeroForMissingValue. (#5740)
* Allow non-zero for missing value when training.

* Fix wrong method names.

* Add a unit test

* Move the getter/setter unit test to MissingValueHandlingSuite

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-08-01 17:16:42 -07:00
Philip Hyunsu Cho
5a2dcd1c33 [R] Provide better guidance for persisting XGBoost model (#5964)
* [R] Provide better guidance for persisting XGBoost model

* Update saving_model.rst

* Add a paragraph about xgb.serialize()
2020-07-31 20:00:26 -07:00
Philip Hyunsu Cho
bf2990e773 Add missing Pytest marks to AsyncIO unit test (#5968) 2020-08-01 10:56:24 +08:00
Philip Hyunsu Cho
5f3c811e84 [CI] Assign larger /dev/shm to NCCL (#5966)
* [CI] Assign larger /dev/shm to NCCL

* Use 10.2 artifact to run multi-GPU Python tests

* Add CUDA 10.0 -> 11.0 cross-version test; remove CUDA 10.0 target
2020-07-31 10:05:04 -07:00
Philip Hyunsu Cho
3fcfaad577 Add CMake flag to log C API invocations, to aid debugging (#5925)
* Add CMake flag to log C API invocations, to aid debugging

* Remove unnecessary parentheses
2020-07-30 19:24:28 -07:00
James Bourbeau
3b88bc948f Update XGBoost + Dask overview documentation (#5961)
* Add imports to code snippet

* Better writing.
2020-07-31 09:58:50 +08:00
Jiaming Yuan
70903c872f Force colored output for ninja build. (#5959) 2020-07-30 20:48:03 +08:00
boxdot
d268a2a463 Thread-safe prediction by making the prediction cache thread-local. (#5853)
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
2020-07-30 12:33:50 +08:00
Jiaming Yuan
fa3715f584 [Dask] Asyncio support. (#5862) 2020-07-30 06:23:58 +08:00
Jiaming Yuan
e4a273e1da Fix evaluate root split. (#5948) 2020-07-29 19:33:29 +08:00
Philip Hyunsu Cho
071e10c1d1 [CI] Fix broken Docker container 'cpu' (#5956) 2020-07-29 04:29:57 -07:00
Jiaming Yuan
f5fdcbe194 Disable feature validation on sklearn predict prob. (#5953)
* Fix issue when scikit learn interface receives transformed inputs.
2020-07-29 19:26:44 +08:00
Jiaming Yuan
18349a7ccf [Breaking] Fix custom metric for multi output. (#5954)
* Set output margin to true for custom metric.  This fixes only R and Python.
2020-07-29 19:25:27 +08:00
Jiaming Yuan
75b8c22b0b Fix prediction heuristic (#5955)
* Relax check for prediction.
* Relax test in spark test.
* Add tests in C++.
2020-07-29 19:24:07 +08:00
Philip Hyunsu Cho
5879acde9a [CI] Improve R linter script (#5944)
* [CI] Move lint to a separate script

* [CI] Improved lintr launcher

* Add lintr as a separate action

* Add custom parsing logic to print out logs

* Fix lintr issues in demos

* Run R demos

* Fix CRAN checks

* Install XGBoost into R env before running lintr

* Install devtools (needed to run demos)
2020-07-27 00:55:35 -07:00
Bobby Wang
8943eb4314 [BLOCKING] [jvm-packages] add gpu_hist and enable gpu scheduling (#5171)
* [jvm-packages] add gpu_hist tree method

* change updater hist to grow_quantile_histmaker

* add gpu scheduling

* pass correct parameters to xgboost library

* remove debug info

* add use.cuda for pom

* add CI for gpu_hist for jvm

* add gpu unit tests

* use gpu node to build jvm

* use nvidia-docker

* Add CLI interface to create_jni.py using argparse

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-07-26 21:53:24 -07:00
Philip Hyunsu Cho
6347fa1c2e [R] Enable weighted learning to rank (#5945)
* [R] enable weighted learning to rank

* Add R unit test for ranking

* Fix lint
2020-07-26 21:10:36 -07:00
Philip Hyunsu Cho
ace7fd328b [R] Add a compatibility layer to load Booster object from an old RDS file (#5940)
* [R] Add a compatibility layer to load Booster from an old RDS
* Modify QuantileHistMaker::LoadConfig() to be backward compatible with 1.1.x
* Add a big warning about compatibility in QuantileHistMaker::LoadConfig()
* Add testing suite
* Discourage use of saveRDS() in CRAN doc
2020-07-26 00:06:49 -07:00
Jiaming Yuan
40361043ae [BLOCKING] Remove to_string. (#5934) 2020-07-26 10:21:26 +08:00
Philip Hyunsu Cho
12110c900e [CI] Make Python model compatibility test runnable locally (#5941) 2020-07-25 16:58:02 -07:00
Philip Hyunsu Cho
487ab0ce73 [BLOCKING] Handle empty rows in data iterators correctly (#5929)
* [jvm-packages] Handle empty rows in data iterators correctly

* Fix clang-tidy error

* last empty row

* Add comments [skip ci]

Co-authored-by: Nan Zhu <nanzhu@uber.com>
2020-07-25 13:46:19 -07:00
Jiaming Yuan
a4de2f68e4 Use cudaOccupancyMaxPotentialBlockSize to calculate the block size. (#5926) 2020-07-23 14:24:42 +08:00
Jiaming Yuan
fbfbd525d8 Cache dependencies on Github Action. (#5928) 2020-07-23 14:00:19 +08:00
Philip Hyunsu Cho
4af857f95d Add explicit template specialization for portability (#5921)
* Add explicit template specializations

* Adding Specialization for FileAdapterBatch
2020-07-22 12:31:17 -07:00
Jiaming Yuan
bc1d3ee230 Fix r early stop with custom objective. (#5923)
* Specify `ntreelimit`.
2020-07-23 03:28:17 +08:00
Jiaming Yuan
30363d9c35 Remove R and JVM from appveyor. (#5922) 2020-07-23 03:26:48 +08:00
Jiaming Yuan
66cc1e02aa Setup github action. (#5917) 2020-07-22 15:05:25 +08:00
Philip Hyunsu Cho
627cf41a60 Add option to enable all compiler warnings in GCC/Clang (#5897)
* Add option to enable all compiler warnings in GCC/Clang

* Fix -Wall for CUDA sources

* Make -Wall private req for xgboost-r
2020-07-21 23:34:03 -07:00
Jiaming Yuan
9b688aca3b Fix mingw build with R. (#5918) 2020-07-22 02:56:49 +08:00
Philip Hyunsu Cho
8d7702766a [Doc] Document new objectives and metrics available on GPUs (#5909) 2020-07-21 02:10:59 -07:00
Jiaming Yuan
03fb98fbde Fix typo in CI. [skip ci] (#5919) 2020-07-21 14:25:27 +08:00
Jiaming Yuan
8b1afce316 Add Github Action for R. (#5911)
* Fix lintr errors.
2020-07-20 19:23:36 +08:00
Andy Adinets
b3d2e7644a Support building XGBoost with CUDA 11 (#5808)
* Change serialization test.
* Add CUDA 11 tests on Linux CI.

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-07-20 07:58:41 +08:00
Philip Hyunsu Cho
ac9136ee49 Further improvements and savings in Jenkins pipeline (#5904)
* Publish artifacts only on the master and release branches

* Build CUDA only for Compute Capability 7.5 when building PRs

* Run all Windows jobs in a single worker image

* Build nightly XGBoost4J SNAPSHOT JARs with Scala 2.12 only

* Show skipped Python tests on Windows

* Make Graphviz optional for Python tests

* Add back C++ tests

* Unstash xgboost_cpp_tests

* Fix label to CUDA 10.1

* Install cuPy for CUDA 10.1

* Install jsonschema

* Address reviewer's feedback
2020-07-18 03:30:40 -07:00
Jiaming Yuan
6c0c87216f Fix Windows 2016 build. (#5902) 2020-07-18 05:50:17 +08:00
Philip Hyunsu Cho
71b0528a2f GPU implementation of AFT survival objective and metric (#5714)
* Add interval accuracy

* De-virtualize AFT functions

* Lint

* Refactor AFT metric using GPU-CPU reducer

* Fix R build

* Fix build on Windows

* Fix copyright header

* Clang-tidy

* Fix crashing demo

* Fix typos in comment; explain GPU ID

* Remove unnecessary #include

* Add C++ test for interval accuracy

* Fix a bug in accuracy metric: use log pred

* Refactor AFT objective using GPU-CPU Transform

* Lint

* Fix lint

* Use Ninja to speed up build

* Use time, not /usr/bin/time

* Add cpu_build worker class, with concurrency = 1

* Use concurrency = 1 only for CUDA build

* concurrency = 1 for clang-tidy

* Address reviewer's feedback

* Update link to AFT paper
2020-07-17 01:18:13 -07:00
Jiaming Yuan
7c2686146e Dask device dmatrix (#5901)
* Fix softprob with empty dmatrix.
2020-07-17 13:17:43 +08:00
Jiaming Yuan
e471056ec4 Fix sketch size calculation. (#5898) 2020-07-17 08:33:16 +08:00
Bobby Wang
730866a7bc [CI] update spark version to 3.0.0 (#5890)
* [CI] update spark version to 3.0.0

* Update Dockerfile.jvm_cross

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-07-16 00:23:44 -07:00
Jiaming Yuan
029a8b533f Simplify the data backends. (#5893) 2020-07-16 15:17:31 +08:00
Philip Hyunsu Cho
7aee0e51ed Fix R package build with CMake 3.13 (#5895)
* Fix R package build with CMake 3.13

* Require OpenMP for xgboost-r target
2020-07-15 20:22:11 -07:00
Philip Hyunsu Cho
3c40f4a7f5 [CI] Reduce load on Windows CI pipeline (#5892) 2020-07-14 18:47:05 -07:00
Jiaming Yuan
3cae287dea Fix NDK Build. (#5886)
* Explicit cast for slice.
2020-07-14 18:34:19 +08:00
Alexander Gugel
970b4b3fa2 Add XGBoosterGetNumFeature (#5856)
- add GetNumFeature to Learner
- add XGBoosterGetNumFeature to C API
- update c-api-demo accordingly
2020-07-13 23:25:17 -07:00
Philip Hyunsu Cho
e0c179c7cc [CI] Enforce daily budget in Jenkins CI (#5884)
* [CI] Throttle Jenkins CI

* Don't use Jenkins master instance
2020-07-13 21:51:11 -07:00
Jiaming Yuan
dd445af56e Cleanup on device sketch. (#5874)
* Remove old functions.

* Merge weighted and un-weighted into a common interface.
2020-07-14 10:15:54 +08:00
Bobby Wang
9f85e92602 [jvm-packages] update spark dependency to 3.0.0 (#5836) 2020-07-12 20:58:30 -07:00
Philip Hyunsu Cho
23e2c6ec91 Upgrade Rabit (#5876) 2020-07-09 16:18:33 -07:00
Zhang Zhang
1813804e36 Add new parameter singlePrecisionHistogram to xgboost4j-spark (#5811)
Expose the existing 'singlePrecisionHistogram' param to the Spark layer.
2020-07-08 16:29:35 -07:00
Philip Hyunsu Cho
0d411b0397 [CI] Simplify CMake build with modern CMake techniques (#5871)
* [CI] Simplify CMake build

* Make sure that plugins can be built

* [CI] Install lz4 on Mac
2020-07-08 04:23:24 -07:00
Philip Hyunsu Cho
22a31b1faa [Doc] Document that CUDA 10.0 is required [skip ci] (#5872) 2020-07-07 18:55:19 -07:00
Rong Ou
06320729d4 fix device sketch with weights in external memory mode (#5870) 2020-07-08 08:44:07 +08:00
Jiaming Yuan
d0a29c3135 Remove print. (#5867) 2020-07-08 04:12:14 +08:00
Jiaming Yuan
a3ec964346 Accept iterator in device dmatrix. (#5783)
* Remove Device DMatrix.
2020-07-07 21:44:48 +08:00
Jiaming Yuan
048d969be4 Implement GK sketching on GPU. (#5846)
* Implement GK sketching on GPU.
* Strong tests on quantile building.
* Handle sparse dataset by binary searching the column index.
* Hypothesis test on dask.
2020-07-07 12:16:21 +08:00
Andy Adinets
ac3f0e78dc Split Features into Groups to Compute Histograms in Shared Memory (#5795) 2020-07-07 15:04:35 +12:00
Jiaming Yuan
93c44a9a64 Move feature names and types of DMatrix from Python to C++. (#5858)
* Add thread local return entry for DMatrix.
* Save feature name and feature type in binary file.

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-07-07 09:40:13 +08:00
Jiaming Yuan
4b0852ee41 Use dmlc stream when URI protocol is not local file. (#5857) 2020-07-07 03:07:12 +08:00
Alexander Gugel
0f17e35bce Add c-api-demo to .gitignore (#5855) 2020-07-05 04:35:22 +08:00
Philip Hyunsu Cho
efe3e48ae2 Ensure that LoadSequentialFile() actually read the whole file (#5831) 2020-07-04 16:17:11 +08:00
Jiaming Yuan
1a0801238e Implement iterative DMatrix. (#5837) 2020-07-03 11:44:52 +08:00
Jiaming Yuan
4d277d750d Relax linear test. (#5849)
* Increased error in coordinate is mostly due to floating point error.
* Shotgun uses Hogwild!, which is non-deterministic and can have even greater
floating point error.
2020-07-03 07:49:53 +08:00
Jiaming Yuan
eb067c1c34 Relax test for shotgun. (#5835) 2020-07-01 19:20:29 +08:00
Jiaming Yuan
90a9c68874 Implement a DMatrix Proxy. (#5803) 2020-06-29 15:03:10 +08:00
Jiaming Yuan
47c89775d6 Accept string for ArrayInterface constructor. (#5799) 2020-06-27 00:06:54 +08:00
Yuan Tang
95f11ed27e Rename Ant Financial to Ant Group (#5827) 2020-06-25 15:25:36 -04:00
Jiaming Yuan
8234091368 Remove unweighted GK quantile. (#5816) 2020-06-23 14:27:46 +08:00
Philip Hyunsu Cho
dcff96ed27 [Doc] Fix rendering of Markdown docs, e.g. R doc (#5821) 2020-06-21 23:49:22 -07:00
Jiaming Yuan
8104f10328 Update document for model dump. (#5818)
* Clarify the relationship between dump and save.
* Mention the schema.
2020-06-22 14:33:54 +08:00
Jiaming Yuan
26143ad0b1 Update rabit. (#5680) 2020-06-22 14:32:43 +08:00
Jiaming Yuan
c4d721200a Implement extend method for meta info. (#5800)
* Implement extend for host device vector.
2020-06-20 03:32:03 +08:00
Philip Hyunsu Cho
a6d9a06b7b [CI] Fix cuDF install; merge 'gpu' and 'cudf' test suite (#5814) 2020-06-19 16:42:57 +08:00
Philip Hyunsu Cho
a67bc64819 Add an option to run brute-force test for JSON round-trip (#5804)
* Add an option to run brute-force test for JSON round-trip

* Apply reviewer's feedback

* Remove unneeded objects

* Parallel run.

* Max.

* Use signed 64-bit loop var, to support MSVC

* Add exhaustive test to CI

* Run JSON test in Win build worker

* Revert "Run JSON test in Win build worker"

This reverts commit c97b2c7dda37b3585b445d36961605b79552ca89.

* Revert "Add exhaustive test to CI"

This reverts commit c149c2ce9971a07a7289f9b9bc247818afd5a667.

Co-authored-by: fis <jm.yuan@outlook.com>
2020-06-17 23:46:02 -07:00
Rory Mitchell
abdf894fcf Add cupy to Windows CI (#5797)
* Add cupy to Windows CI

* Update Jenkinsfile-win64

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Update Jenkinsfile-win64

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

* Update tests/python-gpu/test_gpu_prediction.py

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-06-17 21:55:09 -07:00
Jiaming Yuan
38ee514787 Implement fast number serialization routines. (#5772)
* Implement ryu algorithm.
* Implement integer printing.
* Full coverage roundtrip test.
2020-06-17 12:39:23 +08:00
fis
7c3a168ffd Revert "Accept string for ArrayInterface constructor."
This reverts commit e8ecafb8dc.
2020-06-16 20:02:35 +08:00
fis
e8ecafb8dc Accept string for ArrayInterface constructor. 2020-06-16 20:00:24 +08:00
Rory Mitchell
b47b5ac771 Use hypothesis (#5759)
* Use hypothesis

* Allow int64 array interface for groups

* Add packages to Windows CI

* Add to travis

* Make sure device index is set correctly

* Fix dask-cudf test

* appveyor
2020-06-16 12:45:59 +12:00
Ram Rachum
02884b08aa Fix exception causes all over the codebase (#5787) 2020-06-15 21:06:07 +08:00
Alex
ae18a094b0 Add new skl model attribute for number of features (#5780) 2020-06-15 18:01:59 +08:00
James Lamb
d39da42e69 [R] Remove dependency on gendef for Visual Studio builds (fixes #5608) (#5764)
* [R-package] Remove dependency on gendef for Visual Studio builds (fixes #5608)

* clarify docs

* removed debugging print statement

* Make R CMake install more robust

* Fix doc format; add ToC

* Update build.rst

* Fix AppVeyor

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-06-15 00:20:44 +00:00
Jiaming Yuan
529b5c2cfd [DOC] Mention dask blog post in doc. [skip ci] (#5789) 2020-06-14 13:00:19 +08:00
anttisaukko
1bcbe1fc14 Bump com.esotericsoftware to 4.0.2 (#5690)
Co-authored-by: Antti Saukko <antti.saukko@verizonmedia.com>
2020-06-13 21:06:14 -07:00
Jiaming Yuan
1fa84b61c1 Implement Empty method for host device vector. (#5781)
* Fix accessing nullptr.
2020-06-13 19:02:26 +08:00
Jiaming Yuan
306e38ff31 Avoid including c_api.h in header files. (#5782) 2020-06-12 16:24:24 +08:00
Jiaming Yuan
3028fa6b42 Implement weighted sketching for adapter. (#5760)
* Bounded memory tests.
* Fixed memory estimation.
2020-06-12 06:20:39 +08:00
James Lamb
c35be9dc40 [R] replace uses of T and F with TRUE and FALSE (#5778)
* [R-package] replace uses of T and F with TRUE and FALSE

* enable linting

* Remove skip

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-06-11 06:08:02 -04:00
Elliot Hershberg
cb7f7e542c Added conda environment file for building docs (#5773) 2020-06-11 16:51:24 +08:00
James Lamb
c96e1ef283 [python-package] remove unused imports (#5776) 2020-06-11 16:50:27 +08:00
Philip Hyunsu Cho
1d22a9be1c Revert "Reorder includes. (#5749)" (#5771)
This reverts commit d3a0efbf16.
2020-06-09 10:29:28 -07:00
Philip Hyunsu Cho
d087a12b04 Add release note for 1.1.0 in NEWS.md (#5763)
* Add release note for 1.1.0 in NEWS.md

* Address reviewer's feedback
2020-06-08 14:16:10 -07:00
Philip Hyunsu Cho
b5ab009c19 Document addition of new committer @SmirnovEgorRu (#5762) 2020-06-07 22:57:49 -07:00
Jiaming Yuan
cacff9232a Remove column major specialization. (#5755)
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-06-05 16:19:14 +08:00
Jiaming Yuan
bd9d57f579 Add helper for generating batches of data. (#5756)
* Add helper for generating batches of data.

* VC keyword clash.

* Another clash.
2020-06-05 09:53:56 +08:00
Rory Mitchell
359023c0fa Speed up python test (#5752)
* Speed up tests

* Prevent DeviceQuantileDMatrix initialisation with numpy

* Use joblib.memory

* Use RandomState
2020-06-05 11:39:24 +12:00
Jiaming Yuan
cfc23c6a6b Remove max.depth in R gblinear example. (#5753) 2020-06-04 02:59:22 +08:00
Jiaming Yuan
d3a0efbf16 Reorder includes. (#5749)
* Reorder includes.

* R.
2020-06-03 17:30:47 +12:00
ShvetsKS
cd3d14ad0e Add float32 histogram (#5624)
* new single_precision_histogram param was added.

Co-authored-by: SHVETS, KIRILL <kirill.shvets@intel.com>
Co-authored-by: fis <jm.yuan@outlook.com>
2020-06-03 11:24:53 +08:00
Jiaming Yuan
e49607af19 Add Python binding for rabit ops. (#5743) 2020-06-02 19:47:23 +08:00
Jiaming Yuan
e533908922 Expose device sketching in header. (#5747) 2020-06-02 13:02:53 +08:00
Peter Jung
0be0e6fd88 Add pkgconfig to cmake (#5744)
* Add pkgconfig to cmake

* Move xgboost.pc.in to cmake/

Co-authored-by: Peter Jung <peter.jung@heureka.cz>
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-06-01 18:22:33 -07:00
Philip Hyunsu Cho
b77e3e3fcc [CI] Remove CUDA 9.0 from CI (#5745) 2020-06-01 18:15:45 -07:00
Jiaming Yuan
325156c7a9 Bump version in header. (#5742) 2020-06-01 18:21:18 +08:00
Jiaming Yuan
d19cec70f1 Don't use mask in array interface. (#5730) 2020-06-01 12:17:24 +08:00
Peter Jung
267c1ed784 Add swift package reference (#5728)
Co-authored-by: Peter Jung <peter.jung@heureka.cz>
2020-06-01 15:29:23 +12:00
Philip Hyunsu Cho
073b625bde Bump version to 1.2.0 snapshot in master (#5733) 2020-05-31 00:11:34 -07:00
Jiaming Yuan
9e1b29944e Fix loading old model. (#5724)
* Add test.
2020-05-31 14:55:32 +08:00
ShvetsKS
057c762ecd Fix release degradation (#5720)
* fix release degradation, related to 5666

* less resizes

Co-authored-by: SHVETS, KIRILL <kirill.shvets@intel.com>
2020-05-31 04:37:54 +03:00
Peter Jung
251dc8a663 Allow pass fmap to importance plot (#5719)
Co-authored-by: Peter Jung <peter.jung@heureka.cz>
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-05-29 19:55:35 +08:00
Rory Mitchell
f779980f7e gpu_hist performance tweaks (#5707)
* Remove device vectors

* Remove allreduce synchronize

* Remove double buffer
2020-05-29 16:48:53 +12:00
Philip Hyunsu Cho
ca0d605b34 [Doc] Fix typos in AFT tutorial (#5716) 2020-05-28 14:04:34 -07:00
Jiaming Yuan
35e2205256 [dask] Return GPU Series when input is from cuDF. (#5710)
* Refactor predict function.
2020-05-28 17:51:20 +08:00
Philip Hyunsu Cho
91c646392d Require Python 3.6+; drop Python 3.5 from CI (#5715) 2020-05-27 16:19:30 -07:00
Philip Hyunsu Cho
fdbb6ae856 Require CUDA 10.0+ in CMake build (#5718) 2020-05-27 16:18:18 -07:00
Jiaming Yuan
75a0025a3d [CI] Remove CUDA 9.0 from Windows CI. (#5674)
* Remove CUDA 9.0 on Windows CI.

* Require cuda10 tag, to differentiate

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-05-27 12:23:36 -07:00
Dmitry Mottl
78b4e95f25 Changed build.rst (binary wheels are supported for macOS also) (#5711) 2020-05-27 07:18:45 -07:00
Philip Hyunsu Cho
e3aa7f1441 Define _CRT_SECURE_NO_WARNINGS to remove unneeded warnings in MSVC (#5434) 2020-05-25 22:46:07 -07:00
Jiaming Yuan
f145241593 Let XGBoostError inherit ValueError. (#5696) 2020-05-26 08:34:56 +08:00
Jiaming Yuan
8438c7d0e4 Fix IsDense. (#5702) 2020-05-26 08:24:37 +08:00
Philip Hyunsu Cho
e35ad8a074 [R] Fix duplicated libomp.dylib error on Mac OSX (#5701) 2020-05-24 23:37:33 -07:00
Jiaming Yuan
1ba24a7597 Remove redundant sketching. (#5700) 2020-05-24 08:47:20 +08:00
James Lamb
f656ef2fed [R-package] Reduce duplication in configure.ac (#5693)
* updated configure
2020-05-22 12:15:22 +08:00
Jiaming Yuan
5af8161a1a Implement Python data handler. (#5689)
* Define data handlers for DMatrix.
* Throw ValueError in scikit learn interface.
2020-05-22 11:53:55 +08:00
Andy Adinets
646def51e0 C++14 for xgboost (#5664) 2020-05-21 12:26:40 +12:00
Lorenz Walthert
60511a3222 Document more objective parameters in R package (#5682) 2020-05-20 14:00:55 +08:00
ShvetsKS
dd01e4ba8d Distributed optimizations for 'hist' method with CPUs (#5557)
Co-authored-by: SHVETS, KIRILL <kirill.shvets@intel.com>
2020-05-20 06:03:03 +03:00
Rong Ou
e21a608552 add pointers to the gpu external memory paper (#5684) 2020-05-19 19:46:16 -07:00
Jiaming Yuan
7903286961 Remove silent from R demos. (#5675)
* Remove silent from R demos.

* Vignettes.
2020-05-19 18:20:46 +08:00
Jiaming Yuan
dd9aeb60ae [JVM Packages] Catch dmlc error by ref. (#5678) 2020-05-19 13:00:12 +08:00
LionOrCatThatIsTheQuestion
83981a9ce3 Pseudo-huber loss metric added (#5647)
- Add pseudo huber loss objective.
- Add pseudo huber loss metric.

Co-authored-by: Reetz <s02reetz@iavgroup.local>
2020-05-18 21:08:07 +08:00
Jiaming Yuan
535479e69f Add JSON schema to model dump. (#5660) 2020-05-15 10:18:43 +08:00
Jiaming Yuan
2c1a439869 Update Python demos with tests. (#5651)
* Remove GPU memory usage demo.
* Add tests for demos.
* Remove `silent`.
* Remove shebang as it's not portable.
2020-05-12 12:04:42 +08:00
Oleksandr Kuvshynov
4e64e2ef8e skip missing lookup if nothing is missing in CPU hist partition kernel. (#5644)
* [xgboost] skip missing lookup if nothing is missing
2020-05-12 05:50:08 +03:00
Jiaming Yuan
9ad40901a8 Upgrade to CUDA 10.0 (#5649) (#5652)
Co-authored-by: fis <jm.yuan@outlook.com>

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-05-11 22:27:36 +08:00
Rory Mitchell
fcf57823b6 Reduce device synchronisation (#5631)
* Reduce device synchronisation

* Initialise pinned memory
2020-05-07 21:19:46 +12:00
Rory Mitchell
9910265064 Resolve vector<bool>::iterator crash (#5642) 2020-05-07 21:18:01 +12:00
Jiaming Yuan
21ed1f0c6d Support 64bit seed. (#5643) 2020-05-07 14:52:38 +08:00
Jiaming Yuan
eaf2a00b5c Enhance nvtx support. (#5636) 2020-05-06 22:54:24 +08:00
Jiaming Yuan
67d267f9da Move device dmatrix construction code into ellpack. (#5623) 2020-05-06 19:43:59 +08:00
Jiaming Yuan
33e052b1e5 Remove dead code. (#5635) 2020-05-06 17:03:48 +08:00
Philip Hyunsu Cho
8de7f1928e Fix build on big endian CPUs (#5617)
* Fix build on big endian CPUs

* Clang-tidy
2020-04-29 21:56:34 -07:00
Rory Mitchell
b9649e7b8e Refactor gpu_hist split evaluation (#5610)
* Refactor

* Rewrite evaluate splits

* Add more tests
2020-04-30 08:58:12 +12:00
Yuan Tang
dfcdfabf1f Move dask tutorial closer other distributed tutorials (#5613) 2020-04-28 02:24:00 +08:00
Jiaming Yuan
c90457f489 Refactor the CLI. (#5574)
* Enable parameter validation.
* Enable JSON.
* Catch `dmlc::Error`.
* Show help message.
2020-04-26 10:56:33 +08:00
Jiaming Yuan
7d93932423 Better message when no GPU is found. (#5594) 2020-04-26 10:00:57 +08:00
Jason E. Aten, Ph.D
8dfe7b3686 Clarify meaning of training parameter in XGBoosterPredict() (#5604)
Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
2020-04-25 16:48:42 -07:00
Philip Hyunsu Cho
4fd95272c8 Instruct Mac users to install libomp (#5606) 2020-04-25 15:50:30 -07:00
Philip Hyunsu Cho
474cfddf91 [R] Address warnings to comply with CRAN submission policy (#5600)
* [R] Address warnings to comply with CRAN submission policy

* Include <xgboost/logging.h>
2020-04-25 13:34:36 -07:00
Philip Hyunsu Cho
a23de1c108 [CI] Grant public read access to Mac OSX wheels (#5602) 2020-04-25 11:51:26 -07:00
Philip Hyunsu Cho
f68155de6c Fix compilation on Mac OSX High Sierra (10.13) (#5597)
* Fix compilation on Mac OSX High Sierra

* [CI] Build Mac OSX binary wheel using Travis CI
2020-04-25 10:53:03 -07:00
Jiaming Yuan
e726dd9902 Set device in device dmatrix. (#5596) 2020-04-25 13:42:53 +08:00
Philip Hyunsu Cho
ef26bc45bf Hide C++ symbols in libxgboost.so when building Python wheel (#5590)
* Hide C++ symbols in libxgboost.so when building Python wheel

* Update Jenkinsfile

* Add test

* Upgrade rabit

* Add setup.py option.

Co-authored-by: fis <jm.yuan@outlook.com>
2020-04-24 13:32:05 -07:00
Rory Mitchell
660be66207 Avoid rabit calls in learner configuration (#5581) 2020-04-24 14:59:29 +12:00
Philip Hyunsu Cho
92913aaf7f [CI] Use Vault repository to re-gain access to devtoolset-4 (#5589)
* [CI] Use Vault repository to re-gain access to devtoolset-4

* Use manylinux2010 tag

* Update Dockerfile.jvm

* Fix rename_whl.py

* Upgrade Pip, to handle manylinux2010 tag

* Update insert_vcomp140.py

* Update test_python.sh
2020-04-23 18:53:54 -07:00
Philip Hyunsu Cho
e4f5b6c84f Port R compatibility patches from 1.0.0 release branch (#5577)
* Don't use memset to set struct when compiling for R

* Support 32-bit Solaris target for R package
2020-04-21 22:51:18 -07:00
Jiaming Yuan
f27b6f9ba6 Update document. (#5572) 2020-04-22 02:37:37 +08:00
Jiaming Yuan
c355ab65ed Enable parameter validation for R. (#5569)
* Enable parameter validation for R.

* Add test.
2020-04-21 11:19:09 -07:00
Jiaming Yuan
564b22cee5 Restore attributes in complete. (#5573) 2020-04-21 11:06:55 -07:00
Rory Mitchell
a734f52807 Use cudaDeviceGetAttribute instead of cudaGetDeviceProperties (#5570) 2020-04-21 14:58:29 +12:00
Andy Adinets
73142041b9 For histograms, opting into maximum shared memory available per block. (#5491) 2020-04-21 14:56:42 +12:00
Jiaming Yuan
9c1103e06c [Breaking] Set output margin to True for custom objective. (#5564)
* Set output margin to True for custom objective in Python and R.

* Add a demo for writing multi-class custom objective function.

* Run tests on selected demos.
2020-04-20 20:44:12 +08:00
Jiaming Yuan
fcbedcedf8 Fix configuration I load model. (#5562) 2020-04-20 17:25:11 +08:00
Jiaming Yuan
29a4cfe400 Group aware GPU sketching. (#5551)
* Group aware GPU weighted sketching.

* Distribute group weights to each data point.
* Relax the test.
* Validate input meta info.
* Fix metainfo copy ctor.
2020-04-20 17:18:52 +08:00
Liang-Chi Hsieh
397d8f0ee7 [jvm-packages] XGBoost Spark should deal with NaN when parsing evaluation output (#5546) 2020-04-19 23:10:30 -07:00
Jiaming Yuan
b809f5d8b8 Don't set seed on CLI interface. (#5563) 2020-04-20 12:17:03 +08:00
Jiaming Yuan
ccd30e4491 Fix non-openmp build. (#5566)
* Add test to Jenkins.
* Fix threading utils tests.
* Require thread library.
2020-04-20 12:16:38 +08:00
Rory Mitchell
b2827a80e1 Use non-synchronising scan (#5560) 2020-04-20 15:51:34 +12:00
Rory Mitchell
d6d1035950 gpu_hist performance fixes (#5558)
* Remove unnecessary cuda API calls

* Fix histogram memory growth
2020-04-19 12:21:13 +12:00
Jiaming Yuan
e1f22baf8c Fix slice and get info. (#5552) 2020-04-18 18:00:13 +08:00
Jiaming Yuan
c245eb8755 Fix r interaction constraints (#5543)
* Unify the parsing code.

* Cleanup.
2020-04-18 06:53:51 +08:00
Jiaming Yuan
93df871c8c Assert matching length of evaluation inputs. (#5540) 2020-04-18 06:52:55 +08:00
Jiaming Yuan
c69a19e2b1 Fix skl nan tag. (#5538) 2020-04-18 06:52:17 +08:00
Jiaming Yuan
cfee9fae91 Don't use uint for threads. (#5542) 2020-04-17 09:45:42 +08:00
Jiaming Yuan
bb29ce2818 Add missing aft parameters. [skip ci] (#5553) 2020-04-16 12:08:55 -07:00
ShvetsKS
a2d86b8e4b Optimizations for RNG in InitData kernel (#5522)
* optimizations for subsampling in InitData

* optimizations for subsampling in InitData

Co-authored-by: SHVETS, KIRILL <kirill.shvets@intel.com>
2020-04-16 18:24:32 +03:00
Rory Mitchell
e268fb0093 Use thrust functions instead of custom functions (#5544) 2020-04-16 21:41:16 +12:00
Melissa Kohl
6a169cd41a Fix uninitialized value bug in xgboost callback (#5463)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2020-04-16 07:50:54 +08:00
Jiaming Yuan
468b1594d3 Fix CLI model IO. (#5535)
* Add test for comparing Python and CLI training result.
2020-04-16 07:48:47 +08:00
Philip Hyunsu Cho
0676a19e70 [jvm-packages] [CI] Publish XGBoost4J JARs with Scala 2.11 and 2.12 (#5539) 2020-04-15 09:32:02 -07:00
Philip Hyunsu Cho
ec02f40d42 [CI] Use Ubuntu 18.04 LTS in JVM CI, because 19.04 is EOL (#5537) 2020-04-15 07:32:46 -07:00
Jiaming Yuan
8b04736b81 [dask] dask cudf inplace prediction. (#5512)
* Add inplace prediction for dask-cudf.

* Remove Dockerfile.release, since it's not used anywhere

* Use Conda exclusively in CUDF and GPU containers

* Improve cupy memory copying.

* Add skip marks to tests.

* Add mgpu-cudf category on the CI to run all distributed tests.

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-04-15 18:15:51 +08:00
Rory Mitchell
ca4e05660e Purge device_helpers.cuh (#5534)
* Simplifications with caching_device_vector

* Purge device helpers
2020-04-15 21:51:56 +12:00
Jiaming Yuan
a2f54963b6 Write binary header. (#5532) 2020-04-15 17:47:57 +08:00
Philip Hyunsu Cho
1b1969f20d [jvm-packages] [CI] Create a Maven repository to host SNAPSHOT JARs (#5533) 2020-04-14 19:33:32 -07:00
Kamil A. Kaczmarek
2809fb8b6f Add Neptune and Optuna to list of examples (#5528) 2020-04-14 11:00:50 -07:00
Jiaming Yuan
c90119eb67 Update Python doc. [skip ci] (#5517)
* Update doc for copying booster. [skip ci]

The issue is resolved in  #5312 .

* Add version for new APIs. [skip ci]
2020-04-14 16:25:20 +08:00
Philip Hyunsu Cho
88b64c8162 Ensure that configured dmlc/build_config.h is picked up by Rabit and XGBoost (#5514)
* Ensure that configured header (build_config.h) from dmlc-core is picked up by Rabit and XGBoost

* Check which Rabit target is being used

* Use CMake 3.13 in all Jenkins tests

* Upgrade CMake in Travis CI

* Install CMake using Kitware installer

* Remove existing CMake (3.12.4)
2020-04-11 23:48:28 -07:00
Nicolas Scozzaro
04f69b43e6 fix typo "customized" (#5515) 2020-04-12 14:43:48 +08:00
Liang-Chi Hsieh
449ab79e0c [CI] Use devtoolset-6 because devtoolset-4 is EOL and no longer available (#5506)
* Use devtoolset-6.

* [CI] Use devtoolset-6 because devtoolset-4 is EOL and no longer available

* CUDA 9.0 doesn't work with devtoolset-6; use devtoolset-4 for GPU build only

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-04-11 19:49:06 -07:00
Jiaming Yuan
b56c902841 [R] R raw serialization. (#5123)
* Add bindings for serialization.
* Change `xgb.save.raw' into full serialization instead of simple model.
* Add `xgb.load.raw' for unserialization.
* Run devtools.
2020-04-11 17:16:54 +08:00
Jiaming Yuan
a3db79df22 Remove makefiles. (#5513) 2020-04-11 13:25:53 +08:00
Rory Mitchell
093e2227e3 Serialise booster after training to reset state (#5484)
* Serialise booster after training to reset state

* Prevent process_type being set on load

* Check for correct updater sequence
2020-04-11 16:27:12 +12:00
Jiaming Yuan
4a0c8ef237 Update doc for parameter validation. (#5508)
* Update doc for parameter validation.

* Fix github rebase.
2020-04-11 00:43:46 +08:00
Jiaming Yuan
1334aca437 Fix github merge. (#5509) 2020-04-10 22:17:38 +08:00
Jiaming Yuan
866a477319 Unify max nodes. (#5497) 2020-04-10 19:26:35 +08:00
Jiaming Yuan
bd653fad4c Remove distcol updater. (#5507)
Closes #5498.
2020-04-10 12:52:56 +08:00
Jiaming Yuan
7d52c0b8c2 Requires setting leaf stat when expanding tree. (#5501)
* Fix GPU Hist feature importance.
2020-04-10 12:27:03 +08:00
Jiaming Yuan
dc2950fd90 Fix checking booster. (#5505)
* Use `get_params()` instead of `getattr` intrinsic.
2020-04-10 12:21:21 +08:00
Jiaming Yuan
6671b42dd4 Use ellpack for prediction only when sparsepage doesn't exist. (#5504) 2020-04-10 12:15:46 +08:00
Bobby Wang
ad826e913f [jvm-packages]add feature size for LabelPoint and DataBatch (#5303)
* fix type error

* Validate number of features.

* resolve comments

* add feature size for LabelPoint and DataBatch

* pass the feature size to native

* move feature size validating tests into a separate suite

* resolve comments

Co-authored-by: fis <jm.yuan@outlook.com>
2020-04-07 16:49:52 -07:00
Zhang Zhang
8bc595ea1e Fix out-of-bound array access in WQSummary::SetPrune() (#5493) 2020-04-08 10:02:31 +12:00
Rong Ou
a1085396e2 add reference to gpu external memory (#5490) 2020-04-07 11:15:58 +12:00
Yuan Tang
9097e8f0d9 Edits on tutorial for XGBoost job on Kubernetes (#5487) 2020-04-05 07:36:33 -04:00
Paul Kaefer
c362125d7b corrected spelling of 'list' (#5482) 2020-04-05 09:15:08 +08:00
Jiaming Yuan
0012f2ef93 Upgrade clang-tidy on CI. (#5469)
* Correct all clang-tidy errors.
* Upgrade clang-tidy to 10 on CI.

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-04-05 04:42:29 +08:00
Philip Hyunsu Cho
30e94ddd04 Add R code to AFT tutorial [skip ci] (#5486) 2020-04-04 13:06:12 -07:00
Rory Mitchell
15800107ad Small updates to GPU documentation (#5483) 2020-04-04 13:02:27 -07:00
Jiaming Yuan
a9313802ea Fix dump model. (#5485) 2020-04-05 03:52:54 +08:00
Philip Hyunsu Cho
5fc5ec539d Implement robust regularization in 'survival:aft' objective (#5473)
* Robust regularization of AFT gradient and hessian

* Fix AFT doc; expose it to tutorial TOC

* Apply robust regularization to uncensored case too

* Revise unit test slightly

* Fix lint

* Update test_survival.py

* Use GradientPairPrecise

* Remove unused variables
2020-04-04 12:21:24 -07:00
Jiaming Yuan
939973630d Accept other gradient types for split entry. (#5467) 2020-04-03 10:38:44 +08:00
Jiaming Yuan
86beb68ce8 Implement host span. (#5459) 2020-04-03 10:37:51 +08:00
Jiaming Yuan
459b175dc6 Split up test helpers header. (#5455) 2020-04-03 10:36:53 +08:00
Jiaming Yuan
c218d8ffbf Enable parameter validation for skl. (#5477) 2020-04-03 10:23:58 +08:00
Jiaming Yuan
d0b86c75d9 Remove silent parameter. (#5476) 2020-04-03 08:03:26 +08:00
Jiaming Yuan
29c6ad943a Prevent copying SimpleDMatrix. (#5453)
* Set default dtor for SimpleDMatrix to initialize default copy ctor, which is
deleted due to unique ptr.

* Remove commented code.
* Remove warning for calling host function (std::max).
* Remove warning for initialization order.
* Remove warning for unused variables.
2020-04-02 07:01:49 +08:00
Jiaming Yuan
e86030c360 Update dmlc-core. (#5466)
* Copy dmlc travis script to XGBoost.
2020-04-02 04:16:39 +08:00
Jiaming Yuan
babcb996e7 Reduce span check overhead. (#5464) 2020-04-01 22:07:24 +08:00
Rory Mitchell
15f40e51e9 Add support for dlpack, expose python docs for DeviceQuantileDMatrix (#5465) 2020-04-01 23:34:32 +13:00
Jiaming Yuan
6601a641d7 Thread safe, inplace prediction. (#5389)
Normal prediction with DMatrix is now thread safe with locks.  Added inplace prediction is lock free thread safe.

When data is on device (cupy, cudf), the returned data is also on device.

* Implementation for numpy, csr, cudf and cupy.

* Implementation for dask.

* Remove sync in simple dmatrix.
2020-03-30 15:35:28 +08:00
James Lamb
7f980e9f83 [R-package] fixed inconsistency in R -e calls in FindLibR.cmake (#5438) 2020-03-28 19:24:21 +08:00
ShvetsKS
27a8e36fc3 Reducing memory consumption for 'hist' method on CPU (#5334) 2020-03-28 14:45:52 +13:00
Rory Mitchell
13b10a6370 Device dmatrix (#5420) 2020-03-28 14:42:21 +13:00
Jiaming Yuan
780de49ddb Resolve travis failure. (#5445)
* Install dependencies by pip.
2020-03-27 19:37:58 +08:00
Jiaming Yuan
4942da64ae Refactor tests with data generator. (#5439) 2020-03-27 06:44:44 +08:00
Jiaming Yuan
7146b91d5a Force compressed buffer to be 4 bytes aligned. (#5441) 2020-03-27 06:43:52 +08:00
Avinash Barnwal
dcf439932a Add Accelerated Failure Time loss for survival analysis task (#4763)
* [WIP] Add lower and upper bounds on the label for survival analysis

* Update test MetaInfo.SaveLoadBinary to account for extra two fields

* Don't clear qids_ for version 2 of MetaInfo

* Add SetInfo() and GetInfo() method for lower and upper bounds

* changes to aft

* Add parameter class for AFT; use enum's to represent distribution and event type

* Add AFT metric

* changes to neg grad to grad

* changes to binomial loss

* changes to overflow

* changes to eps

* changes to code refactoring

* changes to code refactoring

* changes to code refactoring

* Re-factor survival analysis

* Remove aft namespace

* Move function bodies out of AFTNormal and AFTLogistic, to reduce clutter

* Move function bodies out of AFTLoss, to reduce clutter

* Use smart pointer to store AFTDistribution and AFTLoss

* Rename AFTNoiseDistribution enum to AFTDistributionType for clarity

The enum class was not a distribution itself but a distribution type

* Add AFTDistribution::Create() method for convenience

* changes to extreme distribution

* changes to extreme distribution

* changes to extreme

* changes to extreme distribution

* changes to left censored

* deleted cout

* changes to x,mu and sd and code refactoring

* changes to print

* changes to hessian formula in censored and uncensored

* changes to variable names and pow

* changes to Logistic Pdf

* changes to parameter

* Expose lower and upper bound labels to R package

* Use example weights; normalize log likelihood metric

* changes to CHECK

* changes to logistic hessian to standard formula

* changes to logistic formula

* Comply with coding style guideline

* Revert back Rabit submodule

* Revert dmlc-core submodule

* Comply with coding style guideline (clang-tidy)

* Fix an error in AFTLoss::Gradient()

* Add missing files to amalgamation

* Address @RAMitchell's comment: minimize future change in MetaInfo interface

* Fix lint

* Fix compilation error on 32-bit target, when size_t == bst_uint

* Allocate sufficient memory to hold extra label info

* Use OpenMP to speed up

* Fix compilation on Windows

* Address reviewer's feedback

* Add unit tests for probability distributions

* Make Metric subclass of Configurable

* Address reviewer's feedback: Configure() AFT metric

* Add a dummy test for AFT metric configuration

* Complete AFT configuration test; remove debugging print

* Rename AFT parameters

* Clarify test comment

* Add a dummy test for AFT loss for uncensored case

* Fix a bug in AFT loss for uncensored labels

* Complete unit test for AFT loss metric

* Simplify unit tests for AFT metric

* Add unit test to verify aggregate output from AFT metric

* Use EXPECT_* instead of ASSERT_*, so that we run all unit tests

* Use aft_loss_param when serializing AFTObj

This is to be consistent with AFT metric

* Add unit tests for AFT Objective

* Fix OpenMP bug; clarify semantics for shared variables used in OpenMP loops

* Add comments

* Remove AFT prefix from probability distribution; put probability distribution in separate source file

* Add comments

* Define kPI and kEulerMascheroni in probability_distribution.h

* Add probability_distribution.cc to amalgamation

* Remove unnecessary diff

* Address reviewer's feedback: define variables where they're used

* Eliminate all INFs and NANs from AFT loss and gradient

* Add demo

* Add tutorial

* Fix lint

* Use 'survival:aft' to be consistent with 'survival:cox'

* Move sample data to demo/data

* Add visual demo with 1D toy data

* Add Python tests

Co-authored-by: Philip Cho <chohyu01@cs.washington.edu>
2020-03-25 13:52:51 -07:00
Rory Mitchell
1de36cdf1e Add link to GPU documentation (#5437) 2020-03-24 09:29:29 +13:00
sriramch
d2231fc840 Ranking metric acceleration on the gpu (#5398) 2020-03-22 19:38:48 +13:00
Jiaming Yuan
cd7d6f7d59 [dask] Fix missing value for scikit-learn interface. (#5435) 2020-03-20 10:56:01 -04:00
James Lamb
4b7e2b7bff [R-package] fixed uses of class() (#5426)
Thank you a lot. Good catch!
2020-03-20 14:51:20 +01:00
Jiaming Yuan
abca9908ba Support pandas SparseArray. (#5431) 2020-03-20 21:40:22 +08:00
James Lamb
3cf665d3ec [R-package] changed FindLibR to take advantage of CMake cache (#5427) 2020-03-20 03:32:15 +08:00
Jiaming Yuan
760d5d0c3c [dask] Accept other inputs for prediction. (#5428)
* Returns a series when input is dataframe.

* Merge assert client.
2020-03-19 17:05:55 +08:00
Jiaming Yuan
8ca06ab329 [dask] Check non-equal when setting threads. (#5421)
* Check non-equal.

`nthread` can be restored from internal parameter, which is mis-interpreted as
user defined parameter.

* Check None.
2020-03-17 13:07:20 +08:00
Jiaming Yuan
b51124c158 [dask] Enable gridsearching with skl. (#5417) 2020-03-16 04:51:51 +08:00
Jiaming Yuan
761a5dbdfc [dask] Honor nthreads from dask worker. (#5414) 2020-03-16 04:51:24 +08:00
Jiaming Yuan
21b671aa06 [dask] Order the prediction result. (#5416) 2020-03-15 19:34:04 +08:00
Jiaming Yuan
668e432e2d [dask] Use DMLC_TASK_ID. (#5415) 2020-03-15 16:47:03 +08:00
Jiaming Yuan
fc88105620 Better error message for updating. (#5418) 2020-03-15 16:46:21 +08:00
Jiaming Yuan
ab7a46a1a4 Check whether current updater can modify a tree. (#5406)
* Check whether current updater can modify a tree.

* Fix tree model JSON IO for pruned trees.
2020-03-14 09:24:08 +08:00
Rory Mitchell
b745b7acce Fix memory usage of device sketching (#5407) 2020-03-14 13:43:24 +13:00
Jan Borchmann
bb8c8df39d [dask] passed through verbose for dask fit (#5413) 2020-03-14 06:33:53 +08:00
Jiaming Yuan
45a97ddf32 Split up LearnerImpl. (#5350) 2020-03-12 16:30:23 +08:00
Rory Mitchell
3ad4333b0e Partial rewrite EllpackPage (#5352) 2020-03-11 10:15:53 +13:00
Darby Payne
7a99f8f27f Adding static library option (#5397) 2020-03-10 18:22:15 +08:00
Bart Broere
a931589c96 Fix typo (#5399) 2020-03-09 19:41:39 +08:00
Rory Mitchell
a38e7bd19c Sketching from adapters (#5365)
* Sketching from adapters

* Add weights test
2020-03-07 21:07:58 +13:00
Jiaming Yuan
0dd97c206b Move thread local entry into Learner. (#5396)
* Move thread local entry into Learner.

This is an attempt to workaround CUDA context issue in static variable, where
the CUDA context can be released before device vector.

* Add PredictionEntry to thread local entry.

This eliminates one copy of prediction vector.

* Don't define CUDA C API in a namespace.
2020-03-07 15:37:39 +08:00
sriramch
1ba6706167 - create a gpu metrics (internal) registry (#5387)
* - create a gpu metrics (internal) registry
  - the objective is to separate the cpu and gpu implementations such that they evolve
    indepedently. to that end, this approach will:
    - preserve the same metrics configuration (from the end user perspective)
    - internally delegate the responsibility to the gpu metrics builder when there is a
      valid device present
    - decouple the gpu metrics builder from the cpu ones to prevent misuse
    - move away from including the cuda file from within the cc file and segregate the code
      via ifdef's
2020-03-07 15:31:35 +13:00
Jiaming Yuan
8d06878bf9 Deterministic GPU histogram. (#5361)
* Use pre-rounding based method to obtain reproducible floating point
  summation.
* GPU Hist for regression and classification are bit-by-bit reproducible.
* Add doc.
* Switch to thrust reduce for `node_sum_gradient`.
2020-03-04 15:13:28 +08:00
Philip Hyunsu Cho
9775da02d9 Add release note for 1.0.0 in NEWS.md (#5329)
* Add release note for 1.0.0

* Fix a small bug in the Python script that compiles the list of contributors

* Clarify governance of CI infrastructure; now PMC is formally in charge

* Address reviewer comment

* Fix typo
2020-03-03 21:35:43 -08:00
sriramch
5dc8e894c9 Fixes and changes to the ranking metrics computed on cpu (#5380)
* - fixes and changes to the ranking metrics computed on cpu
  - auc/aucpr ranking metric accelerated on cpu
  - fixes to the auc/aucpr metrics
2020-03-03 15:56:36 +13:00
Darius Kharazi
71a8b8c65a Fix simple typo: information.c -> information (#5384)
Closes #5383
2020-03-03 08:50:14 +08:00
Egor Smirnov
1b97eaf7a7 Optimized ApplySplit, BuildHist and UpdatePredictCache functions on CPU (#5244)
* Split up sparse and dense build hist kernels.
* Add `PartitionBuilder`.
2020-02-29 16:11:42 +08:00
sriramch
b81f8cbbc0 Move segment sorter to common (#5378)
- move segment sorter to common
- this is the first of a handful of pr's that splits the larger pr #5326
- it moves this facility to common (from ranking objective class), so that it can be
    used for metric computation
- it also wraps all the bald device pointers into span.
2020-02-29 15:42:07 +08:00
Jiaming Yuan
2ba8c13b69 Revert "Enable rabit test (#5358)" (#5377)
This reverts commit 9a5efffebe.
2020-02-29 04:25:03 +08:00
Chen Qin
9a5efffebe Enable rabit test (#5358) 2020-02-28 22:29:02 +08:00
Samrat Pandiri
2d76d40dfd Update dask.rst to correct a spelling mistake (#5371)
Change `signle-node` to `single-node`
2020-02-27 20:46:41 +08:00
Jiaming Yuan
a461a9a90a Define lazy isinstance for Python compat. (#5364)
* Avoid importing datatable.
* Fix #5363.
2020-02-26 14:23:33 +08:00
Jiaming Yuan
0fd455e162 Restore loading model from buffer. (#5360) 2020-02-26 11:30:13 +08:00
Jiaming Yuan
f2b8cd2922 Add number of columns to native data iterator. (#5202)
* Change native data iter into an adapter.
2020-02-25 23:42:01 +08:00
Jiaming Yuan
e0509b3307 Fix pruner. (#5335)
* Honor the tree depth.
* Prevent pruning pruned node.
2020-02-25 08:32:46 +08:00
Rory Mitchell
b0ed3f0a66 Remove unnecessary DMatrix methods (#5324) 2020-02-25 12:40:39 +13:00
Jiaming Yuan
655cf17b60 Predict on Ellpack. (#5327)
* Unify GPU prediction node.
* Add `PageExists`.
* Dispatch prediction on input data for GPU Predictor.
2020-02-23 06:27:03 +08:00
daiki katsuragawa
70a91ec3ba Update README.md (#5346) 2020-02-23 02:52:37 +08:00
Philip Hyunsu Cho
cfae247231 Fix a small typo in sklearn.py that broke multiple eval metrics (#5341) 2020-02-22 19:02:37 +08:00
Rong Ou
d6b31df449 update docs for gpu external memory (#5332)
* update docs for gpu external memory

* add hist limitation
2020-02-22 14:57:40 +08:00
Philip Hyunsu Cho
7ac7e8778f Port patches from 1.0.0 branch (#5336)
* Remove f-string, since it's not supported by Python 3.5 (#5330)

* Remove f-string, since it's not supported by Python 3.5

* Add Python 3.5 to CI, to ensure compatibility

* Remove duplicated matplotlib

* Show deprecation notice for Python 3.5

* Fix lint

* Fix lint

* Fix a unit test that mistook MINOR ver for PATCH ver

* Enforce only major version in JSON model schema

* Bump version to 1.1.0-SNAPSHOT
2020-02-21 13:13:21 -08:00
Philip Hyunsu Cho
8aa8ef1031 Display Sponsor button, link to OpenCollective (#5325) 2020-02-19 01:58:21 -08:00
Rory Mitchell
bc96ceb8b2 Refactor SparsePageSource, delete cache files after use (#5321)
* Refactor sparse page source

* Delete temporary cache files

* Log fatal if cache exists

* Log fatal if multiple threads used with prefetcher
2020-02-19 16:43:41 +13:00
Rory Mitchell
b2b2c4e231 Remove SimpleCSRSource (#5315) 2020-02-18 16:49:17 +13:00
Jiaming Yuan
9f77c18b0d Add JVM_CHECK_CALL. (#5199)
* Added a check call macro in jvm package, prevents executing other functions
from jvm when error occurred in XGBoost. For example, when prediction fails jvm
should not try to allocate memory based on the output prediction size.
2020-02-18 11:10:55 +08:00
Jiaming Yuan
0110754a76 Remove update prediction cache from predictors. (#5312)
Move this function into gbtree, and uses only updater for doing so. As now the predictor knows exactly how many trees to predict, there's no need for it to update the prediction cache.
2020-02-17 11:35:47 +08:00
Jiaming Yuan
e433a379e4 Fix changing locale. (#5314)
* Fix changing locale.

* Don't use locale guard.

As number parsing is implemented in house, we don't need locale.

* Update doc.
2020-02-17 11:31:13 +08:00
Rory Mitchell
7e32af5c21 Wide dataset quantile performance improvement (#5306) 2020-02-16 10:24:42 +13:00
Jiaming Yuan
ed2465cce4 Add configuration to R interface. (#5217)
* Save and load internal parameter configuration as JSON.
2020-02-16 03:01:58 +08:00
Jiaming Yuan
8ca9744b07 Use scikit-learn in extra dependencies. (#5310) 2020-02-15 07:12:51 +08:00
Jiaming Yuan
c35cdecddd Move prediction cache to Learner. (#5220)
* Move prediction cache into Learner.

* Clean-ups

- Remove duplicated cache in Learner and GBM.
- Remove ad-hoc fix of invalid cache.
- Remove `PredictFromCache` in predictors.
- Remove prediction cache for linear altogether, as it's only moving the
  prediction into training process but doesn't provide any actual overall speed
  gain.
- The cache is now unique to Learner, which means the ownership is no longer
  shared by any other components.

* Changes

- Add version to prediction cache.
- Use weak ptr to check expired DMatrix.
- Pass shared pointer instead of raw pointer.
2020-02-14 13:04:23 +08:00
Rory Mitchell
24ad9dec0b Testing hist_util (#5251)
* Rank tests

* Remove categorical split specialisation

* Extend tests to multiple features, switch to WQSketch

* Add tests for SparseCuts

* Add external memory quantile tests, fix some existing tests
2020-02-14 14:36:43 +13:00
Jiaming Yuan
911a902835 Merge model compatibility fixes from 1.0rc branch. (#5305)
* Port test model compatibility.
* Port logit model fix.

https://github.com/dmlc/xgboost/pull/5248
https://github.com/dmlc/xgboost/pull/5281
2020-02-13 20:41:58 +08:00
Jiaming Yuan
29eeea709a Pass shared pointer instead of raw pointer to Learner. (#5302)
Extracted from https://github.com/dmlc/xgboost/pull/5220 .
2020-02-11 14:16:38 +08:00
Philip Hyunsu Cho
2e0067e790 Update affiliation of @hcho3 (#5292) 2020-02-06 20:58:39 -08:00
Andrew Kane
94828a7c0c Updated Windows build docs (#5283) 2020-02-05 12:19:54 +08:00
Jiaming Yuan
84e395d91e Fix CMake build on Windows with setuptools. (#5280) 2020-02-05 10:47:39 +08:00
Jiaming Yuan
595a00466d Rewrite setup.py. (#5271)
The setup.py is rewritten.  This new script uses only Python code and provide customized
implementation of setuptools commands.  This way users can run most of setuptools commands
just like any other Python libraries.

* Remove setup_pip.py
* Remove soft links.
* Define customized commands.
* Remove shell script.
* Remove makefile script.
* Update the doc for building from source.
2020-02-04 13:35:42 +08:00
Rong Ou
e4b74c4d22 Gradient based sampling for GPU Hist (#5093)
* Implement gradient based sampling for GPU Hist tree method.
* Add samplers and handle compacted page in GPU Hist.
2020-02-04 10:31:27 +08:00
Philip Hyunsu Cho
c74216f22c Declare Python 3.8 support in setup.py (#5274) 2020-02-03 10:38:52 -08:00
David Díaz Vico
71e7e3b96f Improved sklearn compatibility (#5255) 2020-02-03 13:30:45 +08:00
Jiaming Yuan
a5cc112eea Export JSON config in get_params. (#5256) 2020-02-03 12:46:51 +08:00
Jiaming Yuan
ed0216642f Avoid dask test fixtures. (#5270)
* Fix Travis OSX timeout.

* Fix classifier.
2020-02-03 12:39:20 +08:00
Jiaming Yuan
856b81c727 Ignore gdb_history. [skip ci] (#5257) 2020-02-02 20:40:09 +08:00
Nan Zhu
d7b45fbcaf [jvm-packages] do not use multiple jobs to make checkpoints (#5082)
* temp

* temp

* tep

* address the comments

* fix stylistic issues

* fix

* external checkpoint
2020-02-01 19:36:39 -08:00
Philip Hyunsu Cho
fa26313feb Remove use of std::cout from R package (#5261) 2020-02-01 05:52:19 -08:00
551 changed files with 33605 additions and 13687 deletions

1
.github/FUNDING.yml vendored Normal file
View File

@@ -0,0 +1 @@
open_collective: xgboost

138
.github/workflows/main.yml vendored Normal file
View File

@@ -0,0 +1,138 @@
# This is a basic workflow to help you get started with Actions
name: XGBoost-CI
# Controls when the action will run. Triggers the workflow on push or pull request
# events but only for the master branch
on: [push, pull_request]
env:
R_PACKAGES: c('XML', 'igraph', 'data.table', 'magrittr', 'stringi', 'ggplot2', 'DiagrammeR', 'Ckmeans.1d.dp', 'vcd', 'testthat', 'lintr', 'knitr', 'rmarkdown', 'e1071', 'cplm', 'devtools')
# A workflow run is made up of one or more jobs that can run sequentially or in parallel
jobs:
test-with-jvm:
name: Test JVM on OS ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [windows-latest, windows-2016, ubuntu-latest]
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: actions/setup-java@v1
with:
java-version: 1.8
- name: Cache Maven packages
uses: actions/cache@v2
with:
path: ~/.m2
key: ${{ runner.os }}-m2-${{ hashFiles('./jvm-packages/pom.xml') }}
restore-keys: ${{ runner.os }}-m2
- name: Test JVM packages
run: |
cd jvm-packages
mvn test -pl :xgboost4j_2.12
lintr:
runs-on: ${{ matrix.config.os }}
name: Run R linters on OS ${{ matrix.config.os }}, R ${{ matrix.config.r }}, Compiler ${{ matrix.config.compiler }}, Build ${{ matrix.config.build }}
strategy:
matrix:
config:
- {os: windows-latest, r: 'release', compiler: 'mingw', build: 'autotools'}
env:
R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
RSPM: ${{ matrix.config.rspm }}
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: r-lib/actions/setup-r@master
with:
r-version: ${{ matrix.config.r }}
- name: Cache R packages
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-1-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-2-
- name: Install dependencies
shell: Rscript {0}
run: |
install.packages(${{ env.R_PACKAGES }},
repos = 'http://cloud.r-project.org',
dependencies = c('Depends', 'Imports', 'LinkingTo'))
- name: Run lintr
run: |
cd R-package
R.exe CMD INSTALL .
Rscript.exe tests/helper_scripts/run_lint.R
test-with-R:
runs-on: ${{ matrix.config.os }}
name: Test R on OS ${{ matrix.config.os }}, R ${{ matrix.config.r }}, Compiler ${{ matrix.config.compiler }}, Build ${{ matrix.config.build }}
strategy:
fail-fast: false
matrix:
config:
- {os: windows-latest, r: 'release', compiler: 'msvc', build: 'autotools'}
- {os: windows-2016, r: 'release', compiler: 'msvc', build: 'autotools'}
- {os: windows-latest, r: 'release', compiler: 'msvc', build: 'cmake'}
- {os: windows-2016, r: 'release', compiler: 'msvc', build: 'cmake'}
- {os: windows-latest, r: 'release', compiler: 'mingw', build: 'autotools'}
- {os: windows-2016, r: 'release', compiler: 'mingw', build: 'autotools'}
- {os: windows-latest, r: 'release', compiler: 'mingw', build: 'cmake'}
- {os: windows-2016, r: 'release', compiler: 'mingw', build: 'cmake'}
env:
R_REMOTES_NO_ERRORS_FROM_WARNINGS: true
RSPM: ${{ matrix.config.rspm }}
steps:
- uses: actions/checkout@v2
with:
submodules: 'true'
- uses: r-lib/actions/setup-r@master
with:
r-version: ${{ matrix.config.r }}
- name: Cache R packages
uses: actions/cache@v2
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-${{ matrix.config.r }}-1-${{ hashFiles('R-package/DESCRIPTION') }}
restore-keys: ${{ runner.os }}-r-${{ matrix.config.r }}-2-
- name: Install dependencies
shell: Rscript {0}
run: |
install.packages(${{ env.R_PACKAGES }},
repos = 'http://cloud.r-project.org',
dependencies = c('Depends', 'Imports', 'LinkingTo'))
- uses: actions/setup-python@v2
with:
python-version: '3.6' # Version range or exact version of a Python version to use, using SemVer's version range syntax
architecture: 'x64' # optional x64 or x86. Defaults to x64 if not specified
- name: Test R
run: |
python tests/ci_build/test_r_package.py --compiler="${{ matrix.config.compiler }}" --build-tool="${{ matrix.config.build }}"

6
.gitignore vendored
View File

@@ -51,6 +51,7 @@ Debug
#.Rbuildignore #.Rbuildignore
R-package.Rproj R-package.Rproj
*.cache* *.cache*
.mypy_cache/
# java # java
java/xgboost4j/target java/xgboost4j/target
java/xgboost4j/tmp java/xgboost4j/tmp
@@ -65,7 +66,6 @@ nb-configuration*
.pydevproject .pydevproject
.settings/ .settings/
build build
config.mk
/xgboost /xgboost
*.data *.data
build_plugin build_plugin
@@ -93,6 +93,7 @@ metastore_db
# files from R-package source install # files from R-package source install
**/config.status **/config.status
R-package/src/Makevars R-package/src/Makevars
*.lib
# Visual Studio Code # Visual Studio Code
/.vscode/ /.vscode/
@@ -101,3 +102,6 @@ R-package/src/Makevars
.idea .idea
*.iml *.iml
/cmake-build-debug/ /cmake-build-debug/
# GDB
.gdb_history

View File

@@ -6,7 +6,7 @@ os:
- linux - linux
- osx - osx
osx_image: xcode10.3 osx_image: xcode10.1
dist: bionic dist: bionic
# Use Build Matrix to do lint and build seperately # Use Build Matrix to do lint and build seperately
@@ -21,6 +21,10 @@ env:
# cmake test # cmake test
- TASK=cmake_test - TASK=cmake_test
global:
- secure: "PR16i9F8QtNwn99C5NDp8nptAS+97xwDtXEJJfEiEVhxPaaRkOp0MPWhogCaK0Eclxk1TqkgWbdXFknwGycX620AzZWa/A1K3gAs+GrpzqhnPMuoBJ0Z9qxXTbSJvCyvMbYwVrjaxc/zWqdMU8waWz8A7iqKGKs/SqbQ3rO6v7c="
- secure: "dAGAjBokqm/0nVoLMofQni/fWIBcYSmdq4XvCBX1ZAMDsWnuOfz/4XCY6h2lEI1rVHZQ+UdZkc9PioOHGPZh5BnvE49/xVVWr9c4/61lrDOlkD01ZjSAeoV0fAZq+93V/wPl4QV+MM+Sem9hNNzFSbN5VsQLAiWCSapWsLdKzqA="
matrix: matrix:
exclude: exclude:
- os: linux - os: linux
@@ -39,12 +43,13 @@ addons:
- graphviz - graphviz
- openssl - openssl
- libgit2 - libgit2
- lz4
- wget - wget
- r - r
update: true update: true
before_install: before_install:
- source dmlc-core/scripts/travis/travis_setup_env.sh - source tests/travis/travis_setup_env.sh
- if [ "${TASK}" != "python_sdist_test" ]; then export PYTHONPATH=${PYTHONPATH}:${PWD}/python-package; fi - if [ "${TASK}" != "python_sdist_test" ]; then export PYTHONPATH=${PYTHONPATH}:${PWD}/python-package; fi
- echo "MAVEN_OPTS='-Xmx2g -XX:MaxPermSize=1024m -XX:ReservedCodeCacheSize=512m -Dorg.slf4j.simpleLogger.defaultLogLevel=error'" > ~/.mavenrc - echo "MAVEN_OPTS='-Xmx2g -XX:MaxPermSize=1024m -XX:ReservedCodeCacheSize=512m -Dorg.slf4j.simpleLogger.defaultLogLevel=error'" > ~/.mavenrc
@@ -60,7 +65,7 @@ cache:
- ${HOME}/.cache/pip - ${HOME}/.cache/pip
before_cache: before_cache:
- dmlc-core/scripts/travis/travis_before_cache.sh - tests/travis/travis_before_cache.sh
after_failure: after_failure:
- tests/travis/travis_after_failure.sh - tests/travis/travis_after_failure.sh

View File

@@ -1,8 +1,11 @@
cmake_minimum_required(VERSION 3.12) cmake_minimum_required(VERSION 3.13)
project(xgboost LANGUAGES CXX C VERSION 1.0.0) project(xgboost LANGUAGES CXX C VERSION 1.2.1)
include(cmake/Utils.cmake) include(cmake/Utils.cmake)
list(APPEND CMAKE_MODULE_PATH "${xgboost_SOURCE_DIR}/cmake/modules") list(APPEND CMAKE_MODULE_PATH "${xgboost_SOURCE_DIR}/cmake/modules")
cmake_policy(SET CMP0022 NEW) cmake_policy(SET CMP0022 NEW)
cmake_policy(SET CMP0079 NEW)
set(CMAKE_POLICY_DEFAULT_CMP0063 NEW)
cmake_policy(SET CMP0063 NEW)
if ((${CMAKE_VERSION} VERSION_GREATER 3.13) OR (${CMAKE_VERSION} VERSION_EQUAL 3.13)) if ((${CMAKE_VERSION} VERSION_GREATER 3.13) OR (${CMAKE_VERSION} VERSION_EQUAL 3.13))
cmake_policy(SET CMP0077 NEW) cmake_policy(SET CMP0077 NEW)
@@ -23,17 +26,22 @@ set_default_configuration_release()
#-- Options #-- Options
option(BUILD_C_DOC "Build documentation for C APIs using Doxygen." OFF) option(BUILD_C_DOC "Build documentation for C APIs using Doxygen." OFF)
option(USE_OPENMP "Build with OpenMP support." ON) option(USE_OPENMP "Build with OpenMP support." ON)
option(BUILD_STATIC_LIB "Build static library" OFF)
## Bindings ## Bindings
option(JVM_BINDINGS "Build JVM bindings" OFF) option(JVM_BINDINGS "Build JVM bindings" OFF)
option(R_LIB "Build shared library for R package" OFF) option(R_LIB "Build shared library for R package" OFF)
## Dev ## Dev
option(USE_DEBUG_OUTPUT "Dump internal training results like gradients and predictions to stdout. option(USE_DEBUG_OUTPUT "Dump internal training results like gradients and predictions to stdout.
Should only be used for debugging." OFF) Should only be used for debugging." OFF)
option(FORCE_COLORED_OUTPUT "Force colored output from compilers, useful when ninja is used instead of make." OFF)
option(ENABLE_ALL_WARNINGS "Enable all compiler warnings. Only effective for GCC/Clang" OFF)
option(LOG_CAPI_INVOCATION "Log all C API invocations for debugging" OFF)
option(GOOGLE_TEST "Build google tests" OFF) option(GOOGLE_TEST "Build google tests" OFF)
option(USE_DMLC_GTEST "Use google tests bundled with dmlc-core submodule" OFF) option(USE_DMLC_GTEST "Use google tests bundled with dmlc-core submodule" OFF)
option(USE_NVTX "Build with cuda profiling annotations. Developers only." OFF) option(USE_NVTX "Build with cuda profiling annotations. Developers only." OFF)
set(NVTX_HEADER_DIR "" CACHE PATH "Path to the stand-alone nvtx header") set(NVTX_HEADER_DIR "" CACHE PATH "Path to the stand-alone nvtx header")
option(RABIT_MOCK "Build rabit with mock" OFF) option(RABIT_MOCK "Build rabit with mock" OFF)
option(HIDE_CXX_SYMBOLS "Build shared library and hide all C++ symbols" OFF)
## CUDA ## CUDA
option(USE_CUDA "Build with GPU acceleration" OFF) option(USE_CUDA "Build with GPU acceleration" OFF)
option(USE_NCCL "Build with NCCL to enable distributed GPU support." OFF) option(USE_NCCL "Build with NCCL to enable distributed GPU support." OFF)
@@ -49,10 +57,11 @@ option(USE_SANITIZER "Use santizer flags" OFF)
option(SANITIZER_PATH "Path to sanitizes.") option(SANITIZER_PATH "Path to sanitizes.")
set(ENABLED_SANITIZERS "address" "leak" CACHE STRING set(ENABLED_SANITIZERS "address" "leak" CACHE STRING
"Semicolon separated list of sanitizer names. E.g 'address;leak'. Supported sanitizers are "Semicolon separated list of sanitizer names. E.g 'address;leak'. Supported sanitizers are
address, leak and thread.") address, leak, undefined and thread.")
## Plugins ## Plugins
option(PLUGIN_LZ4 "Build lz4 plugin" OFF) option(PLUGIN_LZ4 "Build lz4 plugin" OFF)
option(PLUGIN_DENSE_PARSER "Build dense parser plugin" OFF) option(PLUGIN_DENSE_PARSER "Build dense parser plugin" OFF)
option(ADD_PKGCONFIG "Add xgboost.pc into system." ON)
#-- Checks for building XGBoost #-- Checks for building XGBoost
if (USE_DEBUG_OUTPUT AND (NOT (CMAKE_BUILD_TYPE MATCHES Debug))) if (USE_DEBUG_OUTPUT AND (NOT (CMAKE_BUILD_TYPE MATCHES Debug)))
@@ -74,6 +83,11 @@ endif (R_LIB AND GOOGLE_TEST)
if (USE_AVX) if (USE_AVX)
message(SEND_ERROR "The option 'USE_AVX' is deprecated as experimental AVX features have been removed from XGBoost.") message(SEND_ERROR "The option 'USE_AVX' is deprecated as experimental AVX features have been removed from XGBoost.")
endif (USE_AVX) endif (USE_AVX)
if (ENABLE_ALL_WARNINGS)
if ((NOT CMAKE_CXX_COMPILER_ID MATCHES "Clang") AND (NOT CMAKE_CXX_COMPILER_ID STREQUAL "GNU"))
message(SEND_ERROR "ENABLE_ALL_WARNINGS is only available for Clang and GCC.")
endif ((NOT CMAKE_CXX_COMPILER_ID MATCHES "Clang") AND (NOT CMAKE_CXX_COMPILER_ID STREQUAL "GNU"))
endif (ENABLE_ALL_WARNINGS)
#-- Sanitizer #-- Sanitizer
if (USE_SANITIZER) if (USE_SANITIZER)
@@ -88,11 +102,22 @@ if (USE_CUDA)
message(STATUS "Configured CUDA host compiler: ${CMAKE_CUDA_HOST_COMPILER}") message(STATUS "Configured CUDA host compiler: ${CMAKE_CUDA_HOST_COMPILER}")
enable_language(CUDA) enable_language(CUDA)
if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_LESS 10.0)
message(FATAL_ERROR "CUDA version must be at least 10.0!")
endif()
set(GEN_CODE "") set(GEN_CODE "")
format_gencode_flags("${GPU_COMPUTE_VER}" GEN_CODE) format_gencode_flags("${GPU_COMPUTE_VER}" GEN_CODE)
message(STATUS "CUDA GEN_CODE: ${GEN_CODE}") message(STATUS "CUDA GEN_CODE: ${GEN_CODE}")
endif (USE_CUDA) endif (USE_CUDA)
if (FORCE_COLORED_OUTPUT AND (CMAKE_GENERATOR STREQUAL "Ninja") AND
((CMAKE_CXX_COMPILER_ID STREQUAL "GNU") OR
(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")))
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fdiagnostics-color=always")
endif()
find_package(Threads REQUIRED)
if (USE_OPENMP) if (USE_OPENMP)
if (APPLE) if (APPLE)
# Require CMake 3.16+ on Mac OSX, as previous versions of CMake had trouble locating # Require CMake 3.16+ on Mac OSX, as previous versions of CMake had trouble locating
@@ -102,14 +127,28 @@ if (USE_OPENMP)
find_package(OpenMP REQUIRED) find_package(OpenMP REQUIRED)
endif (USE_OPENMP) endif (USE_OPENMP)
# core xgboost
add_subdirectory(${xgboost_SOURCE_DIR}/src)
# dmlc-core # dmlc-core
msvc_use_static_runtime() msvc_use_static_runtime()
add_subdirectory(${xgboost_SOURCE_DIR}/dmlc-core) add_subdirectory(${xgboost_SOURCE_DIR}/dmlc-core)
set_target_properties(dmlc PROPERTIES set_target_properties(dmlc PROPERTIES
CXX_STANDARD 11 CXX_STANDARD 14
CXX_STANDARD_REQUIRED ON CXX_STANDARD_REQUIRED ON
POSITION_INDEPENDENT_CODE ON) POSITION_INDEPENDENT_CODE ON)
list(APPEND LINKED_LIBRARIES_PRIVATE dmlc) if (MSVC)
target_compile_options(dmlc PRIVATE
-D_CRT_SECURE_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE)
if (TARGET dmlc_unit_tests)
target_compile_options(dmlc_unit_tests PRIVATE
-D_CRT_SECURE_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE)
endif (TARGET dmlc_unit_tests)
endif (MSVC)
if (ENABLE_ALL_WARNINGS)
target_compile_options(dmlc PRIVATE -Wall -Wextra)
endif (ENABLE_ALL_WARNINGS)
target_link_libraries(objxgboost PUBLIC dmlc)
# rabit # rabit
set(RABIT_BUILD_DMLC OFF) set(RABIT_BUILD_DMLC OFF)
@@ -118,28 +157,60 @@ set(RABIT_WITH_R_LIB ${R_LIB})
add_subdirectory(rabit) add_subdirectory(rabit)
if (RABIT_MOCK) if (RABIT_MOCK)
list(APPEND LINKED_LIBRARIES_PRIVATE rabit_mock_static) target_link_libraries(objxgboost PUBLIC rabit_mock_static)
if (MSVC)
target_compile_options(rabit_mock_static PRIVATE
-D_CRT_SECURE_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE)
endif (MSVC)
else() else()
list(APPEND LINKED_LIBRARIES_PRIVATE rabit) target_link_libraries(objxgboost PUBLIC rabit)
if (MSVC)
target_compile_options(rabit PRIVATE
-D_CRT_SECURE_NO_WARNINGS -D_CRT_SECURE_NO_DEPRECATE)
endif (MSVC)
endif(RABIT_MOCK) endif(RABIT_MOCK)
foreach(lib rabit rabit_base rabit_empty rabit_mock rabit_mock_static)
# Explicitly link dmlc to rabit, so that configured header (build_config.h)
# from dmlc is correctly applied to rabit.
if (TARGET ${lib})
target_link_libraries(${lib} dmlc ${CMAKE_THREAD_LIBS_INIT})
if (ENABLE_ALL_WARNINGS)
target_compile_options(${lib} PRIVATE -Wall -Wextra)
endif (ENABLE_ALL_WARNINGS)
endif (TARGET ${lib})
endforeach()
# Exports some R specific definitions and objects # Exports some R specific definitions and objects
if (R_LIB) if (R_LIB)
add_subdirectory(${xgboost_SOURCE_DIR}/R-package) add_subdirectory(${xgboost_SOURCE_DIR}/R-package)
endif (R_LIB) endif (R_LIB)
# core xgboost # Plugin
add_subdirectory(${xgboost_SOURCE_DIR}/plugin) add_subdirectory(${xgboost_SOURCE_DIR}/plugin)
add_subdirectory(${xgboost_SOURCE_DIR}/src)
set(XGBOOST_OBJ_SOURCES "${XGBOOST_OBJ_SOURCES};$<TARGET_OBJECTS:objxgboost>")
#-- Shared library #-- library
add_library(xgboost SHARED ${XGBOOST_OBJ_SOURCES}) if (BUILD_STATIC_LIB)
add_library(xgboost STATIC)
else (BUILD_STATIC_LIB)
add_library(xgboost SHARED)
endif (BUILD_STATIC_LIB)
target_link_libraries(xgboost PRIVATE objxgboost)
if (USE_NVTX)
enable_nvtx(xgboost)
endif (USE_NVTX)
#-- Hide all C++ symbols
if (HIDE_CXX_SYMBOLS)
foreach(target objxgboost xgboost dmlc rabit rabit_mock_static)
set_target_properties(${target} PROPERTIES CXX_VISIBILITY_PRESET hidden)
endforeach()
endif (HIDE_CXX_SYMBOLS)
target_include_directories(xgboost target_include_directories(xgboost
INTERFACE INTERFACE
$<INSTALL_INTERFACE:${CMAKE_INSTALL_PREFIX}/include> $<INSTALL_INTERFACE:${CMAKE_INSTALL_PREFIX}/include>
$<BUILD_INTERFACE:${CMAKE_CURRENT_LIST_DIR}/include>) $<BUILD_INTERFACE:${CMAKE_CURRENT_LIST_DIR}/include>)
target_link_libraries(xgboost PRIVATE ${LINKED_LIBRARIES_PRIVATE})
# This creates its own shared library `xgboost4j'. # This creates its own shared library `xgboost4j'.
if (JVM_BINDINGS) if (JVM_BINDINGS)
@@ -148,18 +219,21 @@ endif (JVM_BINDINGS)
#-- End shared library #-- End shared library
#-- CLI for xgboost #-- CLI for xgboost
add_executable(runxgboost ${xgboost_SOURCE_DIR}/src/cli_main.cc ${XGBOOST_OBJ_SOURCES}) add_executable(runxgboost ${xgboost_SOURCE_DIR}/src/cli_main.cc)
target_link_libraries(runxgboost PRIVATE objxgboost)
if (USE_NVTX)
enable_nvtx(runxgboost)
endif (USE_NVTX)
target_include_directories(runxgboost target_include_directories(runxgboost
PRIVATE PRIVATE
${xgboost_SOURCE_DIR}/include ${xgboost_SOURCE_DIR}/include
${xgboost_SOURCE_DIR}/dmlc-core/include ${xgboost_SOURCE_DIR}/dmlc-core/include
${xgboost_SOURCE_DIR}/rabit/include) ${xgboost_SOURCE_DIR}/rabit/include)
target_link_libraries(runxgboost PRIVATE ${LINKED_LIBRARIES_PRIVATE})
set_target_properties( set_target_properties(
runxgboost PROPERTIES runxgboost PROPERTIES
OUTPUT_NAME xgboost OUTPUT_NAME xgboost
CXX_STANDARD 11 CXX_STANDARD 14
CXX_STANDARD_REQUIRED ON) CXX_STANDARD_REQUIRED ON)
#-- End CLI for xgboost #-- End CLI for xgboost
@@ -170,11 +244,12 @@ add_dependencies(xgboost runxgboost)
#-- Installing XGBoost #-- Installing XGBoost
if (R_LIB) if (R_LIB)
include(cmake/RPackageInstallTargetSetup.cmake)
set_target_properties(xgboost PROPERTIES PREFIX "") set_target_properties(xgboost PROPERTIES PREFIX "")
if (APPLE) if (APPLE)
set_target_properties(xgboost PROPERTIES SUFFIX ".so") set_target_properties(xgboost PROPERTIES SUFFIX ".so")
endif (APPLE) endif (APPLE)
setup_rpackage_install_target(xgboost ${CMAKE_CURRENT_BINARY_DIR}) setup_rpackage_install_target(xgboost "${CMAKE_CURRENT_BINARY_DIR}/R-package-install")
set(CMAKE_INSTALL_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/dummy_inst") set(CMAKE_INSTALL_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/dummy_inst")
endif (R_LIB) endif (R_LIB)
if (MINGW) if (MINGW)
@@ -245,3 +320,12 @@ endif (GOOGLE_TEST)
# replace /MD with /MT. See https://github.com/dmlc/xgboost/issues/4462 # replace /MD with /MT. See https://github.com/dmlc/xgboost/issues/4462
# for issues caused by mixing of /MD and /MT flags # for issues caused by mixing of /MD and /MT flags
msvc_use_static_runtime() msvc_use_static_runtime()
# Add xgboost.pc
if (ADD_PKGCONFIG)
configure_file(${xgboost_SOURCE_DIR}/cmake/xgboost.pc.in ${xgboost_BINARY_DIR}/xgboost.pc @ONLY)
install(
FILES ${xgboost_BINARY_DIR}/xgboost.pc
DESTINATION ${CMAKE_INSTALL_LIBDIR}/pkgconfig)
endif (ADD_PKGCONFIG)

View File

@@ -10,14 +10,14 @@ The Project Management Committee(PMC) consists group of active committers that m
- Tianqi is a Ph.D. student working on large-scale machine learning. He is the creator of the project. - Tianqi is a Ph.D. student working on large-scale machine learning. He is the creator of the project.
* [Michael Benesty](https://github.com/pommedeterresautee) * [Michael Benesty](https://github.com/pommedeterresautee)
- Michael is a lawyer and data scientist in France. He is the creator of XGBoost interactive analysis module in R. - Michael is a lawyer and data scientist in France. He is the creator of XGBoost interactive analysis module in R.
* [Yuan Tang](https://github.com/terrytangyuan), Ant Financial * [Yuan Tang](https://github.com/terrytangyuan), Ant Group
- Yuan is a software engineer in Ant Financial. He contributed mostly in R and Python packages. - Yuan is a software engineer in Ant Group. He contributed mostly in R and Python packages.
* [Nan Zhu](https://github.com/CodingCat), Uber * [Nan Zhu](https://github.com/CodingCat), Uber
- Nan is a software engineer in Uber. He contributed mostly in JVM packages. - Nan is a software engineer in Uber. He contributed mostly in JVM packages.
* [Jiaming Yuan](https://github.com/trivialfis) * [Jiaming Yuan](https://github.com/trivialfis)
- Jiaming contributed to the GPU algorithms. He has also introduced new abstractions to improve the quality of the C++ codebase. - Jiaming contributed to the GPU algorithms. He has also introduced new abstractions to improve the quality of the C++ codebase.
* [Hyunsu Cho](http://hyunsu-cho.io/), Amazon AI * [Hyunsu Cho](http://hyunsu-cho.io/), NVIDIA
- Hyunsu is an applied scientist in Amazon AI. He is the maintainer of the XGBoost Python package. He also manages the Jenkins continuous integration system (https://xgboost-ci.net/). He is the initial author of the CPU 'hist' updater. - Hyunsu is the maintainer of the XGBoost Python package. He also manages the Jenkins continuous integration system (https://xgboost-ci.net/). He is the initial author of the CPU 'hist' updater.
* [Rory Mitchell](https://github.com/RAMitchell), University of Waikato * [Rory Mitchell](https://github.com/RAMitchell), University of Waikato
- Rory is a Ph.D. student at University of Waikato. He is the original creator of the GPU training algorithms. He improved the CMake build system and continuous integration. - Rory is a Ph.D. student at University of Waikato. He is the original creator of the GPU training algorithms. He improved the CMake build system and continuous integration.
* [Hongliang Liu](https://github.com/phunterlau) * [Hongliang Liu](https://github.com/phunterlau)
@@ -37,6 +37,8 @@ Committers are people who have made substantial contribution to the project and
- Sergei is a software engineer in Criteo. He contributed mostly in JVM packages. - Sergei is a software engineer in Criteo. He contributed mostly in JVM packages.
* [Scott Lundberg](http://scottlundberg.com/), University of Washington * [Scott Lundberg](http://scottlundberg.com/), University of Washington
- Scott is a Ph.D. student at University of Washington. He is the creator of SHAP, a unified approach to explain the output of machine learning models such as decision tree ensembles. He also helps maintain the XGBoost Julia package. - Scott is a Ph.D. student at University of Washington. He is the creator of SHAP, a unified approach to explain the output of machine learning models such as decision tree ensembles. He also helps maintain the XGBoost Julia package.
* [Egor Smirnov](https://github.com/SmirnovEgorRu), Intel
- Egor has led a major effort to improve the performance of XGBoost on multi-core CPUs.
Become a Committer Become a Committer

215
Jenkinsfile vendored
View File

@@ -6,6 +6,9 @@
// Command to run command inside a docker container // Command to run command inside a docker container
dockerRun = 'tests/ci_build/ci_build.sh' dockerRun = 'tests/ci_build/ci_build.sh'
// Which CUDA version to use when building reference distribution wheel
ref_cuda_ver = '10.0'
import groovy.transform.Field import groovy.transform.Field
@Field @Field
@@ -31,13 +34,14 @@ pipeline {
// Build stages // Build stages
stages { stages {
stage('Jenkins Linux: Get sources') { stage('Jenkins Linux: Initialize') {
agent { label 'linux && cpu' } agent { label 'job_initializer' }
steps { steps {
script { script {
checkoutSrcs() checkoutSrcs()
commit_id = "${GIT_COMMIT}" commit_id = "${GIT_COMMIT}"
} }
sh 'python3 tests/jenkins_get_approval.py'
stash name: 'srcs' stash name: 'srcs'
milestone ordinal: 1 milestone ordinal: 1
} }
@@ -63,10 +67,16 @@ pipeline {
parallel ([ parallel ([
'build-cpu': { BuildCPU() }, 'build-cpu': { BuildCPU() },
'build-cpu-rabit-mock': { BuildCPUMock() }, 'build-cpu-rabit-mock': { BuildCPUMock() },
'build-gpu-cuda9.0': { BuildCUDA(cuda_version: '9.0') }, 'build-cpu-non-omp': { BuildCPUNonOmp() },
// Build reference, distribution-ready Python wheel with CUDA 10.0
// using CentOS 6 image
'build-gpu-cuda10.0': { BuildCUDA(cuda_version: '10.0') }, 'build-gpu-cuda10.0': { BuildCUDA(cuda_version: '10.0') },
// The build-gpu-* builds below use Ubuntu image
'build-gpu-cuda10.1': { BuildCUDA(cuda_version: '10.1') }, 'build-gpu-cuda10.1': { BuildCUDA(cuda_version: '10.1') },
'build-jvm-packages': { BuildJVMPackages(spark_version: '2.4.3') }, 'build-gpu-cuda10.2': { BuildCUDA(cuda_version: '10.2') },
'build-gpu-cuda11.0': { BuildCUDA(cuda_version: '11.0') },
'build-jvm-packages-gpu-cuda10.0': { BuildJVMPackagesWithCUDA(spark_version: '3.0.0', cuda_version: '10.0') },
'build-jvm-packages': { BuildJVMPackages(spark_version: '3.0.0') },
'build-jvm-doc': { BuildJVMDoc() } 'build-jvm-doc': { BuildJVMDoc() }
]) ])
} }
@@ -79,22 +89,33 @@ pipeline {
script { script {
parallel ([ parallel ([
'test-python-cpu': { TestPythonCPU() }, 'test-python-cpu': { TestPythonCPU() },
'test-python-gpu-cuda9.0': { TestPythonGPU(cuda_version: '9.0') }, 'test-python-gpu-cuda10.2': { TestPythonGPU(host_cuda_version: '10.2') },
'test-python-gpu-cuda10.0': { TestPythonGPU(cuda_version: '10.0') }, 'test-python-gpu-cuda11.0-cross': { TestPythonGPU(artifact_cuda_version: '10.0', host_cuda_version: '11.0') },
'test-python-gpu-cuda10.1': { TestPythonGPU(cuda_version: '10.1') }, 'test-python-gpu-cuda11.0': { TestPythonGPU(artifact_cuda_version: '11.0', host_cuda_version: '11.0') },
'test-python-mgpu-cuda10.1': { TestPythonGPU(cuda_version: '10.1', multi_gpu: true) }, 'test-python-mgpu-cuda10.2': { TestPythonGPU(artifact_cuda_version: '10.0', host_cuda_version: '10.2', multi_gpu: true) },
'test-cpp-gpu': { TestCppGPU(cuda_version: '10.1') }, 'test-cpp-gpu-cuda10.2': { TestCppGPU(artifact_cuda_version: '10.2', host_cuda_version: '10.2') },
'test-cpp-mgpu': { TestCppGPU(cuda_version: '10.1', multi_gpu: true) }, 'test-cpp-gpu-cuda11.0': { TestCppGPU(artifact_cuda_version: '11.0', host_cuda_version: '11.0') },
'test-jvm-jdk8': { CrossTestJVMwithJDK(jdk_version: '8', spark_version: '2.4.3') }, 'test-jvm-jdk8-cuda10.0': { CrossTestJVMwithJDKGPU(artifact_cuda_version: '10.0', host_cuda_version: '10.0') },
'test-jvm-jdk8': { CrossTestJVMwithJDK(jdk_version: '8', spark_version: '3.0.0') },
'test-jvm-jdk11': { CrossTestJVMwithJDK(jdk_version: '11') }, 'test-jvm-jdk11': { CrossTestJVMwithJDK(jdk_version: '11') },
'test-jvm-jdk12': { CrossTestJVMwithJDK(jdk_version: '12') }, 'test-jvm-jdk12': { CrossTestJVMwithJDK(jdk_version: '12') },
'test-r-3.4.4': { TestR(use_r35: false) },
'test-r-3.5.3': { TestR(use_r35: true) } 'test-r-3.5.3': { TestR(use_r35: true) }
]) ])
} }
milestone ordinal: 4 milestone ordinal: 4
} }
} }
stage('Jenkins Linux: Deploy') {
agent none
steps {
script {
parallel ([
'deploy-jvm-packages': { DeployJVMPackages(spark_version: '3.0.0') }
])
}
milestone ordinal: 5
}
}
} }
} }
@@ -113,13 +134,17 @@ def checkoutSrcs() {
} }
} }
def GetCUDABuildContainerType(cuda_version) {
return (cuda_version == ref_cuda_ver) ? 'gpu_build_centos6' : 'gpu_build'
}
def ClangTidy() { def ClangTidy() {
node('linux && cpu') { node('linux && cpu_build') {
unstash name: 'srcs' unstash name: 'srcs'
echo "Running clang-tidy job..." echo "Running clang-tidy job..."
def container_type = "clang_tidy" def container_type = "clang_tidy"
def docker_binary = "docker" def docker_binary = "docker"
def dockerArgs = "--build-arg CUDA_VERSION=9.2" def dockerArgs = "--build-arg CUDA_VERSION_ARG=10.1"
sh """ sh """
${dockerRun} ${container_type} ${docker_binary} ${dockerArgs} python3 tests/ci_build/tidy.py ${dockerRun} ${container_type} ${docker_binary} ${dockerArgs} python3 tests/ci_build/tidy.py
""" """
@@ -134,7 +159,7 @@ def Lint() {
def container_type = "cpu" def container_type = "cpu"
def docker_binary = "docker" def docker_binary = "docker"
sh """ sh """
${dockerRun} ${container_type} ${docker_binary} make lint ${dockerRun} ${container_type} ${docker_binary} bash -c "source activate cpu_test && make lint"
""" """
deleteDir() deleteDir()
} }
@@ -148,7 +173,7 @@ def SphinxDoc() {
def docker_binary = "docker" def docker_binary = "docker"
def docker_extra_params = "CI_DOCKER_EXTRA_PARAMS_INIT='-e SPHINX_GIT_BRANCH=${BRANCH_NAME}'" def docker_extra_params = "CI_DOCKER_EXTRA_PARAMS_INIT='-e SPHINX_GIT_BRANCH=${BRANCH_NAME}'"
sh """#!/bin/bash sh """#!/bin/bash
${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} make -C doc html ${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} bash -c "source activate cpu_test && make -C doc html"
""" """
deleteDir() deleteDir()
} }
@@ -163,8 +188,10 @@ def Doxygen() {
sh """ sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/doxygen.sh ${BRANCH_NAME} ${dockerRun} ${container_type} ${docker_binary} tests/ci_build/doxygen.sh ${BRANCH_NAME}
""" """
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Uploading doc...' echo 'Uploading doc...'
s3Upload file: "build/${BRANCH_NAME}.tar.bz2", bucket: 'xgboost-docs', acl: 'PublicRead', path: "doxygen/${BRANCH_NAME}.tar.bz2" s3Upload file: "build/${BRANCH_NAME}.tar.bz2", bucket: 'xgboost-docs', acl: 'PublicRead', path: "doxygen/${BRANCH_NAME}.tar.bz2"
}
deleteDir() deleteDir()
} }
} }
@@ -176,17 +203,22 @@ def BuildCPU() {
def container_type = "cpu" def container_type = "cpu"
def docker_binary = "docker" def docker_binary = "docker"
sh """ sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_via_cmake.sh ${dockerRun} ${container_type} ${docker_binary} rm -fv dmlc-core/include/dmlc/build_config_default.h
# This step is not necessary, but here we include it, to ensure that DMLC_CORE_USE_CMAKE flag is correctly propagated
# We want to make sure that we use the configured header build/dmlc/build_config.h instead of include/dmlc/build_config_default.h.
# See discussion at https://github.com/dmlc/xgboost/issues/5510
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_via_cmake.sh -DPLUGIN_LZ4=ON -DPLUGIN_DENSE_PARSER=ON
${dockerRun} ${container_type} ${docker_binary} build/testxgboost ${dockerRun} ${container_type} ${docker_binary} build/testxgboost
""" """
// Sanitizer test // Sanitizer test
def docker_extra_params = "CI_DOCKER_EXTRA_PARAMS_INIT='-e ASAN_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer -e ASAN_OPTIONS=symbolize=1 -e UBSAN_OPTIONS=print_stacktrace=1:log_path=ubsan_error.log --cap-add SYS_PTRACE'" def docker_extra_params = "CI_DOCKER_EXTRA_PARAMS_INIT='-e ASAN_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer -e ASAN_OPTIONS=symbolize=1 -e UBSAN_OPTIONS=print_stacktrace=1:log_path=ubsan_error.log --cap-add SYS_PTRACE'"
def docker_args = "--build-arg CMAKE_VERSION=3.12"
sh """ sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/build_via_cmake.sh -DUSE_SANITIZER=ON -DENABLED_SANITIZERS="address;leak;undefined" \ ${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_via_cmake.sh -DUSE_SANITIZER=ON -DENABLED_SANITIZERS="address;leak;undefined" \
-DCMAKE_BUILD_TYPE=Debug -DSANITIZER_PATH=/usr/lib/x86_64-linux-gnu/ -DCMAKE_BUILD_TYPE=Debug -DSANITIZER_PATH=/usr/lib/x86_64-linux-gnu/
${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} build/testxgboost ${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} build/testxgboost
""" """
stash name: 'xgboost_cli', includes: 'xgboost'
deleteDir() deleteDir()
} }
} }
@@ -206,28 +238,70 @@ def BuildCPUMock() {
} }
} }
def BuildCPUNonOmp() {
def BuildCUDA(args) {
node('linux && cpu') { node('linux && cpu') {
unstash name: 'srcs' unstash name: 'srcs'
echo "Build with CUDA ${args.cuda_version}" echo "Build CPU without OpenMP"
def container_type = "gpu_build" def container_type = "cpu"
def docker_binary = "docker" def docker_binary = "docker"
def docker_args = "--build-arg CUDA_VERSION=${args.cuda_version}"
sh """ sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/build_via_cmake.sh -DUSE_CUDA=ON -DUSE_NCCL=ON -DOPEN_MP:BOOL=ON ${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_via_cmake.sh -DUSE_OPENMP=OFF
${dockerRun} ${container_type} ${docker_binary} ${docker_args} bash -c "cd python-package && rm -rf dist/* && python setup.py bdist_wheel --universal" """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} python3 tests/ci_build/rename_whl.py python-package/dist/*.whl ${commit_id} manylinux1_x86_64 echo "Running Non-OpenMP C++ test..."
sh """
${dockerRun} ${container_type} ${docker_binary} build/testxgboost
"""
deleteDir()
}
}
def BuildCUDA(args) {
node('linux && cpu_build') {
unstash name: 'srcs'
echo "Build with CUDA ${args.cuda_version}"
def container_type = GetCUDABuildContainerType(args.cuda_version)
def docker_binary = "docker"
def docker_args = "--build-arg CUDA_VERSION_ARG=${args.cuda_version}"
def arch_flag = ""
if (env.BRANCH_NAME != 'master' && !(env.BRANCH_NAME.startsWith('release'))) {
arch_flag = "-DGPU_COMPUTE_VER=75"
}
sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/build_via_cmake.sh -DUSE_CUDA=ON -DUSE_NCCL=ON -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON ${arch_flag}
${dockerRun} ${container_type} ${docker_binary} ${docker_args} bash -c "cd python-package && rm -rf dist/* && python setup.py bdist_wheel --universal"
${dockerRun} ${container_type} ${docker_binary} ${docker_args} python tests/ci_build/rename_whl.py python-package/dist/*.whl ${commit_id} manylinux2010_x86_64
""" """
// Stash wheel for CUDA 9.0 target
if (args.cuda_version == '9.0') {
echo 'Stashing Python wheel...' echo 'Stashing Python wheel...'
stash name: 'xgboost_whl_cuda9', includes: 'python-package/dist/*.whl' stash name: "xgboost_whl_cuda${args.cuda_version}", includes: 'python-package/dist/*.whl'
if (args.cuda_version == ref_cuda_ver && (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release'))) {
echo 'Uploading Python wheel...'
path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/" path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/"
s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', workingDir: 'python-package/dist', includePathPattern:'**/*.whl' s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', workingDir: 'python-package/dist', includePathPattern:'**/*.whl'
echo 'Stashing C++ test executable (testxgboost)...'
stash name: 'xgboost_cpp_tests', includes: 'build/testxgboost'
} }
echo 'Stashing C++ test executable (testxgboost)...'
stash name: "xgboost_cpp_tests_cuda${args.cuda_version}", includes: 'build/testxgboost'
deleteDir()
}
}
def BuildJVMPackagesWithCUDA(args) {
node('linux && mgpu') {
unstash name: 'srcs'
echo "Build XGBoost4J-Spark with Spark ${args.spark_version}, CUDA ${args.cuda_version}"
def container_type = "jvm_gpu_build"
def docker_binary = "nvidia-docker"
def docker_args = "--build-arg CUDA_VERSION_ARG=${args.cuda_version}"
def arch_flag = ""
if (env.BRANCH_NAME != 'master' && !(env.BRANCH_NAME.startsWith('release'))) {
arch_flag = "-DGPU_COMPUTE_VER=75"
}
// Use only 4 CPU cores
def docker_extra_params = "CI_DOCKER_EXTRA_PARAMS_INIT='--cpuset-cpus 0-3'"
sh """
${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/build_jvm_packages.sh ${args.spark_version} -Duse.cuda=ON $arch_flag
"""
echo "Stashing XGBoost4J JAR with CUDA ${args.cuda_version} ..."
stash name: 'xgboost4j_jar_gpu', includes: "jvm-packages/xgboost4j/target/*.jar,jvm-packages/xgboost4j-spark/target/*.jar,jvm-packages/xgboost4j-example/target/*.jar"
deleteDir() deleteDir()
} }
} }
@@ -244,7 +318,7 @@ def BuildJVMPackages(args) {
${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_jvm_packages.sh ${args.spark_version} ${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_jvm_packages.sh ${args.spark_version}
""" """
echo 'Stashing XGBoost4J JAR...' echo 'Stashing XGBoost4J JAR...'
stash name: 'xgboost4j_jar', includes: 'jvm-packages/xgboost4j/target/*.jar,jvm-packages/xgboost4j-spark/target/*.jar,jvm-packages/xgboost4j-example/target/*.jar' stash name: 'xgboost4j_jar', includes: "jvm-packages/xgboost4j/target/*.jar,jvm-packages/xgboost4j-spark/target/*.jar,jvm-packages/xgboost4j-example/target/*.jar"
deleteDir() deleteDir()
} }
} }
@@ -258,16 +332,19 @@ def BuildJVMDoc() {
sh """ sh """
${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_jvm_doc.sh ${BRANCH_NAME} ${dockerRun} ${container_type} ${docker_binary} tests/ci_build/build_jvm_doc.sh ${BRANCH_NAME}
""" """
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Uploading doc...' echo 'Uploading doc...'
s3Upload file: "jvm-packages/${BRANCH_NAME}.tar.bz2", bucket: 'xgboost-docs', acl: 'PublicRead', path: "${BRANCH_NAME}.tar.bz2" s3Upload file: "jvm-packages/${BRANCH_NAME}.tar.bz2", bucket: 'xgboost-docs', acl: 'PublicRead', path: "${BRANCH_NAME}.tar.bz2"
}
deleteDir() deleteDir()
} }
} }
def TestPythonCPU() { def TestPythonCPU() {
node('linux && cpu') { node('linux && cpu') {
unstash name: 'xgboost_whl_cuda9' unstash name: "xgboost_whl_cuda${ref_cuda_ver}"
unstash name: 'srcs' unstash name: 'srcs'
unstash name: 'xgboost_cli'
echo "Test Python CPU" echo "Test Python CPU"
def container_type = "cpu" def container_type = "cpu"
def docker_binary = "docker" def docker_binary = "docker"
@@ -279,18 +356,22 @@ def TestPythonCPU() {
} }
def TestPythonGPU(args) { def TestPythonGPU(args) {
nodeReq = (args.multi_gpu) ? 'linux && mgpu' : 'linux && gpu' def nodeReq = (args.multi_gpu) ? 'linux && mgpu' : 'linux && gpu'
def artifact_cuda_version = (args.artifact_cuda_version) ?: ref_cuda_ver
node(nodeReq) { node(nodeReq) {
unstash name: 'xgboost_whl_cuda9' unstash name: "xgboost_whl_cuda${artifact_cuda_version}"
unstash name: "xgboost_cpp_tests_cuda${artifact_cuda_version}"
unstash name: 'srcs' unstash name: 'srcs'
echo "Test Python GPU: CUDA ${args.cuda_version}" echo "Test Python GPU: CUDA ${args.host_cuda_version}"
def container_type = "gpu" def container_type = "gpu"
def docker_binary = "nvidia-docker" def docker_binary = "nvidia-docker"
def docker_args = "--build-arg CUDA_VERSION=${args.cuda_version}" def docker_args = "--build-arg CUDA_VERSION_ARG=${args.host_cuda_version}"
if (args.multi_gpu) { if (args.multi_gpu) {
echo "Using multiple GPUs" echo "Using multiple GPUs"
// Allocate extra space in /dev/shm to enable NCCL
def docker_extra_params = "CI_DOCKER_EXTRA_PARAMS_INIT='--shm-size=4g'"
sh """ sh """
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/test_python.sh mgpu ${docker_extra_params} ${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/test_python.sh mgpu
""" """
} else { } else {
echo "Using a single GPU" echo "Using a single GPU"
@@ -298,13 +379,6 @@ def TestPythonGPU(args) {
${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/test_python.sh gpu ${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/test_python.sh gpu
""" """
} }
// For CUDA 10.0 target, run cuDF tests too
if (args.cuda_version == '10.0') {
echo "Running tests with cuDF..."
sh """
${dockerRun} cudf ${docker_binary} ${docker_args} tests/ci_build/test_python.sh cudf
"""
}
deleteDir() deleteDir()
} }
} }
@@ -324,21 +398,34 @@ def TestCppRabit() {
} }
def TestCppGPU(args) { def TestCppGPU(args) {
nodeReq = (args.multi_gpu) ? 'linux && mgpu' : 'linux && gpu' def nodeReq = 'linux && mgpu'
def artifact_cuda_version = (args.artifact_cuda_version) ?: ref_cuda_ver
node(nodeReq) { node(nodeReq) {
unstash name: 'xgboost_cpp_tests' unstash name: "xgboost_cpp_tests_cuda${artifact_cuda_version}"
unstash name: 'srcs' unstash name: 'srcs'
echo "Test C++, CUDA ${args.cuda_version}" echo "Test C++, CUDA ${args.host_cuda_version}"
def container_type = "gpu" def container_type = "gpu"
def docker_binary = "nvidia-docker" def docker_binary = "nvidia-docker"
def docker_args = "--build-arg CUDA_VERSION=${args.cuda_version}" def docker_args = "--build-arg CUDA_VERSION_ARG=${args.host_cuda_version}"
if (args.multi_gpu) { sh "${dockerRun} ${container_type} ${docker_binary} ${docker_args} build/testxgboost"
echo "Using multiple GPUs" deleteDir()
sh "${dockerRun} ${container_type} ${docker_binary} ${docker_args} build/testxgboost --gtest_filter=*.MGPU_*"
} else {
echo "Using a single GPU"
sh "${dockerRun} ${container_type} ${docker_binary} ${docker_args} build/testxgboost --gtest_filter=-*.MGPU_*"
} }
}
def CrossTestJVMwithJDKGPU(args) {
def nodeReq = 'linux && mgpu'
node(nodeReq) {
unstash name: "xgboost4j_jar_gpu"
unstash name: 'srcs'
if (args.spark_version != null) {
echo "Test XGBoost4J on a machine with JDK ${args.jdk_version}, Spark ${args.spark_version}, CUDA ${args.host_cuda_version}"
} else {
echo "Test XGBoost4J on a machine with JDK ${args.jdk_version}, CUDA ${args.host_cuda_version}"
}
def container_type = "gpu_jvm"
def docker_binary = "nvidia-docker"
def docker_args = "--build-arg CUDA_VERSION_ARG=${args.host_cuda_version}"
sh "${dockerRun} ${container_type} ${docker_binary} ${docker_args} tests/ci_build/test_jvm_gpu_cross.sh"
deleteDir() deleteDir()
} }
} }
@@ -379,3 +466,19 @@ def TestR(args) {
deleteDir() deleteDir()
} }
} }
def DeployJVMPackages(args) {
node('linux && cpu') {
unstash name: 'srcs'
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Deploying to xgboost-maven-repo S3 repo...'
sh """
${dockerRun} jvm docker tests/ci_build/deploy_jvm_packages.sh ${args.spark_version} 0
"""
sh """
${dockerRun} jvm_gpu_build docker --build-arg CUDA_VERSION_ARG=10.0 tests/ci_build/deploy_jvm_packages.sh ${args.spark_version} 1
"""
}
deleteDir()
}
}

View File

@@ -10,15 +10,25 @@ def commit_id // necessary to pass a variable from one stage to another
pipeline { pipeline {
agent none agent none
// Setup common job properties
options {
timestamps()
timeout(time: 240, unit: 'MINUTES')
buildDiscarder(logRotator(numToKeepStr: '10'))
preserveStashes()
}
// Build stages // Build stages
stages { stages {
stage('Jenkins Win64: Get sources') { stage('Jenkins Win64: Initialize') {
agent { label 'win64 && build' } agent { label 'job_initializer' }
steps { steps {
script { script {
checkoutSrcs() checkoutSrcs()
commit_id = "${GIT_COMMIT}" commit_id = "${GIT_COMMIT}"
} }
sh 'python3 tests/jenkins_get_approval.py'
stash name: 'srcs' stash name: 'srcs'
milestone ordinal: 1 milestone ordinal: 1
} }
@@ -28,7 +38,7 @@ pipeline {
steps { steps {
script { script {
parallel ([ parallel ([
'build-win64-cuda9.0': { BuildWin64() } 'build-win64-cuda10.1': { BuildWin64() }
]) ])
} }
milestone ordinal: 2 milestone ordinal: 2
@@ -39,10 +49,7 @@ pipeline {
steps { steps {
script { script {
parallel ([ parallel ([
'test-win64-cpu': { TestWin64CPU() }, 'test-win64-cuda10.1': { TestWin64() },
'test-win64-gpu-cuda9.0': { TestWin64GPU(cuda_target: 'cuda9') },
'test-win64-gpu-cuda10.0': { TestWin64GPU(cuda_target: 'cuda10_0') },
'test-win64-gpu-cuda10.1': { TestWin64GPU(cuda_target: 'cuda10_1') }
]) ])
} }
milestone ordinal: 3 milestone ordinal: 3
@@ -67,14 +74,18 @@ def checkoutSrcs() {
} }
def BuildWin64() { def BuildWin64() {
node('win64 && build') { node('win64 && cuda10_unified') {
unstash name: 'srcs' unstash name: 'srcs'
echo "Building XGBoost for Windows AMD64 target..." echo "Building XGBoost for Windows AMD64 target..."
bat "nvcc --version" bat "nvcc --version"
def arch_flag = ""
if (env.BRANCH_NAME != 'master' && !(env.BRANCH_NAME.startsWith('release'))) {
arch_flag = "-DGPU_COMPUTE_VER=75"
}
bat """ bat """
mkdir build mkdir build
cd build cd build
cmake .. -G"Visual Studio 15 2017 Win64" -DUSE_CUDA=ON -DCMAKE_VERBOSE_MAKEFILE=ON -DGOOGLE_TEST=ON -DUSE_DMLC_GTEST=ON cmake .. -G"Visual Studio 15 2017 Win64" -DUSE_CUDA=ON -DCMAKE_VERBOSE_MAKEFILE=ON -DGOOGLE_TEST=ON -DUSE_DMLC_GTEST=ON ${arch_flag}
""" """
bat """ bat """
cd build cd build
@@ -92,50 +103,41 @@ def BuildWin64() {
""" """
echo 'Stashing Python wheel...' echo 'Stashing Python wheel...'
stash name: 'xgboost_whl', includes: 'python-package/dist/*.whl' stash name: 'xgboost_whl', includes: 'python-package/dist/*.whl'
if (env.BRANCH_NAME == 'master' || env.BRANCH_NAME.startsWith('release')) {
echo 'Uploading Python wheel...'
path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/" path = ("${BRANCH_NAME}" == 'master') ? '' : "${BRANCH_NAME}/"
s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', workingDir: 'python-package/dist', includePathPattern:'**/*.whl' s3Upload bucket: 'xgboost-nightly-builds', path: path, acl: 'PublicRead', workingDir: 'python-package/dist', includePathPattern:'**/*.whl'
}
echo 'Stashing C++ test executable (testxgboost)...' echo 'Stashing C++ test executable (testxgboost)...'
stash name: 'xgboost_cpp_tests', includes: 'build/testxgboost.exe' stash name: 'xgboost_cpp_tests', includes: 'build/testxgboost.exe'
stash name: 'xgboost_cli', includes: 'xgboost.exe'
deleteDir() deleteDir()
} }
} }
def TestWin64CPU() { def TestWin64() {
node('win64 && cpu') { node('win64 && cuda10_unified') {
unstash name: 'srcs'
unstash name: 'xgboost_whl'
echo "Test Win64 CPU"
echo "Installing Python wheel..."
bat "conda activate && (python -m pip uninstall -y xgboost || cd .)"
bat """
conda activate && for /R %%i in (python-package\\dist\\*.whl) DO python -m pip install "%%i"
"""
echo "Running Python tests..."
bat "conda activate && python -m pytest -v -s --fulltrace tests\\python"
bat "conda activate && python -m pip uninstall -y xgboost"
deleteDir()
}
}
def TestWin64GPU(args) {
node("win64 && gpu && ${args.cuda_target}") {
unstash name: 'srcs' unstash name: 'srcs'
unstash name: 'xgboost_whl' unstash name: 'xgboost_whl'
unstash name: 'xgboost_cli'
unstash name: 'xgboost_cpp_tests' unstash name: 'xgboost_cpp_tests'
echo "Test Win64 GPU (${args.cuda_target})" echo "Test Win64"
bat "nvcc --version" bat "nvcc --version"
echo "Running C++ tests..." echo "Running C++ tests..."
bat "build\\testxgboost.exe" bat "build\\testxgboost.exe"
echo "Installing Python dependencies..."
def env_name = 'win64_' + UUID.randomUUID().toString().replaceAll('-', '')
bat "conda env create -n ${env_name} --file=tests/ci_build/conda_env/win64_test.yml"
echo "Installing Python wheel..." echo "Installing Python wheel..."
bat "conda activate && (python -m pip uninstall -y xgboost || cd .)"
bat """ bat """
conda activate && for /R %%i in (python-package\\dist\\*.whl) DO python -m pip install "%%i" conda activate ${env_name} && for /R %%i in (python-package\\dist\\*.whl) DO python -m pip install "%%i"
""" """
echo "Running Python tests..." echo "Running Python tests..."
bat "conda activate ${env_name} && python -m pytest -v -s -rxXs --fulltrace tests\\python"
bat """ bat """
conda activate && python -m pytest -v -s --fulltrace -m "(not slow) and (not mgpu)" tests\\python-gpu conda activate ${env_name} && python -m pytest -v -s -rxXs --fulltrace -m "(not slow) and (not mgpu)" tests\\python-gpu
""" """
bat "conda activate && python -m pip uninstall -y xgboost" bat "conda env remove --name ${env_name}"
deleteDir() deleteDir()
} }
} }

143
Makefile
View File

@@ -1,11 +1,3 @@
ifndef config
ifneq ("$(wildcard ./config.mk)","")
config = config.mk
else
config = make/config.mk
endif
endif
ifndef DMLC_CORE ifndef DMLC_CORE
DMLC_CORE = dmlc-core DMLC_CORE = dmlc-core
endif endif
@@ -30,16 +22,6 @@ ifndef MAKE_OK
endif endif
$(warning MAKE [$(MAKE)] - $(if $(MAKE_OK),checked OK,PROBLEM)) $(warning MAKE [$(MAKE)] - $(if $(MAKE_OK),checked OK,PROBLEM))
ifeq ($(OS), Windows_NT)
UNAME="Windows"
else
UNAME=$(shell uname)
endif
include $(config)
ifeq ($(USE_OPENMP), 0)
export NO_OPENMP = 1
endif
include $(DMLC_CORE)/make/dmlc.mk include $(DMLC_CORE)/make/dmlc.mk
# set compiler defaults for OSX versus *nix # set compiler defaults for OSX versus *nix
@@ -62,75 +44,21 @@ export CXX = g++
endif endif
endif endif
export LDFLAGS= -pthread -lm $(ADD_LDFLAGS) $(DMLC_LDFLAGS) export CFLAGS= -DDMLC_LOG_CUSTOMIZE=1 -std=c++14 -Wall -Wno-unknown-pragmas -Iinclude $(ADD_CFLAGS)
export CFLAGS= -DDMLC_LOG_CUSTOMIZE=1 -std=c++11 -Wall -Wno-unknown-pragmas -Iinclude $(ADD_CFLAGS)
CFLAGS += -I$(DMLC_CORE)/include -I$(RABIT)/include -I$(GTEST_PATH)/include CFLAGS += -I$(DMLC_CORE)/include -I$(RABIT)/include -I$(GTEST_PATH)/include
#java include path
export JAVAINCFLAGS = -I${JAVA_HOME}/include -I./java
ifeq ($(TEST_COVER), 1) ifeq ($(TEST_COVER), 1)
CFLAGS += -g -O0 -fprofile-arcs -ftest-coverage CFLAGS += -g -O0 -fprofile-arcs -ftest-coverage
else else
CFLAGS += -O3 -funroll-loops CFLAGS += -O3 -funroll-loops
ifeq ($(USE_SSE), 1)
CFLAGS += -msse2
endif
endif endif
ifndef LINT_LANG ifndef LINT_LANG
LINT_LANG= "all" LINT_LANG= "all"
endif endif
ifeq ($(UNAME), Windows)
XGBOOST_DYLIB = lib/xgboost.dll
JAVAINCFLAGS += -I${JAVA_HOME}/include/win32
else
ifeq ($(UNAME), Darwin)
XGBOOST_DYLIB = lib/libxgboost.dylib
CFLAGS += -fPIC
else
XGBOOST_DYLIB = lib/libxgboost.so
CFLAGS += -fPIC
endif
endif
ifeq ($(UNAME), Linux)
LDFLAGS += -lrt
JAVAINCFLAGS += -I${JAVA_HOME}/include/linux
endif
ifeq ($(UNAME), Darwin)
JAVAINCFLAGS += -I${JAVA_HOME}/include/darwin
endif
OPENMP_FLAGS =
ifeq ($(USE_OPENMP), 1)
OPENMP_FLAGS = -fopenmp
else
OPENMP_FLAGS = -DDISABLE_OPENMP
endif
CFLAGS += $(OPENMP_FLAGS)
# specify tensor path # specify tensor path
.PHONY: clean all lint clean_all doxygen rcpplint pypack Rpack Rbuild Rcheck java pylint .PHONY: clean all lint clean_all doxygen rcpplint pypack Rpack Rbuild Rcheck
all: lib/libxgboost.a $(XGBOOST_DYLIB) xgboost
$(DMLC_CORE)/libdmlc.a: $(wildcard $(DMLC_CORE)/src/*.cc $(DMLC_CORE)/src/*/*.cc)
+ cd $(DMLC_CORE); "$(MAKE)" libdmlc.a config=$(ROOTDIR)/$(config); cd $(ROOTDIR)
$(RABIT)/lib/$(LIB_RABIT): $(wildcard $(RABIT)/src/*.cc)
+ cd $(RABIT); "$(MAKE)" lib/$(LIB_RABIT) USE_SSE=$(USE_SSE); cd $(ROOTDIR)
jvm: jvm-packages/lib/libxgboost4j.so
SRC = $(wildcard src/*.cc src/*/*.cc)
ALL_OBJ = $(patsubst src/%.cc, build/%.o, $(SRC))
AMALGA_OBJ = amalgamation/xgboost-all0.o
LIB_DEP = $(DMLC_CORE)/libdmlc.a $(RABIT)/lib/$(LIB_RABIT)
ALL_DEP = $(filter-out build/cli_main.o, $(ALL_OBJ)) $(LIB_DEP)
CLI_OBJ = build/cli_main.o
include tests/cpp/xgboost_test.mk
build/%.o: src/%.cc build/%.o: src/%.cc
@mkdir -p $(@D) @mkdir -p $(@D)
@@ -141,27 +69,6 @@ build/%.o: src/%.cc
amalgamation/xgboost-all0.o: amalgamation/xgboost-all0.cc amalgamation/xgboost-all0.o: amalgamation/xgboost-all0.cc
$(CXX) -c $(CFLAGS) $< -o $@ $(CXX) -c $(CFLAGS) $< -o $@
# Equivalent to lib/libxgboost_all.so
lib/libxgboost_all.so: $(AMALGA_OBJ) $(LIB_DEP)
@mkdir -p $(@D)
$(CXX) $(CFLAGS) -shared -o $@ $(filter %.o %.a, $^) $(LDFLAGS)
lib/libxgboost.a: $(ALL_DEP)
@mkdir -p $(@D)
ar crv $@ $(filter %.o, $?)
lib/xgboost.dll lib/libxgboost.so lib/libxgboost.dylib: $(ALL_DEP)
@mkdir -p $(@D)
$(CXX) $(CFLAGS) -shared -o $@ $(filter %.o %a, $^) $(LDFLAGS)
jvm-packages/lib/libxgboost4j.so: jvm-packages/xgboost4j/src/native/xgboost4j.cpp $(ALL_DEP)
@mkdir -p $(@D)
$(CXX) $(CFLAGS) $(JAVAINCFLAGS) -shared -o $@ $(filter %.cpp %.o %.a, $^) $(LDFLAGS)
xgboost: $(CLI_OBJ) $(ALL_DEP)
$(CXX) $(CFLAGS) -o $@ $(filter %.o %.a, $^) $(LDFLAGS)
rcpplint: rcpplint:
python3 dmlc-core/scripts/lint.py xgboost ${LINT_LANG} R-package/src python3 dmlc-core/scripts/lint.py xgboost ${LINT_LANG} R-package/src
@@ -172,16 +79,6 @@ lint: rcpplint
python-package/xgboost/src --pylint-rc ${PWD}/python-package/.pylintrc xgboost \ python-package/xgboost/src --pylint-rc ${PWD}/python-package/.pylintrc xgboost \
${LINT_LANG} include src python-package ${LINT_LANG} include src python-package
pylint:
flake8 --ignore E501 python-package
flake8 --ignore E501 tests/python
test: $(ALL_TEST)
$(ALL_TEST)
check: test
./tests/cpp/xgboost_test
ifeq ($(TEST_COVER), 1) ifeq ($(TEST_COVER), 1)
cover: check cover: check
@- $(foreach COV_OBJ, $(COVER_OBJ), \ @- $(foreach COV_OBJ, $(COVER_OBJ), \
@@ -202,38 +99,9 @@ clean_all: clean
cd $(DMLC_CORE); "$(MAKE)" clean; cd $(ROOTDIR) cd $(DMLC_CORE); "$(MAKE)" clean; cd $(ROOTDIR)
cd $(RABIT); "$(MAKE)" clean; cd $(ROOTDIR) cd $(RABIT); "$(MAKE)" clean; cd $(ROOTDIR)
doxygen:
doxygen doc/Doxyfile
# create standalone python tar file.
pypack: ${XGBOOST_DYLIB}
cp ${XGBOOST_DYLIB} python-package/xgboost
cd python-package; tar cf xgboost.tar xgboost; cd ..
# create pip source dist (sdist) pack for PyPI # create pip source dist (sdist) pack for PyPI
pippack: clean_all pippack: clean_all
rm -rf xgboost-python cd python-package; python setup.py sdist; mv dist/*.tar.gz ..; cd ..
# remove symlinked directories in python-package/xgboost
rm -rf python-package/xgboost/lib
rm -rf python-package/xgboost/dmlc-core
rm -rf python-package/xgboost/include
rm -rf python-package/xgboost/make
rm -rf python-package/xgboost/rabit
rm -rf python-package/xgboost/src
cp -r python-package xgboost-python
cp -r CMakeLists.txt xgboost-python/xgboost/
cp -r cmake xgboost-python/xgboost/
cp -r plugin xgboost-python/xgboost/
cp -r make xgboost-python/xgboost/
cp -r src xgboost-python/xgboost/
cp -r tests xgboost-python/xgboost/
cp -r include xgboost-python/xgboost/
cp -r dmlc-core xgboost-python/xgboost/
cp -r rabit xgboost-python/xgboost/
# Use setup_pip.py instead of setup.py
mv xgboost-python/setup_pip.py xgboost-python/setup.py
# Build sdist tarball
cd xgboost-python; python setup.py sdist; mv dist/*.tar.gz ..; cd ..
# Script to make a clean installable R package. # Script to make a clean installable R package.
Rpack: clean_all Rpack: clean_all
@@ -265,15 +133,16 @@ Rpack: clean_all
sed -i -e 's/@BACKTRACE_LIB@//g' xgboost/src/Makevars.win sed -i -e 's/@BACKTRACE_LIB@//g' xgboost/src/Makevars.win
sed -i -e 's/@OPENMP_LIB@//g' xgboost/src/Makevars.win sed -i -e 's/@OPENMP_LIB@//g' xgboost/src/Makevars.win
rm -f xgboost/src/Makevars.win-e # OSX sed create this extra file; remove it rm -f xgboost/src/Makevars.win-e # OSX sed create this extra file; remove it
bash R-package/remove_warning_suppression_pragma.sh bash xgboost/remove_warning_suppression_pragma.sh
rm xgboost/remove_warning_suppression_pragma.sh rm xgboost/remove_warning_suppression_pragma.sh
rm -rfv xgboost/tests/helper_scripts/
Rbuild: Rpack Rbuild: Rpack
R CMD build --no-build-vignettes xgboost R CMD build --no-build-vignettes xgboost
rm -rf xgboost rm -rf xgboost
Rcheck: Rbuild Rcheck: Rbuild
R CMD check xgboost*.tar.gz R CMD check --as-cran xgboost*.tar.gz
-include build/*.d -include build/*.d
-include build/*/*.d -include build/*/*.d

508
NEWS.md
View File

@@ -3,6 +3,514 @@ XGBoost Change Log
This file records the changes in xgboost library in reverse chronological order. This file records the changes in xgboost library in reverse chronological order.
## v1.1.0 (2020.05.17)
### Better performance on multi-core CPUs (#5244, #5334, #5522)
* Poor performance scaling of the `hist` algorithm for multi-core CPUs has been under investigation (#3810). #5244 concludes the ongoing effort to improve performance scaling on multi-CPUs, in particular Intel CPUs. Roadmap: #5104
* #5334 makes steps toward reducing memory consumption for the `hist` tree method on CPU.
* #5522 optimizes random number generation for data sampling.
### Deterministic GPU algorithm for regression and classification (#5361)
* GPU algorithm for regression and classification tasks is now deterministic.
* Roadmap: #5023. Currently only single-GPU training is deterministic. Distributed training with multiple GPUs is not yet deterministic.
### Improve external memory support on GPUs (#5093, #5365)
* Starting from 1.0.0 release, we added support for external memory on GPUs to enable training with larger datasets. Gradient-based sampling (#5093) speeds up the external memory algorithm by intelligently sampling a subset of the training data to copy into the GPU memory. [Learn more about out-of-core GPU gradient boosting.](https://arxiv.org/abs/2005.09148)
* GPU-side data sketching now works with data from external memory (#5365).
### Parameter validation: detection of unused or incorrect parameters (#5477, #5569, #5508)
* Mis-spelled training parameter is a common user mistake. In previous versions of XGBoost, mis-spelled parameters were silently ignored. Starting with 1.0.0 release, XGBoost will produce a warning message if there is any unused training parameters. The 1.1.0 release makes parameter validation available to the scikit-learn interface (#5477) and the R binding (#5569).
### Thread-safe, in-place prediction method (#5389, #5512)
* Previously, the prediction method was not thread-safe (#5339). This release adds a new API function `inplace_predict()` that is thread-safe. It is now possible to serve concurrent requests for prediction using a shared model object.
* It is now possible to compute prediction in-place for selected data formats (`numpy.ndarray` / `scipy.sparse.csr_matrix` / `cupy.ndarray` / `cudf.DataFrame` / `pd.DataFrame`) without creating a `DMatrix` object.
### Addition of Accelerated Failure Time objective for survival analysis (#4763, #5473, #5486, #5552, #5553)
* Survival analysis (regression) models the time it takes for an event of interest to occur. The target label is potentially censored, i.e. the label is a range rather than a single number. We added a new objective `survival:aft` to support survival analysis. Also added is the new API to specify the ranged labels. Check out [the tutorial](https://xgboost.readthedocs.io/en/release_1.1.0/tutorials/aft_survival_analysis.html) and the [demos](https://github.com/dmlc/xgboost/tree/release_1.1.0/demo/aft_survival).
* GPU support is work in progress (#5714).
### Improved installation experience on Mac OSX (#5597, #5602, #5606, #5701)
* It only takes two commands to install the XGBoost Python package: `brew install libomp` followed by `pip install xgboost`. The installed XGBoost will use all CPU cores. Even better, starting with this release, we distribute pre-compiled binary wheels targeting Mac OSX. Now the install command `pip install xgboost` finishes instantly, as it no longer compiles the C++ source of XGBoost. The last three Mac versions (High Sierra, Mojave, Catalina) are supported.
* R package: the 1.1.0 release fixes the error `Initializing libomp.dylib, but found libomp.dylib already initialized` (#5701)
### Ranking metrics are now accelerated on GPUs (#5380, #5387, #5398)
### GPU-side data matrix to ingest data directly from other GPU libraries (#5420, #5465)
* Previously, data on GPU memory had to be copied back to the main memory before it could be used by XGBoost. Starting with 1.1.0 release, XGBoost provides a dedicated interface (`DeviceQuantileDMatrix`) so that it can ingest data from GPU memory directly. The result is that XGBoost interoperates better with GPU-accelerated data science libraries, such as cuDF, cuPy, and PyTorch.
* Set device in device dmatrix. (#5596)
### Robust model serialization with JSON (#5123, #5217)
* We continue efforts from the 1.0.0 release to adopt JSON as the format to save and load models robustly. Refer to the release note for 1.0.0 to learn more.
* It is now possible to store internal configuration of the trained model (`Booster`) object in R as a JSON string (#5123, #5217).
### Improved integration with Dask
* Pass through `verbose` parameter for dask fit (#5413)
* Use `DMLC_TASK_ID`. (#5415)
* Order the prediction result. (#5416)
* Honor `nthreads` from dask worker. (#5414)
* Enable grid searching with scikit-learn. (#5417)
* Check non-equal when setting threads. (#5421)
* Accept other inputs for prediction. (#5428)
* Fix missing value for scikit-learn interface. (#5435)
### XGBoost4J-Spark: Check number of columns in the data iterator (#5202, #5303)
* Before, the native layer in XGBoost did not know the number of columns (features) ahead of time and had to guess the number of columns by counting the feature index when ingesting data. This method has a failure more in distributed setting: if the training data is highly sparse, some features may be completely missing in one or more worker partitions. Thus, one or more workers may deduce an incorrect data shape, leading to crashes or silently wrong models.
* Enforce correct data shape by passing the number of columns explicitly from the JVM layer into the native layer.
### Major refactoring of the `DMatrix` class
* Continued from 1.0.0 release.
* Remove update prediction cache from predictors. (#5312)
* Predict on Ellpack. (#5327)
* Partial rewrite EllpackPage (#5352)
* Use ellpack for prediction only when sparsepage doesn't exist. (#5504)
* RFC: #4354, Roadmap: #5143
### Breaking: XGBoost Python package now requires Pip 19.0 and higher (#5589)
* Your Linux machine may have an old version of Pip and may attempt to install a source package, leading to long installation time. This is because we are now using `manylinux2010` tag in the binary wheel release. Ensure you have Pip 19.0 or newer by running `python3 -m pip -V` to check the version. Upgrade Pip with command
```
python3 -m pip install --upgrade pip
```
Upgrading to latest pip allows us to depend on newer versions of system libraries. [TensorFlow](https://www.tensorflow.org/install/pip) also requires Pip 19.0+.
### Breaking: GPU algorithm now requires CUDA 10.0 and higher (#5649)
* CUDA 10.0 is necessary to make the GPU algorithm deterministic (#5361).
### Breaking: `silent` parameter is now removed (#5476)
* Please use `verbosity` instead.
### Breaking: Set `output_margin` to True for custom objectives (#5564)
* Now both R and Python interface custom objectives get un-transformed (raw) prediction outputs.
### Breaking: `Makefile` is now removed. We use CMake exclusively to build XGBoost (#5513)
* Exception: the R package uses Autotools, as the CRAN ecosystem did not yet adopt CMake widely.
### Breaking: `distcol` updater is now removed (#5507)
* The `distcol` updater has been long broken, and currently we lack resources to implement a working implementation from scratch.
### Deprecation notices
* **Python 3.5**. This release is the last release to support Python 3.5. The following release (1.2.0) will require Python 3.6.
* **Scala 2.11**. Currently XGBoost4J supports Scala 2.11. However, if a future release of XGBoost adopts Spark 3, it will not support Scala 2.11, as Spark 3 requires Scala 2.12+. We do not yet know which XGBoost release will adopt Spark 3.
### Known limitations
* (Python package) When early stopping is activated with `early_stopping_rounds` at training time, the prediction method (`xgb.predict()`) behaves in a surprising way. If XGBoost runs for M rounds and chooses iteration N (N < M) as the best iteration, then the prediction method will use M trees by default. To use the best iteration (N trees), users will need to manually take the best iteration field `bst.best_iteration` and pass it as the `ntree_limit` argument to `xgb.predict()`. See #5209 and #4052 for additional context.
* GPU ranking objective is currently not deterministic (#5561).
* When training parameter `reg_lambda` is set to zero, some leaf nodes may be assigned a NaN value. (See [discussion](https://discuss.xgboost.ai/t/still-getting-unexplained-nans-new-replication-code/1383/9).) For now, please set `reg_lambda` to a nonzero value.
### Community and Governance
* The XGBoost Project Management Committee (PMC) is pleased to announce a new committer: Egor Smirnov (@SmirnovEgorRu). He has led a major initiative to improve the performance of XGBoost on multi-core CPUs.
### Bug-fixes
* Improved compatibility with scikit-learn (#5255, #5505, #5538)
* Remove f-string, since it's not supported by Python 3.5 (#5330). Note that Python 3.5 support is deprecated and schedule to be dropped in the upcoming release (1.2.0).
* Fix the pruner so that it doesn't prune the same branch twice (#5335)
* Enforce only major version in JSON model schema (#5336). Any major revision of the model schema would bump up the major version.
* Fix a small typo in sklearn.py that broke multiple eval metrics (#5341)
* Restore loading model from a memory buffer (#5360)
* Define lazy isinstance for Python compat (#5364)
* [R] fixed uses of `class()` (#5426)
* Force compressed buffer to be 4 bytes aligned, to keep cuda-memcheck happy (#5441)
* Remove warning for calling host function (`std::max`) on a GPU device (#5453)
* Fix uninitialized value bug in xgboost callback (#5463)
* Fix model dump in CLI (#5485)
* Fix out-of-bound array access in `WQSummary::SetPrune()` (#5493)
* Ensure that configured `dmlc/build_config.h` is picked up by Rabit and XGBoost, to fix build on Alpine (#5514)
* Fix a misspelled method, made in a git merge (#5509)
* Fix a bug in binary model serialization (#5532)
* Fix CLI model IO (#5535)
* Don't use `uint` for threads (#5542)
* Fix R interaction constraints to handle more than 100000 features (#5543)
* [jvm-packages] XGBoost Spark should deal with NaN when parsing evaluation output (#5546)
* GPU-side data sketching is now aware of query groups in learning-to-rank data (#5551)
* Fix DMatrix slicing for newly added fields (#5552)
* Fix configuration status with loading binary model (#5562)
* Fix build when OpenMP is disabled (#5566)
* R compatibility patches (#5577, #5600)
* gpu\_hist performance fixes (#5558)
* Don't set seed on CLI interface (#5563)
* [R] When serializing model, preserve model attributes related to early stopping (#5573)
* Avoid rabit calls in learner configuration (#5581)
* Hide C++ symbols in libxgboost.so when building Python wheel (#5590). This fixes apache/incubator-tvm#4953.
* Fix compilation on Mac OSX High Sierra (10.13) (#5597)
* Fix build on big endian CPUs (#5617)
* Resolve crash due to use of `vector<bool>::iterator` (#5642)
* Validation JSON model dump using JSON schema (#5660)
### Performance improvements
* Wide dataset quantile performance improvement (#5306)
* Reduce memory usage of GPU-side data sketching (#5407)
* Reduce span check overhead (#5464)
* Serialise booster after training to free up GPU memory (#5484)
* Use the maximum amount of GPU shared memory available to speed up the histogram kernel (#5491)
* Use non-synchronising scan in Thrust (#5560)
* Use `cudaDeviceGetAttribute()` instead of `cudaGetDeviceProperties()` for speed (#5570)
### API changes
* Support importing data from a Pandas SparseArray (#5431)
* `HostDeviceVector` (vector shared between CPU and GPU memory) now exposes `HostSpan` interface, to enable access on the CPU side with bound check (#5459)
* Accept other gradient types for `SplitEntry` (#5467)
### Usability Improvements, Documentation
* Add `JVM_CHECK_CALL` to prevent C++ exceptions from leaking into the JVM layer (#5199)
* Updated Windows build docs (#5283)
* Update affiliation of @hcho3 (#5292)
* Display Sponsor button, link to OpenCollective (#5325)
* Update docs for GPU external memory (#5332)
* Add link to GPU documentation (#5437)
* Small updates to GPU documentation (#5483)
* Edits on tutorial for XGBoost job on Kubernetes (#5487)
* Add reference to GPU external memory (#5490)
* Fix typos (#5346, #5371, #5384, #5399, #5482, #5515)
* Update Python doc (#5517)
* Add Neptune and Optuna to list of examples (#5528)
* Raise error if the number of data weights doesn't match the number of data sets (#5540)
* Add a note about GPU ranking (#5572)
* Clarify meaning of `training` parameter in the C API function `XGBoosterPredict()` (#5604)
* Better error handling for situations where existing trees cannot be modified (#5406, #5418). This feature is enabled when `process_type` is set to `update`.
### Maintenance: testing, continuous integration, build system
* Add C++ test coverage for data sketching (#5251)
* Ignore gdb\_history (#5257)
* Rewrite setup.py. (#5271, #5280)
* Use `scikit-learn` in extra dependencies (#5310)
* Add CMake option to build static library (#5397)
* [R] changed FindLibR to take advantage of CMake cache (#5427)
* [R] fixed inconsistency in R -e calls in FindLibR.cmake (#5438)
* Refactor tests with data generator (#5439)
* Resolve failing Travis CI (#5445)
* Update dmlc-core. (#5466)
* [CI] Use clang-tidy 10 (#5469)
* De-duplicate code for checking maximum number of nodes (#5497)
* [CI] Use Ubuntu 18.04 LTS in JVM CI, because 19.04 is EOL (#5537)
* [jvm-packages] [CI] Create a Maven repository to host SNAPSHOT JARs (#5533)
* [jvm-packages] [CI] Publish XGBoost4J JARs with Scala 2.11 and 2.12 (#5539)
* [CI] Use Vault repository to re-gain access to devtoolset-4 (#5589)
### Maintenance: Refactor code for legibility and maintainability
* Move prediction cache to Learner (#5220, #5302)
* Remove SimpleCSRSource (#5315)
* Refactor SparsePageSource, delete cache files after use (#5321)
* Remove unnecessary DMatrix methods (#5324)
* Split up `LearnerImpl` (#5350)
* Move segment sorter to common (#5378)
* Move thread local entry into Learner (#5396)
* Split up test helpers header (#5455)
* Requires setting leaf stat when expanding tree (#5501)
* Purge device\_helpers.cuh (#5534)
* Use thrust functions instead of custom functions (#5544)
### Acknowledgement
**Contributors**: Nan Zhu (@CodingCat), Rory Mitchell (@RAMitchell), @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Andrew Kane (@ankane), Avinash Barnwal (@avinashbarnwal), Bart Broere (@bartbroere), Andy Adinets (@canonizer), Chen Qin (@chenqin), Daiki Katsuragawa (@daikikatsuragawa), David Díaz Vico (@daviddiazvico), Darius Kharazi (@dkharazi), Darby Payne (@dpayne), Jason E. Aten, Ph.D. (@glycerine), Philip Hyunsu Cho (@hcho3), James Lamb (@jameslamb), Jan Borchmann (@jborchma), Kamil A. Kaczmarek (@kamil-kaczmarek), Melissa Kohl (@mjkohl32), Nicolas Scozzaro (@nscozzaro), Paul Kaefer (@paulkaefer), Rong Ou (@rongou), Samrat Pandiri (@samratp), Sriram Chandramouli (@sriramch), Yuan Tang (@terrytangyuan), Jiaming Yuan (@trivialfis), Liang-Chi Hsieh (@viirya), Bobby Wang (@wbo4958), Zhang Zhang (@zhangzhang10),
**Reviewers**: Nan Zhu (@CodingCat), @LeZhengThu, Rory Mitchell (@RAMitchell), @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Steve Bronder (@SteveBronder), Nikita Titov (@StrikerRUS), Andrew Kane (@ankane), Avinash Barnwal (@avinashbarnwal), @brydag, Andy Adinets (@canonizer), Chandra Shekhar Reddy (@chandrureddy), Chen Qin (@chenqin), Codecov (@codecov-io), David Díaz Vico (@daviddiazvico), Darby Payne (@dpayne), Jason E. Aten, Ph.D. (@glycerine), Philip Hyunsu Cho (@hcho3), James Lamb (@jameslamb), @johnny-cat, Mu Li (@mli), Mate Soos (@msoos), @rnyak, Rong Ou (@rongou), Sriram Chandramouli (@sriramch), Toby Dylan Hocking (@tdhock), Yuan Tang (@terrytangyuan), Oleksandr Pryimak (@trams), Jiaming Yuan (@trivialfis), Liang-Chi Hsieh (@viirya), Bobby Wang (@wbo4958),
## v1.0.0 (2020.02.19)
This release marks a major milestone for the XGBoost project.
### Apache-style governance, contribution policy, and semantic versioning (#4646, #4659)
* Starting with 1.0.0 release, the XGBoost Project is adopting Apache-style governance. The full community guideline is [available in the doc website](https://xgboost.readthedocs.io/en/release_1.0.0/contrib/community.html). Note that we now have Project Management Committee (PMC) who would steward the project on the long-term basis. The PMC is also entrusted to run and fund the project's continuous integration (CI) infrastructure (https://xgboost-ci.net).
* We also adopt the [semantic versioning](https://semver.org/). See [our release versioning policy](https://xgboost.readthedocs.io/en/release_1.0.0/contrib/release.html).
### Better performance scaling for multi-core CPUs (#4502, #4529, #4716, #4851, #5008, #5107, #5138, #5156)
* Poor performance scaling of the `hist` algorithm for multi-core CPUs has been under investigation (#3810). Previous effort #4529 was replaced with a series of pull requests (#5107, #5138, #5156) aimed at achieving the same performance benefits while keeping the C++ codebase legible. The latest performance benchmark results show [up to 5x speedup on Intel CPUs with many cores](https://github.com/dmlc/xgboost/pull/5156#issuecomment-580024413). Note: #5244, which concludes the effort, will become part of the upcoming release 1.1.0.
### Improved installation experience on Mac OSX (#4672, #5074, #5080, #5146, #5240)
* It used to be quite complicated to install XGBoost on Mac OSX. XGBoost uses OpenMP to distribute work among multiple CPU cores, and Mac's default C++ compiler (Apple Clang) does not come with OpenMP. Existing work-around (using another C++ compiler) was complex and prone to fail with cryptic diagnosis (#4933, #4949, #4969).
* Now it only takes two commands to install XGBoost: `brew install libomp` followed by `pip install xgboost`. The installed XGBoost will use all CPU cores.
* Even better, XGBoost is now available from Homebrew: `brew install xgboost`. See Homebrew/homebrew-core#50467.
* Previously, if you installed the XGBoost R package using the command `install.packages('xgboost')`, it could only use a single CPU core and you would experience slow training performance. With 1.0.0 release, the R package will use all CPU cores out of box.
### Distributed XGBoost now available on Kubernetes (#4621, #4939)
* Check out the [tutorial for setting up distributed XGBoost on a Kubernetes cluster](https://xgboost.readthedocs.io/en/release_1.0.0/tutorials/kubernetes.html).
### Ruby binding for XGBoost (#4856)
### New Native Dask interface for multi-GPU and multi-node scaling (#4473, #4507, #4617, #4819, #4907, #4914, #4941, #4942, #4951, #4973, #5048, #5077, #5144, #5270)
* XGBoost now integrates seamlessly with [Dask](https://dask.org/), a lightweight distributed framework for data processing. Together with the first-class support for cuDF data frames (see below), it is now easier than ever to create end-to-end data pipeline running on one or more NVIDIA GPUs.
* Multi-GPU training with Dask is now up to 20% faster than the previous release (#4914, #4951).
### First-class support for cuDF data frames and cuPy arrays (#4737, #4745, #4794, #4850, #4891, #4902, #4918, #4927, #4928, #5053, #5189, #5194, #5206, #5219, #5225)
* [cuDF](https://github.com/rapidsai/cudf) is a data frame library for loading and processing tabular data on NVIDIA GPUs. It provides a Pandas-like API.
* [cuPy](https://github.com/cupy/cupy) implements a NumPy-compatible multi-dimensional array on NVIDIA GPUs.
* Now users can keep the data on the GPU memory throughout the end-to-end data pipeline, obviating the need for copying data between the main memory and GPU memory.
* XGBoost can accept any data structure that exposes `__array_interface__` signature, opening way to support other columar formats that are compatible with Apache Arrow.
### [Feature interaction constraint](https://xgboost.readthedocs.io/en/release_1.0.0/tutorials/feature_interaction_constraint.html) is now available with `approx` and `gpu_hist` algorithms (#4534, #4587, #4596, #5034).
### Learning to rank is now GPU accelerated (#4873, #5004, #5129)
* Supported ranking objectives: NDGC, Map, Pairwise.
* [Up to 2x improved training performance on GPUs](https://devblogs.nvidia.com/learning-to-rank-with-xgboost-and-gpu/).
### Enable `gamma` parameter for GPU training (#4874, #4953)
* The `gamma` parameter specifies the minimum loss reduction required to add a new split in a tree. A larger value for `gamma` has the effect of pre-pruning the tree, by making harder to add splits.
### External memory for GPU training (#4486, #4526, #4747, #4833, #4879, #5014)
* It is now possible to use NVIDIA GPUs even when the size of training data exceeds the available GPU memory. Note that the external memory support for GPU is still experimental. #5093 will further improve performance and will become part of the upcoming release 1.1.0.
* RFC for enabling external memory with GPU algorithms: #4357
### Improve Scikit-Learn interface (#4558, #4842, #4929, #5049, #5151, #5130, #5227)
* Many users of XGBoost enjoy the convenience and breadth of Scikit-Learn ecosystem. In this release, we revise the Scikit-Learn API of XGBoost (`XGBRegressor`, `XGBClassifier`, and `XGBRanker`) to achieve feature parity with the traditional XGBoost interface (`xgboost.train()`).
* Insert check to validate data shapes.
* Produce an error message if `eval_set` is not a tuple. An error message is better than silently crashing.
* Allow using `numpy.RandomState` object.
* Add `n_jobs` as an alias of `nthread`.
* Roadmap: #5152
### XGBoost4J-Spark: Redesigning checkpointing mechanism
* RFC is available at #4786
* Clean up checkpoint file after a successful training job (#4754): The current implementation in XGBoost4J-Spark does not clean up the checkpoint file after a successful training job. If the user runs another job with the same checkpointing directory, she will get a wrong model because the second job will re-use the checkpoint file left over from the first job. To prevent this scenario, we propose to always clean up the checkpoint file after every successful training job.
* Avoid Multiple Jobs for Checkpointing (#5082): The current method for checkpoint is to collect the booster produced at the last iteration of each checkpoint internal to Driver and persist it in HDFS. The major issue with this approach is that it needs to re-perform the data preparation for training if the user did not choose to cache the training dataset. To avoid re-performing data prep, we build external-memory checkpointing in the XGBoost4J layer as well.
* Enable deterministic repartitioning when checkpoint is enabled (#4807): Distributed algorithm for gradient boosting assumes a fixed partition of the training data between multiple iterations. In previous versions, there was no guarantee that data partition would stay the same, especially when a worker goes down and some data had to recovered from previous checkpoint. In this release, we make data partition deterministic by using the data hash value of each data row in computing the partition.
### XGBoost4J-Spark: handle errors thrown by the native code (#4560)
* All core logic of XGBoost is written in C++, so XGBoost4J-Spark internally uses the C++ code via Java Native Interface (JNI). #4560 adds a proper error handling for any errors or exceptions arising from the C++ code, so that the XGBoost Spark application can be torn down in an orderly fashion.
### XGBoost4J-Spark: Refine method to count the number of alive cores (#4858)
* The `SparkParallelismTracker` class ensures that sufficient number of executor cores are alive. To that end, it is important to query the number of alive cores reliably.
### XGBoost4J: Add `BigDenseMatrix` to store more than `Integer.MAX_VALUE` elements (#4383)
### Robust model serialization with JSON (#4632, #4708, #4739, #4868, #4936, #4945, #4974, #5086, #5087, #5089, #5091, #5094, #5110, #5111, #5112, #5120, #5137, #5218, #5222, #5236, #5245, #5248, #5281)
* In this release, we introduce an experimental support of using [JSON](https://www.json.org/json-en.html) for serializing (saving/loading) XGBoost models and related hyperparameters for training. We would like to eventually replace the old binary format with JSON, since it is an open format and parsers are available in many programming languages and platforms. See [the documentation for model I/O using JSON](https://xgboost.readthedocs.io/en/release_1.0.0/tutorials/saving_model.html). #3980 explains why JSON was chosen over other alternatives.
* To maximize interoperability and compatibility of the serialized models, we now split serialization into two parts (#4855):
1. Model, e.g. decision trees and strictly related metadata like `num_features`.
2. Internal configuration, consisting of training parameters and other configurable parameters. For example, `max_delta_step`, `tree_method`, `objective`, `predictor`, `gpu_id`.
Previously, users often ran into issues where the model file produced by one machine could not load or run on another machine. For example, models trained using a machine with an NVIDIA GPU could not run on another machine without a GPU (#5291, #5234). The reason is that the old binary format saved some internal configuration that were not universally applicable to all machines, e.g. `predictor='gpu_predictor'`.
Now, model saving function (`Booster.save_model()` in Python) will save only the model, without internal configuration. This will guarantee that your model file would be used anywhere. Internal configuration will be serialized in limited circumstances such as:
* Multiple nodes in a distributed system exchange model details over the network.
* Model checkpointing, to recover from possible crashes.
This work proved to be useful for parameter validation as well (see below).
* Starting with 1.0.0 release, we will use semantic versioning to indicate whether the model produced by one version of XGBoost would be compatible with another version of XGBoost. Any change in the major version indicates a breaking change in the serialization format.
* We now provide a robust method to save and load scikit-learn related attributes (#5245). Previously, we used Python pickle to save Python attributes related to `XGBClassifier`, `XGBRegressor`, and `XGBRanker` objects. The attributes are necessary to properly interact with scikit-learn. See #4639 for more details. The use of pickling hampered interoperability, as a pickle from one machine may not necessarily work on another machine. Starting with this release, we use an alternative method to serialize the scikit-learn related attributes. The use of Python pickle is now discouraged (#5236, #5281).
### Parameter validation: detection of unused or incorrect parameters (#4553, #4577, #4738, #4801, #4961, #5101, #5157, #5167, #5256)
* Mis-spelled training parameter is a common user mistake. In previous versions of XGBoost, mis-spelled parameters were silently ignored. Starting with 1.0.0 release, XGBoost will produce a warning message if there is any unused training parameters. Currently, parameter validation is available to R users and Python XGBoost API users. We are working to extend its support to scikit-learn users.
* Configuration steps now have well-defined semantics (#4542, #4738), so we know exactly where and how the internal configurable parameters are changed.
* The user can now use `save_config()` function to inspect all (used) training parameters. This is helpful for debugging model performance.
### Allow individual workers to recover from faults (#4808, #4966)
* Status quo: if a worker fails, all workers are shut down and restarted, and learning resumes from the last checkpoint. This involves requesting resources from the scheduler (e.g. Spark) and shuffling all the data again from scratch. Both of these operations can be quite costly and block training for extended periods of time, especially if the training data is big and the number of worker nodes is in the hundreds.
* The proposed solution is to recover the single node that failed, instead of shutting down all workers. The rest of the clusters wait until the single failed worker is bootstrapped and catches up with the rest.
* See roadmap at #4753. Note that this is work in progress. In particular, the feature is not yet available from XGBoost4J-Spark.
### Accurate prediction for DART models
* Use DART tree weights when computing SHAPs (#5050)
* Don't drop trees during DART prediction by default (#5115)
* Fix DART prediction in R (#5204)
### Make external memory more robust
* Fix issues with training with external memory on cpu (#4487)
* Fix crash with approx tree method on cpu (#4510)
* Fix external memory race in `exact` (#4980). Note: `dmlc::ThreadedIter` is not actually thread-safe. We would like to re-design it in the long term.
### Major refactoring of the `DMatrix` class (#4686, #4744, #4748, #5044, #5092, #5108, #5188, #5198)
* Goal 1: improve performance and reduce memory consumption. Right now, if the user trains a model with a NumPy array as training data, the array gets copies 2-3 times before training begins. We'd like to reduce duplication of the data matrix.
* Goal 2: Expose a common interface to external data, unify the way DMatrix objects are constructed and simplify the process of adding new external data sources. This work is essential for ingesting cuPy arrays.
* Goal 3: Handle missing values consistently.
* RFC: #4354, Roadmap: #5143
* This work is also relevant to external memory support on GPUs.
### Breaking: XGBoost Python package now requires Python 3.5 or newer (#5021, #5274)
* Python 3.4 has reached its end-of-life on March 16, 2019, so we now require Python 3.5 or newer.
### Breaking: GPU algorithm now requires CUDA 9.0 and higher (#4527, #4580)
### Breaking: `n_gpus` parameter removed; multi-GPU training now requires a distributed framework (#4579, #4749, #4773, #4810, #4867, #4908)
* #4531 proposed removing support for single-process multi-GPU training. Contributors would focus on multi-GPU support through distributed frameworks such as Dask and Spark, where the framework would be expected to assign a worker process for each GPU independently. By delegating GPU management and data movement to the distributed framework, we can greatly simplify the core XGBoost codebase, make multi-GPU training more robust, and reduce burden for future development.
### Breaking: Some deprecated features have been removed
* ``gpu_exact`` training method (#4527, #4742, #4777). Use ``gpu_hist`` instead.
* ``learning_rates`` parameter in Python (#5155). Use the callback API instead.
* ``num_roots`` (#5059, #5165), since the current training code always uses a single root node.
* GPU-specific objectives (#4690), such as `gpu:reg:linear`. Use objectives without `gpu:` prefix; GPU will be used automatically if your machine has one.
### Breaking: the C API function `XGBoosterPredict()` now asks for an extra parameter `training`.
### Breaking: We now use CMake exclusively to build XGBoost. `Makefile` is being sunset.
* Exception: the R package uses Autotools, as the CRAN ecosystem did not yet adopt CMake widely.
### Performance improvements
* Smarter choice of histogram construction for distributed `gpu_hist` (#4519)
* Optimizations for quantization on device (#4572)
* Introduce caching memory allocator to avoid latency associated with GPU memory allocation (#4554, #4615)
* Optimize the initialization stage of the CPU `hist` algorithm for sparse datasets (#4625)
* Prevent unnecessary data copies from GPU memory to the host (#4795)
* Improve operation efficiency for single prediction (#5016)
* Group builder modified for incremental building, to speed up building large `DMatrix` (#5098)
### Bug-fixes
* Eliminate `FutureWarning: Series.base is deprecated` (#4337)
* Ensure pandas DataFrame column names are treated as strings in type error message (#4481)
* [jvm-packages] Add back `reg:linear` for scala, as it is only deprecated and not meant to be removed yet (#4490)
* Fix library loading for Cygwin users (#4499)
* Fix prediction from loaded pickle (#4516)
* Enforce exclusion between `pred_interactions=True` and `pred_interactions=True` (#4522)
* Do not return dangling reference to local `std::string` (#4543)
* Set the appropriate device before freeing device memory (#4566)
* Mark `SparsePageDmatrix` destructor default. (#4568)
* Choose the appropriate tree method only when the tree method is 'auto' (#4571)
* Fix `benchmark_tree.py` (#4593)
* [jvm-packages] Fix silly bug in feature scoring (#4604)
* Fix GPU predictor when the test data matrix has different number of features than the training data matrix used to train the model (#4613)
* Fix external memory for get column batches. (#4622)
* [R] Use built-in label when xgb.DMatrix is given to xgb.cv() (#4631)
* Fix early stopping in the Python package (#4638)
* Fix AUC error in distributed mode caused by imbalanced dataset (#4645, #4798)
* [jvm-packages] Expose `setMissing` method in `XGBoostClassificationModel` / `XGBoostRegressionModel` (#4643)
* Remove initializing stringstream reference. (#4788)
* [R] `xgb.get.handle` now checks all class listed of `object` (#4800)
* Do not use `gpu_predictor` unless data comes from GPU (#4836)
* Fix data loading (#4862)
* Workaround `isnan` across different environments. (#4883)
* [jvm-packages] Handle Long-type parameter (#4885)
* Don't `set_params` at the end of `set_state` (#4947). Ensure that the model does not change after pickling and unpickling multiple times.
* C++ exceptions should not crash OpenMP loops (#4960)
* Fix `usegpu` flag in DART. (#4984)
* Run training with empty `DMatrix` (#4990, #5159)
* Ensure that no two processes can use the same GPU (#4990)
* Fix repeated split and 0 cover nodes (#5010)
* Reset histogram hit counter between multiple data batches (#5035)
* Fix `feature_name` crated from int64index dataframe. (#5081)
* Don't use 0 for "fresh leaf" (#5084)
* Throw error when user attempts to use multi-GPU training and XGBoost has not been compiled with NCCL (#5170)
* Fix metric name loading (#5122)
* Quick fix for memory leak in CPU `hist` algorithm (#5153)
* Fix wrapping GPU ID and prevent data copying (#5160)
* Fix signature of Span constructor (#5166)
* Lazy initialization of device vector, so that XGBoost compiled with CUDA can run on a machine without any GPU (#5173)
* Model loading should not change system locale (#5314)
* Distributed training jobs would sometimes hang; revert Rabit to fix this regression (dmlc/rabit#132, #5237)
### API changes
* Add support for cross-validation using query ID (#4474)
* Enable feature importance property for DART model (#4525)
* Add `rmsle` metric and `reg:squaredlogerror` objective (#4541)
* All objective and evaluation metrics are now exposed to JVM packages (#4560)
* `dump_model()` and `get_dump()` now support exporting in GraphViz language (#4602)
* Support metrics `ndcg-` and `map-` (#4635)
* [jvm-packages] Allow chaining prediction (transform) in XGBoost4J-Spark (#4667)
* [jvm-packages] Add option to bypass missing value check in the Spark layer (#4805). Only use this option if you know what you are doing.
* [jvm-packages] Add public group getter (#4838)
* `XGDMatrixSetGroup` C API is now deprecated (#4864). Use `XGDMatrixSetUIntInfo` instead.
* [R] Added new `train_folds` parameter to `xgb.cv()` (#5114)
* Ingest meta information from Pandas DataFrame, such as data weights (#5216)
### Maintenance: Refactor code for legibility and maintainability
* De-duplicate GPU parameters (#4454)
* Simplify INI-style config reader using C++11 STL (#4478, #4521)
* Refactor histogram building code for `gpu_hist` (#4528)
* Overload device memory allocator, to enable instrumentation for compiling memory usage statistics (#4532)
* Refactor out row partitioning logic from `gpu_hist` (#4554)
* Remove an unused variable (#4588)
* Implement tree model dump with code generator, to de-duplicate code for generating dumps in 3 different formats (#4602)
* Remove `RowSet` class which is no longer being used (#4697)
* Remove some unused functions as reported by cppcheck (#4743)
* Mimic CUDA assert output in Span check (#4762)
* [jvm-packages] Refactor `XGBoost.scala` to put all params processing in one place (#4815)
* Add some comments for GPU row partitioner (#4832)
* Span: use `size_t' for index_type, add `front' and `back'. (#4935)
* Remove dead code in `exact` algorithm (#5034, #5105)
* Unify integer types used for row and column indices (#5034)
* Extract feature interaction constraint from `SplitEvaluator` class. (#5034)
* [Breaking] De-duplicate paramters and docstrings in the constructors of Scikit-Learn models (#5130)
* Remove benchmark code from GPU tests (#5141)
* Clean up Python 2 compatibility code. (#5161)
* Extensible binary serialization format for `DMatrix::MetaInfo` (#5187). This will be useful for implementing censored labels for survival analysis applications.
* Cleanup clang-tidy warnings. (#5247)
### Maintenance: testing, continuous integration, build system
* Use `yaml.safe_load` instead of `yaml.load`. (#4537)
* Ensure GCC is at least 5.x (#4538)
* Remove all mention of `reg:linear` from tests (#4544)
* [jvm-packages] Upgrade to Scala 2.12 (#4574)
* [jvm-packages] Update kryo dependency to 2.22 (#4575)
* [CI] Specify account ID when logging into ECR Docker registry (#4584)
* Use Sphinx 2.1+ to compile documentation (#4609)
* Make Pandas optional for running Python unit tests (#4620)
* Fix spark tests on machines with many cores (#4634)
* [jvm-packages] Update local dev build process (#4640)
* Add optional dependencies to setup.py (#4655)
* [jvm-packages] Fix maven warnings (#4664)
* Remove extraneous files from the R package, to comply with CRAN policy (#4699)
* Remove VC-2013 support, since it is not C++11 compliant (#4701)
* [CI] Fix broken installation of Pandas (#4704, #4722)
* [jvm-packages] Clean up temporary files afer running tests (#4706)
* Specify version macro in CMake. (#4730)
* Include dmlc-tracker into XGBoost Python package (#4731)
* [CI] Use long key ID for Ubuntu repository fingerprints. (#4783)
* Remove plugin, cuda related code in automake & autoconf files (#4789)
* Skip related tests when scikit-learn is not installed. (#4791)
* Ignore vscode and clion files (#4866)
* Use bundled Google Test by default (#4900)
* [CI] Raise timeout threshold in Jenkins (#4938)
* Copy CMake parameter from dmlc-core. (#4948)
* Set correct file permission. (#4964)
* [CI] Update lint configuration to support latest pylint convention (#4971)
* [CI] Upload nightly builds to S3 (#4976, #4979)
* Add asan.so.5 to cmake script. (#4999)
* [CI] Fix Travis tests. (#5062)
* [CI] Locate vcomp140.dll from System32 directory (#5078)
* Implement training observer to dump internal states of objects (#5088). This will be useful for debugging.
* Fix visual studio output library directories (#5119)
* [jvm-packages] Comply with scala style convention + fix broken unit test (#5134)
* [CI] Repair download URL for Maven 3.6.1 (#5139)
* Don't use modernize-use-trailing-return-type in clang-tidy. (#5169)
* Explicitly use UTF-8 codepage when using MSVC (#5197)
* Add CMake option to run Undefined Behavior Sanitizer (UBSan) (#5211)
* Make some GPU tests deterministic (#5229)
* [R] Robust endian detection in CRAN xgboost build (#5232)
* Support FreeBSD (#5233)
* Make `pip install xgboost*.tar.gz` work by fixing build-python.sh (#5241)
* Fix compilation error due to 64-bit integer narrowing to `size_t` (#5250)
* Remove use of `std::cout` from R package, to comply with CRAN policy (#5261)
* Update DMLC-Core submodule (#4674, #4688, #4726, #4924)
* Update Rabit submodule (#4560, #4667, #4718, #4808, #4966, #5237)
### Usability Improvements, Documentation
* Add Random Forest API to Python API doc (#4500)
* Fix Python demo and doc. (#4545)
* Remove doc about not supporting cuda 10.1 (#4578)
* Address some sphinx warnings and errors, add doc for building doc. (#4589)
* Add instruction to run formatting checks locally (#4591)
* Fix docstring for `XGBModel.predict()` (#4592)
* Doc and demo for customized metric and objective (#4598, #4608)
* Add to documentation how to run tests locally (#4610)
* Empty evaluation list in early stopping should produce meaningful error message (#4633)
* Fixed year to 2019 in conf.py, helpers.h and LICENSE (#4661)
* Minor updates to links and grammar (#4673)
* Remove `silent` in doc (#4689)
* Remove old Python trouble shooting doc (#4729)
* Add `os.PathLike` support for file paths to DMatrix and Booster Python classes (#4757)
* Update XGBoost4J-Spark doc (#4804)
* Regular formatting for evaluation metrics (#4803)
* [jvm-packages] Refine documentation for handling missing values in XGBoost4J-Spark (#4805)
* Monitor for distributed envorinment (#4829). This is useful for identifying performance bottleneck.
* Add check for length of weights and produce a good error message (#4872)
* Fix DMatrix doc (#4884)
* Export C++ headers in CMake installation (#4897)
* Update license year in README.md to 2019 (#4940)
* Fix incorrectly displayed Note in the doc (#4943)
* Follow PEP 257 Docstring Conventions (#4959)
* Document minimum version required for Google Test (#5001)
* Add better error message for invalid feature names (#5024)
* Some guidelines on device memory usage (#5038)
* [doc] Some notes for external memory. (#5065)
* Update document for `tree_method` (#5106)
* Update demo for ranking. (#5154)
* Add new lines for Spark XGBoost missing values section (#5180)
* Fix simple typo: utilty -> utility (#5182)
* Update R doc by roxygen2 (#5201)
* [R] Direct user to use `set.seed()` instead of setting `seed` parameter (#5125)
* Add Optuna badge to `README.md` (#5208)
* Fix compilation error in `c-api-demo.c` (#5215)
### Acknowledgement
**Contributors**: Nan Zhu (@CodingCat), Crissman Loomis (@Crissman), Cyprien Ricque (@Cyprien-Ricque), Evan Kepner (@EvanKepner), K.O. (@Hi-king), KaiJin Ji (@KerryJi), Peter Badida (@KeyWeeUsr), Kodi Arfer (@Kodiologist), Rory Mitchell (@RAMitchell), Egor Smirnov (@SmirnovEgorRu), Jacob Kim (@TheJacobKim), Vibhu Jawa (@VibhuJawa), Marcos (@astrowonk), Andy Adinets (@canonizer), Chen Qin (@chenqin), Christopher Cowden (@cowden), @cpfarrell, @david-cortes, Liangcai Li (@firestarman), @fuhaoda, Philip Hyunsu Cho (@hcho3), @here-nagini, Tong He (@hetong007), Michal Kurka (@michalkurka), Honza Sterba (@honzasterba), @iblumin, @koertkuipers, mattn (@mattn), Mingjie Tang (@merlintang), OrdoAbChao (@mglowacki100), Matthew Jones (@mt-jones), mitama (@nigimitama), Nathan Moore (@nmoorenz), Daniel Stahl (@phillyfan1138), Michaël Benesty (@pommedeterresautee), Rong Ou (@rongou), Sebastian (@sfahnens), Xu Xiao (@sperlingxx), @sriramch, Sean Owen (@srowen), Stephanie Yang (@stpyang), Yuan Tang (@terrytangyuan), Mathew Wicks (@thesuperzapper), Tim Gates (@timgates42), TinkleG (@tinkle1129), Oleksandr Pryimak (@trams), Jiaming Yuan (@trivialfis), Matvey Turkov (@turk0v), Bobby Wang (@wbo4958), yage (@yage99), @yellowdolphin
**Reviewers**: Nan Zhu (@CodingCat), Crissman Loomis (@Crissman), Cyprien Ricque (@Cyprien-Ricque), Evan Kepner (@EvanKepner), John Zedlewski (@JohnZed), KOLANICH (@KOLANICH), KaiJin Ji (@KerryJi), Kodi Arfer (@Kodiologist), Rory Mitchell (@RAMitchell), Egor Smirnov (@SmirnovEgorRu), Nikita Titov (@StrikerRUS), Jacob Kim (@TheJacobKim), Vibhu Jawa (@VibhuJawa), Andrew Kane (@ankane), Arno Candel (@arnocandel), Marcos (@astrowonk), Bryan Woods (@bryan-woods), Andy Adinets (@canonizer), Chen Qin (@chenqin), Thomas Franke (@coding-komek), Peter (@codingforfun), @cpfarrell, Joshua Patterson (@datametrician), @fuhaoda, Philip Hyunsu Cho (@hcho3), Tong He (@hetong007), Honza Sterba (@honzasterba), @iblumin, @jakirkham, Vadim Khotilovich (@khotilov), Keith Kraus (@kkraus14), @koertkuipers, @melonki, Mingjie Tang (@merlintang), OrdoAbChao (@mglowacki100), Daniel Mahler (@mhlr), Matthew Rocklin (@mrocklin), Matthew Jones (@mt-jones), Michaël Benesty (@pommedeterresautee), PSEUDOTENSOR / Jonathan McKinney (@pseudotensor), Rong Ou (@rongou), Vladimir (@sh1ng), Scott Lundberg (@slundberg), Xu Xiao (@sperlingxx), @sriramch, Pasha Stetsenko (@st-pasha), Stephanie Yang (@stpyang), Yuan Tang (@terrytangyuan), Mathew Wicks (@thesuperzapper), Theodore Vasiloudis (@thvasilo), TinkleG (@tinkle1129), Oleksandr Pryimak (@trams), Jiaming Yuan (@trivialfis), Bobby Wang (@wbo4958), yage (@yage99), @yellowdolphin, Yin Lou (@yinlou)
## v0.90 (2019.05.18) ## v0.90 (2019.05.18)
### XGBoost Python package drops Python 2.x (#4379, #4381) ### XGBoost Python package drops Python 2.x (#4379, #4381)

View File

@@ -6,8 +6,11 @@ file(GLOB_RECURSE R_SOURCES
${CMAKE_CURRENT_LIST_DIR}/src/*.c) ${CMAKE_CURRENT_LIST_DIR}/src/*.c)
# Use object library to expose symbols # Use object library to expose symbols
add_library(xgboost-r OBJECT ${R_SOURCES}) add_library(xgboost-r OBJECT ${R_SOURCES})
if (ENABLE_ALL_WARNINGS)
set(R_DEFINITIONS target_compile_options(xgboost-r PRIVATE -Wall -Wextra)
endif (ENABLE_ALL_WARNINGS)
target_compile_definitions(xgboost-r
PUBLIC
-DXGBOOST_STRICT_R_MODE=1 -DXGBOOST_STRICT_R_MODE=1
-DXGBOOST_CUSTOMIZE_GLOBAL_PRNG=1 -DXGBOOST_CUSTOMIZE_GLOBAL_PRNG=1
-DDMLC_LOG_BEFORE_THROW=0 -DDMLC_LOG_BEFORE_THROW=0
@@ -15,20 +18,27 @@ set(R_DEFINITIONS
-DDMLC_LOG_CUSTOMIZE=1 -DDMLC_LOG_CUSTOMIZE=1
-DRABIT_CUSTOMIZE_MSG_ -DRABIT_CUSTOMIZE_MSG_
-DRABIT_STRICT_CXX98_) -DRABIT_STRICT_CXX98_)
target_compile_definitions(xgboost-r
PRIVATE ${R_DEFINITIONS})
target_include_directories(xgboost-r target_include_directories(xgboost-r
PRIVATE PRIVATE
${LIBR_INCLUDE_DIRS} ${LIBR_INCLUDE_DIRS}
${PROJECT_SOURCE_DIR}/include ${PROJECT_SOURCE_DIR}/include
${PROJECT_SOURCE_DIR}/dmlc-core/include ${PROJECT_SOURCE_DIR}/dmlc-core/include
${PROJECT_SOURCE_DIR}/rabit/include) ${PROJECT_SOURCE_DIR}/rabit/include)
target_link_libraries(xgboost-r PUBLIC ${LIBR_CORE_LIBRARY})
if (USE_OPENMP)
find_package(OpenMP REQUIRED)
target_link_libraries(xgboost-r PUBLIC OpenMP::OpenMP_CXX OpenMP::OpenMP_C)
endif (USE_OPENMP)
set_target_properties( set_target_properties(
xgboost-r PROPERTIES xgboost-r PROPERTIES
CXX_STANDARD 11 CXX_STANDARD 14
CXX_STANDARD_REQUIRED ON CXX_STANDARD_REQUIRED ON
POSITION_INDEPENDENT_CODE ON) POSITION_INDEPENDENT_CODE ON)
set(XGBOOST_DEFINITIONS "${XGBOOST_DEFINITIONS};${R_DEFINITIONS}" PARENT_SCOPE) # Get compilation and link flags of xgboost-r and propagate to objxgboost
set(XGBOOST_OBJ_SOURCES $<TARGET_OBJECTS:xgboost-r> PARENT_SCOPE) target_link_libraries(objxgboost PUBLIC xgboost-r)
set(LINKED_LIBRARIES_PRIVATE ${LINKED_LIBRARIES_PRIVATE} ${LIBR_CORE_LIBRARY} PARENT_SCOPE) # Add all objects of xgboost-r to objxgboost
target_sources(objxgboost INTERFACE $<TARGET_OBJECTS:xgboost-r>)
set(LIBR_HOME "${LIBR_HOME}" PARENT_SCOPE)
set(LIBR_EXECUTABLE "${LIBR_EXECUTABLE}" PARENT_SCOPE)

View File

@@ -1,8 +1,8 @@
Package: xgboost Package: xgboost
Type: Package Type: Package
Title: Extreme Gradient Boosting Title: Extreme Gradient Boosting
Version: 1.0.0.1 Version: 1.2.0.1
Date: 2019-07-23 Date: 2020-08-28
Authors@R: c( Authors@R: c(
person("Tianqi", "Chen", role = c("aut"), person("Tianqi", "Chen", role = c("aut"),
email = "tianqi.tchen@gmail.com"), email = "tianqi.tchen@gmail.com"),
@@ -54,7 +54,8 @@ Suggests:
lintr, lintr,
igraph (>= 1.0.1), igraph (>= 1.0.1),
jsonlite, jsonlite,
float float,
crayon
Depends: Depends:
R (>= 3.3.0) R (>= 3.3.0)
Imports: Imports:
@@ -63,5 +64,5 @@ Imports:
data.table (>= 1.9.6), data.table (>= 1.9.6),
magrittr (>= 1.5), magrittr (>= 1.5),
stringi (>= 0.5.2) stringi (>= 0.5.2)
RoxygenNote: 7.0.2 RoxygenNote: 7.1.1
SystemRequirements: GNU make, C++11 SystemRequirements: GNU make, C++14

View File

@@ -14,6 +14,7 @@ S3method(setinfo,xgb.DMatrix)
S3method(slice,xgb.DMatrix) S3method(slice,xgb.DMatrix)
export("xgb.attr<-") export("xgb.attr<-")
export("xgb.attributes<-") export("xgb.attributes<-")
export("xgb.config<-")
export("xgb.parameters<-") export("xgb.parameters<-")
export(cb.cv.predict) export(cb.cv.predict)
export(cb.early.stop) export(cb.early.stop)
@@ -30,6 +31,7 @@ export(xgb.DMatrix)
export(xgb.DMatrix.save) export(xgb.DMatrix.save)
export(xgb.attr) export(xgb.attr)
export(xgb.attributes) export(xgb.attributes)
export(xgb.config)
export(xgb.create.features) export(xgb.create.features)
export(xgb.cv) export(xgb.cv)
export(xgb.dump) export(xgb.dump)
@@ -38,6 +40,7 @@ export(xgb.ggplot.deepness)
export(xgb.ggplot.importance) export(xgb.ggplot.importance)
export(xgb.importance) export(xgb.importance)
export(xgb.load) export(xgb.load)
export(xgb.load.raw)
export(xgb.model.dt.tree) export(xgb.model.dt.tree)
export(xgb.plot.deepness) export(xgb.plot.deepness)
export(xgb.plot.importance) export(xgb.plot.importance)
@@ -46,7 +49,9 @@ export(xgb.plot.shap)
export(xgb.plot.tree) export(xgb.plot.tree)
export(xgb.save) export(xgb.save)
export(xgb.save.raw) export(xgb.save.raw)
export(xgb.serialize)
export(xgb.train) export(xgb.train)
export(xgb.unserialize)
export(xgboost) export(xgboost)
import(methods) import(methods)
importClassesFrom(Matrix,dgCMatrix) importClassesFrom(Matrix,dgCMatrix)

View File

@@ -351,13 +351,13 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
finalizer <- function(env) { finalizer <- function(env) {
if (!is.null(env$bst)) { if (!is.null(env$bst)) {
attr_best_score = as.numeric(xgb.attr(env$bst$handle, 'best_score')) attr_best_score <- as.numeric(xgb.attr(env$bst$handle, 'best_score'))
if (best_score != attr_best_score) if (best_score != attr_best_score)
stop("Inconsistent 'best_score' values between the closure state: ", best_score, stop("Inconsistent 'best_score' values between the closure state: ", best_score,
" and the xgb.attr: ", attr_best_score) " and the xgb.attr: ", attr_best_score)
env$bst$best_iteration = best_iteration env$bst$best_iteration <- best_iteration
env$bst$best_ntreelimit = best_ntreelimit env$bst$best_ntreelimit <- best_ntreelimit
env$bst$best_score = best_score env$bst$best_score <- best_score
} else { } else {
env$basket$best_iteration <- best_iteration env$basket$best_iteration <- best_iteration
env$basket$best_ntreelimit <- best_ntreelimit env$basket$best_ntreelimit <- best_ntreelimit
@@ -372,7 +372,7 @@ cb.early.stop <- function(stopping_rounds, maximize = FALSE,
return(finalizer(env)) return(finalizer(env))
i <- env$iteration i <- env$iteration
score = env$bst_evaluation[metric_idx] score <- env$bst_evaluation[metric_idx]
if ((maximize && score > best_score) || if ((maximize && score > best_score) ||
(!maximize && score < best_score)) { (!maximize && score < best_score)) {
@@ -613,9 +613,7 @@ cb.gblinear.history <- function(sparse=FALSE) {
init <- function(env) { init <- function(env) {
if (!is.null(env$bst)) { # xgb.train: if (!is.null(env$bst)) { # xgb.train:
coef_path <- list()
} else if (!is.null(env$bst_folds)) { # xgb.cv: } else if (!is.null(env$bst_folds)) { # xgb.cv:
coef_path <- rep(list(), length(env$bst_folds))
} else stop("Parent frame has neither 'bst' nor 'bst_folds'") } else stop("Parent frame has neither 'bst' nor 'bst_folds'")
} }

View File

@@ -28,7 +28,7 @@ NVL <- function(x, val) {
# Merges booster params with whatever is provided in ... # Merges booster params with whatever is provided in ...
# plus runs some checks # plus runs some checks
check.booster.params <- function(params, ...) { check.booster.params <- function(params, ...) {
if (typeof(params) != "list") if (!identical(class(params), "list"))
stop("params must be a list") stop("params must be a list")
# in R interface, allow for '.' instead of '_' in parameter names # in R interface, allow for '.' instead of '_' in parameter names
@@ -69,16 +69,16 @@ check.booster.params <- function(params, ...) {
if (!is.null(params[['monotone_constraints']]) && if (!is.null(params[['monotone_constraints']]) &&
typeof(params[['monotone_constraints']]) != "character") { typeof(params[['monotone_constraints']]) != "character") {
vec2str = paste(params[['monotone_constraints']], collapse = ',') vec2str <- paste(params[['monotone_constraints']], collapse = ',')
vec2str = paste0('(', vec2str, ')') vec2str <- paste0('(', vec2str, ')')
params[['monotone_constraints']] = vec2str params[['monotone_constraints']] <- vec2str
} }
# interaction constraints parser (convert from list of column indices to string) # interaction constraints parser (convert from list of column indices to string)
if (!is.null(params[['interaction_constraints']]) && if (!is.null(params[['interaction_constraints']]) &&
typeof(params[['interaction_constraints']]) != "character"){ typeof(params[['interaction_constraints']]) != "character"){
# check input class # check input class
if (class(params[['interaction_constraints']]) != 'list') stop('interaction_constraints should be class list') if (!identical(class(params[['interaction_constraints']]), 'list')) stop('interaction_constraints should be class list')
if (!all(unique(sapply(params[['interaction_constraints']], class)) %in% c('numeric', 'integer'))) { if (!all(unique(sapply(params[['interaction_constraints']], class)) %in% c('numeric', 'integer'))) {
stop('interaction_constraints should be a list of numeric/integer vectors') stop('interaction_constraints should be a list of numeric/integer vectors')
} }
@@ -145,7 +145,8 @@ xgb.iter.update <- function(booster_handle, dtrain, iter, obj = NULL) {
if (is.null(obj)) { if (is.null(obj)) {
.Call(XGBoosterUpdateOneIter_R, booster_handle, as.integer(iter), dtrain) .Call(XGBoosterUpdateOneIter_R, booster_handle, as.integer(iter), dtrain)
} else { } else {
pred <- predict(booster_handle, dtrain, training = TRUE) pred <- predict(booster_handle, dtrain, outputmargin = TRUE, training = TRUE,
ntreelimit = 0)
gpair <- obj(pred, dtrain) gpair <- obj(pred, dtrain)
.Call(XGBoosterBoostOneIter_R, booster_handle, dtrain, gpair$grad, gpair$hess) .Call(XGBoosterBoostOneIter_R, booster_handle, dtrain, gpair$grad, gpair$hess)
} }
@@ -172,7 +173,7 @@ xgb.iter.eval <- function(booster_handle, watchlist, iter, feval = NULL) {
} else { } else {
res <- sapply(seq_along(watchlist), function(j) { res <- sapply(seq_along(watchlist), function(j) {
w <- watchlist[[j]] w <- watchlist[[j]]
preds <- predict(booster_handle, w) # predict using all trees preds <- predict(booster_handle, w, outputmargin = TRUE, ntreelimit = 0) # predict using all trees
eval_res <- feval(preds, w) eval_res <- feval(preds, w)
out <- eval_res$value out <- eval_res$value
names(out) <- paste0(evnames[j], "-", eval_res$metric) names(out) <- paste0(evnames[j], "-", eval_res$metric)
@@ -307,6 +308,68 @@ xgb.createFolds <- function(y, k = 10)
#' @name xgboost-deprecated #' @name xgboost-deprecated
NULL NULL
#' Do not use \code{\link[base]{saveRDS}} or \code{\link[base]{save}} for long-term archival of
#' models. Instead, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}}.
#'
#' It is a common practice to use the built-in \code{\link[base]{saveRDS}} function (or
#' \code{\link[base]{save}}) to persist R objects to the disk. While it is possible to persist
#' \code{xgb.Booster} objects using \code{\link[base]{saveRDS}}, it is not advisable to do so if
#' the model is to be accessed in the future. If you train a model with the current version of
#' XGBoost and persist it with \code{\link[base]{saveRDS}}, the model is not guaranteed to be
#' accessible in later releases of XGBoost. To ensure that your model can be accessed in future
#' releases of XGBoost, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}} instead.
#'
#' @details
#' Use \code{\link{xgb.save}} to save the XGBoost model as a stand-alone file. You may opt into
#' the JSON format by specifying the JSON extension. To read the model back, use
#' \code{\link{xgb.load}}.
#'
#' Use \code{\link{xgb.save.raw}} to save the XGBoost model as a sequence (vector) of raw bytes
#' in a future-proof manner. Future releases of XGBoost will be able to read the raw bytes and
#' re-construct the corresponding model. To read the model back, use \code{\link{xgb.load.raw}}.
#' The \code{\link{xgb.save.raw}} function is useful if you'd like to persist the XGBoost model
#' as part of another R object.
#'
#' Note: Do not use \code{\link{xgb.serialize}} to store models long-term. It persists not only the
#' model but also internal configurations and parameters, and its format is not stable across
#' multiple XGBoost versions. Use \code{\link{xgb.serialize}} only for checkpointing.
#'
#' For more details and explanation about model persistence and archival, consult the page
#' \url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
#'
#' # Save as a stand-alone file; load it with xgb.load()
#' xgb.save(bst, 'xgb.model')
#' bst2 <- xgb.load('xgb.model')
#'
#' # Save as a stand-alone file (JSON); load it with xgb.load()
#' xgb.save(bst, 'xgb.model.json')
#' bst2 <- xgb.load('xgb.model.json')
#' if (file.exists('xgb.model.json')) file.remove('xgb.model.json')
#'
#' # Save as a raw byte vector; load it with xgb.load.raw()
#' xgb_bytes <- xgb.save.raw(bst)
#' bst2 <- xgb.load.raw(xgb_bytes)
#'
#' # Persist XGBoost model as part of another R object
#' obj <- list(xgb_model_bytes = xgb.save.raw(bst), description = "My first XGBoost model")
#' # Persist the R object. Here, saveRDS() is okay, since it doesn't persist
#' # xgb.Booster directly. What's being persisted is the future-proof byte representation
#' # as given by xgb.save.raw().
#' saveRDS(obj, 'my_object.rds')
#' # Read back the R object
#' obj2 <- readRDS('my_object.rds')
#' # Re-construct xgb.Booster object from the bytes
#' bst2 <- xgb.load.raw(obj2$xgb_model_bytes)
#' if (file.exists('my_object.rds')) file.remove('my_object.rds')
#'
#' @name a-compatibility-note-for-saveRDS-save
NULL
# Lookup table for the deprecated parameters bookkeeping # Lookup table for the deprecated parameters bookkeeping
depr_par_lut <- matrix(c( depr_par_lut <- matrix(c(
'print.every.n', 'print_every_n', 'print.every.n', 'print_every_n',

View File

@@ -1,24 +1,39 @@
# Construct an internal xgboost Booster and return a handle to it. # Construct an internal xgboost Booster and return a handle to it.
# internal utility function # internal utility function
xgb.Booster.handle <- function(params = list(), cachelist = list(), modelfile = NULL) { xgb.Booster.handle <- function(params = list(), cachelist = list(),
modelfile = NULL) {
if (typeof(cachelist) != "list" || if (typeof(cachelist) != "list" ||
!all(vapply(cachelist, inherits, logical(1), what = 'xgb.DMatrix'))) { !all(vapply(cachelist, inherits, logical(1), what = 'xgb.DMatrix'))) {
stop("cachelist must be a list of xgb.DMatrix objects") stop("cachelist must be a list of xgb.DMatrix objects")
} }
## Load existing model, dispatch for on disk model file and in memory buffer
handle <- .Call(XGBoosterCreate_R, cachelist)
if (!is.null(modelfile)) { if (!is.null(modelfile)) {
if (typeof(modelfile) == "character") { if (typeof(modelfile) == "character") {
## A filename
handle <- .Call(XGBoosterCreate_R, cachelist)
.Call(XGBoosterLoadModel_R, handle, modelfile[1]) .Call(XGBoosterLoadModel_R, handle, modelfile[1])
class(handle) <- "xgb.Booster.handle"
if (length(params) > 0) {
xgb.parameters(handle) <- params
}
return(handle)
} else if (typeof(modelfile) == "raw") { } else if (typeof(modelfile) == "raw") {
.Call(XGBoosterLoadModelFromRaw_R, handle, modelfile) ## A memory buffer
bst <- xgb.unserialize(modelfile)
xgb.parameters(bst) <- params
return (bst)
} else if (inherits(modelfile, "xgb.Booster")) { } else if (inherits(modelfile, "xgb.Booster")) {
## A booster object
bst <- xgb.Booster.complete(modelfile, saveraw = TRUE) bst <- xgb.Booster.complete(modelfile, saveraw = TRUE)
.Call(XGBoosterLoadModelFromRaw_R, handle, bst$raw) bst <- xgb.unserialize(bst$raw)
xgb.parameters(bst) <- params
return (bst)
} else { } else {
stop("modelfile must be either character filename, or raw booster dump, or xgb.Booster object") stop("modelfile must be either character filename, or raw booster dump, or xgb.Booster object")
} }
} }
## Create new model
handle <- .Call(XGBoosterCreate_R, cachelist)
class(handle) <- "xgb.Booster.handle" class(handle) <- "xgb.Booster.handle"
if (length(params) > 0) { if (length(params) > 0) {
xgb.parameters(handle) <- params xgb.parameters(handle) <- params
@@ -48,8 +63,8 @@ is.null.handle <- function(handle) {
return(FALSE) return(FALSE)
} }
# Return a verified to be valid handle out of either xgb.Booster.handle or xgb.Booster # Return a verified to be valid handle out of either xgb.Booster.handle or
# internal utility function # xgb.Booster internal utility function
xgb.get.handle <- function(object) { xgb.get.handle <- function(object) {
if (inherits(object, "xgb.Booster")) { if (inherits(object, "xgb.Booster")) {
handle <- object$handle handle <- object$handle
@@ -96,6 +111,8 @@ xgb.get.handle <- function(object) {
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") #' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
#' saveRDS(bst, "xgb.model.rds") #' saveRDS(bst, "xgb.model.rds")
#' #'
#' # Warning: The resulting RDS file is only compatible with the current XGBoost version.
#' # Refer to the section titled "a-compatibility-note-for-saveRDS-save".
#' bst1 <- readRDS("xgb.model.rds") #' bst1 <- readRDS("xgb.model.rds")
#' if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds") #' if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds")
#' # the handle is invalid: #' # the handle is invalid:
@@ -113,9 +130,29 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
if (is.null.handle(object$handle)) { if (is.null.handle(object$handle)) {
object$handle <- xgb.Booster.handle(modelfile = object$raw) object$handle <- xgb.Booster.handle(modelfile = object$raw)
} else { } else {
if (is.null(object$raw) && saveraw) if (is.null(object$raw) && saveraw) {
object$raw <- xgb.save.raw(object$handle) object$raw <- xgb.serialize(object$handle)
} }
}
attrs <- xgb.attributes(object)
if (!is.null(attrs$best_ntreelimit)) {
object$best_ntreelimit <- as.integer(attrs$best_ntreelimit)
}
if (!is.null(attrs$best_iteration)) {
## Convert from 0 based back to 1 based.
object$best_iteration <- as.integer(attrs$best_iteration) + 1
}
if (!is.null(attrs$best_score)) {
object$best_score <- as.numeric(attrs$best_score)
}
if (!is.null(attrs$best_msg)) {
object$best_msg <- attrs$best_msg
}
if (!is.null(attrs$niter)) {
object$niter <- as.integer(attrs$niter)
}
return(object) return(object)
} }
@@ -139,6 +176,8 @@ xgb.Booster.complete <- function(object, saveraw = TRUE) {
#' @param reshape whether to reshape the vector of predictions to a matrix form when there are several #' @param reshape whether to reshape the vector of predictions to a matrix form when there are several
#' prediction outputs per case. This option has no effect when either of predleaf, predcontrib, #' prediction outputs per case. This option has no effect when either of predleaf, predcontrib,
#' or predinteraction flags is TRUE. #' or predinteraction flags is TRUE.
#' @param training whether is the prediction result used for training. For dart booster,
#' training predicting will perform dropout.
#' @param ... Parameters passed to \code{predict.xgb.Booster} #' @param ... Parameters passed to \code{predict.xgb.Booster}
#' #'
#' @details #' @details
@@ -397,7 +436,7 @@ predict.xgb.Booster.handle <- function(object, ...) {
#' That would only matter if attributes need to be set many times. #' That would only matter if attributes need to be set many times.
#' Note, however, that when feeding a handle of an \code{xgb.Booster} object to the attribute setters, #' Note, however, that when feeding a handle of an \code{xgb.Booster} object to the attribute setters,
#' the raw model cache of an \code{xgb.Booster} object would not be automatically updated, #' the raw model cache of an \code{xgb.Booster} object would not be automatically updated,
#' and it would be user's responsibility to call \code{xgb.save.raw} to update it. #' and it would be user's responsibility to call \code{xgb.serialize} to update it.
#' #'
#' The \code{xgb.attributes<-} setter either updates the existing or adds one or several attributes, #' The \code{xgb.attributes<-} setter either updates the existing or adds one or several attributes,
#' but it doesn't delete the other existing attributes. #' but it doesn't delete the other existing attributes.
@@ -456,7 +495,7 @@ xgb.attr <- function(object, name) {
} }
.Call(XGBoosterSetAttr_R, handle, as.character(name[1]), value) .Call(XGBoosterSetAttr_R, handle, as.character(name[1]), value)
if (is(object, 'xgb.Booster') && !is.null(object$raw)) { if (is(object, 'xgb.Booster') && !is.null(object$raw)) {
object$raw <- xgb.save.raw(object$handle) object$raw <- xgb.serialize(object$handle)
} }
object object
} }
@@ -496,11 +535,41 @@ xgb.attributes <- function(object) {
.Call(XGBoosterSetAttr_R, handle, names(a[i]), a[[i]]) .Call(XGBoosterSetAttr_R, handle, names(a[i]), a[[i]])
} }
if (is(object, 'xgb.Booster') && !is.null(object$raw)) { if (is(object, 'xgb.Booster') && !is.null(object$raw)) {
object$raw <- xgb.save.raw(object$handle) object$raw <- xgb.serialize(object$handle)
} }
object object
} }
#' Accessors for model parameters as JSON string.
#'
#' @param object Object of class \code{xgb.Booster}
#' @param value A JSON string.
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' train <- agaricus.train
#'
#' bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
#' eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
#' config <- xgb.config(bst)
#'
#' @rdname xgb.config
#' @export
xgb.config <- function(object) {
handle <- xgb.get.handle(object)
.Call(XGBoosterSaveJsonConfig_R, handle);
}
#' @rdname xgb.config
#' @export
`xgb.config<-` <- function(object, value) {
handle <- xgb.get.handle(object)
.Call(XGBoosterLoadJsonConfig_R, handle, value)
object$raw <- NULL # force renew the raw buffer
object <- xgb.Booster.complete(object)
object
}
#' Accessors for model parameters. #' Accessors for model parameters.
#' #'
#' Only the setter for xgboost parameters is currently implemented. #' Only the setter for xgboost parameters is currently implemented.
@@ -537,7 +606,7 @@ xgb.attributes <- function(object) {
.Call(XGBoosterSetParam_R, handle, names(p[i]), p[[i]]) .Call(XGBoosterSetParam_R, handle, names(p[i]), p[[i]])
} }
if (is(object, 'xgb.Booster') && !is.null(object$raw)) { if (is(object, 'xgb.Booster') && !is.null(object$raw)) {
object$raw <- xgb.save.raw(object$handle) object$raw <- xgb.serialize(object$handle)
} }
object object
} }

View File

@@ -188,9 +188,10 @@ getinfo <- function(object, ...) UseMethod("getinfo")
getinfo.xgb.DMatrix <- function(object, name, ...) { getinfo.xgb.DMatrix <- function(object, name, ...) {
if (typeof(name) != "character" || if (typeof(name) != "character" ||
length(name) != 1 || length(name) != 1 ||
!name %in% c('label', 'weight', 'base_margin', 'nrow')) { !name %in% c('label', 'weight', 'base_margin', 'nrow',
'label_lower_bound', 'label_upper_bound')) {
stop("getinfo: name must be one of the following\n", stop("getinfo: name must be one of the following\n",
" 'label', 'weight', 'base_margin', 'nrow'") " 'label', 'weight', 'base_margin', 'nrow', 'label_lower_bound', 'label_upper_bound'")
} }
if (name != "nrow"){ if (name != "nrow"){
ret <- .Call(XGDMatrixGetInfo_R, object, name) ret <- .Call(XGDMatrixGetInfo_R, object, name)
@@ -243,9 +244,19 @@ setinfo.xgb.DMatrix <- function(object, name, info, ...) {
.Call(XGDMatrixSetInfo_R, object, name, as.numeric(info)) .Call(XGDMatrixSetInfo_R, object, name, as.numeric(info))
return(TRUE) return(TRUE)
} }
if (name == "weight") { if (name == "label_lower_bound") {
if (length(info) != nrow(object)) if (length(info) != nrow(object))
stop("The length of weights must equal to the number of rows in the input data") stop("The length of lower-bound labels must equal to the number of rows in the input data")
.Call(XGDMatrixSetInfo_R, object, name, as.numeric(info))
return(TRUE)
}
if (name == "label_upper_bound") {
if (length(info) != nrow(object))
stop("The length of upper-bound labels must equal to the number of rows in the input data")
.Call(XGDMatrixSetInfo_R, object, name, as.numeric(info))
return(TRUE)
}
if (name == "weight") {
.Call(XGDMatrixSetInfo_R, object, name, as.numeric(info)) .Call(XGDMatrixSetInfo_R, object, name, as.numeric(info))
return(TRUE) return(TRUE)
} }

View File

@@ -83,5 +83,5 @@ xgb.create.features <- function(model, data, ...){
check.deprecation(...) check.deprecation(...)
pred_with_leaf <- predict(model, data, predleaf = TRUE) pred_with_leaf <- predict(model, data, predleaf = TRUE)
cols <- lapply(as.data.frame(pred_with_leaf), factor) cols <- lapply(as.data.frame(pred_with_leaf), factor)
cbind(data, sparse.model.matrix( ~ . -1, cols)) cbind(data, sparse.model.matrix(~ . -1, cols)) # nolint
} }

View File

@@ -2,12 +2,15 @@
#' #'
#' The cross validation function of xgboost #' The cross validation function of xgboost
#' #'
#' @param params the list of parameters. Commonly used ones are: #' @param params the list of parameters. The complete list of parameters is
#' available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below
#' is a shorter summary:
#' \itemize{ #' \itemize{
#' \item \code{objective} objective function, common ones are #' \item \code{objective} objective function, common ones are
#' \itemize{ #' \itemize{
#' \item \code{reg:squarederror} Regression with squared loss #' \item \code{reg:squarederror} Regression with squared loss.
#' \item \code{binary:logistic} logistic regression for classification #' \item \code{binary:logistic} logistic regression for classification.
#' \item See \code{\link[=xgb.train]{xgb.train}()} for complete list of objectives.
#' } #' }
#' \item \code{eta} step size of each boosting step #' \item \code{eta} step size of each boosting step
#' \item \code{max_depth} maximum depth of the tree #' \item \code{max_depth} maximum depth of the tree
@@ -76,7 +79,7 @@
#' #'
#' All observations are used for both training and validation. #' All observations are used for both training and validation.
#' #'
#' Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation} #' Adapted from \url{https://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29}
#' #'
#' @return #' @return
#' An object of class \code{xgb.cv.synchronous} with the following elements: #' An object of class \code{xgb.cv.synchronous} with the following elements:
@@ -101,7 +104,7 @@
#' (only available with early stopping). #' (only available with early stopping).
#' \item \code{pred} CV prediction values available when \code{prediction} is set. #' \item \code{pred} CV prediction values available when \code{prediction} is set.
#' It is either vector or matrix (see \code{\link{cb.cv.predict}}). #' It is either vector or matrix (see \code{\link{cb.cv.predict}}).
#' \item \code{models} a liost of the CV folds' models. It is only available with the explicit #' \item \code{models} a list of the CV folds' models. It is only available with the explicit
#' setting of the \code{cb.cv.predict(save_models = TRUE)} callback. #' setting of the \code{cb.cv.predict(save_models = TRUE)} callback.
#' } #' }
#' #'
@@ -140,9 +143,9 @@ xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing =
} else if (inherits(data, 'xgb.DMatrix')) { } else if (inherits(data, 'xgb.DMatrix')) {
if (!is.null(label)) if (!is.null(label))
warning("xgb.cv: label will be ignored, since data is of type xgb.DMatrix") warning("xgb.cv: label will be ignored, since data is of type xgb.DMatrix")
cv_label = getinfo(data, 'label') cv_label <- getinfo(data, 'label')
} else { } else {
cv_label = label cv_label <- label
} }
# CV folds # CV folds
@@ -205,8 +208,8 @@ xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing =
basket <- list() basket <- list()
# extract parameters that can affect the relationship b/w #trees and #iterations # extract parameters that can affect the relationship b/w #trees and #iterations
num_class <- max(as.numeric(NVL(params[['num_class']], 1)), 1) num_class <- max(as.numeric(NVL(params[['num_class']], 1)), 1) # nolint
num_parallel_tree <- max(as.numeric(NVL(params[['num_parallel_tree']], 1)), 1) num_parallel_tree <- max(as.numeric(NVL(params[['num_parallel_tree']], 1)), 1) # nolint
# those are fixed for CV (no training continuation) # those are fixed for CV (no training continuation)
begin_iteration <- 1 begin_iteration <- 1
@@ -223,7 +226,7 @@ xgb.cv <- function(params=list(), data, nrounds, nfold, label = NULL, missing =
}) })
msg <- simplify2array(msg) msg <- simplify2array(msg)
bst_evaluation <- rowMeans(msg) bst_evaluation <- rowMeans(msg)
bst_evaluation_err <- sqrt(rowMeans(msg^2) - bst_evaluation^2) bst_evaluation_err <- sqrt(rowMeans(msg^2) - bst_evaluation^2) # nolint
for (f in cb$post_iter) f() for (f in cb$post_iter) f()

View File

@@ -105,7 +105,7 @@ xgb.ggplot.deepness <- function(model = NULL, which = c("2x1", "max.depth", "med
# internal utility function # internal utility function
multiplot <- function(..., cols = 1) { multiplot <- function(..., cols = 1) {
plots <- list(...) plots <- list(...)
num_plots = length(plots) num_plots <- length(plots)
layout <- matrix(seq(1, cols * ceiling(num_plots / cols)), layout <- matrix(seq(1, cols * ceiling(num_plots / cols)),
ncol = cols, nrow = ceiling(num_plots / cols)) ncol = cols, nrow = ceiling(num_plots / cols))

View File

@@ -117,8 +117,7 @@ xgb.importance <- function(feature_names = NULL, model = NULL, trees = NULL,
Weight = weights, Weight = weights,
Class = seq_len(num_class) - 1)[order(Class, -abs(Weight))] Class = seq_len(num_class) - 1)[order(Class, -abs(Weight))]
} }
} else { } else { # tree model
# tree model
result <- xgb.model.dt.tree(feature_names = feature_names, result <- xgb.model.dt.tree(feature_names = feature_names,
text = model_text_dump, text = model_text_dump,
trees = trees)[ trees = trees)[

View File

@@ -0,0 +1,14 @@
#' Load serialised xgboost model from R's raw vector
#'
#' User can generate raw memory buffer by calling xgb.save.raw
#'
#' @param buffer the buffer returned by xgb.save.raw
#'
#' @export
xgb.load.raw <- function(buffer) {
cachelist <- list()
handle <- .Call(XGBoosterCreate_R, cachelist)
.Call(XGBoosterLoadModelFromRaw_R, handle, buffer)
class(handle) <- "xgb.Booster.handle"
return (handle)
}

View File

@@ -13,7 +13,11 @@
#' #'
#' Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}} #' Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}}
#' or \code{\link[base]{save}}). However, it would then only be compatible with R, and #' or \code{\link[base]{save}}). However, it would then only be compatible with R, and
#' corresponding R-methods would need to be used to load it. #' corresponding R-methods would need to be used to load it. Moreover, persisting the model with
#' \code{\link[base]{readRDS}} or \code{\link[base]{save}}) will cause compatibility problems in
#' future versions of XGBoost. Consult \code{\link{a-compatibility-note-for-saveRDS-save}} to learn
#' how to persist models in a future-proof way, i.e. to make the model accessible in future
#' releases of XGBoost.
#' #'
#' @seealso #' @seealso
#' \code{\link{xgb.load}}, \code{\link{xgb.Booster.complete}}. #' \code{\link{xgb.load}}, \code{\link{xgb.Booster.complete}}.

View File

@@ -1,5 +1,5 @@
#' Save xgboost model to R's raw vector, #' Save xgboost model to R's raw vector,
#' user can call xgb.load to load the model back from raw vector #' user can call xgb.load.raw to load the model back from raw vector
#' #'
#' Save xgboost model from xgboost or xgb.train #' Save xgboost model from xgboost or xgb.train
#' #'
@@ -13,11 +13,11 @@
#' bst <- xgboost(data = train$data, label = train$label, max_depth = 2, #' bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
#' eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic") #' eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
#' raw <- xgb.save.raw(bst) #' raw <- xgb.save.raw(bst)
#' bst <- xgb.load(raw) #' bst <- xgb.load.raw(raw)
#' pred <- predict(bst, test$data) #' pred <- predict(bst, test$data)
#' #'
#' @export #' @export
xgb.save.raw <- function(model) { xgb.save.raw <- function(model) {
model <- xgb.get.handle(model) handle <- xgb.get.handle(model)
.Call(XGBoosterModelToRaw_R, model) .Call(XGBoosterModelToRaw_R, handle)
} }

View File

@@ -0,0 +1,21 @@
#' Serialize the booster instance into R's raw vector. The serialization method differs
#' from \code{\link{xgb.save.raw}} as the latter one saves only the model but not
#' parameters. This serialization format is not stable across different xgboost versions.
#'
#' @param booster the booster instance
#'
#' @examples
#' data(agaricus.train, package='xgboost')
#' data(agaricus.test, package='xgboost')
#' train <- agaricus.train
#' test <- agaricus.test
#' bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
#' eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
#' raw <- xgb.serialize(bst)
#' bst <- xgb.unserialize(raw)
#'
#' @export
xgb.serialize <- function(booster) {
handle <- xgb.get.handle(booster)
.Call(XGBoosterSerializeToBuffer_R, handle)
}

View File

@@ -3,9 +3,9 @@
#' \code{xgb.train} is an advanced interface for training an xgboost model. #' \code{xgb.train} is an advanced interface for training an xgboost model.
#' The \code{xgboost} function is a simpler wrapper for \code{xgb.train}. #' The \code{xgboost} function is a simpler wrapper for \code{xgb.train}.
#' #'
#' @param params the list of parameters. #' @param params the list of parameters. The complete list of parameters is
#' The complete list of parameters is available at \url{http://xgboost.readthedocs.io/en/latest/parameter.html}. #' available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below
#' Below is a shorter summary: #' is a shorter summary:
#' #'
#' 1. General Parameters #' 1. General Parameters
#' #'
@@ -43,13 +43,23 @@
#' \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below: #' \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
#' \itemize{ #' \itemize{
#' \item \code{reg:squarederror} Regression with squared loss (Default). #' \item \code{reg:squarederror} Regression with squared loss (Default).
#' \item \code{reg:squaredlogerror}: regression with squared log loss \eqn{1/2 * (log(pred + 1) - log(label + 1))^2}. All inputs are required to be greater than -1. Also, see metric rmsle for possible issue with this objective.
#' \item \code{reg:logistic} logistic regression. #' \item \code{reg:logistic} logistic regression.
#' \item \code{reg:pseudohubererror}: regression with Pseudo Huber loss, a twice differentiable alternative to absolute loss.
#' \item \code{binary:logistic} logistic regression for binary classification. Output probability. #' \item \code{binary:logistic} logistic regression for binary classification. Output probability.
#' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation. #' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
#' \item \code{num_class} set the number of classes. To use only with multiclass objectives. #' \item \code{binary:hinge}: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
#' \item \code{count:poisson}: poisson regression for count data, output mean of poisson distribution. \code{max_delta_step} is set to 0.7 by default in poisson regression (used to safeguard optimization).
#' \item \code{survival:cox}: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function \code{h(t) = h0(t) * HR)}.
#' \item \code{survival:aft}: Accelerated failure time model for censored survival time data. See \href{https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html}{Survival Analysis with Accelerated Failure Time} for details.
#' \item \code{aft_loss_distribution}: Probabilty Density Function used by \code{survival:aft} and \code{aft-nloglik} metric.
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class - 1}. #' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class - 1}.
#' \item \code{multi:softprob} same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class. #' \item \code{multi:softprob} same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
#' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss. #' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
#' \item \code{rank:ndcg}: Use LambdaMART to perform list-wise ranking where \href{https://en.wikipedia.org/wiki/Discounted_cumulative_gain}{Normalized Discounted Cumulative Gain (NDCG)} is maximized.
#' \item \code{rank:map}: Use LambdaMART to perform list-wise ranking where \href{https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision}{Mean Average Precision (MAP)} is maximized.
#' \item \code{reg:gamma}: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be \href{https://en.wikipedia.org/wiki/Gamma_distribution#Applications}{gamma-distributed}.
#' \item \code{reg:tweedie}: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be \href{https://en.wikipedia.org/wiki/Tweedie_distribution#Applications}{Tweedie-distributed}.
#' } #' }
#' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5 #' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
#' \item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section. #' \item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
@@ -120,16 +130,16 @@
#' Note that when using a customized metric, only this single metric can be used. #' Note that when using a customized metric, only this single metric can be used.
#' The following is the list of built-in metrics for which Xgboost provides optimized implementation: #' The following is the list of built-in metrics for which Xgboost provides optimized implementation:
#' \itemize{ #' \itemize{
#' \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error} #' \item \code{rmse} root mean square error. \url{https://en.wikipedia.org/wiki/Root_mean_square_error}
#' \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood} #' \item \code{logloss} negative log-likelihood. \url{https://en.wikipedia.org/wiki/Log-likelihood}
#' \item \code{mlogloss} multiclass logloss. \url{http://wiki.fast.ai/index.php/Log_Loss} #' \item \code{mlogloss} multiclass logloss. \url{https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html}
#' \item \code{error} Binary classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}. #' \item \code{error} Binary classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
#' By default, it uses the 0.5 threshold for predicted values to define negative and positive instances. #' By default, it uses the 0.5 threshold for predicted values to define negative and positive instances.
#' Different threshold (e.g., 0.) could be specified as "error@0." #' Different threshold (e.g., 0.) could be specified as "error@0."
#' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}. #' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
#' \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation. #' \item \code{auc} Area under the curve. \url{https://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
#' \item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation. #' \item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation.
#' \item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{http://en.wikipedia.org/wiki/NDCG} #' \item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{https://en.wikipedia.org/wiki/NDCG}
#' } #' }
#' #'
#' The following callbacks are automatically created when certain parameters are set: #' The following callbacks are automatically created when certain parameters are set:
@@ -267,7 +277,7 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
} }
# evaluation printing callback # evaluation printing callback
params <- c(params, list(silent = ifelse(verbose > 1, 0, 1))) params <- c(params)
print_every_n <- max(as.integer(print_every_n), 1L) print_every_n <- max(as.integer(print_every_n), 1L)
if (!has.callbacks(callbacks, 'cb.print.evaluation') && if (!has.callbacks(callbacks, 'cb.print.evaluation') &&
verbose) { verbose) {
@@ -291,8 +301,10 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds, callbacks <- add.cb(callbacks, cb.early.stop(early_stopping_rounds,
maximize = maximize, verbose = verbose)) maximize = maximize, verbose = verbose))
} }
# Sort the callbacks into categories # Sort the callbacks into categories
cb <- categorize.callbacks(callbacks) cb <- categorize.callbacks(callbacks)
params['validate_parameters'] <- TRUE
if (!is.null(params[['seed']])) { if (!is.null(params[['seed']])) {
warning("xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.") warning("xgb.train: `seed` is ignored in R package. Use `set.seed()` instead.")
} }
@@ -319,9 +331,6 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
if (is_update && nrounds > niter_init) if (is_update && nrounds > niter_init)
stop("nrounds cannot be larger than ", niter_init, " (nrounds of xgb_model)") stop("nrounds cannot be larger than ", niter_init, " (nrounds of xgb_model)")
# TODO: distributed code
rank <- 0
niter_skip <- ifelse(is_update, 0, niter_init) niter_skip <- ifelse(is_update, 0, niter_init)
begin_iteration <- niter_skip + 1 begin_iteration <- niter_skip + 1
end_iteration <- niter_skip + nrounds end_iteration <- niter_skip + nrounds
@@ -333,7 +342,6 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) xgb.iter.update(bst$handle, dtrain, iteration - 1, obj)
bst_evaluation <- numeric(0)
if (length(watchlist) > 0) if (length(watchlist) > 0)
bst_evaluation <- xgb.iter.eval(bst$handle, watchlist, iteration - 1, feval) bst_evaluation <- xgb.iter.eval(bst$handle, watchlist, iteration - 1, feval)
@@ -348,7 +356,7 @@ xgb.train <- function(params = list(), data, nrounds, watchlist = list(),
bst <- xgb.Booster.complete(bst, saveraw = TRUE) bst <- xgb.Booster.complete(bst, saveraw = TRUE)
# store the total number of boosting iterations # store the total number of boosting iterations
bst$niter = end_iteration bst$niter <- end_iteration
# store the evaluation results # store the evaluation results
if (length(evaluation_log) > 0 && if (length(evaluation_log) > 0 &&

View File

@@ -0,0 +1,31 @@
#' Load the instance back from \code{\link{xgb.serialize}}
#'
#' @param buffer the buffer containing booster instance saved by \code{\link{xgb.serialize}}
#'
#' @export
xgb.unserialize <- function(buffer) {
cachelist <- list()
handle <- .Call(XGBoosterCreate_R, cachelist)
tryCatch(
.Call(XGBoosterUnserializeFromBuffer_R, handle, buffer),
error = function(e) {
error_msg <- conditionMessage(e)
m <- regexec("(src[\\\\/]learner.cc:[0-9]+): Check failed: (header == serialisation_header_)",
error_msg, perl = TRUE)
groups <- regmatches(error_msg, m)[[1]]
if (length(groups) == 3) {
warning(paste("The model had been generated by XGBoost version 1.0.0 or earlier and was ",
"loaded from a RDS file. We strongly ADVISE AGAINST using saveRDS() ",
"function, to ensure that your model can be read in current and upcoming ",
"XGBoost releases. Please use xgb.save() instead to preserve models for the ",
"long term. For more details and explanation, see ",
"https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html",
sep = ""))
.Call(XGBoosterLoadModelFromRaw_R, handle, buffer)
} else {
stop(e)
}
})
class(handle) <- "xgb.Booster.handle"
return (handle)
}

18
R-package/configure vendored
View File

@@ -613,6 +613,7 @@ infodir
docdir docdir
oldincludedir oldincludedir
includedir includedir
runstatedir
localstatedir localstatedir
sharedstatedir sharedstatedir
sysconfdir sysconfdir
@@ -682,6 +683,7 @@ datadir='${datarootdir}'
sysconfdir='${prefix}/etc' sysconfdir='${prefix}/etc'
sharedstatedir='${prefix}/com' sharedstatedir='${prefix}/com'
localstatedir='${prefix}/var' localstatedir='${prefix}/var'
runstatedir='${localstatedir}/run'
includedir='${prefix}/include' includedir='${prefix}/include'
oldincludedir='/usr/include' oldincludedir='/usr/include'
docdir='${datarootdir}/doc/${PACKAGE_TARNAME}' docdir='${datarootdir}/doc/${PACKAGE_TARNAME}'
@@ -934,6 +936,15 @@ do
| -silent | --silent | --silen | --sile | --sil) | -silent | --silent | --silen | --sile | --sil)
silent=yes ;; silent=yes ;;
-runstatedir | --runstatedir | --runstatedi | --runstated \
| --runstate | --runstat | --runsta | --runst | --runs \
| --run | --ru | --r)
ac_prev=runstatedir ;;
-runstatedir=* | --runstatedir=* | --runstatedi=* | --runstated=* \
| --runstate=* | --runstat=* | --runsta=* | --runst=* | --runs=* \
| --run=* | --ru=* | --r=*)
runstatedir=$ac_optarg ;;
-sbindir | --sbindir | --sbindi | --sbind | --sbin | --sbi | --sb) -sbindir | --sbindir | --sbindi | --sbind | --sbin | --sbi | --sb)
ac_prev=sbindir ;; ac_prev=sbindir ;;
-sbindir=* | --sbindir=* | --sbindi=* | --sbind=* | --sbin=* \ -sbindir=* | --sbindir=* | --sbindi=* | --sbind=* | --sbin=* \
@@ -1071,7 +1082,7 @@ fi
for ac_var in exec_prefix prefix bindir sbindir libexecdir datarootdir \ for ac_var in exec_prefix prefix bindir sbindir libexecdir datarootdir \
datadir sysconfdir sharedstatedir localstatedir includedir \ datadir sysconfdir sharedstatedir localstatedir includedir \
oldincludedir docdir infodir htmldir dvidir pdfdir psdir \ oldincludedir docdir infodir htmldir dvidir pdfdir psdir \
libdir localedir mandir libdir localedir mandir runstatedir
do do
eval ac_val=\$$ac_var eval ac_val=\$$ac_var
# Remove trailing slashes. # Remove trailing slashes.
@@ -1224,6 +1235,7 @@ Fine tuning of the installation directories:
--sysconfdir=DIR read-only single-machine data [PREFIX/etc] --sysconfdir=DIR read-only single-machine data [PREFIX/etc]
--sharedstatedir=DIR modifiable architecture-independent data [PREFIX/com] --sharedstatedir=DIR modifiable architecture-independent data [PREFIX/com]
--localstatedir=DIR modifiable single-machine data [PREFIX/var] --localstatedir=DIR modifiable single-machine data [PREFIX/var]
--runstatedir=DIR modifiable per-process data [LOCALSTATEDIR/run]
--libdir=DIR object code libraries [EPREFIX/lib] --libdir=DIR object code libraries [EPREFIX/lib]
--includedir=DIR C header files [PREFIX/include] --includedir=DIR C header files [PREFIX/include]
--oldincludedir=DIR C header files for non-gcc [/usr/include] --oldincludedir=DIR C header files for non-gcc [/usr/include]
@@ -2698,7 +2710,7 @@ fi
if test `uname -s` = "Darwin" if test `uname -s` = "Darwin"
then then
OPENMP_CXXFLAGS='-Xclang -fopenmp' OPENMP_CXXFLAGS='-Xclang -fopenmp'
OPENMP_LIB='/usr/local/lib/libomp.dylib' OPENMP_LIB='-lomp'
ac_pkg_openmp=no ac_pkg_openmp=no
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether OpenMP will work in a package" >&5 { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether OpenMP will work in a package" >&5
$as_echo_n "checking whether OpenMP will work in a package... " >&6; } $as_echo_n "checking whether OpenMP will work in a package... " >&6; }
@@ -2713,7 +2725,7 @@ main ()
return 0; return 0;
} }
_ACEOF _ACEOF
${CC} -o conftest conftest.c /usr/local/lib/libomp.dylib -Xclang -fopenmp 2>/dev/null && ./conftest && ac_pkg_openmp=yes ${CC} -o conftest conftest.c ${OPENMP_LIB} ${OPENMP_CXXFLAGS} 2>/dev/null && ./conftest && ac_pkg_openmp=yes
{ $as_echo "$as_me:${as_lineno-$LINENO}: result: ${ac_pkg_openmp}" >&5 { $as_echo "$as_me:${as_lineno-$LINENO}: result: ${ac_pkg_openmp}" >&5
$as_echo "${ac_pkg_openmp}" >&6; } $as_echo "${ac_pkg_openmp}" >&6; }
if test "${ac_pkg_openmp}" = no; then if test "${ac_pkg_openmp}" = no; then

View File

@@ -1,6 +1,6 @@
### configure.ac -*- Autoconf -*- ### configure.ac -*- Autoconf -*-
AC_PREREQ(2.62) AC_PREREQ(2.69)
AC_INIT([xgboost],[0.6-3],[],[xgboost],[]) AC_INIT([xgboost],[0.6-3],[],[xgboost],[])
@@ -29,11 +29,11 @@ fi
if test `uname -s` = "Darwin" if test `uname -s` = "Darwin"
then then
OPENMP_CXXFLAGS='-Xclang -fopenmp' OPENMP_CXXFLAGS='-Xclang -fopenmp'
OPENMP_LIB='/usr/local/lib/libomp.dylib' OPENMP_LIB='-lomp'
ac_pkg_openmp=no ac_pkg_openmp=no
AC_MSG_CHECKING([whether OpenMP will work in a package]) AC_MSG_CHECKING([whether OpenMP will work in a package])
AC_LANG_CONFTEST([AC_LANG_PROGRAM([[#include <omp.h>]], [[ return (omp_get_max_threads() <= 1); ]])]) AC_LANG_CONFTEST([AC_LANG_PROGRAM([[#include <omp.h>]], [[ return (omp_get_max_threads() <= 1); ]])])
${CC} -o conftest conftest.c /usr/local/lib/libomp.dylib -Xclang -fopenmp 2>/dev/null && ./conftest && ac_pkg_openmp=yes ${CC} -o conftest conftest.c ${OPENMP_LIB} ${OPENMP_CXXFLAGS} 2>/dev/null && ./conftest && ac_pkg_openmp=yes
AC_MSG_RESULT([${ac_pkg_openmp}]) AC_MSG_RESULT([${ac_pkg_openmp}])
if test "${ac_pkg_openmp}" = no; then if test "${ac_pkg_openmp}" = no; then
OPENMP_CXXFLAGS='' OPENMP_CXXFLAGS=''

View File

@@ -61,7 +61,7 @@ pred2 <- predict(bst2, test$data)
print(paste("sum(abs(pred2-pred))=", sum(abs(pred2 - pred)))) print(paste("sum(abs(pred2-pred))=", sum(abs(pred2 - pred))))
# save model to R's raw vector # save model to R's raw vector
raw = xgb.save.raw(bst) raw <- xgb.save.raw(bst)
# load binary model to R # load binary model to R
bst3 <- xgb.load(raw) bst3 <- xgb.load(raw)
pred3 <- predict(bst3, test$data) pred3 <- predict(bst3, test$data)
@@ -93,14 +93,14 @@ dtrain2 <- xgb.DMatrix("dtrain.buffer")
bst <- xgb.train(data = dtrain2, max_depth = 2, eta = 1, nrounds = 2, watchlist = watchlist, bst <- xgb.train(data = dtrain2, max_depth = 2, eta = 1, nrounds = 2, watchlist = watchlist,
nthread = 2, objective = "binary:logistic") nthread = 2, objective = "binary:logistic")
# information can be extracted from xgb.DMatrix using getinfo # information can be extracted from xgb.DMatrix using getinfo
label = getinfo(dtest, "label") label <- getinfo(dtest, "label")
pred <- predict(bst, dtest) pred <- predict(bst, dtest)
err <- as.numeric(sum(as.integer(pred > 0.5) != label)) / length(label) err <- as.numeric(sum(as.integer(pred > 0.5) != label)) / length(label)
print(paste("test-error=", err)) print(paste("test-error=", err))
# You can dump the tree you learned using xgb.dump into a text file # You can dump the tree you learned using xgb.dump into a text file
dump_path = file.path(tempdir(), 'dump.raw.txt') dump_path <- file.path(tempdir(), 'dump.raw.txt')
xgb.dump(bst, dump_path, with_stats = T) xgb.dump(bst, dump_path, with_stats = TRUE)
# Finally, you can check which features are the most important. # Finally, you can check which features are the most important.
print("Most important features (look at column Gain):") print("Most important features (look at column Gain):")

View File

@@ -11,7 +11,7 @@ watchlist <- list(eval = dtest, train = dtrain)
# #
print('start running example to start from a initial prediction') print('start running example to start from a initial prediction')
# train xgboost for 1 round # train xgboost for 1 round
param <- list(max_depth=2, eta=1, nthread = 2, silent=1, objective='binary:logistic') param <- list(max_depth = 2, eta = 1, nthread = 2, objective = 'binary:logistic')
bst <- xgb.train(param, dtrain, 1, watchlist) bst <- xgb.train(param, dtrain, 1, watchlist)
# Note: we need the margin value instead of transformed prediction in set_base_margin # Note: we need the margin value instead of transformed prediction in set_base_margin
# do predict with output_margin=TRUE, will always give you margin values before logistic transformation # do predict with output_margin=TRUE, will always give you margin values before logistic transformation

View File

@@ -9,7 +9,7 @@ require(e1071)
# Load Arthritis dataset in memory. # Load Arthritis dataset in memory.
data(Arthritis) data(Arthritis)
# Create a copy of the dataset with data.table package (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent and its performance are really good). # Create a copy of the dataset with data.table package (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent and its performance are really good).
df <- data.table(Arthritis, keep.rownames = F) df <- data.table(Arthritis, keep.rownames = FALSE)
# Let's add some new categorical features to see if it helps. Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but Tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features. # Let's add some new categorical features to see if it helps. Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but Tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features.
# For the first feature we create groups of age by rounding the real age. Note that we transform it to factor (categorical data) so the algorithm treat them as independant values. # For the first feature we create groups of age by rounding the real age. Note that we transform it to factor (categorical data) so the algorithm treat them as independant values.

View File

@@ -19,7 +19,7 @@ if (!require(vcd)) {
data(Arthritis) data(Arthritis)
# create a copy of the dataset with data.table package (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent and its performance are really good). # create a copy of the dataset with data.table package (data.table is 100% compliant with R dataframe but its syntax is a lot more consistent and its performance are really good).
df <- data.table(Arthritis, keep.rownames = F) df <- data.table(Arthritis, keep.rownames = FALSE)
# Let's have a look to the data.table # Let's have a look to the data.table
cat("Print the dataset\n") cat("Print the dataset\n")
@@ -52,7 +52,7 @@ print(levels(df[,Treatment]))
# #
# Formulae Improved~.-1 used below means transform all categorical features but column Improved to binary values. # Formulae Improved~.-1 used below means transform all categorical features but column Improved to binary values.
# Column Improved is excluded because it will be our output column, the one we want to predict. # Column Improved is excluded because it will be our output column, the one we want to predict.
sparse_matrix = sparse.model.matrix(Improved~.-1, data = df) sparse_matrix <- sparse.model.matrix(Improved ~ . - 1, data = df)
cat("Encoding of the sparse Matrix\n") cat("Encoding of the sparse Matrix\n")
print(sparse_matrix) print(sparse_matrix)
@@ -61,7 +61,7 @@ print(sparse_matrix)
# 1. Set, for all rows, field in Y column to 0; # 1. Set, for all rows, field in Y column to 0;
# 2. set Y to 1 when Improved == Marked; # 2. set Y to 1 when Improved == Marked;
# 3. Return Y column # 3. Return Y column
output_vector = df[,Y:=0][Improved == "Marked",Y:=1][,Y] output_vector <- df[, Y := 0][Improved == "Marked", Y := 1][, Y]
# Following is the same process as other demo # Following is the same process as other demo
cat("Learning...\n") cat("Learning...\n")

View File

@@ -6,7 +6,7 @@ dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label) dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
nrounds <- 2 nrounds <- 2
param <- list(max_depth=2, eta=1, silent=1, nthread=2, objective='binary:logistic') param <- list(max_depth = 2, eta = 1, nthread = 2, objective = 'binary:logistic')
cat('running cross validation\n') cat('running cross validation\n')
# do cross validation, this will print result out as # do cross validation, this will print result out as
@@ -40,7 +40,7 @@ evalerror <- function(preds, dtrain) {
return(list(metric = "error", value = err)) return(list(metric = "error", value = err))
} }
param <- list(max_depth=2, eta=1, silent=1, param <- list(max_depth = 2, eta = 1,
objective = logregobj, eval_metric = evalerror) objective = logregobj, eval_metric = evalerror)
# train with customized objective # train with customized objective
xgb.cv(params = param, data = dtrain, nrounds = nrounds, nfold = 5) xgb.cv(params = param, data = dtrain, nrounds = nrounds, nfold = 5)

View File

@@ -31,4 +31,3 @@ bst <- xgb.train(param, dtrain, num_round, watchlist)
ypred <- predict(bst, dtest) ypred <- predict(bst, dtest)
labels <- getinfo(dtest, 'label') labels <- getinfo(dtest, 'label')
cat('error of preds=', mean(as.numeric(ypred > 0.5) != labels), '\n') cat('error of preds=', mean(as.numeric(ypred > 0.5) != labels), '\n')

View File

@@ -5,7 +5,9 @@ set.seed(1024)
# Function to obtain a list of interactions fitted in trees, requires input of maximum depth # Function to obtain a list of interactions fitted in trees, requires input of maximum depth
treeInteractions <- function(input_tree, input_max_depth) { treeInteractions <- function(input_tree, input_max_depth) {
trees <- copy(input_tree) # copy tree input to prevent overwriting ID_merge <- i.id <- i.feature <- NULL # Suppress warning "no visible binding for global variable"
trees <- data.table::copy(input_tree) # copy tree input to prevent overwriting
if (input_max_depth < 2) return(list()) # no interactions if max depth < 2 if (input_max_depth < 2) return(list()) # no interactions if max depth < 2
if (nrow(input_tree) == 1) return(list()) if (nrow(input_tree) == 1) return(list())
@@ -15,22 +17,25 @@ treeInteractions <- function(input_tree, input_max_depth){
parents_left <- trees[!is.na(Split), list(i.id = ID, i.feature = Feature, ID_merge = Yes)] parents_left <- trees[!is.na(Split), list(i.id = ID, i.feature = Feature, ID_merge = Yes)]
parents_right <- trees[!is.na(Split), list(i.id = ID, i.feature = Feature, ID_merge = No)] parents_right <- trees[!is.na(Split), list(i.id = ID, i.feature = Feature, ID_merge = No)]
setorderv(trees, 'ID_merge') data.table::setorderv(trees, 'ID_merge')
setorderv(parents_left, 'ID_merge') data.table::setorderv(parents_left, 'ID_merge')
setorderv(parents_right, 'ID_merge') data.table::setorderv(parents_right, 'ID_merge')
trees <- merge(trees, parents_left, by='ID_merge', all.x=T) trees <- merge(trees, parents_left, by = 'ID_merge', all.x = TRUE)
trees[!is.na(i.id), c(paste0('parent_', i-1), paste0('parent_feat_', i-1)):=list(i.id, i.feature)] trees[!is.na(i.id), c(paste0('parent_', i - 1), paste0('parent_feat_', i - 1))
:= list(i.id, i.feature)]
trees[, c('i.id', 'i.feature') := NULL] trees[, c('i.id', 'i.feature') := NULL]
trees <- merge(trees, parents_right, by='ID_merge', all.x=T) trees <- merge(trees, parents_right, by = 'ID_merge', all.x = TRUE)
trees[!is.na(i.id), c(paste0('parent_', i-1), paste0('parent_feat_', i-1)):=list(i.id, i.feature)] trees[!is.na(i.id), c(paste0('parent_', i - 1), paste0('parent_feat_', i - 1))
:= list(i.id, i.feature)]
trees[, c('i.id', 'i.feature') := NULL] trees[, c('i.id', 'i.feature') := NULL]
} }
# Extract nodes with interactions # Extract nodes with interactions
interaction_trees <- trees[!is.na(Split) & !is.na(parent_1), interaction_trees <- trees[!is.na(Split) & !is.na(parent_1),
c('Feature',paste0('parent_feat_',1:(input_max_depth-1))), with=F] c('Feature', paste0('parent_feat_', 1:(input_max_depth - 1))),
with = FALSE]
interaction_trees_split <- split(interaction_trees, 1:nrow(interaction_trees)) interaction_trees_split <- split(interaction_trees, 1:nrow(interaction_trees))
interaction_list <- lapply(interaction_trees_split, as.character) interaction_list <- lapply(interaction_trees_split, as.character)
@@ -48,13 +53,14 @@ treeInteractions <- function(input_tree, input_max_depth){
# Generate sample data # Generate sample data
x <- list() x <- list()
for (i in 1:10) { for (i in 1:10) {
x[[i]] = i*rnorm(1000, 10) x[[i]] <- i * rnorm(1000, 10)
} }
x <- as.data.table(x) x <- as.data.table(x)
y = -1*x[, rowSums(.SD)] + x[['V1']]*x[['V2']] + x[['V3']]*x[['V4']]*x[['V5']] + rnorm(1000, 0.001) + 3*sin(x[['V7']]) y <- -1 * x[, rowSums(.SD)] + x[['V1']] * x[['V2']] + x[['V3']] * x[['V4']] * x[['V5']]
+ rnorm(1000, 0.001) + 3 * sin(x[['V7']])
train = as.matrix(x) train <- as.matrix(x)
# Interaction constraint list (column names form) # Interaction constraint list (column names form)
interaction_list <- list(c('V1', 'V2'), c('V3', 'V4', 'V5')) interaction_list <- list(c('V1', 'V2'), c('V3', 'V4', 'V5'))
@@ -65,38 +71,40 @@ cols2ids <- function(object, col_names) {
names(LUT) <- col_names names(LUT) <- col_names
rapply(object, function(x) LUT[x], classes = "character", how = "replace") rapply(object, function(x) LUT[x], classes = "character", how = "replace")
} }
interaction_list_fid = cols2ids(interaction_list, colnames(train)) interaction_list_fid <- cols2ids(interaction_list, colnames(train))
# Fit model with interaction constraints # Fit model with interaction constraints
bst = xgboost(data = train, label = y, max_depth = 4, bst <- xgboost(data = train, label = y, max_depth = 4,
eta = 0.1, nthread = 2, nrounds = 1000, eta = 0.1, nthread = 2, nrounds = 1000,
interaction_constraints = interaction_list_fid) interaction_constraints = interaction_list_fid)
bst_tree <- xgb.model.dt.tree(colnames(train), bst) bst_tree <- xgb.model.dt.tree(colnames(train), bst)
bst_interactions <- treeInteractions(bst_tree, 4) # interactions constrained to combinations of V1*V2 and V3*V4*V5 bst_interactions <- treeInteractions(bst_tree, 4)
# interactions constrained to combinations of V1*V2 and V3*V4*V5
# Fit model without interaction constraints # Fit model without interaction constraints
bst2 = xgboost(data = train, label = y, max_depth = 4, bst2 <- xgboost(data = train, label = y, max_depth = 4,
eta = 0.1, nthread = 2, nrounds = 1000) eta = 0.1, nthread = 2, nrounds = 1000)
bst2_tree <- xgb.model.dt.tree(colnames(train), bst2) bst2_tree <- xgb.model.dt.tree(colnames(train), bst2)
bst2_interactions <- treeInteractions(bst2_tree, 4) # much more interactions bst2_interactions <- treeInteractions(bst2_tree, 4) # much more interactions
# Fit model with both interaction and monotonicity constraints # Fit model with both interaction and monotonicity constraints
bst3 = xgboost(data = train, label = y, max_depth = 4, bst3 <- xgboost(data = train, label = y, max_depth = 4,
eta = 0.1, nthread = 2, nrounds = 1000, eta = 0.1, nthread = 2, nrounds = 1000,
interaction_constraints = interaction_list_fid, interaction_constraints = interaction_list_fid,
monotone_constraints = c(-1, 0, 0, 0, 0, 0, 0, 0, 0, 0)) monotone_constraints = c(-1, 0, 0, 0, 0, 0, 0, 0, 0, 0))
bst3_tree <- xgb.model.dt.tree(colnames(train), bst3) bst3_tree <- xgb.model.dt.tree(colnames(train), bst3)
bst3_interactions <- treeInteractions(bst3_tree, 4) # interactions still constrained to combinations of V1*V2 and V3*V4*V5 bst3_interactions <- treeInteractions(bst3_tree, 4)
# interactions still constrained to combinations of V1*V2 and V3*V4*V5
# Show monotonic constraints still apply by checking scores after incrementing V1 # Show monotonic constraints still apply by checking scores after incrementing V1
x1 <- sort(unique(x[['V1']])) x1 <- sort(unique(x[['V1']]))
for (i in 1:length(x1)){ for (i in 1:length(x1)){
testdata <- copy(x[, -c('V1')]) testdata <- copy(x[, -c('V1')])
testdata[['V1']] <- x1[i] testdata[['V1']] <- x1[i]
testdata <- testdata[, paste0('V',1:10), with=F] testdata <- testdata[, paste0('V', 1:10), with = FALSE]
pred <- predict(bst3, as.matrix(testdata)) pred <- predict(bst3, as.matrix(testdata))
# Should not print out anything due to monotonic constraints # Should not print out anything due to monotonic constraints

View File

@@ -1,7 +1,6 @@
data(mtcars) data(mtcars)
head(mtcars) head(mtcars)
bst = xgboost(data=as.matrix(mtcars[,-11]),label=mtcars[,11], bst <- xgboost(data = as.matrix(mtcars[, -11]), label = mtcars[, 11],
objective = 'count:poisson', nrounds = 5) objective = 'count:poisson', nrounds = 5)
pred = predict(bst,as.matrix(mtcars[,-11])) pred <- predict(bst, as.matrix(mtcars[, -11]))
sqrt(mean((pred - mtcars[, 11]) ^ 2)) sqrt(mean((pred - mtcars[, 11]) ^ 2))

View File

@@ -5,19 +5,19 @@ data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label) dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
param <- list(max_depth=2, eta=1, silent=1, objective='binary:logistic') param <- list(max_depth = 2, eta = 1, objective = 'binary:logistic')
watchlist <- list(eval = dtest, train = dtrain) watchlist <- list(eval = dtest, train = dtrain)
nrounds = 2 nrounds <- 2
# training the model for two rounds # training the model for two rounds
bst = xgb.train(param, dtrain, nrounds, nthread = 2, watchlist) bst <- xgb.train(param, dtrain, nrounds, nthread = 2, watchlist)
cat('start testing prediction from first n trees\n') cat('start testing prediction from first n trees\n')
labels <- getinfo(dtest, 'label') labels <- getinfo(dtest, 'label')
### predict using first 1 tree ### predict using first 1 tree
ypred1 = predict(bst, dtest, ntreelimit=1) ypred1 <- predict(bst, dtest, ntreelimit = 1)
# by default, we predict using all the trees # by default, we predict using all the trees
ypred2 = predict(bst, dtest) ypred2 <- predict(bst, dtest)
cat('error of ypred1=', mean(as.numeric(ypred1 > 0.5) != labels), '\n') cat('error of ypred1=', mean(as.numeric(ypred1 > 0.5) != labels), '\n')
cat('error of ypred2=', mean(as.numeric(ypred2 > 0.5) != labels), '\n') cat('error of ypred2=', mean(as.numeric(ypred2 > 0.5) != labels), '\n')

View File

@@ -10,18 +10,18 @@ data(agaricus.test, package='xgboost')
dtrain <- xgb.DMatrix(data = agaricus.train$data, label = agaricus.train$label) dtrain <- xgb.DMatrix(data = agaricus.train$data, label = agaricus.train$label)
dtest <- xgb.DMatrix(data = agaricus.test$data, label = agaricus.test$label) dtest <- xgb.DMatrix(data = agaricus.test$data, label = agaricus.test$label)
param <- list(max_depth=2, eta=1, silent=1, objective='binary:logistic') param <- list(max_depth = 2, eta = 1, objective = 'binary:logistic')
nrounds = 4 nrounds <- 4
# training the model for two rounds # training the model for two rounds
bst = xgb.train(params = param, data = dtrain, nrounds = nrounds, nthread = 2) bst <- xgb.train(params = param, data = dtrain, nrounds = nrounds, nthread = 2)
# Model accuracy without new features # Model accuracy without new features
accuracy.before <- sum((predict(bst, agaricus.test$data) >= 0.5) == agaricus.test$label) / length(agaricus.test$label) accuracy.before <- (sum((predict(bst, agaricus.test$data) >= 0.5) == agaricus.test$label)
/ length(agaricus.test$label))
# by default, we predict using all the trees # by default, we predict using all the trees
pred_with_leaf <- predict(bst, dtest, predleaf = TRUE)
pred_with_leaf = predict(bst, dtest, predleaf = TRUE)
head(pred_with_leaf) head(pred_with_leaf)
create.new.tree.features <- function(model, original.features){ create.new.tree.features <- function(model, original.features){
@@ -47,7 +47,9 @@ watchlist <- list(train = new.dtrain)
bst <- xgb.train(params = param, data = new.dtrain, nrounds = nrounds, nthread = 2) bst <- xgb.train(params = param, data = new.dtrain, nrounds = nrounds, nthread = 2)
# Model accuracy with new features # Model accuracy with new features
accuracy.after <- sum((predict(bst, new.dtest) >= 0.5) == agaricus.test$label) / length(agaricus.test$label) accuracy.after <- (sum((predict(bst, new.dtest) >= 0.5) == agaricus.test$label)
/ length(agaricus.test$label))
# Here the accuracy was already good and is now perfect. # Here the accuracy was already good and is now perfect.
cat(paste("The accuracy was", accuracy.before, "before adding leaf features and it is now", accuracy.after, "!\n")) cat(paste("The accuracy was", accuracy.before, "before adding leaf features and it is now",
accuracy.after, "!\n"))

View File

@@ -1,14 +1,14 @@
# running all scripts in demo folder # running all scripts in demo folder
demo(basic_walkthrough) demo(basic_walkthrough, package = 'xgboost')
demo(custom_objective) demo(custom_objective, package = 'xgboost')
demo(boost_from_prediction) demo(boost_from_prediction, package = 'xgboost')
demo(predict_first_ntree) demo(predict_first_ntree, package = 'xgboost')
demo(generalized_linear_model) demo(generalized_linear_model, package = 'xgboost')
demo(cross_validation) demo(cross_validation, package = 'xgboost')
demo(create_sparse_matrix) demo(create_sparse_matrix, package = 'xgboost')
demo(predict_leaf_indices) demo(predict_leaf_indices, package = 'xgboost')
demo(early_stopping) demo(early_stopping, package = 'xgboost')
demo(poisson_regression) demo(poisson_regression, package = 'xgboost')
demo(caret_wrapper) demo(caret_wrapper, package = 'xgboost')
demo(tweedie_regression) demo(tweedie_regression, package = 'xgboost')
#demo(gpu_accelerated) # can only run when built with GPU support #demo(gpu_accelerated, package = 'xgboost') # can only run when built with GPU support

2
R-package/demo/tweedie_regression.R Executable file → Normal file
View File

@@ -13,7 +13,7 @@ exclude <- c('POLICYNO', 'PLCYDATE', 'CLM_FREQ5', 'CLM_AMT5', 'CLM_FLAG', 'IN_Y
# retains the missing values # retains the missing values
# NOTE: this dataset is comes ready out of the box # NOTE: this dataset is comes ready out of the box
options(na.action = 'na.pass') options(na.action = 'na.pass')
x <- sparse.model.matrix(~ . - 1, data = dt[, -exclude, with = F]) x <- sparse.model.matrix(~ . - 1, data = dt[, -exclude, with = FALSE])
options(na.action = 'na.omit') options(na.action = 'na.omit')
# response # response

View File

@@ -0,0 +1,96 @@
# [description]
# Create a definition file (.def) from a .dll file, using objdump. This
# is used by FindLibR.cmake when building the R package with MSVC.
#
# [usage]
#
# Rscript make-r-def.R something.dll something.def
#
# [references]
# * https://www.cs.colorado.edu/~main/cs1300/doc/mingwfaq.html
args <- commandArgs(trailingOnly = TRUE)
IN_DLL_FILE <- args[[1L]]
OUT_DEF_FILE <- args[[2L]]
DLL_BASE_NAME <- basename(IN_DLL_FILE)
message(sprintf("Creating '%s' from '%s'", OUT_DEF_FILE, IN_DLL_FILE))
# system() will not raise an R exception if the process called
# fails. Wrapping it here to get that behavior.
#
# system() introduces a lot of overhead, at least on Windows,
# so trying processx if it is available
.pipe_shell_command_to_stdout <- function(command, args, out_file) {
has_processx <- suppressMessages({
suppressWarnings({
require("processx") # nolint
})
})
if (has_processx) {
p <- processx::process$new(
command = command
, args = args
, stdout = out_file
, windows_verbatim_args = FALSE
)
invisible(p$wait())
} else {
message(paste0(
"Using system2() to run shell commands. Installing "
, "'processx' with install.packages('processx') might "
, "make this faster."
))
exit_code <- system2(
command = command
, args = shQuote(args)
, stdout = out_file
)
if (exit_code != 0L) {
stop(paste0("Command failed with exit code: ", exit_code))
}
}
return(invisible(NULL))
}
# use objdump to dump all the symbols
OBJDUMP_FILE <- "objdump-out.txt"
.pipe_shell_command_to_stdout(
command = "objdump"
, args = c("-p", IN_DLL_FILE)
, out_file = OBJDUMP_FILE
)
objdump_results <- readLines(OBJDUMP_FILE)
result <- file.remove(OBJDUMP_FILE)
# Only one table in the objdump results matters for our purposes,
# see https://www.cs.colorado.edu/~main/cs1300/doc/mingwfaq.html
start_index <- which(
grepl(
pattern = "[Ordinal/Name Pointer] Table"
, x = objdump_results
, fixed = TRUE
)
)
empty_lines <- which(objdump_results == "")
end_of_table <- empty_lines[empty_lines > start_index][1L]
# Read the contents of the table
exported_symbols <- objdump_results[(start_index + 1L):end_of_table]
exported_symbols <- gsub("\t", "", exported_symbols)
exported_symbols <- gsub(".*\\] ", "", exported_symbols)
exported_symbols <- gsub(" ", "", exported_symbols)
# Write R.def file
writeLines(
text = c(
paste0("LIBRARY \"", DLL_BASE_NAME, "\"")
, "EXPORTS"
, exported_symbols
)
, con = OUT_DEF_FILE
, sep = "\n"
)
message(sprintf("Successfully created '%s'", OUT_DEF_FILE))

View File

@@ -0,0 +1,64 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/utils.R
\name{a-compatibility-note-for-saveRDS-save}
\alias{a-compatibility-note-for-saveRDS-save}
\title{Do not use \code{\link[base]{saveRDS}} or \code{\link[base]{save}} for long-term archival of
models. Instead, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}}.}
\description{
It is a common practice to use the built-in \code{\link[base]{saveRDS}} function (or
\code{\link[base]{save}}) to persist R objects to the disk. While it is possible to persist
\code{xgb.Booster} objects using \code{\link[base]{saveRDS}}, it is not advisable to do so if
the model is to be accessed in the future. If you train a model with the current version of
XGBoost and persist it with \code{\link[base]{saveRDS}}, the model is not guaranteed to be
accessible in later releases of XGBoost. To ensure that your model can be accessed in future
releases of XGBoost, use \code{\link{xgb.save}} or \code{\link{xgb.save.raw}} instead.
}
\details{
Use \code{\link{xgb.save}} to save the XGBoost model as a stand-alone file. You may opt into
the JSON format by specifying the JSON extension. To read the model back, use
\code{\link{xgb.load}}.
Use \code{\link{xgb.save.raw}} to save the XGBoost model as a sequence (vector) of raw bytes
in a future-proof manner. Future releases of XGBoost will be able to read the raw bytes and
re-construct the corresponding model. To read the model back, use \code{\link{xgb.load.raw}}.
The \code{\link{xgb.save.raw}} function is useful if you'd like to persist the XGBoost model
as part of another R object.
Note: Do not use \code{\link{xgb.serialize}} to store models long-term. It persists not only the
model but also internal configurations and parameters, and its format is not stable across
multiple XGBoost versions. Use \code{\link{xgb.serialize}} only for checkpointing.
For more details and explanation about model persistence and archival, consult the page
\url{https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html}.
}
\examples{
data(agaricus.train, package='xgboost')
bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
# Save as a stand-alone file; load it with xgb.load()
xgb.save(bst, 'xgb.model')
bst2 <- xgb.load('xgb.model')
# Save as a stand-alone file (JSON); load it with xgb.load()
xgb.save(bst, 'xgb.model.json')
bst2 <- xgb.load('xgb.model.json')
if (file.exists('xgb.model.json')) file.remove('xgb.model.json')
# Save as a raw byte vector; load it with xgb.load.raw()
xgb_bytes <- xgb.save.raw(bst)
bst2 <- xgb.load.raw(xgb_bytes)
# Persist XGBoost model as part of another R object
obj <- list(xgb_model_bytes = xgb.save.raw(bst), description = "My first XGBoost model")
# Persist the R object. Here, saveRDS() is okay, since it doesn't persist
# xgb.Booster directly. What's being persisted is the future-proof byte representation
# as given by xgb.save.raw().
saveRDS(obj, 'my_object.rds')
# Read back the R object
obj2 <- readRDS('my_object.rds')
# Re-construct xgb.Booster object from the bytes
bst2 <- xgb.load.raw(obj2$xgb_model_bytes)
if (file.exists('my_object.rds')) file.remove('my_object.rds')
}

View File

@@ -4,8 +4,10 @@
\name{agaricus.test} \name{agaricus.test}
\alias{agaricus.test} \alias{agaricus.test}
\title{Test part from Mushroom Data Set} \title{Test part from Mushroom Data Set}
\format{A list containing a label vector, and a dgCMatrix object with 1611 \format{
rows and 126 variables} A list containing a label vector, and a dgCMatrix object with 1611
rows and 126 variables
}
\usage{ \usage{
data(agaricus.test) data(agaricus.test)
} }

View File

@@ -4,8 +4,10 @@
\name{agaricus.train} \name{agaricus.train}
\alias{agaricus.train} \alias{agaricus.train}
\title{Training part from Mushroom Data Set} \title{Training part from Mushroom Data Set}
\format{A list containing a label vector, and a dgCMatrix object with 6513 \format{
rows and 127 variables} A list containing a label vector, and a dgCMatrix object with 6513
rows and 127 variables
}
\usage{ \usage{
data(agaricus.train) data(agaricus.train)
} }

View File

@@ -49,6 +49,9 @@ It will use all the trees by default (\code{NULL} value).}
prediction outputs per case. This option has no effect when either of predleaf, predcontrib, prediction outputs per case. This option has no effect when either of predleaf, predcontrib,
or predinteraction flags is TRUE.} or predinteraction flags is TRUE.}
\item{training}{whether is the prediction result used for training. For dart booster,
training predicting will perform dropout.}
\item{...}{Parameters passed to \code{predict.xgb.Booster}} \item{...}{Parameters passed to \code{predict.xgb.Booster}}
} }
\value{ \value{

View File

@@ -38,6 +38,8 @@ bst <- xgboost(data = agaricus.train$data, label = agaricus.train$label, max_dep
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic") eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
saveRDS(bst, "xgb.model.rds") saveRDS(bst, "xgb.model.rds")
# Warning: The resulting RDS file is only compatible with the current XGBoost version.
# Refer to the section titled "a-compatibility-note-for-saveRDS-save".
bst1 <- readRDS("xgb.model.rds") bst1 <- readRDS("xgb.model.rds")
if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds") if (file.exists("xgb.model.rds")) file.remove("xgb.model.rds")
# the handle is invalid: # the handle is invalid:

View File

@@ -55,7 +55,7 @@ than for \code{xgb.Booster}, since only just a handle (pointer) would need to be
That would only matter if attributes need to be set many times. That would only matter if attributes need to be set many times.
Note, however, that when feeding a handle of an \code{xgb.Booster} object to the attribute setters, Note, however, that when feeding a handle of an \code{xgb.Booster} object to the attribute setters,
the raw model cache of an \code{xgb.Booster} object would not be automatically updated, the raw model cache of an \code{xgb.Booster} object would not be automatically updated,
and it would be user's responsibility to call \code{xgb.save.raw} to update it. and it would be user's responsibility to call \code{xgb.serialize} to update it.
The \code{xgb.attributes<-} setter either updates the existing or adds one or several attributes, The \code{xgb.attributes<-} setter either updates the existing or adds one or several attributes,
but it doesn't delete the other existing attributes. but it doesn't delete the other existing attributes.

View File

@@ -0,0 +1,28 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.Booster.R
\name{xgb.config}
\alias{xgb.config}
\alias{xgb.config<-}
\title{Accessors for model parameters as JSON string.}
\usage{
xgb.config(object)
xgb.config(object) <- value
}
\arguments{
\item{object}{Object of class \code{xgb.Booster}}
\item{value}{A JSON string.}
}
\description{
Accessors for model parameters as JSON string.
}
\examples{
data(agaricus.train, package='xgboost')
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic")
config <- xgb.config(bst)
}

View File

@@ -28,12 +28,15 @@ xgb.cv(
) )
} }
\arguments{ \arguments{
\item{params}{the list of parameters. Commonly used ones are: \item{params}{the list of parameters. The complete list of parameters is
available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below
is a shorter summary:
\itemize{ \itemize{
\item \code{objective} objective function, common ones are \item \code{objective} objective function, common ones are
\itemize{ \itemize{
\item \code{reg:squarederror} Regression with squared loss \item \code{reg:squarederror} Regression with squared loss.
\item \code{binary:logistic} logistic regression for classification \item \code{binary:logistic} logistic regression for classification.
\item See \code{\link[=xgb.train]{xgb.train}()} for complete list of objectives.
} }
\item \code{eta} step size of each boosting step \item \code{eta} step size of each boosting step
\item \code{max_depth} maximum depth of the tree \item \code{max_depth} maximum depth of the tree
@@ -135,7 +138,7 @@ An object of class \code{xgb.cv.synchronous} with the following elements:
(only available with early stopping). (only available with early stopping).
\item \code{pred} CV prediction values available when \code{prediction} is set. \item \code{pred} CV prediction values available when \code{prediction} is set.
It is either vector or matrix (see \code{\link{cb.cv.predict}}). It is either vector or matrix (see \code{\link{cb.cv.predict}}).
\item \code{models} a liost of the CV folds' models. It is only available with the explicit \item \code{models} a list of the CV folds' models. It is only available with the explicit
setting of the \code{cb.cv.predict(save_models = TRUE)} callback. setting of the \code{cb.cv.predict(save_models = TRUE)} callback.
} }
} }
@@ -151,7 +154,7 @@ The cross-validation process is then repeated \code{nrounds} times, with each of
All observations are used for both training and validation. All observations are used for both training and validation.
Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation} Adapted from \url{https://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29}
} }
\examples{ \examples{
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')

View File

@@ -0,0 +1,14 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.load.raw.R
\name{xgb.load.raw}
\alias{xgb.load.raw}
\title{Load serialised xgboost model from R's raw vector}
\usage{
xgb.load.raw(buffer)
}
\arguments{
\item{buffer}{the buffer returned by xgb.save.raw}
}
\description{
User can generate raw memory buffer by calling xgb.save.raw
}

View File

@@ -22,7 +22,11 @@ of \code{\link{xgb.train}}.
Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}} Note: a model can also be saved as an R-object (e.g., by using \code{\link[base]{readRDS}}
or \code{\link[base]{save}}). However, it would then only be compatible with R, and or \code{\link[base]{save}}). However, it would then only be compatible with R, and
corresponding R-methods would need to be used to load it. corresponding R-methods would need to be used to load it. Moreover, persisting the model with
\code{\link[base]{readRDS}} or \code{\link[base]{save}}) will cause compatibility problems in
future versions of XGBoost. Consult \code{\link{a-compatibility-note-for-saveRDS-save}} to learn
how to persist models in a future-proof way, i.e. to make the model accessible in future
releases of XGBoost.
} }
\examples{ \examples{
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')

View File

@@ -3,7 +3,7 @@
\name{xgb.save.raw} \name{xgb.save.raw}
\alias{xgb.save.raw} \alias{xgb.save.raw}
\title{Save xgboost model to R's raw vector, \title{Save xgboost model to R's raw vector,
user can call xgb.load to load the model back from raw vector} user can call xgb.load.raw to load the model back from raw vector}
\usage{ \usage{
xgb.save.raw(model) xgb.save.raw(model)
} }
@@ -21,7 +21,7 @@ test <- agaricus.test
bst <- xgboost(data = train$data, label = train$label, max_depth = 2, bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic") eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
raw <- xgb.save.raw(bst) raw <- xgb.save.raw(bst)
bst <- xgb.load(raw) bst <- xgb.load.raw(raw)
pred <- predict(bst, test$data) pred <- predict(bst, test$data)
} }

View File

@@ -0,0 +1,29 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.serialize.R
\name{xgb.serialize}
\alias{xgb.serialize}
\title{Serialize the booster instance into R's raw vector. The serialization method differs
from \code{\link{xgb.save.raw}} as the latter one saves only the model but not
parameters. This serialization format is not stable across different xgboost versions.}
\usage{
xgb.serialize(booster)
}
\arguments{
\item{booster}{the booster instance}
}
\description{
Serialize the booster instance into R's raw vector. The serialization method differs
from \code{\link{xgb.save.raw}} as the latter one saves only the model but not
parameters. This serialization format is not stable across different xgboost versions.
}
\examples{
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = 2,objective = "binary:logistic")
raw <- xgb.serialize(bst)
bst <- xgb.unserialize(raw)
}

View File

@@ -42,9 +42,9 @@ xgboost(
) )
} }
\arguments{ \arguments{
\item{params}{the list of parameters. \item{params}{the list of parameters. The complete list of parameters is
The complete list of parameters is available at \url{http://xgboost.readthedocs.io/en/latest/parameter.html}. available in the \href{http://xgboost.readthedocs.io/en/latest/parameter.html}{online documentation}. Below
Below is a shorter summary: is a shorter summary:
1. General Parameters 1. General Parameters
@@ -82,13 +82,23 @@ xgboost(
\item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below: \item \code{objective} specify the learning task and the corresponding learning objective, users can pass a self-defined function to it. The default objective options are below:
\itemize{ \itemize{
\item \code{reg:squarederror} Regression with squared loss (Default). \item \code{reg:squarederror} Regression with squared loss (Default).
\item \code{reg:squaredlogerror}: regression with squared log loss \eqn{1/2 * (log(pred + 1) - log(label + 1))^2}. All inputs are required to be greater than -1. Also, see metric rmsle for possible issue with this objective.
\item \code{reg:logistic} logistic regression. \item \code{reg:logistic} logistic regression.
\item \code{reg:pseudohubererror}: regression with Pseudo Huber loss, a twice differentiable alternative to absolute loss.
\item \code{binary:logistic} logistic regression for binary classification. Output probability. \item \code{binary:logistic} logistic regression for binary classification. Output probability.
\item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation. \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
\item \code{num_class} set the number of classes. To use only with multiclass objectives. \item \code{binary:hinge}: hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.
\item \code{count:poisson}: poisson regression for count data, output mean of poisson distribution. \code{max_delta_step} is set to 0.7 by default in poisson regression (used to safeguard optimization).
\item \code{survival:cox}: Cox regression for right censored survival time data (negative values are considered right censored). Note that predictions are returned on the hazard ratio scale (i.e., as HR = exp(marginal_prediction) in the proportional hazard function \code{h(t) = h0(t) * HR)}.
\item \code{survival:aft}: Accelerated failure time model for censored survival time data. See \href{https://xgboost.readthedocs.io/en/latest/tutorials/aft_survival_analysis.html}{Survival Analysis with Accelerated Failure Time} for details.
\item \code{aft_loss_distribution}: Probabilty Density Function used by \code{survival:aft} and \code{aft-nloglik} metric.
\item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class - 1}. \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{num_class - 1}.
\item \code{multi:softprob} same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class. \item \code{multi:softprob} same as softmax, but prediction outputs a vector of ndata * nclass elements, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
\item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss. \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
\item \code{rank:ndcg}: Use LambdaMART to perform list-wise ranking where \href{https://en.wikipedia.org/wiki/Discounted_cumulative_gain}{Normalized Discounted Cumulative Gain (NDCG)} is maximized.
\item \code{rank:map}: Use LambdaMART to perform list-wise ranking where \href{https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision}{Mean Average Precision (MAP)} is maximized.
\item \code{reg:gamma}: gamma regression with log-link. Output is a mean of gamma distribution. It might be useful, e.g., for modeling insurance claims severity, or for any outcome that might be \href{https://en.wikipedia.org/wiki/Gamma_distribution#Applications}{gamma-distributed}.
\item \code{reg:tweedie}: Tweedie regression with log-link. It might be useful, e.g., for modeling total loss in insurance, or for any outcome that might be \href{https://en.wikipedia.org/wiki/Tweedie_distribution#Applications}{Tweedie-distributed}.
} }
\item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5 \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
\item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section. \item \code{eval_metric} evaluation metrics for validation data. Users can pass a self-defined function to it. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
@@ -205,16 +215,16 @@ User may set one or several \code{eval_metric} parameters.
Note that when using a customized metric, only this single metric can be used. Note that when using a customized metric, only this single metric can be used.
The following is the list of built-in metrics for which Xgboost provides optimized implementation: The following is the list of built-in metrics for which Xgboost provides optimized implementation:
\itemize{ \itemize{
\item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error} \item \code{rmse} root mean square error. \url{https://en.wikipedia.org/wiki/Root_mean_square_error}
\item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood} \item \code{logloss} negative log-likelihood. \url{https://en.wikipedia.org/wiki/Log-likelihood}
\item \code{mlogloss} multiclass logloss. \url{http://wiki.fast.ai/index.php/Log_Loss} \item \code{mlogloss} multiclass logloss. \url{https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html}
\item \code{error} Binary classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}. \item \code{error} Binary classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
By default, it uses the 0.5 threshold for predicted values to define negative and positive instances. By default, it uses the 0.5 threshold for predicted values to define negative and positive instances.
Different threshold (e.g., 0.) could be specified as "error@0." Different threshold (e.g., 0.) could be specified as "error@0."
\item \code{merror} Multiclass classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}. \item \code{merror} Multiclass classification error rate. It is calculated as \code{(# wrong cases) / (# all cases)}.
\item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation. \item \code{auc} Area under the curve. \url{https://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
\item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation. \item \code{aucpr} Area under the PR curve. \url{https://en.wikipedia.org/wiki/Precision_and_recall} for ranking evaluation.
\item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{http://en.wikipedia.org/wiki/NDCG} \item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{https://en.wikipedia.org/wiki/NDCG}
} }
The following callbacks are automatically created when certain parameters are set: The following callbacks are automatically created when certain parameters are set:

View File

@@ -0,0 +1,14 @@
% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/xgb.unserialize.R
\name{xgb.unserialize}
\alias{xgb.unserialize}
\title{Load the instance back from \code{\link{xgb.serialize}}}
\usage{
xgb.unserialize(buffer)
}
\arguments{
\item{buffer}{the buffer containing booster instance saved by \code{\link{xgb.serialize}}}
}
\description{
Load the instance back from \code{\link{xgb.serialize}}
}

View File

@@ -3,7 +3,7 @@ PKGROOT=../../
ENABLE_STD_THREAD=1 ENABLE_STD_THREAD=1
# _*_ mode: Makefile; _*_ # _*_ mode: Makefile; _*_
CXX_STD = CXX11 CXX_STD = CXX14
XGB_RFLAGS = -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0\ XGB_RFLAGS = -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0\
-DDMLC_ENABLE_STD_THREAD=$(ENABLE_STD_THREAD) -DDMLC_DISABLE_STDIN=1\ -DDMLC_ENABLE_STD_THREAD=$(ENABLE_STD_THREAD) -DDMLC_DISABLE_STDIN=1\

View File

@@ -15,7 +15,7 @@ xgblib:
cp -r ../../include . cp -r ../../include .
cp -r ../../amalgamation . cp -r ../../amalgamation .
CXX_STD = CXX11 CXX_STD = CXX14
XGB_RFLAGS = -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0\ XGB_RFLAGS = -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0\
-DDMLC_ENABLE_STD_THREAD=$(ENABLE_STD_THREAD) -DDMLC_DISABLE_STDIN=1\ -DDMLC_ENABLE_STD_THREAD=$(ENABLE_STD_THREAD) -DDMLC_DISABLE_STDIN=1\

View File

@@ -23,6 +23,10 @@ extern SEXP XGBoosterGetAttrNames_R(SEXP);
extern SEXP XGBoosterGetAttr_R(SEXP, SEXP); extern SEXP XGBoosterGetAttr_R(SEXP, SEXP);
extern SEXP XGBoosterLoadModelFromRaw_R(SEXP, SEXP); extern SEXP XGBoosterLoadModelFromRaw_R(SEXP, SEXP);
extern SEXP XGBoosterLoadModel_R(SEXP, SEXP); extern SEXP XGBoosterLoadModel_R(SEXP, SEXP);
extern SEXP XGBoosterSaveJsonConfig_R(SEXP handle);
extern SEXP XGBoosterLoadJsonConfig_R(SEXP handle, SEXP value);
extern SEXP XGBoosterSerializeToBuffer_R(SEXP handle);
extern SEXP XGBoosterUnserializeFromBuffer_R(SEXP handle, SEXP raw);
extern SEXP XGBoosterModelToRaw_R(SEXP); extern SEXP XGBoosterModelToRaw_R(SEXP);
extern SEXP XGBoosterPredict_R(SEXP, SEXP, SEXP, SEXP, SEXP); extern SEXP XGBoosterPredict_R(SEXP, SEXP, SEXP, SEXP, SEXP);
extern SEXP XGBoosterSaveModel_R(SEXP, SEXP); extern SEXP XGBoosterSaveModel_R(SEXP, SEXP);
@@ -49,6 +53,10 @@ static const R_CallMethodDef CallEntries[] = {
{"XGBoosterGetAttr_R", (DL_FUNC) &XGBoosterGetAttr_R, 2}, {"XGBoosterGetAttr_R", (DL_FUNC) &XGBoosterGetAttr_R, 2},
{"XGBoosterLoadModelFromRaw_R", (DL_FUNC) &XGBoosterLoadModelFromRaw_R, 2}, {"XGBoosterLoadModelFromRaw_R", (DL_FUNC) &XGBoosterLoadModelFromRaw_R, 2},
{"XGBoosterLoadModel_R", (DL_FUNC) &XGBoosterLoadModel_R, 2}, {"XGBoosterLoadModel_R", (DL_FUNC) &XGBoosterLoadModel_R, 2},
{"XGBoosterSaveJsonConfig_R", (DL_FUNC) &XGBoosterSaveJsonConfig_R, 1},
{"XGBoosterLoadJsonConfig_R", (DL_FUNC) &XGBoosterLoadJsonConfig_R, 2},
{"XGBoosterSerializeToBuffer_R", (DL_FUNC) &XGBoosterSerializeToBuffer_R, 1},
{"XGBoosterUnserializeFromBuffer_R", (DL_FUNC) &XGBoosterUnserializeFromBuffer_R, 2},
{"XGBoosterModelToRaw_R", (DL_FUNC) &XGBoosterModelToRaw_R, 1}, {"XGBoosterModelToRaw_R", (DL_FUNC) &XGBoosterModelToRaw_R, 1},
{"XGBoosterPredict_R", (DL_FUNC) &XGBoosterPredict_R, 5}, {"XGBoosterPredict_R", (DL_FUNC) &XGBoosterPredict_R, 5},
{"XGBoosterSaveModel_R", (DL_FUNC) &XGBoosterSaveModel_R, 2}, {"XGBoosterSaveModel_R", (DL_FUNC) &XGBoosterSaveModel_R, 2},

View File

@@ -338,15 +338,6 @@ SEXP XGBoosterSaveModel_R(SEXP handle, SEXP fname) {
return R_NilValue; return R_NilValue;
} }
SEXP XGBoosterLoadModelFromRaw_R(SEXP handle, SEXP raw) {
R_API_BEGIN();
CHECK_CALL(XGBoosterLoadModelFromBuffer(R_ExternalPtrAddr(handle),
RAW(raw),
length(raw)));
R_API_END();
return R_NilValue;
}
SEXP XGBoosterModelToRaw_R(SEXP handle) { SEXP XGBoosterModelToRaw_R(SEXP handle) {
SEXP ret; SEXP ret;
R_API_BEGIN(); R_API_BEGIN();
@@ -362,6 +353,57 @@ SEXP XGBoosterModelToRaw_R(SEXP handle) {
return ret; return ret;
} }
SEXP XGBoosterLoadModelFromRaw_R(SEXP handle, SEXP raw) {
R_API_BEGIN();
CHECK_CALL(XGBoosterLoadModelFromBuffer(R_ExternalPtrAddr(handle),
RAW(raw),
length(raw)));
R_API_END();
return R_NilValue;
}
SEXP XGBoosterSaveJsonConfig_R(SEXP handle) {
const char* ret;
R_API_BEGIN();
bst_ulong len {0};
CHECK_CALL(XGBoosterSaveJsonConfig(R_ExternalPtrAddr(handle),
&len,
&ret));
R_API_END();
return mkString(ret);
}
SEXP XGBoosterLoadJsonConfig_R(SEXP handle, SEXP value) {
R_API_BEGIN();
CHECK_CALL(XGBoosterLoadJsonConfig(R_ExternalPtrAddr(handle), CHAR(asChar(value))));
R_API_END();
return R_NilValue;
}
SEXP XGBoosterSerializeToBuffer_R(SEXP handle) {
SEXP ret;
R_API_BEGIN();
bst_ulong out_len;
const char *raw;
CHECK_CALL(XGBoosterSerializeToBuffer(R_ExternalPtrAddr(handle), &out_len, &raw));
ret = PROTECT(allocVector(RAWSXP, out_len));
if (out_len != 0) {
memcpy(RAW(ret), raw, out_len);
}
R_API_END();
UNPROTECT(1);
return ret;
}
SEXP XGBoosterUnserializeFromBuffer_R(SEXP handle, SEXP raw) {
R_API_BEGIN();
CHECK_CALL(XGBoosterUnserializeFromBuffer(R_ExternalPtrAddr(handle),
RAW(raw),
length(raw)));
R_API_END();
return R_NilValue;
}
SEXP XGBoosterDumpModel_R(SEXP handle, SEXP fmap, SEXP with_stats, SEXP dump_format) { SEXP XGBoosterDumpModel_R(SEXP handle, SEXP fmap, SEXP with_stats, SEXP dump_format) {
SEXP out; SEXP out;
R_API_BEGIN(); R_API_BEGIN();

View File

@@ -182,6 +182,36 @@ XGB_DLL SEXP XGBoosterLoadModelFromRaw_R(SEXP handle, SEXP raw);
*/ */
XGB_DLL SEXP XGBoosterModelToRaw_R(SEXP handle); XGB_DLL SEXP XGBoosterModelToRaw_R(SEXP handle);
/*!
* \brief Save internal parameters as a JSON string
* \param handle handle
* \return JSON string
*/
XGB_DLL SEXP XGBoosterSaveJsonConfig_R(SEXP handle);
/*!
* \brief Load the JSON string returnd by XGBoosterSaveJsonConfig_R
* \param handle handle
* \param value JSON string
* \return R_NilValue
*/
XGB_DLL SEXP XGBoosterLoadJsonConfig_R(SEXP handle, SEXP value);
/*!
* \brief Memory snapshot based serialization method. Saves everything states
* into buffer.
* \param handle handle to booster
*/
XGB_DLL SEXP XGBoosterSerializeToBuffer_R(SEXP handle);
/*!
* \brief Memory snapshot based serialization method. Loads the buffer returned
* from `XGBoosterSerializeToBuffer'.
* \param handle handle to booster
* \return raw byte array
*/
XGB_DLL SEXP XGBoosterUnserializeFromBuffer_R(SEXP handle, SEXP raw);
/*! /*!
* \brief dump model into a string * \brief dump model into a string
* \param handle handle * \param handle handle

View File

@@ -0,0 +1,101 @@
# Script to generate reference models. The reference models are used to test backward compatibility
# of saved model files from XGBoost version 0.90 and 1.0.x.
library(xgboost)
library(Matrix)
source('./generate_models_params.R')
set.seed(0)
metadata <- list(
kRounds = 2,
kRows = 1000,
kCols = 4,
kForests = 2,
kMaxDepth = 2,
kClasses = 3
)
X <- Matrix(data = rnorm(metadata$kRows * metadata$kCols), nrow = metadata$kRows,
ncol = metadata$kCols, sparse = TRUE)
w <- runif(metadata$kRows)
version <- packageVersion('xgboost')
target_dir <- 'models'
save_booster <- function (booster, model_name) {
booster_bin <- function (model_name) {
return (file.path(target_dir, paste('xgboost-', version, '.', model_name, '.bin', sep = '')))
}
booster_json <- function (model_name) {
return (file.path(target_dir, paste('xgboost-', version, '.', model_name, '.json', sep = '')))
}
booster_rds <- function (model_name) {
return (file.path(target_dir, paste('xgboost-', version, '.', model_name, '.rds', sep = '')))
}
xgb.save(booster, booster_bin(model_name))
saveRDS(booster, booster_rds(model_name))
if (version >= '1.0.0') {
xgb.save(booster, booster_json(model_name))
}
}
generate_regression_model <- function () {
print('Regression')
y <- rnorm(metadata$kRows)
data <- xgb.DMatrix(X, label = y)
params <- list(tree_method = 'hist', num_parallel_tree = metadata$kForests,
max_depth = metadata$kMaxDepth)
booster <- xgb.train(params, data, nrounds = metadata$kRounds)
save_booster(booster, 'reg')
}
generate_logistic_model <- function () {
print('Binary classification with logistic loss')
y <- sample(0:1, size = metadata$kRows, replace = TRUE)
stopifnot(max(y) == 1, min(y) == 0)
data <- xgb.DMatrix(X, label = y, weight = w)
params <- list(tree_method = 'hist', num_parallel_tree = metadata$kForests,
max_depth = metadata$kMaxDepth, objective = 'binary:logistic')
booster <- xgb.train(params, data, nrounds = metadata$kRounds)
save_booster(booster, 'logit')
}
generate_classification_model <- function () {
print('Multi-class classification')
y <- sample(0:(metadata$kClasses - 1), size = metadata$kRows, replace = TRUE)
stopifnot(max(y) == metadata$kClasses - 1, min(y) == 0)
data <- xgb.DMatrix(X, label = y, weight = w)
params <- list(num_class = metadata$kClasses, tree_method = 'hist',
num_parallel_tree = metadata$kForests, max_depth = metadata$kMaxDepth,
objective = 'multi:softmax')
booster <- xgb.train(params, data, nrounds = metadata$kRounds)
save_booster(booster, 'cls')
}
generate_ranking_model <- function () {
print('Learning to rank')
y <- sample(0:4, size = metadata$kRows, replace = TRUE)
stopifnot(max(y) == 4, min(y) == 0)
kGroups <- 20
w <- runif(kGroups)
g <- rep(50, times = kGroups)
data <- xgb.DMatrix(X, label = y, group = g)
# setinfo(data, 'weight', w)
# ^^^ does not work in version <= 1.1.0; see https://github.com/dmlc/xgboost/issues/5942
# So call low-level function XGDMatrixSetInfo_R directly. Since this function is not an exported
# symbol, use the triple-colon operator.
.Call(xgboost:::XGDMatrixSetInfo_R, data, 'weight', as.numeric(w))
params <- list(objective = 'rank:ndcg', num_parallel_tree = metadata$kForests,
tree_method = 'hist', max_depth = metadata$kMaxDepth)
booster <- xgb.train(params, data, nrounds = metadata$kRounds)
save_booster(booster, 'ltr')
}
dir.create(target_dir)
invisible(generate_regression_model())
invisible(generate_logistic_model())
invisible(generate_classification_model())
invisible(generate_ranking_model())

View File

@@ -0,0 +1,71 @@
library(lintr)
library(crayon)
my_linters <- list(
absolute_path_linter = lintr::absolute_path_linter,
assignment_linter = lintr::assignment_linter,
closed_curly_linter = lintr::closed_curly_linter,
commas_linter = lintr::commas_linter,
# commented_code_linter = lintr::commented_code_linter,
infix_spaces_linter = lintr::infix_spaces_linter,
line_length_linter = lintr::line_length_linter,
no_tab_linter = lintr::no_tab_linter,
object_usage_linter = lintr::object_usage_linter,
# snake_case_linter = lintr::snake_case_linter,
# multiple_dots_linter = lintr::multiple_dots_linter,
object_length_linter = lintr::object_length_linter,
open_curly_linter = lintr::open_curly_linter,
# single_quotes_linter = lintr::single_quotes_linter,
spaces_inside_linter = lintr::spaces_inside_linter,
spaces_left_parentheses_linter = lintr::spaces_left_parentheses_linter,
trailing_blank_lines_linter = lintr::trailing_blank_lines_linter,
trailing_whitespace_linter = lintr::trailing_whitespace_linter,
true_false = lintr::T_and_F_symbol_linter
)
results <- lapply(
list.files(path = '.', pattern = '\\.[Rr]$', recursive = TRUE),
function (r_file) {
cat(sprintf("Processing %s ...\n", r_file))
list(r_file = r_file,
output = lintr::lint(filename = r_file, linters = my_linters))
})
num_issue <- Reduce(sum, lapply(results, function (e) length(e$output)))
lint2str <- function(lint_entry) {
color <- function(type) {
switch(type,
"warning" = crayon::magenta,
"error" = crayon::red,
"style" = crayon::blue,
crayon::bold
)
}
paste0(
lapply(lint_entry$output,
function (lint_line) {
paste0(
crayon::bold(lint_entry$r_file, ":",
as.character(lint_line$line_number), ":",
as.character(lint_line$column_number), ": ", sep = ""),
color(lint_line$type)(lint_line$type, ": ", sep = ""),
crayon::bold(lint_line$message), "\n",
lint_line$line, "\n",
lintr:::highlight_string(lint_line$message, lint_line$column_number, lint_line$ranges),
"\n",
collapse = "")
}),
collapse = "")
}
if (num_issue > 0) {
cat(sprintf('R linters found %d issues:\n', num_issue))
for (entry in results) {
if (length(entry$output)) {
cat(paste0('**** ', crayon::bold(entry$r_file), '\n'))
cat(paste0(lint2str(entry), collapse = ''))
}
}
quit(save = 'no', status = 1) # Signal error to parent shell
}

View File

@@ -1,4 +1,4 @@
library(testthat) library(testthat)
library(xgboost) library(xgboost)
test_check("xgboost") test_check("xgboost", reporter = ProgressReporter)

View File

@@ -9,12 +9,12 @@ test <- agaricus.test
set.seed(1994) set.seed(1994)
# disable some tests for Win32 # disable some tests for Win32
windows_flag = .Platform$OS.type == "windows" && windows_flag <- .Platform$OS.type == "windows" &&
.Machine$sizeof.pointer != 8 .Machine$sizeof.pointer != 8
solaris_flag = (Sys.info()['sysname'] == "SunOS") solaris_flag <- (Sys.info()['sysname'] == "SunOS")
test_that("train and predict binary classification", { test_that("train and predict binary classification", {
nrounds = 2 nrounds <- 2
expect_output( expect_output(
bst <- xgboost(data = train$data, label = train$label, max_depth = 2, bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = nrounds, objective = "binary:logistic") eta = 1, nthread = 2, nrounds = nrounds, objective = "binary:logistic")
@@ -35,8 +35,42 @@ test_that("train and predict binary classification", {
expect_lt(abs(err_pred1 - err_log), 10e-6) expect_lt(abs(err_pred1 - err_log), 10e-6)
}) })
test_that("parameter validation works", {
p <- list(foo = "bar")
nrounds <- 1
set.seed(1994)
d <- cbind(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10))
y <- d[, "x1"] + d[, "x2"]^2 +
ifelse(d[, "x3"] > .5, d[, "x3"]^2, 2^d[, "x3"]) +
rnorm(10)
dtrain <- xgb.DMatrix(data = d, info = list(label = y))
correct <- function() {
params <- list(max_depth = 2, booster = "dart",
rate_drop = 0.5, one_drop = TRUE,
objective = "reg:squarederror")
xgb.train(params = params, data = dtrain, nrounds = nrounds)
}
expect_silent(correct())
incorrect <- function() {
params <- list(max_depth = 2, booster = "dart",
rate_drop = 0.5, one_drop = TRUE,
objective = "reg:squarederror",
foo = "bar", bar = "foo")
output <- capture.output(
xgb.train(params = params, data = dtrain, nrounds = nrounds))
print(output)
}
expect_output(incorrect(), "bar, foo")
})
test_that("dart prediction works", { test_that("dart prediction works", {
nrounds = 32 nrounds <- 32
set.seed(1994) set.seed(1994)
d <- cbind( d <- cbind(
@@ -68,7 +102,6 @@ test_that("dart prediction works", {
one_drop = TRUE, one_drop = TRUE,
nthread = 1, nthread = 1,
tree_method = "exact", tree_method = "exact",
verbosity = 3,
objective = "reg:squarederror" objective = "reg:squarederror"
), ),
data = dtrain, data = dtrain,
@@ -190,7 +223,7 @@ test_that("use of multiple eval metrics works", {
test_that("training continuation works", { test_that("training continuation works", {
dtrain <- xgb.DMatrix(train$data, label = train$label) dtrain <- xgb.DMatrix(train$data, label = train$label)
watchlist = list(train=dtrain) watchlist <- list(train = dtrain)
param <- list(objective = "binary:logistic", max_depth = 2, eta = 1, nthread = 2) param <- list(objective = "binary:logistic", max_depth = 2, eta = 1, nthread = 2)
# for the reference, use 4 iterations at once: # for the reference, use 4 iterations at once:
@@ -219,6 +252,21 @@ test_that("training continuation works", {
expect_equal(dim(bst2$evaluation_log), c(2, 2)) expect_equal(dim(bst2$evaluation_log), c(2, 2))
}) })
test_that("model serialization works", {
out_path <- "model_serialization"
dtrain <- xgb.DMatrix(train$data, label = train$label)
watchlist <- list(train = dtrain)
param <- list(objective = "binary:logistic")
booster <- xgb.train(param, dtrain, nrounds = 4, watchlist)
raw <- xgb.serialize(booster)
saveRDS(raw, out_path)
raw <- readRDS(out_path)
loaded <- xgb.unserialize(raw)
raw_from_loaded <- xgb.serialize(loaded)
expect_equal(raw, raw_from_loaded)
file.remove(out_path)
})
test_that("xgb.cv works", { test_that("xgb.cv works", {
set.seed(11) set.seed(11)
@@ -290,7 +338,7 @@ test_that("max_delta_step works", {
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
watchlist <- list(train = dtrain) watchlist <- list(train = dtrain)
param <- list(objective = "binary:logistic", eval_metric = "logloss", max_depth = 2, nthread = 2, eta = 0.5) param <- list(objective = "binary:logistic", eval_metric = "logloss", max_depth = 2, nthread = 2, eta = 0.5)
nrounds = 5 nrounds <- 5
# model with no restriction on max_delta_step # model with no restriction on max_delta_step
bst1 <- xgb.train(param, dtrain, nrounds, watchlist, verbose = 1) bst1 <- xgb.train(param, dtrain, nrounds, watchlist, verbose = 1)
# model with restricted max_delta_step # model with restricted max_delta_step
@@ -312,9 +360,9 @@ test_that("colsample_bytree works", {
dtrain <- xgb.DMatrix(train_x, label = train_y) dtrain <- xgb.DMatrix(train_x, label = train_y)
dtest <- xgb.DMatrix(test_x, label = test_y) dtest <- xgb.DMatrix(test_x, label = test_y)
watchlist <- list(train = dtrain, eval = dtest) watchlist <- list(train = dtrain, eval = dtest)
# Use colsample_bytree = 0.01, so that roughly one out of 100 features is ## Use colsample_bytree = 0.01, so that roughly one out of 100 features is chosen for
# chosen for each tree ## each tree
param <- list(max_depth = 2, eta = 0, silent = 1, nthread = 2, param <- list(max_depth = 2, eta = 0, nthread = 2,
colsample_bytree = 0.01, objective = "binary:logistic", colsample_bytree = 0.01, objective = "binary:logistic",
eval_metric = "auc") eval_metric = "auc")
set.seed(2) set.seed(2)
@@ -324,3 +372,13 @@ test_that("colsample_bytree works", {
# in the 100 trees # in the 100 trees
expect_gte(nrow(xgb.importance(model = bst)), 30) expect_gte(nrow(xgb.importance(model = bst)), 30)
}) })
test_that("Configuration works", {
bst <- xgboost(data = train$data, label = train$label, max_depth = 2,
eta = 1, nthread = 2, nrounds = 2, objective = "binary:logistic",
eval_metric = 'error', eval_metric = 'auc', eval_metric = "logloss")
config <- xgb.config(bst)
xgb.config(bst) <- config
reloaded_config <- xgb.config(bst)
expect_equal(config, reloaded_config);
})

View File

@@ -21,7 +21,7 @@ ltrain <- add.noise(train$label, 0.2)
ltest <- add.noise(test$label, 0.2) ltest <- add.noise(test$label, 0.2)
dtrain <- xgb.DMatrix(train$data, label = ltrain) dtrain <- xgb.DMatrix(train$data, label = ltrain)
dtest <- xgb.DMatrix(test$data, label = ltest) dtest <- xgb.DMatrix(test$data, label = ltest)
watchlist = list(train=dtrain, test=dtest) watchlist <- list(train = dtrain, test = dtest)
err <- function(label, pr) sum((pr > 0.5) != label) / length(label) err <- function(label, pr) sum((pr > 0.5) != label) / length(label)
@@ -184,6 +184,9 @@ test_that("cb.save.model works as expected", {
expect_equal(xgb.ntree(b1), 1) expect_equal(xgb.ntree(b1), 1)
b2 <- xgb.load('xgboost_02.model') b2 <- xgb.load('xgboost_02.model')
expect_equal(xgb.ntree(b2), 2) expect_equal(xgb.ntree(b2), 2)
xgb.config(b2) <- xgb.config(bst)
expect_equal(xgb.config(bst), xgb.config(b2))
expect_equal(bst$raw, b2$raw) expect_equal(bst$raw, b2$raw)
# save_period = 0 saves the last iteration's model # save_period = 0 saves the last iteration's model
@@ -191,6 +194,7 @@ test_that("cb.save.model works as expected", {
save_period = 0) save_period = 0)
expect_true(file.exists('xgboost.model')) expect_true(file.exists('xgboost.model'))
b2 <- xgb.load('xgboost.model') b2 <- xgb.load('xgboost.model')
xgb.config(b2) <- xgb.config(bst)
expect_equal(bst$raw, b2$raw) expect_equal(bst$raw, b2$raw)
for (f in files) if (file.exists(f)) file.remove(f) for (f in files) if (file.exists(f)) file.remove(f)
@@ -218,6 +222,15 @@ test_that("early stopping xgb.train works", {
early_stopping_rounds = 3, maximize = FALSE, verbose = 0) early_stopping_rounds = 3, maximize = FALSE, verbose = 0)
) )
expect_equal(bst$evaluation_log, bst0$evaluation_log) expect_equal(bst$evaluation_log, bst0$evaluation_log)
xgb.save(bst, "model.bin")
loaded <- xgb.load("model.bin")
expect_false(is.null(loaded$best_iteration))
expect_equal(loaded$best_iteration, bst$best_ntreelimit)
expect_equal(loaded$best_ntreelimit, bst$best_ntreelimit)
file.remove("model.bin")
}) })
test_that("early stopping using a specific metric works", { test_that("early stopping using a specific metric works", {
@@ -254,7 +267,7 @@ test_that("early stopping xgb.cv works", {
test_that("prediction in xgb.cv works", { test_that("prediction in xgb.cv works", {
set.seed(11) set.seed(11)
nrounds = 4 nrounds <- 4
cv <- xgb.cv(param, dtrain, nfold = 5, eta = 0.5, nrounds = nrounds, prediction = TRUE, verbose = 0) cv <- xgb.cv(param, dtrain, nfold = 5, eta = 0.5, nrounds = nrounds, prediction = TRUE, verbose = 0)
expect_false(is.null(cv$evaluation_log)) expect_false(is.null(cv$evaluation_log))
expect_false(is.null(cv$pred)) expect_false(is.null(cv$pred))

View File

@@ -20,7 +20,7 @@ logregobj <- function(preds, dtrain) {
evalerror <- function(preds, dtrain) { evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label") labels <- getinfo(dtrain, "label")
err <- as.numeric(sum(labels != (preds > 0))) / length(labels) err <- as.numeric(sum(labels != (preds > 0.5))) / length(labels)
return(list(metric = "error", value = err)) return(list(metric = "error", value = err))
} }
@@ -31,7 +31,6 @@ num_round <- 2
test_that("custom objective works", { test_that("custom objective works", {
bst <- xgb.train(param, dtrain, num_round, watchlist) bst <- xgb.train(param, dtrain, num_round, watchlist)
expect_equal(class(bst), "xgb.Booster") expect_equal(class(bst), "xgb.Booster")
expect_equal(length(bst$raw), 1100)
expect_false(is.null(bst$evaluation_log)) expect_false(is.null(bst$evaluation_log))
expect_false(is.null(bst$evaluation_log$eval_error)) expect_false(is.null(bst$evaluation_log$eval_error))
expect_lt(bst$evaluation_log[num_round, eval_error], 0.03) expect_lt(bst$evaluation_log[num_round, eval_error], 0.03)
@@ -44,6 +43,13 @@ test_that("custom objective in CV works", {
expect_lt(cv$evaluation_log[num_round, test_error_mean], 0.03) expect_lt(cv$evaluation_log[num_round, test_error_mean], 0.03)
}) })
test_that("custom objective with early stop works", {
bst <- xgb.train(param, dtrain, 10, watchlist)
expect_equal(class(bst), "xgb.Booster")
train_log <- bst$evaluation_log$train_error
expect_true(all(diff(train_log)) <= 0)
})
test_that("custom objective using DMatrix attr works", { test_that("custom objective using DMatrix attr works", {
attr(dtrain, 'label') <- getinfo(dtrain, 'label') attr(dtrain, 'label') <- getinfo(dtrain, 'label')
@@ -55,8 +61,28 @@ test_that("custom objective using DMatrix attr works", {
hess <- preds * (1 - preds) hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess)) return(list(grad = grad, hess = hess))
} }
param$objective = logregobjattr param$objective <- logregobjattr
bst <- xgb.train(param, dtrain, num_round, watchlist) bst <- xgb.train(param, dtrain, num_round, watchlist)
expect_equal(class(bst), "xgb.Booster") expect_equal(class(bst), "xgb.Booster")
expect_equal(length(bst$raw), 1100) })
test_that("custom objective with multi-class works", {
data <- as.matrix(iris[, -5])
label <- as.numeric(iris$Species) - 1
dtrain <- xgb.DMatrix(data = data, label = label)
nclasses <- 3
fake_softprob <- function(preds, dtrain) {
expect_true(all(matrix(preds) == 0.5))
grad <- rnorm(dim(as.matrix(preds))[1])
expect_equal(dim(data)[1] * nclasses, dim(as.matrix(preds))[1])
hess <- rnorm(dim(as.matrix(preds))[1])
return (list(grad = grad, hess = hess))
}
fake_merror <- function(preds, dtrain) {
expect_equal(dim(data)[1] * nclasses, dim(as.matrix(preds))[1])
}
param$objective <- fake_softprob
param$eval_metric <- fake_merror
bst <- xgb.train(param, dtrain, 1, num_class = nclasses)
}) })

View File

@@ -50,6 +50,12 @@ test_that("xgb.DMatrix: getinfo & setinfo", {
labels <- getinfo(dtest, 'label') labels <- getinfo(dtest, 'label')
expect_equal(test_label, getinfo(dtest, 'label')) expect_equal(test_label, getinfo(dtest, 'label'))
expect_true(setinfo(dtest, 'label_lower_bound', test_label))
expect_equal(test_label, getinfo(dtest, 'label_lower_bound'))
expect_true(setinfo(dtest, 'label_upper_bound', test_label))
expect_equal(test_label, getinfo(dtest, 'label_upper_bound'))
expect_true(length(getinfo(dtest, 'weight')) == 0) expect_true(length(getinfo(dtest, 'weight')) == 0)
expect_true(length(getinfo(dtest, 'base_margin')) == 0) expect_true(length(getinfo(dtest, 'base_margin')) == 0)

View File

@@ -16,7 +16,7 @@ test_that("gblinear works", {
ERR_UL <- 0.005 # upper limit for the test set error ERR_UL <- 0.005 # upper limit for the test set error
VERB <- 0 # chatterbox switch VERB <- 0 # chatterbox switch
param$updater = 'shotgun' param$updater <- 'shotgun'
bst <- xgb.train(param, dtrain, n, watchlist, verbose = VERB, feature_selector = 'shuffle') bst <- xgb.train(param, dtrain, n, watchlist, verbose = VERB, feature_selector = 'shuffle')
ypred <- predict(bst, dtest) ypred <- predict(bst, dtest)
expect_equal(length(getinfo(dtest, 'label')), 1611) expect_equal(length(getinfo(dtest, 'label')), 1611)
@@ -29,7 +29,7 @@ test_that("gblinear works", {
expect_equal(dim(h), c(n, ncol(dtrain) + 1)) expect_equal(dim(h), c(n, ncol(dtrain) + 1))
expect_is(h, "matrix") expect_is(h, "matrix")
param$updater = 'coord_descent' param$updater <- 'coord_descent'
bst <- xgb.train(param, dtrain, n, watchlist, verbose = VERB, feature_selector = 'cyclic') bst <- xgb.train(param, dtrain, n, watchlist, verbose = VERB, feature_selector = 'cyclic')
expect_lt(bst$evaluation_log$eval_error[n], ERR_UL) expect_lt(bst$evaluation_log$eval_error[n], ERR_UL)
@@ -40,7 +40,7 @@ test_that("gblinear works", {
expect_lt(bst$evaluation_log$eval_error[2], ERR_UL) expect_lt(bst$evaluation_log$eval_error[2], ERR_UL)
bst <- xgb.train(param, dtrain, n, watchlist, verbose = VERB, feature_selector = 'thrifty', bst <- xgb.train(param, dtrain, n, watchlist, verbose = VERB, feature_selector = 'thrifty',
top_n = 50, callbacks = list(cb.gblinear.history(sparse = TRUE))) top_k = 50, callbacks = list(cb.gblinear.history(sparse = TRUE)))
expect_lt(bst$evaluation_log$eval_error[n], ERR_UL) expect_lt(bst$evaluation_log$eval_error[n], ERR_UL)
h <- xgb.gblinear.history(bst) h <- xgb.gblinear.history(bst)
expect_equal(dim(h), c(n, ncol(dtrain) + 1)) expect_equal(dim(h), c(n, ncol(dtrain) + 1))

View File

@@ -5,18 +5,18 @@ require(data.table)
require(Matrix) require(Matrix)
require(vcd, quietly = TRUE) require(vcd, quietly = TRUE)
float_tolerance = 5e-6 float_tolerance <- 5e-6
# disable some tests for Win32 # disable some tests for 32-bit environment
win32_flag = .Platform$OS.type == "windows" && .Machine$sizeof.pointer != 8 flag_32bit <- .Machine$sizeof.pointer != 8
set.seed(1982) set.seed(1982)
data(Arthritis) data(Arthritis)
df <- data.table(Arthritis, keep.rownames = F) df <- data.table(Arthritis, keep.rownames = FALSE)
df[, AgeDiscret := as.factor(round(Age / 10, 0))] df[, AgeDiscret := as.factor(round(Age / 10, 0))]
df[, AgeCat := as.factor(ifelse(Age > 30, "Old", "Young"))] df[, AgeCat := as.factor(ifelse(Age > 30, "Old", "Young"))]
df[, ID := NULL] df[, ID := NULL]
sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df) sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df) # nolint
label <- df[, ifelse(Improved == "Marked", 1, 0)] label <- df[, ifelse(Improved == "Marked", 1, 0)]
# binary # binary
@@ -44,17 +44,17 @@ mbst.GLM <- xgboost(data = as.matrix(iris[, -5]), label = mlabel, verbose = 0,
test_that("xgb.dump works", { test_that("xgb.dump works", {
if (!win32_flag) if (!flag_32bit)
expect_length(xgb.dump(bst.Tree), 200) expect_length(xgb.dump(bst.Tree), 200)
dump_file = file.path(tempdir(), 'xgb.model.dump') dump_file <- file.path(tempdir(), 'xgb.model.dump')
expect_true(xgb.dump(bst.Tree, dump_file, with_stats = T)) expect_true(xgb.dump(bst.Tree, dump_file, with_stats = TRUE))
expect_true(file.exists(dump_file)) expect_true(file.exists(dump_file))
expect_gt(file.size(dump_file), 8000) expect_gt(file.size(dump_file), 8000)
# JSON format # JSON format
dmp <- xgb.dump(bst.Tree, dump_format = "json") dmp <- xgb.dump(bst.Tree, dump_format = "json")
expect_length(dmp, 1) expect_length(dmp, 1)
if (!win32_flag) if (!flag_32bit)
expect_length(grep('nodeid', strsplit(dmp, '\n')[[1]]), 188) expect_length(grep('nodeid', strsplit(dmp, '\n')[[1]]), 188)
}) })
@@ -160,7 +160,7 @@ test_that("SHAPs sum to predictions, with or without DART", {
objective = "reg:squarederror", objective = "reg:squarederror",
eval_metric = "rmse"), eval_metric = "rmse"),
if (booster == "dart") if (booster == "dart")
list(rate_drop = .01, one_drop = T)), list(rate_drop = .01, one_drop = TRUE)),
data = d, data = d,
label = y, label = y,
nrounds = nrounds) nrounds = nrounds)
@@ -168,9 +168,9 @@ test_that("SHAPs sum to predictions, with or without DART", {
pr <- function(...) pr <- function(...)
predict(fit, newdata = d, ...) predict(fit, newdata = d, ...)
pred <- pr() pred <- pr()
shap <- pr(predcontrib = T) shap <- pr(predcontrib = TRUE)
shapi <- pr(predinteraction = T) shapi <- pr(predinteraction = TRUE)
tol = 1e-5 tol <- 1e-5
expect_equal(rowSums(shap), pred, tol = tol) expect_equal(rowSums(shap), pred, tol = tol)
expect_equal(apply(shapi, 1, sum), pred, tol = tol) expect_equal(apply(shapi, 1, sum), pred, tol = tol)
@@ -256,7 +256,7 @@ test_that("xgb.model.dt.tree works with and without feature names", {
names.dt.trees <- c("Tree", "Node", "ID", "Feature", "Split", "Yes", "No", "Missing", "Quality", "Cover") names.dt.trees <- c("Tree", "Node", "ID", "Feature", "Split", "Yes", "No", "Missing", "Quality", "Cover")
dt.tree <- xgb.model.dt.tree(feature_names = feature.names, model = bst.Tree) dt.tree <- xgb.model.dt.tree(feature_names = feature.names, model = bst.Tree)
expect_equal(names.dt.trees, names(dt.tree)) expect_equal(names.dt.trees, names(dt.tree))
if (!win32_flag) if (!flag_32bit)
expect_equal(dim(dt.tree), c(188, 10)) expect_equal(dim(dt.tree), c(188, 10))
expect_output(str(dt.tree), 'Feature.*\\"Age\\"') expect_output(str(dt.tree), 'Feature.*\\"Age\\"')
@@ -283,7 +283,7 @@ test_that("xgb.model.dt.tree throws error for gblinear", {
test_that("xgb.importance works with and without feature names", { test_that("xgb.importance works with and without feature names", {
importance.Tree <- xgb.importance(feature_names = feature.names, model = bst.Tree) importance.Tree <- xgb.importance(feature_names = feature.names, model = bst.Tree)
if (!win32_flag) if (!flag_32bit)
expect_equal(dim(importance.Tree), c(7, 4)) expect_equal(dim(importance.Tree), c(7, 4))
expect_equal(colnames(importance.Tree), c("Feature", "Gain", "Cover", "Frequency")) expect_equal(colnames(importance.Tree), c("Feature", "Gain", "Cover", "Frequency"))
expect_output(str(importance.Tree), 'Feature.*\\"Age\\"') expect_output(str(importance.Tree), 'Feature.*\\"Age\\"')

View File

@@ -34,5 +34,22 @@ test_that("interaction constraints for regression", {
expect_true({ expect_true({
test1 & test2 test1 & test2
}, "Interaction Contraint Satisfied") }, "Interaction Contraint Satisfied")
})
test_that("interaction constraints scientific representation", {
rows <- 10
## When number exceeds 1e5, R paste function uses scientific representation.
## See: https://github.com/dmlc/xgboost/issues/5179
cols <- 1e5 + 10
d <- matrix(rexp(rows, rate = .1), nrow = rows, ncol = cols)
y <- rnorm(rows)
dtrain <- xgb.DMatrix(data = d, info = list(label = y))
inc <- list(c(seq.int(from = 0, to = cols, by = 1)))
with_inc <- xgb.train(data = dtrain, tree_method = 'hist',
interaction_constraints = inc, nrounds = 10)
without_inc <- xgb.train(data = dtrain, tree_method = 'hist', nrounds = 10)
expect_equal(xgb.save.raw(with_inc), xgb.save.raw(without_inc))
}) })

View File

@@ -26,7 +26,7 @@ test_that("predict feature interactions works", {
param <- list(eta = 0.1, max_depth = 4, base_score = mean(y), lambda = 0, nthread = 2) param <- list(eta = 0.1, max_depth = 4, base_score = mean(y), lambda = 0, nthread = 2)
b <- xgb.train(param, dm, 100) b <- xgb.train(param, dm, 100)
pred = predict(b, dm, outputmargin=TRUE) pred <- predict(b, dm, outputmargin = TRUE)
# SHAP contributions: # SHAP contributions:
cont <- predict(b, dm, predcontrib = TRUE) cont <- predict(b, dm, predcontrib = TRUE)
@@ -73,9 +73,9 @@ test_that("predict feature interactions works", {
gt_intr[, 2, 3] <- 1. * X[, 2] * X[, 3] # attribute a HALF of the interaction term to each symmetric element gt_intr[, 2, 3] <- 1. * X[, 2] * X[, 3] # attribute a HALF of the interaction term to each symmetric element
gt_intr[, 3, 2] <- gt_intr[, 2, 3] gt_intr[, 3, 2] <- gt_intr[, 2, 3]
# merge-in the diagonal based on 'ground truth' feature contributions # merge-in the diagonal based on 'ground truth' feature contributions
intr_diag = gt_cont - apply(gt_intr, c(1,2), sum) intr_diag <- gt_cont - apply(gt_intr, c(1, 2), sum)
for (j in seq_len(P)) { for (j in seq_len(P)) {
gt_intr[,j,j] = intr_diag[,j] gt_intr[, j, j] <- intr_diag[, j]
} }
# These should be relatively close: # These should be relatively close:
expect_lt(max(abs(intr - gt_intr)), 0.1) expect_lt(max(abs(intr - gt_intr)), 0.1)
@@ -107,7 +107,7 @@ test_that("SHAP contribution values are not NAN", {
shaps <- as.data.frame(predict(fit, shaps <- as.data.frame(predict(fit,
newdata = as.matrix(subset(d, fold == 1)[, ivs]), newdata = as.matrix(subset(d, fold == 1)[, ivs]),
predcontrib = T)) predcontrib = TRUE))
result <- cbind(shaps, sum = rowSums(shaps), pred = predict(fit, result <- cbind(shaps, sum = rowSums(shaps), pred = predict(fit,
newdata = as.matrix(subset(d, fold == 1)[, ivs]))) newdata = as.matrix(subset(d, fold == 1)[, ivs])))
@@ -119,7 +119,7 @@ test_that("multiclass feature interactions work", {
dm <- xgb.DMatrix(as.matrix(iris[, -5]), label = as.numeric(iris$Species) - 1) dm <- xgb.DMatrix(as.matrix(iris[, -5]), label = as.numeric(iris$Species) - 1)
param <- list(eta = 0.1, max_depth = 4, objective = 'multi:softprob', num_class = 3) param <- list(eta = 0.1, max_depth = 4, objective = 'multi:softprob', num_class = 3)
b <- xgb.train(param, dm, 40) b <- xgb.train(param, dm, 40)
pred = predict(b, dm, outputmargin=TRUE) %>% array(c(3, 150)) %>% t pred <- predict(b, dm, outputmargin = TRUE) %>% array(c(3, 150)) %>% t
# SHAP contributions: # SHAP contributions:
cont <- predict(b, dm, predcontrib = TRUE) cont <- predict(b, dm, predcontrib = TRUE)

View File

@@ -1,27 +0,0 @@
context("Code is of high quality and lint free")
test_that("Code Lint", {
skip_on_cran()
skip_on_travis()
skip_if_not_installed("lintr")
my_linters <- list(
absolute_paths_linter=lintr::absolute_paths_linter,
assignment_linter=lintr::assignment_linter,
closed_curly_linter=lintr::closed_curly_linter,
commas_linter=lintr::commas_linter,
# commented_code_linter=lintr::commented_code_linter,
infix_spaces_linter=lintr::infix_spaces_linter,
line_length_linter=lintr::line_length_linter,
no_tab_linter=lintr::no_tab_linter,
object_usage_linter=lintr::object_usage_linter,
# snake_case_linter=lintr::snake_case_linter,
# multiple_dots_linter=lintr::multiple_dots_linter,
object_length_linter=lintr::object_length_linter,
open_curly_linter=lintr::open_curly_linter,
# single_quotes_linter=lintr::single_quotes_linter,
spaces_inside_linter=lintr::spaces_inside_linter,
spaces_left_parentheses_linter=lintr::spaces_left_parentheses_linter,
trailing_blank_lines_linter=lintr::trailing_blank_lines_linter,
trailing_whitespace_linter=lintr::trailing_whitespace_linter
)
# lintr::expect_lint_free(linters=my_linters) # uncomment this if you want to check code quality
})

View File

@@ -0,0 +1,84 @@
require(xgboost)
require(jsonlite)
context("Models from previous versions of XGBoost can be loaded")
metadata <- list(
kRounds = 2,
kRows = 1000,
kCols = 4,
kForests = 2,
kMaxDepth = 2,
kClasses = 3
)
run_model_param_check <- function (config) {
testthat::expect_equal(config$learner$learner_model_param$num_feature, '4')
testthat::expect_equal(config$learner$learner_train_param$booster, 'gbtree')
}
get_num_tree <- function (booster) {
dump <- xgb.dump(booster)
m <- regexec('booster\\[[0-9]+\\]', dump, perl = TRUE)
m <- regmatches(dump, m)
num_tree <- Reduce('+', lapply(m, length))
return (num_tree)
}
run_booster_check <- function (booster, name) {
# If given a handle, we need to call xgb.Booster.complete() prior to using xgb.config().
if (inherits(booster, "xgb.Booster") && xgboost:::is.null.handle(booster$handle)) {
booster <- xgb.Booster.complete(booster)
}
config <- jsonlite::fromJSON(xgb.config(booster))
run_model_param_check(config)
if (name == 'cls') {
testthat::expect_equal(get_num_tree(booster),
metadata$kForests * metadata$kRounds * metadata$kClasses)
testthat::expect_equal(as.numeric(config$learner$learner_model_param$base_score), 0.5)
testthat::expect_equal(config$learner$learner_train_param$objective, 'multi:softmax')
testthat::expect_equal(as.numeric(config$learner$learner_model_param$num_class),
metadata$kClasses)
} else if (name == 'logit') {
testthat::expect_equal(get_num_tree(booster), metadata$kForests * metadata$kRounds)
testthat::expect_equal(as.numeric(config$learner$learner_model_param$num_class), 0)
testthat::expect_equal(config$learner$learner_train_param$objective, 'binary:logistic')
} else if (name == 'ltr') {
testthat::expect_equal(get_num_tree(booster), metadata$kForests * metadata$kRounds)
testthat::expect_equal(config$learner$learner_train_param$objective, 'rank:ndcg')
} else {
testthat::expect_equal(name, 'reg')
testthat::expect_equal(get_num_tree(booster), metadata$kForests * metadata$kRounds)
testthat::expect_equal(as.numeric(config$learner$learner_model_param$base_score), 0.5)
testthat::expect_equal(config$learner$learner_train_param$objective, 'reg:squarederror')
}
}
test_that("Models from previous versions of XGBoost can be loaded", {
bucket <- 'xgboost-ci-jenkins-artifacts'
region <- 'us-west-2'
file_name <- 'xgboost_r_model_compatibility_test.zip'
zipfile <- file.path(getwd(), file_name)
model_dir <- file.path(getwd(), 'models')
download.file(paste('https://', bucket, '.s3-', region, '.amazonaws.com/', file_name, sep = ''),
destfile = zipfile, mode = 'wb')
unzip(zipfile, overwrite = TRUE)
pred_data <- xgb.DMatrix(matrix(c(0, 0, 0, 0), nrow = 1, ncol = 4))
lapply(list.files(model_dir), function (x) {
model_file <- file.path(model_dir, x)
m <- regexec("xgboost-([0-9\\.]+)\\.([a-z]+)\\.[a-z]+", model_file, perl = TRUE)
m <- regmatches(model_file, m)[[1]]
model_xgb_ver <- m[2]
name <- m[3]
if (endsWith(model_file, '.rds')) {
booster <- readRDS(model_file)
} else {
booster <- xgb.load(model_file)
}
predict(booster, newdata = pred_data)
run_booster_check(booster, name)
})
})

View File

@@ -3,22 +3,21 @@ require(xgboost)
context("monotone constraints") context("monotone constraints")
set.seed(1024) set.seed(1024)
x = rnorm(1000, 10) x <- rnorm(1000, 10)
y = -1*x + rnorm(1000, 0.001) + 3*sin(x) y <- -1 * x + rnorm(1000, 0.001) + 3 * sin(x)
train = matrix(x, ncol = 1) train <- matrix(x, ncol = 1)
test_that("monotone constraints for regression", { test_that("monotone constraints for regression", {
bst = xgboost(data = train, label = y, max_depth = 2, bst <- xgboost(data = train, label = y, max_depth = 2,
eta = 0.1, nthread = 2, nrounds = 100, verbose = 0, eta = 0.1, nthread = 2, nrounds = 100, verbose = 0,
monotone_constraints = -1) monotone_constraints = -1)
pred = predict(bst, train) pred <- predict(bst, train)
ind = order(train[,1]) ind <- order(train[, 1])
pred.ord = pred[ind] pred.ord <- pred[ind]
expect_true({ expect_true({
!any(diff(pred.ord) > 0) !any(diff(pred.ord) > 0)
}, "Monotone Contraint Satisfied") }, "Monotone Contraint Satisfied")
}) })

View File

@@ -0,0 +1,51 @@
require(xgboost)
require(Matrix)
context('Learning to rank')
test_that('Test ranking with unweighted data', {
X <- sparseMatrix(i = c(2, 3, 7, 9, 12, 15, 17, 18),
j = c(1, 1, 2, 2, 3, 3, 4, 4),
x = rep(1.0, 8), dims = c(20, 4))
y <- c(0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0)
group <- c(5, 5, 5, 5)
dtrain <- xgb.DMatrix(X, label = y, group = group)
params <- list(eta = 1, tree_method = 'exact', objective = 'rank:pairwise', max_depth = 1,
eval_metric = 'auc', eval_metric = 'aucpr')
bst <- xgb.train(params, dtrain, nrounds = 10, watchlist = list(train = dtrain))
# Check if the metric is monotone increasing
expect_true(all(diff(bst$evaluation_log$train_auc) >= 0))
expect_true(all(diff(bst$evaluation_log$train_aucpr) >= 0))
})
test_that('Test ranking with weighted data', {
X <- sparseMatrix(i = c(2, 3, 7, 9, 12, 15, 17, 18),
j = c(1, 1, 2, 2, 3, 3, 4, 4),
x = rep(1.0, 8), dims = c(20, 4))
y <- c(0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0)
group <- c(5, 5, 5, 5)
weight <- c(1.0, 2.0, 3.0, 4.0)
dtrain <- xgb.DMatrix(X, label = y, group = group, weight = weight)
params <- list(eta = 1, tree_method = 'exact', objective = 'rank:pairwise', max_depth = 1,
eval_metric = 'auc', eval_metric = 'aucpr')
bst <- xgb.train(params, dtrain, nrounds = 10, watchlist = list(train = dtrain))
# Check if the metric is monotone increasing
expect_true(all(diff(bst$evaluation_log$train_auc) >= 0))
expect_true(all(diff(bst$evaluation_log$train_aucpr) >= 0))
for (i in 1:10) {
pred <- predict(bst, newdata = dtrain, ntreelimit = i)
# is_sorted[i]: is i-th group correctly sorted by the ranking predictor?
is_sorted <- lapply(seq(1, 20, by = 5),
function (k) {
ind <- order(-pred[k:(k + 4)])
z <- y[ind + (k - 1)]
all(diff(z) <= 0) # Check if z is monotone decreasing
})
# Since we give weights 1, 2, 3, 4 to the four query groups,
# the ranking predictor will first try to correctly sort the last query group
# before correctly sorting other groups.
expect_true(all(diff(as.numeric(is_sorted)) >= 0))
}
})

View File

@@ -9,10 +9,10 @@ dtest <- xgb.DMatrix(agaricus.test$data, label = agaricus.test$label)
# Disable flaky tests for 32-bit Windows. # Disable flaky tests for 32-bit Windows.
# See https://github.com/dmlc/xgboost/issues/3720 # See https://github.com/dmlc/xgboost/issues/3720
win32_flag = .Platform$OS.type == "windows" && .Machine$sizeof.pointer != 8 win32_flag <- .Platform$OS.type == "windows" && .Machine$sizeof.pointer != 8
test_that("updating the model works", { test_that("updating the model works", {
watchlist = list(train = dtrain, test = dtest) watchlist <- list(train = dtrain, test = dtest)
# no-subsampling # no-subsampling
p1 <- list(objective = "binary:logistic", max_depth = 2, eta = 0.05, nthread = 2) p1 <- list(objective = "binary:logistic", max_depth = 2, eta = 0.05, nthread = 2)

View File

@@ -57,16 +57,16 @@ To answer the question above we will convert *categorical* variables to `numeric
In this Vignette we will see how to transform a *dense* `data.frame` (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features. In this Vignette we will see how to transform a *dense* `data.frame` (*dense* = few zeroes in the matrix) with *categorical* variables to a very *sparse* matrix (*sparse* = lots of zero in the matrix) of `numeric` features.
The method we are going to see is usually called [one-hot encoding](http://en.wikipedia.org/wiki/One-hot). The method we are going to see is usually called [one-hot encoding](https://en.wikipedia.org/wiki/One-hot).
The first step is to load `Arthritis` dataset in memory and wrap it with `data.table` package. The first step is to load `Arthritis` dataset in memory and wrap it with `data.table` package.
```{r, results='hide'} ```{r, results='hide'}
data(Arthritis) data(Arthritis)
df <- data.table(Arthritis, keep.rownames = F) df <- data.table(Arthritis, keep.rownames = FALSE)
``` ```
> `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](http://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **Xgboost** **R** package use `data.table`. > `data.table` is 100% compliant with **R** `data.frame` but its syntax is more consistent and its performance for large dataset is [best in class](https://stackoverflow.com/questions/21435339/data-table-vs-dplyr-can-one-do-something-well-the-other-cant-or-does-poorly) (`dplyr` from **R** and `Pandas` from **Python** [included](https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping)). Some parts of **Xgboost** **R** package use `data.table`.
The first thing we want to do is to have a look to the first few lines of the `data.table`: The first thing we want to do is to have a look to the first few lines of the `data.table`:
@@ -137,8 +137,8 @@ levels(df[,Treatment])
#### Encoding categorical features #### Encoding categorical features
Next step, we will transform the categorical data to dummy variables. Next step, we will transform the categorical data to dummy variables.
Several encoding methods exist, e.g., [one-hot encoding](http://en.wikipedia.org/wiki/One-hot) is a common approach. Several encoding methods exist, e.g., [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) is a common approach.
We will use the [dummy contrast coding](http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm#dummy) which is popular because it produces "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)). We will use the [dummy contrast coding](https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/) which is popular because it produces "full rank" encoding (also see [this blog post by Max Kuhn](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models)).
The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`. The purpose is to transform each value of each *categorical* feature into a *binary* feature `{0, 1}`.
@@ -176,7 +176,7 @@ bst <- xgboost(data = sparse_matrix, label = output_vector, max_depth = 4,
You can see some `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well the model explains your data. Lower is better. You can see some `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well the model explains your data. Lower is better.
A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future). A model which fits too well may [overfit](https://en.wikipedia.org/wiki/Overfitting) (meaning it copy/paste too much the past, and won't be that good to predict the future).
> Here you can see the numbers decrease until line 7 and then increase. > Here you can see the numbers decrease until line 7 and then increase.
> >
@@ -304,7 +304,7 @@ Linear model may not be that smart in this scenario.
Special Note: What about Random Forests™? Special Note: What about Random Forests™?
----------------------------------------- -----------------------------------------
As you may know, [Random Forests™](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](http://en.wikipedia.org/wiki/Ensemble_learning) family. As you may know, [Random Forests™](https://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning) family.
Both trains several decision trees for one dataset. The *main* difference is that in Random Forests™, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`). Both trains several decision trees for one dataset. The *main* difference is that in Random Forests™, trees are independent and in boosting, the tree `N+1` focus its learning on the loss (<=> what has not been well modeled by the tree `N`).

View File

@@ -163,7 +163,7 @@ evalerror <- function(preds, dtrain) {
dtest <- xgb.DMatrix(test$data, label = test$label) dtest <- xgb.DMatrix(test$data, label = test$label)
watchlist <- list(eval = dtest, train = dtrain) watchlist <- list(eval = dtest, train = dtrain)
param <- list(max_depth = 2, eta = 1, silent = 1) param <- list(max_depth = 2, eta = 1)
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, logregobj, evalerror, maximize = FALSE) bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, logregobj, evalerror, maximize = FALSE)
@ @

View File

@@ -24,7 +24,7 @@
author = "K. Bache and M. Lichman", author = "K. Bache and M. Lichman",
year = "2013", year = "2013",
title = "{UCI} Machine Learning Repository", title = "{UCI} Machine Learning Repository",
url = "http://archive.ics.uci.edu/ml", url = "http://archive.ics.uci.edu/ml/",
institution = "University of California, Irvine, School of Information and Computer Sciences" institution = "University of California, Irvine, School of Information and Computer Sciences"
} }

View File

@@ -68,7 +68,7 @@ The version 0.4-2 is on CRAN, and you can install it by:
install.packages("xgboost") install.packages("xgboost")
``` ```
Formerly available versions can be obtained from the CRAN [archive](https://cran.r-project.org/src/contrib/Archive/xgboost) Formerly available versions can be obtained from the CRAN [archive](https://cran.r-project.org/src/contrib/Archive/xgboost/)
## Learning ## Learning
@@ -363,7 +363,7 @@ xgb.plot.importance(importance_matrix = importance_matrix)
You can dump the tree you learned using `xgb.dump` into a text file. You can dump the tree you learned using `xgb.dump` into a text file.
```{r dump, message=T, warning=F} ```{r dump, message=T, warning=F}
xgb.dump(bst, with_stats = T) xgb.dump(bst, with_stats = TRUE)
``` ```
You can plot the trees from your model using ```xgb.plot.tree`` You can plot the trees from your model using ```xgb.plot.tree``
@@ -410,7 +410,7 @@ In some very specific cases, like when you want to pilot **XGBoost** from `caret
```{r saveLoadRBinVectorModel, message=F, warning=F} ```{r saveLoadRBinVectorModel, message=F, warning=F}
# save model to R's raw vector # save model to R's raw vector
rawVec <- xgb.save.raw(bst) rawVec <- xgb.serialize(bst)
# print class # print class
print(class(rawVec)) print(class(rawVec))

View File

@@ -27,7 +27,7 @@ License
Contribute to XGBoost Contribute to XGBoost
--------------------- ---------------------
XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone. XGBoost has been developed and used by a group of active community members. Your help is very valuable to make the package better for everyone.
Checkout the [Community Page](https://xgboost.ai/community) Checkout the [Community Page](https://xgboost.ai/community).
Reference Reference
--------- ---------

View File

@@ -14,6 +14,7 @@
#include "../src/metric/elementwise_metric.cc" #include "../src/metric/elementwise_metric.cc"
#include "../src/metric/multiclass_metric.cc" #include "../src/metric/multiclass_metric.cc"
#include "../src/metric/rank_metric.cc" #include "../src/metric/rank_metric.cc"
#include "../src/metric/survival_metric.cc"
// objectives // objectives
#include "../src/objective/objective.cc" #include "../src/objective/objective.cc"
@@ -21,6 +22,7 @@
#include "../src/objective/multiclass_obj.cc" #include "../src/objective/multiclass_obj.cc"
#include "../src/objective/rank_obj.cc" #include "../src/objective/rank_obj.cc"
#include "../src/objective/hinge.cc" #include "../src/objective/hinge.cc"
#include "../src/objective/aft_obj.cc"
// gbms // gbms
#include "../src/gbm/gbm.cc" #include "../src/gbm/gbm.cc"
@@ -31,7 +33,6 @@
// data // data
#include "../src/data/data.cc" #include "../src/data/data.cc"
#include "../src/data/simple_csr_source.cc"
#include "../src/data/simple_dmatrix.cc" #include "../src/data/simple_dmatrix.cc"
#include "../src/data/sparse_page_raw_format.cc" #include "../src/data/sparse_page_raw_format.cc"
#include "../src/data/ellpack_page.cc" #include "../src/data/ellpack_page.cc"
@@ -45,7 +46,7 @@
#include "../src/data/sparse_page_dmatrix.cc" #include "../src/data/sparse_page_dmatrix.cc"
#endif #endif
// tress // trees
#include "../src/tree/param.cc" #include "../src/tree/param.cc"
#include "../src/tree/split_evaluator.cc" #include "../src/tree/split_evaluator.cc"
#include "../src/tree/tree_model.cc" #include "../src/tree/tree_model.cc"
@@ -68,11 +69,13 @@
#include "../src/learner.cc" #include "../src/learner.cc"
#include "../src/logging.cc" #include "../src/logging.cc"
#include "../src/common/common.cc" #include "../src/common/common.cc"
#include "../src/common/charconv.cc"
#include "../src/common/timer.cc" #include "../src/common/timer.cc"
#include "../src/common/host_device_vector.cc" #include "../src/common/host_device_vector.cc"
#include "../src/common/hist_util.cc" #include "../src/common/hist_util.cc"
#include "../src/common/json.cc" #include "../src/common/json.cc"
#include "../src/common/io.cc" #include "../src/common/io.cc"
#include "../src/common/survival_util.cc"
#include "../src/common/version.cc" #include "../src/common/version.cc"
// c_api // c_api

View File

@@ -1,6 +1,4 @@
environment: environment:
R_ARCH: x64
USE_RTOOLS: true
matrix: matrix:
- target: msvc - target: msvc
ver: 2015 ver: 2015
@@ -12,13 +10,6 @@ environment:
configuration: Release configuration: Release
- target: mingw - target: mingw
generator: "Unix Makefiles" generator: "Unix Makefiles"
- target: jvm
- target: rmsvc
ver: 2015
generator: "Visual Studio 14 2015 Win64"
configuration: Release
- target: rmingw
generator: "Unix Makefiles"
#matrix: #matrix:
# fast_finish: true # fast_finish: true
@@ -44,21 +35,9 @@ install:
- if /i "%DO_PYTHON%" == "on" ( - if /i "%DO_PYTHON%" == "on" (
conda config --set always_yes true && conda config --set always_yes true &&
conda update -q conda && conda update -q conda &&
conda install -y numpy scipy pandas matplotlib pytest scikit-learn graphviz python-graphviz conda install -y numpy scipy pandas matplotlib pytest scikit-learn graphviz python-graphviz hypothesis
) )
- set PATH=C:\Miniconda3-x64\Library\bin\graphviz;%PATH% - set PATH=C:\Miniconda3-x64\Library\bin\graphviz;%PATH%
# R: based on https://github.com/krlmlr/r-appveyor
- ps: |
if($env:target -eq 'rmingw' -or $env:target -eq 'rmsvc') {
#$ErrorActionPreference = "Stop"
Invoke-WebRequest https://raw.githubusercontent.com/krlmlr/r-appveyor/master/scripts/appveyor-tool.ps1 -OutFile "$Env:TEMP\appveyor-tool.ps1"
Import-Module "$Env:TEMP\appveyor-tool.ps1"
Bootstrap
$BINARY_DEPS = "c('XML','igraph')"
cmd.exe /c "R.exe -q -e ""install.packages($BINARY_DEPS, repos='$CRAN', type='win.binary')"" 2>&1"
$DEPS = "c('data.table','magrittr','stringi','ggplot2','DiagrammeR','Ckmeans.1d.dp','vcd','testthat','lintr','knitr','rmarkdown')"
cmd.exe /c "R.exe -q -e ""install.packages($DEPS, repos='$CRAN', type='both')"" 2>&1"
}
build_script: build_script:
- cd %APPVEYOR_BUILD_FOLDER% - cd %APPVEYOR_BUILD_FOLDER%
@@ -81,53 +60,12 @@ build_script:
mkdir wheel && mkdir wheel &&
python setup.py bdist_wheel --universal --plat-name win-amd64 -d wheel python setup.py bdist_wheel --universal --plat-name win-amd64 -d wheel
) )
# R package: make + mingw standard CRAN packaging (only x64 for now)
- if /i "%target%" == "rmingw" (
make Rbuild &&
ls -l &&
R.exe CMD INSTALL xgboost*.tar.gz
)
# R package: cmake + VC2015
- if /i "%target%" == "rmsvc" (
mkdir build_rmsvc%ver% &&
cd build_rmsvc%ver% &&
cmake .. -G"%generator%" -DCMAKE_CONFIGURATION_TYPES="Release" -DR_LIB=ON &&
cmake --build . --target install --config Release
)
- if /i "%target%" == "jvm" cd jvm-packages && mvn test -pl :xgboost4j_2.12
test_script: test_script:
- cd %APPVEYOR_BUILD_FOLDER% - cd %APPVEYOR_BUILD_FOLDER%
- if /i "%DO_PYTHON%" == "on" python -m pytest tests/python - if /i "%DO_PYTHON%" == "on" python -m pytest tests/python
# mingw R package: run the R check (which includes unit tests), and also keep the built binary package
- if /i "%target%" == "rmingw" (
set _R_CHECK_CRAN_INCOMING_=FALSE&&
set _R_CHECK_FORCE_SUGGESTS_=FALSE&&
R.exe CMD check xgboost*.tar.gz --no-manual --no-build-vignettes --as-cran --install-args=--build
)
# MSVC R package: run only the unit tests
- if /i "%target%" == "rmsvc" (
cd build_rmsvc%ver%\R-package &&
R.exe -q -e "library(testthat); setwd('tests'); source('testthat.R')"
)
on_failure:
# keep the whole output of R check
- if /i "%target%" == "rmingw" (
7z a failure.zip *.Rcheck\* &&
appveyor PushArtifact failure.zip
)
artifacts: artifacts:
# log from R check
- path: '*.Rcheck\**\*.log'
name: Logs
# source R-package
- path: '\xgboost_*.tar.gz'
name: Bits
# binary R-package
- path: '**\xgboost_*.zip'
name: Bits
# binary Python wheel package # binary Python wheel package
- path: '**\*.whl' - path: '**\*.whl'
name: Bits name: Bits

View File

@@ -1 +1 @@
@xgboost_VERSION_MAJOR@.@xgboost_VERSION_MINOR@.@xgboost_VERSION_PATCH@-SNAPSHOT @xgboost_VERSION_MAJOR@.@xgboost_VERSION_MINOR@.@xgboost_VERSION_PATCH@

View File

@@ -0,0 +1,34 @@
# Commands to install the R package as a CMake install target
function(check_call)
set(cmd COMMAND)
cmake_parse_arguments(
PARSE_ARGV 0
CALL_ARG "" "" "${cmd}"
)
string(REPLACE ";" " " commands "${CALL_ARG_COMMAND}")
message("Command: ${commands}")
execute_process(COMMAND ${CALL_ARG_COMMAND}
OUTPUT_VARIABLE _out
ERROR_VARIABLE _err
RESULT_VARIABLE _res)
if(NOT "${_res}" EQUAL "0")
message(FATAL_ERROR "out: ${_out}, err: ${_err}, res: ${_res}")
endif()
endfunction()
# Important paths
set(build_dir "@build_dir@")
set(LIBR_EXECUTABLE "@LIBR_EXECUTABLE@")
# Back up cmake_install.cmake
file(WRITE "${build_dir}/R-package/src/Makevars" "all:")
file(WRITE "${build_dir}/R-package/src/Makevars.win" "all:")
# Install dependencies
set(XGB_DEPS_SCRIPT
"deps = setdiff(c('data.table', 'magrittr', 'stringi'), rownames(installed.packages())); if(length(deps)>0) install.packages(deps, repo = 'https://cloud.r-project.org/')")
check_call(COMMAND "${LIBR_EXECUTABLE}" -q -e "${XGB_DEPS_SCRIPT}")
# Install the XGBoost R package
check_call(COMMAND "${LIBR_EXECUTABLE}" CMD INSTALL --no-multiarch --build "${build_dir}/R-package")

View File

@@ -0,0 +1,16 @@
# Assembles the R-package files in build_dir;
# if necessary, installs the main R package dependencies;
# runs R CMD INSTALL.
function(setup_rpackage_install_target rlib_target build_dir)
configure_file(${PROJECT_SOURCE_DIR}/cmake/RPackageInstall.cmake.in ${PROJECT_BINARY_DIR}/RPackageInstall.cmake @ONLY)
install(
DIRECTORY "${xgboost_SOURCE_DIR}/R-package"
DESTINATION "${build_dir}"
REGEX "src/*" EXCLUDE
REGEX "R-package/configure" EXCLUDE
)
install(TARGETS ${rlib_target}
LIBRARY DESTINATION "${build_dir}/R-package/src/"
RUNTIME DESTINATION "${build_dir}/R-package/src/")
install(SCRIPT ${PROJECT_BINARY_DIR}/RPackageInstall.cmake)
endfunction()

View File

@@ -65,6 +65,11 @@ function(set_output_directory target dir)
LIBRARY_OUTPUT_DIRECTORY_RELEASE ${dir} LIBRARY_OUTPUT_DIRECTORY_RELEASE ${dir}
LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO ${dir} LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO ${dir}
LIBRARY_OUTPUT_DIRECTORY_MINSIZEREL ${dir} LIBRARY_OUTPUT_DIRECTORY_MINSIZEREL ${dir}
ARCHIVE_OUTPUT_DIRECTORY ${dir}
ARCHIVE_OUTPUT_DIRECTORY_DEBUG ${dir}
ARCHIVE_OUTPUT_DIRECTORY_RELEASE ${dir}
ARCHIVE_OUTPUT_DIRECTORY_RELWITHDEBINFO ${dir}
ARCHIVE_OUTPUT_DIRECTORY_MINSIZEREL ${dir}
) )
endfunction(set_output_directory) endfunction(set_output_directory)
@@ -105,34 +110,9 @@ function(format_gencode_flags flags out)
set(${out} "${${out}}" PARENT_SCOPE) set(${out} "${${out}}" PARENT_SCOPE)
endfunction(format_gencode_flags flags) endfunction(format_gencode_flags flags)
# Assembles the R-package files in build_dir; macro(enable_nvtx target)
# if necessary, installs the main R package dependencies; find_package(NVTX REQUIRED)
# runs R CMD INSTALL. target_include_directories(${target} PRIVATE "${NVTX_INCLUDE_DIR}")
function(setup_rpackage_install_target rlib_target build_dir) target_link_libraries(${target} PRIVATE "${NVTX_LIBRARY}")
# backup cmake_install.cmake target_compile_definitions(${target} PRIVATE -DXGBOOST_USE_NVTX=1)
install(CODE "file(COPY \"${build_dir}/R-package/cmake_install.cmake\" endmacro()
DESTINATION \"${build_dir}/bak\")")
install(CODE "file(REMOVE_RECURSE \"${build_dir}/R-package\")")
install(
DIRECTORY "${xgboost_SOURCE_DIR}/R-package"
DESTINATION "${build_dir}"
REGEX "src/*" EXCLUDE
REGEX "R-package/configure" EXCLUDE
)
install(TARGETS ${rlib_target}
LIBRARY DESTINATION "${build_dir}/R-package/src/"
RUNTIME DESTINATION "${build_dir}/R-package/src/")
install(CODE "file(WRITE \"${build_dir}/R-package/src/Makevars\" \"all:\")")
install(CODE "file(WRITE \"${build_dir}/R-package/src/Makevars.win\" \"all:\")")
set(XGB_DEPS_SCRIPT
"deps = setdiff(c('data.table', 'magrittr', 'stringi'), rownames(installed.packages()));\
if(length(deps)>0) install.packages(deps, repo = 'https://cloud.r-project.org/')")
install(CODE "execute_process(COMMAND \"${LIBR_EXECUTABLE}\" \"-q\" \"-e\" \"${XGB_DEPS_SCRIPT}\")")
install(CODE "execute_process(COMMAND \"${LIBR_EXECUTABLE}\" CMD INSTALL\
\"--no-multiarch\" \"--build\" \"${build_dir}/R-package\")")
# restore cmake_install.cmake
install(CODE "file(RENAME \"${build_dir}/bak/cmake_install.cmake\"
\"${build_dir}/R-package/cmake_install.cmake\")")
endfunction(setup_rpackage_install_target)

View File

@@ -23,7 +23,7 @@
# Windows users might want to change this to their R version: # Windows users might want to change this to their R version:
if(NOT R_VERSION) if(NOT R_VERSION)
set(R_VERSION "3.4.1") set(R_VERSION "4.0.0")
endif() endif()
if(NOT R_ARCH) if(NOT R_ARCH)
if("${CMAKE_SIZEOF_VOID_P}" STREQUAL "4") if("${CMAKE_SIZEOF_VOID_P}" STREQUAL "4")
@@ -37,23 +37,33 @@ endif()
# Creates R.lib and R.def in the build directory for linking with MSVC # Creates R.lib and R.def in the build directory for linking with MSVC
function(create_rlib_for_msvc) function(create_rlib_for_msvc)
# various checks and warnings # various checks and warnings
if(NOT WIN32 OR NOT MSVC) if(NOT WIN32 OR (NOT MSVC AND NOT MINGW))
message(FATAL_ERROR "create_rlib_for_msvc() can only be used with MSVC") message(FATAL_ERROR "create_rlib_for_msvc() can only be used with MSVC or MINGW")
endif() endif()
if(NOT EXISTS "${LIBR_LIB_DIR}") if(NOT EXISTS "${LIBR_LIB_DIR}")
message(FATAL_ERROR "LIBR_LIB_DIR was not set!") message(FATAL_ERROR "LIBR_LIB_DIR was not set!")
endif() endif()
find_program(GENDEF_EXE gendef)
find_program(DLLTOOL_EXE dlltool) find_program(DLLTOOL_EXE dlltool)
if(NOT GENDEF_EXE OR NOT DLLTOOL_EXE) if(NOT DLLTOOL_EXE)
message(FATAL_ERROR "\nEither gendef.exe or dlltool.exe not found!\ message(FATAL_ERROR "\ndlltool.exe not found!\
\nDo you have Rtools installed with its MinGW's bin/ in PATH?") \nDo you have Rtools installed with its MinGW's bin/ in PATH?")
endif() endif()
# extract symbols from R.dll into R.def and R.lib import library # extract symbols from R.dll into R.def and R.lib import library
execute_process(COMMAND gendef get_filename_component(
"-" "${LIBR_LIB_DIR}/R.dll" LIBR_RSCRIPT_EXECUTABLE_DIR
OUTPUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/R.def") ${LIBR_EXECUTABLE}
execute_process(COMMAND dlltool DIRECTORY
)
set(LIBR_RSCRIPT_EXECUTABLE "${LIBR_RSCRIPT_EXECUTABLE_DIR}/Rscript")
execute_process(
COMMAND ${LIBR_RSCRIPT_EXECUTABLE}
"${CMAKE_CURRENT_BINARY_DIR}/../../R-package/inst/make-r-def.R"
"${LIBR_LIB_DIR}/R.dll" "${CMAKE_CURRENT_BINARY_DIR}/R.def"
)
execute_process(COMMAND ${DLLTOOL_EXE}
"--input-def" "${CMAKE_CURRENT_BINARY_DIR}/R.def" "--input-def" "${CMAKE_CURRENT_BINARY_DIR}/R.def"
"--output-lib" "${CMAKE_CURRENT_BINARY_DIR}/R.lib") "--output-lib" "${CMAKE_CURRENT_BINARY_DIR}/R.lib")
endfunction(create_rlib_for_msvc) endfunction(create_rlib_for_msvc)
@@ -103,12 +113,12 @@ else()
) )
# ask R for the include dir # ask R for the include dir
execute_process( execute_process(
COMMAND ${LIBR_EXECUTABLE} "--slave" "--no-save" "-e" "cat(R.home('include'))" COMMAND ${LIBR_EXECUTABLE} "--slave" "--vanilla" "-e" "cat(R.home('include'))"
OUTPUT_VARIABLE LIBR_INCLUDE_DIRS OUTPUT_VARIABLE LIBR_INCLUDE_DIRS
) )
# ask R for the lib dir # ask R for the lib dir
execute_process( execute_process(
COMMAND ${LIBR_EXECUTABLE} "--slave" "--no-save" "-e" "cat(R.home('lib'))" COMMAND ${LIBR_EXECUTABLE} "--slave" "--vanilla" "-e" "cat(R.home('lib'))"
OUTPUT_VARIABLE LIBR_LIB_DIR OUTPUT_VARIABLE LIBR_LIB_DIR
) )
@@ -148,7 +158,7 @@ message(STATUS "LIBR_CORE_LIBRARY [${LIBR_CORE_LIBRARY}]")
endif() endif()
if(WIN32 AND MSVC) if((WIN32 AND MSVC) OR (WIN32 AND MINGW))
# create a local R.lib import library for R.dll if it doesn't exist # create a local R.lib import library for R.dll if it doesn't exist
if(NOT EXISTS "${CMAKE_CURRENT_BINARY_DIR}/R.lib") if(NOT EXISTS "${CMAKE_CURRENT_BINARY_DIR}/R.lib")
create_rlib_for_msvc() create_rlib_for_msvc()

View File

@@ -0,0 +1,26 @@
if (NVTX_LIBRARY)
unset(NVTX_LIBRARY CACHE)
endif (NVTX_LIBRARY)
set(NVTX_LIB_NAME nvToolsExt)
find_path(NVTX_INCLUDE_DIR
NAMES nvToolsExt.h
PATHS ${CUDA_HOME}/include ${CUDA_INCLUDE} /usr/local/cuda/include)
find_library(NVTX_LIBRARY
NAMES nvToolsExt
PATHS ${CUDA_HOME}/lib64 /usr/local/cuda/lib64)
message(STATUS "Using nvtx library: ${NVTX_LIBRARY}")
include(FindPackageHandleStandardArgs)
find_package_handle_standard_args(NVTX DEFAULT_MSG
NVTX_INCLUDE_DIR NVTX_LIBRARY)
mark_as_advanced(
NVTX_INCLUDE_DIR
NVTX_LIBRARY
)

12
cmake/xgboost.pc.in Normal file
View File

@@ -0,0 +1,12 @@
prefix=@CMAKE_INSTALL_PREFIX@
version=@xgboost_VERSION@
exec_prefix=${prefix}/bin
libdir=${prefix}/lib
includedir=${prefix}/include
Name: xgboost
Description: XGBoost - Scalable and Flexible Gradient Boosting.
Version: ${version}
Cflags: -I${includedir}
Libs: -L${libdir} -lxgboost

2
cub

Submodule cub updated: b20808b1b0...c3cceac115

View File

@@ -16,6 +16,7 @@ Contents
- [Tutorials](#tutorials) - [Tutorials](#tutorials)
- [Usecases](#usecases) - [Usecases](#usecases)
- [Tools using XGBoost](#tools-using-xgboost) - [Tools using XGBoost](#tools-using-xgboost)
- [Integrations with 3rd party software](#integrations-with-3rd-party-software)
- [Awards](#awards) - [Awards](#awards)
- [Windows Binaries](#windows-binaries) - [Windows Binaries](#windows-binaries)
@@ -114,6 +115,7 @@ Please send pull requests if you find ones that are missing here.
- [Complete Guide to Parameter Tuning in XGBoost](http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) by Aarshay Jain - [Complete Guide to Parameter Tuning in XGBoost](http://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) by Aarshay Jain
- [Practical XGBoost in Python online course](http://education.parrotprediction.teachable.com/courses/practical-xgboost-in-python) by Parrot Prediction - [Practical XGBoost in Python online course](http://education.parrotprediction.teachable.com/courses/practical-xgboost-in-python) by Parrot Prediction
- [Spark and XGBoost using Scala](http://www.elenacuoco.com/2016/10/10/scala-spark-xgboost-classification/) by Elena Cuoco - [Spark and XGBoost using Scala](http://www.elenacuoco.com/2016/10/10/scala-spark-xgboost-classification/) by Elena Cuoco
## Usecases ## Usecases
If you have particular usecase of xgboost that you would like to highlight. If you have particular usecase of xgboost that you would like to highlight.
Send a PR to add a one sentence description:) Send a PR to add a one sentence description:)
@@ -126,14 +128,17 @@ Send a PR to add a one sentence description:)
- [Hanjing Su](https://www.52cs.org) from Tencent data platform team: "We use distributed XGBoost for click through prediction in wechat shopping and lookalikes. The problems involve hundreds millions of users and thousands of features. XGBoost is cleanly designed and can be easily integrated into our production environment, reducing our cost in developments." - [Hanjing Su](https://www.52cs.org) from Tencent data platform team: "We use distributed XGBoost for click through prediction in wechat shopping and lookalikes. The problems involve hundreds millions of users and thousands of features. XGBoost is cleanly designed and can be easily integrated into our production environment, reducing our cost in developments."
- [CNevd](https://github.com/CNevd) from autohome.com ad platform team: "Distributed XGBoost is used for click through rate prediction in our display advertising, XGBoost is highly efficient and flexible and can be easily used on our distributed platform, our ctr made a great improvement with hundred millions samples and millions features due to this awesome XGBoost" - [CNevd](https://github.com/CNevd) from autohome.com ad platform team: "Distributed XGBoost is used for click through rate prediction in our display advertising, XGBoost is highly efficient and flexible and can be easily used on our distributed platform, our ctr made a great improvement with hundred millions samples and millions features due to this awesome XGBoost"
## Tools using XGBoost ## Tools using XGBoost
- [BayesBoost](https://github.com/mpearmain/BayesBoost) - Bayesian Optimization using xgboost and sklearn API - [BayesBoost](https://github.com/mpearmain/BayesBoost) - Bayesian Optimization using xgboost and sklearn API
- [gp_xgboost_gridsearch](https://github.com/vatsan/gp_xgboost_gridsearch) - In-database parallel grid-search for XGBoost on [Greenplum](https://github.com/greenplum-db/gpdb) using PL/Python - [gp_xgboost_gridsearch](https://github.com/vatsan/gp_xgboost_gridsearch) - In-database parallel grid-search for XGBoost on [Greenplum](https://github.com/greenplum-db/gpdb) using PL/Python
- [tpot](https://github.com/rhiever/tpot) - A Python tool that automatically creates and optimizes machine learning pipelines using genetic programming. - [tpot](https://github.com/rhiever/tpot) - A Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.
## Integrations with 3rd party software
Open source integrations with XGBoost:
* [Neptune.ai](http://neptune.ai/) - Experiment management and collaboration tool for ML/DL/RL specialists. Integration has a form of the [XGBoost callback](https://docs.neptune.ai/integrations/xgboost.html) that automatically logs training and evaluation metrics, as well as saved model (booster), feature importance chart and visualized trees.
* [Optuna](https://optuna.org/) - An open source hyperparameter optimization framework to automate hyperparameter search. Optuna integrates with XGBoost in the [XGBoostPruningCallback](https://optuna.readthedocs.io/en/stable/reference/integration.html#optuna.integration.XGBoostPruningCallback) that let users easily prune unpromising trials.
## Awards ## Awards
- [John Chambers Award](http://stat-computing.org/awards/jmc/winners.html) - 2016 Winner: XGBoost R Package, by Tong He (Simon Fraser University) and Tianqi Chen (University of Washington) - [John Chambers Award](http://stat-computing.org/awards/jmc/winners.html) - 2016 Winner: XGBoost R Package, by Tong He (Simon Fraser University) and Tianqi Chen (University of Washington)
- [InfoWorlds 2019 Technology of the Year Award](https://www.infoworld.com/article/3336072/application-development/infoworlds-2019-technology-of-the-year-award-winners.html) - [InfoWorlds 2019 Technology of the Year Award](https://www.infoworld.com/article/3336072/application-development/infoworlds-2019-technology-of-the-year-award-winners.html)

View File

@@ -0,0 +1,56 @@
"""
Demo for survival analysis (regression) using Accelerated Failure Time (AFT) model
"""
import os
from sklearn.model_selection import ShuffleSplit
import pandas as pd
import numpy as np
import xgboost as xgb
# The Veterans' Administration Lung Cancer Trial
# The Statistical Analysis of Failure Time Data by Kalbfleisch J. and Prentice R (1980)
CURRENT_DIR = os.path.dirname(__file__)
df = pd.read_csv(os.path.join(CURRENT_DIR, '../data/veterans_lung_cancer.csv'))
print('Training data:')
print(df)
# Split features and labels
y_lower_bound = df['Survival_label_lower_bound']
y_upper_bound = df['Survival_label_upper_bound']
X = df.drop(['Survival_label_lower_bound', 'Survival_label_upper_bound'], axis=1)
# Split data into training and validation sets
rs = ShuffleSplit(n_splits=2, test_size=.7, random_state=0)
train_index, valid_index = next(rs.split(X))
dtrain = xgb.DMatrix(X.values[train_index, :])
dtrain.set_float_info('label_lower_bound', y_lower_bound[train_index])
dtrain.set_float_info('label_upper_bound', y_upper_bound[train_index])
dvalid = xgb.DMatrix(X.values[valid_index, :])
dvalid.set_float_info('label_lower_bound', y_lower_bound[valid_index])
dvalid.set_float_info('label_upper_bound', y_upper_bound[valid_index])
# Train gradient boosted trees using AFT loss and metric
params = {'verbosity': 0,
'objective': 'survival:aft',
'eval_metric': 'aft-nloglik',
'tree_method': 'hist',
'learning_rate': 0.05,
'aft_loss_distribution': 'normal',
'aft_loss_distribution_scale': 1.20,
'max_depth': 6,
'lambda': 0.01,
'alpha': 0.02}
bst = xgb.train(params, dtrain, num_boost_round=10000,
evals=[(dtrain, 'train'), (dvalid, 'valid')],
early_stopping_rounds=50)
# Run prediction on the validation set
df = pd.DataFrame({'Label (lower bound)': y_lower_bound[valid_index],
'Label (upper bound)': y_upper_bound[valid_index],
'Predicted label': bst.predict(dvalid)})
print(df)
# Show only data points with right-censored labels
print(df[np.isinf(df['Label (upper bound)'])])
# Save trained model
bst.save_model('aft_model.json')

View File

@@ -0,0 +1,78 @@
"""
Demo for survival analysis (regression) using Accelerated Failure Time (AFT) model, using Optuna
to tune hyperparameters
"""
from sklearn.model_selection import ShuffleSplit
import pandas as pd
import numpy as np
import xgboost as xgb
import optuna
# The Veterans' Administration Lung Cancer Trial
# The Statistical Analysis of Failure Time Data by Kalbfleisch J. and Prentice R (1980)
df = pd.read_csv('../data/veterans_lung_cancer.csv')
print('Training data:')
print(df)
# Split features and labels
y_lower_bound = df['Survival_label_lower_bound']
y_upper_bound = df['Survival_label_upper_bound']
X = df.drop(['Survival_label_lower_bound', 'Survival_label_upper_bound'], axis=1)
# Split data into training and validation sets
rs = ShuffleSplit(n_splits=2, test_size=.7, random_state=0)
train_index, valid_index = next(rs.split(X))
dtrain = xgb.DMatrix(X.values[train_index, :])
dtrain.set_float_info('label_lower_bound', y_lower_bound[train_index])
dtrain.set_float_info('label_upper_bound', y_upper_bound[train_index])
dvalid = xgb.DMatrix(X.values[valid_index, :])
dvalid.set_float_info('label_lower_bound', y_lower_bound[valid_index])
dvalid.set_float_info('label_upper_bound', y_upper_bound[valid_index])
# Define hyperparameter search space
base_params = {'verbosity': 0,
'objective': 'survival:aft',
'eval_metric': 'aft-nloglik',
'tree_method': 'hist'} # Hyperparameters common to all trials
def objective(trial):
params = {'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 1.0),
'aft_loss_distribution': trial.suggest_categorical('aft_loss_distribution',
['normal', 'logistic', 'extreme']),
'aft_loss_distribution_scale': trial.suggest_loguniform('aft_loss_distribution_scale', 0.1, 10.0),
'max_depth': trial.suggest_int('max_depth', 3, 8),
'lambda': trial.suggest_loguniform('lambda', 1e-8, 1.0),
'alpha': trial.suggest_loguniform('alpha', 1e-8, 1.0)} # Search space
params.update(base_params)
pruning_callback = optuna.integration.XGBoostPruningCallback(trial, 'valid-aft-nloglik')
bst = xgb.train(params, dtrain, num_boost_round=10000,
evals=[(dtrain, 'train'), (dvalid, 'valid')],
early_stopping_rounds=50, verbose_eval=False, callbacks=[pruning_callback])
if bst.best_iteration >= 25:
return bst.best_score
else:
return np.inf # Reject models with < 25 trees
# Run hyperparameter search
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=200)
print('Completed hyperparameter tuning with best aft-nloglik = {}.'.format(study.best_trial.value))
params = {}
params.update(base_params)
params.update(study.best_trial.params)
# Re-run training with the best hyperparameter combination
print('Re-running the best trial... params = {}'.format(params))
bst = xgb.train(params, dtrain, num_boost_round=10000,
evals=[(dtrain, 'train'), (dvalid, 'valid')],
early_stopping_rounds=50)
# Run prediction on the validation set
df = pd.DataFrame({'Label (lower bound)': y_lower_bound[valid_index],
'Label (upper bound)': y_upper_bound[valid_index],
'Predicted label': bst.predict(dvalid)})
print(df)
# Show only data points with right-censored labels
print(df[np.isinf(df['Label (upper bound)'])])
# Save trained model
bst.save_model('aft_best_model.json')

View File

@@ -0,0 +1,97 @@
"""
Visual demo for survival analysis (regression) with Accelerated Failure Time (AFT) model.
This demo uses 1D toy data and visualizes how XGBoost fits a tree ensemble. The ensemble model
starts out as a flat line and evolves into a step function in order to account for all ranged
labels.
"""
import numpy as np
import xgboost as xgb
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 13})
# Function to visualize censored labels
def plot_censored_labels(X, y_lower, y_upper):
def replace_inf(x, target_value):
x[np.isinf(x)] = target_value
return x
plt.plot(X, y_lower, 'o', label='y_lower', color='blue')
plt.plot(X, y_upper, 'o', label='y_upper', color='fuchsia')
plt.vlines(X, ymin=replace_inf(y_lower, 0.01), ymax=replace_inf(y_upper, 1000),
label='Range for y', color='gray')
# Toy data
X = np.array([1, 2, 3, 4, 5]).reshape((-1, 1))
INF = np.inf
y_lower = np.array([ 10, 15, -INF, 30, 100])
y_upper = np.array([INF, INF, 20, 50, INF])
# Visualize toy data
plt.figure(figsize=(5, 4))
plot_censored_labels(X, y_lower, y_upper)
plt.ylim((6, 200))
plt.legend(loc='lower right')
plt.title('Toy data')
plt.xlabel('Input feature')
plt.ylabel('Label')
plt.yscale('log')
plt.tight_layout()
plt.show(block=True)
# Will be used to visualize XGBoost model
grid_pts = np.linspace(0.8, 5.2, 1000).reshape((-1, 1))
# Train AFT model using XGBoost
dmat = xgb.DMatrix(X)
dmat.set_float_info('label_lower_bound', y_lower)
dmat.set_float_info('label_upper_bound', y_upper)
params = {'max_depth': 3, 'objective':'survival:aft', 'min_child_weight': 0}
accuracy_history = []
def plot_intermediate_model_callback(env):
"""Custom callback to plot intermediate models"""
# Compute y_pred = prediction using the intermediate model, at current boosting iteration
y_pred = env.model.predict(dmat)
# "Accuracy" = the number of data points whose ranged label (y_lower, y_upper) includes
# the corresponding predicted label (y_pred)
acc = np.sum(np.logical_and(y_pred >= y_lower, y_pred <= y_upper)/len(X) * 100)
accuracy_history.append(acc)
# Plot ranged labels as well as predictions by the model
plt.subplot(5, 3, env.iteration + 1)
plot_censored_labels(X, y_lower, y_upper)
y_pred_grid_pts = env.model.predict(xgb.DMatrix(grid_pts))
plt.plot(grid_pts, y_pred_grid_pts, 'r-', label='XGBoost AFT model', linewidth=4)
plt.title('Iteration {}'.format(env.iteration), x=0.5, y=0.8)
plt.xlim((0.8, 5.2))
plt.ylim((1 if np.min(y_pred) < 6 else 6, 200))
plt.yscale('log')
res = {}
plt.figure(figsize=(12,13))
bst = xgb.train(params, dmat, 15, [(dmat, 'train')], evals_result=res,
callbacks=[plot_intermediate_model_callback])
plt.tight_layout()
plt.legend(loc='lower center', ncol=4,
bbox_to_anchor=(0.5, 0),
bbox_transform=plt.gcf().transFigure)
plt.tight_layout()
# Plot negative log likelihood over boosting iterations
plt.figure(figsize=(8,3))
plt.subplot(1, 2, 1)
plt.plot(res['train']['aft-nloglik'], 'b-o', label='aft-nloglik')
plt.xlabel('# Boosting Iterations')
plt.legend(loc='best')
# Plot "accuracy" over boosting iterations
# "Accuracy" = the number of data points whose ranged label (y_lower, y_upper) includes
# the corresponding predicted label (y_pred)
plt.subplot(1, 2, 2)
plt.plot(accuracy_history, 'r-o', label='Accuracy (%)')
plt.xlabel('# Boosting Iterations')
plt.legend(loc='best')
plt.tight_layout()
plt.show()

View File

@@ -156,7 +156,7 @@ If you want to continue boosting from existing model, say 0002.model, use
``` ```
xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function. xgboost will load from 0002.model continue boosting for 2 rounds, and save output to continue.model. However, beware that the training and evaluation data specified in mushroom.conf should not change when you use this function.
#### Use Multi-Threading #### Use Multi-Threading
When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running add ```nthread``` parameter to you configuration. When you are working with a large dataset, you may want to take advantage of parallelism. If your compiler supports OpenMP, xgboost is naturally multi-threaded, to set number of parallel running add ```nthread``` parameter to your configuration.
Eg. ```nthread=10``` Eg. ```nthread=10```
Set nthread to be the number of your real cpu (On Unix, this can be found using ```lscpu```) Set nthread to be the number of your real cpu (On Unix, this can be found using ```lscpu```)

1
demo/c-api/.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
c-api-demo

View File

@@ -60,6 +60,10 @@ int main(int argc, char** argv) {
printf("%s\n", eval_result); printf("%s\n", eval_result);
} }
bst_ulong num_feature = 0;
safe_xgboost(XGBoosterGetNumFeature(booster, &num_feature));
printf("num_feature: %llu\n", num_feature);
// predict // predict
bst_ulong out_len = 0; bst_ulong out_len = 0;
const float* out_result = NULL; const float* out_result = NULL;

View File

@@ -22,7 +22,6 @@ def main(client):
# evaluation metrics. # evaluation metrics.
output = xgb.dask.train(client, output = xgb.dask.train(client,
{'verbosity': 1, {'verbosity': 1,
'nthread': 1,
'tree_method': 'hist'}, 'tree_method': 'hist'},
dtrain, dtrain,
num_boost_round=4, evals=[(dtrain, 'train')]) num_boost_round=4, evals=[(dtrain, 'train')])
@@ -37,6 +36,6 @@ def main(client):
if __name__ == '__main__': if __name__ == '__main__':
# or use other clusters for scaling # or use other clusters for scaling
with LocalCluster(n_workers=7, threads_per_worker=1) as cluster: with LocalCluster(n_workers=7, threads_per_worker=4) as cluster:
with Client(cluster) as client: with Client(cluster) as client:
main(client) main(client)

Some files were not shown because too many files have changed in this diff Show More