1.5 release note. [skip ci] (#7271)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
This commit is contained in:
parent
d1f00fb0b7
commit
7593fa9982
235
NEWS.md
235
NEWS.md
@ -3,6 +3,241 @@ XGBoost Change Log
|
|||||||
|
|
||||||
This file records the changes in xgboost library in reverse chronological order.
|
This file records the changes in xgboost library in reverse chronological order.
|
||||||
|
|
||||||
|
## v1.5.0 (2021 Oct 11)
|
||||||
|
|
||||||
|
This release comes with many exciting new features and optimizations, along with some bug
|
||||||
|
fixes. We will describe the experimental categorical data support and the external memory
|
||||||
|
interface independently. Package-specific new features will be listed in respective
|
||||||
|
sections.
|
||||||
|
|
||||||
|
### Development on categorical data support
|
||||||
|
In version 1.3, XGBoost introduced an experimental feature for handling categorical data
|
||||||
|
natively, without one-hot encoding. XGBoost can fit categorical splits in decision
|
||||||
|
trees. (Currently, the generated splits will be of form `x \in {v}`, where the input is
|
||||||
|
compared to a single category value. A future version of XGBoost will generate splits that
|
||||||
|
compare the input against a list of multiple category values.)
|
||||||
|
|
||||||
|
Most of the other features, including prediction, SHAP value computation, feature
|
||||||
|
importance, and model plotting were revised to natively handle categorical splits. Also,
|
||||||
|
all Python interfaces including native interface with and without quantized `DMatrix`,
|
||||||
|
scikit-learn interface, and Dask interface now accept categorical data with a wide range
|
||||||
|
of data structures support including numpy/cupy array and cuDF/pandas/modin dataframe. In
|
||||||
|
practice, the following are required for enabling categorical data support during
|
||||||
|
training:
|
||||||
|
|
||||||
|
- Use Python package.
|
||||||
|
- Use `gpu_hist` to train the model.
|
||||||
|
- Use JSON model file format for saving the model.
|
||||||
|
|
||||||
|
Once the model is trained, it can be used with most of the features that are available on
|
||||||
|
the Python package. For a quick introduction, see
|
||||||
|
https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html
|
||||||
|
|
||||||
|
Related PRs: (#7011, #7001, #7042, #7041, #7047, #7043, #7036, #7054, #7053, #7065, #7213, #7228, #7220, #7221, #7231, #7306)
|
||||||
|
|
||||||
|
* Next steps
|
||||||
|
|
||||||
|
- Revise the CPU training algorithm to handle categorical data natively and generate categorical splits
|
||||||
|
- Extend the CPU and GPU algorithms to generate categorical splits of form `x \in S`
|
||||||
|
where the input is compared with multiple category values. split. (#7081)
|
||||||
|
|
||||||
|
### External memory
|
||||||
|
This release features a brand-new interface and implementation for external memory (also
|
||||||
|
known as out-of-core training). (#6901, #7064, #7088, #7089, #7087, #7092, #7070,
|
||||||
|
#7216). The new implementation leverages the data iterator interface, which is currently
|
||||||
|
used to create `DeviceQuantileDMatrix`. For a quick introduction, see
|
||||||
|
https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html#data-iterator
|
||||||
|
. During the development of this new interface, `lz4` compression is removed. (#7076).
|
||||||
|
Please note that external memory support is still experimental and not ready for
|
||||||
|
production use yet. All future development will focus on this new interface and users are
|
||||||
|
advised to migrate. (You are using the old interface if you are using a URL suffix to use
|
||||||
|
external memory.)
|
||||||
|
|
||||||
|
### New features in Python package
|
||||||
|
* Support numpy array interface and all numeric types from numpy in `DMatrix`
|
||||||
|
construction and `inplace_predict` (#6998, #7003). Now XGBoost no longer makes data
|
||||||
|
copy when input is numpy array view.
|
||||||
|
* The early stopping callback in Python has a new `min_delta` parameter to control the
|
||||||
|
stopping behavior (#7137)
|
||||||
|
* Python package now supports calculating feature scores for the linear model, which is
|
||||||
|
also available on R package. (#7048)
|
||||||
|
* Python interface now supports configuring constraints using feature names instead of
|
||||||
|
feature indices.
|
||||||
|
* Typehint support for more Python code including scikit-learn interface and rabit
|
||||||
|
module. (#6799, #7240)
|
||||||
|
* Add tutorial for XGBoost-Ray (#6884)
|
||||||
|
|
||||||
|
### New features in R package
|
||||||
|
* In 1.4 we have a new prediction function in the C API which is used by the Python
|
||||||
|
package. This release revises the R package to use the new prediction function as well.
|
||||||
|
A new parameter `iteration_range` for the predict function is available, which can be
|
||||||
|
used for specifying the range of trees for running prediction. (#6819, #7126)
|
||||||
|
* R package now supports the `nthread` parameter in `DMatrix` construction. (#7127)
|
||||||
|
|
||||||
|
### New features in JVM packages
|
||||||
|
* Support GPU dataframe and `DeviceQuantileDMatrix` (#7195). Constructing `DMatrix`
|
||||||
|
with GPU data structures and the interface for quantized `DMatrix` were first
|
||||||
|
introduced in the Python package and are now available in the xgboost4j package.
|
||||||
|
* JVM packages now support saving and getting early stopping attributes. (#7095) Here is a
|
||||||
|
quick [example](https://github.com/dmlc/xgboost/jvm-packages/xgboost4j-example/src/main/java/ml/dmlc/xgboost4j/java/example/EarlyStopping.java "example") in JAVA (#7252).
|
||||||
|
|
||||||
|
### General new features
|
||||||
|
* We now have a pre-built binary package for R on Windows with GPU support. (#7185)
|
||||||
|
* CUDA compute capability 86 is now part of the default CMake build configuration with
|
||||||
|
newly added support for CUDA 11.4. (#7131, #7182, #7254)
|
||||||
|
* XGBoost can be compiled using system CUB provided by CUDA 11.x installation. (#7232)
|
||||||
|
|
||||||
|
### Optimizations
|
||||||
|
The performance for both `hist` and `gpu_hist` has been significantly improved in 1.5
|
||||||
|
with the following optimizations:
|
||||||
|
* GPU multi-class model training now supports prediction cache. (#6860)
|
||||||
|
* GPU histogram building is sped up and the overall training time is 2-3 times faster on
|
||||||
|
large datasets (#7180, #7198). In addition, we removed the parameter `deterministic_histogram` and now
|
||||||
|
the GPU algorithm is always deterministic.
|
||||||
|
* CPU hist has an optimized procedure for data sampling (#6922)
|
||||||
|
* More performance optimization in regression and binary classification objectives on
|
||||||
|
CPU (#7206)
|
||||||
|
* Tree model dump is now performed in parallel (#7040)
|
||||||
|
|
||||||
|
### Breaking changes
|
||||||
|
* `n_gpus` was deprecated in 1.0 release and is now removed.
|
||||||
|
* Feature grouping in CPU hist tree method is removed, which was disabled long
|
||||||
|
ago. (#7018)
|
||||||
|
* C API for Quantile DMatrix is changed to be consistent with the new external memory
|
||||||
|
implementation. (#7082)
|
||||||
|
|
||||||
|
### Notable general bug fixes
|
||||||
|
* XGBoost no long changes global CUDA device ordinal when `gpu_id` is specified (#6891,
|
||||||
|
#6987)
|
||||||
|
* Fix `gamma` negative likelihood evaluation metric. (#7275)
|
||||||
|
* Fix integer value of `verbose_eal` for `xgboost.cv` function in Python. (#7291)
|
||||||
|
* Remove extra sync in CPU hist for dense data, which can lead to incorrect tree node
|
||||||
|
statistics. (#7120, #7128)
|
||||||
|
* Fix a bug in GPU hist when data size is larger than `UINT32_MAX` with missing
|
||||||
|
values. (#7026)
|
||||||
|
* Fix a thread safety issue in prediction with the `softmax` objective. (#7104)
|
||||||
|
* Fix a thread safety issue in CPU SHAP value computation. (#7050) Please note that all
|
||||||
|
prediction functions in Python are thread-safe.
|
||||||
|
* Fix model slicing. (#7149, #7078)
|
||||||
|
* Workaround a bug in old GCC which can lead to segfault during construction of
|
||||||
|
DMatrix. (#7161)
|
||||||
|
* Fix histogram truncation in GPU hist, which can lead to slightly-off results. (#7181)
|
||||||
|
* Fix loading GPU linear model pickle files on CPU-only machine. (#7154)
|
||||||
|
* Check input value is duplicated when CPU quantile queue is full (#7091)
|
||||||
|
* Fix parameter loading with training continuation. (#7121)
|
||||||
|
* Fix CMake interface for exposing C library by specifying dependencies. (#7099)
|
||||||
|
* Callback and early stopping are explicitly disabled for the scikit-learn interface
|
||||||
|
random forest estimator. (#7236)
|
||||||
|
* Fix compilation error on x86 (32-bit machine) (#6964)
|
||||||
|
* Fix CPU memory usage with extremely sparse datasets (#7255)
|
||||||
|
* Fix a bug in GPU multi-class AUC implementation with weighted data (#7300)
|
||||||
|
|
||||||
|
### Python package
|
||||||
|
Other than the items mentioned in the previous sections, there are some Python-specific
|
||||||
|
improvements.
|
||||||
|
* Change development release postfix to `dev` (#6988)
|
||||||
|
* Fix early stopping behavior with MAPE metric (#7061)
|
||||||
|
* Fixed incorrect feature mismatch error message (#6949)
|
||||||
|
* Add predictor to skl constructor. (#7000, #7159)
|
||||||
|
* Re-enable feature validation in predict proba. (#7177)
|
||||||
|
* scikit learn interface regression estimator now can pass the scikit-learn estimator
|
||||||
|
check and is fully compatible with scikit-learn utilities. `__sklearn_is_fitted__` is
|
||||||
|
implemented as part of the changes (#7130, #7230)
|
||||||
|
* Conform the latest pylint. (#7071, #7241)
|
||||||
|
* Support latest panda range index in DMatrix construction. (#7074)
|
||||||
|
* Fix DMatrix construction from pandas series. (#7243)
|
||||||
|
* Fix typo and grammatical mistake in error message (#7134)
|
||||||
|
* [dask] disable work stealing explicitly for training tasks (#6794)
|
||||||
|
* [dask] Set dataframe index in predict. (#6944)
|
||||||
|
* [dask] Fix prediction on df with latest dask. (#6969)
|
||||||
|
* [dask] Fix dask predict on `DaskDMatrix` with `iteration_range`. (#7005)
|
||||||
|
* [dask] Disallow importing non-dask estimators from xgboost.dask (#7133)
|
||||||
|
|
||||||
|
### R package
|
||||||
|
Improvements other than new features on R package:
|
||||||
|
* Optimization for updating R handles in-place (#6903)
|
||||||
|
* Removed the magrittr dependency. (#6855, #6906, #6928)
|
||||||
|
* The R package now hides all C++ symbols to avoid conflicts. (#7245)
|
||||||
|
* Other maintenance including code cleanups, document updates. (#6863, #6915, #6930, #6966, #6967)
|
||||||
|
|
||||||
|
### JVM packages
|
||||||
|
Improvements other than new features on JVM packages:
|
||||||
|
* Constructors with implicit missing value are deprecated due to confusing behaviors. (#7225)
|
||||||
|
* Reduce scala-compiler, scalatest dependency scopes (#6730)
|
||||||
|
* Making the Java library loader emit helpful error messages on missing dependencies. (#6926)
|
||||||
|
* JVM packages now use the Python tracker in XGBoost instead of dmlc. The one in XGBoost
|
||||||
|
is shared between JVM packages and Python Dask and enjoys better maintenance (#7132)
|
||||||
|
* Fix "key not found: train" error (#6842)
|
||||||
|
* Fix model loading from stream (#7067)
|
||||||
|
|
||||||
|
### General document improvements
|
||||||
|
* Overhaul the installation documents. (#6877)
|
||||||
|
* A few demos are added for AFT with dask (#6853), callback with dask (#6995), inference
|
||||||
|
in C (#7151), `process_type`. (#7135)
|
||||||
|
* Fix PDF format of document. (#7143)
|
||||||
|
* Clarify the behavior of `use_rmm`. (#6808)
|
||||||
|
* Clarify prediction function. (#6813)
|
||||||
|
* Improve tutorial on feature interactions (#7219)
|
||||||
|
* Add small example for dask sklearn interface. (#6970)
|
||||||
|
* Update Python intro. (#7235)
|
||||||
|
* Some fixes/updates (#6810, #6856, #6935, #6948, #6976, #7084, #7097, #7170, #7173, #7174, #7226, #6979, #6809, #6796, #6979)
|
||||||
|
|
||||||
|
### Maintenance
|
||||||
|
* Some refactoring around CPU hist, which lead to better performance but are listed under general maintenance tasks:
|
||||||
|
- Extract evaluate splits from CPU hist. (#7079)
|
||||||
|
- Merge lossgude and depthwise strategies for CPU hist (#7007)
|
||||||
|
- Simplify sparse and dense CPU hist kernels (#7029)
|
||||||
|
- Extract histogram builder from CPU Hist. (#7152)
|
||||||
|
|
||||||
|
* Others
|
||||||
|
- Fix `gpu_id` with custom objective. (#7015)
|
||||||
|
- Fix typos in AUC. (#6795)
|
||||||
|
- Use constexpr in `dh::CopyIf`. (#6828)
|
||||||
|
- Update dmlc-core. (#6862)
|
||||||
|
- Bump version to 1.5.0 snapshot in master. (#6875)
|
||||||
|
- Relax shotgun test. (#6900)
|
||||||
|
- Guard against index error in prediction. (#6982)
|
||||||
|
- Hide symbols in CI build + hide symbols for C and CUDA (#6798)
|
||||||
|
- Persist data in dask test. (#7077)
|
||||||
|
- Fix typo in arguments of PartitionBuilder::Init (#7113)
|
||||||
|
- Fix typo in src/common/hist.cc BuildHistKernel (#7116)
|
||||||
|
- Use upstream URI in distributed quantile tests. (#7129)
|
||||||
|
- Include cpack (#7160)
|
||||||
|
- Remove synchronization in monitor. (#7164)
|
||||||
|
- Remove unused code. (#7175)
|
||||||
|
- Fix building on CUDA 11.0. (#7187)
|
||||||
|
- Better error message for `ncclUnhandledCudaError`. (#7190)
|
||||||
|
- Add noexcept to JSON objects. (#7205)
|
||||||
|
- Improve wording for warning (#7248)
|
||||||
|
- Fix typo in release script. [skip ci] (#7238)
|
||||||
|
- Relax shotgun test. (#6918)
|
||||||
|
- Relax test for decision stump in distributed environment. (#6919)
|
||||||
|
- [dask] speed up tests (#7020)
|
||||||
|
|
||||||
|
### CI
|
||||||
|
* [CI] Rotate access keys for uploading MacOS artifacts from Travis CI (#7253)
|
||||||
|
* Reduce Travis environment setup time. (#6912)
|
||||||
|
* Restore R cache on github action. (#6985)
|
||||||
|
* [CI] Remove stray build artifact to avoid error in artifact packaging (#6994)
|
||||||
|
* [CI] Move appveyor tests to action (#6986)
|
||||||
|
* Remove appveyor badge. [skip ci] (#7035)
|
||||||
|
* [CI] Configure RAPIDS, dask, modin (#7033)
|
||||||
|
* Test on s390x. (#7038)
|
||||||
|
* [CI] Upgrade to CMake 3.14 (#7060)
|
||||||
|
* [CI] Update R cache. (#7102)
|
||||||
|
* [CI] Pin libomp to 11.1.0 (#7107)
|
||||||
|
* [CI] Upgrade build image to CentOS 7 + GCC 8; require CUDA 10.1 and later (#7141)
|
||||||
|
* [dask] Work around segfault in prediction. (#7112)
|
||||||
|
* [dask] Remove the workaround for segfault. (#7146)
|
||||||
|
* [CI] Fix hanging Python setup in Windows CI (#7186)
|
||||||
|
* [CI] Clean up in beginning of each task in Win CI (#7189)
|
||||||
|
* Fix travis. (#7237)
|
||||||
|
|
||||||
|
### Acknowledgement
|
||||||
|
* **Contributors**: Adam Pocock (@Craigacp), Jeff H (@JeffHCross), Johan Hansson (@JohanWork), Jose Manuel Llorens (@JoseLlorensRipolles), Benjamin Szőke (@Livius90), @ReeceGoding, @ShvetsKS, Robert Zabel (@ZabelTech), Ali (@ali5h), Andrew Ziem (@az0), Andy Adinets (@canonizer), @david-cortes, Daniel Saxton (@dsaxton), Emil Sadek (@esadek), @farfarawayzyt, Gil Forsyth (@gforsyth), @giladmaya, @graue70, Philip Hyunsu Cho (@hcho3), James Lamb (@jameslamb), José Morales (@jmoralez), Kai Fricke (@krfricke), Christian Lorentzen (@lorentzenchr), Mads R. B. Kristensen (@madsbk), Anton Kostin (@masguit42), Martin Petříček (@mpetricek-corp), @naveenkb, Taewoo Kim (@oOTWK), Viktor Szathmáry (@phraktle), Robert Maynard (@robertmaynard), TP Boudreau (@tpboudreau), Jiaming Yuan (@trivialfis), Paul Taylor (@trxcllnt), @vslaykovsky, Bobby Wang (@wbo4958),
|
||||||
|
* **Reviewers**: Nan Zhu (@CodingCat), Adam Pocock (@Craigacp), Jose Manuel Llorens (@JoseLlorensRipolles), Kodi Arfer (@Kodiologist), Benjamin Szőke (@Livius90), Mark Guryanov (@MarkGuryanov), Rory Mitchell (@RAMitchell), @ReeceGoding, @ShvetsKS, Egor Smirnov (@SmirnovEgorRu), Andrew Ziem (@az0), @candalfigomoro, Andy Adinets (@canonizer), Dante Gama Dessavre (@dantegd), @david-cortes, Daniel Saxton (@dsaxton), @farfarawayzyt, Gil Forsyth (@gforsyth), Harutaka Kawamura (@harupy), Philip Hyunsu Cho (@hcho3), @jakirkham, James Lamb (@jameslamb), José Morales (@jmoralez), James Bourbeau (@jrbourbeau), Christian Lorentzen (@lorentzenchr), Martin Petříček (@mpetricek-corp), Nikolay Petrov (@napetrov), @naveenkb, Viktor Szathmáry (@phraktle), Robin Teuwens (@rteuwens), Yuan Tang (@terrytangyuan), TP Boudreau (@tpboudreau), Jiaming Yuan (@trivialfis), @vkuzmin-uber, Bobby Wang (@wbo4958), William Hicks (@wphicks)
|
||||||
|
|
||||||
|
|
||||||
## v1.4.2 (2021.05.13)
|
## v1.4.2 (2021.05.13)
|
||||||
This is a patch release for Python package with following fixes:
|
This is a patch release for Python package with following fixes:
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user