v1.6.0 release note. [skip ci] (#7746)
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
This commit is contained in:
parent
5dea21273a
commit
d0de954af2
245
NEWS.md
245
NEWS.md
@ -3,6 +3,251 @@ XGBoost Change Log
|
||||
|
||||
This file records the changes in xgboost library in reverse chronological order.
|
||||
|
||||
## v1.6.0 (2022 Apr 16)
|
||||
|
||||
After a long period of development, XGBoost v1.6.0 is packed with many new features and
|
||||
improvements. We summarize them in the following sections starting with an introduction to
|
||||
some major new features, then moving on to language binding specific changes including new
|
||||
features and notable bug fixes for that binding.
|
||||
|
||||
### Development of categorical data support
|
||||
This version of XGBoost features new improvements and full coverage of experimental
|
||||
categorical data support in Python and C package with tree model. Both `hist`, `approx`
|
||||
and `gpu_hist` now support training with categorical data. Also, partition-based
|
||||
categorical split is introduced in this release. This split type is first available in
|
||||
LightGBM in the context of gradient boosting. The previous XGBoost release supported one-hot split where the splitting criteria is of form `x \in {c}`, i.e. the categorical feature `x` is tested against a single candidate. The new release allows for more expressive conditions: `x \in S` where the categorical feature `x` is tested against multiple candidates. Moreover, it is now possible to use any tree algorithms (`hist`, `approx`, `gpu_hist`) when creating categorical splits. For more
|
||||
information, please see our tutorial on [categorical
|
||||
data](https://xgboost.readthedocs.io/en/latest/tutorials/categorical.html), along with
|
||||
examples linked on that page. (#7380, #7708, #7695, #7330, #7307, #7322, #7705,
|
||||
#7652, #7592, #7666, #7576, #7569, #7529, #7575, #7393, #7465, #7385, #7371, #7745, #7810)
|
||||
|
||||
In the future, we will continue to improve categorical data support with new features and
|
||||
optimizations. Also, we are looking forward to bringing the feature beyond Python binding,
|
||||
contributions and feedback are welcomed! Lastly, as a result of experimental status, the
|
||||
behavior might be subject to change, especially the default value of related
|
||||
hyper-parameters.
|
||||
|
||||
### Experimental support for multi-output model
|
||||
|
||||
XGBoost 1.6 features initial support for the multi-output model, which includes
|
||||
multi-output regression and multi-label classification. Along with this, the XGBoost
|
||||
classifier has proper support for base margin without to need for the user to flatten the
|
||||
input. In this initial support, XGBoost builds one model for each target similar to the
|
||||
sklearn meta estimator, for more details, please see our [quick
|
||||
introduction](https://xgboost.readthedocs.io/en/latest/tutorials/multioutput.html).
|
||||
|
||||
(#7365, #7736, #7607, #7574, #7521, #7514, #7456, #7453, #7455, #7434, #7429, #7405, #7381)
|
||||
|
||||
### External memory support
|
||||
External memory support for both approx and hist tree method is considered feature
|
||||
complete in XGBoost 1.6. Building upon the iterator-based interface introduced in the
|
||||
previous version, now both `hist` and `approx` iterates over each batch of data during
|
||||
training and prediction. In previous versions, `hist` concatenates all the batches into
|
||||
an internal representation, which is removed in this version. As a result, users can
|
||||
expect higher scalability in terms of data size but might experience lower performance due
|
||||
to disk IO. (#7531, #7320, #7638, #7372)
|
||||
|
||||
### Rewritten approx
|
||||
|
||||
The `approx` tree method is rewritten based on the existing `hist` tree method. The
|
||||
rewrite closes the feature gap between `approx` and `hist` and improves the performance.
|
||||
Now the behavior of `approx` should be more aligned with `hist` and `gpu_hist`. Here is a
|
||||
list of user-visible changes:
|
||||
|
||||
- Supports both `max_leaves` and `max_depth`.
|
||||
- Supports `grow_policy`.
|
||||
- Supports monotonic constraint.
|
||||
- Supports feature weights.
|
||||
- Use `max_bin` to replace `sketch_eps`.
|
||||
- Supports categorical data.
|
||||
- Faster performance for many of the datasets.
|
||||
- Improved performance and robustness for distributed training.
|
||||
- Supports prediction cache.
|
||||
- Significantly better performance for external memory when `depthwise` policy is used.
|
||||
|
||||
### New serialization format
|
||||
Based on the existing JSON serialization format, we introduce UBJSON support as a more
|
||||
efficient alternative. Both formats will be available in the future and we plan to
|
||||
gradually [phase out](https://github.com/dmlc/xgboost/issues/7547) support for the old
|
||||
binary model format. Users can opt to use the different formats in the serialization
|
||||
function by providing the file extension `json` or `ubj`. Also, the `save_raw` function in
|
||||
all supported languages bindings gains a new parameter for exporting the model in different
|
||||
formats, available options are `json`, `ubj`, and `deprecated`, see document for the
|
||||
language binding you are using for details. Lastly, the default internal serialization
|
||||
format is set to UBJSON, which affects Python pickle and R RDS. (#7572, #7570, #7358,
|
||||
#7571, #7556, #7549, #7416)
|
||||
|
||||
### General new features and improvements
|
||||
Aside from the major new features mentioned above, some others are summarized here:
|
||||
|
||||
* Users can now access the build information of XGBoost binary in Python and C
|
||||
interface. (#7399, #7553)
|
||||
* Auto-configuration of `seed_per_iteration` is removed, now distributed training should
|
||||
generate closer results to single node training when sampling is used. (#7009)
|
||||
* A new parameter `huber_slope` is introduced for the `Pseudo-Huber` objective.
|
||||
* During source build, XGBoost can choose cub in the system path automatically. (#7579)
|
||||
* XGBoost now honors the CPU counts from CFS, which is usually set in docker
|
||||
environments. (#7654, #7704)
|
||||
* The metric `aucpr` is rewritten for better performance and GPU support. (#7297, #7368)
|
||||
* Metric calculation is now performed in double precision. (#7364)
|
||||
* XGBoost no longer mutates the global OpenMP thread limit. (#7537, #7519, #7608, #7590,
|
||||
#7589, #7588, #7687)
|
||||
* The default behavior of `max_leave` and `max_depth` is now unified (#7302, #7551).
|
||||
* CUDA fat binary is now compressed. (#7601)
|
||||
* Deterministic result for evaluation metric and linear model. In previous versions of
|
||||
XGBoost, evaluation results might differ slightly for each run due to parallel reduction
|
||||
for floating-point values, which is now addressed. (#7362, #7303, #7316, #7349)
|
||||
* XGBoost now uses double for GPU Hist node sum, which improves the accuracy of
|
||||
`gpu_hist`. (#7507)
|
||||
|
||||
### Performance improvements
|
||||
Most of the performance improvements are integrated into other refactors during feature
|
||||
developments. The `approx` should see significant performance gain for many datasets as
|
||||
mentioned in the previous section, while the `hist` tree method also enjoys improved
|
||||
performance with the removal of the internal `pruner` along with some other
|
||||
refactoring. Lastly, `gpu_hist` no longer synchronizes the device during training. (#7737)
|
||||
|
||||
### General bug fixes
|
||||
This section lists bug fixes that are not specific to any language binding.
|
||||
* The `num_parallel_tree` is now a model parameter instead of a training hyper-parameter,
|
||||
which fixes model IO with random forest. (#7751)
|
||||
* Fixes in CMake script for exporting configuration. (#7730)
|
||||
* XGBoost can now handle unsorted sparse input. This includes text file formats like
|
||||
libsvm and scipy sparse matrix where column index might not be sorted. (#7731)
|
||||
* Fix tree param feature type, this affects inputs with the number of columns greater than
|
||||
the maximum value of int32. (#7565)
|
||||
* Fix external memory with gpu_hist and subsampling. (#7481)
|
||||
* Check the number of trees in inplace predict, this avoids a potential segfault when an
|
||||
incorrect value for `iteration_range` is provided. (#7409)
|
||||
* Fix non-stable result in cox regression (#7756)
|
||||
|
||||
### Changes in the Python package
|
||||
Other than the changes in Dask, the XGBoost Python package gained some new features and
|
||||
improvements along with small bug fixes.
|
||||
|
||||
* Python 3.7 is required as the lowest Python version. (#7682)
|
||||
* Pre-built binary wheel for Apple Silicon. (#7621, #7612, #7747) Apple Silicon users will
|
||||
now be able to run `pip install xgboost` to install XGBoost.
|
||||
* MacOS users no longer need to install `libomp` from Homebrew, as the XGBoost wheel now
|
||||
bundles `libomp.dylib` library.
|
||||
* There are new parameters for users to specify the custom metric with new
|
||||
behavior. XGBoost can now output transformed prediction values when a custom objective is
|
||||
not supplied. See our explanation in the
|
||||
[tutorial](https://xgboost.readthedocs.io/en/latest/tutorials/custom_metric_obj.html#reverse-link-function)
|
||||
for details.
|
||||
* For the sklearn interface, following the estimator guideline from scikit-learn, all
|
||||
parameters in `fit` that are not related to input data are moved into the constructor
|
||||
and can be set by `set_params`. (#6751, #7420, #7375, #7369)
|
||||
* Apache arrow format is now supported, which can bring better performance to users'
|
||||
pipeline (#7512)
|
||||
* Pandas nullable types are now supported (#7760)
|
||||
* A new function `get_group` is introduced for `DMatrix` to allow users to get the group
|
||||
information in the custom objective function. (#7564)
|
||||
* More training parameters are exposed in the sklearn interface instead of relying on the
|
||||
`**kwargs`. (#7629)
|
||||
* A new attribute `feature_names_in_` is defined for all sklearn estimators like
|
||||
`XGBRegressor` to follow the convention of sklearn. (#7526)
|
||||
* More work on Python type hint. (#7432, #7348, #7338, #7513, #7707)
|
||||
* Support the latest pandas Index type. (#7595)
|
||||
* Fix for Feature shape mismatch error on s390x platform (#7715)
|
||||
* Fix using feature names for constraints with multiple groups (#7711)
|
||||
* We clarified the behavior of the callback function when it contains mutable
|
||||
states. (#7685)
|
||||
* Lastly, there are some code cleanups and maintenance work. (#7585, #7426, #7634, #7665,
|
||||
#7667, #7377, #7360, #7498, #7438, #7667, #7752, #7749, #7751)
|
||||
|
||||
### Changes in the Dask interface
|
||||
* Dask module now supports user-supplied host IP and port address of scheduler node.
|
||||
Please see [introduction](https://xgboost.readthedocs.io/en/latest/tutorials/dask.html#troubleshooting) and
|
||||
[API document](https://xgboost.readthedocs.io/en/latest/python/python_api.html#optional-dask-configuration)
|
||||
for reference. (#7645, #7581)
|
||||
* Internal `DMatrix` construction in dask now honers thread configuration. (#7337)
|
||||
* A fix for `nthread` configuration using the Dask sklearn interface. (#7633)
|
||||
* The Dask interface can now handle empty partitions. An empty partition is different
|
||||
from an empty worker, the latter refers to the case when a worker has no partition of an
|
||||
input dataset, while the former refers to some partitions on a worker that has zero
|
||||
sizes. (#7644, #7510)
|
||||
* Scipy sparse matrix is supported as Dask array partition. (#7457)
|
||||
* Dask interface is no longer considered experimental. (#7509)
|
||||
|
||||
### Changes in the R package
|
||||
This section summarizes the new features, improvements, and bug fixes to the R package.
|
||||
|
||||
* `load.raw` can optionally construct a booster as return. (#7686)
|
||||
* Fix parsing decision stump, which affects both transforming text representation to data
|
||||
table and plotting. (#7689)
|
||||
* Implement feature weights. (#7660)
|
||||
* Some improvements for complying the CRAN release policy. (#7672, #7661, #7763)
|
||||
* Support CSR data for predictions (#7615)
|
||||
* Document update (#7263, #7606)
|
||||
* New maintainer for the CRAN package (#7691, #7649)
|
||||
* Handle non-standard installation of toolchain on macos (#7759)
|
||||
|
||||
### Changes in JVM-packages
|
||||
Some new features for JVM-packages are introduced for a more integrated GPU pipeline and
|
||||
better compatibility with musl-based Linux. Aside from this, we have a few notable bug
|
||||
fixes.
|
||||
|
||||
* User can specify the tracker IP address for training, which helps running XGBoost on
|
||||
restricted network environments. (#7808)
|
||||
* Add support for detecting musl-based Linux (#7624)
|
||||
* Add `DeviceQuantileDMatrix` to Scala binding (#7459)
|
||||
* Add Rapids plugin support, now more of the JVM pipeline can be accelerated by RAPIDS (#7491, #7779, #7793, #7806)
|
||||
* The setters for CPU and GPU are more aligned (#7692, #7798)
|
||||
* Control logging for early stopping (#7326)
|
||||
* Do not repartition when nWorker = 1 (#7676)
|
||||
* Fix the prediction issue for `multi:softmax` (#7694)
|
||||
* Fix for serialization of custom objective and eval (#7274)
|
||||
* Update documentation about Python tracker (#7396)
|
||||
* Remove jackson from dependency, which fixes CVE-2020-36518. (#7791)
|
||||
* Some refactoring to the training pipeline for better compatibility between CPU and
|
||||
GPU. (#7440, #7401, #7789, #7784)
|
||||
* Maintenance work. (#7550, #7335, #7641, #7523, #6792, #4676)
|
||||
|
||||
### Deprecation
|
||||
Other than the changes in the Python package and serialization, we removed some deprecated
|
||||
features in previous releases. Also, as mentioned in the previous section, we plan to
|
||||
phase out the old binary format in future releases.
|
||||
|
||||
* Remove old warning in 1.3 (#7279)
|
||||
* Remove label encoder deprecated in 1.3. (#7357)
|
||||
* Remove old callback deprecated in 1.3. (#7280)
|
||||
* Pre-built binary will no longer support deprecated CUDA architectures including sm35 and
|
||||
sm50. Users can continue to use these platforms with source build. (#7767)
|
||||
|
||||
### Documentation
|
||||
This section lists some of the general changes to XGBoost's document, for language binding
|
||||
specific change please visit related sections.
|
||||
|
||||
* Document is overhauled to use the new RTD theme, along with integration of Python
|
||||
examples using Sphinx gallery. Also, we replaced most of the hard-coded URLs with sphinx
|
||||
references. (#7347, #7346, #7468, #7522, #7530)
|
||||
* Small update along with fixes for broken links, typos, etc. (#7684, #7324, #7334, #7655,
|
||||
#7628, #7623, #7487, #7532, #7500, #7341, #7648, #7311)
|
||||
* Update document for GPU. [skip ci] (#7403)
|
||||
* Document the status of RTD hosting. (#7353)
|
||||
* Update document for building from source. (#7664)
|
||||
* Add note about CRAN release [skip ci] (#7395)
|
||||
|
||||
### Maintenance
|
||||
This is a summary of maintenance work that is not specific to any language binding.
|
||||
|
||||
* Add CMake option to use /MD runtime (#7277)
|
||||
* Add clang-format configuration. (#7383)
|
||||
* Code cleanups (#7539, #7536, #7466, #7499, #7533, #7735, #7722, #7668, #7304, #7293,
|
||||
#7321, #7356, #7345, #7387, #7577, #7548, #7469, #7680, #7433, #7398)
|
||||
* Improved tests with better coverage and latest dependency (#7573, #7446, #7650, #7520,
|
||||
#7373, #7723, #7611, #7771)
|
||||
* Improved automation of the release process. (#7278, #7332, #7470)
|
||||
* Compiler workarounds (#7673)
|
||||
* Change shebang used in CLI demo. (#7389)
|
||||
* Update affiliation (#7289)
|
||||
|
||||
### CI
|
||||
Some fixes and update to XGBoost's CI infrastructure. (#7739, #7701, #7382, #7662, #7646,
|
||||
#7582, #7407, #7417, #7475, #7474, #7479, #7472, #7626)
|
||||
|
||||
|
||||
## v1.5.0 (2021 Oct 11)
|
||||
|
||||
This release comes with many exciting new features and optimizations, along with some bug
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user