Rewrite approx (#7214)

This PR rewrites the approx tree method to use codebase from hist for better performance and code sharing.

The rewrite has many benefits:
- Support for both `max_leaves` and `max_depth`.
- Support for `grow_policy`.
- Support for mono constraint.
- Support for feature weights.
- Support for easier bin configuration (`max_bin`).
- Support for categorical data.
- Faster performance for most of the datasets. (many times faster)
- Support for prediction cache.
- Significantly better performance for external memory.
- Unites the code base between approx and hist.
This commit is contained in:
Jiaming Yuan
2022-01-10 21:15:05 +08:00
committed by GitHub
parent ed95e77752
commit 001503186c
22 changed files with 635 additions and 264 deletions

View File

@@ -154,7 +154,7 @@ Parameters for Tree Booster
* ``sketch_eps`` [default=0.03]
- Only used for ``tree_method=approx``.
- Only used for ``updater=grow_local_histmaker``.
- This roughly translates into ``O(1 / sketch_eps)`` number of bins.
Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy.
- Usually user does not have to tune this.
@@ -238,13 +238,27 @@ Parameters for Tree Booster
list is a group of indices of features that are allowed to interact with each other.
See :doc:`/tutorials/feature_interaction_constraint` for more information.
Additional parameters for ``hist`` and ``gpu_hist`` tree method
================================================================
Additional parameters for ``hist``, ``gpu_hist`` and ``approx`` tree method
===========================================================================
* ``single_precision_histogram``, [default= ``false``]
- Use single precision to build histograms instead of double precision.
Additional parameters for ``approx`` tree method
================================================
* ``max_cat_to_onehot``
.. versionadded:: 1.6
.. note:: The support for this parameter is experimental.
- A threshold for deciding whether XGBoost should use one-hot encoding based split for
categorical data. When number of categories is lesser than the threshold then one-hot
encoding is chosen, otherwise the categories will be partitioned into children nodes.
Only relevant for regression and binary classification with `approx` tree method.
Additional parameters for Dart Booster (``booster=dart``)
=========================================================