[doc] Clarify prediction function. (#6813)

2021-04-03 02:12:04 +08:00 · 2021-04-03 02:12:04 +08:00 · 0cced530ea
commit 0cced530ea
parent b1fdb220f4
3 changed files with 164 additions and 23 deletions
--- a/doc/index.rst
+++ b/doc/index.rst
@ -22,6 +22,7 @@ Contents
  XGBoost User Forum <https://discuss.xgboost.ai>
  GPU support <gpu/index>
  parameter
  prediction
  treemethod
  Python package <python/index>
  R package <R-package/index>
--- a/doc/prediction.rst
+++ b/doc/prediction.rst
@ -0,0 +1,148 @@
 .. _predict_api:
 ##########
 Prediction
 ##########
 There are a number of prediction functions in XGBoost with various parameters.  This
 document attempts to clarify some of confusions around prediction with a focus on the
 Python binding.
 ******************
 Prediction Options
 ******************
 There are a number of different prediction options for the
 :py:meth:`xgboost.Booster.predict` method, ranging from ``pred_contribs`` to
 ``pred_leaf``.  The output shape depends on types of prediction.  Also for multi-class
 classification problem, XGBoost builds one tree for each class and the trees for each
 class are called a "group" of trees, so output dimension may change due to used model.
 After 1.4 release, we added a new parameter called ``strict_shape``, one can set it to
 ``True`` to indicate a more restricted output is desired.  Assuming you are using
 :py:obj:`xgboost.Booster`, here is a list of possible returns:
 - When using normal prediction with ``strict_shape`` set to ``True``:
  Output is a 2-dim array with first dimension as rows and second as groups.  For
  regression/survival/ranking/binary classification this is equivalent to a column vector
  with ``shape[1] == 1``.  But for multi-class with ``multi:softprob`` the number of
  columns equals to number of classes.  If strict_shape is set to False then XGBoost might
  output 1 or 2 dim array.
 - When using ``output_margin`` to avoid transformation and ``strict_shape`` is set to ``True``:
  Similar to the previous case, output is a 2-dim array, except for that ``multi:softmax``
  has equivalent output of ``multi:softprob`` due to dropped transformation.  If strict
  shape is set to False then output can have 1 or 2 dim depending on used model.
 - When using ``preds_contribs`` with ``strict_shape`` set to ``True``:
  Output is a 3-dim array, with ``(rows, groups, columns + 1)`` as shape.  Whether
  ``approx_contribs`` is used does not change the output shape. If the strict shape
  parameter is not set, it can be a 2 or 3 dimension array depending on whether
  multi-class model is being used.
 - When using ``preds_interactions`` with ``strict_shape`` set to ``True``:
  Output is a 4-dim array, with ``(rows, groups, columns + 1, columns + 1)`` as shape.
  Like the predict contribution case, whether ``approx_contribs`` is used does not change
  the output shape.  If strict shape is set to False, it can have 3 or 4 dims depending on
  the underlying model.
 - When using ``pred_leaf`` with ``strict_shape`` set to ``True``:
  Output is a 4-dim array with ``(n_samples, n_iterations, n_classes, n_trees_in_forest)``
  as shape.  ``n_trees_in_forest`` is specified by the ``numb_parallel_tree`` during
  training.  When strict shape is set to False, output is a 2-dim array with last 3 dims
  concatenated into 1.  When using ``apply`` method in scikit learn interface, this is set
  to False by default.
 Other than these prediction types, there's also a parameter called ``iteration_range``,
 which is similar to model slicing.  But instead of actually splitting up the model into
 multiple stacks, it simply returns the prediction formed by the trees within range.
 Number of trees created in each iteration eqauls to :math:`trees_i = num\_class \times
 num\_parallel\_tree`.  So if you are training a boosted random forest with size of 4, on
 the 3-class classification dataset, and want to use the first 2 iterations of trees for
 prediction, you need to provide ``iteration_range=(0, 2)``.  Then the first :math:`2
 \times 3 \times 4` trees will be used in this prediction.
 *********
 Predictor
 *********
 There are 2 predictors in XGBoost (3 if you have the one-api plugin enabled), namely
 ``cpu_predictor`` and ``gpu_predictor``.  The default option is ``auto`` so that XGBoost
 can employ some heuristics for saving GPU memory during training.  They might have slight
 different outputs due to floating point errors.
 ***********
 Base Margin
 ***********
 There's a training parameter in XGBoost called ``base_score``, and a meta data for
 ``DMatrix`` called ``base_margin`` (which can be set in ``fit`` method if you are using
 scikit-learn interface).  They specifies the global bias for boosted model.  If the latter
 is supplied then former is ignored.  ``base_margin`` can be used to train XGBoost model
 based on other models.  See demos on boosting from predictions.
 *****************
 Staged Prediction
 *****************
 Using the native interface with ``DMatrix``, prediction can be staged (or cached).  For
 example, one can first predict on the first 4 trees then run prediction on 8 trees.  After
 running the first prediction, result from first 4 trees are cached so when you run the
 prediction with 8 trees XGBoost can reuse the result from previous prediction.  The cache
 expires automatically upon next prediction, train or evaluation if the cached ``DMatrix``
 object is expired (like going out of scope and being collected by garbage collector in
 your language environment).
 *******************
 In-place Prediction
 *******************
 Traditionally XGBoost accepts only ``DMatrix`` for prediction, with wrappers like
 scikit-learn interface the construction happens internally.  We added support for in-place
 predict to bypass the construction of ``DMatrix``, which is slow and memory consuming.
 The new predict function has limited features but is often sufficient for simple inference
 tasks.  It accepts some commonly found data types in Python like :py:obj:`numpy.ndarray`,
 :py:obj:`scipy.sparse.csr_matrix` and :py:obj:`cudf.DataFrame` instead of
 :py:obj:`xgboost.DMatrix`.  You can call :py:meth:`xgboost.Booster.inplace_predict` to use
 it.  Be aware that the output of in-place prediction depends on input data type, when
 input is on GPU data output is :py:obj:`cupy.ndarray`, otherwise a :py:obj:`numpy.ndarray`
 is returned.
 ****************
 Categorical Data
 ****************
 Other than users performing encoding, XGBoost has experimental support for categorical
 data using ``gpu_hist`` and ``gpu_predictor``.  No special operation needs to be done on
 input test data since the information about categories is encoded into the model during
 training.
 *************
 Thread Safety
 *************
 After 1.4 release, all prediction functions including normal ``predict`` with various
 parameters like shap value computation and ``inplace_predict`` are thread safe when
 underlying booster is ``gbtree`` or ``dart``, which means as long as tree model is used,
 prediction itself should thread safe.  But the safety is only guaranteed with prediction.
 If one tries to train a model in one thread and provide prediction at the other using the
 same model the behaviour is undefined.  This happens easier than one might expect, for
 instance we might accidientally call ``clf.set_params()`` inside a predict function:
 .. code-block:: python
    def predict_fn(clf: xgb.XGBClassifier, X):
        X = preprocess(X)
        clf.set_params(predictor="gpu_predictor")  # NOT safe!
        clf.set_params(n_jobs=1)  # NOT safe!
        return clf.predict_proba(X, iteration_range=(0, 10))
    with ThreadPoolExecutor(max_workers=10) as e:
        e.submit(predict_fn, ...)
--- a/python-package/xgboost/core.py
+++ b/python-package/xgboost/core.py
@ -1616,12 +1616,11 @@ class Booster(object):
    ) -> np.ndarray:
        """Predict with data.
-          .. note:: This function is not thread safe except for ``gbtree`` booster.
+        .. note::
-          When using booster other than ``gbtree``, predict can only be called from one
+            See `Prediction
-          thread.  If you want to run prediction using multiple thread, call
+            <https://xgboost.readthedocs.io/en/latest/tutorials/prediction.html>`_
-          :py:meth:`xgboost.Booster.copy` to make copies of model object and then call
+            for issues like thread safety and a summary of outputs from this function.
          ``predict()``.
        Parameters
        ----------
@ -1665,8 +1664,11 @@ class Booster(object):
            feature_names are the same.
        training :
-            Whether the prediction value is used for training.  This can effect
+            Whether the prediction value is used for training.  This can effect `dart`
-            `dart` booster, which performs dropouts during training iterations.
+            booster, which performs dropouts during training iterations but use all trees
            for inference. If you want to obtain result with dropouts, set this parameter
            to `True`.  Also, the parameter is set to true when obtaining prediction for
            custom objective function.
            .. versionadded:: 1.0.0
@ -1686,12 +1688,6 @@ class Booster(object):
            .. versionadded:: 1.4.0
        .. note:: Using ``predict()`` with DART booster
          If the booster object is DART type, ``predict()`` will not perform
          dropouts, i.e. all the trees will be evaluated.  If you want to
          obtain result with dropouts, provide `training=True`.
        Returns
        -------
        prediction : numpy array
@ -1916,11 +1912,9 @@ class Booster(object):
        The model is saved in an XGBoost internal format which is universal among the
        various XGBoost interfaces. Auxiliary attributes of the Python Booster object
        (such as feature_names) will not be saved when using binary format.  To save those
-        attributes, use JSON instead. See:
+        attributes, use JSON instead. See: `Model IO
-
+        <https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html>`_ for more
-          https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
+        info.
        for more info.
        Parameters
        ----------
@ -1956,11 +1950,9 @@ class Booster(object):
        The model is loaded from XGBoost format which is universal among the various
        XGBoost interfaces. Auxiliary attributes of the Python Booster object (such as
        feature_names) will not be loaded when using binary format.  To save those
-        attributes, use JSON instead.  See:
+        attributes, use JSON instead.  See: `Model IO
-
+        <https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html>`_ for more
-          https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html
+        info.
        for more info.
        Parameters
        ----------