[doc] Mention data consistency for categorical features. (#9678)

This commit is contained in:
Jiaming Yuan
2023-10-24 10:11:33 +08:00
committed by GitHub
parent 5e6cb63a56
commit 3ca06ac51e
8 changed files with 293 additions and 96 deletions

View File

@@ -94,11 +94,11 @@ Using native interface
**********************
The ``scikit-learn`` interface is user friendly, but lacks some features that are only
available in native interface. For instance users cannot compute SHAP value directly or
use quantized :class:`DMatrix <xgboost.DMatrix>`. Also native interface supports data
types other than dataframe, like ``numpy/cupy array``. To use the native interface with
categorical data, we need to pass the similar parameter to :class:`DMatrix
<xgboost.DMatrix>` and the :func:`train <xgboost.train>` function. For dataframe input:
available in native interface. For instance users cannot compute SHAP value directly.
Also native interface supports more data types. To use the native interface with
categorical data, we need to pass the similar parameter to :class:`~xgboost.DMatrix` or
:py:class:`~xgboost.QuantileDMatrix` and the :func:`train <xgboost.train>` function. For
dataframe input:
.. code:: python
@@ -117,7 +117,6 @@ SHAP value computation:
# categorical features are listed as "c"
print(booster.feature_types)
For other types of input, like ``numpy array``, we can tell XGBoost about the feature
types by using the ``feature_types`` parameter in :class:`DMatrix <xgboost.DMatrix>`:
@@ -131,7 +130,31 @@ types by using the ``feature_types`` parameter in :class:`DMatrix <xgboost.DMatr
For numerical data, the feature type can be ``"q"`` or ``"float"``, while for categorical
feature it's specified as ``"c"``. The Dask module in XGBoost has the same interface so
:class:`dask.Array <dask.Array>` can also be used for categorical data.
:class:`dask.Array <dask.Array>` can also be used for categorical data. Lastly, the
sklearn interface :py:class:`~xgboost.XGBRegressor` has the same parameter.
****************
Data Consistency
****************
XGBoost accepts parameters to indicate which feature is considered categorical, either through the ``dtypes`` of a dataframe or through the ``feature_types`` parameter. However, XGBoost by itself doesn't store information on how categories are encoded in the first place. For instance, given an encoding schema that maps music genres to integer codes:
.. code-block:: python
{"acoustic": 0, "indie": 1, "blues": 2, "country": 3}
XGBoost doesn't know this mapping from the input and hence cannot store it in the model. The mapping usually happens in the users' data engineering pipeline with column transformers like :py:class:`sklearn.preprocessing.OrdinalEncoder`. To make sure correct result from XGBoost, users need to keep the pipeline for transforming data consistent across training and testing data. One should watch out for errors like:
.. code-block:: python
X_train["genre"] = X_train["genre"].astype("category")
reg = xgb.XGBRegressor(enable_categorical=True).fit(X_train, y_train)
# invalid encoding
X_test["genre"] = X_test["genre"].astype("category")
reg.predict(X_test)
In the above snippet, training data and test data are encoded separately, resulting in two different encoding schemas and invalid prediction result. See :ref:`sphx_glr_python_examples_cat_pipeline.py` for a worked example using ordinal encoder.
*************
Miscellaneous