[doc] Mention data consistency for categorical features. (#9678)
This commit is contained in:
@@ -137,7 +137,7 @@ To build and run C++ unit tests enable tests while running CMake:
|
||||
./testxgboost
|
||||
|
||||
Flags like ``USE_CUDA``, ``USE_DMLC_GTEST`` are optional. For more info about how to build
|
||||
XGBoost from source, see :doc:`</build>`. One can also run all unit test using ctest tool
|
||||
XGBoost from source, see :doc:`/build`. One can also run all unit tests using ctest tool
|
||||
which provides higher flexibility. For example:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
@@ -94,11 +94,11 @@ Using native interface
|
||||
**********************
|
||||
|
||||
The ``scikit-learn`` interface is user friendly, but lacks some features that are only
|
||||
available in native interface. For instance users cannot compute SHAP value directly or
|
||||
use quantized :class:`DMatrix <xgboost.DMatrix>`. Also native interface supports data
|
||||
types other than dataframe, like ``numpy/cupy array``. To use the native interface with
|
||||
categorical data, we need to pass the similar parameter to :class:`DMatrix
|
||||
<xgboost.DMatrix>` and the :func:`train <xgboost.train>` function. For dataframe input:
|
||||
available in native interface. For instance users cannot compute SHAP value directly.
|
||||
Also native interface supports more data types. To use the native interface with
|
||||
categorical data, we need to pass the similar parameter to :class:`~xgboost.DMatrix` or
|
||||
:py:class:`~xgboost.QuantileDMatrix` and the :func:`train <xgboost.train>` function. For
|
||||
dataframe input:
|
||||
|
||||
.. code:: python
|
||||
|
||||
@@ -117,7 +117,6 @@ SHAP value computation:
|
||||
# categorical features are listed as "c"
|
||||
print(booster.feature_types)
|
||||
|
||||
|
||||
For other types of input, like ``numpy array``, we can tell XGBoost about the feature
|
||||
types by using the ``feature_types`` parameter in :class:`DMatrix <xgboost.DMatrix>`:
|
||||
|
||||
@@ -131,7 +130,31 @@ types by using the ``feature_types`` parameter in :class:`DMatrix <xgboost.DMatr
|
||||
|
||||
For numerical data, the feature type can be ``"q"`` or ``"float"``, while for categorical
|
||||
feature it's specified as ``"c"``. The Dask module in XGBoost has the same interface so
|
||||
:class:`dask.Array <dask.Array>` can also be used for categorical data.
|
||||
:class:`dask.Array <dask.Array>` can also be used for categorical data. Lastly, the
|
||||
sklearn interface :py:class:`~xgboost.XGBRegressor` has the same parameter.
|
||||
|
||||
****************
|
||||
Data Consistency
|
||||
****************
|
||||
|
||||
XGBoost accepts parameters to indicate which feature is considered categorical, either through the ``dtypes`` of a dataframe or through the ``feature_types`` parameter. However, XGBoost by itself doesn't store information on how categories are encoded in the first place. For instance, given an encoding schema that maps music genres to integer codes:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
{"acoustic": 0, "indie": 1, "blues": 2, "country": 3}
|
||||
|
||||
XGBoost doesn't know this mapping from the input and hence cannot store it in the model. The mapping usually happens in the users' data engineering pipeline with column transformers like :py:class:`sklearn.preprocessing.OrdinalEncoder`. To make sure correct result from XGBoost, users need to keep the pipeline for transforming data consistent across training and testing data. One should watch out for errors like:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
X_train["genre"] = X_train["genre"].astype("category")
|
||||
reg = xgb.XGBRegressor(enable_categorical=True).fit(X_train, y_train)
|
||||
|
||||
# invalid encoding
|
||||
X_test["genre"] = X_test["genre"].astype("category")
|
||||
reg.predict(X_test)
|
||||
|
||||
In the above snippet, training data and test data are encoded separately, resulting in two different encoding schemas and invalid prediction result. See :ref:`sphx_glr_python_examples_cat_pipeline.py` for a worked example using ordinal encoder.
|
||||
|
||||
*************
|
||||
Miscellaneous
|
||||
|
||||
Reference in New Issue
Block a user