Fix inline code blocks in 'spark_estimator.rst' (#8465)
This commit is contained in:
parent
16f96b6cfb
commit
812d577597
@ -23,7 +23,7 @@ SparkXGBRegressor is a PySpark ML estimator. It implements the XGBoost classific
|
|||||||
algorithm based on XGBoost python library, and it can be used in PySpark Pipeline
|
algorithm based on XGBoost python library, and it can be used in PySpark Pipeline
|
||||||
and PySpark ML meta algorithms like CrossValidator/TrainValidationSplit/OneVsRest.
|
and PySpark ML meta algorithms like CrossValidator/TrainValidationSplit/OneVsRest.
|
||||||
|
|
||||||
We can create a `SparkXGBRegressor` estimator like:
|
We can create a ``SparkXGBRegressor`` estimator like:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
@ -38,14 +38,14 @@ We can create a `SparkXGBRegressor` estimator like:
|
|||||||
The above snippet creates a spark estimator which can fit on a spark dataset,
|
The above snippet creates a spark estimator which can fit on a spark dataset,
|
||||||
and return a spark model that can transform a spark dataset and generate dataset
|
and return a spark model that can transform a spark dataset and generate dataset
|
||||||
with prediction column. We can set almost all of xgboost sklearn estimator parameters
|
with prediction column. We can set almost all of xgboost sklearn estimator parameters
|
||||||
as `SparkXGBRegressor` parameters, but some parameter such as `nthread` is forbidden
|
as ``SparkXGBRegressor`` parameters, but some parameter such as ``nthread`` is forbidden
|
||||||
in spark estimator, and some parameters are replaced with pyspark specific parameters
|
in spark estimator, and some parameters are replaced with pyspark specific parameters
|
||||||
such as `weight_col`, `validation_indicator_col`, `use_gpu`, for details please see
|
such as ``weight_col``, ``validation_indicator_col``, ``use_gpu``, for details please see
|
||||||
`SparkXGBRegressor` doc.
|
``SparkXGBRegressor`` doc.
|
||||||
|
|
||||||
The following code snippet shows how to train a spark xgboost regressor model,
|
The following code snippet shows how to train a spark xgboost regressor model,
|
||||||
first we need to prepare a training dataset as a spark dataframe contains
|
first we need to prepare a training dataset as a spark dataframe contains
|
||||||
"label" column and "features" column(s), the "features" column(s) must be `pyspark.ml.linalg.Vector`
|
"label" column and "features" column(s), the "features" column(s) must be ``pyspark.ml.linalg.Vector`
|
||||||
type or spark array type or a list of feature column names.
|
type or spark array type or a list of feature column names.
|
||||||
|
|
||||||
|
|
||||||
@ -56,7 +56,7 @@ type or spark array type or a list of feature column names.
|
|||||||
|
|
||||||
The following code snippet shows how to predict test data using a spark xgboost regressor model,
|
The following code snippet shows how to predict test data using a spark xgboost regressor model,
|
||||||
first we need to prepare a test dataset as a spark dataframe contains
|
first we need to prepare a test dataset as a spark dataframe contains
|
||||||
"features" and "label" column, the "features" column must be `pyspark.ml.linalg.Vector`
|
"features" and "label" column, the "features" column must be ``pyspark.ml.linalg.Vector`
|
||||||
type or spark array type.
|
type or spark array type.
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
@ -64,16 +64,17 @@ type or spark array type.
|
|||||||
transformed_test_spark_dataframe = xgb_regressor.predict(test_spark_dataframe)
|
transformed_test_spark_dataframe = xgb_regressor.predict(test_spark_dataframe)
|
||||||
|
|
||||||
|
|
||||||
The above snippet code returns a `transformed_test_spark_dataframe` that contains the input
|
The above snippet code returns a ``transformed_test_spark_dataframe`` that contains the input
|
||||||
dataset columns and an appended column "prediction" representing the prediction results.
|
dataset columns and an appended column "prediction" representing the prediction results.
|
||||||
|
|
||||||
SparkXGBClassifier
|
SparkXGBClassifier
|
||||||
==================
|
==================
|
||||||
|
|
||||||
`SparkXGBClassifier` estimator has similar API with `SparkXGBRegressor`, but it has some
|
``SparkXGBClassifier`` estimator has similar API with ``SparkXGBRegressor``, but it has some
|
||||||
pyspark classifier specific params, e.g. `raw_prediction_col` and `probability_col` parameters.
|
pyspark classifier specific params, e.g. ``raw_prediction_col`` and ``probability_col`` parameters.
|
||||||
Correspondingly, by default, `SparkXGBClassifierModel` transforming test dataset will
|
Correspondingly, by default, ``SparkXGBClassifierModel`` transforming test dataset will
|
||||||
generate result dataset with 3 new columns:
|
generate result dataset with 3 new columns:
|
||||||
|
|
||||||
- "prediction": represents the predicted label.
|
- "prediction": represents the predicted label.
|
||||||
- "raw_prediction": represents the output margin values.
|
- "raw_prediction": represents the output margin values.
|
||||||
- "probability": represents the prediction probability on each label.
|
- "probability": represents the prediction probability on each label.
|
||||||
@ -87,7 +88,7 @@ XGBoost PySpark fully supports GPU acceleration. Users are not only able to enab
|
|||||||
efficient training but also utilize their GPUs for the whole PySpark pipeline including
|
efficient training but also utilize their GPUs for the whole PySpark pipeline including
|
||||||
ETL and inference. In below sections, we will walk through an example of training on a
|
ETL and inference. In below sections, we will walk through an example of training on a
|
||||||
PySpark standalone GPU cluster. To get started, first we need to install some additional
|
PySpark standalone GPU cluster. To get started, first we need to install some additional
|
||||||
packages, then we can set the `use_gpu` parameter to `True`.
|
packages, then we can set the ``use_gpu`` parameter to ``True``.
|
||||||
|
|
||||||
Prepare the necessary packages
|
Prepare the necessary packages
|
||||||
==============================
|
==============================
|
||||||
@ -96,7 +97,7 @@ Aside from the PySpark and XGBoost modules, we also need the `cuDF
|
|||||||
<https://docs.rapids.ai/api/cudf/stable/>`_ package for handling Spark dataframe. We
|
<https://docs.rapids.ai/api/cudf/stable/>`_ package for handling Spark dataframe. We
|
||||||
recommend using either Conda or Virtualenv to manage python dependencies for PySpark
|
recommend using either Conda or Virtualenv to manage python dependencies for PySpark
|
||||||
jobs. Please refer to `How to Manage Python Dependencies in PySpark
|
jobs. Please refer to `How to Manage Python Dependencies in PySpark
|
||||||
<https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_
|
<https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_
|
||||||
for more details on PySpark dependency management.
|
for more details on PySpark dependency management.
|
||||||
|
|
||||||
In short, to create a Python environment that can be sent to a remote cluster using
|
In short, to create a Python environment that can be sent to a remote cluster using
|
||||||
@ -188,8 +189,8 @@ specification of GPU allocation. We will revisit this command later on.
|
|||||||
Model Persistence
|
Model Persistence
|
||||||
=================
|
=================
|
||||||
|
|
||||||
Similar to standard PySpark ml estimators, one can persist and reuse the model with `save`
|
Similar to standard PySpark ml estimators, one can persist and reuse the model with ``save`
|
||||||
and `load` methods:
|
and ``load`` methods:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user