[pyspark] User guide doc and tutorials (#8082)
Co-authored-by: Bobby Wang <wbo4958@gmail.com>
This commit is contained in:
66
doc/tutorials/spark_estimator.rst
Normal file
66
doc/tutorials/spark_estimator.rst
Normal file
@@ -0,0 +1,66 @@
|
||||
###############################
|
||||
Using XGBoost PySpark Estimator
|
||||
###############################
|
||||
Starting from version 2.0, xgboost supports pyspark estimator APIs.
|
||||
The feature is still experimental and not yet ready for production use.
|
||||
|
||||
*****************
|
||||
SparkXGBRegressor
|
||||
*****************
|
||||
|
||||
SparkXGBRegressor is a PySpark ML estimator. It implements the XGBoost classification
|
||||
algorithm based on XGBoost python library, and it can be used in PySpark Pipeline
|
||||
and PySpark ML meta algorithms like CrossValidator/TrainValidationSplit/OneVsRest.
|
||||
|
||||
We can create a `SparkXGBRegressor` estimator like:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
from xgboost.spark import SparkXGBRegressor
|
||||
spark_reg_estimator = SparkXGBRegressor(num_workers=2, max_depth=5)
|
||||
|
||||
|
||||
The above snippet create an spark estimator which can fit on a spark dataset,
|
||||
and return a spark model that can transform a spark dataset and generate dataset
|
||||
with prediction column. We can set almost all of xgboost sklearn estimator parameters
|
||||
as `SparkXGBRegressor` parameters, but some parameter such as `nthread` is forbidden
|
||||
in spark estimator, and some parameters are replaced with pyspark specific parameters
|
||||
such as `weight_col`, `validation_indicator_col`, `use_gpu`, for details please see
|
||||
`SparkXGBRegressor` doc.
|
||||
|
||||
The following code snippet shows how to train a spark xgboost regressor model,
|
||||
first we need to prepare a training dataset as a spark dataframe contains
|
||||
"features" and "label" column, the "features" column must be `pyspark.ml.linalg.Vector`
|
||||
type or spark array type.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
xgb_regressor_model = xgb_regressor.fit(train_spark_dataframe)
|
||||
|
||||
|
||||
The following code snippet shows how to predict test data using a spark xgboost regressor model,
|
||||
first we need to prepare a test dataset as a spark dataframe contains
|
||||
"features" and "label" column, the "features" column must be `pyspark.ml.linalg.Vector`
|
||||
type or spark array type.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
transformed_test_spark_dataframe = xgb_regressor.predict(test_spark_dataframe)
|
||||
|
||||
|
||||
The above snippet code returns a `transformed_test_spark_dataframe` that contains the input
|
||||
dataset columns and an appended column "prediction" representing the prediction results.
|
||||
|
||||
|
||||
******************
|
||||
SparkXGBClassifier
|
||||
******************
|
||||
|
||||
|
||||
`SparkXGBClassifier` estimator has similar API with `SparkXGBRegressor`, but it has some
|
||||
pyspark classifier specific params, e.g. `raw_prediction_col` and `probability_col` parameters.
|
||||
Correspondingly, by default, `SparkXGBClassifierModel` transforming test dataset will
|
||||
generate result dataset with 3 new columns:
|
||||
- "prediction": represents the predicted label.
|
||||
- "raw_prediction": represents the output margin values.
|
||||
- "probability": represents the prediction probability on each label.
|
||||
Reference in New Issue
Block a user