[pyspark] Cleanup data processing. (#8344)

* Enable additional combinations of ctor parameters. * Unify procedures for QuantileDMatrix and DMatrix.
2022-10-18 14:56:23 +08:00
parent 521086d56b
commit 3901f5d9db
5 changed files with 68 additions and 55 deletions
--- a/doc/tutorials/spark_estimator.rst
+++ b/doc/tutorials/spark_estimator.rst
@@ -83,10 +83,11 @@ generate result dataset with 3 new columns:
 XGBoost PySpark GPU support
 ***************************

-XGBoost PySpark supports GPU training and prediction. To enable GPU support, you first need
-to install the xgboost and cudf packages. Then you can set `use_gpu` parameter to `True`.
+XGBoost PySpark supports GPU training and prediction. To enable GPU support, first you
+need to install the XGBoost and the `cuDF <https://docs.rapids.ai/api/cudf/stable/>`_
+package. Then you can set `use_gpu` parameter to `True`.

-Below tutorial will show you how to train a model with XGBoost PySpark GPU on Spark
+Below tutorial demonstrates how to train a model with XGBoost PySpark GPU on Spark
 standalone cluster.


@@ -138,7 +139,7 @@ in PySpark. Please refer to
  conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9
  conda activate xgboost-env
  pip install xgboost
-  pip install cudf
+  conda install cudf -c rapids -c nvidia -c conda-forge
  conda pack -f -o xgboost-env.tar.gz


@@ -220,3 +221,6 @@ Below is a simple example submit command for enabling GPU acceleration:
    --conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \
    --archives xgboost-env.tar.gz#environment \
    xgboost_app.py
+
+When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python are
+required for the acceleration.