[pyspark] Cleanup data processing. (#8344)

* Enable additional combinations of ctor parameters.
* Unify procedures for QuantileDMatrix and DMatrix.
This commit is contained in:
Jiaming Yuan
2022-10-18 14:56:23 +08:00
committed by GitHub
parent 521086d56b
commit 3901f5d9db
5 changed files with 68 additions and 55 deletions

View File

@@ -83,10 +83,11 @@ generate result dataset with 3 new columns:
XGBoost PySpark GPU support
***************************
XGBoost PySpark supports GPU training and prediction. To enable GPU support, you first need
to install the xgboost and cudf packages. Then you can set `use_gpu` parameter to `True`.
XGBoost PySpark supports GPU training and prediction. To enable GPU support, first you
need to install the XGBoost and the `cuDF <https://docs.rapids.ai/api/cudf/stable/>`_
package. Then you can set `use_gpu` parameter to `True`.
Below tutorial will show you how to train a model with XGBoost PySpark GPU on Spark
Below tutorial demonstrates how to train a model with XGBoost PySpark GPU on Spark
standalone cluster.
@@ -138,7 +139,7 @@ in PySpark. Please refer to
conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9
conda activate xgboost-env
pip install xgboost
pip install cudf
conda install cudf -c rapids -c nvidia -c conda-forge
conda pack -f -o xgboost-env.tar.gz
@@ -220,3 +221,6 @@ Below is a simple example submit command for enabling GPU acceleration:
--conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \
--archives xgboost-env.tar.gz#environment \
xgboost_app.py
When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python are
required for the acceleration.