[pyspark] Improve tutorial on enabling GPU support. (#8385)

- Quote the databricks doc on how to manage dependencies. - Some wording changes. Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
2022-10-26 15:45:54 +08:00
parent ba9cc43464
commit 7e53189e7c
1 changed files with 60 additions and 35 deletions
--- a/doc/tutorials/spark_estimator.rst
+++ b/doc/tutorials/spark_estimator.rst
@@ -83,17 +83,52 @@ generate result dataset with 3 new columns:
 XGBoost PySpark GPU support
 ***************************
-XGBoost PySpark supports GPU training and prediction. To enable GPU support, first you
+XGBoost PySpark fully supports GPU acceleration. Users are not only able to enable
-need to install the XGBoost and the `cuDF <https://docs.rapids.ai/api/cudf/stable/>`_
+efficient training but also utilize their GPUs for the whole PySpark pipeline including
-package. Then you can set `use_gpu` parameter to `True`.
+ETL and inference. In below sections, we will walk through an example of training on a
 PySpark standalone GPU cluster. To get started, first we need to install some additional
 packages, then we can set the `use_gpu` parameter to `True`.
-Below tutorial demonstrates how to train a model with XGBoost PySpark GPU on Spark
+Prepare the necessary packages
-standalone cluster.
+==============================
 Aside from the PySpark and XGBoost modules, we also need the `cuDF
 <https://docs.rapids.ai/api/cudf/stable/>`_ package for handling Spark dataframe. We
 recommend using either Conda or Virtualenv to manage python dependencies for PySpark
 jobs. Please refer to `How to Manage Python Dependencies in PySpark
 <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_
 for more details on PySpark dependency management.
 In short, to create a Python environment that can be sent to a remote cluster using
 virtualenv and pip:
 .. code-block:: bash
  python -m venv xgboost_env
  source xgboost_env/bin/activate
  pip install pyarrow pandas venv-pack xgboost
  # https://rapids.ai/pip.html#install
  pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
  venv-pack -o xgboost_env.tar.gz
 With Conda:
 .. code-block:: bash
  conda create -y -n xgboost_env -c conda-forge conda-pack python=3.9
  conda activate xgboost_env
  # use conda when the supported version of xgboost (1.7) is released on conda-forge
  pip install xgboost
  conda install cudf pyarrow pandas -c rapids -c nvidia -c conda-forge
  conda pack -f -o xgboost_env.tar.gz
 Write your PySpark application
 ==============================
 Below snippet is a small example for training xgboost model with PySpark. Notice that we are
 using a list of feature names and the additional parameter ``use_gpu``:
 .. code-block:: python
  from xgboost.spark import SparkXGBRegressor
@@ -127,26 +162,11 @@ Write your PySpark application
  predict_df = model.transform(test_df)
  predict_df.show()
 Prepare the necessary packages
 ==============================
 We recommend using Conda or Virtualenv to manage python dependencies
 in PySpark. Please refer to
 `How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_.
 .. code-block:: bash
  conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9
  conda activate xgboost-env
  pip install xgboost
  conda install cudf -c rapids -c nvidia -c conda-forge
  conda pack -f -o xgboost-env.tar.gz
 Submit the PySpark application
 ==============================
-Assuming you have configured your Spark cluster with GPU support, if not yet, please
+Assuming you have configured your Spark cluster with GPU support. Otherwise, please
 refer to `spark standalone configuration with GPU support <https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster>`_.
 .. code-block:: bash
@@ -158,10 +178,13 @@ refer to `spark standalone configuration with GPU support <https://nvidia.github
    --master spark://<master-ip>:7077 \
    --conf spark.executor.resource.gpu.amount=1 \
    --conf spark.task.resource.gpu.amount=1 \
-    --archives xgboost-env.tar.gz#environment \
+    --archives xgboost_env.tar.gz#environment \
    xgboost_app.py
 The submit command sends the Python environment created by pip or conda along with the
 specification of GPU allocation. We will revisit this command later on.
 Model Persistence
 =================
@@ -186,26 +209,27 @@ To export the underlying booster model used by XGBoost:
  # the same booster object returned by xgboost.train
  booster: xgb.Booster = model.get_booster()
  booster.predict(...)
-  booster.save_model("model.json")
+  booster.save_model("model.json") # or model.ubj, depending on your choice of format.
-This booster is shared by other Python interfaces and can be used by other language
+This booster is not only shared by other Python interfaces but also used by all the
-bindings like the C and R packages. Lastly, one can extract a booster file directly from
+XGBoost bindings including the C, Java, and the R package. Lastly, one can extract the
-saved spark estimator without going through the getter:
+booster file directly from a saved spark estimator without going through the getter:
 .. code-block:: python
  import xgboost as xgb
  bst = xgb.Booster()
  # Loading the model saved in previous snippet
  bst.load_model("/tmp/xgboost-pyspark-model/model/part-00000")
 Accelerate the whole pipeline of xgboost pyspark
 ================================================
-With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_,
+Accelerate the whole pipeline for xgboost pyspark
-you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark
+=================================================
 without any code change by leveraging GPU.
-Below is a simple example submit command for enabling GPU acceleration:
+With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_, you
 can leverage GPUs to accelerate the whole pipeline (ETL, Train, Transform) for xgboost
 pyspark without any Python code change. An example submit command is shown below with
 additional spark configurations and dependencies:
 .. code-block:: bash
@@ -219,8 +243,9 @@ Below is a simple example submit command for enabling GPU acceleration:
    --packages com.nvidia:rapids-4-spark_2.12:22.08.0 \
    --conf spark.plugins=com.nvidia.spark.SQLPlugin \
    --conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \
-    --archives xgboost-env.tar.gz#environment \
+    --archives xgboost_env.tar.gz#environment \
    xgboost_app.py
-When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python are
+When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python package
-required for the acceleration.
+are required. More configuration options can be found in the RAPIDS link above along with
 details on the plugin.