[pyspark] Improve tutorial on enabling GPU support. (#8385)
- Quote the databricks doc on how to manage dependencies. - Some wording changes. Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
This commit is contained in:
parent
ba9cc43464
commit
7e53189e7c
@ -83,17 +83,52 @@ generate result dataset with 3 new columns:
|
|||||||
XGBoost PySpark GPU support
|
XGBoost PySpark GPU support
|
||||||
***************************
|
***************************
|
||||||
|
|
||||||
XGBoost PySpark supports GPU training and prediction. To enable GPU support, first you
|
XGBoost PySpark fully supports GPU acceleration. Users are not only able to enable
|
||||||
need to install the XGBoost and the `cuDF <https://docs.rapids.ai/api/cudf/stable/>`_
|
efficient training but also utilize their GPUs for the whole PySpark pipeline including
|
||||||
package. Then you can set `use_gpu` parameter to `True`.
|
ETL and inference. In below sections, we will walk through an example of training on a
|
||||||
|
PySpark standalone GPU cluster. To get started, first we need to install some additional
|
||||||
|
packages, then we can set the `use_gpu` parameter to `True`.
|
||||||
|
|
||||||
Below tutorial demonstrates how to train a model with XGBoost PySpark GPU on Spark
|
Prepare the necessary packages
|
||||||
standalone cluster.
|
==============================
|
||||||
|
|
||||||
|
Aside from the PySpark and XGBoost modules, we also need the `cuDF
|
||||||
|
<https://docs.rapids.ai/api/cudf/stable/>`_ package for handling Spark dataframe. We
|
||||||
|
recommend using either Conda or Virtualenv to manage python dependencies for PySpark
|
||||||
|
jobs. Please refer to `How to Manage Python Dependencies in PySpark
|
||||||
|
<https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_
|
||||||
|
for more details on PySpark dependency management.
|
||||||
|
|
||||||
|
In short, to create a Python environment that can be sent to a remote cluster using
|
||||||
|
virtualenv and pip:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
python -m venv xgboost_env
|
||||||
|
source xgboost_env/bin/activate
|
||||||
|
pip install pyarrow pandas venv-pack xgboost
|
||||||
|
# https://rapids.ai/pip.html#install
|
||||||
|
pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
|
||||||
|
venv-pack -o xgboost_env.tar.gz
|
||||||
|
|
||||||
|
With Conda:
|
||||||
|
|
||||||
|
.. code-block:: bash
|
||||||
|
|
||||||
|
conda create -y -n xgboost_env -c conda-forge conda-pack python=3.9
|
||||||
|
conda activate xgboost_env
|
||||||
|
# use conda when the supported version of xgboost (1.7) is released on conda-forge
|
||||||
|
pip install xgboost
|
||||||
|
conda install cudf pyarrow pandas -c rapids -c nvidia -c conda-forge
|
||||||
|
conda pack -f -o xgboost_env.tar.gz
|
||||||
|
|
||||||
|
|
||||||
Write your PySpark application
|
Write your PySpark application
|
||||||
==============================
|
==============================
|
||||||
|
|
||||||
|
Below snippet is a small example for training xgboost model with PySpark. Notice that we are
|
||||||
|
using a list of feature names and the additional parameter ``use_gpu``:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
from xgboost.spark import SparkXGBRegressor
|
from xgboost.spark import SparkXGBRegressor
|
||||||
@ -127,26 +162,11 @@ Write your PySpark application
|
|||||||
predict_df = model.transform(test_df)
|
predict_df = model.transform(test_df)
|
||||||
predict_df.show()
|
predict_df.show()
|
||||||
|
|
||||||
Prepare the necessary packages
|
|
||||||
==============================
|
|
||||||
|
|
||||||
We recommend using Conda or Virtualenv to manage python dependencies
|
|
||||||
in PySpark. Please refer to
|
|
||||||
`How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_.
|
|
||||||
|
|
||||||
.. code-block:: bash
|
|
||||||
|
|
||||||
conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9
|
|
||||||
conda activate xgboost-env
|
|
||||||
pip install xgboost
|
|
||||||
conda install cudf -c rapids -c nvidia -c conda-forge
|
|
||||||
conda pack -f -o xgboost-env.tar.gz
|
|
||||||
|
|
||||||
|
|
||||||
Submit the PySpark application
|
Submit the PySpark application
|
||||||
==============================
|
==============================
|
||||||
|
|
||||||
Assuming you have configured your Spark cluster with GPU support, if not yet, please
|
Assuming you have configured your Spark cluster with GPU support. Otherwise, please
|
||||||
refer to `spark standalone configuration with GPU support <https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster>`_.
|
refer to `spark standalone configuration with GPU support <https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster>`_.
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
@ -158,10 +178,13 @@ refer to `spark standalone configuration with GPU support <https://nvidia.github
|
|||||||
--master spark://<master-ip>:7077 \
|
--master spark://<master-ip>:7077 \
|
||||||
--conf spark.executor.resource.gpu.amount=1 \
|
--conf spark.executor.resource.gpu.amount=1 \
|
||||||
--conf spark.task.resource.gpu.amount=1 \
|
--conf spark.task.resource.gpu.amount=1 \
|
||||||
--archives xgboost-env.tar.gz#environment \
|
--archives xgboost_env.tar.gz#environment \
|
||||||
xgboost_app.py
|
xgboost_app.py
|
||||||
|
|
||||||
|
|
||||||
|
The submit command sends the Python environment created by pip or conda along with the
|
||||||
|
specification of GPU allocation. We will revisit this command later on.
|
||||||
|
|
||||||
Model Persistence
|
Model Persistence
|
||||||
=================
|
=================
|
||||||
|
|
||||||
@ -186,26 +209,27 @@ To export the underlying booster model used by XGBoost:
|
|||||||
# the same booster object returned by xgboost.train
|
# the same booster object returned by xgboost.train
|
||||||
booster: xgb.Booster = model.get_booster()
|
booster: xgb.Booster = model.get_booster()
|
||||||
booster.predict(...)
|
booster.predict(...)
|
||||||
booster.save_model("model.json")
|
booster.save_model("model.json") # or model.ubj, depending on your choice of format.
|
||||||
|
|
||||||
This booster is shared by other Python interfaces and can be used by other language
|
This booster is not only shared by other Python interfaces but also used by all the
|
||||||
bindings like the C and R packages. Lastly, one can extract a booster file directly from
|
XGBoost bindings including the C, Java, and the R package. Lastly, one can extract the
|
||||||
saved spark estimator without going through the getter:
|
booster file directly from a saved spark estimator without going through the getter:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
import xgboost as xgb
|
import xgboost as xgb
|
||||||
bst = xgb.Booster()
|
bst = xgb.Booster()
|
||||||
|
# Loading the model saved in previous snippet
|
||||||
bst.load_model("/tmp/xgboost-pyspark-model/model/part-00000")
|
bst.load_model("/tmp/xgboost-pyspark-model/model/part-00000")
|
||||||
|
|
||||||
Accelerate the whole pipeline of xgboost pyspark
|
|
||||||
================================================
|
|
||||||
|
|
||||||
With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_,
|
Accelerate the whole pipeline for xgboost pyspark
|
||||||
you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark
|
=================================================
|
||||||
without any code change by leveraging GPU.
|
|
||||||
|
|
||||||
Below is a simple example submit command for enabling GPU acceleration:
|
With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_, you
|
||||||
|
can leverage GPUs to accelerate the whole pipeline (ETL, Train, Transform) for xgboost
|
||||||
|
pyspark without any Python code change. An example submit command is shown below with
|
||||||
|
additional spark configurations and dependencies:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
@ -219,8 +243,9 @@ Below is a simple example submit command for enabling GPU acceleration:
|
|||||||
--packages com.nvidia:rapids-4-spark_2.12:22.08.0 \
|
--packages com.nvidia:rapids-4-spark_2.12:22.08.0 \
|
||||||
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
|
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
|
||||||
--conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \
|
--conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \
|
||||||
--archives xgboost-env.tar.gz#environment \
|
--archives xgboost_env.tar.gz#environment \
|
||||||
xgboost_app.py
|
xgboost_app.py
|
||||||
|
|
||||||
When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python are
|
When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python package
|
||||||
required for the acceleration.
|
are required. More configuration options can be found in the RAPIDS link above along with
|
||||||
|
details on the plugin.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user