[Doc] update the tutorial of xgboost4j-spark-gpu (#9752)

---------

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
This commit is contained in:
Bobby Wang 2023-11-03 18:19:28 +08:00 committed by GitHub
parent be20df8c23
commit 093b675838
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -18,9 +18,9 @@ Build an ML Application with XGBoost4J-Spark-GPU
Add XGBoost to Your Project Add XGBoost to Your Project
=========================== ===========================
Before we go into the tour of how to use XGBoost4J-Spark-GPU, you should first consult Prior to delving into the tutorial on utilizing XGBoost4J-Spark-GPU, it is advisable to refer to
:ref:`Installation from Maven repository <install_jvm_packages>` in order to add XGBoost4J-Spark-GPU as :ref:`Installation from Maven repository <install_jvm_packages>` for instructions on adding XGBoost4J-Spark-GPU
a dependency for your project. We provide both stable releases and snapshots. as a project dependency. We offer both stable releases and snapshots for your convenience.
Data Preparation Data Preparation
================ ================
@ -54,7 +54,7 @@ Read Dataset with Spark's Built-In Reader
.schema(schema) .schema(schema)
.csv(dataPath) .csv(dataPath)
In the first line, we create an instance of a `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_ At first, we create an instance of a `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_
which is the entry point of any Spark application working with DataFrames. The ``schema`` variable which is the entry point of any Spark application working with DataFrames. The ``schema`` variable
defines the schema of the DataFrame wrapping Iris data. With this explicitly set schema, we defines the schema of the DataFrame wrapping Iris data. With this explicitly set schema, we
can define the column names as well as their types; otherwise the column names would be can define the column names as well as their types; otherwise the column names would be
@ -112,7 +112,7 @@ models. Although we use the Iris dataset in this tutorial to show how we use
``XGBoost/XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the ``XGBoost/XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the
usage in Regression is very similar to classification. usage in Regression is very similar to classification.
To train a XGBoost model for classification, we need to claim a XGBoostClassifier first: To train a XGBoost model for classification, we need to define a XGBoostClassifier first:
.. code-block:: scala .. code-block:: scala
@ -130,9 +130,13 @@ To train a XGBoost model for classification, we need to claim a XGBoostClassifie
.setFeaturesCol(featuresNames) .setFeaturesCol(featuresNames)
.setLabelCol(labelName) .setLabelCol(labelName)
The ``device`` parameter is for informing XGBoost that CUDA devices should be used instead of CPU. Unlike the single-node mode, GPUs are managed by spark instead of by XGBoost. Therefore, explicitly specified device ordinal like ``cuda:1`` is not support. The ``device`` parameter is for informing XGBoost that CUDA devices should be used instead of CPU.
Unlike the single-node mode, GPUs are managed by spark instead of by XGBoost. Therefore,
explicitly specified device ordinal like ``cuda:1`` is not support.
The available parameters for training a XGBoost model can be found in :doc:`here </parameter>`. Similar to the XGBoost4J-Spark package, in addition to the default set of parameters, XGBoost4J-Spark-GPU also supports the camel-case variant of these parameters to be consistent with Spark's MLlib naming convention. The available parameters for training a XGBoost model can be found in :doc:`here </parameter>`.
Similar to the XGBoost4J-Spark package, in addition to the default set of parameters,
XGBoost4J-Spark-GPU also supports the camel-case variant of these parameters to be consistent with Spark's MLlib naming convention.
Specifically, each parameter in :doc:`this page </parameter>` has its equivalent form in Specifically, each parameter in :doc:`this page </parameter>` has its equivalent form in
XGBoost4J-Spark-GPU with camel case. For example, to set ``max_depth`` for each tree, you XGBoost4J-Spark-GPU with camel case. For example, to set ``max_depth`` for each tree, you
@ -211,32 +215,31 @@ and the prediction for each instance.
Submit the application Submit the application
********************** **********************
Heres an example to submit an end-to-end XGBoost-4j-Spark-GPU Spark application to an Assuming that the application main class is "Iris" and the application jar is "iris-1.0.0.jar",`
Apache Spark Standalone cluster, assuming the application main class is Iris and the provided below is an instance demonstrating how to submit the xgboost application to an Apache
application jar is iris-1.0.0.jar Spark Standalone cluster.
.. code-block:: bash .. code-block:: bash
cudf_version=22.02.0 rapids_version=23.10.0
rapids_version=22.02.0 xgboost_version=2.0.1
xgboost_version=1.6.1
main_class=Iris main_class=Iris
app_jar=iris-1.0.0.jar app_jar=iris-1.0.0.jar
spark-submit \ spark-submit \
--master $master \ --master $master \
--packages ai.rapids:cudf:${cudf_version},com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-gpu_2.12:${xgboost_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \ --packages com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-gpu_2.12:${xgboost_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \
--conf spark.executor.cores=12 \ --conf spark.executor.cores=12 \
--conf spark.task.cpus=1 \ --conf spark.task.cpus=12 \
--conf spark.executor.resource.gpu.amount=1 \ --conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=0.08 \ --conf spark.task.resource.gpu.amount=1 \
--conf spark.rapids.sql.csv.read.double.enabled=true \ --conf spark.rapids.sql.csv.read.double.enabled=true \
--conf spark.rapids.sql.hasNans=false \ --conf spark.rapids.sql.hasNans=false \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \
--class ${main_class} \ --class ${main_class} \
${app_jar} ${app_jar}
* First, we need to specify the ``RAPIDS Accelerator, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages`` * First, we need to specify the ``RAPIDS Accelerator, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
* Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin`` * Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``
For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_. For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.