[Doc] update the tutorial of xgboost4j-spark-gpu (#9752)
--------- Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
This commit is contained in:
parent
be20df8c23
commit
093b675838
@ -18,9 +18,9 @@ Build an ML Application with XGBoost4J-Spark-GPU
|
|||||||
Add XGBoost to Your Project
|
Add XGBoost to Your Project
|
||||||
===========================
|
===========================
|
||||||
|
|
||||||
Before we go into the tour of how to use XGBoost4J-Spark-GPU, you should first consult
|
Prior to delving into the tutorial on utilizing XGBoost4J-Spark-GPU, it is advisable to refer to
|
||||||
:ref:`Installation from Maven repository <install_jvm_packages>` in order to add XGBoost4J-Spark-GPU as
|
:ref:`Installation from Maven repository <install_jvm_packages>` for instructions on adding XGBoost4J-Spark-GPU
|
||||||
a dependency for your project. We provide both stable releases and snapshots.
|
as a project dependency. We offer both stable releases and snapshots for your convenience.
|
||||||
|
|
||||||
Data Preparation
|
Data Preparation
|
||||||
================
|
================
|
||||||
@ -54,7 +54,7 @@ Read Dataset with Spark's Built-In Reader
|
|||||||
.schema(schema)
|
.schema(schema)
|
||||||
.csv(dataPath)
|
.csv(dataPath)
|
||||||
|
|
||||||
In the first line, we create an instance of a `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_
|
At first, we create an instance of a `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_
|
||||||
which is the entry point of any Spark application working with DataFrames. The ``schema`` variable
|
which is the entry point of any Spark application working with DataFrames. The ``schema`` variable
|
||||||
defines the schema of the DataFrame wrapping Iris data. With this explicitly set schema, we
|
defines the schema of the DataFrame wrapping Iris data. With this explicitly set schema, we
|
||||||
can define the column names as well as their types; otherwise the column names would be
|
can define the column names as well as their types; otherwise the column names would be
|
||||||
@ -112,7 +112,7 @@ models. Although we use the Iris dataset in this tutorial to show how we use
|
|||||||
``XGBoost/XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the
|
``XGBoost/XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the
|
||||||
usage in Regression is very similar to classification.
|
usage in Regression is very similar to classification.
|
||||||
|
|
||||||
To train a XGBoost model for classification, we need to claim a XGBoostClassifier first:
|
To train a XGBoost model for classification, we need to define a XGBoostClassifier first:
|
||||||
|
|
||||||
.. code-block:: scala
|
.. code-block:: scala
|
||||||
|
|
||||||
@ -130,9 +130,13 @@ To train a XGBoost model for classification, we need to claim a XGBoostClassifie
|
|||||||
.setFeaturesCol(featuresNames)
|
.setFeaturesCol(featuresNames)
|
||||||
.setLabelCol(labelName)
|
.setLabelCol(labelName)
|
||||||
|
|
||||||
The ``device`` parameter is for informing XGBoost that CUDA devices should be used instead of CPU. Unlike the single-node mode, GPUs are managed by spark instead of by XGBoost. Therefore, explicitly specified device ordinal like ``cuda:1`` is not support.
|
The ``device`` parameter is for informing XGBoost that CUDA devices should be used instead of CPU.
|
||||||
|
Unlike the single-node mode, GPUs are managed by spark instead of by XGBoost. Therefore,
|
||||||
|
explicitly specified device ordinal like ``cuda:1`` is not support.
|
||||||
|
|
||||||
The available parameters for training a XGBoost model can be found in :doc:`here </parameter>`. Similar to the XGBoost4J-Spark package, in addition to the default set of parameters, XGBoost4J-Spark-GPU also supports the camel-case variant of these parameters to be consistent with Spark's MLlib naming convention.
|
The available parameters for training a XGBoost model can be found in :doc:`here </parameter>`.
|
||||||
|
Similar to the XGBoost4J-Spark package, in addition to the default set of parameters,
|
||||||
|
XGBoost4J-Spark-GPU also supports the camel-case variant of these parameters to be consistent with Spark's MLlib naming convention.
|
||||||
|
|
||||||
Specifically, each parameter in :doc:`this page </parameter>` has its equivalent form in
|
Specifically, each parameter in :doc:`this page </parameter>` has its equivalent form in
|
||||||
XGBoost4J-Spark-GPU with camel case. For example, to set ``max_depth`` for each tree, you
|
XGBoost4J-Spark-GPU with camel case. For example, to set ``max_depth`` for each tree, you
|
||||||
@ -211,32 +215,31 @@ and the prediction for each instance.
|
|||||||
Submit the application
|
Submit the application
|
||||||
**********************
|
**********************
|
||||||
|
|
||||||
Here’s an example to submit an end-to-end XGBoost-4j-Spark-GPU Spark application to an
|
Assuming that the application main class is "Iris" and the application jar is "iris-1.0.0.jar",`
|
||||||
Apache Spark Standalone cluster, assuming the application main class is Iris and the
|
provided below is an instance demonstrating how to submit the xgboost application to an Apache
|
||||||
application jar is iris-1.0.0.jar
|
Spark Standalone cluster.
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
cudf_version=22.02.0
|
rapids_version=23.10.0
|
||||||
rapids_version=22.02.0
|
xgboost_version=2.0.1
|
||||||
xgboost_version=1.6.1
|
|
||||||
main_class=Iris
|
main_class=Iris
|
||||||
app_jar=iris-1.0.0.jar
|
app_jar=iris-1.0.0.jar
|
||||||
|
|
||||||
spark-submit \
|
spark-submit \
|
||||||
--master $master \
|
--master $master \
|
||||||
--packages ai.rapids:cudf:${cudf_version},com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-gpu_2.12:${xgboost_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \
|
--packages com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-gpu_2.12:${xgboost_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \
|
||||||
--conf spark.executor.cores=12 \
|
--conf spark.executor.cores=12 \
|
||||||
--conf spark.task.cpus=1 \
|
--conf spark.task.cpus=12 \
|
||||||
--conf spark.executor.resource.gpu.amount=1 \
|
--conf spark.executor.resource.gpu.amount=1 \
|
||||||
--conf spark.task.resource.gpu.amount=0.08 \
|
--conf spark.task.resource.gpu.amount=1 \
|
||||||
--conf spark.rapids.sql.csv.read.double.enabled=true \
|
--conf spark.rapids.sql.csv.read.double.enabled=true \
|
||||||
--conf spark.rapids.sql.hasNans=false \
|
--conf spark.rapids.sql.hasNans=false \
|
||||||
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
|
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
|
||||||
--class ${main_class} \
|
--class ${main_class} \
|
||||||
${app_jar}
|
${app_jar}
|
||||||
|
|
||||||
* First, we need to specify the ``RAPIDS Accelerator, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
|
* First, we need to specify the ``RAPIDS Accelerator, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
|
||||||
* Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``
|
* Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``
|
||||||
|
|
||||||
For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
|
For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user