[backport] Backport jvm changes to 1.6. (#7803)
* [doc] improve xgboost4j-spark-gpu doc [skip ci] (#7793) Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com> * [jvm-packages] fix evaluation when featuresCols is used (#7798) Co-authored-by: Bobby Wang <wbo4958@gmail.com> Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
This commit is contained in:
parent
67298ccd03
commit
ece4dc457b
@ -2,8 +2,8 @@
|
|||||||
XGBoost4J-Spark-GPU Tutorial (version 1.6.0+)
|
XGBoost4J-Spark-GPU Tutorial (version 1.6.0+)
|
||||||
#############################################
|
#############################################
|
||||||
|
|
||||||
**XGBoost4J-Spark-GPU** is a project aiming to accelerate XGBoost distributed training on Spark from
|
**XGBoost4J-Spark-GPU** is an open source library aiming to accelerate distributed XGBoost training on Apache Spark cluster from
|
||||||
end to end with GPUs by leveraging the `Spark-Rapids <https://nvidia.github.io/spark-rapids/>`_ project.
|
end to end with GPUs by leveraging the `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_ product.
|
||||||
|
|
||||||
This tutorial will show you how to use **XGBoost4J-Spark-GPU**.
|
This tutorial will show you how to use **XGBoost4J-Spark-GPU**.
|
||||||
|
|
||||||
@ -15,8 +15,8 @@ This tutorial will show you how to use **XGBoost4J-Spark-GPU**.
|
|||||||
Build an ML Application with XGBoost4J-Spark-GPU
|
Build an ML Application with XGBoost4J-Spark-GPU
|
||||||
************************************************
|
************************************************
|
||||||
|
|
||||||
Adding XGBoost to Your Project
|
Add XGBoost to Your Project
|
||||||
==============================
|
===========================
|
||||||
|
|
||||||
Before we go into the tour of how to use XGBoost4J-Spark-GPU, you should first consult
|
Before we go into the tour of how to use XGBoost4J-Spark-GPU, you should first consult
|
||||||
:ref:`Installation from Maven repository <install_jvm_packages>` in order to add XGBoost4J-Spark-GPU as
|
:ref:`Installation from Maven repository <install_jvm_packages>` in order to add XGBoost4J-Spark-GPU as
|
||||||
@ -25,10 +25,10 @@ a dependency for your project. We provide both stable releases and snapshots.
|
|||||||
Data Preparation
|
Data Preparation
|
||||||
================
|
================
|
||||||
|
|
||||||
In this section, we use `Iris <https://archive.ics.uci.edu/ml/datasets/iris>`_ dataset as an example to
|
In this section, we use the `Iris <https://archive.ics.uci.edu/ml/datasets/iris>`_ dataset as an example to
|
||||||
showcase how we use Spark to transform raw dataset and make it fit to the data interface of XGBoost.
|
showcase how we use Apache Spark to transform a raw dataset and make it fit the data interface of XGBoost.
|
||||||
|
|
||||||
Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width",
|
The Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width",
|
||||||
"petal length" and "petal width". In addition, it contains the "class" column, which is essentially the
|
"petal length" and "petal width". In addition, it contains the "class" column, which is essentially the
|
||||||
label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica".
|
label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica".
|
||||||
|
|
||||||
@ -54,26 +54,26 @@ Read Dataset with Spark's Built-In Reader
|
|||||||
.schema(schema)
|
.schema(schema)
|
||||||
.csv(dataPath)
|
.csv(dataPath)
|
||||||
|
|
||||||
At the first line, we create an instance of `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_
|
In the first line, we create an instance of a `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_
|
||||||
which is the entry of any Spark program working with DataFrame. The ``schema`` variable
|
which is the entry point of any Spark application working with DataFrames. The ``schema`` variable
|
||||||
defines the schema of DataFrame wrapping Iris data. With this explicitly set schema, we
|
defines the schema of the DataFrame wrapping Iris data. With this explicitly set schema, we
|
||||||
can define the columns' name as well as their types; otherwise the column name would be
|
can define the column names as well as their types; otherwise the column names would be
|
||||||
the default ones derived by Spark, such as ``_col0``, etc. Finally, we can use Spark's
|
the default ones derived by Spark, such as ``_col0``, etc. Finally, we can use Spark's
|
||||||
built-in csv reader to load Iris csv file as a DataFrame named ``xgbInput``.
|
built-in CSV reader to load the Iris CSV file as a DataFrame named ``xgbInput``.
|
||||||
|
|
||||||
|
Apache Spark also contains many built-in readers for other formats such as ORC, Parquet, Avro, JSON.
|
||||||
|
|
||||||
Spark also contains many built-in readers for other format. eg ORC, Parquet, Avro, Json.
|
|
||||||
|
|
||||||
Transform Raw Iris Dataset
|
Transform Raw Iris Dataset
|
||||||
--------------------------
|
--------------------------
|
||||||
|
|
||||||
To make Iris dataset be recognizable to XGBoost, we need to encode String-typed
|
To make the Iris dataset recognizable to XGBoost, we need to encode the String-typed
|
||||||
label, i.e. "class", to Double-typed label.
|
label, i.e. "class", to the Double-typed label.
|
||||||
|
|
||||||
One way to convert the String-typed label to Double is to use Spark's built-in feature transformer
|
One way to convert the String-typed label to Double is to use Spark's built-in feature transformer
|
||||||
`StringIndexer <https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer>`_.
|
`StringIndexer <https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer>`_.
|
||||||
but it has not been accelerated by Spark-Rapids yet, which means it will fall back
|
But this feature is not accelerated in RAPIDS Accelerator, which means it will fall back
|
||||||
to CPU to run and cause performance issue. Instead, we use an alternative way to acheive
|
to CPU. Instead, we use an alternative way to achieve the same goal with the following code:
|
||||||
the same goal by the following code
|
|
||||||
|
|
||||||
.. code-block:: scala
|
.. code-block:: scala
|
||||||
|
|
||||||
@ -102,7 +102,7 @@ the same goal by the following code
|
|||||||
+------------+-----------+------------+-----------+-----+
|
+------------+-----------+------------+-----------+-----+
|
||||||
|
|
||||||
|
|
||||||
With window operations, we have mapped string column of labels to label indices.
|
With window operations, we have mapped the string column of labels to label indices.
|
||||||
|
|
||||||
Training
|
Training
|
||||||
========
|
========
|
||||||
@ -133,7 +133,7 @@ To train a XGBoost model for classification, we need to claim a XGBoostClassifie
|
|||||||
The available parameters for training a XGBoost model can be found in :doc:`here </parameter>`.
|
The available parameters for training a XGBoost model can be found in :doc:`here </parameter>`.
|
||||||
Similar to the XGBoost4J-Spark package, in addition to the default set of parameters,
|
Similar to the XGBoost4J-Spark package, in addition to the default set of parameters,
|
||||||
XGBoost4J-Spark-GPU also supports the camel-case variant of these parameters to be
|
XGBoost4J-Spark-GPU also supports the camel-case variant of these parameters to be
|
||||||
consistent with Spark's MLLIB naming convention.
|
consistent with Spark's MLlib naming convention.
|
||||||
|
|
||||||
Specifically, each parameter in :doc:`this page </parameter>` has its equivalent form in
|
Specifically, each parameter in :doc:`this page </parameter>` has its equivalent form in
|
||||||
XGBoost4J-Spark-GPU with camel case. For example, to set ``max_depth`` for each tree, you can pass
|
XGBoost4J-Spark-GPU with camel case. For example, to set ``max_depth`` for each tree, you can pass
|
||||||
@ -149,12 +149,11 @@ you can do it through setters in XGBoostClassifer:
|
|||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
In contrast to the XGBoost4J-Spark package, which needs to first assemble the numeric
|
In contrast with XGBoost4j-Spark which accepts both a feature column with VectorUDT type and
|
||||||
feature columns into one column with VectorUDF type by VectorAssembler, the
|
an array of feature column names, XGBoost4j-Spark-GPU only accepts an array of feature
|
||||||
XGBoost4J-Spark-GPU does not require such transformation, it accepts an array of feature
|
|
||||||
column names by ``setFeaturesCol(value: Array[String])``.
|
column names by ``setFeaturesCol(value: Array[String])``.
|
||||||
|
|
||||||
After we set XGBoostClassifier parameters and feature/label columns, we can build a
|
After setting XGBoostClassifier parameters and feature/label columns, we can build a
|
||||||
transformer, XGBoostClassificationModel by fitting XGBoostClassifier with the input
|
transformer, XGBoostClassificationModel by fitting XGBoostClassifier with the input
|
||||||
DataFrame. This ``fit`` operation is essentially the training process and the generated
|
DataFrame. This ``fit`` operation is essentially the training process and the generated
|
||||||
model can then be used in other tasks like prediction.
|
model can then be used in other tasks like prediction.
|
||||||
@ -166,12 +165,12 @@ model can then be used in other tasks like prediction.
|
|||||||
Prediction
|
Prediction
|
||||||
==========
|
==========
|
||||||
|
|
||||||
When we get a model, either XGBoostClassificationModel or XGBoostRegressionModel, it takes a DataFrame,
|
When we get a model, either a XGBoostClassificationModel or a XGBoostRegressionModel, it takes a DataFrame as an input,
|
||||||
read the column containing feature vectors, predict for each feature vector, and output a new DataFrame
|
reads the column containing feature vectors, predicts for each feature vector, and outputs a new DataFrame
|
||||||
with the following columns by default:
|
with the following columns by default:
|
||||||
|
|
||||||
* XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label.
|
* XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label.
|
||||||
* XGBoostRegressionModel will output prediction label(``predictionCol``).
|
* XGBoostRegressionModel will output prediction a label(``predictionCol``).
|
||||||
|
|
||||||
.. code-block:: scala
|
.. code-block:: scala
|
||||||
|
|
||||||
@ -180,7 +179,7 @@ with the following columns by default:
|
|||||||
results.show()
|
results.show()
|
||||||
|
|
||||||
With the above code snippet, we get a DataFrame as result, which contains the margin, probability for each class,
|
With the above code snippet, we get a DataFrame as result, which contains the margin, probability for each class,
|
||||||
and the prediction for each instance
|
and the prediction for each instance.
|
||||||
|
|
||||||
.. code-block:: none
|
.. code-block:: none
|
||||||
|
|
||||||
@ -213,8 +212,9 @@ and the prediction for each instance
|
|||||||
Submit the application
|
Submit the application
|
||||||
**********************
|
**********************
|
||||||
|
|
||||||
Take submitting the spark job to Spark Standalone cluster as an example, and assuming your application main class
|
Here’s an example to submit an end-to-end XGBoost-4j-Spark-GPU Spark application to an
|
||||||
is ``Iris`` and the application jar is ``iris-1.0.0.jar``
|
Apache Spark Standalone cluster, assuming the application main class is Iris and the
|
||||||
|
application jar is iris-1.0.0.jar
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
@ -237,10 +237,10 @@ is ``Iris`` and the application jar is ``iris-1.0.0.jar``
|
|||||||
--class ${main_class} \
|
--class ${main_class} \
|
||||||
${app_jar}
|
${app_jar}
|
||||||
|
|
||||||
* First, we need to specify the ``spark-rapids, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
|
* First, we need to specify the ``RAPIDS Accelerator, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
|
||||||
* Second, ``spark-rapids`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``
|
* Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``
|
||||||
|
|
||||||
For details about ``spark-rapids`` other configurations, please refer to `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
|
For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
|
||||||
|
|
||||||
For ``spark-rapids Frequently Asked Questions``, please refer to
|
For ``RAPIDS Accelerator Frequently Asked Questions``, please refer to the
|
||||||
`frequently-asked-questions <https://nvidia.github.io/spark-rapids/docs/FAQ.html#frequently-asked-questions>`_.
|
`frequently-asked-questions <https://nvidia.github.io/spark-rapids/docs/FAQ.html#frequently-asked-questions>`_.
|
||||||
|
|||||||
@ -127,6 +127,11 @@ Now, we have a DataFrame containing only two columns, "features" which contains
|
|||||||
"sepal length", "sepal width", "petal length" and "petal width" and "classIndex" which has Double-typed
|
"sepal length", "sepal width", "petal length" and "petal width" and "classIndex" which has Double-typed
|
||||||
labels. A DataFrame like this (containing vector-represented features and numeric labels) can be fed to XGBoost4J-Spark's training engine directly.
|
labels. A DataFrame like this (containing vector-represented features and numeric labels) can be fed to XGBoost4J-Spark's training engine directly.
|
||||||
|
|
||||||
|
.. note::
|
||||||
|
|
||||||
|
There is no need to assemble feature columns from version 1.6.0+. Instead, users can specify an array of
|
||||||
|
feture column names by ``setFeaturesCol(value: Array[String])`` and XGBoost4j-Spark will do it.
|
||||||
|
|
||||||
Dealing with missing values
|
Dealing with missing values
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
|
|||||||
@ -140,9 +140,13 @@ object PreXGBoost extends PreXGBoostProvider {
|
|||||||
|
|
||||||
val (xgbInput, featuresName) = est.vectorize(dataset)
|
val (xgbInput, featuresName) = est.vectorize(dataset)
|
||||||
|
|
||||||
|
val evalSets = est.getEvalSets(params).transform((_, df) => {
|
||||||
|
val (dfTransformed, _) = est.vectorize(df)
|
||||||
|
dfTransformed
|
||||||
|
})
|
||||||
|
|
||||||
(PackedParams(col(est.getLabelCol), col(featuresName), weight, baseMargin, group,
|
(PackedParams(col(est.getLabelCol), col(featuresName), weight, baseMargin, group,
|
||||||
est.getNumWorkers, est.needDeterministicRepartitioning), est.getEvalSets(params),
|
est.getNumWorkers, est.needDeterministicRepartitioning), evalSets, xgbInput)
|
||||||
xgbInput)
|
|
||||||
|
|
||||||
case _ => throw new RuntimeException("Unsupporting " + estimator)
|
case _ => throw new RuntimeException("Unsupporting " + estimator)
|
||||||
}
|
}
|
||||||
@ -154,7 +158,8 @@ object PreXGBoost extends PreXGBoostProvider {
|
|||||||
// transform the eval Dataset[_] to RDD[XGBLabeledPoint]
|
// transform the eval Dataset[_] to RDD[XGBLabeledPoint]
|
||||||
val evalRDDMap = evalSet.map {
|
val evalRDDMap = evalSet.map {
|
||||||
case (name, dataFrame) => (name,
|
case (name, dataFrame) => (name,
|
||||||
DataUtils.convertDataFrameToXGBLabeledPointRDDs(packedParams, dataFrame).head)
|
DataUtils.convertDataFrameToXGBLabeledPointRDDs(packedParams,
|
||||||
|
dataFrame.asInstanceOf[DataFrame]).head)
|
||||||
}
|
}
|
||||||
|
|
||||||
val hasGroup = packedParams.group.map(_ != defaultGroupColumn).getOrElse(false)
|
val hasGroup = packedParams.group.map(_ != defaultGroupColumn).getOrElse(false)
|
||||||
|
|||||||
@ -370,6 +370,7 @@ class XGBoostClassifierSuite extends FunSuite with PerTest {
|
|||||||
val xgbClassifier = new XGBoostClassifier(paramMap)
|
val xgbClassifier = new XGBoostClassifier(paramMap)
|
||||||
.setFeaturesCol(featuresName)
|
.setFeaturesCol(featuresName)
|
||||||
.setLabelCol("label")
|
.setLabelCol("label")
|
||||||
|
.setEvalSets(Map("eval" -> xgbInput))
|
||||||
|
|
||||||
val model = xgbClassifier.fit(xgbInput)
|
val model = xgbClassifier.fit(xgbInput)
|
||||||
assert(model.getFeaturesCols.sameElements(featuresName))
|
assert(model.getFeaturesCols.sameElements(featuresName))
|
||||||
|
|||||||
@ -273,6 +273,7 @@ class XGBoostRegressorSuite extends FunSuite with PerTest {
|
|||||||
val xgbClassifier = new XGBoostRegressor(paramMap)
|
val xgbClassifier = new XGBoostRegressor(paramMap)
|
||||||
.setFeaturesCol(featuresName)
|
.setFeaturesCol(featuresName)
|
||||||
.setLabelCol("label")
|
.setLabelCol("label")
|
||||||
|
.setEvalSets(Map("eval" -> xgbInput))
|
||||||
|
|
||||||
val model = xgbClassifier.fit(xgbInput)
|
val model = xgbClassifier.fit(xgbInput)
|
||||||
assert(model.getFeaturesCols.sameElements(featuresName))
|
assert(model.getFeaturesCols.sameElements(featuresName))
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user