From ece4dc457bfbe7ff9c5918446b9e2b0907b7dfe8 Mon Sep 17 00:00:00 2001 From: Jiaming Yuan Date: Wed, 13 Apr 2022 17:35:29 +0800 Subject: [PATCH] [backport] Backport jvm changes to 1.6. (#7803) * [doc] improve xgboost4j-spark-gpu doc [skip ci] (#7793) Co-authored-by: Sameer Raheja * [jvm-packages] fix evaluation when featuresCols is used (#7798) Co-authored-by: Bobby Wang Co-authored-by: Sameer Raheja --- doc/jvm/xgboost4j_spark_gpu_tutorial.rst | 68 +++++++++---------- doc/jvm/xgboost4j_spark_tutorial.rst | 5 ++ .../xgboost4j/scala/spark/PreXGBoost.scala | 11 ++- .../scala/spark/XGBoostClassifierSuite.scala | 1 + .../scala/spark/XGBoostRegressorSuite.scala | 1 + 5 files changed, 49 insertions(+), 37 deletions(-) diff --git a/doc/jvm/xgboost4j_spark_gpu_tutorial.rst b/doc/jvm/xgboost4j_spark_gpu_tutorial.rst index 60fcbbc35..5af257da0 100644 --- a/doc/jvm/xgboost4j_spark_gpu_tutorial.rst +++ b/doc/jvm/xgboost4j_spark_gpu_tutorial.rst @@ -2,8 +2,8 @@ XGBoost4J-Spark-GPU Tutorial (version 1.6.0+) ############################################# -**XGBoost4J-Spark-GPU** is a project aiming to accelerate XGBoost distributed training on Spark from -end to end with GPUs by leveraging the `Spark-Rapids `_ project. +**XGBoost4J-Spark-GPU** is an open source library aiming to accelerate distributed XGBoost training on Apache Spark cluster from +end to end with GPUs by leveraging the `RAPIDS Accelerator for Apache Spark `_ product. This tutorial will show you how to use **XGBoost4J-Spark-GPU**. @@ -15,8 +15,8 @@ This tutorial will show you how to use **XGBoost4J-Spark-GPU**. Build an ML Application with XGBoost4J-Spark-GPU ************************************************ -Adding XGBoost to Your Project -============================== +Add XGBoost to Your Project +=========================== Before we go into the tour of how to use XGBoost4J-Spark-GPU, you should first consult :ref:`Installation from Maven repository ` in order to add XGBoost4J-Spark-GPU as @@ -25,10 +25,10 @@ a dependency for your project. We provide both stable releases and snapshots. Data Preparation ================ -In this section, we use `Iris `_ dataset as an example to -showcase how we use Spark to transform raw dataset and make it fit to the data interface of XGBoost. +In this section, we use the `Iris `_ dataset as an example to +showcase how we use Apache Spark to transform a raw dataset and make it fit the data interface of XGBoost. -Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width", +The Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width", "petal length" and "petal width". In addition, it contains the "class" column, which is essentially the label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica". @@ -54,26 +54,26 @@ Read Dataset with Spark's Built-In Reader .schema(schema) .csv(dataPath) -At the first line, we create an instance of `SparkSession `_ -which is the entry of any Spark program working with DataFrame. The ``schema`` variable -defines the schema of DataFrame wrapping Iris data. With this explicitly set schema, we -can define the columns' name as well as their types; otherwise the column name would be +In the first line, we create an instance of a `SparkSession `_ +which is the entry point of any Spark application working with DataFrames. The ``schema`` variable +defines the schema of the DataFrame wrapping Iris data. With this explicitly set schema, we +can define the column names as well as their types; otherwise the column names would be the default ones derived by Spark, such as ``_col0``, etc. Finally, we can use Spark's -built-in csv reader to load Iris csv file as a DataFrame named ``xgbInput``. +built-in CSV reader to load the Iris CSV file as a DataFrame named ``xgbInput``. + +Apache Spark also contains many built-in readers for other formats such as ORC, Parquet, Avro, JSON. -Spark also contains many built-in readers for other format. eg ORC, Parquet, Avro, Json. Transform Raw Iris Dataset -------------------------- -To make Iris dataset be recognizable to XGBoost, we need to encode String-typed -label, i.e. "class", to Double-typed label. +To make the Iris dataset recognizable to XGBoost, we need to encode the String-typed +label, i.e. "class", to the Double-typed label. One way to convert the String-typed label to Double is to use Spark's built-in feature transformer `StringIndexer `_. -but it has not been accelerated by Spark-Rapids yet, which means it will fall back -to CPU to run and cause performance issue. Instead, we use an alternative way to acheive -the same goal by the following code +But this feature is not accelerated in RAPIDS Accelerator, which means it will fall back +to CPU. Instead, we use an alternative way to achieve the same goal with the following code: .. code-block:: scala @@ -102,7 +102,7 @@ the same goal by the following code +------------+-----------+------------+-----------+-----+ -With window operations, we have mapped string column of labels to label indices. +With window operations, we have mapped the string column of labels to label indices. Training ======== @@ -133,7 +133,7 @@ To train a XGBoost model for classification, we need to claim a XGBoostClassifie The available parameters for training a XGBoost model can be found in :doc:`here `. Similar to the XGBoost4J-Spark package, in addition to the default set of parameters, XGBoost4J-Spark-GPU also supports the camel-case variant of these parameters to be -consistent with Spark's MLLIB naming convention. +consistent with Spark's MLlib naming convention. Specifically, each parameter in :doc:`this page ` has its equivalent form in XGBoost4J-Spark-GPU with camel case. For example, to set ``max_depth`` for each tree, you can pass @@ -149,12 +149,11 @@ you can do it through setters in XGBoostClassifer: .. note:: - In contrast to the XGBoost4J-Spark package, which needs to first assemble the numeric - feature columns into one column with VectorUDF type by VectorAssembler, the - XGBoost4J-Spark-GPU does not require such transformation, it accepts an array of feature + In contrast with XGBoost4j-Spark which accepts both a feature column with VectorUDT type and + an array of feature column names, XGBoost4j-Spark-GPU only accepts an array of feature column names by ``setFeaturesCol(value: Array[String])``. -After we set XGBoostClassifier parameters and feature/label columns, we can build a +After setting XGBoostClassifier parameters and feature/label columns, we can build a transformer, XGBoostClassificationModel by fitting XGBoostClassifier with the input DataFrame. This ``fit`` operation is essentially the training process and the generated model can then be used in other tasks like prediction. @@ -166,12 +165,12 @@ model can then be used in other tasks like prediction. Prediction ========== -When we get a model, either XGBoostClassificationModel or XGBoostRegressionModel, it takes a DataFrame, -read the column containing feature vectors, predict for each feature vector, and output a new DataFrame +When we get a model, either a XGBoostClassificationModel or a XGBoostRegressionModel, it takes a DataFrame as an input, +reads the column containing feature vectors, predicts for each feature vector, and outputs a new DataFrame with the following columns by default: * XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label. -* XGBoostRegressionModel will output prediction label(``predictionCol``). +* XGBoostRegressionModel will output prediction a label(``predictionCol``). .. code-block:: scala @@ -180,7 +179,7 @@ with the following columns by default: results.show() With the above code snippet, we get a DataFrame as result, which contains the margin, probability for each class, -and the prediction for each instance +and the prediction for each instance. .. code-block:: none @@ -213,8 +212,9 @@ and the prediction for each instance Submit the application ********************** -Take submitting the spark job to Spark Standalone cluster as an example, and assuming your application main class -is ``Iris`` and the application jar is ``iris-1.0.0.jar`` +Here’s an example to submit an end-to-end XGBoost-4j-Spark-GPU Spark application to an +Apache Spark Standalone cluster, assuming the application main class is Iris and the +application jar is iris-1.0.0.jar .. code-block:: bash @@ -237,10 +237,10 @@ is ``Iris`` and the application jar is ``iris-1.0.0.jar`` --class ${main_class} \ ${app_jar} -* First, we need to specify the ``spark-rapids, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages`` -* Second, ``spark-rapids`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin`` +* First, we need to specify the ``RAPIDS Accelerator, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages`` +* Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin`` -For details about ``spark-rapids`` other configurations, please refer to `configuration `_. +For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration `_. -For ``spark-rapids Frequently Asked Questions``, please refer to +For ``RAPIDS Accelerator Frequently Asked Questions``, please refer to the `frequently-asked-questions `_. diff --git a/doc/jvm/xgboost4j_spark_tutorial.rst b/doc/jvm/xgboost4j_spark_tutorial.rst index b46198115..bc0ae9276 100644 --- a/doc/jvm/xgboost4j_spark_tutorial.rst +++ b/doc/jvm/xgboost4j_spark_tutorial.rst @@ -127,6 +127,11 @@ Now, we have a DataFrame containing only two columns, "features" which contains "sepal length", "sepal width", "petal length" and "petal width" and "classIndex" which has Double-typed labels. A DataFrame like this (containing vector-represented features and numeric labels) can be fed to XGBoost4J-Spark's training engine directly. +.. note:: + + There is no need to assemble feature columns from version 1.6.0+. Instead, users can specify an array of + feture column names by ``setFeaturesCol(value: Array[String])`` and XGBoost4j-Spark will do it. + Dealing with missing values ~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/PreXGBoost.scala b/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/PreXGBoost.scala index 67deb6979..32fd6938e 100644 --- a/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/PreXGBoost.scala +++ b/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark/PreXGBoost.scala @@ -140,9 +140,13 @@ object PreXGBoost extends PreXGBoostProvider { val (xgbInput, featuresName) = est.vectorize(dataset) + val evalSets = est.getEvalSets(params).transform((_, df) => { + val (dfTransformed, _) = est.vectorize(df) + dfTransformed + }) + (PackedParams(col(est.getLabelCol), col(featuresName), weight, baseMargin, group, - est.getNumWorkers, est.needDeterministicRepartitioning), est.getEvalSets(params), - xgbInput) + est.getNumWorkers, est.needDeterministicRepartitioning), evalSets, xgbInput) case _ => throw new RuntimeException("Unsupporting " + estimator) } @@ -154,7 +158,8 @@ object PreXGBoost extends PreXGBoostProvider { // transform the eval Dataset[_] to RDD[XGBLabeledPoint] val evalRDDMap = evalSet.map { case (name, dataFrame) => (name, - DataUtils.convertDataFrameToXGBLabeledPointRDDs(packedParams, dataFrame).head) + DataUtils.convertDataFrameToXGBLabeledPointRDDs(packedParams, + dataFrame.asInstanceOf[DataFrame]).head) } val hasGroup = packedParams.group.map(_ != defaultGroupColumn).getOrElse(false) diff --git a/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifierSuite.scala b/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifierSuite.scala index 91f4a4cfa..0fa851f57 100644 --- a/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifierSuite.scala +++ b/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostClassifierSuite.scala @@ -370,6 +370,7 @@ class XGBoostClassifierSuite extends FunSuite with PerTest { val xgbClassifier = new XGBoostClassifier(paramMap) .setFeaturesCol(featuresName) .setLabelCol("label") + .setEvalSets(Map("eval" -> xgbInput)) val model = xgbClassifier.fit(xgbInput) assert(model.getFeaturesCols.sameElements(featuresName)) diff --git a/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressorSuite.scala b/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressorSuite.scala index 04e510640..e427c17e3 100644 --- a/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressorSuite.scala +++ b/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostRegressorSuite.scala @@ -273,6 +273,7 @@ class XGBoostRegressorSuite extends FunSuite with PerTest { val xgbClassifier = new XGBoostRegressor(paramMap) .setFeaturesCol(featuresName) .setLabelCol("label") + .setEvalSets(Map("eval" -> xgbInput)) val model = xgbClassifier.fit(xgbInput) assert(model.getFeaturesCols.sameElements(featuresName))