[backport] Backport JVM fixes and document update to 1.6 (#7792)

* [jvm-packages] unify setFeaturesCol API for XGBoostRegressor (#7784)

* [jvm-packages] add doc for xgboost4j-spark-gpu (#7779)


Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>

* [jvm-packages] remove the dep of com.fasterxml.jackson (#7791)

* [jvm-packages] xgboost4j-spark should work when featuresCols is specified (#7789)

Co-authored-by: Bobby Wang <wbo4958@gmail.com>
This commit is contained in:
Jiaming Yuan 2022-04-08 14:18:46 +08:00 committed by GitHub
parent 78d231264a
commit 67298ccd03
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
19 changed files with 738 additions and 160 deletions

View File

@ -101,7 +101,7 @@ R
JVM JVM
--- ---
You can use XGBoost4J in your Java/Scala application by adding XGBoost4J as a dependency: * XGBoost4j/XGBoost4j-Spark
.. code-block:: xml .. code-block:: xml
:caption: Maven :caption: Maven
@ -134,6 +134,39 @@ You can use XGBoost4J in your Java/Scala application by adding XGBoost4J as a de
"ml.dmlc" %% "xgboost4j-spark" % "latest_version_num" "ml.dmlc" %% "xgboost4j-spark" % "latest_version_num"
) )
* XGBoost4j-GPU/XGBoost4j-Spark-GPU
.. code-block:: xml
:caption: Maven
<properties>
...
<!-- Specify Scala version in package name -->
<scala.binary.version>2.12</scala.binary.version>
</properties>
<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-gpu_${scala.binary.version}</artifactId>
<version>latest_version_num</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark-gpu_${scala.binary.version}</artifactId>
<version>latest_version_num</version>
</dependency>
</dependencies>
.. code-block:: scala
:caption: sbt
libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j-gpu" % "latest_version_num",
"ml.dmlc" %% "xgboost4j-spark-gpu" % "latest_version_num"
)
This will check out the latest stable version from the Maven Central. This will check out the latest stable version from the Maven Central.
For the latest release version number, please check `release page <https://github.com/dmlc/xgboost/releases>`_. For the latest release version number, please check `release page <https://github.com/dmlc/xgboost/releases>`_.
@ -185,7 +218,7 @@ and Windows.) Download it and run the following commands:
JVM JVM
--- ---
First add the following Maven repository hosted by the XGBoost project: * XGBoost4j/XGBoost4j-Spark
.. code-block:: xml .. code-block:: xml
:caption: Maven :caption: Maven
@ -234,6 +267,40 @@ Then add XGBoost4J as a dependency:
"ml.dmlc" %% "xgboost4j-spark" % "latest_version_num-SNAPSHOT" "ml.dmlc" %% "xgboost4j-spark" % "latest_version_num-SNAPSHOT"
) )
* XGBoost4j-GPU/XGBoost4j-Spark-GPU
.. code-block:: xml
:caption: maven
<properties>
...
<!-- Specify Scala version in package name -->
<scala.binary.version>2.12</scala.binary.version>
</properties>
<dependencies>
...
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-gpu_${scala.binary.version}</artifactId>
<version>latest_version_num-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>ml.dmlc</groupId>
<artifactId>xgboost4j-spark-gpu_${scala.binary.version}</artifactId>
<version>latest_version_num-SNAPSHOT</version>
</dependency>
</dependencies>
.. code-block:: scala
:caption: sbt
libraryDependencies ++= Seq(
"ml.dmlc" %% "xgboost4j-gpu" % "latest_version_num-SNAPSHOT",
"ml.dmlc" %% "xgboost4j-spark-gpu" % "latest_version_num-SNAPSHOT"
)
Look up the ``version`` field in `pom.xml <https://github.com/dmlc/xgboost/blob/master/jvm-packages/pom.xml>`_ to get the correct version number. Look up the ``version`` field in `pom.xml <https://github.com/dmlc/xgboost/blob/master/jvm-packages/pom.xml>`_ to get the correct version number.
The SNAPSHOT JARs are hosted by the XGBoost project. Every commit in the ``master`` branch will automatically trigger generation of a new SNAPSHOT JAR. You can control how often Maven should upgrade your SNAPSHOT installation by specifying ``updatePolicy``. See `here <http://maven.apache.org/pom.html#Repositories>`_ for details. The SNAPSHOT JARs are hosted by the XGBoost project. Every commit in the ``master`` branch will automatically trigger generation of a new SNAPSHOT JAR. You can control how often Maven should upgrade your SNAPSHOT installation by specifying ``updatePolicy``. See `here <http://maven.apache.org/pom.html#Repositories>`_ for details.

View File

@ -35,6 +35,7 @@ Contents
java_intro java_intro
XGBoost4J-Spark Tutorial <xgboost4j_spark_tutorial> XGBoost4J-Spark Tutorial <xgboost4j_spark_tutorial>
XGBoost4J-Spark-GPU Turorial <xgboost4j_spark_gpu_tutorial>
Code Examples <https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example> Code Examples <https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example>
XGBoost4J Java API <javadocs/index> XGBoost4J Java API <javadocs/index>
XGBoost4J Scala API <scaladocs/xgboost4j/index> XGBoost4J Scala API <scaladocs/xgboost4j/index>

View File

@ -0,0 +1,246 @@
#############################################
XGBoost4J-Spark-GPU Tutorial (version 1.6.0+)
#############################################
**XGBoost4J-Spark-GPU** is a project aiming to accelerate XGBoost distributed training on Spark from
end to end with GPUs by leveraging the `Spark-Rapids <https://nvidia.github.io/spark-rapids/>`_ project.
This tutorial will show you how to use **XGBoost4J-Spark-GPU**.
.. contents::
:backlinks: none
:local:
************************************************
Build an ML Application with XGBoost4J-Spark-GPU
************************************************
Adding XGBoost to Your Project
==============================
Before we go into the tour of how to use XGBoost4J-Spark-GPU, you should first consult
:ref:`Installation from Maven repository <install_jvm_packages>` in order to add XGBoost4J-Spark-GPU as
a dependency for your project. We provide both stable releases and snapshots.
Data Preparation
================
In this section, we use `Iris <https://archive.ics.uci.edu/ml/datasets/iris>`_ dataset as an example to
showcase how we use Spark to transform raw dataset and make it fit to the data interface of XGBoost.
Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width",
"petal length" and "petal width". In addition, it contains the "class" column, which is essentially the
label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica".
Read Dataset with Spark's Built-In Reader
-----------------------------------------
.. code-block:: scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
val spark = SparkSession.builder().getOrCreate()
val labelName = "class"
val schema = new StructType(Array(
StructField("sepal length", DoubleType, true),
StructField("sepal width", DoubleType, true),
StructField("petal length", DoubleType, true),
StructField("petal width", DoubleType, true),
StructField(labelName, StringType, true)))
val xgbInput = spark.read.option("header", "false")
.schema(schema)
.csv(dataPath)
At the first line, we create an instance of `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_
which is the entry of any Spark program working with DataFrame. The ``schema`` variable
defines the schema of DataFrame wrapping Iris data. With this explicitly set schema, we
can define the columns' name as well as their types; otherwise the column name would be
the default ones derived by Spark, such as ``_col0``, etc. Finally, we can use Spark's
built-in csv reader to load Iris csv file as a DataFrame named ``xgbInput``.
Spark also contains many built-in readers for other format. eg ORC, Parquet, Avro, Json.
Transform Raw Iris Dataset
--------------------------
To make Iris dataset be recognizable to XGBoost, we need to encode String-typed
label, i.e. "class", to Double-typed label.
One way to convert the String-typed label to Double is to use Spark's built-in feature transformer
`StringIndexer <https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer>`_.
but it has not been accelerated by Spark-Rapids yet, which means it will fall back
to CPU to run and cause performance issue. Instead, we use an alternative way to acheive
the same goal by the following code
.. code-block:: scala
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val spec = Window.orderBy(labelName)
val Array(train, test) = xgbInput
.withColumn("tmpClassName", dense_rank().over(spec) - 1)
.drop(labelName)
.withColumnRenamed("tmpClassName", labelName)
.randomSplit(Array(0.7, 0.3), seed = 1)
train.show(5)
.. code-block:: none
+------------+-----------+------------+-----------+-----+
|sepal length|sepal width|petal length|petal width|class|
+------------+-----------+------------+-----------+-----+
| 4.3| 3.0| 1.1| 0.1| 0|
| 4.4| 2.9| 1.4| 0.2| 0|
| 4.4| 3.0| 1.3| 0.2| 0|
| 4.4| 3.2| 1.3| 0.2| 0|
| 4.6| 3.2| 1.4| 0.2| 0|
+------------+-----------+------------+-----------+-----+
With window operations, we have mapped string column of labels to label indices.
Training
========
The GPU version of XGBoost-Spark supports both regression and classification
models. Although we use the Iris dataset in this tutorial to show how we use
``XGBoost/XGBoost4J-Spark-GPU`` to resolve a multi-classes classification problem, the
usage in Regression is very similar to classification.
To train a XGBoost model for classification, we need to claim a XGBoostClassifier first:
.. code-block:: scala
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
val xgbParam = Map(
"objective" -> "multi:softprob",
"num_class" -> 3,
"num_round" -> 100,
"tree_method" -> "gpu_hist",
"num_workers" -> 1)
val featuresNames = schema.fieldNames.filter(name => name != labelName)
val xgbClassifier = new XGBoostClassifier(xgbParam)
.setFeaturesCol(featuresNames)
.setLabelCol(labelName)
The available parameters for training a XGBoost model can be found in :doc:`here </parameter>`.
Similar to the XGBoost4J-Spark package, in addition to the default set of parameters,
XGBoost4J-Spark-GPU also supports the camel-case variant of these parameters to be
consistent with Spark's MLLIB naming convention.
Specifically, each parameter in :doc:`this page </parameter>` has its equivalent form in
XGBoost4J-Spark-GPU with camel case. For example, to set ``max_depth`` for each tree, you can pass
parameter just like what we did in the above code snippet (as ``max_depth`` wrapped in a Map), or
you can do it through setters in XGBoostClassifer:
.. code-block:: scala
val xgbClassifier = new XGBoostClassifier(xgbParam)
.setFeaturesCol(featuresNames)
.setLabelCol(labelName)
xgbClassifier.setMaxDepth(2)
.. note::
In contrast to the XGBoost4J-Spark package, which needs to first assemble the numeric
feature columns into one column with VectorUDF type by VectorAssembler, the
XGBoost4J-Spark-GPU does not require such transformation, it accepts an array of feature
column names by ``setFeaturesCol(value: Array[String])``.
After we set XGBoostClassifier parameters and feature/label columns, we can build a
transformer, XGBoostClassificationModel by fitting XGBoostClassifier with the input
DataFrame. This ``fit`` operation is essentially the training process and the generated
model can then be used in other tasks like prediction.
.. code-block:: scala
val xgbClassificationModel = xgbClassifier.fit(train)
Prediction
==========
When we get a model, either XGBoostClassificationModel or XGBoostRegressionModel, it takes a DataFrame,
read the column containing feature vectors, predict for each feature vector, and output a new DataFrame
with the following columns by default:
* XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label.
* XGBoostRegressionModel will output prediction label(``predictionCol``).
.. code-block:: scala
val xgbClassificationModel = xgbClassifier.fit(train)
val results = xgbClassificationModel.transform(test)
results.show()
With the above code snippet, we get a DataFrame as result, which contains the margin, probability for each class,
and the prediction for each instance
.. code-block:: none
+------------+-----------+------------------+-------------------+-----+--------------------+--------------------+----------+
|sepal length|sepal width| petal length| petal width|class| rawPrediction| probability|prediction|
+------------+-----------+------------------+-------------------+-----+--------------------+--------------------+----------+
| 4.5| 2.3| 1.3|0.30000000000000004| 0|[3.16666603088378...|[0.98853939771652...| 0.0|
| 4.6| 3.1| 1.5| 0.2| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 4.8| 3.1| 1.6| 0.2| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 4.8| 3.4| 1.6| 0.2| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 4.8| 3.4|1.9000000000000001| 0.2| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 4.9| 2.4| 3.3| 1.0| 1|[-2.1498908996582...|[0.00596602633595...| 1.0|
| 4.9| 2.5| 4.5| 1.7| 2|[-2.1498908996582...|[0.00596602633595...| 1.0|
| 5.0| 3.5| 1.3|0.30000000000000004| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 5.1| 2.5| 3.0| 1.1| 1|[3.16666603088378...|[0.98853939771652...| 0.0|
| 5.1| 3.3| 1.7| 0.5| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 5.1| 3.5| 1.4| 0.2| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 5.1| 3.8| 1.6| 0.2| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 5.2| 3.4| 1.4| 0.2| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 5.2| 3.5| 1.5| 0.2| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 5.2| 4.1| 1.5| 0.1| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 5.4| 3.9| 1.7| 0.4| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 5.5| 2.4| 3.8| 1.1| 1|[-2.1498908996582...|[0.00596602633595...| 1.0|
| 5.5| 4.2| 1.4| 0.2| 0|[3.25857257843017...|[0.98969423770904...| 0.0|
| 5.7| 2.5| 5.0| 2.0| 2|[-2.1498908996582...|[0.00280966912396...| 2.0|
| 5.7| 3.0| 4.2| 1.2| 1|[-2.1498908996582...|[0.00643939292058...| 1.0|
+------------+-----------+------------------+-------------------+-----+--------------------+--------------------+----------+
**********************
Submit the application
**********************
Take submitting the spark job to Spark Standalone cluster as an example, and assuming your application main class
is ``Iris`` and the application jar is ``iris-1.0.0.jar``
.. code-block:: bash
cudf_version=22.02.0
rapids_version=22.02.0
xgboost_version=1.6.0
main_class=Iris
app_jar=iris-1.0.0.jar
spark-submit \
--master $master \
--packages ai.rapids:cudf:${cudf_version},com.nvidia:rapids-4-spark_2.12:${rapids_version},ml.dmlc:xgboost4j-gpu_2.12:${xgboost_version},ml.dmlc:xgboost4j-spark-gpu_2.12:${xgboost_version} \
--conf spark.executor.cores=12 \
--conf spark.task.cpus=1 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.task.resource.gpu.amount=0.08 \
--conf spark.rapids.sql.csv.read.double.enabled=true \
--conf spark.rapids.sql.hasNans=false \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--class ${main_class} \
${app_jar}
* First, we need to specify the ``spark-rapids, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
* Second, ``spark-rapids`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``
For details about ``spark-rapids`` other configurations, please refer to `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
For ``spark-rapids Frequently Asked Questions``, please refer to
`frequently-asked-questions <https://nvidia.github.io/spark-rapids/docs/FAQ.html#frequently-asked-questions>`_.

View File

@ -14,6 +14,7 @@ See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for mo
Distributed XGBoost with AWS YARN <aws_yarn> Distributed XGBoost with AWS YARN <aws_yarn>
kubernetes kubernetes
Distributed XGBoost with XGBoost4J-Spark <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html> Distributed XGBoost with XGBoost4J-Spark <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html>
Distributed XGBoost with XGBoost4J-Spark-GPU <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_gpu_tutorial.html>
dask dask
ray ray
dart dart

View File

@ -20,11 +20,6 @@
<classifier>${cudf.classifier}</classifier> <classifier>${cudf.classifier}</classifier>
<scope>provided</scope> <scope>provided</scope>
</dependency> </dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.10.5.1</version>
</dependency>
<dependency> <dependency>
<groupId>org.apache.hadoop</groupId> <groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId> <artifactId>hadoop-hdfs</artifactId>

View File

@ -1,5 +1,5 @@
/* /*
Copyright (c) 2021 by Contributors Copyright (c) 2021-2022 by Contributors
Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.
@ -16,15 +16,7 @@
package ml.dmlc.xgboost4j.gpu.java; package ml.dmlc.xgboost4j.gpu.java;
import java.io.ByteArrayOutputStream; import java.util.ArrayList;
import java.io.IOException;
import com.fasterxml.jackson.core.JsonFactory;
import com.fasterxml.jackson.core.JsonGenerator;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.node.ArrayNode;
import com.fasterxml.jackson.databind.node.JsonNodeFactory;
import com.fasterxml.jackson.databind.node.ObjectNode;
/** /**
* Cudf utilities to build cuda array interface against {@link CudfColumn} * Cudf utilities to build cuda array interface against {@link CudfColumn}
@ -42,58 +34,64 @@ class CudfUtils {
// Helper class to build array interface string // Helper class to build array interface string
private static class Builder { private static class Builder {
private JsonNodeFactory nodeFactory = new JsonNodeFactory(false); private ArrayList<String> colArrayInterfaces = new ArrayList<String>();
private ArrayNode rootArrayNode = nodeFactory.arrayNode();
private Builder add(CudfColumn... columns) { private Builder add(CudfColumn... columns) {
if (columns == null || columns.length <= 0) { if (columns == null || columns.length <= 0) {
throw new IllegalArgumentException("At least one ColumnData is required."); throw new IllegalArgumentException("At least one ColumnData is required.");
} }
for (CudfColumn cd : columns) { for (CudfColumn cd : columns) {
rootArrayNode.add(buildColumnObject(cd)); colArrayInterfaces.add(buildColumnObject(cd));
} }
return this; return this;
} }
private String build() { private String build() {
try { StringBuilder builder = new StringBuilder();
ByteArrayOutputStream bos = new ByteArrayOutputStream(); builder.append("[");
JsonGenerator jsonGen = new JsonFactory().createGenerator(bos); for (int i = 0; i < colArrayInterfaces.size(); i++) {
new ObjectMapper().writeTree(jsonGen, rootArrayNode); builder.append(colArrayInterfaces.get(i));
return bos.toString(); if (i != colArrayInterfaces.size() - 1) {
} catch (IOException ie) { builder.append(",");
ie.printStackTrace(); }
throw new RuntimeException("Failed to build array interface. Error: " + ie);
} }
builder.append("]");
return builder.toString();
} }
private ObjectNode buildColumnObject(CudfColumn column) { /** build the whole column information including data and valid info */
private String buildColumnObject(CudfColumn column) {
if (column.getDataPtr() == 0) { if (column.getDataPtr() == 0) {
throw new IllegalArgumentException("Empty column data is NOT accepted!"); throw new IllegalArgumentException("Empty column data is NOT accepted!");
} }
if (column.getTypeStr() == null || column.getTypeStr().isEmpty()) { if (column.getTypeStr() == null || column.getTypeStr().isEmpty()) {
throw new IllegalArgumentException("Empty type string is NOT accepted!"); throw new IllegalArgumentException("Empty type string is NOT accepted!");
} }
ObjectNode colDataObj = buildMetaObject(column.getDataPtr(), column.getShape(),
column.getTypeStr());
StringBuilder builder = new StringBuilder();
String colData = buildMetaObject(column.getDataPtr(), column.getShape(),
column.getTypeStr());
builder.append("{");
builder.append(colData);
if (column.getValidPtr() != 0 && column.getNullCount() != 0) { if (column.getValidPtr() != 0 && column.getNullCount() != 0) {
ObjectNode validObj = buildMetaObject(column.getValidPtr(), column.getShape(), "<t1"); String validString = buildMetaObject(column.getValidPtr(), column.getShape(), "<t1");
colDataObj.set("mask", validObj); builder.append(",\"mask\":");
builder.append("{");
builder.append(validString);
builder.append("}");
} }
return colDataObj; builder.append("}");
return builder.toString();
} }
private ObjectNode buildMetaObject(long ptr, long shape, final String typeStr) { /** build the base information of a column */
ObjectNode objNode = nodeFactory.objectNode(); private String buildMetaObject(long ptr, long shape, final String typeStr) {
ArrayNode shapeNode = objNode.putArray("shape"); StringBuilder builder = new StringBuilder();
shapeNode.add(shape); builder.append("\"shape\":[" + shape + "],");
ArrayNode dataNode = objNode.putArray("data"); builder.append("\"data\":[" + ptr + "," + "false" + "],");
dataNode.add(ptr) builder.append("\"typestr\":\"" + typeStr + "\",");
.add(false); builder.append("\"version\":" + 1);
objNode.put("typestr", typeStr) return builder.toString();
.put("version", 1);
return objNode;
} }
} }

View File

@ -112,7 +112,7 @@ private[spark] object GpuUtils {
val msg = if (fitting) "train" else "transform" val msg = if (fitting) "train" else "transform"
// feature columns // feature columns
require(featureNames.nonEmpty, s"Gpu $msg requires features columns. " + require(featureNames.nonEmpty, s"Gpu $msg requires features columns. " +
"please refer to setFeaturesCols!") "please refer to `setFeaturesCol(value: Array[String])`!")
featureNames.foreach(fn => checkNumericType(schema, fn)) featureNames.foreach(fn => checkNumericType(schema, fn))
if (fitting) { if (fitting) {
require(labelName.nonEmpty, "label column is not set.") require(labelName.nonEmpty, "label column is not set.")

View File

@ -126,7 +126,7 @@ class GpuXGBoostClassifierSuite extends GpuTestSuite {
val vectorAssembler = new VectorAssembler() val vectorAssembler = new VectorAssembler()
.setHandleInvalid("keep") .setHandleInvalid("keep")
.setInputCols(featureNames.toArray) .setInputCols(featureNames)
.setOutputCol("features") .setOutputCol("features")
val trainingDf = vectorAssembler.transform(rawInput).select("features", labelName) val trainingDf = vectorAssembler.transform(rawInput).select("features", labelName)
@ -147,12 +147,12 @@ class GpuXGBoostClassifierSuite extends GpuTestSuite {
.csv(dataPath).randomSplit(Array(0.7, 0.3), seed = 1) .csv(dataPath).randomSplit(Array(0.7, 0.3), seed = 1)
// Since CPU model does not know the information about the features cols that GPU transform // Since CPU model does not know the information about the features cols that GPU transform
// pipeline requires. End user needs to setFeaturesCols in the model manually // pipeline requires. End user needs to setFeaturesCol(features: Array[String]) in the model
val thrown = intercept[IllegalArgumentException](cpuModel // manually
val thrown = intercept[NoSuchElementException](cpuModel
.transform(testDf) .transform(testDf)
.collect()) .collect())
assert(thrown.getMessage.contains("Gpu transform requires features columns. " + assert(thrown.getMessage.contains("Failed to find a default value for featuresCols"))
"please refer to setFeaturesCols"))
val left = cpuModel val left = cpuModel
.setFeaturesCol(featureNames) .setFeaturesCol(featureNames)
@ -195,17 +195,16 @@ class GpuXGBoostClassifierSuite extends GpuTestSuite {
val featureColName = "feature_col" val featureColName = "feature_col"
val vectorAssembler = new VectorAssembler() val vectorAssembler = new VectorAssembler()
.setHandleInvalid("keep") .setHandleInvalid("keep")
.setInputCols(featureNames.toArray) .setInputCols(featureNames)
.setOutputCol(featureColName) .setOutputCol(featureColName)
val testDf = vectorAssembler.transform(rawInput).select(featureColName, labelName) val testDf = vectorAssembler.transform(rawInput).select(featureColName, labelName)
// Since GPU model does not know the information about the features col name that CPU // Since GPU model does not know the information about the features col name that CPU
// transform pipeline requires. End user needs to setFeaturesCol in the model manually // transform pipeline requires. End user needs to setFeaturesCol in the model manually
val thrown = intercept[IllegalArgumentException]( intercept[IllegalArgumentException](
gpuModel gpuModel
.transform(testDf) .transform(testDf)
.collect()) .collect())
assert(thrown.getMessage.contains("features does not exist"))
val left = gpuModel val left = gpuModel
.setFeaturesCol(featureColName) .setFeaturesCol(featureColName)

View File

@ -108,12 +108,15 @@ class GpuXGBoostGeneralSuite extends GpuTestSuite {
val trainingDf = trainingData.toDF(allColumnNames: _*) val trainingDf = trainingData.toDF(allColumnNames: _*)
val xgbParam = Map("eta" -> 0.1f, "max_depth" -> 2, "objective" -> "multi:softprob", val xgbParam = Map("eta" -> 0.1f, "max_depth" -> 2, "objective" -> "multi:softprob",
"num_class" -> 3, "num_round" -> 5, "num_workers" -> 1, "tree_method" -> "gpu_hist") "num_class" -> 3, "num_round" -> 5, "num_workers" -> 1, "tree_method" -> "gpu_hist")
val thrown = intercept[IllegalArgumentException] {
// GPU train requires featuresCols. If not specified,
// then NoSuchElementException will be thrown
val thrown = intercept[NoSuchElementException] {
new XGBoostClassifier(xgbParam) new XGBoostClassifier(xgbParam)
.setLabelCol(labelName) .setLabelCol(labelName)
.fit(trainingDf) .fit(trainingDf)
} }
assert(thrown.getMessage.contains("Gpu train requires features columns.")) assert(thrown.getMessage.contains("Failed to find a default value for featuresCols"))
val thrown1 = intercept[IllegalArgumentException] { val thrown1 = intercept[IllegalArgumentException] {
new XGBoostClassifier(xgbParam) new XGBoostClassifier(xgbParam)

View File

@ -1,5 +1,5 @@
/* /*
Copyright (c) 2021 by Contributors Copyright (c) 2021-2022 by Contributors
Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.
@ -86,7 +86,7 @@ class GpuXGBoostRegressorSuite extends GpuTestSuite {
.csv(getResourcePath("/rank.train.csv")).randomSplit(Array(0.7, 0.3), seed = 1) .csv(getResourcePath("/rank.train.csv")).randomSplit(Array(0.7, 0.3), seed = 1)
val classifier = new XGBoostRegressor(xgbParam) val classifier = new XGBoostRegressor(xgbParam)
.setFeaturesCols(featureNames) .setFeaturesCol(featureNames)
.setLabelCol(labelName) .setLabelCol(labelName)
.setTreeMethod("gpu_hist") .setTreeMethod("gpu_hist")
(classifier.fit(rawInput), testDf) (classifier.fit(rawInput), testDf)
@ -122,7 +122,7 @@ class GpuXGBoostRegressorSuite extends GpuTestSuite {
val vectorAssembler = new VectorAssembler() val vectorAssembler = new VectorAssembler()
.setHandleInvalid("keep") .setHandleInvalid("keep")
.setInputCols(featureNames.toArray) .setInputCols(featureNames)
.setOutputCol("features") .setOutputCol("features")
val trainingDf = vectorAssembler.transform(rawInput).select("features", labelName) val trainingDf = vectorAssembler.transform(rawInput).select("features", labelName)
@ -143,20 +143,20 @@ class GpuXGBoostRegressorSuite extends GpuTestSuite {
.csv(getResourcePath("/rank.train.csv")).randomSplit(Array(0.7, 0.3), seed = 1) .csv(getResourcePath("/rank.train.csv")).randomSplit(Array(0.7, 0.3), seed = 1)
// Since CPU model does not know the information about the features cols that GPU transform // Since CPU model does not know the information about the features cols that GPU transform
// pipeline requires. End user needs to setFeaturesCols in the model manually // pipeline requires. End user needs to setFeaturesCol(features: Array[String]) in the model
val thrown = intercept[IllegalArgumentException](cpuModel // manually
val thrown = intercept[NoSuchElementException](cpuModel
.transform(testDf) .transform(testDf)
.collect()) .collect())
assert(thrown.getMessage.contains("Gpu transform requires features columns. " + assert(thrown.getMessage.contains("Failed to find a default value for featuresCols"))
"please refer to setFeaturesCols"))
val left = cpuModel val left = cpuModel
.setFeaturesCols(featureNames) .setFeaturesCol(featureNames)
.transform(testDf) .transform(testDf)
.collect() .collect()
val right = cpuModelFromFile val right = cpuModelFromFile
.setFeaturesCols(featureNames) .setFeaturesCol(featureNames)
.transform(testDf) .transform(testDf)
.collect() .collect()
@ -173,7 +173,7 @@ class GpuXGBoostRegressorSuite extends GpuTestSuite {
.csv(getResourcePath("/rank.train.csv")).randomSplit(Array(0.7, 0.3), seed = 1) .csv(getResourcePath("/rank.train.csv")).randomSplit(Array(0.7, 0.3), seed = 1)
val classifier = new XGBoostRegressor(xgbParam) val classifier = new XGBoostRegressor(xgbParam)
.setFeaturesCols(featureNames) .setFeaturesCol(featureNames)
.setLabelCol(labelName) .setLabelCol(labelName)
.setTreeMethod("gpu_hist") .setTreeMethod("gpu_hist")
classifier.fit(rawInput) classifier.fit(rawInput)
@ -191,17 +191,16 @@ class GpuXGBoostRegressorSuite extends GpuTestSuite {
val featureColName = "feature_col" val featureColName = "feature_col"
val vectorAssembler = new VectorAssembler() val vectorAssembler = new VectorAssembler()
.setHandleInvalid("keep") .setHandleInvalid("keep")
.setInputCols(featureNames.toArray) .setInputCols(featureNames)
.setOutputCol(featureColName) .setOutputCol(featureColName)
val testDf = vectorAssembler.transform(rawInput).select(featureColName, labelName) val testDf = vectorAssembler.transform(rawInput).select(featureColName, labelName)
// Since GPU model does not know the information about the features col name that CPU // Since GPU model does not know the information about the features col name that CPU
// transform pipeline requires. End user needs to setFeaturesCol in the model manually // transform pipeline requires. End user needs to setFeaturesCol in the model manually
val thrown = intercept[IllegalArgumentException]( intercept[IllegalArgumentException](
gpuModel gpuModel
.transform(testDf) .transform(testDf)
.collect()) .collect())
assert(thrown.getMessage.contains("features does not exist"))
val left = gpuModel val left = gpuModel
.setFeaturesCol(featureColName) .setFeaturesCol(featureColName)

View File

@ -1,5 +1,5 @@
/* /*
Copyright (c) 2021 by Contributors Copyright (c) 2021-2022 by Contributors
Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.
@ -35,8 +35,10 @@ import org.apache.commons.logging.LogFactory
import org.apache.spark.TaskContext import org.apache.spark.TaskContext
import org.apache.spark.broadcast.Broadcast import org.apache.spark.broadcast.Broadcast
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.{Estimator, Model, PipelineStage} import org.apache.spark.ml.{Estimator, Model, PipelineStage}
import org.apache.spark.ml.linalg.Vector import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.linalg.xgboost.XGBoostSchemaUtils
import org.apache.spark.sql.types.{ArrayType, FloatType, StructField, StructType} import org.apache.spark.sql.types.{ArrayType, FloatType, StructField, StructType}
import org.apache.spark.storage.StorageLevel import org.apache.spark.storage.StorageLevel
@ -112,7 +114,7 @@ object PreXGBoost extends PreXGBoostProvider {
return optionProvider.get.buildDatasetToRDD(estimator, dataset, params) return optionProvider.get.buildDatasetToRDD(estimator, dataset, params)
} }
val (packedParams, evalSet) = estimator match { val (packedParams, evalSet, xgbInput) = estimator match {
case est: XGBoostEstimatorCommon => case est: XGBoostEstimatorCommon =>
// get weight column, if weight is not defined, default to lit(1.0) // get weight column, if weight is not defined, default to lit(1.0)
val weight = if (!est.isDefined(est.weightCol) || est.getWeightCol.isEmpty) { val weight = if (!est.isDefined(est.weightCol) || est.getWeightCol.isEmpty) {
@ -136,15 +138,18 @@ object PreXGBoost extends PreXGBoostProvider {
} }
(PackedParams(col(est.getLabelCol), col(est.getFeaturesCol), weight, baseMargin, group, val (xgbInput, featuresName) = est.vectorize(dataset)
est.getNumWorkers, est.needDeterministicRepartitioning), est.getEvalSets(params))
(PackedParams(col(est.getLabelCol), col(featuresName), weight, baseMargin, group,
est.getNumWorkers, est.needDeterministicRepartitioning), est.getEvalSets(params),
xgbInput)
case _ => throw new RuntimeException("Unsupporting " + estimator) case _ => throw new RuntimeException("Unsupporting " + estimator)
} }
// transform the training Dataset[_] to RDD[XGBLabeledPoint] // transform the training Dataset[_] to RDD[XGBLabeledPoint]
val trainingSet: RDD[XGBLabeledPoint] = DataUtils.convertDataFrameToXGBLabeledPointRDDs( val trainingSet: RDD[XGBLabeledPoint] = DataUtils.convertDataFrameToXGBLabeledPointRDDs(
packedParams, dataset.asInstanceOf[DataFrame]).head packedParams, xgbInput.asInstanceOf[DataFrame]).head
// transform the eval Dataset[_] to RDD[XGBLabeledPoint] // transform the eval Dataset[_] to RDD[XGBLabeledPoint]
val evalRDDMap = evalSet.map { val evalRDDMap = evalSet.map {
@ -184,11 +189,11 @@ object PreXGBoost extends PreXGBoostProvider {
} }
/** get the necessary parameters */ /** get the necessary parameters */
val (booster, inferBatchSize, featuresCol, useExternalMemory, missing, allowNonZeroForMissing, val (booster, inferBatchSize, xgbInput, featuresCol, useExternalMemory, missing,
predictFunc, schema) = allowNonZeroForMissing, predictFunc, schema) =
model match { model match {
case m: XGBoostClassificationModel => case m: XGBoostClassificationModel =>
val (xgbInput, featuresName) = m.vectorize(dataset)
// predict and turn to Row // predict and turn to Row
val predictFunc = val predictFunc =
(broadcastBooster: Broadcast[Booster], dm: DMatrix, originalRowItr: Iterator[Row]) => { (broadcastBooster: Broadcast[Booster], dm: DMatrix, originalRowItr: Iterator[Row]) => {
@ -199,7 +204,7 @@ object PreXGBoost extends PreXGBoostProvider {
} }
// prepare the final Schema // prepare the final Schema
var schema = StructType(dataset.schema.fields ++ var schema = StructType(xgbInput.schema.fields ++
Seq(StructField(name = XGBoostClassificationModel._rawPredictionCol, dataType = Seq(StructField(name = XGBoostClassificationModel._rawPredictionCol, dataType =
ArrayType(FloatType, containsNull = false), nullable = false)) ++ ArrayType(FloatType, containsNull = false), nullable = false)) ++
Seq(StructField(name = XGBoostClassificationModel._probabilityCol, dataType = Seq(StructField(name = XGBoostClassificationModel._probabilityCol, dataType =
@ -214,11 +219,12 @@ object PreXGBoost extends PreXGBoostProvider {
ArrayType(FloatType, containsNull = false), nullable = false)) ArrayType(FloatType, containsNull = false), nullable = false))
} }
(m._booster, m.getInferBatchSize, m.getFeaturesCol, m.getUseExternalMemory, m.getMissing, (m._booster, m.getInferBatchSize, xgbInput, featuresName, m.getUseExternalMemory,
m.getAllowNonZeroForMissingValue, predictFunc, schema) m.getMissing, m.getAllowNonZeroForMissingValue, predictFunc, schema)
case m: XGBoostRegressionModel => case m: XGBoostRegressionModel =>
// predict and turn to Row // predict and turn to Row
val (xgbInput, featuresName) = m.vectorize(dataset)
val predictFunc = val predictFunc =
(broadcastBooster: Broadcast[Booster], dm: DMatrix, originalRowItr: Iterator[Row]) => { (broadcastBooster: Broadcast[Booster], dm: DMatrix, originalRowItr: Iterator[Row]) => {
val Array(rawPredictionItr, predLeafItr, predContribItr) = val Array(rawPredictionItr, predLeafItr, predContribItr) =
@ -227,7 +233,7 @@ object PreXGBoost extends PreXGBoostProvider {
} }
// prepare the final Schema // prepare the final Schema
var schema = StructType(dataset.schema.fields ++ var schema = StructType(xgbInput.schema.fields ++
Seq(StructField(name = XGBoostRegressionModel._originalPredictionCol, dataType = Seq(StructField(name = XGBoostRegressionModel._originalPredictionCol, dataType =
ArrayType(FloatType, containsNull = false), nullable = false))) ArrayType(FloatType, containsNull = false), nullable = false)))
@ -240,14 +246,14 @@ object PreXGBoost extends PreXGBoostProvider {
ArrayType(FloatType, containsNull = false), nullable = false)) ArrayType(FloatType, containsNull = false), nullable = false))
} }
(m._booster, m.getInferBatchSize, m.getFeaturesCol, m.getUseExternalMemory, m.getMissing, (m._booster, m.getInferBatchSize, xgbInput, featuresName, m.getUseExternalMemory,
m.getAllowNonZeroForMissingValue, predictFunc, schema) m.getMissing, m.getAllowNonZeroForMissingValue, predictFunc, schema)
} }
val bBooster = dataset.sparkSession.sparkContext.broadcast(booster) val bBooster = xgbInput.sparkSession.sparkContext.broadcast(booster)
val appName = dataset.sparkSession.sparkContext.appName val appName = xgbInput.sparkSession.sparkContext.appName
val resultRDD = dataset.asInstanceOf[Dataset[Row]].rdd.mapPartitions { rowIterator => val resultRDD = xgbInput.asInstanceOf[Dataset[Row]].rdd.mapPartitions { rowIterator =>
new AbstractIterator[Row] { new AbstractIterator[Row] {
private var batchCnt = 0 private var batchCnt = 0
@ -295,7 +301,7 @@ object PreXGBoost extends PreXGBoostProvider {
} }
bBooster.unpersist(blocking = false) bBooster.unpersist(blocking = false)
dataset.sparkSession.createDataFrame(resultRDD, schema) xgbInput.sparkSession.createDataFrame(resultRDD, schema)
} }

View File

@ -144,13 +144,6 @@ class XGBoostClassifier (
def setSinglePrecisionHistogram(value: Boolean): this.type = def setSinglePrecisionHistogram(value: Boolean): this.type =
set(singlePrecisionHistogram, value) set(singlePrecisionHistogram, value)
/**
* This API is only used in GPU train pipeline of xgboost4j-spark-gpu, which requires
* all feature columns must be numeric types.
*/
def setFeaturesCol(value: Array[String]): this.type =
set(featuresCols, value)
// called at the start of fit/train when 'eval_metric' is not defined // called at the start of fit/train when 'eval_metric' is not defined
private def setupDefaultEvalMetric(): String = { private def setupDefaultEvalMetric(): String = {
require(isDefined(objective), "Users must set \'objective\' via xgboostParams.") require(isDefined(objective), "Users must set \'objective\' via xgboostParams.")
@ -165,7 +158,12 @@ class XGBoostClassifier (
// Callback from PreXGBoost // Callback from PreXGBoost
private[spark] def transformSchemaInternal(schema: StructType): StructType = { private[spark] def transformSchemaInternal(schema: StructType): StructType = {
super.transformSchema(schema) if (isFeaturesColSet(schema)) {
// User has vectorized the features into VectorUDT.
super.transformSchema(schema)
} else {
transformSchemaWithFeaturesCols(true, schema)
}
} }
override def transformSchema(schema: StructType): StructType = { override def transformSchema(schema: StructType): StructType = {
@ -260,13 +258,6 @@ class XGBoostClassificationModel private[ml](
def setInferBatchSize(value: Int): this.type = set(inferBatchSize, value) def setInferBatchSize(value: Int): this.type = set(inferBatchSize, value)
/**
* This API is only used in GPU train pipeline of xgboost4j-spark-gpu, which requires
* all feature columns must be numeric types.
*/
def setFeaturesCol(value: Array[String]): this.type =
set(featuresCols, value)
/** /**
* Single instance prediction. * Single instance prediction.
* Note: The performance is not ideal, use it carefully! * Note: The performance is not ideal, use it carefully!
@ -359,7 +350,12 @@ class XGBoostClassificationModel private[ml](
} }
private[spark] def transformSchemaInternal(schema: StructType): StructType = { private[spark] def transformSchemaInternal(schema: StructType): StructType = {
super.transformSchema(schema) if (isFeaturesColSet(schema)) {
// User has vectorized the features into VectorUDT.
super.transformSchema(schema)
} else {
transformSchemaWithFeaturesCols(false, schema)
}
} }
override def transformSchema(schema: StructType): StructType = { override def transformSchema(schema: StructType): StructType = {
@ -385,8 +381,6 @@ class XGBoostClassificationModel private[ml](
Vectors.dense(rawPredictions) Vectors.dense(rawPredictions)
} }
if ($(rawPredictionCol).nonEmpty) { if ($(rawPredictionCol).nonEmpty) {
outputData = outputData outputData = outputData
.withColumn(getRawPredictionCol, rawPredictionUDF(col(_rawPredictionCol))) .withColumn(getRawPredictionCol, rawPredictionUDF(col(_rawPredictionCol)))

View File

@ -146,13 +146,6 @@ class XGBoostRegressor (
def setSinglePrecisionHistogram(value: Boolean): this.type = def setSinglePrecisionHistogram(value: Boolean): this.type =
set(singlePrecisionHistogram, value) set(singlePrecisionHistogram, value)
/**
* This API is only used in GPU train pipeline of xgboost4j-spark-gpu, which requires
* all feature columns must be numeric types.
*/
def setFeaturesCols(value: Array[String]): this.type =
set(featuresCols, value)
// called at the start of fit/train when 'eval_metric' is not defined // called at the start of fit/train when 'eval_metric' is not defined
private def setupDefaultEvalMetric(): String = { private def setupDefaultEvalMetric(): String = {
require(isDefined(objective), "Users must set \'objective\' via xgboostParams.") require(isDefined(objective), "Users must set \'objective\' via xgboostParams.")
@ -164,7 +157,12 @@ class XGBoostRegressor (
} }
private[spark] def transformSchemaInternal(schema: StructType): StructType = { private[spark] def transformSchemaInternal(schema: StructType): StructType = {
super.transformSchema(schema) if (isFeaturesColSet(schema)) {
// User has vectorized the features into VectorUDT.
super.transformSchema(schema)
} else {
transformSchemaWithFeaturesCols(false, schema)
}
} }
override def transformSchema(schema: StructType): StructType = { override def transformSchema(schema: StructType): StructType = {
@ -253,13 +251,6 @@ class XGBoostRegressionModel private[ml] (
def setInferBatchSize(value: Int): this.type = set(inferBatchSize, value) def setInferBatchSize(value: Int): this.type = set(inferBatchSize, value)
/**
* This API is only used in GPU train pipeline of xgboost4j-spark-gpu, which requires
* all feature columns must be numeric types.
*/
def setFeaturesCols(value: Array[String]): this.type =
set(featuresCols, value)
/** /**
* Single instance prediction. * Single instance prediction.
* Note: The performance is not ideal, use it carefully! * Note: The performance is not ideal, use it carefully!
@ -331,7 +322,12 @@ class XGBoostRegressionModel private[ml] (
} }
private[spark] def transformSchemaInternal(schema: StructType): StructType = { private[spark] def transformSchemaInternal(schema: StructType): StructType = {
super.transformSchema(schema) if (isFeaturesColSet(schema)) {
// User has vectorized the features into VectorUDT.
super.transformSchema(schema)
} else {
transformSchemaWithFeaturesCols(false, schema)
}
} }
override def transformSchema(schema: StructType): StructType = { override def transformSchema(schema: StructType): StructType = {

View File

@ -1,5 +1,5 @@
/* /*
Copyright (c) 2014,2021 by Contributors Copyright (c) 2014-2022 by Contributors
Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.
@ -247,6 +247,27 @@ trait HasNumClass extends Params {
final def getNumClass: Int = $(numClass) final def getNumClass: Int = $(numClass)
} }
/**
* Trait for shared param featuresCols.
*/
trait HasFeaturesCols extends Params {
/**
* Param for the names of feature columns.
* @group param
*/
final val featuresCols: StringArrayParam = new StringArrayParam(this, "featuresCols",
"an array of feature column names.")
/** @group getParam */
final def getFeaturesCols: Array[String] = $(featuresCols)
/** Check if featuresCols is valid */
def isFeaturesColsValid: Boolean = {
isDefined(featuresCols) && $(featuresCols) != Array.empty
}
}
private[spark] trait ParamMapFuncs extends Params { private[spark] trait ParamMapFuncs extends Params {
def XGBoost2MLlibParams(xgboostParams: Map[String, Any]): Unit = { def XGBoost2MLlibParams(xgboostParams: Map[String, Any]): Unit = {

View File

@ -1,34 +0,0 @@
/*
Copyright (c) 2021-2022 by Contributors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package ml.dmlc.xgboost4j.scala.spark.params
import org.apache.spark.ml.param.{Params, StringArrayParam}
trait GpuParams extends Params {
/**
* Param for the names of feature columns for GPU pipeline.
* @group param
*/
final val featuresCols: StringArrayParam = new StringArrayParam(this, "featuresCols",
"an array of feature column names for GPU pipeline.")
setDefault(featuresCols, Array.empty[String])
/** @group getParam */
final def getFeaturesCols: Array[String] = $(featuresCols)
}

View File

@ -1,5 +1,5 @@
/* /*
Copyright (c) 2014,2021 by Contributors Copyright (c) 2014-2022 by Contributors
Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.
@ -16,16 +16,101 @@
package ml.dmlc.xgboost4j.scala.spark.params package ml.dmlc.xgboost4j.scala.spark.params
import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol, HasWeightCol} import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.xgboost.XGBoostSchemaUtils
import org.apache.spark.ml.param.{Param, ParamValidators}
import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasHandleInvalid, HasLabelCol, HasWeightCol}
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.types.StructType
private[scala] sealed trait XGBoostEstimatorCommon extends GeneralParams with LearningTaskParams private[scala] sealed trait XGBoostEstimatorCommon extends GeneralParams with LearningTaskParams
with BoosterParams with RabitParams with ParamMapFuncs with NonParamVariables with HasWeightCol with BoosterParams with RabitParams with ParamMapFuncs with NonParamVariables with HasWeightCol
with HasBaseMarginCol with HasLeafPredictionCol with HasContribPredictionCol with HasFeaturesCol with HasBaseMarginCol with HasLeafPredictionCol with HasContribPredictionCol with HasFeaturesCol
with HasLabelCol with GpuParams { with HasLabelCol with HasFeaturesCols with HasHandleInvalid {
def needDeterministicRepartitioning: Boolean = { def needDeterministicRepartitioning: Boolean = {
getCheckpointPath != null && getCheckpointPath.nonEmpty && getCheckpointInterval > 0 getCheckpointPath != null && getCheckpointPath.nonEmpty && getCheckpointInterval > 0
} }
/**
* Param for how to handle invalid data (NULL values). Options are 'skip' (filter out rows with
* invalid data), 'error' (throw an error), or 'keep' (return relevant number of NaN in the
* output). Column lengths are taken from the size of ML Attribute Group, which can be set using
* `VectorSizeHint` in a pipeline before `VectorAssembler`. Column lengths can also be inferred
* from first rows of the data since it is safe to do so but only in case of 'error' or 'skip'.
* Default: "error"
* @group param
*/
override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid",
"""Param for how to handle invalid data (NULL and NaN values). Options are 'skip' (filter out
|rows with invalid data), 'error' (throw an error), or 'keep' (return relevant number of NaN
|in the output). Column lengths are taken from the size of ML Attribute Group, which can be
|set using `VectorSizeHint` in a pipeline before `VectorAssembler`. Column lengths can also
|be inferred from first rows of the data since it is safe to do so but only in case of 'error'
|or 'skip'.""".stripMargin.replaceAll("\n", " "),
ParamValidators.inArray(Array("skip", "error", "keep")))
setDefault(handleInvalid, "error")
/**
* Specify an array of feature column names which must be numeric types.
*/
def setFeaturesCol(value: Array[String]): this.type = set(featuresCols, value)
/** Set the handleInvalid for VectorAssembler */
def setHandleInvalid(value: String): this.type = set(handleInvalid, value)
/**
* Check if schema has a field named with the value of "featuresCol" param and it's data type
* must be VectorUDT
*/
def isFeaturesColSet(schema: StructType): Boolean = {
schema.fieldNames.contains(getFeaturesCol) &&
XGBoostSchemaUtils.isVectorUDFType(schema(getFeaturesCol).dataType)
}
/** check the features columns type */
def transformSchemaWithFeaturesCols(fit: Boolean, schema: StructType): StructType = {
if (isFeaturesColsValid) {
if (fit) {
XGBoostSchemaUtils.checkNumericType(schema, $(labelCol))
}
$(featuresCols).foreach(feature =>
XGBoostSchemaUtils.checkFeatureColumnType(schema(feature).dataType))
schema
} else {
throw new IllegalArgumentException("featuresCol or featuresCols must be specified")
}
}
/**
* Vectorize the features columns if necessary.
*
* @param input the input dataset
* @return (output dataset and the feature column name)
*/
def vectorize(input: Dataset[_]): (Dataset[_], String) = {
val schema = input.schema
if (isFeaturesColSet(schema)) {
// Dataset already has vectorized.
(input, getFeaturesCol)
} else if (isFeaturesColsValid) {
val featuresName = if (!schema.fieldNames.contains(getFeaturesCol)) {
getFeaturesCol
} else {
"features_" + uid
}
val vectorAssembler = new VectorAssembler()
.setHandleInvalid($(handleInvalid))
.setInputCols(getFeaturesCols)
.setOutputCol(featuresName)
(vectorAssembler.transform(input).select(featuresName, getLabelCol), featuresName)
} else {
// never reach here, since transformSchema will take care of the case
// that featuresCols is invalid
(input, getFeaturesCol)
}
}
} }
private[scala] trait XGBoostClassifierParams extends XGBoostEstimatorCommon with HasNumClass private[scala] trait XGBoostClassifierParams extends XGBoostEstimatorCommon with HasNumClass

View File

@ -0,0 +1,51 @@
/*
Copyright (c) 2022 by Contributors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package org.apache.spark.ml.linalg.xgboost
import org.apache.spark.sql.types.{BooleanType, DataType, NumericType, StructType}
import org.apache.spark.ml.linalg.VectorUDT
import org.apache.spark.ml.util.SchemaUtils
object XGBoostSchemaUtils {
/** check if the dataType is VectorUDT */
def isVectorUDFType(dataType: DataType): Boolean = {
dataType match {
case _: VectorUDT => true
case _ => false
}
}
/** The feature columns will be vectorized by VectorAssembler first, which only
* supports Numeric, Boolean and VectorUDT types */
def checkFeatureColumnType(dataType: DataType): Unit = {
dataType match {
case _: NumericType | BooleanType =>
case _: VectorUDT =>
case d => throw new UnsupportedOperationException(s"featuresCols only supports Numeric, " +
s"boolean and VectorUDT types, found: ${d}")
}
}
def checkNumericType(
schema: StructType,
colName: String,
msg: String = ""): Unit = {
SchemaUtils.checkNumericType(schema, colName, msg)
}
}

View File

@ -23,6 +23,7 @@ import org.apache.spark.sql._
import org.scalatest.FunSuite import org.scalatest.FunSuite
import org.apache.spark.Partitioner import org.apache.spark.Partitioner
import org.apache.spark.ml.feature.VectorAssembler
class XGBoostClassifierSuite extends FunSuite with PerTest { class XGBoostClassifierSuite extends FunSuite with PerTest {
@ -316,4 +317,77 @@ class XGBoostClassifierSuite extends FunSuite with PerTest {
xgb.fit(repartitioned) xgb.fit(repartitioned)
} }
test("featuresCols with features column can work") {
val spark = ss
import spark.implicits._
val xgbInput = Seq(
(Vectors.dense(1.0, 7.0), true, 10.1, 100.2, 0),
(Vectors.dense(2.0, 20.0), false, 2.1, 2.2, 1))
.toDF("f1", "f2", "f3", "features", "label")
val paramMap = Map("eta" -> "1", "max_depth" -> "6", "silent" -> "1",
"objective" -> "binary:logistic", "num_round" -> 5, "num_workers" -> 1)
val featuresName = Array("f1", "f2", "f3", "features")
val xgbClassifier = new XGBoostClassifier(paramMap)
.setFeaturesCol(featuresName)
.setLabelCol("label")
val model = xgbClassifier.fit(xgbInput)
assert(model.getFeaturesCols.sameElements(featuresName))
val df = model.transform(xgbInput)
assert(df.schema.fieldNames.contains("features_" + model.uid))
df.show()
val newFeatureName = "features_new"
// transform also can work for vectorized dataset
val vectorizedInput = new VectorAssembler()
.setInputCols(featuresName)
.setOutputCol(newFeatureName)
.transform(xgbInput)
.select(newFeatureName, "label")
val df1 = model
.setFeaturesCol(newFeatureName)
.transform(vectorizedInput)
assert(df1.schema.fieldNames.contains(newFeatureName))
df1.show()
}
test("featuresCols without features column can work") {
val spark = ss
import spark.implicits._
val xgbInput = Seq(
(Vectors.dense(1.0, 7.0), true, 10.1, 100.2, 0),
(Vectors.dense(2.0, 20.0), false, 2.1, 2.2, 1))
.toDF("f1", "f2", "f3", "f4", "label")
val paramMap = Map("eta" -> "1", "max_depth" -> "6", "silent" -> "1",
"objective" -> "binary:logistic", "num_round" -> 5, "num_workers" -> 1)
val featuresName = Array("f1", "f2", "f3", "f4")
val xgbClassifier = new XGBoostClassifier(paramMap)
.setFeaturesCol(featuresName)
.setLabelCol("label")
val model = xgbClassifier.fit(xgbInput)
assert(model.getFeaturesCols.sameElements(featuresName))
// transform should work for the dataset which includes the feature column names.
val df = model.transform(xgbInput)
assert(df.schema.fieldNames.contains("features"))
df.show()
// transform also can work for vectorized dataset
val vectorizedInput = new VectorAssembler()
.setInputCols(featuresName)
.setOutputCol("features")
.transform(xgbInput)
.select("features", "label")
val df1 = model.transform(vectorizedInput)
df1.show()
}
} }

View File

@ -1,5 +1,5 @@
/* /*
Copyright (c) 2014 by Contributors Copyright (c) 2014-2022 by Contributors
Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.
@ -17,12 +17,15 @@
package ml.dmlc.xgboost4j.scala.spark package ml.dmlc.xgboost4j.scala.spark
import ml.dmlc.xgboost4j.scala.{DMatrix, XGBoost => ScalaXGBoost} import ml.dmlc.xgboost4j.scala.{DMatrix, XGBoost => ScalaXGBoost}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Row} import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.types._ import org.apache.spark.sql.types._
import org.scalatest.FunSuite import org.scalatest.FunSuite
import org.apache.spark.ml.feature.VectorAssembler
class XGBoostRegressorSuite extends FunSuite with PerTest { class XGBoostRegressorSuite extends FunSuite with PerTest {
protected val treeMethod: String = "auto" protected val treeMethod: String = "auto"
@ -216,4 +219,77 @@ class XGBoostRegressorSuite extends FunSuite with PerTest {
assert(resultDF.columns.contains("predictLeaf")) assert(resultDF.columns.contains("predictLeaf"))
assert(resultDF.columns.contains("predictContrib")) assert(resultDF.columns.contains("predictContrib"))
} }
test("featuresCols with features column can work") {
val spark = ss
import spark.implicits._
val xgbInput = Seq(
(Vectors.dense(1.0, 7.0), true, 10.1, 100.2, 0),
(Vectors.dense(2.0, 20.0), false, 2.1, 2.2, 1))
.toDF("f1", "f2", "f3", "features", "label")
val paramMap = Map("eta" -> "1", "max_depth" -> "6", "silent" -> "1",
"objective" -> "reg:squarederror", "num_round" -> 5, "num_workers" -> 1)
val featuresName = Array("f1", "f2", "f3", "features")
val xgbClassifier = new XGBoostRegressor(paramMap)
.setFeaturesCol(featuresName)
.setLabelCol("label")
val model = xgbClassifier.fit(xgbInput)
assert(model.getFeaturesCols.sameElements(featuresName))
val df = model.transform(xgbInput)
assert(df.schema.fieldNames.contains("features_" + model.uid))
df.show()
val newFeatureName = "features_new"
// transform also can work for vectorized dataset
val vectorizedInput = new VectorAssembler()
.setInputCols(featuresName)
.setOutputCol(newFeatureName)
.transform(xgbInput)
.select(newFeatureName, "label")
val df1 = model
.setFeaturesCol(newFeatureName)
.transform(vectorizedInput)
assert(df1.schema.fieldNames.contains(newFeatureName))
df1.show()
}
test("featuresCols without features column can work") {
val spark = ss
import spark.implicits._
val xgbInput = Seq(
(Vectors.dense(1.0, 7.0), true, 10.1, 100.2, 0),
(Vectors.dense(2.0, 20.0), false, 2.1, 2.2, 1))
.toDF("f1", "f2", "f3", "f4", "label")
val paramMap = Map("eta" -> "1", "max_depth" -> "6", "silent" -> "1",
"objective" -> "reg:squarederror", "num_round" -> 5, "num_workers" -> 1)
val featuresName = Array("f1", "f2", "f3", "f4")
val xgbClassifier = new XGBoostRegressor(paramMap)
.setFeaturesCol(featuresName)
.setLabelCol("label")
val model = xgbClassifier.fit(xgbInput)
assert(model.getFeaturesCols.sameElements(featuresName))
// transform should work for the dataset which includes the feature column names.
val df = model.transform(xgbInput)
assert(df.schema.fieldNames.contains("features"))
df.show()
// transform also can work for vectorized dataset
val vectorizedInput = new VectorAssembler()
.setInputCols(featuresName)
.setOutputCol("features")
.transform(xgbInput)
.select("features", "label")
val df1 = model.transform(vectorizedInput)
df1.show()
}
} }