[jvm-packages] Tutorial of XGBoost4J-Spark (#3534)

* add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * add new * update doc * finish Gang Scheduling * more * intro * Add sections: Prediction, Model persistence and ML pipeline. * Add XGBoost4j-Spark MLlib pipeline example * partial finished version * finish the doc * adjust code * fix the doc * use rst * Convert XGBoost4J-Spark tutorial to reST * Bring XGBoost4J up to date * add note about using hdfs * remove duplicate file * fix descriptions * update doc * Wrap HDFS/S3 export support as a note * update * wrap indexing_mode example in code block
2018-08-03 21:17:50 -07:00
parent 34dc9155ab
commit 31d1baba3d
8 changed files with 761 additions and 323 deletions
--- a/doc/jvm/index.rst
+++ b/doc/jvm/index.rst
@@ -145,8 +145,10 @@ Contents
 ********
 .. toctree::
  :maxdepth: 2
-  Java Overview Tutorial <java_intro>
+  java_intro
  XGBoost4J-Spark Tutorial <xgboost4j_spark_tutorial>
  Code Examples <https://github.com/dmlc/xgboost/tree/master/jvm-packages/xgboost4j-example>
  XGBoost4J Java API <http://dmlc.ml/docs/javadocs/index.html>
  XGBoost4J Scala API <http://dmlc.ml/docs/scaladocs/xgboost4j/index.html>
--- a/doc/jvm/java_intro.rst
+++ b/doc/jvm/java_intro.rst
@@ -1,28 +1,28 @@
-##################
+##############################
-XGBoost4J Java API
+Getting Started with XGBoost4J
-##################
+##############################
 This tutorial introduces Java API for XGBoost.
 **************
 Data Interface
 **************
-Like the XGBoost python module, XGBoost4J uses ``DMatrix`` to handle data,
+Like the XGBoost python module, XGBoost4J uses DMatrix to handle data,
-libsvm txt format file, sparse matrix in CSR/CSC format, and dense matrix is
+LIBSVM txt format file, sparse matrix in CSR/CSC format, and dense matrix is
 supported.
-* The first step is to import ``DMatrix``:
+* The first step is to import DMatrix:
  .. code-block:: java
-    import org.dmlc.xgboost4j.DMatrix;
+    import org.dmlc.xgboost4j.java.DMatrix;
-* Use ``DMatrix`` constructor to load data from a libsvm text format file:
+* Use DMatrix constructor to load data from a libsvm text format file:
  .. code-block:: java
    DMatrix dmat = new DMatrix("train.svm.txt");
-* Pass arrays to ``DMatrix`` constructor to load from sparse matrix.
+* Pass arrays to DMatrix constructor to load from sparse matrix.
  Suppose we have a sparse matrix
@@ -78,47 +78,31 @@ supported.
 ******************
 Setting Parameters
 ******************
-* In XGBoost4J any ``Iterable<Entry<String, Object>>`` object could be used as parameters.
+To set parameters, parameters are specified as a Map:
-* To set parameters, for non-multiple value params, you can simply use entrySet of an Map:
+.. code-block:: java
-  .. code-block:: java
+  Map<String, Object> params = new HashMap<>() {
-
+    {
-    Map<String, Object> paramMap = new HashMap<>() {
+      put("eta", 1.0);
-      {
+      put("max_depth", 2);
-        put("eta", 1.0);
+      put("silent", 1);
-        put("max_depth", 2);
+      put("objective", "binary:logistic");
-        put("silent", 1);
+      put("eval_metric", "logloss");
-        put("objective", "binary:logistic");
+    }
-        put("eval_metric", "logloss");
+  };
      }
    };
    Iterable<Entry<String, Object>> params = paramMap.entrySet();
 * for the situation that multiple values with same param key, List<Entry<String, Object>> would be a good choice, e.g. :
  .. code-block:: java
    List<Entry<String, Object>> params = new ArrayList<Entry<String, Object>>() {
        {
            add(new SimpleEntry<String, Object>("eta", 1.0));
            add(new SimpleEntry<String, Object>("max_depth", 2.0));
            add(new SimpleEntry<String, Object>("silent", 1));
            add(new SimpleEntry<String, Object>("objective", "binary:logistic"));
        }
    };
 **************
 Training Model
 **************
 With parameters and data, you are able to train a booster model.
-* Import ``Trainer`` and ``Booster``:
+* Import Booster and XGBoost:
  .. code-block:: java
-    import org.dmlc.xgboost4j.Booster;
+    import org.dmlc.xgboost4j.java.Booster;
-    import org.dmlc.xgboost4j.util.Trainer;
+    import org.dmlc.xgboost4j.java.XGBoost;
 * Training
@@ -126,13 +110,13 @@ With parameters and data, you are able to train a booster model.
    DMatrix trainMat = new DMatrix("train.svm.txt");
    DMatrix validMat = new DMatrix("valid.svm.txt");
-    //specify a watchList to see the performance
+    // Specify a watchList to see the performance
-    //any Iterable<Entry<String, DMatrix>> object could be used as watchList
+    // Any Iterable<Entry<String, DMatrix>> object could be used as watchList
-    List<Entry<String, DMatrix>> watchs =  new ArrayList<>();
+    List<Entry<String, DMatrix>> watches = new ArrayList<>();
-    watchs.add(new SimpleEntry<>("train", trainMat));
+    watches.add(new SimpleEntry<>("train", trainMat));
-    watchs.add(new SimpleEntry<>("test", testMat));
+    watches.add(new SimpleEntry<>("test", testMat));
-    int round = 2;
+    int nround = 2;
-    Booster booster = Trainer.train(params, trainMat, round, watchs, null, null);
+    Booster booster = XGBoost.train(trainMat, params, nround, watches, null, null);
 * Saving model
@@ -142,25 +126,19 @@ With parameters and data, you are able to train a booster model.
    booster.saveModel("model.bin");
-* Dump Model and Feature Map
+* Generaing model dump with feature map
  .. code-block:: java
-    booster.dumpModel("modelInfo.txt", false)
+    String[] model_dump = booster.getModelDump(null, false)
-    //dump with featureMap
+    // dump with feature map
-    booster.dumpModel("modelInfo.txt", "featureMap.txt", false)
+    String[] model_dump_with_feature_map = booster.getModelDump("featureMap.txt", false)
 * Load a model
  .. code-block:: java
-    Params param = new Params() {
+    Booster booster = Booster.loadModel("model.bin");
      {
        put("silent", 1);
        put("nthread", 6);
      }
    };
    Booster booster = new Booster(param, "model.bin");
 **********
 Prediction
@@ -170,8 +148,8 @@ After training and loading a model, you can use it to make prediction for other
 .. code-block:: java
  DMatrix dtest = new DMatrix("test.svm.txt");
-  //predict
+  // predict
  float[][] predicts = booster.predict(dtest);
-  //predict leaf
+  // predict leaf
-  float[][] leafPredicts = booster.predict(dtest, 0, true);
+  float[][] leafPredicts = booster.predictLeaf(dtest, 0);
--- a/doc/jvm/xgboost4j_spark_tutorial.rst
+++ b/doc/jvm/xgboost4j_spark_tutorial.rst
@@ -0,0 +1,509 @@
 #######################################
 XGBoost4J-Spark Tutorial (version 0.8+)
 #######################################
 **XGBoost4J-Spark** is a project aiming to seamlessly integrate XGBoost and Apache Spark by fitting XGBoost to Apache Spark's MLLIB framework. With the integration, user can not only uses the high-performant algorithm implementation of XGBoost, but also leverages the powerful  data processing engine of Spark for:
 * Feature Engineering: feature extraction, transformation, dimensionality reduction, and selection, etc.
 * Pipelines: constructing, evaluating, and tuning ML Pipelines
 * Persistence: persist and load machine learning models and even whole Pipelines
 This tutorial is to cover the end-to-end process to build a machine learning pipeline with XGBoost4J-Spark. We will discuss
 * Using Spark to preprocess data to fit to XGBoost/XGBoost4J-Spark's data interface
 * Training a XGBoost model with XGBoost4J-Spark
 * Serving XGBoost model (prediction) with Spark
 * Building a Machine Learning Pipeline with XGBoost4J-Spark
 * Running XGBoost4J-Spark in Production
 .. contents::
  :backlinks: none
  :local:
 ********************************************
 Build an ML Application with XGBoost4J-Spark
 ********************************************
 Refer to XGBoost4J-Spark Dependency
 ===================================
 Before we go into the tour of how to use XGBoost4J-Spark, we would bring a brief introduction about how to build a machine learning application with XGBoost4J-Spark. The first thing you need to do is to refer to the dependency in Maven Central.
 You can add the following dependency in your ``pom.xml``.
 .. code-block:: xml
  <dependency>
    <groupId>ml.dmlc</groupId>
    <artifactId>xgboost4j-spark</artifactId>
    <version>latest_version_num</version>
  </dependency>
 For the latest release version number, please check `here <https://github.com/dmlc/xgboost/releases>`_.
 We also publish some functionalities which would be included in the coming release in the form of snapshot version. To access these functionalities, you can add dependency to the snapshot artifacts. We publish snapshot version in github-based repo, so you can add the following repo in ``pom.xml``:
 .. code-block:: xml
  <repository>
    <id>XGBoost4J-Spark Snapshot Repo</id>
    <name>XGBoost4J-Spark Snapshot Repo</name>
    <url>https://raw.githubusercontent.com/CodingCat/xgboost/maven-repo/</url>
  </repository>
 and then refer to the snapshot dependency by adding:
 .. code-block:: xml
  <dependency>
      <groupId>ml.dmlc</groupId>
      <artifactId>xgboost4j</artifactId>
      <version>next_version_num-SNAPSHOT</version>
  </dependency>
 Data Preparation
 ================
 As aforementioned, XGBoost4J-Spark seamlessly integrates Spark and XGBoost. The integration enables
 users to apply various types of transformation over the training/test datasets with the convenient
 and powerful data processing framework, Spark.
 In this section, we use `Iris <https://archive.ics.uci.edu/ml/datasets/iris>`_ dataset as an example to
 showcase how we use Spark to transform raw dataset and make it fit to the data interface of XGBoost.
 Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width",
 "petal length" and "petal width". In addition, it contains the "class" columnm, which is essentially the label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica".
 Read Dataset with Spark's Built-In Reader
 -----------------------------------------
 The first thing in data transformation is to load the dataset as Spark's structured data abstraction, DataFrame.
 .. code-block:: scala
  import org.apache.spark.sql.SparkSession
  import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
  val spark = SparkSession.builder().getOrCreate()
  val schema = new StructType(Array(
    StructField("sepal length", DoubleType, true),
    StructField("sepal width", DoubleType, true),
    StructField("petal length", DoubleType, true),
    StructField("petal width", DoubleType, true),
    StructField("class", StringType, true)))
  val rawInput = spark.read.schema(schema).csv("input_path")
 At the first line, we create a instance of `SparkSession <http://spark.apache.org/docs/latest/sql-programming-guide.html#starting-point-sparksession>`_ which is the entry of any Spark program working with DataFrame. The ``schema`` variable defines the schema of DataFrame wrapping Iris data. With this explicitly set schema, we can define the columns' name as well as their types; otherwise the column name would be the default ones derived by Spark, such as ``_col0``, etc. Finally, we can use Spark's built-in csv reader to load Iris csv file as a DataFrame named ``rawInput``.
 Spark also contains many built-in readers for other format. The latest version of Spark supports CSV, JSON, Parquet, and LIBSVM.
 Transform Raw Iris Dataset
 --------------------------
 To make Iris dataset be recognizable to XGBoost, we need to
 1. Transform String-typed label, i.e. "class", to Double-typed label.
 2. Assemble the feature columns as a vector to fit to the data interface of Spark ML framework.
 To convert String-typed label to Double, we can use Spark's built-in feature transformer `StringIndexer <https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer>`_.
 .. code-block:: scala
  import org.apache.spark.ml.feature.StringIndexer
  val stringIndexer = new StringIndexer().
    setInputCol("class").
    setOutputCol("classIndex").
    fit(rawInput)
  val labelTransformed = stringIndexer.transform(rawInput).drop("class")
 With a newly created StringIndexer instance:
 1. we set input column, i.e. the column containing String-typed label
 2. we set output column, i.e. the column to contain the Double-typed label.
 3. Then we ``fit`` StringIndex with our input DataFrame ``rawInput``, so that Spark internals can get information like total number of distinct values, etc.
 Now we have a StringIndexer which is ready to be applied to our input DataFrame. To execute the transformation logic of StringIndexer, we ``transform`` the input DataFrame ``rawInput`` and to keep a concise DataFrame,
 we drop the column "class" and only keeps the feature columns and the transformed Double-typed label column (in the last line of the above code snippet).
 The ``fit`` and ``transform`` are two key operations in MLLIB. Basically, ``fit`` produces a "transformer", e.g. StringIndexer, and each transformer applies ``transform`` method on DataFrame to add new column(s) containing transformed features/labels or prediction results, etc. To understand more about ``fit`` and ``transform``, You can find more details in `here <http://spark.apache.org/docs/latest/ml-pipeline.html#pipeline-components>`_.
 Similarly, we can use another transformer, `VectorAssembler <https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.ml.feature.VectorAssembler>`_, to assemble feature columns "sepal length", "sepal width", "petal length" and "petal width" as a vector.
 .. code-block:: scala
  import org.apache.spark.ml.feature.VectorAssembler
  val vectorAssembler = new VectorAssembler().
    setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).
    setOutputCol("features")
  val xgbInput = vectorAssembler.transform(labelTransformed).select("features", "classIndex")
 Now, we have a DataFrame containing only two columns, "features" which contains vector-represented
 "sepal length", "sepal width", "petal length" and "petal width" and "classIndex" which has Double-typed
 labels. A DataFrame like this (containing vector-represented features and numeric labels) can be fed to XGBoost4J-Spark's training engine directly.
 Training
 ========
 XGBoost supports both regression and classification. While we use Iris dataset in this tutorial to show how we use XGBoost/XGBoost4J-Spark to resolve a multi-classes classification problem, the usage in Regression is very similar to classification.
 To train a XGBoost model for classification, we need to claim a XGBoostClassifier first:
 .. code-block:: scala
  import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
  val xgbParam = Map("eta" -> 0.1f,
        "max_depth" -> 2,
        "objective" -> "multi:softprob",
        "num_class" -> 3,
        "num_round" -> 100,
        "num_workers" -> 2)
  val xgbClassifier = new XGBoostClassifier(xgbParam).
        setFeaturesCol("features").
        setLabelCol("classIndex")
 The available parameters for training a XGBoost model can be found in :doc:`here </parameter>`. In XGBoost4J-Spark, we support not only the default set of parameters but also the camel-case variant of these parameters to keep consistent with Spark's MLLIB parameters.
 Specifically, each parameter in :doc:`this page </parameter>` has its
 equivalent form in XGBoost4J-Spark with camel case. For example, to set ``max_depth`` for each tree, you can pass parameter just like what we did in the above code snippet (as ``max_depth`` wrapped in a Map), or you can do it through setters in XGBoostClassifer:
 .. code-block:: scala
  val xgbClassifier = new XGBoostClassifier().
    setFeaturesCol("features").
    setLabelCol("classIndex")
  xgbClassifier.setMaxDepth(2)
 After we set XGBoostClassifier parameters and feature/label column, we can build a transformer, XGBoostClassificationModel by fitting XGBoostClassifier with the input DataFrame. This ``fit`` operation is essentially the training process and the generated model can then be used in prediction.
 .. code-block:: scala
  val xgbClassificationModel = xgbClassifier.fit(xgbInput)
 Prediction
 ==========
 XGBoost4j-Spark supports two ways for model serving: batch prediction and single instance prediction.
 Batch Prediction
 ----------------
 When we get a model, either XGBoostClassificationModel or XGBoostRegressionModel, it takes a DataFrame, read the column containing feature vectors, predict for each feature vector, and output a new DataFrame with the following columns by default:
 * XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label.
 * XGBoostRegressionModel will output prediction label(``predictionCol``).
 Batch prediction expects the user to pass the testset in the form of a DataFrame. XGBoost4J-Spark starts a XGBoost worker for each partition of DataFrame for parallel prediction and generates prediction results for the whole DataFrame in a batch.
 .. code-block:: scala
  val xgbClassificationModel = xgbClassifier.fit(xgbInput)
  val results = xgbClassificationModel.transform(testSet)
 With the above code snippet, we get a result DataFrame, result containing margin, probability for each class and the prediction for each instance
 .. code-block:: none
  +-----------------+----------+--------------------+--------------------+----------+
  |         features|classIndex|       rawPrediction|         probability|prediction|
  +-----------------+----------+--------------------+--------------------+----------+
  |[5.1,3.5,1.4,0.2]|       0.0|[3.45569849014282...|[0.99579632282257...|       0.0|
  |[4.9,3.0,1.4,0.2]|       0.0|[3.45569849014282...|[0.99618089199066...|       0.0|
  |[4.7,3.2,1.3,0.2]|       0.0|[3.45569849014282...|[0.99643349647521...|       0.0|
  |[4.6,3.1,1.5,0.2]|       0.0|[3.45569849014282...|[0.99636095762252...|       0.0|
  |[5.0,3.6,1.4,0.2]|       0.0|[3.45569849014282...|[0.99579632282257...|       0.0|
  |[5.4,3.9,1.7,0.4]|       0.0|[3.45569849014282...|[0.99428516626358...|       0.0|
  |[4.6,3.4,1.4,0.3]|       0.0|[3.45569849014282...|[0.99643349647521...|       0.0|
  |[5.0,3.4,1.5,0.2]|       0.0|[3.45569849014282...|[0.99579632282257...|       0.0|
  |[4.4,2.9,1.4,0.2]|       0.0|[3.45569849014282...|[0.99618089199066...|       0.0|
  |[4.9,3.1,1.5,0.1]|       0.0|[3.45569849014282...|[0.99636095762252...|       0.0|
  |[5.4,3.7,1.5,0.2]|       0.0|[3.45569849014282...|[0.99428516626358...|       0.0|
  |[4.8,3.4,1.6,0.2]|       0.0|[3.45569849014282...|[0.99643349647521...|       0.0|
  |[4.8,3.0,1.4,0.1]|       0.0|[3.45569849014282...|[0.99618089199066...|       0.0|
  |[4.3,3.0,1.1,0.1]|       0.0|[3.45569849014282...|[0.99618089199066...|       0.0|
  |[5.8,4.0,1.2,0.2]|       0.0|[3.45569849014282...|[0.97809928655624...|       0.0|
  |[5.7,4.4,1.5,0.4]|       0.0|[3.45569849014282...|[0.97809928655624...|       0.0|
  |[5.4,3.9,1.3,0.4]|       0.0|[3.45569849014282...|[0.99428516626358...|       0.0|
  |[5.1,3.5,1.4,0.3]|       0.0|[3.45569849014282...|[0.99579632282257...|       0.0|
  |[5.7,3.8,1.7,0.3]|       0.0|[3.45569849014282...|[0.97809928655624...|       0.0|
  |[5.1,3.8,1.5,0.3]|       0.0|[3.45569849014282...|[0.99579632282257...|       0.0|
  +-----------------+----------+--------------------+--------------------+----------+
 Single instance prediction
 --------------------------
 XGBoostClassificationModel or XGBoostRegressionModel support make prediction on single instance as well.
 It accepts a single Vector as feature, and output the prediction label.
 However, the overhead of single-instance prediction is high due to the internal overhead of XGBoost, use it carefully!
 .. code-block:: scala
  val features = xgbInput.head().getAs[Vector]("features")
  val result = xgbClassificationModel.predict(features)
 Model Persistence
 =================
 Model and pipeline persistence
 ------------------------------
 A data scientist produces an ML model and hands it over to an engineering team for deployment in a production environment. Reversely, a trained model may be used by data scientists, for example as a baseline, across the process of data exploration. So it's important to support model persistence to make the models available across usage scenarios and programming languages.
 XGBoost4j-Spark supports saving and loading XGBoostClassifier/XGBoostClassificationModel and XGBoostRegressor/XGBoostRegressionModel. It also supports saving and loading a ML pipeline which includes these estimators and models.
 We can save the XGBoostClassificationModel to file system:
 .. code-block:: scala
  val xgbClassificationModelPath = "/tmp/xgbClassificationModel"
  xgbClassificationModel.write.overwrite().save(xgbClassificationModelPath)
 and then loading the model in another session:
 .. code-block:: scala
  import ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel
  val xgbClassificationModel2 = XGBoostClassificationModel.load(xgbClassificationModelPath)
  xgbClassificationModel2.transform(xgbInput)
 With regards to ML pipeline save and load, please refer the next section.
 Interact with Other Bindings of XGBoost
 ------------------------------------
 After we train a model with XGBoost4j-Spark on massive dataset, sometimes we want to do model serving in single machine or integrate it with other single node libraries for further processing. XGBoost4j-Spark supports export model to local by:
 .. code-block:: scala
  val nativeModelPath = "/tmp/nativeModel"
  xgbClassificationModel.nativeBooster.saveModel(nativeModelPath)
 Then we can load this model with single node Python XGBoost:
 .. code-block:: python
  import xgboost as xgb
  bst = xgb.Booster({'nthread': 4})
  bst.load_model(nativeModelPath)
 .. note:: Using HDFS and S3 for exporting the models with nativeBooster.saveModel()
  When interacting with other language bindings, XGBoost also supports saving-models-to and loading-models-from file systems other than the local one. You can use HDFS and S3 by prefixing the path with ``hdfs://`` and ``s3://`` respectively. However, for this capability, you must do **one** of the following:
  1. Build XGBoost4J-Spark with the steps described in `here <https://xgboost.readthedocs.io/en/latest/jvm/index.html#installation-from-source>`_, but turning `USE_HDFS <https://github.com/dmlc/xgboost/blob/e939192978a0c152ad7b49b744630e99d54cffa8/jvm-packages/create_jni.py#L18>`_ (or USE_S3, etc. in the same place) switch on. With this approach, you can reuse the above code example by replacing "nativeModelPath" with a HDFS path.
     - However, if you build with USE_HDFS, etc. you have to ensure that the involved shared object file, e.g. libhdfs.so, is put in the LIBRARY_PATH of your cluster. To avoid the complicated cluster environment configuration, choose the other option.
  2. Use bindings of HDFS, S3, etc. to pass model files around. Here are the steps (taking HDFS as an example):
     - Create a new file with
       .. code-block:: scala
         val outputStream = fs.create("hdfs_path")
       where "fs" is an instance of `org.apache.hadoop.fs.FileSystem <https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html>`_ class in Hadoop.
     - Pass the returned OutputStream in the first step to nativeBooster.saveModel():
       .. code-block:: scala
         xgbClassificationModel.nativeBooster.saveModel(outputStream)
     - Download file in other languages from HDFS and load with the pre-built (without the requirement of libhdfs.so) version of XGBoost. (The function "download_from_hdfs" is a helper function to be implemented by the user)
       .. code-block:: python
         import xgboost as xgb
         bst = xgb.Booster({'nthread': 4})
         local_path = download_from_hdfs("hdfs_path")
         bst.load_model(local_path)
 .. note:: Consistency issue between XGBoost4J-Spark and other bindings
  There is a consistency issue between XGBoost4J-Spark and other language bindings of XGBoost.
  When users use Spark to load training/test data in LIBSVM format with the following code snippet:
  .. code-block:: scala
    spark.read.format("libsvm").load("trainingset_libsvm")
  Spark assumes that the dataset is using 1-based indexing (feature indices staring with 1). However, when you do prediction with other bindings of XGBoost (e.g. Python API of XGBoost), XGBoost assumes that the dataset is using 0-based indexing (feature indices starting with 0) by default. It creates a pitfall for the users who train model with Spark but predict with the dataset in the same format in other bindings of XGBoost. The solution is to transform the dataset to 0-based indexing before you predict with, for example, Python API, or you append ``?indexing_mode=1`` to your file path when loading with DMatirx. For example in Python:
  .. code-block:: python
    xgb.DMatrix('test.libsvm?indexing_mode=1')
 *******************************************
 Building a ML Pipeline with XGBoost4J-Spark
 *******************************************
 Basic ML Pipeline
 =================
 Spark ML pipeline can combine multiple algorithms or functions into a single pipeline.
 It covers from feature extraction, transformation, selection to model training and prediction.
 XGBoost4j-Spark makes it feasible to embed XGBoost into such a pipeline seamlessly.
 The following example shows how to build such a pipeline consisting of Spark MLlib feature transformer
 and XGBoostClassifier estimator.
 We still use `Iris <https://archive.ics.uci.edu/ml/datasets/iris>`_ dataset and the ``rawInput`` DataFrame.
 First we need to split the dataset into training and test dataset.
 .. code-block:: scala
  val Array(training, test) = rawInput.randomSplit(Array(0.8, 0.2), 123)
 The we build the ML pipeline which includes 4 stages:
 * Assemble all features into a single vector column.
 * From string label to indexed double label.
 * Use XGBoostClassifier to train classification model.
 * Convert indexed double label back to original string label.
 We have shown the first three steps in the earlier sections, and the last step is finished with a new transformer `IndexToString <https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.ml.feature.IndexToString>`_:
 .. code-block:: scala
 	val labelConverter = new IndexToString()
        .setInputCol("prediction")
        .setOutputCol("realLabel")
        .setLabels(stringIndexer.labels)
 We need to organize these steps as a Pipeline in Spark ML framework and evaluate the whole pipeline to get a PipelineModel:
 .. code-block:: scala
  import org.apache.spark.ml.feature._
  import org.apache.spark.ml.Pipeline
  val pipeline = new Pipeline()
      .setStages(Array(assembler, stringIndexer, booster, labelConverter))
  val model = pipeline.fit(training)
 After we get the PipelineModel, we can make prediction on the test dataset and evaluate the model accuracy.
 .. code-block:: scala
  import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
  val prediction = model.transform(test)
  val evaluator = new MulticlassClassificationEvaluator()
  val accuracy = evaluator.evaluate(prediction)
 Pipeline with Hyper-parameter Tunning
 =====================================
 The most critical operation to maximize the power of XGBoost is to select the optimal parameters for the model. Tuning parameters manually is a tedious and labor-consuming process. With the latest version of XGBoost4J-Spark, we can utilize the Spark model selecting tool to automate this process.
 The following example shows the code snippet utilizing CrossValidation and MulticlassClassificationEvaluator
 to search the optimal combination of two XGBoost parameters, ``max_depth`` and ``eta``. (See :doc:`/parameter`.)
 The model producing the maximum accuracy defined by MulticlassClassificationEvaluator is selected and used to generate the prediction for the test set.
 .. code-block:: scala
  import org.apache.spark.ml.tuning._
  import org.apache.spark.ml.PipelineModel
  import ml.dmlc.xgboost4j.scala.spark.XGBoostClassificationModel
  val paramGrid = new ParamGridBuilder()
      .addGrid(booster.maxDepth, Array(3, 8))
      .addGrid(booster.eta, Array(0.2, 0.6))
      .build()
  val cv = new CrossValidator()
      .setEstimator(pipeline)
      .setEvaluator(evaluator)
      .setEstimatorParamMaps(paramGrid)
      .setNumFolds(3)
  val cvModel = cv.fit(training)
  val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel].stages(2)
      .asInstanceOf[XGBoostClassificationModel]
  bestModel.extractParamMap()
 *********************************
 Run XGBoost4J-Spark in Production
 *********************************
 XGBoost4J-Spark is one of the most important steps to bring XGBoost to production environment easier. In this section, we introduce three key features to run XGBoost4J-Spark in production.
 Parallel/Distributed Training
 =============================
 The massive size of training dataset is one of the most significant characteristics in production environment. To ensure that training in XGBoost scales with the data size, XGBoost4J-Spark bridges the distributed/parallel processing framework of Spark and the parallel/distributed training mechanism of XGBoost.
 In XGBoost4J-Spark, each XGBoost worker is wrapped by a Spark task and the training dataset in Spark's memory space is fed to XGBoost workers in a transparent approach to the user.
 In the code snippet where we build XGBoostClassifier, we set parameter ``num_workers`` (or ``numWorkers``).
 This parameter controls how many parallel workers we want to have when training a XGBoostClassificationModel.
 .. note:: Regarding OpenMP optimization
  By default, we allocate a core per each XGBoost worker. Therefore, the OpenMP optimization within each XGBoost worker does not take effect and the parallelization of training is achieved
  by running multiple workers (i.e. Spark tasks) at the same time.
  If you do want OpenMP optimization, you have to
  1. set ``nthread`` to a value larger than 1 when creating XGBoostClassifier/XGBoostRegressor
  2. set ``spark.task.cpus`` in Spark to the same value as ``nthread``
 Gang Scheduling
 ===============
 XGBoost uses `AllReduce <http://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/>`_.
 algorithm to synchronize the stats, e.g. histogram values, of each worker during training. Therefore XGBoost4J-Spark requires that all of ``nthread * numWorkers`` cores should be available before the training runs.
 In the production environment where many users share the same cluster, it's hard to guarantee that your XGBoost4J-Spark application can get all requested resources for every run. By default, the communication layer in XGBoost will block the whole application when it requires more resources to be available. This process usually brings unnecessary resource waste as it keeps the ready resources and try to claim more. Additionally, this usually happens silently and does not bring the attention of users.
 XGBoost4J-Spark allows the user to setup a timeout threshold for claiming resources from the cluster. If the application cannot get enough resources within this time period, the application would fail instead of wasting resources for hanging long. To enable this feature, you can set with XGBoostClassifier/XGBoostRegressor:
 .. code-block:: scala
  xgbClassifier.setTimeoutRequestWorkers(60000L)
 or pass in ``timeout_request_workers`` in ``xgbParamMap`` when building XGBoostClassifier:
 .. code-block:: scala
  val xgbParam = Map("eta" -> 0.1f,
     "max_depth" -> 2,
     "objective" -> "multi:softprob",
     "num_class" -> 3,
     "num_round" -> 100,
     "num_workers" -> 2,
     "timeout_request_workers" -> 60000L)
  val xgbClassifier = new XGBoostClassifier(xgbParam).
      setFeaturesCol("features").
      setLabelCol("classIndex")
 If XGBoost4J-Spark cannot get enough resources for running two XGBoost workers, the application would fail. Users can have external mechanism to monitor the status of application and get notified for such case.
 Checkpoint During Training
 ==========================
 Transient failures are also commonly seen in production environment. To simplify the design of XGBoost,
 we stop training if any of the distributed workers fail. However, if the training fails after having been through a long time, it would be a great waste of resources.
 We support creating checkpoint during training to facilitate more efficient recovery from failture. To enable this feature, you can set how many iterations we build each checkpoint with ``setCheckpointInterval`` and the location of checkpoints with ``setCheckpointPath``:
 .. code-block:: scala
  xgbClassifier.setCheckpointInterval(2)
  xgbClassifier.setCheckpointPath("/checkpoint_path")
 An equivalent way is to pass in parameters in XGBoostClassifier's constructor:
 .. code-block:: scala
  val xgbParam = Map("eta" -> 0.1f,
     "max_depth" -> 2,
     "objective" -> "multi:softprob",
     "num_class" -> 3,
     "num_round" -> 100,
     "num_workers" -> 2,
     "checkpoint_path" -> "/checkpoints_path",
     "checkpoint_interval" -> 2)
  val xgbClassifier = new XGBoostClassifier(xgbParam).
      setFeaturesCol("features").
      setLabelCol("classIndex")
 If the training failed during these 100 rounds, the next run of training would start by reading the latest checkpoint file in ``/checkpoints_path`` and start from the iteration when the checkpoint was built until to next failure or the specified 100 rounds.
--- a/doc/tutorials/index.rst
+++ b/doc/tutorials/index.rst
@@ -10,7 +10,8 @@ See `Awesome XGBoost <https://github.com/dmlc/xgboost/tree/master/demo>`_ for mo
  :caption: Contents:
  model
-  aws_yarn
+  Distributed XGBoost with AWS YARN <aws_yarn>
  Distributed XGBoost with XGBoost4J-Spark <https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html>
  dart
  monotonic
  input_format
--- a/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkMLlibPipeline.scala
+++ b/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkMLlibPipeline.scala
@@ -0,0 +1,131 @@
 /*
 Copyright (c) 2014 by Contributors
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 */
 package ml.dmlc.xgboost4j.scala.example.spark
 import org.apache.spark.ml.{Pipeline, PipelineModel}
 import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
 import org.apache.spark.ml.feature._
 import org.apache.spark.ml.tuning._
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.types._
 import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassifier, XGBoostClassificationModel}
 // this example works with Iris dataset (https://archive.ics.uci.edu/ml/datasets/iris)
 object SparkMLlibPipeline {
  def main(args: Array[String]): Unit = {
    if (args.length != 1) {
      println("Usage: SparkMLlibPipeline input_path native_model_path pipeline_model_path")
      sys.exit(1)
    }
    val inputPath = args(0)
    val nativeModelPath = args(1)
    val pipelineModelPath = args(2)
    val spark = SparkSession
      .builder()
      .appName("XGBoost4J-Spark Pipeline Example")
      .getOrCreate()
    // Load dataset
    val schema = new StructType(Array(
      StructField("sepal length", DoubleType, true),
      StructField("sepal width", DoubleType, true),
      StructField("petal length", DoubleType, true),
      StructField("petal width", DoubleType, true),
      StructField("class", StringType, true)))
    val rawInput = spark.read.schema(schema).csv(inputPath)
    // Split training and test dataset
    val Array(training, test) = rawInput.randomSplit(Array(0.8, 0.2), 123)
    // Build ML pipeline, it includes 4 stages:
    // 1, Assemble all features into a single vector column.
    // 2, From string label to indexed double label.
    // 3, Use XGBoostClassifier to train classification model.
    // 4, Convert indexed double label back to original string label.
    val assembler = new VectorAssembler()
      .setInputCols(Array("sepal length", "sepal width", "petal length", "petal width"))
      .setOutputCol("features")
    val labelIndexer = new StringIndexer()
      .setInputCol("class")
      .setOutputCol("classIndex")
      .fit(training)
    val booster = new XGBoostClassifier(
      Map("eta" -> 0.1f,
        "max_depth" -> 2,
        "objective" -> "multi:softprob",
        "num_class" -> 3,
        "num_round" -> 100,
        "num_workers" -> 2
      )
    )
    val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("realLabel")
      .setLabels(labelIndexer.labels)
    val pipeline = new Pipeline()
      .setStages(Array(assembler, labelIndexer, booster, labelConverter))
    val model = pipeline.fit(training)
    // Batch prediction
    val prediction = model.transform(test)
    prediction.show(false)
    // Model evaluation
    val evaluator = new MulticlassClassificationEvaluator()
    val accuracy = evaluator.evaluate(prediction)
    println("The model accuracy is : " + accuracy)
    // Tune model using cross validation
    val paramGrid = new ParamGridBuilder()
      .addGrid(booster.maxDepth, Array(3, 8))
      .addGrid(booster.eta, Array(0.2, 0.6))
      .build()
    val cv = new CrossValidator()
      .setEstimator(pipeline)
      .setEvaluator(evaluator)
      .setEstimatorParamMaps(paramGrid)
      .setNumFolds(3)
    val cvModel = cv.fit(training)
    val bestModel = cvModel.bestModel.asInstanceOf[PipelineModel].stages(2)
      .asInstanceOf[XGBoostClassificationModel]
    println("The params of best XGBoostClassification model : " +
      bestModel.extractParamMap())
    println("The training summary of best XGBoostClassificationModel : " +
      bestModel.summary)
    // Export the XGBoostClassificationModel as local XGBoost model,
    // then you can load it back in local Python environment.
    bestModel.nativeBooster.saveModel(nativeModelPath)
    // ML pipeline persistence
    model.write.overwrite().save(pipelineModelPath)
    // Load a saved model and serving
    val model2 = PipelineModel.load(pipelineModelPath)
    model2.transform(test).show(false)
  }
 }
--- a/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkModelTuningTool.scala
+++ b/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkModelTuningTool.scala
@@ -1,206 +0,0 @@
 /*
 Copyright (c) 2014 by Contributors
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 */
 package ml.dmlc.xgboost4j.scala.example.spark
 import scala.collection.mutable
 import scala.collection.mutable.ListBuffer
 import scala.io.Source
 import ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor
 import org.apache.spark.ml.Pipeline
 import org.apache.spark.ml.evaluation.RegressionEvaluator
 import org.apache.spark.ml.feature.{VectorAssembler, StringIndexer}
 import org.apache.spark.ml.tuning._
 import org.apache.spark.sql.{Dataset, DataFrame, SparkSession}
 case class SalesRecord(storeId: Int, daysOfWeek: Int, date: String, sales: Int, customers: Int,
                       open: Int, promo: Int, stateHoliday: String, schoolHoliday: String)
 case class Store(storeId: Int, storeType: String, assortment: String, competitionDistance: Int,
                 competitionOpenSinceMonth: Int, competitionOpenSinceYear: Int, promo2: Int,
                 promo2SinceWeek: Int, promo2SinceYear: Int, promoInterval: String)
 object SparkModelTuningTool {
  private def parseStoreFile(storeFilePath: String): List[Store] = {
    var isHeader = true
    val storeInstances = new ListBuffer[Store]
    for (line <- Source.fromFile(storeFilePath).getLines()) {
      if (isHeader) {
        isHeader = false
      } else {
        try {
          val strArray = line.split(",")
          if (strArray.length == 10) {
            val Array(storeIdStr, storeTypeStr, assortmentStr, competitionDistanceStr,
            competitionOpenSinceMonthStr, competitionOpenSinceYearStr, promo2Str,
            promo2SinceWeekStr, promo2SinceYearStr, promoIntervalStr) = line.split(",")
            storeInstances += Store(storeIdStr.toInt, storeTypeStr, assortmentStr,
              if (competitionDistanceStr == "") -1 else competitionDistanceStr.toInt,
              if (competitionOpenSinceMonthStr == "" ) -1 else competitionOpenSinceMonthStr.toInt,
              if (competitionOpenSinceYearStr == "" ) -1 else competitionOpenSinceYearStr.toInt,
              promo2Str.toInt,
              if (promo2Str == "0") -1 else promo2SinceWeekStr.toInt,
              if (promo2Str == "0") -1 else promo2SinceYearStr.toInt,
              promoIntervalStr.replace("\"", ""))
          } else {
            val Array(storeIdStr, storeTypeStr, assortmentStr, competitionDistanceStr,
            competitionOpenSinceMonthStr, competitionOpenSinceYearStr, promo2Str,
            promo2SinceWeekStr, promo2SinceYearStr, firstMonth, secondMonth, thirdMonth,
            forthMonth) = line.split(",")
            storeInstances += Store(storeIdStr.toInt, storeTypeStr, assortmentStr,
              if (competitionDistanceStr == "") -1 else competitionDistanceStr.toInt,
              if (competitionOpenSinceMonthStr == "" ) -1 else competitionOpenSinceMonthStr.toInt,
              if (competitionOpenSinceYearStr == "" ) -1 else competitionOpenSinceYearStr.toInt,
              promo2Str.toInt,
              if (promo2Str == "0") -1 else promo2SinceWeekStr.toInt,
              if (promo2Str == "0") -1 else promo2SinceYearStr.toInt,
              firstMonth.replace("\"", "") + "," + secondMonth + "," + thirdMonth + "," +
                forthMonth.replace("\"", ""))
          }
        } catch {
          case e: Exception =>
            e.printStackTrace()
            sys.exit(1)
        }
      }
    }
    storeInstances.toList
  }
  private def parseTrainingFile(trainingPath: String): List[SalesRecord] = {
    var isHeader = true
    val records = new ListBuffer[SalesRecord]
    for (line <- Source.fromFile(trainingPath).getLines()) {
      if (isHeader) {
        isHeader = false
      } else {
        val Array(storeIdStr, daysOfWeekStr, dateStr, salesStr, customerStr, openStr, promoStr,
        stateHolidayStr, schoolHolidayStr) = line.split(",")
        val salesRecord = SalesRecord(storeIdStr.toInt, daysOfWeekStr.toInt, dateStr,
          salesStr.toInt, customerStr.toInt, openStr.toInt, promoStr.toInt, stateHolidayStr,
          schoolHolidayStr)
        records += salesRecord
      }
    }
    records.toList
  }
  private def featureEngineering(ds: DataFrame): DataFrame = {
    import org.apache.spark.sql.functions._
    import ds.sparkSession.implicits._
    val stateHolidayIndexer = new StringIndexer()
      .setInputCol("stateHoliday")
      .setOutputCol("stateHolidayIndex")
    val schoolHolidayIndexer = new StringIndexer()
      .setInputCol("schoolHoliday")
      .setOutputCol("schoolHolidayIndex")
    val storeTypeIndexer = new StringIndexer()
      .setInputCol("storeType")
      .setOutputCol("storeTypeIndex")
    val assortmentIndexer = new StringIndexer()
      .setInputCol("assortment")
      .setOutputCol("assortmentIndex")
    val promoInterval = new StringIndexer()
      .setInputCol("promoInterval")
      .setOutputCol("promoIntervalIndex")
    val filteredDS = ds.filter($"sales" > 0).filter($"open" > 0)
    // parse date
    val dsWithDayCol =
      filteredDS.withColumn("day", udf((dateStr: String) =>
        dateStr.split("-")(2).toInt).apply(col("date")))
    val dsWithMonthCol =
      dsWithDayCol.withColumn("month", udf((dateStr: String) =>
        dateStr.split("-")(1).toInt).apply(col("date")))
    val dsWithYearCol =
      dsWithMonthCol.withColumn("year", udf((dateStr: String) =>
        dateStr.split("-")(0).toInt).apply(col("date")))
    val dsWithLogSales = dsWithYearCol.withColumn("logSales",
      udf((sales: Int) => math.log(sales)).apply(col("sales")))
    // fill with mean values
    val meanCompetitionDistance = dsWithLogSales.select(avg("competitionDistance")).first()(0).
      asInstanceOf[Double]
    println("====" + meanCompetitionDistance)
    val finalDS = dsWithLogSales.withColumn("transformedCompetitionDistance",
      udf((distance: Int) => if (distance > 0) distance.toDouble else meanCompetitionDistance).
        apply(col("competitionDistance")))
    val vectorAssembler = new VectorAssembler()
      .setInputCols(Array("storeId", "daysOfWeek", "promo", "competitionDistance", "promo2", "day",
        "month", "year", "transformedCompetitionDistance", "stateHolidayIndex",
        "schoolHolidayIndex", "storeTypeIndex", "assortmentIndex", "promoIntervalIndex"))
      .setOutputCol("features")
    val pipeline = new Pipeline().setStages(
      Array(stateHolidayIndexer, schoolHolidayIndexer, storeTypeIndexer, assortmentIndexer,
        promoInterval, vectorAssembler))
    pipeline.fit(finalDS).transform(finalDS).
      drop("stateHoliday", "schoolHoliday", "storeType", "assortment", "promoInterval", "sales",
        "promo2SinceWeek", "customers", "promoInterval", "competitionOpenSinceYear",
        "competitionOpenSinceMonth", "promo2SinceYear", "competitionDistance", "date")
  }
  private def crossValidation(
      xgboostParam: Map[String, Any],
      trainingData: Dataset[_]): TrainValidationSplitModel = {
    val xgbEstimator = new XGBoostRegressor(xgboostParam).setFeaturesCol("features").
      setLabelCol("logSales")
    val paramGrid = new ParamGridBuilder()
      .addGrid(xgbEstimator.numRound, Array(20, 50))
      .addGrid(xgbEstimator.eta, Array(0.1, 0.4))
      .build()
    val tv = new TrainValidationSplit()
      .setEstimator(xgbEstimator)
      .setEvaluator(new RegressionEvaluator().setLabelCol("logSales"))
      .setEstimatorParamMaps(paramGrid)
      .setTrainRatio(0.8)  // Use 3+ in practice
    tv.fit(trainingData)
  }
  def main(args: Array[String]): Unit = {
    val sparkSession = SparkSession.builder().appName("rosseman").getOrCreate()
    import sparkSession.implicits._
    // parse training file to data frame
    val trainingPath = args(0)
    val allSalesRecords = parseTrainingFile(trainingPath)
    // create dataset
    val salesRecordsDF = allSalesRecords.toDF
    // parse store file to data frame
    val storeFilePath = args(1)
    val allStores = parseStoreFile(storeFilePath)
    val storesDS = allStores.toDF()
    val fullDataset = salesRecordsDF.join(storesDS, "storeId")
    val featureEngineeredDF = featureEngineering(fullDataset)
    // prediction
    val params = new mutable.HashMap[String, Any]()
    params += "eta" -> 0.1
    params += "max_depth" -> 6
    params += "silent" -> 1
    params += "ntreelimit" -> 1000
    params += "objective" -> "reg:linear"
    params += "subsample" -> 0.8
    params += "num_round" -> 100
    val bestModel = crossValidation(params.toMap, featureEngineeredDF)
  }
 }
--- a/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkTraining.scala
+++ b/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkTraining.scala
@@ -0,0 +1,78 @@
 /*
 Copyright (c) 2014 by Contributors
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 */
 package ml.dmlc.xgboost4j.scala.example.spark
 import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
 import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler}
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
 // this example works with Iris dataset (https://archive.ics.uci.edu/ml/datasets/iris)
 object SparkTraining {
  def main(args: Array[String]): Unit = {
    if (args.length < 1) {
      // scalastyle:off
      println("Usage: program input_path")
      sys.exit(1)
    }
    val spark = SparkSession.builder().getOrCreate()
    val inputPath = args(0)
    val schema = new StructType(Array(
      StructField("sepal length", DoubleType, true),
      StructField("sepal width", DoubleType, true),
      StructField("petal length", DoubleType, true),
      StructField("petal width", DoubleType, true),
      StructField("class", StringType, true)))
    val rawInput = spark.read.schema(schema).csv(args(0))
    // transform class to index to make xgboost happy
    val stringIndexer = new StringIndexer()
      .setInputCol("class")
      .setOutputCol("classIndex")
      .fit(rawInput)
    val labelTransformed = stringIndexer.transform(rawInput).drop("class")
    // compose all feature columns as vector
    val vectorAssembler = new VectorAssembler().
      setInputCols(Array("sepal length", "sepal width", "petal length", "petal width")).
      setOutputCol("features")
    val xgbInput = vectorAssembler.transform(labelTransformed).select("features",
      "classIndex")
    /**
     * setup  "timeout_request_workers" -> 60000L to make this application if it cannot get enough resources
     * to get 2 workers within 60000 ms
     *
     * setup "checkpoint_path" -> "/checkpoints" and "checkpoint_interval" -> 2 to save checkpoint for every
     * two iterations
     */
    val xgbParam = Map("eta" -> 0.1f,
      "max_depth" -> 2,
      "objective" -> "multi:softprob",
      "num_class" -> 3,
      "num_round" -> 100,
      "num_workers" -> 2)
    val xgbClassifier = new XGBoostClassifier(xgbParam).
      setFeaturesCol("features").
      setLabelCol("classIndex")
    val xgbClassificationModel = xgbClassifier.fit(xgbInput)
    val results = xgbClassificationModel.transform(xgbInput)
    results.show()
  }
 }
--- a/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkWithDataFrame.scala
+++ b/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkWithDataFrame.scala
@@ -1,55 +0,0 @@
 /*
 Copyright (c) 2014 by Contributors
 Licensed under the Apache License, Version 2.0 (the "License");
 you may not use this file except in compliance with the License.
 You may obtain a copy of the License at
 http://www.apache.org/licenses/LICENSE-2.0
 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 */
 package ml.dmlc.xgboost4j.scala.example.spark
 import ml.dmlc.xgboost4j.scala.Booster
 import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
 import org.apache.spark.sql.SparkSession
 import org.apache.spark.SparkConf
 object SparkWithDataFrame {
  def main(args: Array[String]): Unit = {
    if (args.length != 4) {
      println(
        "usage: program num_of_rounds num_workers training_path test_path")
      sys.exit(1)
    }
    // create SparkSession
    val sparkConf = new SparkConf().setAppName("XGBoost-spark-example")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    sparkConf.registerKryoClasses(Array(classOf[Booster]))
    // val sqlContext = new SQLContext(new SparkContext(sparkConf))
    val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
    // create training and testing dataframes
    val numRound = args(0).toInt
    val inputTrainPath = args(2)
    val inputTestPath = args(3)
    // build dataset
    val trainDF = sparkSession.sqlContext.read.format("libsvm").load(inputTrainPath)
    val testDF = sparkSession.sqlContext.read.format("libsvm").load(inputTestPath)
    // start training
    val paramMap = List(
      "eta" -> 0.1f,
      "max_depth" -> 2,
      "objective" -> "binary:logistic",
      "num_round" -> numRound,
      "num_workers" -> args(1).toInt).toMap
    val xgboostModel = new XGBoostClassifier(paramMap).fit(trainDF)
    // xgboost-spark appends the column containing prediction results
    xgboostModel.transform(testDF).show()
  }
 }