[jvm-packages] Implemented early stopping (#2710)

* Allowed subsampling test from the training data frame/RDD The implementation requires storing 1 - trainTestRatio points in memory to make the sampling work. An alternative approach would be to construct the full DMatrix and then slice it deterministically into train/test. The peak memory consumption of such scenario, however, is twice the dataset size. * Removed duplication from 'XGBoost.train' Scala callers can (and should) use names to supply a subset of parameters. Method overloading is not required. * Reuse XGBoost seed parameter to stabilize train/test splitting * Added early stopping support to non-distributed XGBoost Closes #1544 * Added early-stopping to distributed XGBoost * Moved construction of 'watches' into a separate method This commit also fixes the handling of 'baseMargin' which previously was not added to the validation matrix. * Addressed review comments
2017-09-29 21:06:22 +02:00
parent 74db9757b3
commit 69c3b78a29
15 changed files with 191 additions and 91 deletions
--- a/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostDFSuite.scala
+++ b/jvm-packages/xgboost4j-spark/src/test/scala/ml/dmlc/xgboost4j/scala/spark/XGBoostDFSuite.scala
@@ -18,6 +18,7 @@ package ml.dmlc.xgboost4j.scala.spark

 import ml.dmlc.xgboost4j.scala.{DMatrix, XGBoost => ScalaXGBoost}
 import ml.dmlc.xgboost4j.{LabeledPoint => XGBLabeledPoint}
+
 import org.apache.spark.ml.linalg.DenseVector
 import org.apache.spark.ml.param.ParamMap
 import org.apache.spark.sql._
@@ -201,7 +202,8 @@ class XGBoostDFSuite extends FunSuite with PerTest {
    val trainingDfWithMargin = trainingDf.withColumn("margin", functions.rand())
    val testRDD = sc.parallelize(Classification.test.map(_.features))
    val paramMap = Map("eta" -> "1", "max_depth" -> "6", "silent" -> "1",
-      "objective" -> "binary:logistic", "baseMarginCol" -> "margin")
+      "objective" -> "binary:logistic", "baseMarginCol" -> "margin",
+      "testTrainSplit" -> 0.5)

    def trainPredict(df: Dataset[_]): Array[Float] = {
      XGBoost.trainWithDataFrame(df, paramMap, round = 1, nWorkers = numWorkers)