[jvm-packages] Implemented early stopping (#2710)

* Allowed subsampling test from the training data frame/RDD The implementation requires storing 1 - trainTestRatio points in memory to make the sampling work. An alternative approach would be to construct the full DMatrix and then slice it deterministically into train/test. The peak memory consumption of such scenario, however, is twice the dataset size. * Removed duplication from 'XGBoost.train' Scala callers can (and should) use names to supply a subset of parameters. Method overloading is not required. * Reuse XGBoost seed parameter to stabilize train/test splitting * Added early stopping support to non-distributed XGBoost Closes #1544 * Added early-stopping to distributed XGBoost * Moved construction of 'watches' into a separate method This commit also fixes the handling of 'baseMargin' which previously was not added to the validation matrix. * Addressed review comments
2017-09-29 21:06:22 +02:00
parent 74db9757b3
commit 69c3b78a29
15 changed files with 191 additions and 91 deletions
--- a/jvm-packages/xgboost4j-flink/src/main/scala/ml/dmlc/xgboost4j/scala/flink/XGBoost.scala
+++ b/jvm-packages/xgboost4j-flink/src/main/scala/ml/dmlc/xgboost4j/scala/flink/XGBoost.scala
@@ -55,7 +55,10 @@ object XGBoost {
      val trainMat = new DMatrix(dataIter, null)
      val watches = List("train" -> trainMat).toMap
      val round = 2
-      val booster = XGBoostScala.train(trainMat, paramMap, round, watches, null, null)
+      val numEarlyStoppingRounds = paramMap.get("numEarlyStoppingRounds")
+          .map(_.toString.toInt).getOrElse(0)
+      val booster = XGBoostScala.train(trainMat, paramMap, round, watches,
+        earlyStoppingRound = numEarlyStoppingRounds)
      Rabit.shutdown()
      collector.collect(new XGBoostModel(booster))
    }