[jvm-packages] Implemented early stopping (#2710)

* Allowed subsampling test from the training data frame/RDD

The implementation requires storing 1 - trainTestRatio points in memory
to make the sampling work.

An alternative approach would be to construct the full DMatrix and then
slice it deterministically into train/test. The peak memory consumption
of such scenario, however, is twice the dataset size.

* Removed duplication from 'XGBoost.train'

Scala callers can (and should) use names to supply a subset of
parameters. Method overloading is not required.

* Reuse XGBoost seed parameter to stabilize train/test splitting

* Added early stopping support to non-distributed XGBoost

Closes #1544

* Added early-stopping to distributed XGBoost

* Moved construction of 'watches' into a separate method

This commit also fixes the handling of 'baseMargin' which previously
was not added to the validation matrix.

* Addressed review comments
This commit is contained in:
Sergei Lebedev
2017-09-29 21:06:22 +02:00
committed by Nan Zhu
parent 74db9757b3
commit 69c3b78a29
15 changed files with 191 additions and 91 deletions

View File

@@ -55,7 +55,10 @@ object XGBoost {
val trainMat = new DMatrix(dataIter, null)
val watches = List("train" -> trainMat).toMap
val round = 2
val booster = XGBoostScala.train(trainMat, paramMap, round, watches, null, null)
val numEarlyStoppingRounds = paramMap.get("numEarlyStoppingRounds")
.map(_.toString.toInt).getOrElse(0)
val booster = XGBoostScala.train(trainMat, paramMap, round, watches,
earlyStoppingRound = numEarlyStoppingRounds)
Rabit.shutdown()
collector.collect(new XGBoostModel(booster))
}