* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* disable booster setup in spark
* check in parameter conversion
* fix compilation issue
* update exception type
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* maven central release
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* [jvm-packages] XGBoost Spark integration refactor. (#3313)
* XGBoost Spark integration refactor.
* Make corresponding update for xgboost4j-example
* Address comments.
* [jvm-packages] Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib (#3326)
* Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib
* Fix extra space.
* [jvm-packages] XGBoost Spark supports ranking with group data. (#3369)
* XGBoost Spark supports ranking with group data.
* Use Iterator.duplicate to prevent OOM.
* Update CheckpointManagerSuite.scala
* Resolve conflicts
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update 0.80
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* static glibc glibc++
* update to build with glib 2.12
* remove unsupported flags
* update version number
* remove properties
* remove unnecessary command
* update poms
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* change version of jvm to keep consistent with other pkgs
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update default spark version to 2.3
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add back libsvm notes
* rank_metric: add AUC-PR
Implementation of the AUC-PR calculation for weighted data, proposed by Keilwagen, Grosse and Grau (https://doi.org/10.1371/journal.pone.0092209)
* rank_metric: fix lint warnings
* Implement tests for AUC-PR and fix implementation
* add aucpr to documentation for other languages
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* tiny fix for empty partition in predict
* further fix
* [jvm-packages] Prevent dispose being called twice when finalize
* Convert SIGSEGV to XGBoostError
* Avoid creating a new SBooster with the same JBooster
* Address CR Comments
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix the pattern in dev script and version mismatch
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add dev script to update version and update versions
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update resource files
* Update SparkParallelismTracker.scala
* remove xgboost-tracker.properties
* [jvm-packages] Train Booster from an existing model
* Align Scala API with Java API
* Existing model should not load rabit checkpoint
* Address minor comments
* Implement saving temporary boosters and loading previous booster
* Add more unit tests for loadPrevBooster
* Add params to XGBoostEstimator
* (1) Move repartition out of the temp model saving loop (2) Address CR comments
* Catch a corner case of training next model with fewer rounds
* Address comments
* Refactor newly added methods into TmpBoosterManager
* Add two files which is missing in previous commit
* Rename TmpBooster to checkpoint
* [jvm-packages] Fixed test/train persistence
Prior to this patch both data sets were persisted in the same directory,
i.e. the test data replaced the training one which led to
* training on less data (since usually test < train) and
* test loss being exactly equal to the training loss.
Closes#2945.
* Cleanup file cache after the training
* Addressed review comments
* [jvm-packages] Exposed train-time evaluation metrics
They are accessible via 'XGBoostModel.summary'. The summary is not
serialized with the model and is only available after the training.
* Addressed review comments
* Extracted model-related tests into 'XGBoostModelSuite'
* Added tests for copying the 'XGBoostModel'
* [jvm-packages] Fixed a subtle bug in train/test split
Iterator.partition (naturally) assumes that the predicate is deterministic
but this is not the case for
r.nextDouble() <= trainTestRatio
therefore sometimes the DMatrix(...) call got a NoSuchElementException
and crashed the JVM due to lack of exception handling in
XGBoost4jCallbackDataIterNext.
* Make sure train/test objectives are different
In the refactor to add base margins, #2532, all of the labels were lost
when creating the dmatrix. This became obvious as metrics like ndcg
always returned 1.0 regardless of the results.
Change-Id: I88be047e1c108afba4784bd3d892bfc9edeabe55
Training a model with the experimental rank:ndcg objective incorrectly
returns a Classification model. Adjust the classification check to
not recognize rank:* objectives as classification.
While writing tests for isClassificationTask also turned up that
obj_type -> regression was incorrectly identified as a classification
task so the function was slightly adjusted to pass the new tests.
* Add SparkParallelismTracker to prevent job from hanging
* Code review comments
* Code Review Comments
* Fix unit tests
* Changes and unit test to catch the corner case.
* Update documentations
* Small improvements
* cancalAllJobs is problematic with scalatest. Remove it
* Code Review Comments
* Check number of executor cores beforehand, and throw exeception if any core is lost.
* Address CR Comments
* Add missing class
* Fix flaky unit test
* Address CR comments
* Remove redundant param for TaskFailedListener
* Allowed subsampling test from the training data frame/RDD
The implementation requires storing 1 - trainTestRatio points in memory
to make the sampling work.
An alternative approach would be to construct the full DMatrix and then
slice it deterministically into train/test. The peak memory consumption
of such scenario, however, is twice the dataset size.
* Removed duplication from 'XGBoost.train'
Scala callers can (and should) use names to supply a subset of
parameters. Method overloading is not required.
* Reuse XGBoost seed parameter to stabilize train/test splitting
* Added early stopping support to non-distributed XGBoost
Closes#1544
* Added early-stopping to distributed XGBoost
* Moved construction of 'watches' into a separate method
This commit also fixes the handling of 'baseMargin' which previously
was not added to the validation matrix.
* Addressed review comments
* Converted ml.dmlc.xgboost4j.LabeledPoint to Scala
This allows to easily integrate LabeledPoint with Spark DataFrame APIs,
which support encoding/decoding case classes out of the box. Alternative
solution would be to keep LabeledPoint in Java and make it a Bean by
generating boilerplate getters/setters. I have decided against that, even
thought the conversion in this PR implies a public API change.
I also had to remove the factory methods fromSparseVector and
fromDenseVector because a) they would need to be duplicated to support
overloaded calls with extra data (e.g. weight); and b) Scala would expose
them via mangled $.MODULE$ which looks ugly in Java.
Additionally, this commit makes it possible to switch to LabeledPoint in
all public APIs and effectively to pass initial margin/group as part of
the point. This seems to be the only reliable way of implementing distributed
learning with these data. Note that group size format used by single-node
XGBoost is not compatible with that scenario, since the partition split
could divide a group into two chunks.
* Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs
Note that DataFrame-based and Flink APIs are not affected by this change.
* Removed baseMargin argument in favour of the LabeledPoint field
* Do a single pass over the partition in buildDistributedBoosters
Note that there is no formal guarantee that
val repartitioned = rdd.repartition(42)
repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... }
would do a single shuffle, but in practice it seems to be always the case.
* Exposed baseMargin in DataFrame-based API
* Addressed review comments
* Pass baseMargin to XGBoost.trainWithDataFrame via params
* Reverted MLLabeledPoint in Spark APIs
As discussed, baseMargin would only be supported for DataFrame-based APIs.
* Cleaned up baseMargin tests
- Removed RDD-based test, since the option is no longer exposed via
public APIs
- Changed DataFrame-based one to check that adding a margin actually
affects the prediction
* Pleased Scalastyle
* Addressed more review comments
* Pleased scalastyle again
* Fixed XGBoost.fromBaseMarginsToArray
which always returned an array of NaNs even if base margin was not
specified. Surprisingly this only failed a few tests.
* Deduplicated DataFrame creation in XGBoostDFSuite
* Extracted dermatology.data into MultiClassification
* Moved cache cleaning to SharedSparkContext
Cache files are prefixed with appName therefore this seems to be just the
place to delete them.
* Removed redundant JMatrix calls in xgboost4j-spark
* Slightly more readable buildDenseRDD in XGBoostGeneralSuite
* Generalized train/test DataFrame construction in XGBoostDFSuite
* Changed SharedSparkContext to setup a new context per-test
Hence the new name: PerTestSparkSession :)
* Fused Utils into PerTestSparkSession
* Whitespace fix in XGBoostDFSuite
* Ensure SparkSession is always eagerly created in PerTestSparkSession
* Renamed PerTestSparkSession->PerTest
because it was doing slightly more than creating/stopping the session.