* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* maven central release
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* [jvm-packages] XGBoost Spark integration refactor. (#3313)
* XGBoost Spark integration refactor.
* Make corresponding update for xgboost4j-example
* Address comments.
* [jvm-packages] Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib (#3326)
* Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib
* Fix extra space.
* [jvm-packages] XGBoost Spark supports ranking with group data. (#3369)
* XGBoost Spark supports ranking with group data.
* Use Iterator.duplicate to prevent OOM.
* Update CheckpointManagerSuite.scala
* Resolve conflicts
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update 0.80
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* static glibc glibc++
* update to build with glib 2.12
* remove unsupported flags
* update version number
* remove properties
* remove unnecessary command
* update poms
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* change version of jvm to keep consistent with other pkgs
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update default spark version to 2.3
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add back libsvm notes
* rank_metric: add AUC-PR
Implementation of the AUC-PR calculation for weighted data, proposed by Keilwagen, Grosse and Grau (https://doi.org/10.1371/journal.pone.0092209)
* rank_metric: fix lint warnings
* Implement tests for AUC-PR and fix implementation
* add aucpr to documentation for other languages
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* tiny fix for empty partition in predict
* further fix
* [jvm-packages] Prevent dispose being called twice when finalize
* Convert SIGSEGV to XGBoostError
* Avoid creating a new SBooster with the same JBooster
* Address CR Comments
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* fix the pattern in dev script and version mismatch
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* add dev script to update version and update versions
* add back train method but mark as deprecated
* add back train method but mark as deprecated
* fix scalastyle error
* fix scalastyle error
* update resource files
* Update SparkParallelismTracker.scala
* remove xgboost-tracker.properties
* [jvm-packages] Train Booster from an existing model
* Align Scala API with Java API
* Existing model should not load rabit checkpoint
* Address minor comments
* Implement saving temporary boosters and loading previous booster
* Add more unit tests for loadPrevBooster
* Add params to XGBoostEstimator
* (1) Move repartition out of the temp model saving loop (2) Address CR comments
* Catch a corner case of training next model with fewer rounds
* Address comments
* Refactor newly added methods into TmpBoosterManager
* Add two files which is missing in previous commit
* Rename TmpBooster to checkpoint
* [jvm-packages] Fixed test/train persistence
Prior to this patch both data sets were persisted in the same directory,
i.e. the test data replaced the training one which led to
* training on less data (since usually test < train) and
* test loss being exactly equal to the training loss.
Closes#2945.
* Cleanup file cache after the training
* Addressed review comments
* [jvm-packages] Exposed train-time evaluation metrics
They are accessible via 'XGBoostModel.summary'. The summary is not
serialized with the model and is only available after the training.
* Addressed review comments
* Extracted model-related tests into 'XGBoostModelSuite'
* Added tests for copying the 'XGBoostModel'
* [jvm-packages] Fixed a subtle bug in train/test split
Iterator.partition (naturally) assumes that the predicate is deterministic
but this is not the case for
r.nextDouble() <= trainTestRatio
therefore sometimes the DMatrix(...) call got a NoSuchElementException
and crashed the JVM due to lack of exception handling in
XGBoost4jCallbackDataIterNext.
* Make sure train/test objectives are different
In the refactor to add base margins, #2532, all of the labels were lost
when creating the dmatrix. This became obvious as metrics like ndcg
always returned 1.0 regardless of the results.
Change-Id: I88be047e1c108afba4784bd3d892bfc9edeabe55
Training a model with the experimental rank:ndcg objective incorrectly
returns a Classification model. Adjust the classification check to
not recognize rank:* objectives as classification.
While writing tests for isClassificationTask also turned up that
obj_type -> regression was incorrectly identified as a classification
task so the function was slightly adjusted to pass the new tests.
* Add SparkParallelismTracker to prevent job from hanging
* Code review comments
* Code Review Comments
* Fix unit tests
* Changes and unit test to catch the corner case.
* Update documentations
* Small improvements
* cancalAllJobs is problematic with scalatest. Remove it
* Code Review Comments
* Check number of executor cores beforehand, and throw exeception if any core is lost.
* Address CR Comments
* Add missing class
* Fix flaky unit test
* Address CR comments
* Remove redundant param for TaskFailedListener
* Allowed subsampling test from the training data frame/RDD
The implementation requires storing 1 - trainTestRatio points in memory
to make the sampling work.
An alternative approach would be to construct the full DMatrix and then
slice it deterministically into train/test. The peak memory consumption
of such scenario, however, is twice the dataset size.
* Removed duplication from 'XGBoost.train'
Scala callers can (and should) use names to supply a subset of
parameters. Method overloading is not required.
* Reuse XGBoost seed parameter to stabilize train/test splitting
* Added early stopping support to non-distributed XGBoost
Closes#1544
* Added early-stopping to distributed XGBoost
* Moved construction of 'watches' into a separate method
This commit also fixes the handling of 'baseMargin' which previously
was not added to the validation matrix.
* Addressed review comments
* Converted ml.dmlc.xgboost4j.LabeledPoint to Scala
This allows to easily integrate LabeledPoint with Spark DataFrame APIs,
which support encoding/decoding case classes out of the box. Alternative
solution would be to keep LabeledPoint in Java and make it a Bean by
generating boilerplate getters/setters. I have decided against that, even
thought the conversion in this PR implies a public API change.
I also had to remove the factory methods fromSparseVector and
fromDenseVector because a) they would need to be duplicated to support
overloaded calls with extra data (e.g. weight); and b) Scala would expose
them via mangled $.MODULE$ which looks ugly in Java.
Additionally, this commit makes it possible to switch to LabeledPoint in
all public APIs and effectively to pass initial margin/group as part of
the point. This seems to be the only reliable way of implementing distributed
learning with these data. Note that group size format used by single-node
XGBoost is not compatible with that scenario, since the partition split
could divide a group into two chunks.
* Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs
Note that DataFrame-based and Flink APIs are not affected by this change.
* Removed baseMargin argument in favour of the LabeledPoint field
* Do a single pass over the partition in buildDistributedBoosters
Note that there is no formal guarantee that
val repartitioned = rdd.repartition(42)
repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... }
would do a single shuffle, but in practice it seems to be always the case.
* Exposed baseMargin in DataFrame-based API
* Addressed review comments
* Pass baseMargin to XGBoost.trainWithDataFrame via params
* Reverted MLLabeledPoint in Spark APIs
As discussed, baseMargin would only be supported for DataFrame-based APIs.
* Cleaned up baseMargin tests
- Removed RDD-based test, since the option is no longer exposed via
public APIs
- Changed DataFrame-based one to check that adding a margin actually
affects the prediction
* Pleased Scalastyle
* Addressed more review comments
* Pleased scalastyle again
* Fixed XGBoost.fromBaseMarginsToArray
which always returned an array of NaNs even if base margin was not
specified. Surprisingly this only failed a few tests.
* Deduplicated DataFrame creation in XGBoostDFSuite
* Extracted dermatology.data into MultiClassification
* Moved cache cleaning to SharedSparkContext
Cache files are prefixed with appName therefore this seems to be just the
place to delete them.
* Removed redundant JMatrix calls in xgboost4j-spark
* Slightly more readable buildDenseRDD in XGBoostGeneralSuite
* Generalized train/test DataFrame construction in XGBoostDFSuite
* Changed SharedSparkContext to setup a new context per-test
Hence the new name: PerTestSparkSession :)
* Fused Utils into PerTestSparkSession
* Whitespace fix in XGBoostDFSuite
* Ensure SparkSession is always eagerly created in PerTestSparkSession
* Renamed PerTestSparkSession->PerTest
because it was doing slightly more than creating/stopping the session.
* [jvm-packages] Deduplicated train/test data access in tests
All datasets are now available via a unified API, e.g. Agaricus.test.
The only exception is the dermatology data which requires parsing a
CSV file.
* Inlined Utils.buildTrainingRDD
The default number of partitions for local mode is equal to the number
of available CPUs.
* Replaced dataset names with problem types
Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.
This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified. See discussion in 06bd5dca for motivation.
* Disabled excessive Spark logging in tests
* Fixed a singature of XGBoostModel.predict
Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.
This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified.
* Removed boxing in XGBoost.fromDenseToSparseLabeledPoints
* Inlined XGBoost.repartitionData
An if is more explicit than an opaque method name.
* Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel
* Check the input dimension in DMatrix.setBaseMargin
Prior to this commit providing an array of incorrect dimensions would
have resulted in memory corruption. Maybe backport this to C++?
* Reduced nesting in XGBoost.buildDistributedBoosters
* Ensured consistent naming of the params map
* Cleaned up DataBatch to make it easier to comprehend
* Made scalastyle happy
* Added baseMargin to XGBoost.train and trainWithRDD
* Deprecated XGBoost.train
It is ambiguous and work only for RDDs.
* Addressed review comments
* Revert "Fixed a singature of XGBoostModel.predict"
This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4.
* Addressed more review comments
* Fixed NullPointerException in buildDistributedBoosters
* Exposed prediction feature contribution on the Java side
* was not supplying the newly added argument
* Exposed from Scala-side as well
* formatting (keep declaration in one line unless exceeding 100 chars)
* [jvm-packages] Ensure the native library is loaded once
Previously any class using XGBoostJNI queried NativeLibLoader to make
sure the native library is loaded. This commit moves the initXGBoost
call to XGBoostJNI, effectively delegating the initialization to the class
loader.
Note also, that now XGBoostJNI would NOT suppress an IOException if it
occured in initXGBoost.
* [jvm-packages] Fused JNIErrorHandle with XGBoostJNI
There was no reason for having a separate class.