* Converted ml.dmlc.xgboost4j.LabeledPoint to Scala
This allows to easily integrate LabeledPoint with Spark DataFrame APIs,
which support encoding/decoding case classes out of the box. Alternative
solution would be to keep LabeledPoint in Java and make it a Bean by
generating boilerplate getters/setters. I have decided against that, even
thought the conversion in this PR implies a public API change.
I also had to remove the factory methods fromSparseVector and
fromDenseVector because a) they would need to be duplicated to support
overloaded calls with extra data (e.g. weight); and b) Scala would expose
them via mangled $.MODULE$ which looks ugly in Java.
Additionally, this commit makes it possible to switch to LabeledPoint in
all public APIs and effectively to pass initial margin/group as part of
the point. This seems to be the only reliable way of implementing distributed
learning with these data. Note that group size format used by single-node
XGBoost is not compatible with that scenario, since the partition split
could divide a group into two chunks.
* Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs
Note that DataFrame-based and Flink APIs are not affected by this change.
* Removed baseMargin argument in favour of the LabeledPoint field
* Do a single pass over the partition in buildDistributedBoosters
Note that there is no formal guarantee that
val repartitioned = rdd.repartition(42)
repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... }
would do a single shuffle, but in practice it seems to be always the case.
* Exposed baseMargin in DataFrame-based API
* Addressed review comments
* Pass baseMargin to XGBoost.trainWithDataFrame via params
* Reverted MLLabeledPoint in Spark APIs
As discussed, baseMargin would only be supported for DataFrame-based APIs.
* Cleaned up baseMargin tests
- Removed RDD-based test, since the option is no longer exposed via
public APIs
- Changed DataFrame-based one to check that adding a margin actually
affects the prediction
* Pleased Scalastyle
* Addressed more review comments
* Pleased scalastyle again
* Fixed XGBoost.fromBaseMarginsToArray
which always returned an array of NaNs even if base margin was not
specified. Surprisingly this only failed a few tests.
* Disabled excessive Spark logging in tests
* Fixed a singature of XGBoostModel.predict
Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.
This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified.
* Removed boxing in XGBoost.fromDenseToSparseLabeledPoints
* Inlined XGBoost.repartitionData
An if is more explicit than an opaque method name.
* Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel
* Check the input dimension in DMatrix.setBaseMargin
Prior to this commit providing an array of incorrect dimensions would
have resulted in memory corruption. Maybe backport this to C++?
* Reduced nesting in XGBoost.buildDistributedBoosters
* Ensured consistent naming of the params map
* Cleaned up DataBatch to make it easier to comprehend
* Made scalastyle happy
* Added baseMargin to XGBoost.train and trainWithRDD
* Deprecated XGBoost.train
It is ambiguous and work only for RDDs.
* Addressed review comments
* Revert "Fixed a singature of XGBoostModel.predict"
This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4.
* Addressed more review comments
* Fixed NullPointerException in buildDistributedBoosters
* Exposed prediction feature contribution on the Java side
* was not supplying the newly added argument
* Exposed from Scala-side as well
* formatting (keep declaration in one line unless exceeding 100 chars)
* [jvm-packages] Ensure the native library is loaded once
Previously any class using XGBoostJNI queried NativeLibLoader to make
sure the native library is loaded. This commit moves the initXGBoost
call to XGBoostJNI, effectively delegating the initialization to the class
loader.
Note also, that now XGBoostJNI would NOT suppress an IOException if it
occured in initXGBoost.
* [jvm-packages] Fused JNIErrorHandle with XGBoostJNI
There was no reason for having a separate class.
* [jvm-packages] Fixed compilation on Windows
* [jvm-packages] Build the JNI bindings on Appveyor
* [jvm-packages] Build & test on OS X
* [jvm-packages] Re-applied the CMake build changes reverted by #2395
* Fixed Appveyor JVM build
* Muted Maven on Travis
* Don't link with libawt
* "linux2"->"linux"
Python2.x and 3.X use slightly different values for ``sys.platform``.
* [jvm-packages] Fixed JNI_OnLoad overload
It does not compile on Windows without proper export flags.
* [jvm-packages] Use JNI types directly where appropriate
* Removed lib hack from CMake build
Prior to this commit the CMake build use hardcoded lib prefix for
libxgboost and libxgboost4j. Unfortunatelly this did not play well with
Windows, which does not use the lib- prefix.
* [jvm-packages] Added libxgboost4j to CMake build
* [jvm-packages] Wired CMake build into create_jni.sh
* User newer CMake version on Travis
* Lowered CMake version constraints
* Fixed various quirks in the new CMake build
* add back train method but mark as deprecated
* fix scalastyle error
* first commit in scala binding for fast histo
* java test
* add missed scala tests
* spark training
* add back train method but mark as deprecated
* fix scalastyle error
* local change
* first commit in scala binding for fast histo
* local change
* fix df frame test
* [jvm-packages] Scala implementation of the Rabit tracker.
A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).
* [jvm-packages] Updated Akka dependency in pom.xml.
* Refactored the RabitTracker directory structure.
* Fixed premature stopping of connection handler.
Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.
* Added interface IRabitTracker so that user can switch implementations.
* Default timeout duration changes.
* Dependency for Akka tests.
* Removed the main function of RabitTracker.
* A skeleton for testing Akka-based Rabit tracker.
* waitFor() in RabitTracker no longer throws exceptions.
* Completed unit test for the 'start' command of Rabit tracker.
* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)
* Fixed the default timeout duration.
* Use Java container to avoid serialization issues due to intermediate wrappers.
* Added tests for Allreduce/model training using Scala Rabit tracker.
* Added spill-over unit test for the Scala Rabit tracker.
* Fixed a typo.
* Overhaul of RabitTracker interface per code review.
- Removed methods start() waitFor() (no arguments) from IRabitTracker.
- The timeout in start(timeout) is now worker connection timeout, as tcp
socket binding timeout is less intuitive.
- Dropped time unit from start(...) and waitFor(...) methods; the default
time unit is millisecond.
- Moved random port number generation into the RabitTrackerHandler.
- Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.
* More code refactoring and comments.
* Unified timeout constants. Readable tracker status code.
* Add comments to indicate that allReduce is for tests only. Removed all other variants.
* Removed unused imports.
* Simplified signatures of training methods.
- Moved TrackerConf into parameter map.
- Changed GeneralParams so that TrackerConf becomes a standalone parameter.
- Updated test cases accordingly.
* Changed monitoring strategies.
* Reverted monitoring changes.
* Update test case for Rabit AllReduce.
* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.
* More comprehensive test cases for exception handling and worker connection timeout.
* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.
* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.
* Reverted scalastyle-config changes.
* Visibility scope change. Interface tweaks.
* Use match pattern to handle tracker_conf parameter.
* Minor clarification in JNI code.
* Clearer intent in match pattern to suppress warnings.
* Removed Future from constructor. Block in start() and waitFor() instead.
* Revert inadvertent comment changes.
* Removed debugging information.
* Updated test cases that are a bit finicky.
* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.
* Fixed BufferUnderFlow bug in decoding tracker 'print' command.
* Merge conflicts resolution.
* [jvm-packages] Scala implementation of the Rabit tracker.
A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).
* [jvm-packages] Updated Akka dependency in pom.xml.
* Refactored the RabitTracker directory structure.
* Fixed premature stopping of connection handler.
Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.
* Added interface IRabitTracker so that user can switch implementations.
* Default timeout duration changes.
* Dependency for Akka tests.
* Removed the main function of RabitTracker.
* A skeleton for testing Akka-based Rabit tracker.
* waitFor() in RabitTracker no longer throws exceptions.
* Completed unit test for the 'start' command of Rabit tracker.
* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)
* Fixed the default timeout duration.
* Use Java container to avoid serialization issues due to intermediate wrappers.
* Added tests for Allreduce/model training using Scala Rabit tracker.
* Added spill-over unit test for the Scala Rabit tracker.
* Fixed a typo.
* Overhaul of RabitTracker interface per code review.
- Removed methods start() waitFor() (no arguments) from IRabitTracker.
- The timeout in start(timeout) is now worker connection timeout, as tcp
socket binding timeout is less intuitive.
- Dropped time unit from start(...) and waitFor(...) methods; the default
time unit is millisecond.
- Moved random port number generation into the RabitTrackerHandler.
- Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.
* More code refactoring and comments.
* Unified timeout constants. Readable tracker status code.
* Add comments to indicate that allReduce is for tests only. Removed all other variants.
* Removed unused imports.
* Simplified signatures of training methods.
- Moved TrackerConf into parameter map.
- Changed GeneralParams so that TrackerConf becomes a standalone parameter.
- Updated test cases accordingly.
* Changed monitoring strategies.
* Reverted monitoring changes.
* Update test case for Rabit AllReduce.
* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.
* More comprehensive test cases for exception handling and worker connection timeout.
* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.
* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.
* Reverted scalastyle-config changes.
* Visibility scope change. Interface tweaks.
* Use match pattern to handle tracker_conf parameter.
* Minor clarification in JNI code.
* Clearer intent in match pattern to suppress warnings.
* Removed Future from constructor. Block in start() and waitFor() instead.
* Revert inadvertent comment changes.
* Removed debugging information.
* Updated test cases that are a bit finicky.
* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.
* Changes for Mingw64 compilation to ensure long is a consistent size.
Mainly impacts the Java API which would not compile, but there may be
silent errors on Windows with large datasets before this patch (as long
is 32-bits when compiled with mingw64 even in 64-bit mode).
* Adding ifdefs to ensure it still compiles on MacOS
* Makefile and create_jni.bat changes for Windows.
* Switching XGDMatrixCreateFromCSREx JNI call to use size_t cast
* Fixing lint error, adding profile switching to jvm-packages build to make create-jni.bat get called, adding myself to Contributors.Md
* create dmatrix with specified missing value
* update dmlc-core
* support for predict method in spark package
repartitioning
work around
* add more elements to work around training set empty partition issue