348 Commits

Author SHA1 Message Date
Sergei Lebedev
66874f5777 [jvm-packages] Deduplicated train/test data access in tests (#2507)
* [jvm-packages] Deduplicated train/test data access in tests

All datasets are now available via a unified API, e.g. Agaricus.test.
The only exception is the dermatology data which requires parsing a
CSV file.

* Inlined Utils.buildTrainingRDD

The default number of partitions for local mode is equal to the number
of available CPUs.

* Replaced dataset names with problem types
2017-07-12 09:13:55 -07:00
Rory Mitchell
e939192978 Cmake improvements (#2487)
* Cmake improvements
* Add google test to cmake
2017-07-06 18:05:11 +12:00
Sergei Lebedev
8ceeb32bad Fixed a signature of XGBoostModel.predict (#2476)
Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.

This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified. See discussion in 06bd5dca for motivation.
2017-07-02 21:42:46 -07:00
Sergei Lebedev
d535340459 [jvm-packages] Exposed baseMargin (#2450)
* Disabled excessive Spark logging in tests

* Fixed a singature of XGBoostModel.predict

Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.

This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified.

* Removed boxing in XGBoost.fromDenseToSparseLabeledPoints

* Inlined XGBoost.repartitionData

An if is more explicit than an opaque method name.

* Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel

* Check the input dimension in DMatrix.setBaseMargin

Prior to this commit providing an array of incorrect dimensions would
have resulted in memory corruption. Maybe backport this to C++?

* Reduced nesting in XGBoost.buildDistributedBoosters

* Ensured consistent naming of the params map

* Cleaned up DataBatch to make it easier to comprehend

* Made scalastyle happy

* Added baseMargin to XGBoost.train and trainWithRDD

* Deprecated XGBoost.train

It is ambiguous and work only for RDDs.

* Addressed review comments

* Revert "Fixed a singature of XGBoostModel.predict"

This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4.

* Addressed more review comments

* Fixed NullPointerException in buildDistributedBoosters
2017-06-30 08:27:24 -07:00
Edi Bice
2911597f3d [jvm-packages] Expose prediction feature contribution on the Java side (#2441)
* Exposed prediction feature contribution on the Java side

* was not supplying the newly added argument

* Exposed from Scala-side as well

* formatting (keep declaration in one line unless exceeding 100 chars)
2017-06-28 13:34:51 -07:00
Sergei Lebedev
91e778c6db [jvm-packages] JNI Cosmetics (#2448)
* [jvm-packages] Ensure the native library is loaded once

Previously any class using XGBoostJNI queried NativeLibLoader to make
sure the native library is loaded. This commit moves the initXGBoost
call to XGBoostJNI, effectively delegating the initialization to the class
loader.

Note also, that now XGBoostJNI would NOT suppress an IOException if it
occured in initXGBoost.

* [jvm-packages] Fused JNIErrorHandle with XGBoostJNI

There was no reason for having a separate class.
2017-06-23 11:49:30 -07:00
ebernhardson
169c983b5f [jvm-packages] Release dmatrix when no longer needed (#2436)
When using xgboost4j-spark I had executors getting killed much more
often than i would expect by yarn for overrunning their memory limits,
based on the memoryOverhead provided. It looks like a significant
amount of this is because dmatrix's were being created but not released,
because they were only released when the GC decided it was time to
cleanup the references.

Rather than waiting for the GC, relesae the DMatrix's when we know
they are no longer necessary.
2017-06-22 09:20:55 -07:00
Sergei Lebedev
2cb51f7097 [jvm-packages] Another pack of build/CI improvements (#2422)
* [jvm-packages] Fixed compilation on Windows

* [jvm-packages] Build the JNI bindings on Appveyor

* [jvm-packages] Build & test on OS X

* [jvm-packages] Re-applied the CMake build changes reverted by #2395

* Fixed Appveyor JVM build

* Muted Maven on Travis

* Don't link with libawt

* "linux2"->"linux"

Python2.x and 3.X use slightly different values for ``sys.platform``.
2017-06-21 12:28:35 -07:00
Sergei Lebedev
0db37c05bd [jvm-packages] Deterministically XGBoost training on exception (#2405)
Previously the code relied on the tracker process being terminated
by the OS, which was not the case on Windows.

Closes #2394
2017-06-12 20:19:28 -07:00
PSEUDOTENSOR / Jonathan McKinney
41efe32aa5 [GPU-Plugin] Multi-GPU for grow_gpu_hist histogram method using NVIDIA NCCL. (#2395) 2017-06-12 05:06:08 +12:00
Sergei Lebedev
3820ab6a0b [jvm-packages] Minor improvements to the CMake build (#2379)
* [jvm-packages] Fixed JNI_OnLoad overload

It does not compile on Windows without proper export flags.

* [jvm-packages] Use JNI types directly where appropriate

* Removed lib hack from CMake build

Prior to this commit the CMake build use hardcoded lib prefix for
libxgboost and libxgboost4j. Unfortunatelly this did not play well with
Windows, which does not use the lib- prefix.
2017-06-09 08:25:09 -07:00
Sergei Lebedev
37c27ab8e8 [jvm-packages] Replaced create_jni.{bat,sh} with a Python version (#2383)
* [jvm-packages] Replaced create_jni.{bat,sh} with a Python version

This allows to have a single script for all platforms.

* [jvm-packages] Added all configuration options to create_jni.py
2017-06-07 21:55:45 -07:00
Sergei Lebedev
2d9052bc7d libxgboost4j is now part of the CMake build (#2373)
* [jvm-packages] Added libxgboost4j to CMake build

* [jvm-packages] Wired CMake build into create_jni.sh

* User newer CMake version on Travis

* Lowered CMake version constraints

* Fixed various quirks in the new CMake build
2017-06-03 17:14:51 -07:00
Sergei Lebedev
97abfc487a [jvm-packages] Fixed checkstyle excludes on Windows (#2370)
XGBoostJNI.java was not excluded on Windows, probably because the path
specified in 'checkstyle-suppressions.xml' used UNIX file separators.
2017-06-02 10:14:13 -07:00
Sergei Lebedev
433269c335 Minor improvements to xgboost/jvm-packages build (#2356)
* Specified 'exec-maven-plugin' version

* Changed 'create_jni.sh' to fail on error

and also report each of the executed commands, which makes it easier
to debug.
2017-05-30 17:51:27 +02:00
Nan Zhu
a607f697e3 [jvm-packages] Disable fast histo for spark (#2296)
* add back train method but mark as deprecated

* fix scalastyle error

* disable fast histogram in xgboost4j-spark temporarily
2017-05-15 20:43:16 -07:00
Sergei Lebedev
e62be19c70 Removed 'flink.suffix' and added 'flink.version' (#2277)
The former was just Scala binary tag, and the latter was hardcoded in
the 'xgboost4j-flink' POM.
2017-05-10 08:42:40 -07:00
Nan Zhu
428453f7d6 [jvm-packages] fix the persistence of XGBoostEstimator (#2265)
* add back train method but mark as deprecated

* fix scalastyle error

* fix the persistence of XGBoostEstimator

* test persistence of a complete pipeline

* fix compilation issue

* do not allow persist custom_eval and custom_obj

* fix the failed tesl
2017-05-08 21:58:06 -07:00
ebernhardson
197a9eacc5 [jvm-packages] Expose json dumps to scala (#2247)
* Add parameter passthru of format on Booster.getModelDump
2017-05-02 17:41:27 -07:00
ebernhardson
ccccf8a015 [jvm-packages] Accept groupData in spark model eval (#2244)
* Support model evaluation for ranking tasks by accepting
 groupData in XGBoostModel.eval
2017-05-02 10:03:20 -07:00
ebernhardson
d3b866e3fd [jvm-packages] Expose json formatted booster dumps (#2233) (#2234)
* Change Booster dump from XGBoosterDumpModel to XGBoosterDumpModelEx

Allows exposing multiple formatting options of model dumping.
2017-04-29 20:23:09 -07:00
Nan Zhu
392aa6d1d3 [jvm-packages] make XGBoostModel hold BoosterParams as well (#2214)
* add back train method but mark as deprecated

* fix scalastyle error

* make XGBoostModel hold BoosterParams as well
2017-04-21 08:12:50 -07:00
Nan Zhu
a837fa9620 [jvm-packages] rdds containing boosters should be cleaned once we got boosters to driver (#2183) 2017-04-11 06:12:49 -07:00
Nan Zhu
f08077606c [jvm-packages] Clean external cache (#2181)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* small fix for cleanExternalCache
2017-04-10 07:49:58 -07:00
Nan Zhu
8d8cbcc6db [jvm-packages] fixed several issues in unit tests (#2173)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* fix several issues in tests
2017-04-06 06:25:23 -07:00
cloverrose
288f309434 [jvm-packages] call setGroup for ranking task (#2066)
* [jvm-packages] call setGroup for ranking task

* passing groupData through xgBoostConfMap

* fix original comment position

* make groupData param

* remove groupData variable, use xgBoostConfMap directly

* set default groupData value

* add use groupData tests

* reduce rank-demo size

* use TaskContext.getPartitionId() instead of mapPartitionsWithIndex

* add DF use groupData test

* remove unused varable
2017-03-06 15:45:06 -08:00
geoHeil
cf6b173bd7 [jvm-packages] Spark pipeline persistence (#1906)
[jvm-packages] Spark pipeline persistence
2017-03-05 18:35:37 -08:00
Xin Yin
5b54b9437c Fixed Exception handling for fragmented Rabit 'print' tracker command. Fixed unit test. (#2081) 2017-03-05 13:40:59 -08:00
Nan Zhu
ab13fd72bd [jvm-packages] Scala/Java interface for Fast Histogram Algorithm (#1966)
* add back train method but mark as deprecated

* fix scalastyle error

* first commit in scala binding for fast histo

* java test

* add missed scala tests

* spark training

* add back train method but mark as deprecated

* fix scalastyle error

* local change

* first commit in scala binding for fast histo

* local change

* fix df frame test
2017-03-04 15:37:24 -08:00
Nan Zhu
ac30a0aff5 [jvm-packages][spark]Preserve num classes (#2068)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* bump spark version to 2.1

* preserve num_class issues

* fix failed test cases

* rivising

* add multi class test
2017-03-04 14:14:31 -08:00
hlsc
a92093388d [jvm-packages] fix bug doing rabit call after finalize (#2079)
[jvm-packages]fix bug doing rabit call after finalize
2017-03-02 16:46:57 -08:00
Nan Zhu
63aec12a13 [jvm-packages] Bump spark to 2.1 (#2046) 2017-02-19 08:29:35 -08:00
Nan Zhu
185fe1d645 [jvm-packages] use ML's para system to build the passed-in params to XGBoost (#2043)
* add back train method but mark as deprecated

* fix scalastyle error

* use ML's para system to build the passed-in params to XGBoost

* clean
2017-02-18 11:56:27 -08:00
DougM
acce11d3f4 fix MLlib CrossValidator issues (wrong default value configuration) #1941 (#2042) 2017-02-18 08:10:47 -08:00
Xin Yin
4fb7fdb240 [jvm-packages] Fixed java.nio.BufferUnderFlow issue in Scala Rabit tracker. (#1993)
* [jvm-packages] Scala implementation of the Rabit tracker.

A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).

* [jvm-packages] Updated Akka dependency in pom.xml.

* Refactored the RabitTracker directory structure.

* Fixed premature stopping of connection handler.

Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.

* Added interface IRabitTracker so that user can switch implementations.

* Default timeout duration changes.

* Dependency for Akka tests.

* Removed the main function of RabitTracker.

* A skeleton for testing Akka-based Rabit tracker.

* waitFor() in RabitTracker no longer throws exceptions.

* Completed unit test for the 'start' command of Rabit tracker.

* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)

* Fixed the default timeout duration.

* Use Java container to avoid serialization issues due to intermediate wrappers.

* Added tests for Allreduce/model training using Scala Rabit tracker.

* Added spill-over unit test for the Scala Rabit tracker.

* Fixed a typo.

* Overhaul of RabitTracker interface per code review.

  - Removed methods start() waitFor() (no arguments) from IRabitTracker.
  - The timeout in start(timeout) is now worker connection timeout, as tcp
    socket binding timeout is less intuitive.
  - Dropped time unit from start(...) and waitFor(...) methods; the default
    time unit is millisecond.
  - Moved random port number generation into the RabitTrackerHandler.
  - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.

* More code refactoring and comments.

* Unified timeout constants. Readable tracker status code.

* Add comments to indicate that allReduce is for tests only. Removed all other variants.

* Removed unused imports.

* Simplified signatures of training methods.

 - Moved TrackerConf into parameter map.
 - Changed GeneralParams so that TrackerConf becomes a standalone parameter.
 - Updated test cases accordingly.

* Changed monitoring strategies.

* Reverted monitoring changes.

* Update test case for Rabit AllReduce.

* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.

* More comprehensive test cases for exception handling and worker connection timeout.

* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.

* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.

* Reverted scalastyle-config changes.

* Visibility scope change. Interface tweaks.

* Use match pattern to handle tracker_conf parameter.

* Minor clarification in JNI code.

* Clearer intent in match pattern to suppress warnings.

* Removed Future from constructor. Block in start() and waitFor() instead.

* Revert inadvertent comment changes.

* Removed debugging information.

* Updated test cases that are a bit finicky.

* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.

* Fixed BufferUnderFlow bug in decoding tracker 'print' command.

* Merge conflicts resolution.
2017-02-04 10:20:39 -08:00
geoHeil
2250b9b6d2 [jvm-packages] try setting default profile (#1891)
* try setting default profile

* add spark pipeline persistence

* access spark session

* copy paste sparks default parameter reader

* remove unnecessary parameters, only change xml

* remove unnecessary changes 2
2017-01-31 08:33:51 -08:00
Ruimin Wang
d9584ab82e refactor duplicate evaluation implementation (#1852) 2016-12-08 20:33:40 -08:00
Xin Yin
e7fbc8591f [jvm-packages] Scala implementation of the Rabit tracker. (#1612)
* [jvm-packages] Scala implementation of the Rabit tracker.

A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).

* [jvm-packages] Updated Akka dependency in pom.xml.

* Refactored the RabitTracker directory structure.

* Fixed premature stopping of connection handler.

Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.

* Added interface IRabitTracker so that user can switch implementations.

* Default timeout duration changes.

* Dependency for Akka tests.

* Removed the main function of RabitTracker.

* A skeleton for testing Akka-based Rabit tracker.

* waitFor() in RabitTracker no longer throws exceptions.

* Completed unit test for the 'start' command of Rabit tracker.

* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)

* Fixed the default timeout duration.

* Use Java container to avoid serialization issues due to intermediate wrappers.

* Added tests for Allreduce/model training using Scala Rabit tracker.

* Added spill-over unit test for the Scala Rabit tracker.

* Fixed a typo.

* Overhaul of RabitTracker interface per code review.

  - Removed methods start() waitFor() (no arguments) from IRabitTracker.
  - The timeout in start(timeout) is now worker connection timeout, as tcp
    socket binding timeout is less intuitive.
  - Dropped time unit from start(...) and waitFor(...) methods; the default
    time unit is millisecond.
  - Moved random port number generation into the RabitTrackerHandler.
  - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.

* More code refactoring and comments.

* Unified timeout constants. Readable tracker status code.

* Add comments to indicate that allReduce is for tests only. Removed all other variants.

* Removed unused imports.

* Simplified signatures of training methods.

 - Moved TrackerConf into parameter map.
 - Changed GeneralParams so that TrackerConf becomes a standalone parameter.
 - Updated test cases accordingly.

* Changed monitoring strategies.

* Reverted monitoring changes.

* Update test case for Rabit AllReduce.

* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.

* More comprehensive test cases for exception handling and worker connection timeout.

* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.

* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.

* Reverted scalastyle-config changes.

* Visibility scope change. Interface tweaks.

* Use match pattern to handle tracker_conf parameter.

* Minor clarification in JNI code.

* Clearer intent in match pattern to suppress warnings.

* Removed Future from constructor. Block in start() and waitFor() instead.

* Revert inadvertent comment changes.

* Removed debugging information.

* Updated test cases that are a bit finicky.

* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.
2016-12-07 06:35:42 -08:00
Alexey Grigorev
80e70c56b9 [jvm-packages] xgboost4j: publishing sources along with bins (#1797)
* xgboost4j: publishing sources along with bins

* description about building maven artifacts

* publishing scala source to local m2 as well
2016-11-21 15:02:57 -05:00
Ruimin Wang
d80cec3384 [jvm-pacakges] the first parameter in getModelDump should be featuremap path not model path (#1788)
* fix the model dump in xgboost4j example

* Modify the dump model part of scala version

* add the forgotten modelInfos
2016-11-21 08:52:26 -05:00
Nan Zhu
965091c4bb [jvm-packages] update methods in test cases to be consistent (#1780)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* update methods in test cases to be consistent

* add blank lines

* fix
2016-11-20 22:49:18 -05:00
XianXing Zhang
ce708c8e7f [jvm-packages] Leverage the Spark ml API to read DataFrame from files in LibSVM format. (#1785) 2016-11-20 21:28:03 -05:00
Nan Zhu
5217e53156 stylistic fix (#1789)
* stylistic fix

* try multiple repos

* fix

* fix
2016-11-19 22:03:10 -05:00
joandre
91b75f9b41 Fix a small typo in GeneralParams class. Change customEval parameter name from "custom_obj" to "custom_eval". (#1741) 2016-11-06 12:44:49 -05:00
Nan Zhu
6082184cd1 [jvm-packages] update API docs (#1713)
* add back train method but mark as deprecated

* fix scalastyle error

* update java doc

* update
2016-10-27 18:53:22 -07:00
Nan Zhu
d321375df5 [jvm-packages] Fix mis configure of nthread (#1709)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* fix mis configuration
2016-10-27 12:10:35 -04:00
Nan Zhu
f12074d355 [jvm-packages] release blog (#1706) 2016-10-26 21:35:42 -04:00
Nan Zhu
f801c22710 [jvm-packages] change class to object in examples (#1703)
* change class to object in examples

* fix compilation error
2016-10-26 14:54:56 -04:00
Nan Zhu
016ab89484 [jvm-packages] Parameter tuning tool for XGBoost (#1664) 2016-10-23 16:58:18 -04:00
Adam Pocock
445029bb82 [jvm-packages] XGBoost4j Windows fixes (#1639)
* Changes for Mingw64 compilation to ensure long is a consistent size.

Mainly impacts the Java API which would not compile, but there may be
silent errors on Windows with large datasets before this patch (as long
is 32-bits when compiled with mingw64 even in 64-bit mode).

* Adding ifdefs to ensure it still compiles on MacOS

* Makefile and create_jni.bat changes for Windows.

* Switching XGDMatrixCreateFromCSREx JNI call to use size_t cast

* Fixing lint error, adding profile switching to jvm-packages build to make create-jni.bat get called, adding myself to Contributors.Md
2016-10-18 08:35:25 -04:00