127 Commits

Author SHA1 Message Date
Nan Zhu
392aa6d1d3 [jvm-packages] make XGBoostModel hold BoosterParams as well (#2214)
* add back train method but mark as deprecated

* fix scalastyle error

* make XGBoostModel hold BoosterParams as well
2017-04-21 08:12:50 -07:00
Nan Zhu
a837fa9620 [jvm-packages] rdds containing boosters should be cleaned once we got boosters to driver (#2183) 2017-04-11 06:12:49 -07:00
Nan Zhu
f08077606c [jvm-packages] Clean external cache (#2181)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* small fix for cleanExternalCache
2017-04-10 07:49:58 -07:00
Nan Zhu
8d8cbcc6db [jvm-packages] fixed several issues in unit tests (#2173)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* fix several issues in tests
2017-04-06 06:25:23 -07:00
cloverrose
288f309434 [jvm-packages] call setGroup for ranking task (#2066)
* [jvm-packages] call setGroup for ranking task

* passing groupData through xgBoostConfMap

* fix original comment position

* make groupData param

* remove groupData variable, use xgBoostConfMap directly

* set default groupData value

* add use groupData tests

* reduce rank-demo size

* use TaskContext.getPartitionId() instead of mapPartitionsWithIndex

* add DF use groupData test

* remove unused varable
2017-03-06 15:45:06 -08:00
geoHeil
cf6b173bd7 [jvm-packages] Spark pipeline persistence (#1906)
[jvm-packages] Spark pipeline persistence
2017-03-05 18:35:37 -08:00
Xin Yin
5b54b9437c Fixed Exception handling for fragmented Rabit 'print' tracker command. Fixed unit test. (#2081) 2017-03-05 13:40:59 -08:00
Nan Zhu
ab13fd72bd [jvm-packages] Scala/Java interface for Fast Histogram Algorithm (#1966)
* add back train method but mark as deprecated

* fix scalastyle error

* first commit in scala binding for fast histo

* java test

* add missed scala tests

* spark training

* add back train method but mark as deprecated

* fix scalastyle error

* local change

* first commit in scala binding for fast histo

* local change

* fix df frame test
2017-03-04 15:37:24 -08:00
Nan Zhu
ac30a0aff5 [jvm-packages][spark]Preserve num classes (#2068)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* bump spark version to 2.1

* preserve num_class issues

* fix failed test cases

* rivising

* add multi class test
2017-03-04 14:14:31 -08:00
hlsc
a92093388d [jvm-packages] fix bug doing rabit call after finalize (#2079)
[jvm-packages]fix bug doing rabit call after finalize
2017-03-02 16:46:57 -08:00
Nan Zhu
63aec12a13 [jvm-packages] Bump spark to 2.1 (#2046) 2017-02-19 08:29:35 -08:00
Nan Zhu
185fe1d645 [jvm-packages] use ML's para system to build the passed-in params to XGBoost (#2043)
* add back train method but mark as deprecated

* fix scalastyle error

* use ML's para system to build the passed-in params to XGBoost

* clean
2017-02-18 11:56:27 -08:00
DougM
acce11d3f4 fix MLlib CrossValidator issues (wrong default value configuration) #1941 (#2042) 2017-02-18 08:10:47 -08:00
Xin Yin
4fb7fdb240 [jvm-packages] Fixed java.nio.BufferUnderFlow issue in Scala Rabit tracker. (#1993)
* [jvm-packages] Scala implementation of the Rabit tracker.

A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).

* [jvm-packages] Updated Akka dependency in pom.xml.

* Refactored the RabitTracker directory structure.

* Fixed premature stopping of connection handler.

Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.

* Added interface IRabitTracker so that user can switch implementations.

* Default timeout duration changes.

* Dependency for Akka tests.

* Removed the main function of RabitTracker.

* A skeleton for testing Akka-based Rabit tracker.

* waitFor() in RabitTracker no longer throws exceptions.

* Completed unit test for the 'start' command of Rabit tracker.

* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)

* Fixed the default timeout duration.

* Use Java container to avoid serialization issues due to intermediate wrappers.

* Added tests for Allreduce/model training using Scala Rabit tracker.

* Added spill-over unit test for the Scala Rabit tracker.

* Fixed a typo.

* Overhaul of RabitTracker interface per code review.

  - Removed methods start() waitFor() (no arguments) from IRabitTracker.
  - The timeout in start(timeout) is now worker connection timeout, as tcp
    socket binding timeout is less intuitive.
  - Dropped time unit from start(...) and waitFor(...) methods; the default
    time unit is millisecond.
  - Moved random port number generation into the RabitTrackerHandler.
  - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.

* More code refactoring and comments.

* Unified timeout constants. Readable tracker status code.

* Add comments to indicate that allReduce is for tests only. Removed all other variants.

* Removed unused imports.

* Simplified signatures of training methods.

 - Moved TrackerConf into parameter map.
 - Changed GeneralParams so that TrackerConf becomes a standalone parameter.
 - Updated test cases accordingly.

* Changed monitoring strategies.

* Reverted monitoring changes.

* Update test case for Rabit AllReduce.

* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.

* More comprehensive test cases for exception handling and worker connection timeout.

* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.

* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.

* Reverted scalastyle-config changes.

* Visibility scope change. Interface tweaks.

* Use match pattern to handle tracker_conf parameter.

* Minor clarification in JNI code.

* Clearer intent in match pattern to suppress warnings.

* Removed Future from constructor. Block in start() and waitFor() instead.

* Revert inadvertent comment changes.

* Removed debugging information.

* Updated test cases that are a bit finicky.

* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.

* Fixed BufferUnderFlow bug in decoding tracker 'print' command.

* Merge conflicts resolution.
2017-02-04 10:20:39 -08:00
geoHeil
2250b9b6d2 [jvm-packages] try setting default profile (#1891)
* try setting default profile

* add spark pipeline persistence

* access spark session

* copy paste sparks default parameter reader

* remove unnecessary parameters, only change xml

* remove unnecessary changes 2
2017-01-31 08:33:51 -08:00
Ruimin Wang
d9584ab82e refactor duplicate evaluation implementation (#1852) 2016-12-08 20:33:40 -08:00
Xin Yin
e7fbc8591f [jvm-packages] Scala implementation of the Rabit tracker. (#1612)
* [jvm-packages] Scala implementation of the Rabit tracker.

A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).

* [jvm-packages] Updated Akka dependency in pom.xml.

* Refactored the RabitTracker directory structure.

* Fixed premature stopping of connection handler.

Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.

* Added interface IRabitTracker so that user can switch implementations.

* Default timeout duration changes.

* Dependency for Akka tests.

* Removed the main function of RabitTracker.

* A skeleton for testing Akka-based Rabit tracker.

* waitFor() in RabitTracker no longer throws exceptions.

* Completed unit test for the 'start' command of Rabit tracker.

* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)

* Fixed the default timeout duration.

* Use Java container to avoid serialization issues due to intermediate wrappers.

* Added tests for Allreduce/model training using Scala Rabit tracker.

* Added spill-over unit test for the Scala Rabit tracker.

* Fixed a typo.

* Overhaul of RabitTracker interface per code review.

  - Removed methods start() waitFor() (no arguments) from IRabitTracker.
  - The timeout in start(timeout) is now worker connection timeout, as tcp
    socket binding timeout is less intuitive.
  - Dropped time unit from start(...) and waitFor(...) methods; the default
    time unit is millisecond.
  - Moved random port number generation into the RabitTrackerHandler.
  - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.

* More code refactoring and comments.

* Unified timeout constants. Readable tracker status code.

* Add comments to indicate that allReduce is for tests only. Removed all other variants.

* Removed unused imports.

* Simplified signatures of training methods.

 - Moved TrackerConf into parameter map.
 - Changed GeneralParams so that TrackerConf becomes a standalone parameter.
 - Updated test cases accordingly.

* Changed monitoring strategies.

* Reverted monitoring changes.

* Update test case for Rabit AllReduce.

* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.

* More comprehensive test cases for exception handling and worker connection timeout.

* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.

* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.

* Reverted scalastyle-config changes.

* Visibility scope change. Interface tweaks.

* Use match pattern to handle tracker_conf parameter.

* Minor clarification in JNI code.

* Clearer intent in match pattern to suppress warnings.

* Removed Future from constructor. Block in start() and waitFor() instead.

* Revert inadvertent comment changes.

* Removed debugging information.

* Updated test cases that are a bit finicky.

* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.
2016-12-07 06:35:42 -08:00
Alexey Grigorev
80e70c56b9 [jvm-packages] xgboost4j: publishing sources along with bins (#1797)
* xgboost4j: publishing sources along with bins

* description about building maven artifacts

* publishing scala source to local m2 as well
2016-11-21 15:02:57 -05:00
Ruimin Wang
d80cec3384 [jvm-pacakges] the first parameter in getModelDump should be featuremap path not model path (#1788)
* fix the model dump in xgboost4j example

* Modify the dump model part of scala version

* add the forgotten modelInfos
2016-11-21 08:52:26 -05:00
Nan Zhu
965091c4bb [jvm-packages] update methods in test cases to be consistent (#1780)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* update methods in test cases to be consistent

* add blank lines

* fix
2016-11-20 22:49:18 -05:00
XianXing Zhang
ce708c8e7f [jvm-packages] Leverage the Spark ml API to read DataFrame from files in LibSVM format. (#1785) 2016-11-20 21:28:03 -05:00
Nan Zhu
5217e53156 stylistic fix (#1789)
* stylistic fix

* try multiple repos

* fix

* fix
2016-11-19 22:03:10 -05:00
joandre
91b75f9b41 Fix a small typo in GeneralParams class. Change customEval parameter name from "custom_obj" to "custom_eval". (#1741) 2016-11-06 12:44:49 -05:00
Nan Zhu
6082184cd1 [jvm-packages] update API docs (#1713)
* add back train method but mark as deprecated

* fix scalastyle error

* update java doc

* update
2016-10-27 18:53:22 -07:00
Nan Zhu
d321375df5 [jvm-packages] Fix mis configure of nthread (#1709)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* fix mis configuration
2016-10-27 12:10:35 -04:00
Nan Zhu
f12074d355 [jvm-packages] release blog (#1706) 2016-10-26 21:35:42 -04:00
Nan Zhu
f801c22710 [jvm-packages] change class to object in examples (#1703)
* change class to object in examples

* fix compilation error
2016-10-26 14:54:56 -04:00
Nan Zhu
016ab89484 [jvm-packages] Parameter tuning tool for XGBoost (#1664) 2016-10-23 16:58:18 -04:00
Adam Pocock
445029bb82 [jvm-packages] XGBoost4j Windows fixes (#1639)
* Changes for Mingw64 compilation to ensure long is a consistent size.

Mainly impacts the Java API which would not compile, but there may be
silent errors on Windows with large datasets before this patch (as long
is 32-bits when compiled with mingw64 even in 64-bit mode).

* Adding ifdefs to ensure it still compiles on MacOS

* Makefile and create_jni.bat changes for Windows.

* Switching XGDMatrixCreateFromCSREx JNI call to use size_t cast

* Fixing lint error, adding profile switching to jvm-packages build to make create-jni.bat get called, adding myself to Contributors.Md
2016-10-18 08:35:25 -04:00
Nan Zhu
f5c776f64f [jvm-packages] add apache maven repo url and bump up default spark version to 2.0.1 (#1650)
* add apache maven repo url and bump up default spark version to 2.0.1
2016-10-13 08:55:03 -04:00
Nan Zhu
813a53882a [jvm-packages] deprecate Flaky test (#1662)
* deprecate flaky test
2016-10-13 07:21:24 -04:00
Nan Zhu
1673bcbe7e [jvm-packages] separate classification and regression model and integrate with ML package (#1608) 2016-09-30 11:49:03 -04:00
Nan Zhu
37bc122c90 [jvm-packages] Robust dmatrix creation (#1613)
* add back train method but mark as deprecated

* robust matrix creation in jvm
2016-09-26 13:35:04 -04:00
reg.zhuce
3ee145b8dc [jvm-packages] IndexOutOfBoundsException (#1589)
ml.dmlc.xgboost4j.scala.spark.XGBoost.scala:51

values is empty when we meet it at first time, so values(0) throw an IndexOutOfBoundsException.
It should be  dVector.values(i) instead of values(i).
2016-09-20 09:13:47 -04:00
Xin Yin
7245145712 [jvm-packages] Fixed the sanity check for parameter 'nthread' against 'spark.task.cpus'. (#1582) 2016-09-16 11:31:35 -04:00
Nan Zhu
4ad648e856 [jvm-packages] predictLeaf with Dataframe (#1576)
* add back train method but mark as deprecated

* predictLeaf with Dataset

* fix

* fix
2016-09-15 06:15:47 -04:00
Nan Zhu
bb388cbb31 default eval func (#1574) 2016-09-14 13:26:16 -04:00
Nan Zhu
fb02797e2a [jvm-packages] Integration with Spark Dataframe/Dataset (#1559)
* bump up to scala 2.11

* framework of data frame integration

* test consistency between RDD and DataFrame

* order preservation

* test order preservation

* example code and fix makefile

* improve type checking

* improve APIs

* user docs

* work around travis CI's limitation on log length

* adjust test structure

* integrate with Spark -1 .x

* spark 2.x integration

* remove spark 1.x implementation but provide instructions on how to downgrade
2016-09-11 15:02:58 -04:00
Nan Zhu
6dabdd33e3 [jvm-packages] bump to next version (#1535)
* bump to next version

* fix

* fix
2016-09-01 12:18:21 -04:00
Nan Zhu
7fb3fbf577 impose shuffle when creating training RDD (#1531) 2016-08-31 07:34:10 -04:00
Nan Zhu
3f198b9fef [jvm-packages] allow training with missing values in xgboost-spark (#1525)
* allow training with missing values in xgboost-spark

* fix compilation error

* fix bug
2016-08-29 21:45:49 -04:00
Nan Zhu
74db1e8867 [jvm-packages] remove APIs with DMatrix from xgboost-spark (#1519)
* test consistency of prediction functions between DMatrix and RDD

* remove APIs with DMatrix from xgboost-spark

* fix compilation error in xgboost4j-example

* fix test cases
2016-08-28 21:25:49 -04:00
Nan Zhu
6d65aae091 [jvm-packages] test consistency of prediction functions with DMatrix and RDD (#1518)
* test consistency of prediction functions between DMatrix and RDD

* fix the failed test cases
2016-08-28 20:27:03 -04:00
Nan Zhu
d7f79255ec improve test of save/load model (#1515) 2016-08-27 17:16:22 -04:00
Nan Zhu
dc1125eb56 evaluation with RDD data (#1492) 2016-08-20 18:31:10 -04:00
Nan Zhu
582ee63e34 enable train multiple models by distinguishing stage IDs (#1493) 2016-08-20 16:37:07 -04:00
Nan Zhu
70432cac5b make IEvaluation serializable (#1487) 2016-08-19 13:12:39 -04:00
Fangzhou
a8adf16228 fix bug: doing rabit call after finalize in spark prediction phase (#1420) 2016-07-28 23:11:20 -05:00
Earthson Lu
d29edc677c fix #1377 spark-mllib scope: default => provided (#1381) 2016-07-20 23:10:49 -04:00
convexquad
313764b3be Expose predictLeaf functionality in Scala XGBoostModel (#1351) 2016-07-12 06:55:24 -04:00