37 Commits

Author SHA1 Message Date
Nan Zhu
428453f7d6 [jvm-packages] fix the persistence of XGBoostEstimator (#2265)
* add back train method but mark as deprecated

* fix scalastyle error

* fix the persistence of XGBoostEstimator

* test persistence of a complete pipeline

* fix compilation issue

* do not allow persist custom_eval and custom_obj

* fix the failed tesl
2017-05-08 21:58:06 -07:00
ebernhardson
ccccf8a015 [jvm-packages] Accept groupData in spark model eval (#2244)
* Support model evaluation for ranking tasks by accepting
 groupData in XGBoostModel.eval
2017-05-02 10:03:20 -07:00
Nan Zhu
392aa6d1d3 [jvm-packages] make XGBoostModel hold BoosterParams as well (#2214)
* add back train method but mark as deprecated

* fix scalastyle error

* make XGBoostModel hold BoosterParams as well
2017-04-21 08:12:50 -07:00
Nan Zhu
f08077606c [jvm-packages] Clean external cache (#2181)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* small fix for cleanExternalCache
2017-04-10 07:49:58 -07:00
Nan Zhu
8d8cbcc6db [jvm-packages] fixed several issues in unit tests (#2173)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* fix several issues in tests
2017-04-06 06:25:23 -07:00
cloverrose
288f309434 [jvm-packages] call setGroup for ranking task (#2066)
* [jvm-packages] call setGroup for ranking task

* passing groupData through xgBoostConfMap

* fix original comment position

* make groupData param

* remove groupData variable, use xgBoostConfMap directly

* set default groupData value

* add use groupData tests

* reduce rank-demo size

* use TaskContext.getPartitionId() instead of mapPartitionsWithIndex

* add DF use groupData test

* remove unused varable
2017-03-06 15:45:06 -08:00
geoHeil
cf6b173bd7 [jvm-packages] Spark pipeline persistence (#1906)
[jvm-packages] Spark pipeline persistence
2017-03-05 18:35:37 -08:00
Nan Zhu
ab13fd72bd [jvm-packages] Scala/Java interface for Fast Histogram Algorithm (#1966)
* add back train method but mark as deprecated

* fix scalastyle error

* first commit in scala binding for fast histo

* java test

* add missed scala tests

* spark training

* add back train method but mark as deprecated

* fix scalastyle error

* local change

* first commit in scala binding for fast histo

* local change

* fix df frame test
2017-03-04 15:37:24 -08:00
Nan Zhu
ac30a0aff5 [jvm-packages][spark]Preserve num classes (#2068)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* bump spark version to 2.1

* preserve num_class issues

* fix failed test cases

* rivising

* add multi class test
2017-03-04 14:14:31 -08:00
Nan Zhu
185fe1d645 [jvm-packages] use ML's para system to build the passed-in params to XGBoost (#2043)
* add back train method but mark as deprecated

* fix scalastyle error

* use ML's para system to build the passed-in params to XGBoost

* clean
2017-02-18 11:56:27 -08:00
Xin Yin
e7fbc8591f [jvm-packages] Scala implementation of the Rabit tracker. (#1612)
* [jvm-packages] Scala implementation of the Rabit tracker.

A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).

* [jvm-packages] Updated Akka dependency in pom.xml.

* Refactored the RabitTracker directory structure.

* Fixed premature stopping of connection handler.

Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.

* Added interface IRabitTracker so that user can switch implementations.

* Default timeout duration changes.

* Dependency for Akka tests.

* Removed the main function of RabitTracker.

* A skeleton for testing Akka-based Rabit tracker.

* waitFor() in RabitTracker no longer throws exceptions.

* Completed unit test for the 'start' command of Rabit tracker.

* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)

* Fixed the default timeout duration.

* Use Java container to avoid serialization issues due to intermediate wrappers.

* Added tests for Allreduce/model training using Scala Rabit tracker.

* Added spill-over unit test for the Scala Rabit tracker.

* Fixed a typo.

* Overhaul of RabitTracker interface per code review.

  - Removed methods start() waitFor() (no arguments) from IRabitTracker.
  - The timeout in start(timeout) is now worker connection timeout, as tcp
    socket binding timeout is less intuitive.
  - Dropped time unit from start(...) and waitFor(...) methods; the default
    time unit is millisecond.
  - Moved random port number generation into the RabitTrackerHandler.
  - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.

* More code refactoring and comments.

* Unified timeout constants. Readable tracker status code.

* Add comments to indicate that allReduce is for tests only. Removed all other variants.

* Removed unused imports.

* Simplified signatures of training methods.

 - Moved TrackerConf into parameter map.
 - Changed GeneralParams so that TrackerConf becomes a standalone parameter.
 - Updated test cases accordingly.

* Changed monitoring strategies.

* Reverted monitoring changes.

* Update test case for Rabit AllReduce.

* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.

* More comprehensive test cases for exception handling and worker connection timeout.

* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.

* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.

* Reverted scalastyle-config changes.

* Visibility scope change. Interface tweaks.

* Use match pattern to handle tracker_conf parameter.

* Minor clarification in JNI code.

* Clearer intent in match pattern to suppress warnings.

* Removed Future from constructor. Block in start() and waitFor() instead.

* Revert inadvertent comment changes.

* Removed debugging information.

* Updated test cases that are a bit finicky.

* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.
2016-12-07 06:35:42 -08:00
Nan Zhu
965091c4bb [jvm-packages] update methods in test cases to be consistent (#1780)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* update methods in test cases to be consistent

* add blank lines

* fix
2016-11-20 22:49:18 -05:00
Nan Zhu
016ab89484 [jvm-packages] Parameter tuning tool for XGBoost (#1664) 2016-10-23 16:58:18 -04:00
Nan Zhu
813a53882a [jvm-packages] deprecate Flaky test (#1662)
* deprecate flaky test
2016-10-13 07:21:24 -04:00
Nan Zhu
1673bcbe7e [jvm-packages] separate classification and regression model and integrate with ML package (#1608) 2016-09-30 11:49:03 -04:00
Nan Zhu
4ad648e856 [jvm-packages] predictLeaf with Dataframe (#1576)
* add back train method but mark as deprecated

* predictLeaf with Dataset

* fix

* fix
2016-09-15 06:15:47 -04:00
Nan Zhu
bb388cbb31 default eval func (#1574) 2016-09-14 13:26:16 -04:00
Nan Zhu
fb02797e2a [jvm-packages] Integration with Spark Dataframe/Dataset (#1559)
* bump up to scala 2.11

* framework of data frame integration

* test consistency between RDD and DataFrame

* order preservation

* test order preservation

* example code and fix makefile

* improve type checking

* improve APIs

* user docs

* work around travis CI's limitation on log length

* adjust test structure

* integrate with Spark -1 .x

* spark 2.x integration

* remove spark 1.x implementation but provide instructions on how to downgrade
2016-09-11 15:02:58 -04:00
Nan Zhu
3f198b9fef [jvm-packages] allow training with missing values in xgboost-spark (#1525)
* allow training with missing values in xgboost-spark

* fix compilation error

* fix bug
2016-08-29 21:45:49 -04:00
Nan Zhu
74db1e8867 [jvm-packages] remove APIs with DMatrix from xgboost-spark (#1519)
* test consistency of prediction functions between DMatrix and RDD

* remove APIs with DMatrix from xgboost-spark

* fix compilation error in xgboost4j-example

* fix test cases
2016-08-28 21:25:49 -04:00
Nan Zhu
6d65aae091 [jvm-packages] test consistency of prediction functions with DMatrix and RDD (#1518)
* test consistency of prediction functions between DMatrix and RDD

* fix the failed test cases
2016-08-28 20:27:03 -04:00
Nan Zhu
d7f79255ec improve test of save/load model (#1515) 2016-08-27 17:16:22 -04:00
Nan Zhu
bd5b07873e [jvm-packages] create dmatrix with specified missing value (#1272)
* create dmatrix with specified missing value

* update dmlc-core

* support for predict method in spark package

repartitioning

work around

* add more elements to work around training set empty partition issue
2016-06-21 17:35:17 -04:00
Nan Zhu
c85b9012c6 [jvm-packages] xgboost4j-spark external memory (#1219)
* implement external memory support for XGBoost4J

* remove extra space

* enable external memory for prediction

* update doc
2016-05-22 14:01:28 -04:00
CodingCat
d8535313eb allow empty partitions 2016-03-23 12:30:06 -04:00
CodingCat
55ab1c6a22 adjust numWorkers for test 2016-03-18 10:34:36 -04:00
CodingCat
f2ef958ebb support kryo serialization 2016-03-13 11:55:14 -04:00
CodingCat
16b9e92328 force the user to set number of workers 2016-03-12 13:33:57 -05:00
CodingCat
43d7a85bc9 change the API name since we support not only HDFS and local file system 2016-03-11 10:05:32 -05:00
CodingCat
d47df5c1d8 allow the user to specify the worker number and avoid unnecessary shuffle 2016-03-10 06:58:30 -05:00
CodingCat
e0a3f1c000 nthread no larger than spark.task.cpus 2016-03-10 05:51:07 -05:00
CodingCat
909c6af330 add test resources manually 2016-03-08 18:43:30 -05:00
CodingCat
fa03aaeb63 revise current API 2016-03-08 17:18:55 -05:00
CodingCat
6499422e90 fix the merge 2016-03-06 15:22:05 -05:00
CodingCat
130ca7b00c test case for XGBoostSpark 2016-03-05 19:41:26 -05:00
CodingCat
f0647ec76d test resources 2016-03-05 18:18:07 -05:00
CodingCat
5c1af13f84 distributed in RDD 2016-03-05 17:50:40 -05:00