30 Commits

Author SHA1 Message Date
Nan Zhu
ae3bb9c2d5
Distributed Fast Histogram Algorithm (#4011)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* init

* allow hist algo

* more changes

* temp

* update

* remove hist sync

* udpate rabit

* change hist size

* change the histogram

* update kfactor

* sync per node stats

* temp

* update

* final

* code clean

* update rabit

* more cleanup

* fix errors

* fix failed tests

* enforce c++11

* fix lint issue

* broadcast subsampled feature correctly

* revert some changes

* fix lint issue

* enable monotone and interaction constraints

* don't specify default for monotone and interactions

* update docs
2019-02-05 05:12:53 -08:00
Shayak Banerjee
431c850c03 [jvm-packages] Updates to Java Booster to support other feature importance measures (#3801)
* Updates to Booster to support other feature importances

* Add returns for Java methods

* Pass Scala style checks

* Pass Java style checks

* Fix indents

* Use class instead of enum

* Return map string double

* A no longer broken build, thanks to mvn package local build

* Add a unit test to increase code coverage back

* Address code review on main code

* Add more unit tests for different feature importance scores

* Address more CR
2019-01-02 01:13:14 -08:00
Nan Zhu
c055a32609
[jvm-packages]support multiple validation datasets in Spark (#3910)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* wrap iterators

* enable copartition training and validationset

* add parameters

* converge code path and have init unit test

* enable multi evals for ranking

* unit test and doc

* update example

* fix early stopping

* address the offline comments

* udpate doc

* test eval metrics

* fix compilation issue

* fix example
2018-12-17 21:03:57 -08:00
Nan Zhu
9c4ff50e83
[jvm-packages]Fix early stopping condition (#3928)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* update version

* 0.82

* fix early stopping condition

* remove unused

* update comments

* udpate comments

* update test
2018-11-24 00:18:07 -08:00
Nan Zhu
4ae225a08d
[Blocking][jvm-packages] fix the early stopping feature (#3808)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* temp

* add method for classifier and regressor

* update tutorial

* address the comments

* update
2018-10-23 14:53:13 -07:00
zengxy
9e73087324 [jvm-packages] support specified feature names when getModelDump and getFeatureScore (#3733)
* [jvm-packages] support specified feature names for jvm when get ModelDump and get FeatureScore (#3725)

* typo and style fix
2018-10-04 09:05:42 -07:00
Yun Ni
65fb4e3f5c [jvm-packages] Prevent dispose being called on unfinalized JBooster (#3005)
* [jvm-packages] Prevent dispose being called twice when finalize

* Convert SIGSEGV to XGBoostError

* Avoid creating a new SBooster with the same JBooster

* Address CR Comments
2018-01-06 09:46:52 -08:00
Yun Ni
9004ca03ca [jvm-packages] Saving models into a tmp folder every a few rounds (#2964)
* [jvm-packages] Train Booster from an existing model

* Align Scala API with Java API

* Existing model should not load rabit checkpoint

* Address minor comments

* Implement saving temporary boosters and loading previous booster

* Add more unit tests for loadPrevBooster

* Add params to XGBoostEstimator

* (1) Move repartition out of the temp model saving loop (2) Address CR comments

* Catch a corner case of training next model with fewer rounds

* Address comments

* Refactor newly added methods into TmpBoosterManager

* Add two files which is missing in previous commit

* Rename TmpBooster to checkpoint
2017-12-29 08:36:41 -08:00
Sergei Lebedev
69c3b78a29 [jvm-packages] Implemented early stopping (#2710)
* Allowed subsampling test from the training data frame/RDD

The implementation requires storing 1 - trainTestRatio points in memory
to make the sampling work.

An alternative approach would be to construct the full DMatrix and then
slice it deterministically into train/test. The peak memory consumption
of such scenario, however, is twice the dataset size.

* Removed duplication from 'XGBoost.train'

Scala callers can (and should) use names to supply a subset of
parameters. Method overloading is not required.

* Reuse XGBoost seed parameter to stabilize train/test splitting

* Added early stopping support to non-distributed XGBoost

Closes #1544

* Added early-stopping to distributed XGBoost

* Moved construction of 'watches' into a separate method

This commit also fixes the handling of 'baseMargin' which previously
was not added to the validation matrix.

* Addressed review comments
2017-09-29 12:06:22 -07:00
Sergei Lebedev
771a95aec6 [jvm-packages] Added baseMargin to ml.dmlc.xgboost4j.LabeledPoint (#2532)
* Converted ml.dmlc.xgboost4j.LabeledPoint to Scala

This allows to easily integrate LabeledPoint with Spark DataFrame APIs,
which support encoding/decoding case classes out of the box. Alternative
solution would be to keep LabeledPoint in Java and make it a Bean by
generating boilerplate getters/setters. I have decided against that, even
thought the conversion in this PR implies a public API change.

I also had to remove the factory methods fromSparseVector and
fromDenseVector because a) they would need to be duplicated to support
overloaded calls with extra data (e.g. weight); and b) Scala would expose
them via mangled $.MODULE$ which looks ugly in Java.

Additionally, this commit makes it possible to switch to LabeledPoint in
all public APIs and effectively to pass initial margin/group as part of
the point. This seems to be the only reliable way of implementing distributed
learning with these data. Note that group size format used by single-node
XGBoost is not compatible with that scenario, since the partition split
could divide a group into two chunks.

* Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs

Note that DataFrame-based and Flink APIs are not affected by this change.

* Removed baseMargin argument in favour of the LabeledPoint field

* Do a single pass over the partition in buildDistributedBoosters

Note that there is no formal guarantee that

    val repartitioned = rdd.repartition(42)
    repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... }

would do a single shuffle, but in practice it seems to be always the case.

* Exposed baseMargin in DataFrame-based API

* Addressed review comments

* Pass baseMargin to XGBoost.trainWithDataFrame via params

* Reverted MLLabeledPoint in Spark APIs

As discussed, baseMargin would only be supported for DataFrame-based APIs.

* Cleaned up baseMargin tests

- Removed RDD-based test, since the option is no longer exposed via
  public APIs
- Changed DataFrame-based one to check that adding a margin actually
  affects the prediction

* Pleased Scalastyle

* Addressed more review comments

* Pleased scalastyle again

* Fixed XGBoost.fromBaseMarginsToArray

which always returned an array of NaNs even if base margin was not
specified. Surprisingly this only failed a few tests.
2017-08-10 14:29:26 -07:00
Philip Cho
03e213c7cd Fix documentation for a misspelled parameter (#2569) 2017-08-02 21:50:09 +12:00
ebernhardson
d3b866e3fd [jvm-packages] Expose json formatted booster dumps (#2233) (#2234)
* Change Booster dump from XGBoosterDumpModel to XGBoosterDumpModelEx

Allows exposing multiple formatting options of model dumping.
2017-04-29 20:23:09 -07:00
Xin Yin
5b54b9437c Fixed Exception handling for fragmented Rabit 'print' tracker command. Fixed unit test. (#2081) 2017-03-05 13:40:59 -08:00
Nan Zhu
ab13fd72bd [jvm-packages] Scala/Java interface for Fast Histogram Algorithm (#1966)
* add back train method but mark as deprecated

* fix scalastyle error

* first commit in scala binding for fast histo

* java test

* add missed scala tests

* spark training

* add back train method but mark as deprecated

* fix scalastyle error

* local change

* first commit in scala binding for fast histo

* local change

* fix df frame test
2017-03-04 15:37:24 -08:00
Xin Yin
4fb7fdb240 [jvm-packages] Fixed java.nio.BufferUnderFlow issue in Scala Rabit tracker. (#1993)
* [jvm-packages] Scala implementation of the Rabit tracker.

A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).

* [jvm-packages] Updated Akka dependency in pom.xml.

* Refactored the RabitTracker directory structure.

* Fixed premature stopping of connection handler.

Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.

* Added interface IRabitTracker so that user can switch implementations.

* Default timeout duration changes.

* Dependency for Akka tests.

* Removed the main function of RabitTracker.

* A skeleton for testing Akka-based Rabit tracker.

* waitFor() in RabitTracker no longer throws exceptions.

* Completed unit test for the 'start' command of Rabit tracker.

* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)

* Fixed the default timeout duration.

* Use Java container to avoid serialization issues due to intermediate wrappers.

* Added tests for Allreduce/model training using Scala Rabit tracker.

* Added spill-over unit test for the Scala Rabit tracker.

* Fixed a typo.

* Overhaul of RabitTracker interface per code review.

  - Removed methods start() waitFor() (no arguments) from IRabitTracker.
  - The timeout in start(timeout) is now worker connection timeout, as tcp
    socket binding timeout is less intuitive.
  - Dropped time unit from start(...) and waitFor(...) methods; the default
    time unit is millisecond.
  - Moved random port number generation into the RabitTrackerHandler.
  - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.

* More code refactoring and comments.

* Unified timeout constants. Readable tracker status code.

* Add comments to indicate that allReduce is for tests only. Removed all other variants.

* Removed unused imports.

* Simplified signatures of training methods.

 - Moved TrackerConf into parameter map.
 - Changed GeneralParams so that TrackerConf becomes a standalone parameter.
 - Updated test cases accordingly.

* Changed monitoring strategies.

* Reverted monitoring changes.

* Update test case for Rabit AllReduce.

* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.

* More comprehensive test cases for exception handling and worker connection timeout.

* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.

* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.

* Reverted scalastyle-config changes.

* Visibility scope change. Interface tweaks.

* Use match pattern to handle tracker_conf parameter.

* Minor clarification in JNI code.

* Clearer intent in match pattern to suppress warnings.

* Removed Future from constructor. Block in start() and waitFor() instead.

* Revert inadvertent comment changes.

* Removed debugging information.

* Updated test cases that are a bit finicky.

* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.

* Fixed BufferUnderFlow bug in decoding tracker 'print' command.

* Merge conflicts resolution.
2017-02-04 10:20:39 -08:00
Xin Yin
e7fbc8591f [jvm-packages] Scala implementation of the Rabit tracker. (#1612)
* [jvm-packages] Scala implementation of the Rabit tracker.

A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).

* [jvm-packages] Updated Akka dependency in pom.xml.

* Refactored the RabitTracker directory structure.

* Fixed premature stopping of connection handler.

Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.

* Added interface IRabitTracker so that user can switch implementations.

* Default timeout duration changes.

* Dependency for Akka tests.

* Removed the main function of RabitTracker.

* A skeleton for testing Akka-based Rabit tracker.

* waitFor() in RabitTracker no longer throws exceptions.

* Completed unit test for the 'start' command of Rabit tracker.

* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)

* Fixed the default timeout duration.

* Use Java container to avoid serialization issues due to intermediate wrappers.

* Added tests for Allreduce/model training using Scala Rabit tracker.

* Added spill-over unit test for the Scala Rabit tracker.

* Fixed a typo.

* Overhaul of RabitTracker interface per code review.

  - Removed methods start() waitFor() (no arguments) from IRabitTracker.
  - The timeout in start(timeout) is now worker connection timeout, as tcp
    socket binding timeout is less intuitive.
  - Dropped time unit from start(...) and waitFor(...) methods; the default
    time unit is millisecond.
  - Moved random port number generation into the RabitTrackerHandler.
  - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.

* More code refactoring and comments.

* Unified timeout constants. Readable tracker status code.

* Add comments to indicate that allReduce is for tests only. Removed all other variants.

* Removed unused imports.

* Simplified signatures of training methods.

 - Moved TrackerConf into parameter map.
 - Changed GeneralParams so that TrackerConf becomes a standalone parameter.
 - Updated test cases accordingly.

* Changed monitoring strategies.

* Reverted monitoring changes.

* Update test case for Rabit AllReduce.

* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.

* More comprehensive test cases for exception handling and worker connection timeout.

* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.

* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.

* Reverted scalastyle-config changes.

* Visibility scope change. Interface tweaks.

* Use match pattern to handle tracker_conf parameter.

* Minor clarification in JNI code.

* Clearer intent in match pattern to suppress warnings.

* Removed Future from constructor. Block in start() and waitFor() instead.

* Revert inadvertent comment changes.

* Removed debugging information.

* Updated test cases that are a bit finicky.

* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.
2016-12-07 06:35:42 -08:00
Nan Zhu
37bc122c90 [jvm-packages] Robust dmatrix creation (#1613)
* add back train method but mark as deprecated

* robust matrix creation in jvm
2016-09-26 13:35:04 -04:00
Nan Zhu
bd5b07873e [jvm-packages] create dmatrix with specified missing value (#1272)
* create dmatrix with specified missing value

* update dmlc-core

* support for predict method in spark package

repartitioning

work around

* add more elements to work around training set empty partition issue
2016-06-21 17:35:17 -04:00
CodingCat
400b1faecc adjust the API signature as well as the docs 2016-03-11 15:22:44 -05:00
CodingCat
fa03aaeb63 revise current API 2016-03-08 17:18:55 -05:00
tqchen
56f7a414d1 [JVM] Refactor, add filesys API 2016-03-06 11:33:48 -08:00
tqchen
514df14baf [JVM] Add LabeledPoint read support
fix
2016-03-05 13:36:33 -08:00
tqchen
e8560c7909 [refactor] move java package to namespace java 2016-03-05 14:04:13 -05:00
tqchen
ae969a0e69 [refactor] move java package to namespace java 2016-03-05 14:00:04 -05:00
CodingCat
1540773340 sketch of xgboost-spark
chooseBestBooster shall be in Boosters

remove tracker.py

rename XGBoost

remove cross-validation
2016-03-05 08:44:55 -05:00
tqchen
86871d4be9 [JVM] Add Iterator loading API 2016-03-04 17:37:46 -08:00
CodingCat
e3dc67c6a0 apply google-java-style indentation and impose import orders.... 2016-03-03 12:59:18 -05:00
CodingCat
5e309f1ce8 add test cases for Scala API 2016-03-02 15:24:13 -05:00
CodingCat
f8fff6c6fc rename files/packages 2016-03-01 23:48:35 -05:00
CodingCat
3b246c2420 re-structure Java API, add Scala API and consolidate the names of Java/Scala API 2016-03-01 20:53:41 -05:00