xgboost

Author	SHA1	Message	Date
Nan Zhu	f66731181f	Update 0.8 version num (#3358 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * update 0.80	2018-06-02 07:06:01 -07:00
Nan Zhu	e1f57b4417	[jvm-packages] scripts to cross-build and deploy artifacts to github (#3276 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * cross building files * update * build with docker * remove * temp * update build script * update pom * update * update version * upload build * fix path * update README.md * fix compiler version to 4.8.5	2018-04-28 07:41:30 -07:00
Nan Zhu	25b2919c44	[jvm-packages] change version of jvm to keep consistent with other pkgs (#3253 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * change version of jvm to keep consistent with other pkgs	2018-04-19 20:48:50 -07:00
Nan Zhu	9747ea2acb	[jvm-packages] fix the pattern in dev script and version mismatch (#3009 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * fix the pattern in dev script and version mismatch	2018-01-06 06:59:38 -08:00
Nan Zhu	14c6392381	[jvm-packages] add dev script to update version and update versions (#2998 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * add dev script to update version and update versions	2018-01-01 21:28:53 -08:00
Sergei Lebedev	69c3b78a29	[jvm-packages] Implemented early stopping (#2710 ) * Allowed subsampling test from the training data frame/RDD The implementation requires storing 1 - trainTestRatio points in memory to make the sampling work. An alternative approach would be to construct the full DMatrix and then slice it deterministically into train/test. The peak memory consumption of such scenario, however, is twice the dataset size. * Removed duplication from 'XGBoost.train' Scala callers can (and should) use names to supply a subset of parameters. Method overloading is not required. * Reuse XGBoost seed parameter to stabilize train/test splitting * Added early stopping support to non-distributed XGBoost Closes #1544 * Added early-stopping to distributed XGBoost * Moved construction of 'watches' into a separate method This commit also fixes the handling of 'baseMargin' which previously was not added to the validation matrix. * Addressed review comments	2017-09-29 12:06:22 -07:00
Sergei Lebedev	771a95aec6	[jvm-packages] Added baseMargin to ml.dmlc.xgboost4j.LabeledPoint (#2532 ) * Converted ml.dmlc.xgboost4j.LabeledPoint to Scala This allows to easily integrate LabeledPoint with Spark DataFrame APIs, which support encoding/decoding case classes out of the box. Alternative solution would be to keep LabeledPoint in Java and make it a Bean by generating boilerplate getters/setters. I have decided against that, even thought the conversion in this PR implies a public API change. I also had to remove the factory methods fromSparseVector and fromDenseVector because a) they would need to be duplicated to support overloaded calls with extra data (e.g. weight); and b) Scala would expose them via mangled $.MODULE$ which looks ugly in Java. Additionally, this commit makes it possible to switch to LabeledPoint in all public APIs and effectively to pass initial margin/group as part of the point. This seems to be the only reliable way of implementing distributed learning with these data. Note that group size format used by single-node XGBoost is not compatible with that scenario, since the partition split could divide a group into two chunks. * Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs Note that DataFrame-based and Flink APIs are not affected by this change. * Removed baseMargin argument in favour of the LabeledPoint field * Do a single pass over the partition in buildDistributedBoosters Note that there is no formal guarantee that val repartitioned = rdd.repartition(42) repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... } would do a single shuffle, but in practice it seems to be always the case. * Exposed baseMargin in DataFrame-based API * Addressed review comments * Pass baseMargin to XGBoost.trainWithDataFrame via params * Reverted MLLabeledPoint in Spark APIs As discussed, baseMargin would only be supported for DataFrame-based APIs. * Cleaned up baseMargin tests - Removed RDD-based test, since the option is no longer exposed via public APIs - Changed DataFrame-based one to check that adding a margin actually affects the prediction * Pleased Scalastyle * Addressed more review comments * Pleased scalastyle again * Fixed XGBoost.fromBaseMarginsToArray which always returned an array of NaNs even if base margin was not specified. Surprisingly this only failed a few tests.	2017-08-10 14:29:26 -07:00
Sergei Lebedev	e62be19c70	Removed 'flink.suffix' and added 'flink.version' (#2277 ) The former was just Scala binary tag, and the latter was hardcoded in the 'xgboost4j-flink' POM.	2017-05-10 08:42:40 -07:00
Xin Yin	e7fbc8591f	[jvm-packages] Scala implementation of the Rabit tracker. (#1612 ) * [jvm-packages] Scala implementation of the Rabit tracker. A Scala implementation of RabitTracker that is interface-interchangable with the Java implementation, ported from `tracker.py` in the [dmlc-core project](https://github.com/dmlc/dmlc-core). * [jvm-packages] Updated Akka dependency in pom.xml. * Refactored the RabitTracker directory structure. * Fixed premature stopping of connection handler. Added a new finite state "AwaitingPortNumber" to explicitly wait for the worker to send the port, and close the connection. Stopping the actor prematurely sends a TCP RST to the worker, causing the worker to crash on AssertionError. * Added interface IRabitTracker so that user can switch implementations. * Default timeout duration changes. * Dependency for Akka tests. * Removed the main function of RabitTracker. * A skeleton for testing Akka-based Rabit tracker. * waitFor() in RabitTracker no longer throws exceptions. * Completed unit test for the 'start' command of Rabit tracker. * Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.) * Fixed the default timeout duration. * Use Java container to avoid serialization issues due to intermediate wrappers. * Added tests for Allreduce/model training using Scala Rabit tracker. * Added spill-over unit test for the Scala Rabit tracker. * Fixed a typo. * Overhaul of RabitTracker interface per code review. - Removed methods start() waitFor() (no arguments) from IRabitTracker. - The timeout in start(timeout) is now worker connection timeout, as tcp socket binding timeout is less intuitive. - Dropped time unit from start(...) and waitFor(...) methods; the default time unit is millisecond. - Moved random port number generation into the RabitTrackerHandler. - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit. * More code refactoring and comments. * Unified timeout constants. Readable tracker status code. * Add comments to indicate that allReduce is for tests only. Removed all other variants. * Removed unused imports. * Simplified signatures of training methods. - Moved TrackerConf into parameter map. - Changed GeneralParams so that TrackerConf becomes a standalone parameter. - Updated test cases accordingly. * Changed monitoring strategies. * Reverted monitoring changes. * Update test case for Rabit AllReduce. * Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers. * More comprehensive test cases for exception handling and worker connection timeout. * Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case. * Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code. * Reverted scalastyle-config changes. * Visibility scope change. Interface tweaks. * Use match pattern to handle tracker_conf parameter. * Minor clarification in JNI code. * Clearer intent in match pattern to suppress warnings. * Removed Future from constructor. Block in start() and waitFor() instead. * Revert inadvertent comment changes. * Removed debugging information. * Updated test cases that are a bit finicky. * Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.	2016-12-07 06:35:42 -08:00
Nan Zhu	fb02797e2a	[jvm-packages] Integration with Spark Dataframe/Dataset (#1559 ) * bump up to scala 2.11 * framework of data frame integration * test consistency between RDD and DataFrame * order preservation * test order preservation * example code and fix makefile * improve type checking * improve APIs * user docs * work around travis CI's limitation on log length * adjust test structure * integrate with Spark -1 .x * spark 2.x integration * remove spark 1.x implementation but provide instructions on how to downgrade	2016-09-11 15:02:58 -04:00
Nan Zhu	6dabdd33e3	[jvm-packages] bump to next version (#1535 ) * bump to next version * fix * fix	2016-09-01 12:18:21 -04:00
tqchen	90f7220736	[FLINK] remove nWorker from API	2016-03-14 16:18:35 -07:00
CodingCat	16b9e92328	force the user to set number of workers	2016-03-12 13:33:57 -05:00
CodingCat	400b1faecc	adjust the API signature as well as the docs	2016-03-11 15:22:44 -05:00
CodingCat	ab68a0ccc7	fix examples	2016-03-11 13:57:03 -05:00
CodingCat	aca0096b33	more updates for Flink more fix	2016-03-11 10:15:49 -05:00
tqchen	c05c5bc7bc	[DOC-JVM] Refactor JVM docs	2016-03-06 20:42:01 -08:00
tqchen	56f7a414d1	[JVM] Refactor, add filesys API	2016-03-06 11:33:48 -08:00
tqchen	99dc311f6d	[FLINK] Make runnable flink	2016-03-05 17:59:22 -08:00
tqchen	ac8e950227	Add Labeled Point, minor fix build	2016-03-05 12:12:43 -08:00
tqchen	e8560c7909	[refactor] move java package to namespace java	2016-03-05 14:04:13 -05:00
tqchen	ae969a0e69	[refactor] move java package to namespace java	2016-03-05 14:00:04 -05:00
tqchen	81dbf564a4	[Flink] Check	2016-03-05 13:57:26 -05:00

23 Commits