xgboost

Author	SHA1	Message	Date
Nan Zhu	f66731181f	Update 0.8 version num (#3358 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * update 0.80	2018-06-02 07:06:01 -07:00
Nan Zhu	e1f57b4417	[jvm-packages] scripts to cross-build and deploy artifacts to github (#3276 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * cross building files * update * build with docker * remove * temp * update build script * update pom * update * update version * upload build * fix path * update README.md * fix compiler version to 4.8.5	2018-04-28 07:41:30 -07:00
Yanbo Liang	4850f67b85	Fix broken link for xgboost-spark example. (#3275 )	2018-04-26 06:45:01 -07:00
Nan Zhu	25b2919c44	[jvm-packages] change version of jvm to keep consistent with other pkgs (#3253 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * change version of jvm to keep consistent with other pkgs	2018-04-19 20:48:50 -07:00
Nan Zhu	14c6392381	[jvm-packages] add dev script to update version and update versions (#2998 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * add dev script to update version and update versions	2018-01-01 21:28:53 -08:00
Sergei Lebedev	69c3b78a29	[jvm-packages] Implemented early stopping (#2710 ) * Allowed subsampling test from the training data frame/RDD The implementation requires storing 1 - trainTestRatio points in memory to make the sampling work. An alternative approach would be to construct the full DMatrix and then slice it deterministically into train/test. The peak memory consumption of such scenario, however, is twice the dataset size. * Removed duplication from 'XGBoost.train' Scala callers can (and should) use names to supply a subset of parameters. Method overloading is not required. * Reuse XGBoost seed parameter to stabilize train/test splitting * Added early stopping support to non-distributed XGBoost Closes #1544 * Added early-stopping to distributed XGBoost * Moved construction of 'watches' into a separate method This commit also fixes the handling of 'baseMargin' which previously was not added to the validation matrix. * Addressed review comments	2017-09-29 12:06:22 -07:00
Sergei Lebedev	771a95aec6	[jvm-packages] Added baseMargin to ml.dmlc.xgboost4j.LabeledPoint (#2532 ) * Converted ml.dmlc.xgboost4j.LabeledPoint to Scala This allows to easily integrate LabeledPoint with Spark DataFrame APIs, which support encoding/decoding case classes out of the box. Alternative solution would be to keep LabeledPoint in Java and make it a Bean by generating boilerplate getters/setters. I have decided against that, even thought the conversion in this PR implies a public API change. I also had to remove the factory methods fromSparseVector and fromDenseVector because a) they would need to be duplicated to support overloaded calls with extra data (e.g. weight); and b) Scala would expose them via mangled $.MODULE$ which looks ugly in Java. Additionally, this commit makes it possible to switch to LabeledPoint in all public APIs and effectively to pass initial margin/group as part of the point. This seems to be the only reliable way of implementing distributed learning with these data. Note that group size format used by single-node XGBoost is not compatible with that scenario, since the partition split could divide a group into two chunks. * Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs Note that DataFrame-based and Flink APIs are not affected by this change. * Removed baseMargin argument in favour of the LabeledPoint field * Do a single pass over the partition in buildDistributedBoosters Note that there is no formal guarantee that val repartitioned = rdd.repartition(42) repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... } would do a single shuffle, but in practice it seems to be always the case. * Exposed baseMargin in DataFrame-based API * Addressed review comments * Pass baseMargin to XGBoost.trainWithDataFrame via params * Reverted MLLabeledPoint in Spark APIs As discussed, baseMargin would only be supported for DataFrame-based APIs. * Cleaned up baseMargin tests - Removed RDD-based test, since the option is no longer exposed via public APIs - Changed DataFrame-based one to check that adding a margin actually affects the prediction * Pleased Scalastyle * Addressed more review comments * Pleased scalastyle again * Fixed XGBoost.fromBaseMarginsToArray which always returned an array of NaNs even if base margin was not specified. Surprisingly this only failed a few tests.	2017-08-10 14:29:26 -07:00
Sergei Lebedev	d535340459	[jvm-packages] Exposed baseMargin (#2450 ) * Disabled excessive Spark logging in tests * Fixed a singature of XGBoostModel.predict Prior to this commit XGBoostModel.predict produced an RDD with an array of predictions for each partition, effectively changing the shape wrt the input RDD. A more natural contract for prediction API is that given an RDD it returns a new RDD with the same number of elements. This allows the users to easily match inputs with predictions. This commit removes one layer of nesting in XGBoostModel.predict output. Even though the change is clearly non-backward compatible, I still think it is well justified. * Removed boxing in XGBoost.fromDenseToSparseLabeledPoints * Inlined XGBoost.repartitionData An if is more explicit than an opaque method name. * Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel * Check the input dimension in DMatrix.setBaseMargin Prior to this commit providing an array of incorrect dimensions would have resulted in memory corruption. Maybe backport this to C++? * Reduced nesting in XGBoost.buildDistributedBoosters * Ensured consistent naming of the params map * Cleaned up DataBatch to make it easier to comprehend * Made scalastyle happy * Added baseMargin to XGBoost.train and trainWithRDD * Deprecated XGBoost.train It is ambiguous and work only for RDDs. * Addressed review comments * Revert "Fixed a singature of XGBoostModel.predict" This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4. * Addressed more review comments * Fixed NullPointerException in buildDistributedBoosters	2017-06-30 08:27:24 -07:00
Xin Yin	e7fbc8591f	[jvm-packages] Scala implementation of the Rabit tracker. (#1612 ) * [jvm-packages] Scala implementation of the Rabit tracker. A Scala implementation of RabitTracker that is interface-interchangable with the Java implementation, ported from `tracker.py` in the [dmlc-core project](https://github.com/dmlc/dmlc-core). * [jvm-packages] Updated Akka dependency in pom.xml. * Refactored the RabitTracker directory structure. * Fixed premature stopping of connection handler. Added a new finite state "AwaitingPortNumber" to explicitly wait for the worker to send the port, and close the connection. Stopping the actor prematurely sends a TCP RST to the worker, causing the worker to crash on AssertionError. * Added interface IRabitTracker so that user can switch implementations. * Default timeout duration changes. * Dependency for Akka tests. * Removed the main function of RabitTracker. * A skeleton for testing Akka-based Rabit tracker. * waitFor() in RabitTracker no longer throws exceptions. * Completed unit test for the 'start' command of Rabit tracker. * Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.) * Fixed the default timeout duration. * Use Java container to avoid serialization issues due to intermediate wrappers. * Added tests for Allreduce/model training using Scala Rabit tracker. * Added spill-over unit test for the Scala Rabit tracker. * Fixed a typo. * Overhaul of RabitTracker interface per code review. - Removed methods start() waitFor() (no arguments) from IRabitTracker. - The timeout in start(timeout) is now worker connection timeout, as tcp socket binding timeout is less intuitive. - Dropped time unit from start(...) and waitFor(...) methods; the default time unit is millisecond. - Moved random port number generation into the RabitTrackerHandler. - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit. * More code refactoring and comments. * Unified timeout constants. Readable tracker status code. * Add comments to indicate that allReduce is for tests only. Removed all other variants. * Removed unused imports. * Simplified signatures of training methods. - Moved TrackerConf into parameter map. - Changed GeneralParams so that TrackerConf becomes a standalone parameter. - Updated test cases accordingly. * Changed monitoring strategies. * Reverted monitoring changes. * Update test case for Rabit AllReduce. * Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers. * More comprehensive test cases for exception handling and worker connection timeout. * Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case. * Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code. * Reverted scalastyle-config changes. * Visibility scope change. Interface tweaks. * Use match pattern to handle tracker_conf parameter. * Minor clarification in JNI code. * Clearer intent in match pattern to suppress warnings. * Removed Future from constructor. Block in start() and waitFor() instead. * Revert inadvertent comment changes. * Removed debugging information. * Updated test cases that are a bit finicky. * Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.	2016-12-07 06:35:42 -08:00
Ruimin Wang	d80cec3384	[jvm-pacakges] the first parameter in getModelDump should be featuremap path not model path (#1788 ) * fix the model dump in xgboost4j example * Modify the dump model part of scala version * add the forgotten modelInfos	2016-11-21 08:52:26 -05:00
XianXing Zhang	ce708c8e7f	[jvm-packages] Leverage the Spark ml API to read DataFrame from files in LibSVM format. (#1785 )	2016-11-20 21:28:03 -05:00
Nan Zhu	6082184cd1	[jvm-packages] update API docs (#1713 ) * add back train method but mark as deprecated * fix scalastyle error * update java doc * update	2016-10-27 18:53:22 -07:00
Nan Zhu	f12074d355	[jvm-packages] release blog (#1706 )	2016-10-26 21:35:42 -04:00
Nan Zhu	f801c22710	[jvm-packages] change class to object in examples (#1703 ) * change class to object in examples * fix compilation error	2016-10-26 14:54:56 -04:00
Nan Zhu	016ab89484	[jvm-packages] Parameter tuning tool for XGBoost (#1664 )	2016-10-23 16:58:18 -04:00
Nan Zhu	1673bcbe7e	[jvm-packages] separate classification and regression model and integrate with ML package (#1608 )	2016-09-30 11:49:03 -04:00
Nan Zhu	fb02797e2a	[jvm-packages] Integration with Spark Dataframe/Dataset (#1559 ) * bump up to scala 2.11 * framework of data frame integration * test consistency between RDD and DataFrame * order preservation * test order preservation * example code and fix makefile * improve type checking * improve APIs * user docs * work around travis CI's limitation on log length * adjust test structure * integrate with Spark -1 .x * spark 2.x integration * remove spark 1.x implementation but provide instructions on how to downgrade	2016-09-11 15:02:58 -04:00
Nan Zhu	6dabdd33e3	[jvm-packages] bump to next version (#1535 ) * bump to next version * fix * fix	2016-09-01 12:18:21 -04:00
Nan Zhu	74db1e8867	[jvm-packages] remove APIs with DMatrix from xgboost-spark (#1519 ) * test consistency of prediction functions between DMatrix and RDD * remove APIs with DMatrix from xgboost-spark * fix compilation error in xgboost4j-example * fix test cases	2016-08-28 21:25:49 -04:00
Earthson Lu	d29edc677c	fix #1377 spark-mllib scope: default => provided (#1381 )	2016-07-20 23:10:49 -04:00
Rahul	f14c160f4f	[jvm-packages][xgboost4j-spark][Minor] Move sparkContext dependency from the XGBoostModel (#1335 ) * Move sparkContext dependency from the XGBoostModel * Update Spark example to declare SparkContext as implict	2016-07-08 06:43:33 -04:00
Nan Zhu	c85b9012c6	[jvm-packages] xgboost4j-spark external memory (#1219 ) * implement external memory support for XGBoost4J * remove extra space * enable external memory for prediction * update doc	2016-05-22 14:01:28 -04:00
tqchen	90f7220736	[FLINK] remove nWorker from API	2016-03-14 16:18:35 -07:00
CodingCat	f2ef958ebb	support kryo serialization	2016-03-13 11:55:14 -04:00
CodingCat	16b9e92328	force the user to set number of workers	2016-03-12 13:33:57 -05:00
CodingCat	400b1faecc	adjust the API signature as well as the docs	2016-03-11 15:22:44 -05:00
CodingCat	ab68a0ccc7	fix examples	2016-03-11 13:57:03 -05:00
CodingCat	aca0096b33	more updates for Flink more fix	2016-03-11 10:15:49 -05:00
CodingCat	43d7a85bc9	change the API name since we support not only HDFS and local file system	2016-03-11 10:05:32 -05:00
CodingCat	4e86c8c866	fix typo in README	2016-03-09 17:22:19 -05:00
CodingCat	7e30ada8c1	update README	2016-03-09 13:05:08 -05:00
CodingCat	c9830cd8b1	remove spark/flink examples	2016-03-09 12:31:35 -05:00
CodingCat	8cfa752fa0	add scala examples	2016-03-09 12:31:35 -05:00
CodingCat	a08cc8aad4	allow the user define how many workers they need	2016-03-08 18:46:53 -05:00
CodingCat	fa03aaeb63	revise current API	2016-03-08 17:18:55 -05:00
tqchen	435a0425b9	[Spark] Refactor train, predict, add save	2016-03-06 21:51:08 -08:00
tqchen	c05c5bc7bc	[DOC-JVM] Refactor JVM docs	2016-03-06 20:42:01 -08:00

37 Commits