xgboost

Author	SHA1	Message	Date
Sergei Lebedev	8e141427aa	[jvm-packages] Exposed train-time evaluation metrics (#2836 ) * [jvm-packages] Exposed train-time evaluation metrics They are accessible via 'XGBoostModel.summary'. The summary is not serialized with the model and is only available after the training. * Addressed review comments * Extracted model-related tests into 'XGBoostModelSuite' * Added tests for copying the 'XGBoostModel' * [jvm-packages] Fixed a subtle bug in train/test split Iterator.partition (naturally) assumes that the predicate is deterministic but this is not the case for r.nextDouble() <= trainTestRatio therefore sometimes the DMatrix(...) call got a NoSuchElementException and crashed the JVM due to lack of exception handling in XGBoost4jCallbackDataIterNext. * Make sure train/test objectives are different	2017-11-20 22:21:54 +01:00
ebernhardson	78d0bd6c9d	[jvm-packages] Repair spark model eval (#2841 ) In the refactor to add base margins, #2532, all of the labels were lost when creating the dmatrix. This became obvious as metrics like ndcg always returned 1.0 regardless of the results. Change-Id: I88be047e1c108afba4784bd3d892bfc9edeabe55	2017-11-04 23:28:47 +01:00
Seth Hendrickson	a8f670d247	[jvm-packages] Add some documentation to xgboost4j-spark plus minor style edits (#2823 ) * add scala docs to several methods * indentation * license formatting * clarify distributed boosters * address some review comments * reduce doc lengths * change method name, clarify doc * reset make config * delete most comments * more review feedback	2017-11-02 13:16:02 -07:00
ebernhardson	46f2b820f1	[jvm-packages] Objectives starting with rank: are never classification (#2837 ) Training a model with the experimental rank:ndcg objective incorrectly returns a Classification model. Adjust the classification check to not recognize rank:* objectives as classification. While writing tests for isClassificationTask also turned up that obj_type -> regression was incorrectly identified as a classification task so the function was slightly adjusted to pass the new tests.	2017-10-30 17:36:03 +01:00
Yun Ni	b678e1711d	[jvm-packages] Add SparkParallelismTracker to prevent job from hanging (#2697 ) * Add SparkParallelismTracker to prevent job from hanging * Code review comments * Code Review Comments * Fix unit tests * Changes and unit test to catch the corner case. * Update documentations * Small improvements * cancalAllJobs is problematic with scalatest. Remove it * Code Review Comments * Check number of executor cores beforehand, and throw exeception if any core is lost. * Address CR Comments * Add missing class * Fix flaky unit test * Address CR comments * Remove redundant param for TaskFailedListener	2017-10-16 20:18:47 -07:00
Sergei Lebedev	69c3b78a29	[jvm-packages] Implemented early stopping (#2710 ) * Allowed subsampling test from the training data frame/RDD The implementation requires storing 1 - trainTestRatio points in memory to make the sampling work. An alternative approach would be to construct the full DMatrix and then slice it deterministically into train/test. The peak memory consumption of such scenario, however, is twice the dataset size. * Removed duplication from 'XGBoost.train' Scala callers can (and should) use names to supply a subset of parameters. Method overloading is not required. * Reuse XGBoost seed parameter to stabilize train/test splitting * Added early stopping support to non-distributed XGBoost Closes #1544 * Added early-stopping to distributed XGBoost * Moved construction of 'watches' into a separate method This commit also fixes the handling of 'baseMargin' which previously was not added to the validation matrix. * Addressed review comments	2017-09-29 12:06:22 -07:00
Sergei Lebedev	d570337262	[jvm-packages] (xgboost-spark) preserving num_class across save & load (#2742 ) * [bugfix] (xgboost-spark) preserving num_class across save & load * add testcase for save & load of multiclass model	2017-09-24 16:03:30 +02:00
Sergei Lebedev	39adba51c5	Fixed compilation on Scala 2.10 (#2629 )	2017-08-28 10:59:39 -07:00
Yun Ni	a00157543d	Support instance weights for xgboost4j-spark (#2642 ) * Support instance weights for xgboost4j-spark * Use 0.001 instead of 0 for weights * Address CR comments	2017-08-28 09:03:20 -07:00
Sergei Lebedev	771a95aec6	[jvm-packages] Added baseMargin to ml.dmlc.xgboost4j.LabeledPoint (#2532 ) * Converted ml.dmlc.xgboost4j.LabeledPoint to Scala This allows to easily integrate LabeledPoint with Spark DataFrame APIs, which support encoding/decoding case classes out of the box. Alternative solution would be to keep LabeledPoint in Java and make it a Bean by generating boilerplate getters/setters. I have decided against that, even thought the conversion in this PR implies a public API change. I also had to remove the factory methods fromSparseVector and fromDenseVector because a) they would need to be duplicated to support overloaded calls with extra data (e.g. weight); and b) Scala would expose them via mangled $.MODULE$ which looks ugly in Java. Additionally, this commit makes it possible to switch to LabeledPoint in all public APIs and effectively to pass initial margin/group as part of the point. This seems to be the only reliable way of implementing distributed learning with these data. Note that group size format used by single-node XGBoost is not compatible with that scenario, since the partition split could divide a group into two chunks. * Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs Note that DataFrame-based and Flink APIs are not affected by this change. * Removed baseMargin argument in favour of the LabeledPoint field * Do a single pass over the partition in buildDistributedBoosters Note that there is no formal guarantee that val repartitioned = rdd.repartition(42) repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... } would do a single shuffle, but in practice it seems to be always the case. * Exposed baseMargin in DataFrame-based API * Addressed review comments * Pass baseMargin to XGBoost.trainWithDataFrame via params * Reverted MLLabeledPoint in Spark APIs As discussed, baseMargin would only be supported for DataFrame-based APIs. * Cleaned up baseMargin tests - Removed RDD-based test, since the option is no longer exposed via public APIs - Changed DataFrame-based one to check that adding a margin actually affects the prediction * Pleased Scalastyle * Addressed more review comments * Pleased scalastyle again * Fixed XGBoost.fromBaseMarginsToArray which always returned an array of NaNs even if base margin was not specified. Surprisingly this only failed a few tests.	2017-08-10 14:29:26 -07:00
Sergei Lebedev	4eb255262f	[jvm-packages] More brooming in tests (#2517 ) * Deduplicated DataFrame creation in XGBoostDFSuite * Extracted dermatology.data into MultiClassification * Moved cache cleaning to SharedSparkContext Cache files are prefixed with appName therefore this seems to be just the place to delete them. * Removed redundant JMatrix calls in xgboost4j-spark * Slightly more readable buildDenseRDD in XGBoostGeneralSuite * Generalized train/test DataFrame construction in XGBoostDFSuite * Changed SharedSparkContext to setup a new context per-test Hence the new name: PerTestSparkSession :) * Fused Utils into PerTestSparkSession * Whitespace fix in XGBoostDFSuite * Ensure SparkSession is always eagerly created in PerTestSparkSession * Renamed PerTestSparkSession->PerTest because it was doing slightly more than creating/stopping the session.	2017-07-18 13:08:48 -07:00
Sergei Lebedev	66874f5777	[jvm-packages] Deduplicated train/test data access in tests (#2507 ) * [jvm-packages] Deduplicated train/test data access in tests All datasets are now available via a unified API, e.g. Agaricus.test. The only exception is the dermatology data which requires parsing a CSV file. * Inlined Utils.buildTrainingRDD The default number of partitions for local mode is equal to the number of available CPUs. * Replaced dataset names with problem types	2017-07-12 09:13:55 -07:00
Sergei Lebedev	8ceeb32bad	Fixed a signature of XGBoostModel.predict (#2476 ) Prior to this commit XGBoostModel.predict produced an RDD with an array of predictions for each partition, effectively changing the shape wrt the input RDD. A more natural contract for prediction API is that given an RDD it returns a new RDD with the same number of elements. This allows the users to easily match inputs with predictions. This commit removes one layer of nesting in XGBoostModel.predict output. Even though the change is clearly non-backward compatible, I still think it is well justified. See discussion in 06bd5dca for motivation.	2017-07-02 21:42:46 -07:00
Sergei Lebedev	d535340459	[jvm-packages] Exposed baseMargin (#2450 ) * Disabled excessive Spark logging in tests * Fixed a singature of XGBoostModel.predict Prior to this commit XGBoostModel.predict produced an RDD with an array of predictions for each partition, effectively changing the shape wrt the input RDD. A more natural contract for prediction API is that given an RDD it returns a new RDD with the same number of elements. This allows the users to easily match inputs with predictions. This commit removes one layer of nesting in XGBoostModel.predict output. Even though the change is clearly non-backward compatible, I still think it is well justified. * Removed boxing in XGBoost.fromDenseToSparseLabeledPoints * Inlined XGBoost.repartitionData An if is more explicit than an opaque method name. * Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel * Check the input dimension in DMatrix.setBaseMargin Prior to this commit providing an array of incorrect dimensions would have resulted in memory corruption. Maybe backport this to C++? * Reduced nesting in XGBoost.buildDistributedBoosters * Ensured consistent naming of the params map * Cleaned up DataBatch to make it easier to comprehend * Made scalastyle happy * Added baseMargin to XGBoost.train and trainWithRDD * Deprecated XGBoost.train It is ambiguous and work only for RDDs. * Addressed review comments * Revert "Fixed a singature of XGBoostModel.predict" This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4. * Addressed more review comments * Fixed NullPointerException in buildDistributedBoosters	2017-06-30 08:27:24 -07:00
ebernhardson	169c983b5f	[jvm-packages] Release dmatrix when no longer needed (#2436 ) When using xgboost4j-spark I had executors getting killed much more often than i would expect by yarn for overrunning their memory limits, based on the memoryOverhead provided. It looks like a significant amount of this is because dmatrix's were being created but not released, because they were only released when the GC decided it was time to cleanup the references. Rather than waiting for the GC, relesae the DMatrix's when we know they are no longer necessary.	2017-06-22 09:20:55 -07:00
Sergei Lebedev	0db37c05bd	[jvm-packages] Deterministically XGBoost training on exception (#2405 ) Previously the code relied on the tracker process being terminated by the OS, which was not the case on Windows. Closes #2394	2017-06-12 20:19:28 -07:00
Nan Zhu	a607f697e3	[jvm-packages] Disable fast histo for spark (#2296 ) * add back train method but mark as deprecated * fix scalastyle error * disable fast histogram in xgboost4j-spark temporarily	2017-05-15 20:43:16 -07:00
Nan Zhu	428453f7d6	[jvm-packages] fix the persistence of XGBoostEstimator (#2265 ) * add back train method but mark as deprecated * fix scalastyle error * fix the persistence of XGBoostEstimator * test persistence of a complete pipeline * fix compilation issue * do not allow persist custom_eval and custom_obj * fix the failed tesl	2017-05-08 21:58:06 -07:00
ebernhardson	ccccf8a015	[jvm-packages] Accept groupData in spark model eval (#2244 ) * Support model evaluation for ranking tasks by accepting groupData in XGBoostModel.eval	2017-05-02 10:03:20 -07:00
Nan Zhu	392aa6d1d3	[jvm-packages] make XGBoostModel hold BoosterParams as well (#2214 ) * add back train method but mark as deprecated * fix scalastyle error * make XGBoostModel hold BoosterParams as well	2017-04-21 08:12:50 -07:00
Nan Zhu	a837fa9620	[jvm-packages] rdds containing boosters should be cleaned once we got boosters to driver (#2183 )	2017-04-11 06:12:49 -07:00
Nan Zhu	f08077606c	[jvm-packages] Clean external cache (#2181 ) * add back train method but mark as deprecated * fix scalastyle error * change class to object in examples * fix compilation error * small fix for cleanExternalCache	2017-04-10 07:49:58 -07:00
Nan Zhu	8d8cbcc6db	[jvm-packages] fixed several issues in unit tests (#2173 ) * add back train method but mark as deprecated * fix scalastyle error * change class to object in examples * fix compilation error * fix several issues in tests	2017-04-06 06:25:23 -07:00
cloverrose	288f309434	[jvm-packages] call setGroup for ranking task (#2066 ) * [jvm-packages] call setGroup for ranking task * passing groupData through xgBoostConfMap * fix original comment position * make groupData param * remove groupData variable, use xgBoostConfMap directly * set default groupData value * add use groupData tests * reduce rank-demo size * use TaskContext.getPartitionId() instead of mapPartitionsWithIndex * add DF use groupData test * remove unused varable	2017-03-06 15:45:06 -08:00
geoHeil	cf6b173bd7	[jvm-packages] Spark pipeline persistence (#1906 ) [jvm-packages] Spark pipeline persistence	2017-03-05 18:35:37 -08:00
Nan Zhu	ab13fd72bd	[jvm-packages] Scala/Java interface for Fast Histogram Algorithm (#1966 ) * add back train method but mark as deprecated * fix scalastyle error * first commit in scala binding for fast histo * java test * add missed scala tests * spark training * add back train method but mark as deprecated * fix scalastyle error * local change * first commit in scala binding for fast histo * local change * fix df frame test	2017-03-04 15:37:24 -08:00
Nan Zhu	ac30a0aff5	[jvm-packages][spark]Preserve num classes (#2068 ) * add back train method but mark as deprecated * fix scalastyle error * change class to object in examples * fix compilation error * bump spark version to 2.1 * preserve num_class issues * fix failed test cases * rivising * add multi class test	2017-03-04 14:14:31 -08:00
hlsc	a92093388d	[jvm-packages] fix bug doing rabit call after finalize (#2079 ) [jvm-packages]fix bug doing rabit call after finalize	2017-03-02 16:46:57 -08:00
Nan Zhu	185fe1d645	[jvm-packages] use ML's para system to build the passed-in params to XGBoost (#2043 ) * add back train method but mark as deprecated * fix scalastyle error * use ML's para system to build the passed-in params to XGBoost * clean	2017-02-18 11:56:27 -08:00
DougM	acce11d3f4	fix MLlib CrossValidator issues (wrong default value configuration) #1941 (#2042 )	2017-02-18 08:10:47 -08:00
Ruimin Wang	d9584ab82e	refactor duplicate evaluation implementation (#1852 )	2016-12-08 20:33:40 -08:00
Xin Yin	e7fbc8591f	[jvm-packages] Scala implementation of the Rabit tracker. (#1612 ) * [jvm-packages] Scala implementation of the Rabit tracker. A Scala implementation of RabitTracker that is interface-interchangable with the Java implementation, ported from `tracker.py` in the [dmlc-core project](https://github.com/dmlc/dmlc-core). * [jvm-packages] Updated Akka dependency in pom.xml. * Refactored the RabitTracker directory structure. * Fixed premature stopping of connection handler. Added a new finite state "AwaitingPortNumber" to explicitly wait for the worker to send the port, and close the connection. Stopping the actor prematurely sends a TCP RST to the worker, causing the worker to crash on AssertionError. * Added interface IRabitTracker so that user can switch implementations. * Default timeout duration changes. * Dependency for Akka tests. * Removed the main function of RabitTracker. * A skeleton for testing Akka-based Rabit tracker. * waitFor() in RabitTracker no longer throws exceptions. * Completed unit test for the 'start' command of Rabit tracker. * Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.) * Fixed the default timeout duration. * Use Java container to avoid serialization issues due to intermediate wrappers. * Added tests for Allreduce/model training using Scala Rabit tracker. * Added spill-over unit test for the Scala Rabit tracker. * Fixed a typo. * Overhaul of RabitTracker interface per code review. - Removed methods start() waitFor() (no arguments) from IRabitTracker. - The timeout in start(timeout) is now worker connection timeout, as tcp socket binding timeout is less intuitive. - Dropped time unit from start(...) and waitFor(...) methods; the default time unit is millisecond. - Moved random port number generation into the RabitTrackerHandler. - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit. * More code refactoring and comments. * Unified timeout constants. Readable tracker status code. * Add comments to indicate that allReduce is for tests only. Removed all other variants. * Removed unused imports. * Simplified signatures of training methods. - Moved TrackerConf into parameter map. - Changed GeneralParams so that TrackerConf becomes a standalone parameter. - Updated test cases accordingly. * Changed monitoring strategies. * Reverted monitoring changes. * Update test case for Rabit AllReduce. * Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers. * More comprehensive test cases for exception handling and worker connection timeout. * Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case. * Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code. * Reverted scalastyle-config changes. * Visibility scope change. Interface tweaks. * Use match pattern to handle tracker_conf parameter. * Minor clarification in JNI code. * Clearer intent in match pattern to suppress warnings. * Removed Future from constructor. Block in start() and waitFor() instead. * Revert inadvertent comment changes. * Removed debugging information. * Updated test cases that are a bit finicky. * Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.	2016-12-07 06:35:42 -08:00
Nan Zhu	965091c4bb	[jvm-packages] update methods in test cases to be consistent (#1780 ) * add back train method but mark as deprecated * fix scalastyle error * change class to object in examples * fix compilation error * update methods in test cases to be consistent * add blank lines * fix	2016-11-20 22:49:18 -05:00
joandre	91b75f9b41	Fix a small typo in GeneralParams class. Change customEval parameter name from "custom_obj" to "custom_eval". (#1741 )	2016-11-06 12:44:49 -05:00
Nan Zhu	6082184cd1	[jvm-packages] update API docs (#1713 ) * add back train method but mark as deprecated * fix scalastyle error * update java doc * update	2016-10-27 18:53:22 -07:00
Nan Zhu	d321375df5	[jvm-packages] Fix mis configure of nthread (#1709 ) * add back train method but mark as deprecated * fix scalastyle error * change class to object in examples * fix compilation error * fix mis configuration	2016-10-27 12:10:35 -04:00
Nan Zhu	016ab89484	[jvm-packages] Parameter tuning tool for XGBoost (#1664 )	2016-10-23 16:58:18 -04:00
Nan Zhu	813a53882a	[jvm-packages] deprecate Flaky test (#1662 ) * deprecate flaky test	2016-10-13 07:21:24 -04:00
Nan Zhu	1673bcbe7e	[jvm-packages] separate classification and regression model and integrate with ML package (#1608 )	2016-09-30 11:49:03 -04:00
reg.zhuce	3ee145b8dc	[jvm-packages] IndexOutOfBoundsException (#1589 ) ml.dmlc.xgboost4j.scala.spark.XGBoost.scala:51 values is empty when we meet it at first time, so values(0) throw an IndexOutOfBoundsException. It should be dVector.values(i) instead of values(i).	2016-09-20 09:13:47 -04:00
Xin Yin	7245145712	[jvm-packages] Fixed the sanity check for parameter 'nthread' against 'spark.task.cpus'. (#1582 )	2016-09-16 11:31:35 -04:00
Nan Zhu	4ad648e856	[jvm-packages] predictLeaf with Dataframe (#1576 ) * add back train method but mark as deprecated * predictLeaf with Dataset * fix * fix	2016-09-15 06:15:47 -04:00
Nan Zhu	bb388cbb31	default eval func (#1574 )	2016-09-14 13:26:16 -04:00
Nan Zhu	fb02797e2a	[jvm-packages] Integration with Spark Dataframe/Dataset (#1559 ) * bump up to scala 2.11 * framework of data frame integration * test consistency between RDD and DataFrame * order preservation * test order preservation * example code and fix makefile * improve type checking * improve APIs * user docs * work around travis CI's limitation on log length * adjust test structure * integrate with Spark -1 .x * spark 2.x integration * remove spark 1.x implementation but provide instructions on how to downgrade	2016-09-11 15:02:58 -04:00
Nan Zhu	6dabdd33e3	[jvm-packages] bump to next version (#1535 ) * bump to next version * fix * fix	2016-09-01 12:18:21 -04:00
Nan Zhu	7fb3fbf577	impose shuffle when creating training RDD (#1531 )	2016-08-31 07:34:10 -04:00
Nan Zhu	3f198b9fef	[jvm-packages] allow training with missing values in xgboost-spark (#1525 ) * allow training with missing values in xgboost-spark * fix compilation error * fix bug	2016-08-29 21:45:49 -04:00
Nan Zhu	74db1e8867	[jvm-packages] remove APIs with DMatrix from xgboost-spark (#1519 ) * test consistency of prediction functions between DMatrix and RDD * remove APIs with DMatrix from xgboost-spark * fix compilation error in xgboost4j-example * fix test cases	2016-08-28 21:25:49 -04:00
Nan Zhu	6d65aae091	[jvm-packages] test consistency of prediction functions with DMatrix and RDD (#1518 ) * test consistency of prediction functions between DMatrix and RDD * fix the failed test cases	2016-08-28 20:27:03 -04:00
Nan Zhu	d7f79255ec	improve test of save/load model (#1515 )	2016-08-27 17:16:22 -04:00

1 2 3 4 5

243 Commits