xgboost

Author	SHA1	Message	Date
Shaochen Shi	71197d1dfa	[jvm-packages] Fix wrong method name `setAllowZeroForMissingValue`. (#5740 ) * Allow non-zero for missing value when training. * Fix wrong method names. * Add a unit test * Move the getter/setter unit test to MissingValueHandlingSuite Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>	2020-08-01 17:16:42 -07:00
Jiaming Yuan	75b8c22b0b	Fix prediction heuristic (#5955 ) * Relax check for prediction. * Relax test in spark test. * Add tests in C++.	2020-07-29 19:24:07 +08:00
Bobby Wang	8943eb4314	[BLOCKING] [jvm-packages] add gpu_hist and enable gpu scheduling (#5171 ) * [jvm-packages] add gpu_hist tree method * change updater hist to grow_quantile_histmaker * add gpu scheduling * pass correct parameters to xgboost library * remove debug info * add use.cuda for pom * add CI for gpu_hist for jvm * add gpu unit tests * use gpu node to build jvm * use nvidia-docker * Add CLI interface to create_jni.py using argparse Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>	2020-07-26 21:53:24 -07:00
Philip Hyunsu Cho	487ab0ce73	[BLOCKING] Handle empty rows in data iterators correctly (#5929 ) * [jvm-packages] Handle empty rows in data iterators correctly * Fix clang-tidy error * last empty row * Add comments [skip ci] Co-authored-by: Nan Zhu <nanzhu@uber.com>	2020-07-25 13:46:19 -07:00
Bobby Wang	9f85e92602	[jvm-packages] update spark dependency to 3.0.0 (#5836 )	2020-07-12 20:58:30 -07:00
Bobby Wang	ad826e913f	[jvm-packages]add feature size for LabelPoint and DataBatch (#5303 ) * fix type error * Validate number of features. * resolve comments * add feature size for LabelPoint and DataBatch * pass the feature size to native * move feature size validating tests into a separate suite * resolve comments Co-authored-by: fis <jm.yuan@outlook.com>	2020-04-07 16:49:52 -07:00
Nan Zhu	d7b45fbcaf	[jvm-packages] do not use multiple jobs to make checkpoints (#5082 ) * temp * temp * tep * address the comments * fix stylistic issues * fix * external checkpoint	2020-02-01 19:36:39 -08:00
Philip Hyunsu Cho	37fdfa03f8	[jvm-packages] Comply with scala style convention + fix broken unit test (#5134 ) * Fix scala style check * fix messed unit test	2019-12-18 17:26:58 -08:00
cpfarrell	bc9d88259f	[jvm-packages] Allow for bypassing spark missing value check (#4805 ) * Allow for bypassing spark missing value check * Update documentation for dealing with missing values in spark xgboost	2019-12-18 10:48:20 -08:00
Chen Qin	b29b8c2f34	[jvm-packages] update rabit, surface new changes to spark, add parity and failure tests (#4966 ) * [phase 1] expose sets of rabit configurations to spark layer * add back mutable import * disable ring_mincount till https://github.com/dmlc/rabit/pull/106d * Revert "disable ring_mincount till https://github.com/dmlc/rabit/pull/106d" This reverts commit 65e95a98e24f5eb53c6ba9ef9b2379524258984d. * apply latest rabit * fix build error * apply https://github.com/dmlc/xgboost/pull/4880 * downgrade cmake in rabit * point to rabit with DMLC_ROOT fix * relative path of rabit install prefix * split rabit parameters to another trait * misc * misc * Delete .classpath * Delete .classpath * Delete .classpath * Update XGBoostClassifier.scala * Update XGBoostRegressor.scala * Update GeneralParams.scala * Update GeneralParams.scala * Update GeneralParams.scala * Update GeneralParams.scala * Delete .classpath * Update RabitParams.scala * Update .gitignore * Update .gitignore * apply rabitParams to training * use string as rabit parameter value type * cleanup * add rabitEnv check * point to dmlc/rabit * per feedback * update private scope * misc * update rabit * add rabit_timtout, fix failing test. * split tests * allow build jvm with rabit mock * pass mock failures to rabit with test * add mock error and graceful handle rabit assertion error test * split mvn test * remove sign for test * update rabit * build jvm_packages with rabit mock * point back to dmlc/rabit * per feedback, update scala header * cleanup pom * per feedback * try fix lint * fix lint * per feedback, remove bootstrap_cache * per feedback 2 * try replace dev profile with passing mvn property * fix build error * remove mvn property and replace with env setting to build test jar * per feedback * revert copyright headlines, point to dmlc/rabit * revert python lint * remove multiple failure test case as retry is not enabled in spark * Update core.py * Update core.py * per feedback, style fix	2019-11-01 14:21:19 -07:00
Jiaming Yuan	010b8f1428	Revert "[jvm-packages] update rabit, surface new changes to spark, add parity and failure tests (#4876 )" (#4965 ) This reverts commit 86ed01c4bbecef66e1bc4d02fb13116bd6130fae.	2019-10-18 14:02:35 -07:00
Chen Qin	86ed01c4bb	[jvm-packages] update rabit, surface new changes to spark, add parity and failure tests (#4876 ) * Expose sets of rabit configurations to spark layer	2019-10-18 15:07:31 -04:00
Nan Zhu	fc8c9b0521	[jvm-packages] enable deterministic repartitioning when checkpoint is enabled (#4807 ) * do reparititoning in DataUtil * keep previous behavior of partitioning without checkpoint * deterministic repartitioning * change	2019-09-19 15:21:05 -07:00
Nan Zhu	7b5cbcc846	[jvm-packages] cleaning checkpoint file after a successful training (#4754 ) * cleaning checkpoint file after a successful file * address comments	2019-08-14 10:57:47 -07:00
Oleksandr Pryimak	b68de018b8	[jvm-packages] jvm test should clean up after themselfs (#4706 )	2019-08-04 14:09:11 -07:00
Nan Zhu	01b0c9047c	[jvm-packages] allowing chaining prediction (#4667 ) * add test for chaining prediction * update rabit * Update XGBoostGeneralSuite.scala	2019-07-17 08:50:27 -07:00
Rong Ou	30204b50fe	fix spark tests on machines with many cores (#4634 )	2019-07-07 16:02:56 -07:00
Nan Zhu	abffbe014e	[jvm-packages] delete all constraints from spark layer about obj and eval metrics and handle error in jvm layer (#4560 ) * temp * prediction part * remove supported* * add for test * fix param name * add rabit * update rabit * return value of rabit init * eliminate compilation warnings * update rabit * shutdown * update rabit again * check sparkcontext shutdown * fix logic * sleep * fix tests * test with relaxed threshold * create new thread each time * stop for job quitting * udpate rabit * update rabit * update rabit * update git modules	2019-06-27 08:47:37 -07:00
Xu Xiao	797ba8e72d	[jvm-packages] fix compatibility problem of spark version (#4411 ) * fix compatibility problem of spark version on MissingValueHandlingSuite.scala * call setHandleInvalid by runtime reflection	2019-04-30 09:13:05 -07:00
Nan Zhu	253fdd8a42	[jvm-packages] fix the split of input (#4417 )	2019-04-29 18:52:40 -07:00
Nan Zhu	37dc82c3ff	[jvm-packages] allow partial evaluation of dataframe before prediction (#4407 ) * allow partial evaluation of dataframe before prediction * resume spark test * comments * Run unit tests after building JVM packages	2019-04-26 21:02:40 -07:00
Nan Zhu	995698b0cb	[BREAKING][jvm-packages] fix the non-zero missing value handling (#4349 ) * fix the nan and non-zero missing value handling * fix nan handling part * add missing value * Update MissingValueHandlingSuite.scala * Update MissingValueHandlingSuite.scala * stylistic fix	2019-04-26 11:10:33 -07:00
Xu Xiao	2d875ec019	[BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction (#4388 ) * [jvm-packages][hot-fix] fix column mismatch caused by zip actions at XGBooostModel.transformInternal * apply minibatch in prediction * an iterator-compatible minibatch prediction * regressor impl * continuous working on mini-batch prediction of xgboost4j-spark * Update Booster.java	2019-04-26 11:09:20 -07:00
Nan Zhu	65db8d0626	[jvm-packages] support spark 2.4 and compatibility test with previous xgboost version (#4377 ) * bump spark version * keep float.nan * handle brokenly changed name/value * add test * add model files * add model files * update doc	2019-04-17 11:33:13 -07:00
Nan Zhu	ad4de0d718	[jvm-packages] handle NaN as missing value explicitly (#4309 ) * handle nan * handle nan explicitly * make code better and handle sparse vector in spark * Update XGBoostGeneralSuite.scala	2019-03-30 19:34:26 +08:00
Nan Zhu	359ed9c5bc	[jvm-packages] add configuration flag to control whether to cache transformed training set (#4268 ) * control whether to cache data * uncache	2019-03-18 10:13:28 +08:00
Jiaming Yuan	29a1356669	Deprecate `reg:linear' in favor of` reg:squarederror'. (#4267 ) * Deprecate `reg:linear' in favor of `reg:squarederror'. * Replace the use of `reg:linear'. * Replace the use of `silent`.	2019-03-17 17:55:04 +08:00
Nan Zhu	c18a3660fa	Separate Depthwidth and Lossguide growing policy in fast histogram (#4102 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * fix scalastyle error * fix scalastyle error * init * more changes * temp * update * udpate rabit * change the histogram * update kfactor * sync per node stats * temp * update * final * code clean * update rabit * more cleanup * fix errors * fix failed tests * enforce c++11 * broadcast subsampled feature correctly * init col * temp * col sampling * fix histmastrix init * fix col sampling * remove cout * fix out of bound access * fix core dump remove core dump file * disbale test temporarily * update * add fid * print perf data * update * revert some changes * temp * temp * pass all tests * bring back some tests * recover some changes * fix lint issue * enable monotone and interaction constraints * don't specify default for monotone and interactions * recover column init part * more recovery * fix core dumps * code clean * revert some changes * fix test compilation issue * fix lint issue * resolve compilation issue * fix issues of lint caused by rebase * fix stylistic changes and change variable names * use regtree internal function * modularize depth width * address the comments * fix failed tests * wrap perf timers with class * fix lint * fix num_leaves count * fix indention * Update src/tree/updater_quantile_hist.cc Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com> * Update src/tree/updater_quantile_hist.h Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com> * Update src/tree/updater_quantile_hist.cc Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com> * Update src/tree/updater_quantile_hist.cc Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com> * Update src/tree/updater_quantile_hist.cc Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com> * Update src/tree/updater_quantile_hist.h Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com> * merge * fix compilation	2019-02-13 12:56:19 -08:00
Nan Zhu	3320a52192	[jvm-packages] force use per-group weights in spark layer (#4118 )	2019-02-10 05:38:03 +08:00
Nan Zhu	ae3bb9c2d5	Distributed Fast Histogram Algorithm (#4011 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * fix scalastyle error * fix scalastyle error * init * allow hist algo * more changes * temp * update * remove hist sync * udpate rabit * change hist size * change the histogram * update kfactor * sync per node stats * temp * update * final * code clean * update rabit * more cleanup * fix errors * fix failed tests * enforce c++11 * fix lint issue * broadcast subsampled feature correctly * revert some changes * fix lint issue * enable monotone and interaction constraints * don't specify default for monotone and interactions * update docs	2019-02-05 05:12:53 -08:00
Nan Zhu	773ddbcfcb	[BLOCKING] fix the issue with infrequent feature (#4045 ) * fix the issue with infrequent feature * handle exception * use only 2 workers * address the comments	2019-01-06 16:01:03 -08:00
Nan Zhu	c055a32609	[jvm-packages]support multiple validation datasets in Spark (#3910 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * fix scalastyle error * fix scalastyle error * wrap iterators * enable copartition training and validationset * add parameters * converge code path and have init unit test * enable multi evals for ranking * unit test and doc * update example * fix early stopping * address the offline comments * udpate doc * test eval metrics * fix compilation issue * fix example	2018-12-17 21:03:57 -08:00
Huafeng Wang	42cac4a30b	[jvm-packages] Fix vector size of 'rawPredictionCol' in XGBoostClassificationModel (#3932 ) * Fix vector size of 'rawPredictionCol' in XGBoostClassificationModel * Fix UT	2018-11-23 21:09:43 -08:00
Nan Zhu	aa48b7e903	[jvm-packages][refactor] refactor XGBoost.scala (spark) (#3904 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * fix scalastyle error * fix scalastyle error * wrap iterators * remove unused code * refactor * fix typo	2018-11-15 20:38:28 -08:00
weitian	9504f411c1	[jvm-packages] For training data with group, empty RDD partition threw exception (#3749 ) (#3750 )	2018-10-09 09:03:22 -07:00
weitian	efc4f85505	[jvm-packages] Fix #3489 : Spark repartitionForData can potentially shuffle all data and lose ordering required for ranking objectives (#3654 )	2018-10-03 08:43:55 -07:00
Michael Mui	20a9e716bd	[jvm-packages] Fix "obj_type" error to enable custom objectives and evaluations (#3646 ) credits to @mmui	2018-09-14 12:06:33 -07:00
Nan Zhu	d1e75d615e	[jvm-packages] Remove copy paste error in test suite (#3692 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * remove copy paste error	2018-09-11 13:08:36 -07:00
Joseph Bradley	14a8b96476	[jvm-packages] xgboost-spark warning when Spark encryption is turned on (#3667 ) * added test, commented out right now * reinstated test * added fix for checking encryption settings * fix by using RDD conf * fix compilation * renamed conf * use SparkSession if available * fix message * nop * code review fixes	2018-09-10 14:21:01 -07:00
Nan Zhu	1c08b3b2ea	[jvm-packages] enable predictLeaf/predictContrib/treeLimit in 0.8 (#3532 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * partial finish * no test * add test cases * add test cases * address comments * add test for regressor * fix typo	2018-08-07 14:01:18 -07:00
Nan Zhu	6cf97b4eae	[jvm-packages] consider spark.task.cpus when controlling parallelism (#3530 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * consider spark.task.cpus when controlling parallelism * fix bug * fix conf setup * calculate requestedCores within ParallelismController * enforce spark.task.cpus = 1 * unify unit test case framework * enable spark ui	2018-07-31 06:19:45 -07:00
Yanbo Liang	2f8764955c	[JVM-packages] Support single instance prediction. (#3464 ) * Support single instance prediction. * Address comments.	2018-07-12 14:17:53 -07:00
Yanbo Liang	2c4359e914	[jvm-packages] XGBoost Spark integration refactor (#3387 ) * add back train method but mark as deprecated * add back train method but mark as deprecated * fix scalastyle error * fix scalastyle error * [jvm-packages] XGBoost Spark integration refactor. (#3313) * XGBoost Spark integration refactor. * Make corresponding update for xgboost4j-example * Address comments. * [jvm-packages] Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib (#3326) * Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib * Fix extra space. * [jvm-packages] XGBoost Spark supports ranking with group data. (#3369) * XGBoost Spark supports ranking with group data. * Use Iterator.duplicate to prevent OOM. * Update CheckpointManagerSuite.scala * Resolve conflicts	2018-06-18 15:39:18 -07:00
Bruce Qu	578a0c7ddb	params confusion fixed (#3386 )	2018-06-15 13:17:35 -07:00
Sergei Lebedev	8f6aadd4b7	[jvm-packages] Fixed CheckpointManagerSuite for Scala 2.10 (#3332 ) As before, the compilation error is caused by mixing positional and labelled arguments.	2018-05-19 18:28:11 -07:00
Yun Ni	3f3f54bcad	[jvm-packages] Update docs and unify the terminology (#3024 ) * [jvm-packages] Move cache files to tmp dir and delete on exit * [jvm-packages] Update docs and unify terminology * Address CR Comments	2018-01-16 17:16:55 +01:00
Yun Ni	9004ca03ca	[jvm-packages] Saving models into a tmp folder every a few rounds (#2964 ) * [jvm-packages] Train Booster from an existing model * Align Scala API with Java API * Existing model should not load rabit checkpoint * Address minor comments * Implement saving temporary boosters and loading previous booster * Add more unit tests for loadPrevBooster * Add params to XGBoostEstimator * (1) Move repartition out of the temp model saving loop (2) Address CR comments * Catch a corner case of training next model with fewer rounds * Address comments * Refactor newly added methods into TmpBoosterManager * Add two files which is missing in previous commit * Rename TmpBooster to checkpoint	2017-12-29 08:36:41 -08:00
Sergei Lebedev	7c6673cb9e	[jvm-packages] Fixed test/train persistence (#2949 ) * [jvm-packages] Fixed test/train persistence Prior to this patch both data sets were persisted in the same directory, i.e. the test data replaced the training one which led to * training on less data (since usually test < train) and * test loss being exactly equal to the training loss. Closes #2945. * Cleanup file cache after the training * Addressed review comments	2017-12-19 07:11:48 -08:00
Sergei Lebedev	8e141427aa	[jvm-packages] Exposed train-time evaluation metrics (#2836 ) * [jvm-packages] Exposed train-time evaluation metrics They are accessible via 'XGBoostModel.summary'. The summary is not serialized with the model and is only available after the training. * Addressed review comments * Extracted model-related tests into 'XGBoostModelSuite' * Added tests for copying the 'XGBoostModel' * [jvm-packages] Fixed a subtle bug in train/test split Iterator.partition (naturally) assumes that the predicate is deterministic but this is not the case for r.nextDouble() <= trainTestRatio therefore sometimes the DMatrix(...) call got a NoSuchElementException and crashed the JVM due to lack of exception handling in XGBoost4jCallbackDataIterNext. * Make sure train/test objectives are different	2017-11-20 22:21:54 +01:00
ebernhardson	78d0bd6c9d	[jvm-packages] Repair spark model eval (#2841 ) In the refactor to add base margins, #2532, all of the labels were lost when creating the dmatrix. This became obvious as metrics like ndcg always returned 1.0 regardless of the results. Change-Id: I88be047e1c108afba4784bd3d892bfc9edeabe55	2017-11-04 23:28:47 +01:00

1 2

99 Commits