234 Commits

Author SHA1 Message Date
Bobby Wang
2d83b2ad8f
[jvm-packages] add hostIp and python exec for rabit tracker (#7808) 2022-04-15 16:28:43 +08:00
Bobby Wang
3f536b5308
[jvm-packages] fix evaluation when featuresCols is used (#7798) 2022-04-13 12:52:50 +08:00
Bobby Wang
118192f116
[jvm-packages] xgboost4j-spark should work when featuresCols is specified (#7789) 2022-04-08 13:21:04 +08:00
Bobby Wang
2454407f3a
[jvm-packages] unify setFeaturesCol API for XGBoostRegressor (#7784) 2022-04-05 13:35:33 +08:00
Jiaming Yuan
522636cb52
Bump version. (#7769) 2022-03-31 06:33:22 +08:00
Bobby Wang
89aa8ddf52
[jvm-packages] fix the prediction issue for multi:softmax (#7694) 2022-02-24 01:09:45 +08:00
Bobby Wang
e3e6de5ed9
[jvm-packages] unify the set features API (#7692)
xgboost4j-spark provides 2 sets of API for setting features, one for CPU, another for GPU, which may cause confusion.

This PR removes the GPU API and adds an override CPU function setFeaturesCol to accept Array[String] parameters.
2022-02-23 03:37:25 +08:00
Jiaming Yuan
ac7a36367c
[jvm-packages] Implement new save_raw in jvm-packages. (#7570)
* New `toByteArray` that accepts a parameter for format.
2022-01-19 16:00:14 +08:00
Jiaming Yuan
001503186c
Rewrite approx (#7214)
This PR rewrites the approx tree method to use codebase from hist for better performance and code sharing.

The rewrite has many benefits:
- Support for both `max_leaves` and `max_depth`.
- Support for `grow_policy`.
- Support for mono constraint.
- Support for feature weights.
- Support for easier bin configuration (`max_bin`).
- Support for categorical data.
- Faster performance for most of the datasets. (many times faster)
- Support for prediction cache.
- Significantly better performance for external memory.
- Unites the code base between approx and hist.
2022-01-10 21:15:05 +08:00
Bobby Wang
e8c1eb99e4
[jvm-package] Clean up the legacy gpu support tests (#7523) 2021-12-21 09:15:51 +08:00
Bobby Wang
24e25802a7
[jvm-packages] Add Rapids plugin support (#7491)
* Add GPU pre-processing pipeline.
2021-12-17 13:11:12 +08:00
Bobby Wang
7cfb310eb4
Rework transform (#7440)
extract the common part of transform code from XGBoostClassifier
and XGBoostRegressor
2021-11-18 15:48:57 +08:00
Bobby Wang
cb685607b2
[jvm-packages] Rework the train pipeline (#7401)
1. Add PreXGBoost to build RDD[Watches] from Dataset
2. Feed RDD[Watches] built from PreXGBoost to XGBoost to train
2021-11-10 17:51:38 +08:00
Bobby Wang
b81ebbef62
[jvm-packages] Fix json4s binary compatibility issue (#7376)
Spark 3.2 depends on 3.7.0-M11 which has changed some implicited functions'
signatures. And it will result the xgboost4j built against spark 3.0/3.1
failed when saving the model.
2021-10-30 03:20:57 +08:00
nicovdijk
31a307cf6b
[XGBoost4J-Spark] Serialization for custom objective and eval (#7274)
* added type hints to custom_obj and custom_eval for Spark persistence


Co-authored-by: Bobby Wang <wbo4958@gmail.com>
2021-10-21 16:22:23 +08:00
Bobby Wang
4fd149b3a2
[jvm-packages] update checkstyle (#7335)
* [jvm-packages] update scalastyle

1. bump scalastyle-maven-plugin and maven-checkstyle-plugin to latest
2. remove unused imports

* fix code style check
2021-10-18 18:42:01 +08:00
Jiaming Yuan
f7caac2563
Bump version to 1.6.0 in master. (#7259) 2021-10-07 16:09:26 +08:00
Jiaming Yuan
146549260a
Bump version to 1.5.0 snapshot in master. (#6875) 2021-04-22 01:53:44 +08:00
Bobby Wang
2c684ffd32
[jvm-packages] fix "key not found: train" issue (#6842)
* [jvm-packages] fix "key not found: train" issue

* fix bug
2021-04-18 23:28:39 -07:00
Bobby Wang
49c22c23b4
[jvm-packages] fix early stopping doesn't work even without custom_eval setting (#6738)
* [jvm-packages] fix early stopping doesn't work even without custom_eval setting

* remove debug info

* resolve comment
2021-03-06 20:19:40 -08:00
Bobby Wang
9d2832a3a3
fix potential TaskFailedListener's callback won't be called (#6612)
there is possibility that onJobStart of TaskFailedListener won't be called, if
the job is submitted before the other thread adds addSparkListener.

detail can be found at https://github.com/dmlc/xgboost/pull/6019#issuecomment-760937628
2021-01-21 14:20:32 +08:00
Philip Hyunsu Cho
0d483cb7c1
Bump version to 1.4.0 snapshot in master (#6486) 2020-12-10 07:38:08 -08:00
zhang_jf
cc581b3b6b
Misleading exception information: no such param of "allow_non_zero_missing" (#6418) 2020-11-20 19:33:34 +08:00
Nan Zhu
4d1d5d4010
[jvm-packages] fix potential unit test suites aborted issue (#6373)
* fix race conditio

* code cleaning

rm pom.xml-e

* clean again

* fix compilation issue

* recover

* avoid using getOrCreate

* interrupt zombie threads

* safe guard

* fix deadlock

* Update SparkParallelismTracker.scala
2020-11-17 10:59:26 -08:00
Jiaming Yuan
d61b628bf5
Remove RABIT CMake targets. (#6275)
* Now it's built as part of libxgboost.
* Set correct C API error in RABIT initialization and finalization.
* Remove redundant message.
* Guard the tracker print C API.
2020-10-27 01:30:20 +08:00
Jiaming Yuan
b5c2a47b20
Drop single point model recovery (#6262)
* Pass rabit params in JVM package.
* Implement timeout using poll timeout parameter.
* Remove OOB data check.
2020-10-21 15:27:03 +08:00
Christian Lorentzen
cf4f019ed6
[Breaking] Change default evaluation metric for classification to logloss / mlogloss (#6183)
* Change DefaultEvalMetric of classification from error to logloss

* Change default binary metric in plugin/example/custom_obj.cc

* Set old error metric in python tests

* Set old error metric in R tests

* Fix missed eval metrics and typos in R tests

* Fix setting eval_metric twice in R tests

* Add warning for empty eval_metric for classification

* Fix Dask tests

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-10-02 12:06:47 -07:00
Philip Hyunsu Cho
33577ef5d3
Add MAPE metric (#6119) 2020-09-14 18:45:27 -07:00
Bobby Wang
0e2d5669f6
[jvm-packages] cancel job instead of killing SparkContext (#6019)
* cancel job instead of killing SparkContext

This PR changes the default behavior that kills SparkContext. Instead, This PR
cancels jobs when coming across task failed. That means the SparkContext is
still alive even some exceptions happen.

* add a parameter to control if killing SparkContext

* cancel the jobs the failed task belongs to

* remove the jobId from the map when one job failed.

* resolve comments
2020-09-02 14:20:59 -07:00
Anthony D'Amato
ada964f16e
Clean the way deterministic paritioning is computed (#6033)
We propose to only use the rowHashCode to compute the partitionKey, adding the FeatureValue hashCode does not bring more value and would make the computation slower. Even though a collision would appear at 0.2% with MurmurHash3 this is bearable for partitioning, this won't have any impact on the data balancing.
2020-08-30 14:38:23 -07:00
FelixYBW
3a990433f9
set maxBins to 256. Align with c code in src/tree/param.h (#6066) 2020-08-28 15:06:11 +03:00
Philip Hyunsu Cho
b3193052b3
Bump version to 1.3.0 snapshot in master (#6052) 2020-08-23 17:13:46 -07:00
Anthony D'Amato
f58e41bad8
Fix deterministic partitioning with dataset containing Double.NaN (#5996)
The functions featureValueOfSparseVector or featureValueOfDenseVector could return a Float.NaN if the input vectore was containing any missing values. This would make fail the partition key computation and most of the vectors would end up in the same partition. We fix this by avoid returning a NaN and simply use the row HashCode in this case.
We added a test to ensure that the repartition is indeed now uniform on input dataset containing values by checking that the partitions size variance is below a certain threshold.

Signed-off-by: Anthony D'Amato <anthony.damato@hotmail.fr>
2020-08-18 18:55:37 -07:00
Jiaming Yuan
f93f1c03fc
Rabit update. (#5978)
* Remove parameter on JVM Packages.
2020-08-11 09:17:32 +08:00
Shaochen Shi
71197d1dfa
[jvm-packages] Fix wrong method name setAllowZeroForMissingValue. (#5740)
* Allow non-zero for missing value when training.

* Fix wrong method names.

* Add a unit test

* Move the getter/setter unit test to MissingValueHandlingSuite

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-08-01 17:16:42 -07:00
Jiaming Yuan
75b8c22b0b
Fix prediction heuristic (#5955)
* Relax check for prediction.
* Relax test in spark test.
* Add tests in C++.
2020-07-29 19:24:07 +08:00
Bobby Wang
8943eb4314
[BLOCKING] [jvm-packages] add gpu_hist and enable gpu scheduling (#5171)
* [jvm-packages] add gpu_hist tree method

* change updater hist to grow_quantile_histmaker

* add gpu scheduling

* pass correct parameters to xgboost library

* remove debug info

* add use.cuda for pom

* add CI for gpu_hist for jvm

* add gpu unit tests

* use gpu node to build jvm

* use nvidia-docker

* Add CLI interface to create_jni.py using argparse

Co-authored-by: Hyunsu Cho <chohyu01@cs.washington.edu>
2020-07-26 21:53:24 -07:00
Philip Hyunsu Cho
487ab0ce73
[BLOCKING] Handle empty rows in data iterators correctly (#5929)
* [jvm-packages] Handle empty rows in data iterators correctly

* Fix clang-tidy error

* last empty row

* Add comments [skip ci]

Co-authored-by: Nan Zhu <nanzhu@uber.com>
2020-07-25 13:46:19 -07:00
Bobby Wang
9f85e92602
[jvm-packages] update spark dependency to 3.0.0 (#5836) 2020-07-12 20:58:30 -07:00
Zhang Zhang
1813804e36
Add new parameter singlePrecisionHistogram to xgboost4j-spark (#5811)
Expose the existing 'singlePrecisionHistogram' param to the Spark layer.
2020-07-08 16:29:35 -07:00
Philip Hyunsu Cho
073b625bde
Bump version to 1.2.0 snapshot in master (#5733) 2020-05-31 00:11:34 -07:00
Bobby Wang
ad826e913f
[jvm-packages]add feature size for LabelPoint and DataBatch (#5303)
* fix type error

* Validate number of features.

* resolve comments

* add feature size for LabelPoint and DataBatch

* pass the feature size to native

* move feature size validating tests into a separate suite

* resolve comments

Co-authored-by: fis <jm.yuan@outlook.com>
2020-04-07 16:49:52 -07:00
Philip Hyunsu Cho
7ac7e8778f
Port patches from 1.0.0 branch (#5336)
* Remove f-string, since it's not supported by Python 3.5 (#5330)

* Remove f-string, since it's not supported by Python 3.5

* Add Python 3.5 to CI, to ensure compatibility

* Remove duplicated matplotlib

* Show deprecation notice for Python 3.5

* Fix lint

* Fix lint

* Fix a unit test that mistook MINOR ver for PATCH ver

* Enforce only major version in JSON model schema

* Bump version to 1.1.0-SNAPSHOT
2020-02-21 13:13:21 -08:00
Nan Zhu
d7b45fbcaf
[jvm-packages] do not use multiple jobs to make checkpoints (#5082)
* temp

* temp

* tep

* address the comments

* fix stylistic issues

* fix

* external checkpoint
2020-02-01 19:36:39 -08:00
Philip Hyunsu Cho
37fdfa03f8
[jvm-packages] Comply with scala style convention + fix broken unit test (#5134)
* Fix scala style check

* fix messed unit test
2019-12-18 17:26:58 -08:00
cpfarrell
bc9d88259f [jvm-packages] Allow for bypassing spark missing value check (#4805)
* Allow for bypassing spark missing value check

* Update documentation for dealing with missing values in spark xgboost
2019-12-18 10:48:20 -08:00
Chen Qin
b29b8c2f34 [jvm-packages] update rabit, surface new changes to spark, add parity and failure tests (#4966)
* [phase 1] expose sets of rabit configurations to spark layer

* add back mutable import

* disable ring_mincount till https://github.com/dmlc/rabit/pull/106d

* Revert "disable ring_mincount till https://github.com/dmlc/rabit/pull/106d"

This reverts commit 65e95a98e24f5eb53c6ba9ef9b2379524258984d.

* apply latest rabit

* fix build error

* apply https://github.com/dmlc/xgboost/pull/4880

* downgrade cmake in rabit

* point to rabit with DMLC_ROOT fix

* relative path of rabit install prefix

* split rabit parameters to another trait

* misc

* misc

* Delete .classpath

* Delete .classpath

* Delete .classpath

* Update XGBoostClassifier.scala

* Update XGBoostRegressor.scala

* Update GeneralParams.scala

* Update GeneralParams.scala

* Update GeneralParams.scala

* Update GeneralParams.scala

* Delete .classpath

* Update RabitParams.scala

* Update .gitignore

* Update .gitignore

* apply rabitParams to training

* use string as rabit parameter value type

* cleanup

* add rabitEnv check

* point to dmlc/rabit

* per feedback

* update private scope

* misc

* update rabit

* add rabit_timtout, fix failing test.

* split tests

* allow build jvm with rabit mock

* pass mock failures to rabit with test

* add mock error and graceful handle rabit assertion error test

* split mvn test

* remove sign for test

* update rabit

* build jvm_packages with rabit mock

* point back to dmlc/rabit

* per feedback, update scala header

* cleanup pom

* per feedback

* try fix lint

* fix lint

* per feedback, remove bootstrap_cache

* per feedback 2

* try replace dev profile with passing mvn property

* fix build error

* remove mvn property and replace with env setting to build test jar

* per feedback

* revert copyright headlines, point to dmlc/rabit

* revert python lint

* remove multiple failure test case as retry is not enabled in spark

* Update core.py

* Update core.py

* per feedback, style fix
2019-11-01 14:21:19 -07:00
Jiaming Yuan
010b8f1428 Revert "[jvm-packages] update rabit, surface new changes to spark, add parity and failure tests (#4876)" (#4965)
This reverts commit 86ed01c4bbecef66e1bc4d02fb13116bd6130fae.
2019-10-18 14:02:35 -07:00
Chen Qin
86ed01c4bb [jvm-packages] update rabit, surface new changes to spark, add parity and failure tests (#4876)
* Expose sets of rabit configurations to spark layer
2019-10-18 15:07:31 -04:00
Liangcai Li
82ee2317e8 Add case for LongParam. (#4885)
To support specifying long parameter as String, the same as other basic
type, such as Int, Double ...
2019-09-25 05:41:53 -07:00