84 Commits

Author SHA1 Message Date
Nan Zhu
01b0c9047c
[jvm-packages] allowing chaining prediction (#4667)
* add test for chaining prediction

* update rabit

* Update XGBoostGeneralSuite.scala
2019-07-17 08:50:27 -07:00
Rong Ou
30204b50fe fix spark tests on machines with many cores (#4634) 2019-07-07 16:02:56 -07:00
Nan Zhu
abffbe014e
[jvm-packages] delete all constraints from spark layer about obj and eval metrics and handle error in jvm layer (#4560)
* temp

* prediction part

* remove supported*

* add for test

* fix param name

* add rabit

* update rabit

* return value of rabit init

* eliminate compilation warnings

* update rabit

* shutdown

* update rabit again

* check sparkcontext shutdown

* fix logic

* sleep

* fix tests

* test with relaxed threshold

* create new thread each time

* stop for job quitting

* udpate rabit

* update rabit

* update rabit

* update git modules
2019-06-27 08:47:37 -07:00
Xu Xiao
797ba8e72d [jvm-packages] fix compatibility problem of spark version (#4411)
* fix compatibility problem of spark version on MissingValueHandlingSuite.scala

* call setHandleInvalid by runtime reflection
2019-04-30 09:13:05 -07:00
Nan Zhu
253fdd8a42
[jvm-packages] fix the split of input (#4417) 2019-04-29 18:52:40 -07:00
Nan Zhu
37dc82c3ff
[jvm-packages] allow partial evaluation of dataframe before prediction (#4407)
* allow partial evaluation of dataframe before prediction

* resume spark test

* comments

* Run unit tests after building JVM packages
2019-04-26 21:02:40 -07:00
Nan Zhu
995698b0cb
[BREAKING][jvm-packages] fix the non-zero missing value handling (#4349)
* fix the nan and non-zero missing value handling

* fix nan handling part

* add missing value

* Update MissingValueHandlingSuite.scala

* Update MissingValueHandlingSuite.scala

* stylistic fix
2019-04-26 11:10:33 -07:00
Xu Xiao
2d875ec019 [BLOCKING][jvm-packages] fix non-deterministic order within a partition (in the case of an upstream shuffle) on prediction (#4388)
* [jvm-packages][hot-fix] fix column mismatch caused by zip actions at XGBooostModel.transformInternal

* apply minibatch in prediction

* an iterator-compatible minibatch prediction

* regressor impl

* continuous working on mini-batch prediction of xgboost4j-spark

* Update Booster.java
2019-04-26 11:09:20 -07:00
Nan Zhu
65db8d0626
[jvm-packages] support spark 2.4 and compatibility test with previous xgboost version (#4377)
* bump spark version

* keep float.nan

* handle brokenly changed name/value

* add test

* add model files

* add model files

* update doc
2019-04-17 11:33:13 -07:00
Nan Zhu
ad4de0d718
[jvm-packages] handle NaN as missing value explicitly (#4309)
* handle nan

* handle nan explicitly

* make code better and handle sparse vector in spark

* Update XGBoostGeneralSuite.scala
2019-03-30 19:34:26 +08:00
Nan Zhu
359ed9c5bc
[jvm-packages] add configuration flag to control whether to cache transformed training set (#4268)
* control whether to cache data

* uncache
2019-03-18 10:13:28 +08:00
Jiaming Yuan
29a1356669
Deprecate reg:linear' in favor of reg:squarederror'. (#4267)
* Deprecate `reg:linear' in favor of `reg:squarederror'.
* Replace the use of `reg:linear'.
* Replace the use of `silent`.
2019-03-17 17:55:04 +08:00
Nan Zhu
c18a3660fa
Separate Depthwidth and Lossguide growing policy in fast histogram (#4102)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* init

* more changes

* temp

* update

* udpate rabit

* change the histogram

* update kfactor

* sync per node stats

* temp

* update

* final

* code clean

* update rabit

* more cleanup

* fix errors

* fix failed tests

* enforce c++11

* broadcast subsampled feature correctly

* init col

* temp

* col sampling

* fix histmastrix init

* fix col sampling

* remove cout

* fix out of bound access

* fix core dump

remove core dump file

* disbale test temporarily

* update

* add fid

* print perf data

* update

* revert some changes

* temp

* temp

* pass all tests

* bring back some tests

* recover some changes

* fix lint issue

* enable monotone and interaction constraints

* don't specify default for monotone and interactions

* recover column init part

* more recovery

* fix core dumps

* code clean

* revert some changes

* fix test compilation issue

* fix lint issue

* resolve compilation issue

* fix issues of lint caused by rebase

* fix stylistic changes and change variable names

* use regtree internal function

* modularize depth width

* address the comments

* fix failed tests

* wrap perf timers with class

* fix lint

* fix num_leaves count

* fix indention

* Update src/tree/updater_quantile_hist.cc

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* Update src/tree/updater_quantile_hist.h

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* Update src/tree/updater_quantile_hist.cc

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* Update src/tree/updater_quantile_hist.cc

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* Update src/tree/updater_quantile_hist.cc

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* Update src/tree/updater_quantile_hist.h

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* merge

* fix compilation
2019-02-13 12:56:19 -08:00
Nan Zhu
3320a52192 [jvm-packages] force use per-group weights in spark layer (#4118) 2019-02-10 05:38:03 +08:00
Nan Zhu
ae3bb9c2d5
Distributed Fast Histogram Algorithm (#4011)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* init

* allow hist algo

* more changes

* temp

* update

* remove hist sync

* udpate rabit

* change hist size

* change the histogram

* update kfactor

* sync per node stats

* temp

* update

* final

* code clean

* update rabit

* more cleanup

* fix errors

* fix failed tests

* enforce c++11

* fix lint issue

* broadcast subsampled feature correctly

* revert some changes

* fix lint issue

* enable monotone and interaction constraints

* don't specify default for monotone and interactions

* update docs
2019-02-05 05:12:53 -08:00
Nan Zhu
773ddbcfcb
[BLOCKING] fix the issue with infrequent feature (#4045)
* fix the issue with infrequent feature

* handle exception

* use only 2 workers

* address the comments
2019-01-06 16:01:03 -08:00
Nan Zhu
c055a32609
[jvm-packages]support multiple validation datasets in Spark (#3910)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* wrap iterators

* enable copartition training and validationset

* add parameters

* converge code path and have init unit test

* enable multi evals for ranking

* unit test and doc

* update example

* fix early stopping

* address the offline comments

* udpate doc

* test eval metrics

* fix compilation issue

* fix example
2018-12-17 21:03:57 -08:00
Huafeng Wang
42cac4a30b [jvm-packages] Fix vector size of 'rawPredictionCol' in XGBoostClassificationModel (#3932)
* Fix vector size of 'rawPredictionCol' in XGBoostClassificationModel

* Fix UT
2018-11-23 21:09:43 -08:00
Nan Zhu
aa48b7e903
[jvm-packages][refactor] refactor XGBoost.scala (spark) (#3904)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* wrap iterators

* remove unused code

* refactor

* fix typo
2018-11-15 20:38:28 -08:00
weitian
9504f411c1 [jvm-packages] For training data with group, empty RDD partition threw exception (#3749) (#3750) 2018-10-09 09:03:22 -07:00
weitian
efc4f85505 [jvm-packages] Fix #3489: Spark repartitionForData can potentially shuffle all data and lose ordering required for ranking objectives (#3654) 2018-10-03 08:43:55 -07:00
Michael Mui
20a9e716bd [jvm-packages] Fix "obj_type" error to enable custom objectives and evaluations (#3646)
credits to @mmui
2018-09-14 12:06:33 -07:00
Nan Zhu
d1e75d615e
[jvm-packages] Remove copy paste error in test suite (#3692)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* remove copy paste error
2018-09-11 13:08:36 -07:00
Joseph Bradley
14a8b96476 [jvm-packages] xgboost-spark warning when Spark encryption is turned on (#3667)
* added test, commented out right now

* reinstated test

* added fix for checking encryption settings

* fix by using RDD conf

* fix compilation

* renamed conf

* use SparkSession if available

* fix message

* nop

* code review fixes
2018-09-10 14:21:01 -07:00
Nan Zhu
1c08b3b2ea
[jvm-packages] enable predictLeaf/predictContrib/treeLimit in 0.8 (#3532)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* partial finish

* no test

* add test cases

* add test cases

* address comments

* add test for regressor

* fix typo
2018-08-07 14:01:18 -07:00
Nan Zhu
6cf97b4eae
[jvm-packages] consider spark.task.cpus when controlling parallelism (#3530)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* consider spark.task.cpus when controlling parallelism

* fix bug

* fix conf setup

* calculate requestedCores within ParallelismController

* enforce spark.task.cpus = 1

* unify unit test case framework

* enable spark ui
2018-07-31 06:19:45 -07:00
Yanbo Liang
2f8764955c [JVM-packages] Support single instance prediction. (#3464)
* Support single instance prediction.

* Address comments.
2018-07-12 14:17:53 -07:00
Yanbo Liang
2c4359e914 [jvm-packages] XGBoost Spark integration refactor (#3387)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* [jvm-packages] XGBoost Spark integration refactor. (#3313)

* XGBoost Spark integration refactor.

* Make corresponding update for xgboost4j-example

* Address comments.

* [jvm-packages] Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib (#3326)

* Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib

* Fix extra space.

* [jvm-packages] XGBoost Spark supports ranking with group data. (#3369)

* XGBoost Spark supports ranking with group data.

* Use Iterator.duplicate to prevent OOM.

* Update CheckpointManagerSuite.scala

* Resolve conflicts
2018-06-18 15:39:18 -07:00
Bruce Qu
578a0c7ddb params confusion fixed (#3386) 2018-06-15 13:17:35 -07:00
Sergei Lebedev
8f6aadd4b7 [jvm-packages] Fixed CheckpointManagerSuite for Scala 2.10 (#3332)
As before, the compilation error is caused by mixing positional and
labelled arguments.
2018-05-19 18:28:11 -07:00
Yun Ni
3f3f54bcad [jvm-packages] Update docs and unify the terminology (#3024)
* [jvm-packages] Move cache files to tmp dir and delete on exit

* [jvm-packages] Update docs and unify terminology

* Address CR Comments
2018-01-16 17:16:55 +01:00
Yun Ni
9004ca03ca [jvm-packages] Saving models into a tmp folder every a few rounds (#2964)
* [jvm-packages] Train Booster from an existing model

* Align Scala API with Java API

* Existing model should not load rabit checkpoint

* Address minor comments

* Implement saving temporary boosters and loading previous booster

* Add more unit tests for loadPrevBooster

* Add params to XGBoostEstimator

* (1) Move repartition out of the temp model saving loop (2) Address CR comments

* Catch a corner case of training next model with fewer rounds

* Address comments

* Refactor newly added methods into TmpBoosterManager

* Add two files which is missing in previous commit

* Rename TmpBooster to checkpoint
2017-12-29 08:36:41 -08:00
Sergei Lebedev
7c6673cb9e [jvm-packages] Fixed test/train persistence (#2949)
* [jvm-packages] Fixed test/train persistence

Prior to this patch both data sets were persisted in the same directory,
i.e. the test data replaced the training one which led to

* training on less data (since usually test < train) and
* test loss being exactly equal to the training loss.

Closes #2945.

* Cleanup file cache after the training

* Addressed review comments
2017-12-19 07:11:48 -08:00
Sergei Lebedev
8e141427aa
[jvm-packages] Exposed train-time evaluation metrics (#2836)
* [jvm-packages] Exposed train-time evaluation metrics

They are accessible via 'XGBoostModel.summary'. The summary is not
serialized with the model and is only available after the training.

* Addressed review comments

* Extracted model-related tests into 'XGBoostModelSuite'

* Added tests for copying the 'XGBoostModel'

* [jvm-packages] Fixed a subtle bug in train/test split

Iterator.partition (naturally) assumes that the predicate is deterministic
but this is not the case for

    r.nextDouble() <= trainTestRatio

therefore sometimes the DMatrix(...) call got a NoSuchElementException
and crashed the JVM due to lack of exception handling in
XGBoost4jCallbackDataIterNext.

* Make sure train/test objectives are different
2017-11-20 22:21:54 +01:00
ebernhardson
78d0bd6c9d [jvm-packages] Repair spark model eval (#2841)
In the refactor to add base margins, #2532, all of the labels were lost
when creating the dmatrix. This became obvious as metrics like ndcg
always returned 1.0 regardless of the results.

Change-Id: I88be047e1c108afba4784bd3d892bfc9edeabe55
2017-11-04 23:28:47 +01:00
ebernhardson
46f2b820f1 [jvm-packages] Objectives starting with rank: are never classification (#2837)
Training a model with the experimental rank:ndcg objective incorrectly
returns a Classification model. Adjust the classification check to
not recognize rank:* objectives as classification.

While writing tests for isClassificationTask also turned up that
obj_type -> regression was incorrectly identified as a classification
task so the function was slightly adjusted to pass the new tests.
2017-10-30 17:36:03 +01:00
Yun Ni
b678e1711d [jvm-packages] Add SparkParallelismTracker to prevent job from hanging (#2697)
* Add SparkParallelismTracker to prevent job from hanging

* Code review comments

* Code Review Comments

* Fix unit tests

* Changes and unit test to catch the corner case.

* Update documentations

* Small improvements

* cancalAllJobs is problematic with scalatest. Remove it

* Code Review Comments

* Check number of executor cores beforehand, and throw exeception if any core is lost.

* Address CR Comments

* Add missing class

* Fix flaky unit test

* Address CR comments

* Remove redundant param for TaskFailedListener
2017-10-16 20:18:47 -07:00
Sergei Lebedev
69c3b78a29 [jvm-packages] Implemented early stopping (#2710)
* Allowed subsampling test from the training data frame/RDD

The implementation requires storing 1 - trainTestRatio points in memory
to make the sampling work.

An alternative approach would be to construct the full DMatrix and then
slice it deterministically into train/test. The peak memory consumption
of such scenario, however, is twice the dataset size.

* Removed duplication from 'XGBoost.train'

Scala callers can (and should) use names to supply a subset of
parameters. Method overloading is not required.

* Reuse XGBoost seed parameter to stabilize train/test splitting

* Added early stopping support to non-distributed XGBoost

Closes #1544

* Added early-stopping to distributed XGBoost

* Moved construction of 'watches' into a separate method

This commit also fixes the handling of 'baseMargin' which previously
was not added to the validation matrix.

* Addressed review comments
2017-09-29 12:06:22 -07:00
Sergei Lebedev
d570337262 [jvm-packages] (xgboost-spark) preserving num_class across save & load (#2742)
* [bugfix] (xgboost-spark) preserving num_class across save & load

* add testcase for save & load of multiclass model
2017-09-24 16:03:30 +02:00
Sergei Lebedev
39adba51c5 Fixed compilation on Scala 2.10 (#2629) 2017-08-28 10:59:39 -07:00
Yun Ni
a00157543d Support instance weights for xgboost4j-spark (#2642)
* Support instance weights for xgboost4j-spark

* Use 0.001 instead of 0 for weights

* Address CR comments
2017-08-28 09:03:20 -07:00
Sergei Lebedev
771a95aec6 [jvm-packages] Added baseMargin to ml.dmlc.xgboost4j.LabeledPoint (#2532)
* Converted ml.dmlc.xgboost4j.LabeledPoint to Scala

This allows to easily integrate LabeledPoint with Spark DataFrame APIs,
which support encoding/decoding case classes out of the box. Alternative
solution would be to keep LabeledPoint in Java and make it a Bean by
generating boilerplate getters/setters. I have decided against that, even
thought the conversion in this PR implies a public API change.

I also had to remove the factory methods fromSparseVector and
fromDenseVector because a) they would need to be duplicated to support
overloaded calls with extra data (e.g. weight); and b) Scala would expose
them via mangled $.MODULE$ which looks ugly in Java.

Additionally, this commit makes it possible to switch to LabeledPoint in
all public APIs and effectively to pass initial margin/group as part of
the point. This seems to be the only reliable way of implementing distributed
learning with these data. Note that group size format used by single-node
XGBoost is not compatible with that scenario, since the partition split
could divide a group into two chunks.

* Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs

Note that DataFrame-based and Flink APIs are not affected by this change.

* Removed baseMargin argument in favour of the LabeledPoint field

* Do a single pass over the partition in buildDistributedBoosters

Note that there is no formal guarantee that

    val repartitioned = rdd.repartition(42)
    repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... }

would do a single shuffle, but in practice it seems to be always the case.

* Exposed baseMargin in DataFrame-based API

* Addressed review comments

* Pass baseMargin to XGBoost.trainWithDataFrame via params

* Reverted MLLabeledPoint in Spark APIs

As discussed, baseMargin would only be supported for DataFrame-based APIs.

* Cleaned up baseMargin tests

- Removed RDD-based test, since the option is no longer exposed via
  public APIs
- Changed DataFrame-based one to check that adding a margin actually
  affects the prediction

* Pleased Scalastyle

* Addressed more review comments

* Pleased scalastyle again

* Fixed XGBoost.fromBaseMarginsToArray

which always returned an array of NaNs even if base margin was not
specified. Surprisingly this only failed a few tests.
2017-08-10 14:29:26 -07:00
Sergei Lebedev
4eb255262f [jvm-packages] More brooming in tests (#2517)
* Deduplicated DataFrame creation in XGBoostDFSuite

* Extracted dermatology.data into MultiClassification

* Moved cache cleaning to SharedSparkContext

Cache files are prefixed with appName therefore this seems to be just the
place to delete them.

* Removed redundant JMatrix calls in xgboost4j-spark

* Slightly more readable buildDenseRDD in XGBoostGeneralSuite

* Generalized train/test DataFrame construction in XGBoostDFSuite

* Changed SharedSparkContext to setup a new context per-test

Hence the new name: PerTestSparkSession :)

* Fused Utils into PerTestSparkSession

* Whitespace fix in XGBoostDFSuite

* Ensure SparkSession is always eagerly created in PerTestSparkSession

* Renamed PerTestSparkSession->PerTest

because it was doing slightly more than creating/stopping the session.
2017-07-18 13:08:48 -07:00
Sergei Lebedev
66874f5777 [jvm-packages] Deduplicated train/test data access in tests (#2507)
* [jvm-packages] Deduplicated train/test data access in tests

All datasets are now available via a unified API, e.g. Agaricus.test.
The only exception is the dermatology data which requires parsing a
CSV file.

* Inlined Utils.buildTrainingRDD

The default number of partitions for local mode is equal to the number
of available CPUs.

* Replaced dataset names with problem types
2017-07-12 09:13:55 -07:00
Sergei Lebedev
8ceeb32bad Fixed a signature of XGBoostModel.predict (#2476)
Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.

This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified. See discussion in 06bd5dca for motivation.
2017-07-02 21:42:46 -07:00
Sergei Lebedev
d535340459 [jvm-packages] Exposed baseMargin (#2450)
* Disabled excessive Spark logging in tests

* Fixed a singature of XGBoostModel.predict

Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.

This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified.

* Removed boxing in XGBoost.fromDenseToSparseLabeledPoints

* Inlined XGBoost.repartitionData

An if is more explicit than an opaque method name.

* Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel

* Check the input dimension in DMatrix.setBaseMargin

Prior to this commit providing an array of incorrect dimensions would
have resulted in memory corruption. Maybe backport this to C++?

* Reduced nesting in XGBoost.buildDistributedBoosters

* Ensured consistent naming of the params map

* Cleaned up DataBatch to make it easier to comprehend

* Made scalastyle happy

* Added baseMargin to XGBoost.train and trainWithRDD

* Deprecated XGBoost.train

It is ambiguous and work only for RDDs.

* Addressed review comments

* Revert "Fixed a singature of XGBoostModel.predict"

This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4.

* Addressed more review comments

* Fixed NullPointerException in buildDistributedBoosters
2017-06-30 08:27:24 -07:00
Nan Zhu
a607f697e3 [jvm-packages] Disable fast histo for spark (#2296)
* add back train method but mark as deprecated

* fix scalastyle error

* disable fast histogram in xgboost4j-spark temporarily
2017-05-15 20:43:16 -07:00
Nan Zhu
428453f7d6 [jvm-packages] fix the persistence of XGBoostEstimator (#2265)
* add back train method but mark as deprecated

* fix scalastyle error

* fix the persistence of XGBoostEstimator

* test persistence of a complete pipeline

* fix compilation issue

* do not allow persist custom_eval and custom_obj

* fix the failed tesl
2017-05-08 21:58:06 -07:00
ebernhardson
ccccf8a015 [jvm-packages] Accept groupData in spark model eval (#2244)
* Support model evaluation for ranking tasks by accepting
 groupData in XGBoostModel.eval
2017-05-02 10:03:20 -07:00
Nan Zhu
392aa6d1d3 [jvm-packages] make XGBoostModel hold BoosterParams as well (#2214)
* add back train method but mark as deprecated

* fix scalastyle error

* make XGBoostModel hold BoosterParams as well
2017-04-21 08:12:50 -07:00