204 Commits

Author SHA1 Message Date
Nan Zhu
37dc82c3ff
[jvm-packages] allow partial evaluation of dataframe before prediction (#4407)
* allow partial evaluation of dataframe before prediction

* resume spark test

* comments

* Run unit tests after building JVM packages
2019-04-26 21:02:40 -07:00
Philip Hyunsu Cho
ea850ecd20
[CI] Refactor Jenkins CI pipeline + migrate all Linux tests to Jenkins (#4401)
* All Linux tests are now in Jenkins CI
* Tests are now de-coupled from builds. We can now build XGBoost with one version of CUDA/JDK and test it with another version of CUDA/JDK
* Builds (compilation) are significantly faster because 1) They use C5 instances with faster CPU cores; and 2) build environment setup is cached using Docker containers
2019-04-26 18:39:12 -07:00
Nan Zhu
995698b0cb
[BREAKING][jvm-packages] fix the non-zero missing value handling (#4349)
* fix the nan and non-zero missing value handling

* fix nan handling part

* add missing value

* Update MissingValueHandlingSuite.scala

* Update MissingValueHandlingSuite.scala

* stylistic fix
2019-04-26 11:10:33 -07:00
Jiaming Yuan
207f058711 Refactor CMake scripts. (#4323)
* Refactor CMake scripts.

* Remove CMake CUDA wrapper.
* Bump CMake version for CUDA.
* Use CMake to handle Doxygen.
* Split up CMakeList.
* Export install target.
* Use modern CMake.
* Remove build.sh
* Workaround for gpu_hist test.
* Use cmake 3.12.

* Revert machine.conf.

* Move CLI test to gpu.

* Small cleanup.

* Support using XGBoost as submodule.

* Fix windows

* Fix cpp tests on Windows

* Remove duplicated find_package.
2019-04-15 10:08:12 -07:00
Adam Pocock
a448a8320c [jvm-packages] Fixing the NativeLibLoader on Java 9+ (#4351)
The old NativeLibLoader had a short-circuit load path which modified
java.library.path and attempted to load the xgboost library from outside
the jar first, falling back to loading the library from inside the jar.
This path is a no-op every time when using XGBoost outside of it's
source tree. Additionally it triggers an illegal reflective access
warning in the module system in 9, 10, and 11.

On Java 12 the ClassLoader fields are not accessible via reflection
(separately from the illegal reflective acces warning), and so it fails
in a way that isn't caught by the code which falls back to loading the
library from inside the jar.

This commit removes that code path and always loads the xgboost library
from inside the jar file as it's a valid technique across multiple JVM
implementations and works with all versions of Java.
2019-04-10 12:41:44 -07:00
Xu Xiao
60a9af567c [jvm-packages] Add methods operating attributes of booster in jvm package, which follow API design in python package. (#4336) 2019-04-08 11:00:35 -07:00
Rong Ou
7ea5b772fb do not filter shared library files (#4303) 2019-03-28 19:40:54 +08:00
Harry Braviner
b374e0a7ab [jvm-packages] Allow supression of Rabit output in Booster::train in xgboost4j (#4262)
* Make train in xgboost4j respect print params

Previously no setting in params argument of Booster::train would prevent
the Rabit.trackerPrint call. This can fill up a lot of screen space in
the case that many folds are being trained.
* Setting "silent" in this map to "true", "True", a non-zero integer, or
  a string that can be parsed to such an int will prevent printing.
* Setting "verbose_eval" to "False" or "false" will prevent printing.
* Setting "verbose_eval" to an int (or a String parseable to an int) n
  will result in printing every n steps, or no printing is n is zero.

This is to match the python behaviour described here:
https://www.kaggle.com/c/rossmann-store-sales/discussion/17499

* Fixed 'slient' typo in xgboost4j test

* private access on two methods
2019-03-21 18:25:12 +08:00
Nan Zhu
45c89a6792
[jvm-packages] logging version number (#4271)
* print version number

* add property file
2019-03-21 18:24:29 +08:00
Christopher Suchanek
ac3d03089b [jvm-packages] remove shutdown of handler shutdown (#4224) 2019-03-06 19:32:43 -08:00
Nan Zhu
5f34078fba
[jvm-packages] bump version for master (#4209)
* update version

* bump version
2019-03-04 23:12:24 -08:00
Yanbo Liang
9fefa2128d [jvm-packages] Fix early stop with xgboost4j-spark (#4176)
* Fix early stop with xgboost4j-spark

* Update XGBoost.java

* Update XGBoost.java

* Update XGBoost.java

To use -Float.MAX_VALUE as the lower bound, in case there is positive metric.

* Only update best score if the current score is better (no update when equal)

* Update xgboost-spark tutorial to fix early stopping docs.
2019-03-01 13:02:57 -08:00
Nan Zhu
1b7405f688
[jvm-packages] fix comments in objectiveTrait (#4174) 2019-02-22 00:32:13 -08:00
Nan Zhu
c18a3660fa
Separate Depthwidth and Lossguide growing policy in fast histogram (#4102)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* init

* more changes

* temp

* update

* udpate rabit

* change the histogram

* update kfactor

* sync per node stats

* temp

* update

* final

* code clean

* update rabit

* more cleanup

* fix errors

* fix failed tests

* enforce c++11

* broadcast subsampled feature correctly

* init col

* temp

* col sampling

* fix histmastrix init

* fix col sampling

* remove cout

* fix out of bound access

* fix core dump

remove core dump file

* disbale test temporarily

* update

* add fid

* print perf data

* update

* revert some changes

* temp

* temp

* pass all tests

* bring back some tests

* recover some changes

* fix lint issue

* enable monotone and interaction constraints

* don't specify default for monotone and interactions

* recover column init part

* more recovery

* fix core dumps

* code clean

* revert some changes

* fix test compilation issue

* fix lint issue

* resolve compilation issue

* fix issues of lint caused by rebase

* fix stylistic changes and change variable names

* use regtree internal function

* modularize depth width

* address the comments

* fix failed tests

* wrap perf timers with class

* fix lint

* fix num_leaves count

* fix indention

* Update src/tree/updater_quantile_hist.cc

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* Update src/tree/updater_quantile_hist.h

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* Update src/tree/updater_quantile_hist.cc

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* Update src/tree/updater_quantile_hist.cc

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* Update src/tree/updater_quantile_hist.cc

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* Update src/tree/updater_quantile_hist.h

Co-Authored-By: CodingCat <CodingCat@users.noreply.github.com>

* merge

* fix compilation
2019-02-13 12:56:19 -08:00
Nan Zhu
ae3bb9c2d5
Distributed Fast Histogram Algorithm (#4011)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* init

* allow hist algo

* more changes

* temp

* update

* remove hist sync

* udpate rabit

* change hist size

* change the histogram

* update kfactor

* sync per node stats

* temp

* update

* final

* code clean

* update rabit

* more cleanup

* fix errors

* fix failed tests

* enforce c++11

* fix lint issue

* broadcast subsampled feature correctly

* revert some changes

* fix lint issue

* enable monotone and interaction constraints

* don't specify default for monotone and interactions

* update docs
2019-02-05 05:12:53 -08:00
Shayak Banerjee
431c850c03 [jvm-packages] Updates to Java Booster to support other feature importance measures (#3801)
* Updates to Booster to support other feature importances

* Add returns for Java methods

* Pass Scala style checks

* Pass Java style checks

* Fix indents

* Use class instead of enum

* Return map string double

* A no longer broken build, thanks to mvn package local build

* Add a unit test to increase code coverage back

* Address code review on main code

* Add more unit tests for different feature importance scores

* Address more CR
2019-01-02 01:13:14 -08:00
Nan Zhu
c055a32609
[jvm-packages]support multiple validation datasets in Spark (#3910)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* wrap iterators

* enable copartition training and validationset

* add parameters

* converge code path and have init unit test

* enable multi evals for ranking

* unit test and doc

* update example

* fix early stopping

* address the offline comments

* udpate doc

* test eval metrics

* fix compilation issue

* fix example
2018-12-17 21:03:57 -08:00
Nan Zhu
9c4ff50e83
[jvm-packages]Fix early stopping condition (#3928)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* update version

* 0.82

* fix early stopping condition

* remove unused

* update comments

* udpate comments

* update test
2018-11-24 00:18:07 -08:00
Nan Zhu
dc2bfbfde1
[jvm-packages] update version to 0.82-SNAPSHOT (#3920)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* update version

* 0.82
2018-11-18 16:47:48 -08:00
Philip Hyunsu Cho
78ec77fa97
Release 0.81 version (#3864)
* Release 0.81 version

* Update NEWS.md
2018-11-04 05:49:11 -08:00
Matthew Tovbin
d81fedb955 [jvm-packages] RabitTracker for Scala: allow specifying host ip from the xgboost-tracker.properties file (#3833) 2018-10-26 22:01:36 -07:00
Nan Zhu
4ae225a08d
[Blocking][jvm-packages] fix the early stopping feature (#3808)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* fix scalastyle error

* temp

* add method for classifier and regressor

* update tutorial

* address the comments

* update
2018-10-23 14:53:13 -07:00
zengxy
9e73087324 [jvm-packages] support specified feature names when getModelDump and getFeatureScore (#3733)
* [jvm-packages] support specified feature names for jvm when get ModelDump and get FeatureScore (#3725)

* typo and style fix
2018-10-04 09:05:42 -07:00
Philip Hyunsu Cho
6288f6d563 Update JVM packages version to 0.81-SNAPSHOT (#3584) 2018-08-13 10:17:52 -07:00
Philip Hyunsu Cho
96826a3515
Release version 0.80 (#3541)
* Up versions

* Write release note for 0.80
2018-08-13 01:38:37 -07:00
Matthew Tovbin
bad76048d1 Eliminate use of System.out + proper error logging (#3572) 2018-08-09 10:06:17 -07:00
Nan Zhu
1c08b3b2ea
[jvm-packages] enable predictLeaf/predictContrib/treeLimit in 0.8 (#3532)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* partial finish

* no test

* add test cases

* add test cases

* address comments

* add test for regressor

* fix typo
2018-08-07 14:01:18 -07:00
Yun Ni
30d10ab035 Convert handle == nullptr from SegFault to user-friendly error. (#3021)
* Convert SegFault to user-friendly error.

* Apply the change to DMatrix API as well
2018-06-29 06:30:26 +00:00
Yanbo Liang
2c4359e914 [jvm-packages] XGBoost Spark integration refactor (#3387)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* [jvm-packages] XGBoost Spark integration refactor. (#3313)

* XGBoost Spark integration refactor.

* Make corresponding update for xgboost4j-example

* Address comments.

* [jvm-packages] Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib (#3326)

* Refactor XGBoost-Spark params to make it compatible with both XGBoost and Spark MLLib

* Fix extra space.

* [jvm-packages] XGBoost Spark supports ranking with group data. (#3369)

* XGBoost Spark supports ranking with group data.

* Use Iterator.duplicate to prevent OOM.

* Update CheckpointManagerSuite.scala

* Resolve conflicts
2018-06-18 15:39:18 -07:00
Nan Zhu
f66731181f
Update 0.8 version num (#3358)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* update 0.80
2018-06-02 07:06:01 -07:00
Nan Zhu
49b9f39818
[jvm-packages] update xgboost4j cross build script to be compatible with older glibc (#3307)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* static glibc glibc++

* update to build with glib 2.12

* remove unsupported flags

* update version number

* remove properties

* remove unnecessary command

* update poms
2018-05-10 06:39:44 -07:00
Nan Zhu
e1f57b4417
[jvm-packages] scripts to cross-build and deploy artifacts to github (#3276)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* cross building files

* update

* build with docker

* remove

* temp

* update build script

* update pom

* update

* update version

* upload build

* fix path

* update README.md

* fix compiler version to 4.8.5
2018-04-28 07:41:30 -07:00
Nan Zhu
25b2919c44
[jvm-packages] change version of jvm to keep consistent with other pkgs (#3253)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* change version of jvm to keep consistent with other pkgs
2018-04-19 20:48:50 -07:00
Yun Ni
740eba42f7 [jvm-packages] Add back the overriden finalize() method for SBooster (#3011)
* Convert SIGSEGV to XGBoostError

* Address CR Comments

* Address CR Comments
2018-01-06 14:07:37 -08:00
Yun Ni
65fb4e3f5c [jvm-packages] Prevent dispose being called on unfinalized JBooster (#3005)
* [jvm-packages] Prevent dispose being called twice when finalize

* Convert SIGSEGV to XGBoostError

* Avoid creating a new SBooster with the same JBooster

* Address CR Comments
2018-01-06 09:46:52 -08:00
Nan Zhu
9747ea2acb
[jvm-packages] fix the pattern in dev script and version mismatch (#3009)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* fix the pattern in dev script and version mismatch
2018-01-06 06:59:38 -08:00
Nan Zhu
14c6392381
[jvm-packages] add dev script to update version and update versions (#2998)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* add dev script to update version and update versions
2018-01-01 21:28:53 -08:00
Yun Ni
9004ca03ca [jvm-packages] Saving models into a tmp folder every a few rounds (#2964)
* [jvm-packages] Train Booster from an existing model

* Align Scala API with Java API

* Existing model should not load rabit checkpoint

* Address minor comments

* Implement saving temporary boosters and loading previous booster

* Add more unit tests for loadPrevBooster

* Add params to XGBoostEstimator

* (1) Move repartition out of the temp model saving loop (2) Address CR comments

* Catch a corner case of training next model with fewer rounds

* Address comments

* Refactor newly added methods into TmpBoosterManager

* Add two files which is missing in previous commit

* Rename TmpBooster to checkpoint
2017-12-29 08:36:41 -08:00
avinocur
0ad20f8fe0 Parameterize host-ip to pass to tracker.py (#2831) 2017-11-29 11:14:34 -08:00
Sergei Lebedev
8e141427aa
[jvm-packages] Exposed train-time evaluation metrics (#2836)
* [jvm-packages] Exposed train-time evaluation metrics

They are accessible via 'XGBoostModel.summary'. The summary is not
serialized with the model and is only available after the training.

* Addressed review comments

* Extracted model-related tests into 'XGBoostModelSuite'

* Added tests for copying the 'XGBoostModel'

* [jvm-packages] Fixed a subtle bug in train/test split

Iterator.partition (naturally) assumes that the predicate is deterministic
but this is not the case for

    r.nextDouble() <= trainTestRatio

therefore sometimes the DMatrix(...) call got a NoSuchElementException
and crashed the JVM due to lack of exception handling in
XGBoost4jCallbackDataIterNext.

* Make sure train/test objectives are different
2017-11-20 22:21:54 +01:00
Seth Hendrickson
a8f670d247 [jvm-packages] Add some documentation to xgboost4j-spark plus minor style edits (#2823)
* add scala docs to several methods

* indentation

* license formatting

* clarify distributed boosters

* address some review comments

* reduce doc lengths

* change method name, clarify  doc

* reset make config

* delete most comments

* more review feedback
2017-11-02 13:16:02 -07:00
Sergei Lebedev
69c3b78a29 [jvm-packages] Implemented early stopping (#2710)
* Allowed subsampling test from the training data frame/RDD

The implementation requires storing 1 - trainTestRatio points in memory
to make the sampling work.

An alternative approach would be to construct the full DMatrix and then
slice it deterministically into train/test. The peak memory consumption
of such scenario, however, is twice the dataset size.

* Removed duplication from 'XGBoost.train'

Scala callers can (and should) use names to supply a subset of
parameters. Method overloading is not required.

* Reuse XGBoost seed parameter to stabilize train/test splitting

* Added early stopping support to non-distributed XGBoost

Closes #1544

* Added early-stopping to distributed XGBoost

* Moved construction of 'watches' into a separate method

This commit also fixes the handling of 'baseMargin' which previously
was not added to the validation matrix.

* Addressed review comments
2017-09-29 12:06:22 -07:00
Mahmoud Rawas
a7ce4d2462 Returning back LabeledPoint into public, in referece to the discussion in : https://github.com/dmlc/xgboost/pull/2532#discussion_r137172759 (#2677) 2017-09-10 20:45:43 -07:00
Yun Ni
a00157543d Support instance weights for xgboost4j-spark (#2642)
* Support instance weights for xgboost4j-spark

* Use 0.001 instead of 0 for weights

* Address CR comments
2017-08-28 09:03:20 -07:00
Sergei Lebedev
771a95aec6 [jvm-packages] Added baseMargin to ml.dmlc.xgboost4j.LabeledPoint (#2532)
* Converted ml.dmlc.xgboost4j.LabeledPoint to Scala

This allows to easily integrate LabeledPoint with Spark DataFrame APIs,
which support encoding/decoding case classes out of the box. Alternative
solution would be to keep LabeledPoint in Java and make it a Bean by
generating boilerplate getters/setters. I have decided against that, even
thought the conversion in this PR implies a public API change.

I also had to remove the factory methods fromSparseVector and
fromDenseVector because a) they would need to be duplicated to support
overloaded calls with extra data (e.g. weight); and b) Scala would expose
them via mangled $.MODULE$ which looks ugly in Java.

Additionally, this commit makes it possible to switch to LabeledPoint in
all public APIs and effectively to pass initial margin/group as part of
the point. This seems to be the only reliable way of implementing distributed
learning with these data. Note that group size format used by single-node
XGBoost is not compatible with that scenario, since the partition split
could divide a group into two chunks.

* Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs

Note that DataFrame-based and Flink APIs are not affected by this change.

* Removed baseMargin argument in favour of the LabeledPoint field

* Do a single pass over the partition in buildDistributedBoosters

Note that there is no formal guarantee that

    val repartitioned = rdd.repartition(42)
    repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... }

would do a single shuffle, but in practice it seems to be always the case.

* Exposed baseMargin in DataFrame-based API

* Addressed review comments

* Pass baseMargin to XGBoost.trainWithDataFrame via params

* Reverted MLLabeledPoint in Spark APIs

As discussed, baseMargin would only be supported for DataFrame-based APIs.

* Cleaned up baseMargin tests

- Removed RDD-based test, since the option is no longer exposed via
  public APIs
- Changed DataFrame-based one to check that adding a margin actually
  affects the prediction

* Pleased Scalastyle

* Addressed more review comments

* Pleased scalastyle again

* Fixed XGBoost.fromBaseMarginsToArray

which always returned an array of NaNs even if base margin was not
specified. Surprisingly this only failed a few tests.
2017-08-10 14:29:26 -07:00
Philip Cho
03e213c7cd Fix documentation for a misspelled parameter (#2569) 2017-08-02 21:50:09 +12:00
Sergei Lebedev
d535340459 [jvm-packages] Exposed baseMargin (#2450)
* Disabled excessive Spark logging in tests

* Fixed a singature of XGBoostModel.predict

Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.

This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified.

* Removed boxing in XGBoost.fromDenseToSparseLabeledPoints

* Inlined XGBoost.repartitionData

An if is more explicit than an opaque method name.

* Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel

* Check the input dimension in DMatrix.setBaseMargin

Prior to this commit providing an array of incorrect dimensions would
have resulted in memory corruption. Maybe backport this to C++?

* Reduced nesting in XGBoost.buildDistributedBoosters

* Ensured consistent naming of the params map

* Cleaned up DataBatch to make it easier to comprehend

* Made scalastyle happy

* Added baseMargin to XGBoost.train and trainWithRDD

* Deprecated XGBoost.train

It is ambiguous and work only for RDDs.

* Addressed review comments

* Revert "Fixed a singature of XGBoostModel.predict"

This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4.

* Addressed more review comments

* Fixed NullPointerException in buildDistributedBoosters
2017-06-30 08:27:24 -07:00
Edi Bice
2911597f3d [jvm-packages] Expose prediction feature contribution on the Java side (#2441)
* Exposed prediction feature contribution on the Java side

* was not supplying the newly added argument

* Exposed from Scala-side as well

* formatting (keep declaration in one line unless exceeding 100 chars)
2017-06-28 13:34:51 -07:00
Sergei Lebedev
91e778c6db [jvm-packages] JNI Cosmetics (#2448)
* [jvm-packages] Ensure the native library is loaded once

Previously any class using XGBoostJNI queried NativeLibLoader to make
sure the native library is loaded. This commit moves the initXGBoost
call to XGBoostJNI, effectively delegating the initialization to the class
loader.

Note also, that now XGBoostJNI would NOT suppress an IOException if it
occured in initXGBoost.

* [jvm-packages] Fused JNIErrorHandle with XGBoostJNI

There was no reason for having a separate class.
2017-06-23 11:49:30 -07:00
Sergei Lebedev
2cb51f7097 [jvm-packages] Another pack of build/CI improvements (#2422)
* [jvm-packages] Fixed compilation on Windows

* [jvm-packages] Build the JNI bindings on Appveyor

* [jvm-packages] Build & test on OS X

* [jvm-packages] Re-applied the CMake build changes reverted by #2395

* Fixed Appveyor JVM build

* Muted Maven on Travis

* Don't link with libawt

* "linux2"->"linux"

Python2.x and 3.X use slightly different values for ``sys.platform``.
2017-06-21 12:28:35 -07:00