3112 Commits

Author SHA1 Message Date
Sergei Lebedev
771a95aec6 [jvm-packages] Added baseMargin to ml.dmlc.xgboost4j.LabeledPoint (#2532)
* Converted ml.dmlc.xgboost4j.LabeledPoint to Scala

This allows to easily integrate LabeledPoint with Spark DataFrame APIs,
which support encoding/decoding case classes out of the box. Alternative
solution would be to keep LabeledPoint in Java and make it a Bean by
generating boilerplate getters/setters. I have decided against that, even
thought the conversion in this PR implies a public API change.

I also had to remove the factory methods fromSparseVector and
fromDenseVector because a) they would need to be duplicated to support
overloaded calls with extra data (e.g. weight); and b) Scala would expose
them via mangled $.MODULE$ which looks ugly in Java.

Additionally, this commit makes it possible to switch to LabeledPoint in
all public APIs and effectively to pass initial margin/group as part of
the point. This seems to be the only reliable way of implementing distributed
learning with these data. Note that group size format used by single-node
XGBoost is not compatible with that scenario, since the partition split
could divide a group into two chunks.

* Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs

Note that DataFrame-based and Flink APIs are not affected by this change.

* Removed baseMargin argument in favour of the LabeledPoint field

* Do a single pass over the partition in buildDistributedBoosters

Note that there is no formal guarantee that

    val repartitioned = rdd.repartition(42)
    repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... }

would do a single shuffle, but in practice it seems to be always the case.

* Exposed baseMargin in DataFrame-based API

* Addressed review comments

* Pass baseMargin to XGBoost.trainWithDataFrame via params

* Reverted MLLabeledPoint in Spark APIs

As discussed, baseMargin would only be supported for DataFrame-based APIs.

* Cleaned up baseMargin tests

- Removed RDD-based test, since the option is no longer exposed via
  public APIs
- Changed DataFrame-based one to check that adding a margin actually
  affects the prediction

* Pleased Scalastyle

* Addressed more review comments

* Pleased scalastyle again

* Fixed XGBoost.fromBaseMarginsToArray

which always returned an array of NaNs even if base margin was not
specified. Surprisingly this only failed a few tests.
2017-08-10 14:29:26 -07:00
PSEUDOTENSOR / Jonathan McKinney
c1104f7d0a [GPU-Plugin] Add throw of asserts and added compute compatibility error check. (#2565)
* [GPU-Plugin] Added compute compatibility error check, added verbose timing
2017-08-10 16:07:07 +12:00
René Scheibe
75ea07b847 Fix parameter documentation inconsistencies (#2584)
* fix indentation - otherwise list items are rendered incorrectly
* consistency: no spaces inside square brackets
2017-08-07 19:07:10 +02:00
René Scheibe
a0c5bde024 Fix typo in sklearn documentation (#2580) 2017-08-07 19:06:11 +02:00
Vadim Khotilovich
2b3a4318c5 Several fixes (#2572)
* repared serialization after update process; fixes #2545

* non-stratified folds in python could omit some data instances

* Makefile: fixes for older makes on windows; clean R-package too

* make cub to be a shallow submodule

* improve $(MAKE) recovery
2017-08-06 13:03:50 -05:00
Philip Cho
70b65a282c Use jQuery 2.2.4 (#2581) 2017-08-05 15:37:38 -07:00
Rory Mitchell
eda9e180f0 [GPU-Plugin] Various fixes (#2579)
* Fix test large

* Add check for max_depth 0

* Update readme

* Add LBS specialisation for dense data

* Add bst_gpair_precise

* Temporarily disable accuracy tests on test_large.py

* Solve unused variable compiler warning

* Fix max_bin > 1024 error
2017-08-05 22:16:23 +12:00
Philip Cho
03e213c7cd Fix documentation for a misspelled parameter (#2569) 2017-08-02 21:50:09 +12:00
Rory Mitchell
0e06d1805d [WIP] Extract prediction into separate interface (#2531)
* [WIP] Extract prediction into separate interface

* Add copyright, fix linter errors

* Add predictor to amalgamation

* Fix documentation

* Move prediction cache into predictor, add GBTreeModel

* Updated predictor doc comments
2017-07-28 17:01:03 -07:00
Vadim Khotilovich
00eda28b3c MinGW: shared library prefix and appveyor CI (#2539)
* for MinGW, drop the 'lib' prefix from shared library name

* fix defines for 'g++ 4.8 or higher' to include g++ >= 5

* fix compile warnings

* [Appveyor] add MinGW with python; remove redundant jobs

* [Appveyor] also do python build for one of msvc jobs
2017-07-25 01:06:47 -05:00
Sergei Lebedev
d41dc078b6 [jvm-packages] Mentioned CMake in the docs (#2529) 2017-07-23 21:57:31 -07:00
Qiang Kou (KK)
4f3539b913 To compile on ARM cpu (#2513) 2017-07-21 21:16:30 -07:00
PSEUDOTENSOR / Jonathan McKinney
6b375f6ad8 Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation (#2530)
* Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation from numpy arrays for python interface.
2017-07-21 14:43:17 +12:00
Rory Mitchell
56550ff3f1 Fix pylint (#2537) 2017-07-21 11:41:56 +12:00
Sergei Lebedev
4eb255262f [jvm-packages] More brooming in tests (#2517)
* Deduplicated DataFrame creation in XGBoostDFSuite

* Extracted dermatology.data into MultiClassification

* Moved cache cleaning to SharedSparkContext

Cache files are prefixed with appName therefore this seems to be just the
place to delete them.

* Removed redundant JMatrix calls in xgboost4j-spark

* Slightly more readable buildDenseRDD in XGBoostGeneralSuite

* Generalized train/test DataFrame construction in XGBoostDFSuite

* Changed SharedSparkContext to setup a new context per-test

Hence the new name: PerTestSparkSession :)

* Fused Utils into PerTestSparkSession

* Whitespace fix in XGBoostDFSuite

* Ensure SparkSession is always eagerly created in PerTestSparkSession

* Renamed PerTestSparkSession->PerTest

because it was doing slightly more than creating/stopping the session.
2017-07-18 13:08:48 -07:00
PSEUDOTENSOR / Jonathan McKinney
ca7fc9fda3 [GPU-Plugin] Fix gpu_hist to allow matrices with more than just 2^{32} elements. Also fixed CPU hist algorithm. (#2518) 2017-07-18 11:19:27 +12:00
Rory Mitchell
c85bf9859e [GPU-Plugin] Improved load balancing search (#2521) 2017-07-17 11:50:57 +12:00
Michal Malohlava
33ee7d1615 [BUILD] Dockerfile and Jenkinsfile revisited (#2514)
Includes:
  - Dockerfile changes
    - Dockerfile clean up
    - Fix execution privileges of files used from Dockerfile.
    - New Dockerfile entrypoint to replace with_user script
    - Defined a placeholders for CPU testing (script and Dockerfile)
  - Jenkinsfile
    - Jenkins file milestone defined
    - Single source code checkout and propagation via stash/unstash
    - Bash needs to be explicitly used in launching make build, since we need
access to environment
    - Jenkinsfile build factory for cmake and make style of jobs
    - Archivation of artifacts (*.so, *.whl, *.egg) produced by cmake build

Missing:
  - CPU testing
  - Python3 env build and testing
2017-07-13 17:51:47 +12:00
Sergei Lebedev
66874f5777 [jvm-packages] Deduplicated train/test data access in tests (#2507)
* [jvm-packages] Deduplicated train/test data access in tests

All datasets are now available via a unified API, e.g. Agaricus.test.
The only exception is the dermatology data which requires parsing a
CSV file.

* Inlined Utils.buildTrainingRDD

The default number of partitions for local mode is equal to the number
of available CPUs.

* Replaced dataset names with problem types
2017-07-12 09:13:55 -07:00
Rory Mitchell
530f01e21c [GPU-Plugin] Add load balancing search to gpu_hist. Add compressed iterator. (#2504) 2017-07-11 22:36:39 +12:00
Philip Cho
64c8f6fa6d Use old parallel algorithm for histogram construction by default (#2501)
It has been reported that new parallel algorithm (#2493) results in excessive
message usage (see issue #2326). Until issues are resolved, XGBoost should use
the old parallel algorithm by default. The user would have to specify
`enable_feature_grouping=1` manually to enable the new algorithm.
2017-07-10 09:35:48 -07:00
Jeff Macaluso
be1f76a06a Fixed Spacing (#2498)
Fixed spacing under "Model Complexity" section
2017-07-08 09:17:45 -07:00
Vadim Khotilovich
7350085955 Fix broken make on windows (#2499)
* fix Makefile for make on windows

* clean up compilation warnings

* fix for `no file name for include` make warning
2017-07-08 09:17:31 -07:00
Philip Cho
ba820847f9 Patch to improve multithreaded performance scaling (#2493)
* Patch to improve multithreaded performance scaling

Change parallel strategy for histogram construction.
Instead of partitioning data rows among multiple threads, partition feature
columns instead. Useful heuristics for assigning partitions have been adopted
from LightGBM project.

* Add missing header to satisfy MSVC

* Restore max_bin and related parameters to TrainParam

* Fix lint error

* inline functions do not require static keyword

* Feature grouping algorithm accepting FastHistParam

Feature grouping algorithm accepts many parameters (3+), and it gets annoying to
pass them one by one. Instead, simply pass the reference to FastHistParam. The
definition of FastHistParam has been moved to a separate header file to
accomodate this change.
2017-07-07 08:25:07 -07:00
Rory Mitchell
6bfc472bec Update nccl (#2494) 2017-07-07 12:36:26 +12:00
Qiang Kou (KK)
e7530bdffc Not use -msse2 on power or arm arch. close #2446 (#2475) 2017-07-06 20:06:55 -04:00
69guitar1015
9091493250 Update bosch.py (#2482)
- fix deprecated expression on StratifiedKFold
- use range instead of xrange
2017-07-06 20:05:09 -04:00
Rory Mitchell
e939192978 Cmake improvements (#2487)
* Cmake improvements
* Add google test to cmake
2017-07-06 18:05:11 +12:00
Sergei Lebedev
8ceeb32bad Fixed a signature of XGBoostModel.predict (#2476)
Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.

This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified. See discussion in 06bd5dca for motivation.
2017-07-02 21:42:46 -07:00
Rory Mitchell
ed8bc4521e [GPU-Plugin] Resolve double compilation issue (#2479) 2017-07-03 13:29:10 +12:00
Rory Mitchell
5f1b0bb386 [GPU-Plugin] Unify gpu_gpair/bst_gpair. Refactor. (#2477) 2017-07-01 17:31:13 +12:00
Sergei Lebedev
d535340459 [jvm-packages] Exposed baseMargin (#2450)
* Disabled excessive Spark logging in tests

* Fixed a singature of XGBoostModel.predict

Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.

This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified.

* Removed boxing in XGBoost.fromDenseToSparseLabeledPoints

* Inlined XGBoost.repartitionData

An if is more explicit than an opaque method name.

* Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel

* Check the input dimension in DMatrix.setBaseMargin

Prior to this commit providing an array of incorrect dimensions would
have resulted in memory corruption. Maybe backport this to C++?

* Reduced nesting in XGBoost.buildDistributedBoosters

* Ensured consistent naming of the params map

* Cleaned up DataBatch to make it easier to comprehend

* Made scalastyle happy

* Added baseMargin to XGBoost.train and trainWithRDD

* Deprecated XGBoost.train

It is ambiguous and work only for RDDs.

* Addressed review comments

* Revert "Fixed a singature of XGBoostModel.predict"

This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4.

* Addressed more review comments

* Fixed NullPointerException in buildDistributedBoosters
2017-06-30 08:27:24 -07:00
PSEUDOTENSOR / Jonathan McKinney
6b287177c8 [GPU-Plugin] Multi-GPU gpu_id bug fixes for grow_gpu_hist and grow_gpu methods, and additional documentation for the gpu plugin. (#2463) 2017-06-30 20:04:17 +12:00
Yaguang
91dae84a00 Update URL for "Multiclass logloss". (#2469)
The original URL shows 404 Error.
2017-06-30 08:06:09 +02:00
Rory Mitchell
48f3003302 [GPU-Plugin] Change GPU plugin to use tree_method parameter, bump cmake version to 3.5 for GPU plugin, add compute architecture 3.5, remove unused cmake files (#2455) 2017-06-29 16:19:45 +12:00
Sergei Lebedev
88488fdbb9 Fixed shared library loading in the Python package (#2461)
* Fixed DLL name on Windows in ``xgboost.libpath``

* Added support for OS X to ``xgboost.libpath``

* Use .dylib for shared library on OS X

This does not affect the JNI library, because it is not trully
cross-platform in the Makefile-build anyway.
2017-06-29 11:50:50 +12:00
Edi Bice
2911597f3d [jvm-packages] Expose prediction feature contribution on the Java side (#2441)
* Exposed prediction feature contribution on the Java side

* was not supplying the newly added argument

* Exposed from Scala-side as well

* formatting (keep declaration in one line unless exceeding 100 chars)
2017-06-28 13:34:51 -07:00
Sergei Lebedev
d01a31088b [jvm-packages] Test xgboost4j on Windows (#2451) 2017-06-26 11:19:18 -07:00
Zex Li
9bcbaa8869 Add build failure message (#2397)
* Add build failure message

* quit on error
2017-06-25 22:32:11 -04:00
Ryuichi Yamamoto
70ba492eb7 doc: Fix broken links in contribute.md (#2435) 2017-06-25 22:31:14 -04:00
Sergei Lebedev
91e778c6db [jvm-packages] JNI Cosmetics (#2448)
* [jvm-packages] Ensure the native library is loaded once

Previously any class using XGBoostJNI queried NativeLibLoader to make
sure the native library is loaded. This commit moves the initXGBoost
call to XGBoostJNI, effectively delegating the initialization to the class
loader.

Note also, that now XGBoostJNI would NOT suppress an IOException if it
occured in initXGBoost.

* [jvm-packages] Fused JNIErrorHandle with XGBoostJNI

There was no reason for having a separate class.
2017-06-23 11:49:30 -07:00
Rory Mitchell
0e48f87529 [GPU-Plugin] Make node_idx type 32 bit for hist algo. Set default n_gpus to 1. (#2445) 2017-06-23 18:26:45 +12:00
ebernhardson
169c983b5f [jvm-packages] Release dmatrix when no longer needed (#2436)
When using xgboost4j-spark I had executors getting killed much more
often than i would expect by yarn for overrunning their memory limits,
based on the memoryOverhead provided. It looks like a significant
amount of this is because dmatrix's were being created but not released,
because they were only released when the GC decided it was time to
cleanup the references.

Rather than waiting for the GC, relesae the DMatrix's when we know
they are no longer necessary.
2017-06-22 09:20:55 -07:00
Rory Mitchell
1899f9e744 [GPU-Plugin] Add basic continuous integration for GPU plugin. (#2431) 2017-06-22 10:15:28 -04:00
Sergei Lebedev
2cb51f7097 [jvm-packages] Another pack of build/CI improvements (#2422)
* [jvm-packages] Fixed compilation on Windows

* [jvm-packages] Build the JNI bindings on Appveyor

* [jvm-packages] Build & test on OS X

* [jvm-packages] Re-applied the CMake build changes reverted by #2395

* Fixed Appveyor JVM build

* Muted Maven on Travis

* Don't link with libawt

* "linux2"->"linux"

Python2.x and 3.X use slightly different values for ``sys.platform``.
2017-06-21 12:28:35 -07:00
Alfredo Cambera
46b9889cc5 Update build_trouble_shooting.md (#2430)
I had to fight with my linux box for a day to find the solution to the problem. I hope than this may help other users to save some time.
2017-06-20 21:36:10 -07:00
Pierre PACI
ee3d680e89 Fix Typo in documentation (#2416)
The objective section was missing a space and thus all the bullet were are the same level.
2017-06-17 09:22:59 -07:00
Bernie Gray
cd7659937b [R] many minor changes to increase the robustness of the R code (#2404)
* many minor changes to increase robustness of R code

* fixing which mistake in xgb.model.dt.tree.R and a few cosmetics
2017-06-15 22:56:23 -05:00
Sergei Lebedev
0db37c05bd [jvm-packages] Deterministically XGBoost training on exception (#2405)
Previously the code relied on the tracker process being terminated
by the OS, which was not the case on Windows.

Closes #2394
2017-06-12 20:19:28 -07:00
Thejaswi
34dfe2f6de [GPU-Plugin] Support for building to specific GPU architectures (#2390)
* Support for builing gpu-plugins to specific GPU architectures
1. Option GPU_COMPUTE_VER exposed from both Makefile and CMakeLists.txt
2. updater_gpu documentation updated accordingly

* Re-introduced GPU_COMPUTE_VER option in the cmake flow.
This seems to fix the compile-time, rdc=true and copy-constructor related
errors seen and discussed in PR #2390.
2017-06-13 09:51:38 +12:00