* Removal of redundant code/files.
* Removal of exact namespace in GPU plugin
* Revert double precision histograms to single precision for performance on Maxwell/Kepler
* Converted ml.dmlc.xgboost4j.LabeledPoint to Scala
This allows to easily integrate LabeledPoint with Spark DataFrame APIs,
which support encoding/decoding case classes out of the box. Alternative
solution would be to keep LabeledPoint in Java and make it a Bean by
generating boilerplate getters/setters. I have decided against that, even
thought the conversion in this PR implies a public API change.
I also had to remove the factory methods fromSparseVector and
fromDenseVector because a) they would need to be duplicated to support
overloaded calls with extra data (e.g. weight); and b) Scala would expose
them via mangled $.MODULE$ which looks ugly in Java.
Additionally, this commit makes it possible to switch to LabeledPoint in
all public APIs and effectively to pass initial margin/group as part of
the point. This seems to be the only reliable way of implementing distributed
learning with these data. Note that group size format used by single-node
XGBoost is not compatible with that scenario, since the partition split
could divide a group into two chunks.
* Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs
Note that DataFrame-based and Flink APIs are not affected by this change.
* Removed baseMargin argument in favour of the LabeledPoint field
* Do a single pass over the partition in buildDistributedBoosters
Note that there is no formal guarantee that
val repartitioned = rdd.repartition(42)
repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... }
would do a single shuffle, but in practice it seems to be always the case.
* Exposed baseMargin in DataFrame-based API
* Addressed review comments
* Pass baseMargin to XGBoost.trainWithDataFrame via params
* Reverted MLLabeledPoint in Spark APIs
As discussed, baseMargin would only be supported for DataFrame-based APIs.
* Cleaned up baseMargin tests
- Removed RDD-based test, since the option is no longer exposed via
public APIs
- Changed DataFrame-based one to check that adding a margin actually
affects the prediction
* Pleased Scalastyle
* Addressed more review comments
* Pleased scalastyle again
* Fixed XGBoost.fromBaseMarginsToArray
which always returned an array of NaNs even if base margin was not
specified. Surprisingly this only failed a few tests.
* repared serialization after update process; fixes#2545
* non-stratified folds in python could omit some data instances
* Makefile: fixes for older makes on windows; clean R-package too
* make cub to be a shallow submodule
* improve $(MAKE) recovery
* for MinGW, drop the 'lib' prefix from shared library name
* fix defines for 'g++ 4.8 or higher' to include g++ >= 5
* fix compile warnings
* [Appveyor] add MinGW with python; remove redundant jobs
* [Appveyor] also do python build for one of msvc jobs
* Deduplicated DataFrame creation in XGBoostDFSuite
* Extracted dermatology.data into MultiClassification
* Moved cache cleaning to SharedSparkContext
Cache files are prefixed with appName therefore this seems to be just the
place to delete them.
* Removed redundant JMatrix calls in xgboost4j-spark
* Slightly more readable buildDenseRDD in XGBoostGeneralSuite
* Generalized train/test DataFrame construction in XGBoostDFSuite
* Changed SharedSparkContext to setup a new context per-test
Hence the new name: PerTestSparkSession :)
* Fused Utils into PerTestSparkSession
* Whitespace fix in XGBoostDFSuite
* Ensure SparkSession is always eagerly created in PerTestSparkSession
* Renamed PerTestSparkSession->PerTest
because it was doing slightly more than creating/stopping the session.
Includes:
- Dockerfile changes
- Dockerfile clean up
- Fix execution privileges of files used from Dockerfile.
- New Dockerfile entrypoint to replace with_user script
- Defined a placeholders for CPU testing (script and Dockerfile)
- Jenkinsfile
- Jenkins file milestone defined
- Single source code checkout and propagation via stash/unstash
- Bash needs to be explicitly used in launching make build, since we need
access to environment
- Jenkinsfile build factory for cmake and make style of jobs
- Archivation of artifacts (*.so, *.whl, *.egg) produced by cmake build
Missing:
- CPU testing
- Python3 env build and testing
* [jvm-packages] Deduplicated train/test data access in tests
All datasets are now available via a unified API, e.g. Agaricus.test.
The only exception is the dermatology data which requires parsing a
CSV file.
* Inlined Utils.buildTrainingRDD
The default number of partitions for local mode is equal to the number
of available CPUs.
* Replaced dataset names with problem types
It has been reported that new parallel algorithm (#2493) results in excessive
message usage (see issue #2326). Until issues are resolved, XGBoost should use
the old parallel algorithm by default. The user would have to specify
`enable_feature_grouping=1` manually to enable the new algorithm.
* Patch to improve multithreaded performance scaling
Change parallel strategy for histogram construction.
Instead of partitioning data rows among multiple threads, partition feature
columns instead. Useful heuristics for assigning partitions have been adopted
from LightGBM project.
* Add missing header to satisfy MSVC
* Restore max_bin and related parameters to TrainParam
* Fix lint error
* inline functions do not require static keyword
* Feature grouping algorithm accepting FastHistParam
Feature grouping algorithm accepts many parameters (3+), and it gets annoying to
pass them one by one. Instead, simply pass the reference to FastHistParam. The
definition of FastHistParam has been moved to a separate header file to
accomodate this change.
Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.
This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified. See discussion in 06bd5dca for motivation.
* Disabled excessive Spark logging in tests
* Fixed a singature of XGBoostModel.predict
Prior to this commit XGBoostModel.predict produced an RDD with
an array of predictions for each partition, effectively changing
the shape wrt the input RDD. A more natural contract for prediction
API is that given an RDD it returns a new RDD with the same number
of elements. This allows the users to easily match inputs with
predictions.
This commit removes one layer of nesting in XGBoostModel.predict output.
Even though the change is clearly non-backward compatible, I still
think it is well justified.
* Removed boxing in XGBoost.fromDenseToSparseLabeledPoints
* Inlined XGBoost.repartitionData
An if is more explicit than an opaque method name.
* Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel
* Check the input dimension in DMatrix.setBaseMargin
Prior to this commit providing an array of incorrect dimensions would
have resulted in memory corruption. Maybe backport this to C++?
* Reduced nesting in XGBoost.buildDistributedBoosters
* Ensured consistent naming of the params map
* Cleaned up DataBatch to make it easier to comprehend
* Made scalastyle happy
* Added baseMargin to XGBoost.train and trainWithRDD
* Deprecated XGBoost.train
It is ambiguous and work only for RDDs.
* Addressed review comments
* Revert "Fixed a singature of XGBoostModel.predict"
This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4.
* Addressed more review comments
* Fixed NullPointerException in buildDistributedBoosters