xgboost

Author	SHA1	Message	Date
Rory Mitchell	15267eedf2	[GPU-Plugin] Major refactor 2 (#2664 ) * Change cmake option * Move source files * Move google tests * Move python tests * Move benchmarks * Move documentation * Remove makefile support * Fix test run * Move GPU tests	2017-09-08 09:57:16 +12:00
Yun Ni	8244f6f120	Use Sudo-enabled VM which has 7.5GB memory (#2680 )	2017-09-07 08:36:37 -07:00
Yun Ni	f04bde05fd	Add Coverage Report for Java and Python (#2667 ) * Add coverage report for java * Add coverage report for python * Increase memory for JVM unit tests * Increase memory for JVM unit tests	2017-09-05 14:46:51 -07:00
SimonAB	2e9d06443e	Add show_values option to feature importances plot (#2351 ) Adding an option to remove the values from the features importances plot in Python.	2017-08-31 12:26:54 -05:00
PSEUDOTENSOR / Jonathan McKinney	0664298bb2	Update sklearn API to pass along n_jobs to DMatrix creation (#2658 )	2017-08-31 15:24:59 +12:00
Rory Mitchell	19a53814ce	[GPU-Plugin] Major refactor (#2644 ) * Removal of redundant code/files. * Removal of exact namespace in GPU plugin * Revert double precision histograms to single precision for performance on Maxwell/Kepler	2017-08-30 10:53:52 +12:00
Sergei Lebedev	39adba51c5	Fixed compilation on Scala 2.10 (#2629 )	2017-08-28 10:59:39 -07:00
Yun Ni	a00157543d	Support instance weights for xgboost4j-spark (#2642 ) * Support instance weights for xgboost4j-spark * Use 0.001 instead of 0 for weights * Address CR comments	2017-08-28 09:03:20 -07:00
Evan Culver	ba16475c3a	Fix past participle tense in docs (#2637 )	2017-08-25 14:16:57 +02:00
Rory Mitchell	70071fc38c	Fix demo typo (#2632 )	2017-08-23 17:21:51 +02:00
Boris Kostenko	cd366ecb4b	fix build in case of spaces in path to make (#2619 )	2017-08-23 02:29:33 -03:00
Rory Mitchell	332b26df95	Update GPU acceleration demo (#2617 ) * Update GPU acceleration demo * Fix parameter formatting	2017-08-19 21:27:48 +12:00
Rory Mitchell	5661a67d20	Add parallel sort for MSVC (#2609 )	2017-08-17 17:14:39 +12:00
Rory Mitchell	ef23e424f1	[GPU-Plugin] Add GPU accelerated prediction (#2593 ) * [GPU-Plugin] Add GPU accelerated prediction * Improve allocation message * Update documentation * Resolve linker error for predictor * Add unit tests	2017-08-16 12:31:59 +12:00
Rory Mitchell	71e5e622b1	Update cub submodule again (fixes GPU build) (#2599 )	2017-08-13 22:14:40 +12:00
Rory Mitchell	ac2d0d0ac5	Updated cub submodule reference (#2597 )	2017-08-12 23:00:56 -07:00
Vadim Khotilovich	e04e2fbe2c	revert shallow submodule for cub (#2591 )	2017-08-11 20:19:04 -07:00
Sergei Lebedev	771a95aec6	[jvm-packages] Added baseMargin to ml.dmlc.xgboost4j.LabeledPoint (#2532 ) * Converted ml.dmlc.xgboost4j.LabeledPoint to Scala This allows to easily integrate LabeledPoint with Spark DataFrame APIs, which support encoding/decoding case classes out of the box. Alternative solution would be to keep LabeledPoint in Java and make it a Bean by generating boilerplate getters/setters. I have decided against that, even thought the conversion in this PR implies a public API change. I also had to remove the factory methods fromSparseVector and fromDenseVector because a) they would need to be duplicated to support overloaded calls with extra data (e.g. weight); and b) Scala would expose them via mangled $.MODULE$ which looks ugly in Java. Additionally, this commit makes it possible to switch to LabeledPoint in all public APIs and effectively to pass initial margin/group as part of the point. This seems to be the only reliable way of implementing distributed learning with these data. Note that group size format used by single-node XGBoost is not compatible with that scenario, since the partition split could divide a group into two chunks. * Switched to ml.dmlc.xgboost4j.LabeledPoint in RDD-based public APIs Note that DataFrame-based and Flink APIs are not affected by this change. * Removed baseMargin argument in favour of the LabeledPoint field * Do a single pass over the partition in buildDistributedBoosters Note that there is no formal guarantee that val repartitioned = rdd.repartition(42) repartitioned.zipPartitions(repartitioned.map(_ + 1)) { it1, it2, => ... } would do a single shuffle, but in practice it seems to be always the case. * Exposed baseMargin in DataFrame-based API * Addressed review comments * Pass baseMargin to XGBoost.trainWithDataFrame via params * Reverted MLLabeledPoint in Spark APIs As discussed, baseMargin would only be supported for DataFrame-based APIs. * Cleaned up baseMargin tests - Removed RDD-based test, since the option is no longer exposed via public APIs - Changed DataFrame-based one to check that adding a margin actually affects the prediction * Pleased Scalastyle * Addressed more review comments * Pleased scalastyle again * Fixed XGBoost.fromBaseMarginsToArray which always returned an array of NaNs even if base margin was not specified. Surprisingly this only failed a few tests.	2017-08-10 14:29:26 -07:00
PSEUDOTENSOR / Jonathan McKinney	c1104f7d0a	[GPU-Plugin] Add throw of asserts and added compute compatibility error check. (#2565 ) * [GPU-Plugin] Added compute compatibility error check, added verbose timing	2017-08-10 16:07:07 +12:00
René Scheibe	75ea07b847	Fix parameter documentation inconsistencies (#2584 ) * fix indentation - otherwise list items are rendered incorrectly * consistency: no spaces inside square brackets	2017-08-07 19:07:10 +02:00
René Scheibe	a0c5bde024	Fix typo in sklearn documentation (#2580 )	2017-08-07 19:06:11 +02:00
Vadim Khotilovich	2b3a4318c5	Several fixes (#2572 ) * repared serialization after update process; fixes #2545 * non-stratified folds in python could omit some data instances * Makefile: fixes for older makes on windows; clean R-package too * make cub to be a shallow submodule * improve $(MAKE) recovery	2017-08-06 13:03:50 -05:00
Philip Cho	70b65a282c	Use jQuery 2.2.4 (#2581 )	2017-08-05 15:37:38 -07:00
Rory Mitchell	eda9e180f0	[GPU-Plugin] Various fixes (#2579 ) * Fix test large * Add check for max_depth 0 * Update readme * Add LBS specialisation for dense data * Add bst_gpair_precise * Temporarily disable accuracy tests on test_large.py * Solve unused variable compiler warning * Fix max_bin > 1024 error	2017-08-05 22:16:23 +12:00
Philip Cho	03e213c7cd	Fix documentation for a misspelled parameter (#2569 )	2017-08-02 21:50:09 +12:00
Rory Mitchell	0e06d1805d	[WIP] Extract prediction into separate interface (#2531 ) * [WIP] Extract prediction into separate interface * Add copyright, fix linter errors * Add predictor to amalgamation * Fix documentation * Move prediction cache into predictor, add GBTreeModel * Updated predictor doc comments	2017-07-28 17:01:03 -07:00
Vadim Khotilovich	00eda28b3c	MinGW: shared library prefix and appveyor CI (#2539 ) * for MinGW, drop the 'lib' prefix from shared library name * fix defines for 'g++ 4.8 or higher' to include g++ >= 5 * fix compile warnings * [Appveyor] add MinGW with python; remove redundant jobs * [Appveyor] also do python build for one of msvc jobs	2017-07-25 01:06:47 -05:00
Sergei Lebedev	d41dc078b6	[jvm-packages] Mentioned CMake in the docs (#2529 )	2017-07-23 21:57:31 -07:00
Qiang Kou (KK)	4f3539b913	To compile on ARM cpu (#2513 )	2017-07-21 21:16:30 -07:00
PSEUDOTENSOR / Jonathan McKinney	6b375f6ad8	Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation (#2530 ) * Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation from numpy arrays for python interface.	2017-07-21 14:43:17 +12:00
Rory Mitchell	56550ff3f1	Fix pylint (#2537 )	2017-07-21 11:41:56 +12:00
Sergei Lebedev	4eb255262f	[jvm-packages] More brooming in tests (#2517 ) * Deduplicated DataFrame creation in XGBoostDFSuite * Extracted dermatology.data into MultiClassification * Moved cache cleaning to SharedSparkContext Cache files are prefixed with appName therefore this seems to be just the place to delete them. * Removed redundant JMatrix calls in xgboost4j-spark * Slightly more readable buildDenseRDD in XGBoostGeneralSuite * Generalized train/test DataFrame construction in XGBoostDFSuite * Changed SharedSparkContext to setup a new context per-test Hence the new name: PerTestSparkSession :) * Fused Utils into PerTestSparkSession * Whitespace fix in XGBoostDFSuite * Ensure SparkSession is always eagerly created in PerTestSparkSession * Renamed PerTestSparkSession->PerTest because it was doing slightly more than creating/stopping the session.	2017-07-18 13:08:48 -07:00
PSEUDOTENSOR / Jonathan McKinney	ca7fc9fda3	[GPU-Plugin] Fix gpu_hist to allow matrices with more than just 2^{32} elements. Also fixed CPU hist algorithm. (#2518 )	2017-07-18 11:19:27 +12:00
Rory Mitchell	c85bf9859e	[GPU-Plugin] Improved load balancing search (#2521 )	2017-07-17 11:50:57 +12:00
Michal Malohlava	33ee7d1615	[BUILD] Dockerfile and Jenkinsfile revisited (#2514 ) Includes: - Dockerfile changes - Dockerfile clean up - Fix execution privileges of files used from Dockerfile. - New Dockerfile entrypoint to replace with_user script - Defined a placeholders for CPU testing (script and Dockerfile) - Jenkinsfile - Jenkins file milestone defined - Single source code checkout and propagation via stash/unstash - Bash needs to be explicitly used in launching make build, since we need access to environment - Jenkinsfile build factory for cmake and make style of jobs - Archivation of artifacts (.so, .whl, *.egg) produced by cmake build Missing: - CPU testing - Python3 env build and testing	2017-07-13 17:51:47 +12:00
Sergei Lebedev	66874f5777	[jvm-packages] Deduplicated train/test data access in tests (#2507 ) * [jvm-packages] Deduplicated train/test data access in tests All datasets are now available via a unified API, e.g. Agaricus.test. The only exception is the dermatology data which requires parsing a CSV file. * Inlined Utils.buildTrainingRDD The default number of partitions for local mode is equal to the number of available CPUs. * Replaced dataset names with problem types	2017-07-12 09:13:55 -07:00
Rory Mitchell	530f01e21c	[GPU-Plugin] Add load balancing search to gpu_hist. Add compressed iterator. (#2504 )	2017-07-11 22:36:39 +12:00
Philip Cho	64c8f6fa6d	Use old parallel algorithm for histogram construction by default (#2501 ) It has been reported that new parallel algorithm (#2493) results in excessive message usage (see issue #2326). Until issues are resolved, XGBoost should use the old parallel algorithm by default. The user would have to specify `enable_feature_grouping=1` manually to enable the new algorithm.	2017-07-10 09:35:48 -07:00
Jeff Macaluso	be1f76a06a	Fixed Spacing (#2498 ) Fixed spacing under "Model Complexity" section	2017-07-08 09:17:45 -07:00
Vadim Khotilovich	7350085955	Fix broken make on windows (#2499 ) * fix Makefile for make on windows * clean up compilation warnings * fix for `no file name for include` make warning	2017-07-08 09:17:31 -07:00
Philip Cho	ba820847f9	Patch to improve multithreaded performance scaling (#2493 ) * Patch to improve multithreaded performance scaling Change parallel strategy for histogram construction. Instead of partitioning data rows among multiple threads, partition feature columns instead. Useful heuristics for assigning partitions have been adopted from LightGBM project. * Add missing header to satisfy MSVC * Restore max_bin and related parameters to TrainParam * Fix lint error * inline functions do not require static keyword * Feature grouping algorithm accepting FastHistParam Feature grouping algorithm accepts many parameters (3+), and it gets annoying to pass them one by one. Instead, simply pass the reference to FastHistParam. The definition of FastHistParam has been moved to a separate header file to accomodate this change.	2017-07-07 08:25:07 -07:00
Rory Mitchell	6bfc472bec	Update nccl (#2494 )	2017-07-07 12:36:26 +12:00
Qiang Kou (KK)	e7530bdffc	Not use -msse2 on power or arm arch. close #2446 (#2475 )	2017-07-06 20:06:55 -04:00
69guitar1015	9091493250	Update bosch.py (#2482 ) - fix deprecated expression on StratifiedKFold - use range instead of xrange	2017-07-06 20:05:09 -04:00
Rory Mitchell	e939192978	Cmake improvements (#2487 ) * Cmake improvements * Add google test to cmake	2017-07-06 18:05:11 +12:00
Sergei Lebedev	8ceeb32bad	Fixed a signature of XGBoostModel.predict (#2476 ) Prior to this commit XGBoostModel.predict produced an RDD with an array of predictions for each partition, effectively changing the shape wrt the input RDD. A more natural contract for prediction API is that given an RDD it returns a new RDD with the same number of elements. This allows the users to easily match inputs with predictions. This commit removes one layer of nesting in XGBoostModel.predict output. Even though the change is clearly non-backward compatible, I still think it is well justified. See discussion in 06bd5dca for motivation.	2017-07-02 21:42:46 -07:00
Rory Mitchell	ed8bc4521e	[GPU-Plugin] Resolve double compilation issue (#2479 )	2017-07-03 13:29:10 +12:00
Rory Mitchell	5f1b0bb386	[GPU-Plugin] Unify gpu_gpair/bst_gpair. Refactor. (#2477 )	2017-07-01 17:31:13 +12:00
Sergei Lebedev	d535340459	[jvm-packages] Exposed baseMargin (#2450 ) * Disabled excessive Spark logging in tests * Fixed a singature of XGBoostModel.predict Prior to this commit XGBoostModel.predict produced an RDD with an array of predictions for each partition, effectively changing the shape wrt the input RDD. A more natural contract for prediction API is that given an RDD it returns a new RDD with the same number of elements. This allows the users to easily match inputs with predictions. This commit removes one layer of nesting in XGBoostModel.predict output. Even though the change is clearly non-backward compatible, I still think it is well justified. * Removed boxing in XGBoost.fromDenseToSparseLabeledPoints * Inlined XGBoost.repartitionData An if is more explicit than an opaque method name. * Moved XGBoost.convertBoosterToXGBoostModel to XGBoostModel * Check the input dimension in DMatrix.setBaseMargin Prior to this commit providing an array of incorrect dimensions would have resulted in memory corruption. Maybe backport this to C++? * Reduced nesting in XGBoost.buildDistributedBoosters * Ensured consistent naming of the params map * Cleaned up DataBatch to make it easier to comprehend * Made scalastyle happy * Added baseMargin to XGBoost.train and trainWithRDD * Deprecated XGBoost.train It is ambiguous and work only for RDDs. * Addressed review comments * Revert "Fixed a singature of XGBoostModel.predict" This reverts commit 06bd5dcae7780265dd57e93ed7d4135f4e78f9b4. * Addressed more review comments * Fixed NullPointerException in buildDistributedBoosters	2017-06-30 08:27:24 -07:00
PSEUDOTENSOR / Jonathan McKinney	6b287177c8	[GPU-Plugin] Multi-GPU gpu_id bug fixes for grow_gpu_hist and grow_gpu methods, and additional documentation for the gpu plugin. (#2463 )	2017-06-30 20:04:17 +12:00

1 2 3 4 5 ...

3129 Commits