3366 Commits

Author SHA1 Message Date
Henry Gouk
69454d9487 Implementation of hinge loss for binary classification (#3477) 2018-08-07 10:06:42 +12:00
Philip Hyunsu Cho
44811f2330
Fix #3485, #3540: Don't use dropout for predicting test sets (#3556)
* Fix #3485, #3540: Don't use dropout for predicting test sets

Dropout (for DART) should only be used at training time.

* Add regression test
2018-08-05 10:17:21 -07:00
Philip Hyunsu Cho
109473dae2
Fix #3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows (#3553)
* Fix #3545: XGDMatrixCreateFromCSCEx silently discards empty trailing rows

Description: The bug is triggered when

1. The data matrix has empty rows at the bottom. More precisely, the rows
   `n-k+1`, `n-k+2`, ..., `n` of the matrix have missing values in all
   dimensions (`n` number of instances, `k` number of trailing rows)
2. The data matrix is given as Compressed Sparse Column (CSC) format.

Diagnosis: When the CSC matrix is converted to Compressed Sparse Row (CSR)
format (this is common format used for DMatrix), the trailing empty rows
are silently ignored. More specifically, the row pointer (`offset`) of the
newly created CSR matrix does not take account of these rows.

Fix: Modify the row pointer.

* Add regression test
2018-08-05 10:15:42 -07:00
Philip Hyunsu Cho
8c633d1ca3
Fix #3505: Prevent undefined behavior due to incorrectly sized base_margin (#3555)
The base margin will need to have length `[num_class] * [number of data points]`.
Otherwise, the array holding prediction results will be only partially
initialized, causing undefined behavior.

Fix: check the length of the base margin. If the length is not correct,
use the global bias (`base_score`) instead. Warn the user about the
substitution.
2018-08-05 10:14:07 -07:00
Philip Hyunsu Cho
4a429a7c4f Add reg:tweedie to supported objectives in XGBoost4J-Spark (#3552) 2018-08-05 07:42:59 -07:00
Philip Hyunsu Cho
7fefd6865d
Fix #3402: wrong fid crashes distributed algorithm (#3535)
* Fix #3402: wrong fid crashes distributed algorithm

The bug was introduced by the recent DMatrix refactor (#3301). It was partially
fixed by #3408 but the example in #3402 was still failing. The example in #3402
will succeed after this fix is applied.

* Explicitly specify "this" to prevent compile error

* Add regression test

* Add distributed test to Travis matrix

* Install kubernetes Python package as dependency of dmlc tracker

* Add Python dependencies

* Add compile step

* Reduce size of regression test case

* Further reduce size of test
2018-08-04 19:20:04 -07:00
Nan Zhu
31d1baba3d [jvm-packages] Tutorial of XGBoost4J-Spark (#3534)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* add new

* update doc

* finish Gang Scheduling

* more

* intro

* Add sections: Prediction, Model persistence and ML pipeline.

* Add XGBoost4j-Spark MLlib pipeline example

* partial finished version

* finish the doc

* adjust code

* fix the doc

* use rst

* Convert XGBoost4J-Spark tutorial to reST

* Bring XGBoost4J up to date

* add note about using hdfs

* remove duplicate file

* fix descriptions

* update doc

* Wrap HDFS/S3 export support as a note

* update

* wrap indexing_mode example in code block
2018-08-03 21:17:50 -07:00
trivialfis
34dc9155ab Use __CUDA__ macro with __NVCC__. (#3539)
* __CUDA__ is defined in clang. Making the change won't make clang
compile xgboost, but syntax checking from clang is at least partially
working.
2018-08-02 22:04:23 +12:00
Philip Hyunsu Cho
70026655b0
Clarify supported OSes for XGBoost4J published JARs (#3547) 2018-08-01 19:51:44 -07:00
Philip Hyunsu Cho
437b368b1f
Update dmlc-core submodule (#3546)
This bring many goodies, including:

* Ability to specify delimiter and weight_column for CSV files:
```python
dtrain = xgboost.DMatrix('train.csv?format=csv&label_column=0&weight_column=1&delimiter= ')
```
* Ability to choose between 0-based and 1-based indexing for LIBSVM/LIBFM files:
```python
dtrain = xgboost.DMatrix('train.libsvm?indexing_mode=1')    # use 1-based indexing
dtest = xgboost.DMatrix('test.libsvm')                      # use 0-based indexing (default)
dtest2 = xgboost.DMatrix('test2.libsvm?indexing_mode=-1')  # use heuristic to detect 0-based / 1-based
```
* Fix a bug in float parsing (issue dmlc/dmlc-core#440)
2018-08-01 15:15:40 -07:00
Nan Zhu
6cf97b4eae
[jvm-packages] consider spark.task.cpus when controlling parallelism (#3530)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* consider spark.task.cpus when controlling parallelism

* fix bug

* fix conf setup

* calculate requestedCores within ParallelismController

* enforce spark.task.cpus = 1

* unify unit test case framework

* enable spark ui
2018-07-31 06:19:45 -07:00
trivialfis
860263f814 Enable building with sanitizers. (#3525) 2018-07-31 17:25:47 +12:00
Nan Zhu
b546321c83
[jvm-packages] the current version of xgboost does not consider missing value in prediction (#3529)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* consider missing value in prediction

* handle single prediction instance

* fix type conversion
2018-07-30 14:16:24 -07:00
wenduowang
3b62e75f2e Fix bug of using list(x) function when x is string (#3432)
* Fix bug of using list(x) function when x is string

list('abcdcba') = ['a', 'b', 'c', 'd', 'c', 'b', 'a']

* Allow feature_names/feature_types to be of any type

If feature_names/feature_types is iterable, e.g. tuple, list, then convert the value to list, except for string; otherwise construct a list with a single value

* Delete excess whitespace

* Fix whitespace to pass lint
2018-07-30 07:36:34 -07:00
jqmp
dd07c25d12 Fix typo in ElasticNet threshold function (#3527) 2018-07-30 14:08:14 +12:00
Philip Hyunsu Cho
2bb9b9d3db
Fix typo in parameter.rst, gblinear section (#3518) 2018-07-28 18:58:15 -07:00
Nan Zhu
b5178d3d99
[jvm-packages] a better explanation about the inconsistent issue (#3524) 2018-07-28 17:34:39 -07:00
hlsc
5850a2558a fix DMatrix load_row_split bug (#3431) 2018-07-28 17:21:30 -07:00
trivialfis
8973f2cb0e Fix building dmlc-core from xgboost. (#3522)
Move building dmlc-core before adding DMLC_LOG_CUSTOMIZE.

Fix #3520.
2018-07-28 10:35:11 -07:00
Uddeshya Singh
3363b9142e Update faq.rst (#3521)
Just fixing a minor typo
2018-07-28 10:34:14 -07:00
Rory Mitchell
07ff52d54c
Dynamically allocate GPU histogram memory (#3519)
* Expand histogram memory dynamically to prevent large allocations for large tree depths (e.g. > 15)

* Remove GPU memory allocation messages. These are misleading as a large number of allocations are now dynamic.

* Fix appveyor R test
2018-07-28 21:22:41 +12:00
Brandon Greenwell
b5fad42da2 Issue warning when requesting bivariate plotting (#3516) 2018-07-27 16:15:37 -07:00
Philip Hyunsu Cho
8a5209c55e
Fix model saving for 'count:possion': max_delta_step as Booster attribute (#3515)
* Save max_delta_step as an extra attribute of Booster

Fixes #3509 and #3026, where `max_delta_step` parameter gets lost during serialization.

* fix lint

* Use camel case for global constant

* disable local variable case in clang-tidy
2018-07-27 09:55:54 -07:00
Andy Adinets
cc6a5a3666 Added finding quantiles on GPU. (#3393)
* Added finding quantiles on GPU.

- this includes datasets where weights are assigned to data rows
- as the quantiles found by the new algorithm are not the same
  as those found by the old one, test thresholds in
    tests/python-gpu/test_gpu_updaters.py have been adjusted.

* Adjustments and improved testing for finding quantiles on the GPU.

- added C++ tests for the DeviceSketch() function
- reduced one of the thresholds in test_gpu_updaters.py
- adjusted the cuts found by the find_cuts_k kernel
2018-07-27 14:03:16 +12:00
Nan Zhu
e2f09db77a
[jvm-packages] minor fix for parameter name in example (#3507) 2018-07-25 19:57:40 -07:00
Rory Mitchell
a725272e19
Correct mistake from dmatrix refactor (#3408) 2018-07-24 15:03:36 +12:00
jqmp
e9a97e0d88 Add total_gain and total_cover importance measures (#3498)
Add `'total_gain'` and `'total_cover'` as possible `importance_type`
arguments to `Booster.get_score` in the Python package.

`get_score` already accepts a `'gain'` argument, which returns each
feature's average gain over all of its splits.  `'total_gain'` does the
same, but returns a total rather than an average.  This seems more
intuitively meaningful, and also matches the behavior of the R package's
`xgb.importance` function.

I also added an analogous `'total_cover'` command for consistency.

This should resolve #3484.
2018-07-23 00:30:55 -07:00
KOLANICH
a1505de631 Added configuration for python into .editorconfig (#3494)
* Added configuration for python into .editorconfig

* Fixed forgotten change in the number of spaces
2018-07-23 00:24:10 -07:00
KOLANICH
a393d44c5d Improved library loading a bit (#3481)
* Improved library loading a bit

* Fixed indentation.

* Fixes according to the discussion

* Moved the comment to a separate line.
* specified exception type
2018-07-20 16:03:44 -07:00
Philip Hyunsu Cho
8e90b60c4d
Fix relpath in setup.py on Windows (#3493)
* Fix relpath in setup.py on Windows

Fixes #3480.

* Use only one lib file; use 4 space indent
2018-07-20 12:28:08 -07:00
Philip Hyunsu Cho
05b089405d
Doc modernization (#3474)
* Change doc build to reST exclusively

* Rewrite Intro doc in reST; create toctree

* Update parameter and contribute

* Convert tutorials to reST

* Convert Python tutorials to reST

* Convert CLI and Julia docs to reST

* Enable markdown for R vignettes

* Done migrating to reST

* Add guzzle_sphinx_theme to requirements

* Add breathe to requirements

* Fix search bar

* Add link to user forum
2018-07-19 14:22:16 -07:00
Yanbo Liang
c004cea788 Expose setCustomObj & setCustomEval for XGBoostClassifier & XGBoostRegressor. (#3486) 2018-07-17 21:16:51 -07:00
KOLANICH
b6dcbf0e07 Added .editorconfig (#3478) 2018-07-17 20:05:55 -07:00
Rory Mitchell
0f145a0365
Resolve GPU bug on large files (#3472)
Remove calls to thrust copy, fix indexing bug
2018-07-16 20:43:45 +12:00
Rory Mitchell
1b59316444
Updates for GPU CI tests (#3467)
* Fail GPU CI after test failure

* Fix GPU linear tests

* Reduced number of GPU tests to speed up CI

* Remove static allocations of device memory

* Resolve illegal memory access for updater_fast_hist.cc

* Fix broken r tests dependency

* Update python install documentation for GPU
2018-07-16 18:05:53 +12:00
Henry Gouk
a13e29ece1 Add LASSO (#3429)
* Allow multiple split constraints

* Replace RidgePenalty with ElasticNet

* Add test for checking Ridge, LASSO, and Elastic Net are implemented
2018-07-15 16:38:26 +12:00
Yanbo Liang
2f8764955c [JVM-packages] Support single instance prediction. (#3464)
* Support single instance prediction.

* Address comments.
2018-07-12 14:17:53 -07:00
Thejaswi
2200939416 Upgrading to NCCL2 (#3404)
* Upgrading to NCCL2

* Part - II of NCCL2 upgradation

 - Doc updates to build with nccl2
 - Dockerfile.gpu update for a correct CI build with nccl2
 - Updated FindNccl package to have env-var NCCL_ROOT to take precedence

* Upgrading to v9.2 for CI workflow, since it has the nccl2 binaries available

* Added NCCL2 license + copy the nccl binaries into /usr location for the FindNccl module to find

* Set LD_LIBRARY_PATH variable to pick nccl2 binary at runtime

* Need the nccl2 library download instructions inside Dockerfile.release as well

* Use NCCL2 as a static library
2018-07-10 00:42:15 -07:00
Thejaswi
a6331925d2 Upgrade cuda version to 9.2 for CI workflows (#3460)
- Needed by the issue #3404
 - as v9.1 doesn't have a nccl2 release
2018-07-08 23:04:51 -07:00
Philip Hyunsu Cho
b40959042c
Document 0.72.1 version (#3458) 2018-07-08 15:42:09 -07:00
kodonnell
6bed54ac39 python sklearn api: defaulting to best_ntree_limit if defined, otherwise current behaviour (#3445)
* python sklearn api: defaulting to best_ntree_limit if defined, otherwise current behaviour

* Fix whitespace
2018-07-08 14:35:52 -07:00
ngoyal2707
cb017d0c9a [jvm-packages] removed old group_data from spark api (#3451) 2018-07-07 22:21:01 -07:00
Nan Zhu
aa90e5c6ce
[jvm-packages] disable booster setup for xgboost4j-spark (#3456)
* add back train method but mark as deprecated

* add back train method but mark as deprecated

* fix scalastyle error

* fix scalastyle error

* disable booster setup in spark

* check in parameter conversion

* fix compilation issue

* update exception type
2018-07-07 21:57:24 -07:00
Philip Hyunsu Cho
66e74d2223 Fix get_uint_info() (#3442)
* Add regression test
2018-07-05 20:06:59 -07:00
Philip Hyunsu Cho
48d6e68690
Add callback interface to re-direct console output (#3438)
* Add callback interface to re-direct console output

* Exempt TrackerLogger from custom logging

* Fix lint
2018-07-05 11:32:30 -07:00
Philip Hyunsu Cho
45bf4fbffb
Add a notice for binary PyPI wheel (#3443) 2018-07-05 08:28:43 -07:00
Tianqi Chen
01aff45f26
Update README.md 2018-07-04 13:09:32 -07:00
Tianqi Chen
e62639c59b
[DOCS] Update link to readme (#3437) 2018-07-04 12:24:33 -07:00
Yanbo Liang
aec6299c49 [jvm-packages] Expose nativeBooster for XGBoostClassificationModel and XGBoostRegressionModel. (#3428) 2018-07-01 15:06:16 -07:00
Nikita Titov
295252249e fixed MinGW missed dll (#3430) 2018-07-01 16:43:33 +00:00