3050 Commits

Author SHA1 Message Date
Yang Zhang
cc012dac68 Updated sklearn_parallel.py for soon-to-be-deprecated modules (#2134) 2017-03-22 16:18:15 -05:00
Yang Zhang
f6f5003f79 Updated sklearn_examples.py for soon-to-be-deprecated modules (#2117) 2017-03-21 20:07:27 -07:00
Zhiquan
e65564ba59 Update rank_obj.cc (#2126)
typo: PairwieRankObj -> PairwiseRankObj
2017-03-21 20:06:16 -07:00
Matthew R. Becker
95b7dbb1ea ENH add gcc/g++ before clang for macs (#2125)
* ENH add gcc/g++ before clang for macs - will default to clang anyways and supports separate gcc installs

* BUG missed a ) - :(
2017-03-21 20:05:09 -07:00
Tianqi Chen
dc2fb978e1 new thread local requires xcode8 2017-03-17 09:40:34 -07:00
Icyblade Dai
301540f1d9 fix DeprecationWarning on sklearn.cross_validation (#2075)
* fix DeprecationWarning on sklearn.cross_validation

* fix syntax

* fix kfold n_split issue

* fix mistype

* fix n_splits multiple value issue

* split should pass a iterable

* use np.arange instead of xrange, py3 compatibility
2017-03-17 08:38:22 -05:00
Tianqi Chen
d581a3d0e7 [UPDATE] Update rabit and threadlocal (#2114)
* [UPDATE] Update rabit and threadlocal

* minor fix to make build system happy

* upgrade requirement to g++4.8

* upgrade dmlc-core

* update travis
2017-03-16 18:48:37 -07:00
Luckick
b0c972aa4d Typo Issue (#2100)
Contruct to Construct
2017-03-16 10:38:25 -07:00
Oleg Sofrygin
9d19e13ed0 adding a copy of base_margin to slice, fixes a bug where base_margin was notcopied during cross-validation (#2007) 2017-03-16 10:36:57 -07:00
Liam Huang
3a2b8332a6 bugfix: when metric's name contains - (#2090)
When metric's name contains `-`, Python will complain about insufficient arguments to unpack.
2017-03-16 10:36:39 -07:00
ZhouYong
fee1181803 fix online prediction function in learner.h (#2010)
I use the online prediction function(`inline void Predict(const SparseBatch::Inst &inst, ... ) const;`), the results obtained are different from the results of the batch prediction function(`  virtual void Predict(DMatrix* data, ...) const = 0`). After the investigation found that the online prediction function using the `base_score_` parameters, and the batch prediction function is not used in this parameter. It is found that the `base_score_` values are different when the same model file is loaded many times.

```
1st times:base_score_: 6.69023e-21
2nd times:base_score_: -3.7668e+19
3rd times:base_score_: 5.40507e+07
```
 Online prediction results are affected by `base_score_` parameters. After deleting the if condition(`if (out_preds->size() == 1)`) , the online prediction is consistent with the batch prediction results, and the xgboost prediction results are consistent with python version.  Therefore, it is likely that the online prediction function is bug
2017-03-16 10:35:52 -07:00
Matthew R. Becker
4a63f4ab43 BUG make sure to specify no openmp for some mac osx builds properly (#2095) 2017-03-10 18:36:15 -08:00
Shaform
15456c7882 Remove deprecated prefix bst: (#2091) 2017-03-09 09:06:37 -08:00
Holger Peters
95510b9667 Inform setuptools that this is a binary package (#1996)
* Inform setuptools that this is a binary package that needs platform-tags in wheel names.

This fixes issue #1995 .

* PEP8 Formatting

* Add docstring
2017-03-07 09:26:04 -06:00
cloverrose
288f309434 [jvm-packages] call setGroup for ranking task (#2066)
* [jvm-packages] call setGroup for ranking task

* passing groupData through xgBoostConfMap

* fix original comment position

* make groupData param

* remove groupData variable, use xgBoostConfMap directly

* set default groupData value

* add use groupData tests

* reduce rank-demo size

* use TaskContext.getPartitionId() instead of mapPartitionsWithIndex

* add DF use groupData test

* remove unused varable
2017-03-06 15:45:06 -08:00
geoHeil
cf6b173bd7 [jvm-packages] Spark pipeline persistence (#1906)
[jvm-packages] Spark pipeline persistence
2017-03-05 18:35:37 -08:00
Xin Yin
5b54b9437c Fixed Exception handling for fragmented Rabit 'print' tracker command. Fixed unit test. (#2081) 2017-03-05 13:40:59 -08:00
Nan Zhu
ab13fd72bd [jvm-packages] Scala/Java interface for Fast Histogram Algorithm (#1966)
* add back train method but mark as deprecated

* fix scalastyle error

* first commit in scala binding for fast histo

* java test

* add missed scala tests

* spark training

* add back train method but mark as deprecated

* fix scalastyle error

* local change

* first commit in scala binding for fast histo

* local change

* fix df frame test
2017-03-04 15:37:24 -08:00
Nan Zhu
ac30a0aff5 [jvm-packages][spark]Preserve num classes (#2068)
* add back train method but mark as deprecated

* fix scalastyle error

* change class to object in examples

* fix compilation error

* bump spark version to 2.1

* preserve num_class issues

* fix failed test cases

* rivising

* add multi class test
2017-03-04 14:14:31 -08:00
hlsc
a92093388d [jvm-packages] fix bug doing rabit call after finalize (#2079)
[jvm-packages]fix bug doing rabit call after finalize
2017-03-02 16:46:57 -08:00
Tianqi Chen
fd19b7a188 Automatically remove nan from input data when it is sparse. (#2062)
* [DATALoad] Automatically remove Nan when load from sparse matrix

* add log
2017-02-25 08:59:17 -08:00
moqiguzhu
5d093a7f4c in caret settings, if you want do 10*10 cross validation, you need to set repeats=10, number=10 and method=repeatedcv, (#2061)
if you set method=cv, actually just one 10-fold cross validation will be run; fixes #2055
2017-02-25 09:16:19 -05:00
Eric Liu
7927031ffe print_evaluation callback output on last iteration (#2036)
verbose_eval docs claim it will log the last iteration (http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train). this is also consistent w/the behavior from 0.4. not a huge deal but I found it handy to see the last iter's result b/c my period is usually large.

this doesn't address logging the last stage found by early_stopping (as noted in docs) as I'm not sure how to do that.
2017-02-24 23:06:48 -05:00
Vadim Khotilovich
b4d97d3cb8 R maintenance Feb2017 (#2045)
* [R] better argument check in xgb.DMatrix; fixes #1480

* [R] showsd was a dummy; fixes #2044

* [R] better categorical encoding explanation in vignette; fixes #1989

* [R] new roxygen version docs update
2017-02-20 10:02:40 -08:00
Nan Zhu
63aec12a13 [jvm-packages] Bump spark to 2.1 (#2046) 2017-02-19 08:29:35 -08:00
Nan Zhu
185fe1d645 [jvm-packages] use ML's para system to build the passed-in params to XGBoost (#2043)
* add back train method but mark as deprecated

* fix scalastyle error

* use ML's para system to build the passed-in params to XGBoost

* clean
2017-02-18 11:56:27 -08:00
DougM
acce11d3f4 fix MLlib CrossValidator issues (wrong default value configuration) #1941 (#2042) 2017-02-18 08:10:47 -08:00
Theodore Vasiloudis
9fb46e2c5e [trivial] Fix typo in Poisson metric name. (#2026) 2017-02-09 09:32:06 -08:00
ANtlord
f054d812dc Fix typo in Python Package Introduction (#2023)
Fixed #2016
2017-02-08 23:35:13 -05:00
Xin Yin
4fb7fdb240 [jvm-packages] Fixed java.nio.BufferUnderFlow issue in Scala Rabit tracker. (#1993)
* [jvm-packages] Scala implementation of the Rabit tracker.

A Scala implementation of RabitTracker that is interface-interchangable with the
Java implementation, ported from `tracker.py` in the
[dmlc-core project](https://github.com/dmlc/dmlc-core).

* [jvm-packages] Updated Akka dependency in pom.xml.

* Refactored the RabitTracker directory structure.

* Fixed premature stopping of connection handler.

Added a new finite state "AwaitingPortNumber" to explicitly wait for the
worker to send the port, and close the connection. Stopping the actor
prematurely sends a TCP RST to the worker, causing the worker to crash
on AssertionError.

* Added interface IRabitTracker so that user can switch implementations.

* Default timeout duration changes.

* Dependency for Akka tests.

* Removed the main function of RabitTracker.

* A skeleton for testing Akka-based Rabit tracker.

* waitFor() in RabitTracker no longer throws exceptions.

* Completed unit test for the 'start' command of Rabit tracker.

* Preliminary support for Rabit Allreduce via JNI (no prepare function support yet.)

* Fixed the default timeout duration.

* Use Java container to avoid serialization issues due to intermediate wrappers.

* Added tests for Allreduce/model training using Scala Rabit tracker.

* Added spill-over unit test for the Scala Rabit tracker.

* Fixed a typo.

* Overhaul of RabitTracker interface per code review.

  - Removed methods start() waitFor() (no arguments) from IRabitTracker.
  - The timeout in start(timeout) is now worker connection timeout, as tcp
    socket binding timeout is less intuitive.
  - Dropped time unit from start(...) and waitFor(...) methods; the default
    time unit is millisecond.
  - Moved random port number generation into the RabitTrackerHandler.
  - Moved all Rabit-related classes to package ml.dmlc.xgboost4j.scala.rabit.

* More code refactoring and comments.

* Unified timeout constants. Readable tracker status code.

* Add comments to indicate that allReduce is for tests only. Removed all other variants.

* Removed unused imports.

* Simplified signatures of training methods.

 - Moved TrackerConf into parameter map.
 - Changed GeneralParams so that TrackerConf becomes a standalone parameter.
 - Updated test cases accordingly.

* Changed monitoring strategies.

* Reverted monitoring changes.

* Update test case for Rabit AllReduce.

* Mix in UncaughtExceptionHandler into IRabitTracker to prevent tracker from hanging due to exceptions thrown by workers.

* More comprehensive test cases for exception handling and worker connection timeout.

* Handle executor loss due to unknown cause: the newly spawned executor will attempt to connect to the tracker. Interrupt tracker in such case.

* Per code-review, removed training timeout from TrackerConf. Timeout logic must be implemented explicitly and externally in the driver code.

* Reverted scalastyle-config changes.

* Visibility scope change. Interface tweaks.

* Use match pattern to handle tracker_conf parameter.

* Minor clarification in JNI code.

* Clearer intent in match pattern to suppress warnings.

* Removed Future from constructor. Block in start() and waitFor() instead.

* Revert inadvertent comment changes.

* Removed debugging information.

* Updated test cases that are a bit finicky.

* Added comments on the reasoning behind the unit tests for testing Rabit tracker robustness.

* Fixed BufferUnderFlow bug in decoding tracker 'print' command.

* Merge conflicts resolution.
2017-02-04 10:20:39 -08:00
geoHeil
2250b9b6d2 [jvm-packages] try setting default profile (#1891)
* try setting default profile

* add spark pipeline persistence

* access spark session

* copy paste sparks default parameter reader

* remove unnecessary parameters, only change xml

* remove unnecessary changes 2
2017-01-31 08:33:51 -08:00
yexu15
179b384e39 A fix regarding the compatibility with python 2.6 (#1981)
* A fix regarding the compatibility with python 2.6

the syntax of {n: self.attr(n) for n in attr_names} is illegal in python 2.6

* Update core.py

add a space after comma
2017-01-29 20:18:28 -08:00
Philip Cho
5d74578095 Disallow multiple roots for tree_method=hist (#1979)
As discussed in issue #1978, tree_method=hist ignores the parameter
param.num_roots; it simply assumes that the tree has only one root. In
particular, when InitData() method initializes row_set_collection_, it simply
assigns all rows to node 0, the value that's hard-coded.

For now, the updater will simply fail when num_roots exceeds 1. I will revise
the updater soon to support multiple roots.
2017-01-21 12:02:29 -08:00
Srivatsan Ramanujam
036ee55fe0 adding sample weights for XGBRegressor (was this forgotten?) (#1874) 2017-01-21 11:58:03 -08:00
Vadim Khotilovich
2b5b96d760 [R] various R code maintenance (#1964)
* [R] xgb.save must work when handle in nil but raw exists

* [R] print.xgb.Booster should still print other info when handle is nil

* [R] rename internal function xgb.Booster to xgb.Booster.handle to make its intent clear

* [R] rename xgb.Booster.check to xgb.Booster.complete and make it visible; more docs

* [R] storing evaluation_log should depend only on watchlist, not on verbose

* [R] reduce the excessive chattiness of unit tests

* [R] only disable some tests in windows when it's not 64-bit

* [R] clean-up xgb.DMatrix

* [R] test xgb.DMatrix loading from libsvm text file

* [R] store feature_names in xgb.Booster, use them from utility functions

* [R] remove non-functional co-occurence computation from xgb.importance

* [R] verbose=0 is enough without a callback

* [R] added forgotten xgb.Booster.complete.Rd; cran check fixes

* [R] update installation instructions
2017-01-21 11:22:46 -08:00
wxchan
a073a2c3d4 fix ylim with max_num_features in python plot_importance (#1974) 2017-01-18 11:59:50 -08:00
Félix MIKAELIAN
a7d2833766 added the max_features parameter to the plot_importance function. (#1963)
* added the max_features parameter to the plot_importance function.

* renamed max_features parameter to max_num_features for better understanding

* removed unwanted character in docstring
2017-01-16 14:49:47 -08:00
Philip Cho
49ff7c1649 Rename parameter in fast_hist to disambiguate (#1962) 2017-01-13 11:35:55 -08:00
Philip Cho
aeb4e76118 Histogram Optimized Tree Grower (#1940)
* Support histogram-based algorithm + multiple tree growing strategy

* Add a brand new updater to support histogram-based algorithm, which buckets
  continuous features into discrete bins to speed up training. To use it, set
  `tree_method = fast_hist` to configuration.
* Support multiple tree growing strategies. For now, two policies are supported:
  * `grow_policy=depthwise` (default):  favor splitting at nodes closest to the
    root, i.e. grow depth-wise.
  * `grow_policy=lossguide`: favor splitting at nodes with highest loss change
* Improve single-threaded performance
  * Unroll critical loops
  * Introduce specialized code for dense data (i.e. no missing values)
* Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose`

* Adding a small test for hist method

* Fix memory error in row_set.h

When std::vector is resized, a reference to one of its element may become
stale. Any such reference must be updated as well.

* Resolve cross-platform compilation issues

* Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g.
  alignas(*) and new initializer syntax. To support g++ 4.6, use pre-C++11
  initializer and remove alignas(*).
* Versions of MSVC older than 2015 does not support alignas(*). To support
  MSVC 2012, remove alignas(*).
* For g++ 4.8 and newer, alignas(*) is enabled for performance benefits.
* Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases
  (which uses `using` to declate type aliases). So always use `typedef`.

* Fix a host of CI issues

* Remove dependency for libz on osx
* Fix heading for hist_util
* Fix minor style issues
* Add missing #include
* Remove extraneous logging

* Enable tree_method=hist in R

* Renaming HistMaker to GHistBuilder to avoid confusion

* Fix R integration

* Respond to style comments

* Consistent tie-breaking for priority queue using timestamps

* Last-minute style fixes

* Fix issuecomment-271977647

The way we quantize data is broken. The agaricus data consists of all
categorical values. When NAs are converted into 0's,
`HistCutMatrix::Init` assign both 0's and 1's to the same single bin.

Why? gmat only the smallest value (0) and an upper bound (2), which is twice
the maximum value (1). Add the maximum value itself to gmat to fix the issue.

* Fix issuecomment-272266358

* Remove padding from cut values for the continuous case
* For categorical/ordinal values, use midpoints as bin boundaries to be safe

* Fix CI issue -- do not use xrange(*)

* Fix corner case in quantile sketch

Signed-off-by: Philip Cho <chohyu01@cs.washington.edu>

* Adding a test for an edge case in quantile sketcher

max_bin=2 used to cause an exception.

* Fix fast_hist test

The test used to require a strictly increasing Test AUC for all examples.
One of them exhibits a small blip in Test AUC before achieving a Test AUC
of 1. (See bottom.)

Solution: do not require monotonic increase for this particular example.

[0] train-auc:0.99989 test-auc:0.999497
[1] train-auc:1 test-auc:0.999749
[2] train-auc:1 test-auc:0.999749
[3] train-auc:1 test-auc:0.999749
[4] train-auc:1 test-auc:0.999749
[5] train-auc:1 test-auc:0.999497
[6] train-auc:1 test-auc:1
[7] train-auc:1 test-auc:1
[8] train-auc:1 test-auc:1
[9] train-auc:1 test-auc:1
2017-01-13 09:25:55 -08:00
Luckick
ef8d92fc52 Validation Typo (#1949)
change valudation to validation
2017-01-09 10:40:43 -08:00
Andrey Tereskin
cfb9b11aa4 Make lib path relatrive to fix setup error #1932 (#1947) 2017-01-09 10:40:24 -08:00
Vadim Khotilovich
87e897f428 [R] fix #1903 (#1929) 2017-01-06 13:16:37 -08:00
Vadim Khotilovich
d7406e07f3 [R] xgb.plot.tree fixes (#1939)
* [R] a few fixes and improvements to xgb.plot.tree

* [R] deprecate n_first_tree replace with trees; fix types in xgb.model.dt.tree
2017-01-06 11:09:51 -08:00
Vadim Khotilovich
d23ea5ca7d An option for doing binomial+1 or epsilon-dropout from DART paper (#1922)
* An option for doing binomial+1 or epsilon-dropout from DART paper

* use callback-based discrete_distribution to make MSVC2013 happy
2017-01-05 16:23:22 -08:00
Tong He
ce84af7923 0.6-4 submission (#1935) 2017-01-04 23:31:05 -08:00
Muneyuki Noguchi
8b827425b2 Fix comment in cross_validation.py (#1923)
cv() doesn't output std_value because show_stdv is set to False.
2017-01-02 09:40:41 -05:00
Kyle Willett
7e07b2b93d Correcting small typos in documentation. (#1901) 2016-12-31 20:47:51 +08:00
Tong He
f5c85836bf [R] Increase the version number, date and required R version (#1920)
* remove unnecessary line
2016-12-30 21:29:26 -08:00
Qiang Kou (KK)
7948d1c799 disable openmp on solaris (#1912) 2016-12-28 11:32:56 -08:00
adamist521
119763bc49 cross_validation is included in model_selection module since sklearn 0.18 (#1908) 2016-12-26 04:11:56 -05:00