108 Commits

Author SHA1 Message Date
Philip Cho
aeb4e76118 Histogram Optimized Tree Grower (#1940)
* Support histogram-based algorithm + multiple tree growing strategy

* Add a brand new updater to support histogram-based algorithm, which buckets
  continuous features into discrete bins to speed up training. To use it, set
  `tree_method = fast_hist` to configuration.
* Support multiple tree growing strategies. For now, two policies are supported:
  * `grow_policy=depthwise` (default):  favor splitting at nodes closest to the
    root, i.e. grow depth-wise.
  * `grow_policy=lossguide`: favor splitting at nodes with highest loss change
* Improve single-threaded performance
  * Unroll critical loops
  * Introduce specialized code for dense data (i.e. no missing values)
* Additional training parameters: `max_leaves`, `max_bin`, `grow_policy`, `verbose`

* Adding a small test for hist method

* Fix memory error in row_set.h

When std::vector is resized, a reference to one of its element may become
stale. Any such reference must be updated as well.

* Resolve cross-platform compilation issues

* Versions of g++ older than 4.8 lacks support for a few C++11 features, e.g.
  alignas(*) and new initializer syntax. To support g++ 4.6, use pre-C++11
  initializer and remove alignas(*).
* Versions of MSVC older than 2015 does not support alignas(*). To support
  MSVC 2012, remove alignas(*).
* For g++ 4.8 and newer, alignas(*) is enabled for performance benefits.
* Some old compilers (MSVC 2012, g++ 4.6) do not support template aliases
  (which uses `using` to declate type aliases). So always use `typedef`.

* Fix a host of CI issues

* Remove dependency for libz on osx
* Fix heading for hist_util
* Fix minor style issues
* Add missing #include
* Remove extraneous logging

* Enable tree_method=hist in R

* Renaming HistMaker to GHistBuilder to avoid confusion

* Fix R integration

* Respond to style comments

* Consistent tie-breaking for priority queue using timestamps

* Last-minute style fixes

* Fix issuecomment-271977647

The way we quantize data is broken. The agaricus data consists of all
categorical values. When NAs are converted into 0's,
`HistCutMatrix::Init` assign both 0's and 1's to the same single bin.

Why? gmat only the smallest value (0) and an upper bound (2), which is twice
the maximum value (1). Add the maximum value itself to gmat to fix the issue.

* Fix issuecomment-272266358

* Remove padding from cut values for the continuous case
* For categorical/ordinal values, use midpoints as bin boundaries to be safe

* Fix CI issue -- do not use xrange(*)

* Fix corner case in quantile sketch

Signed-off-by: Philip Cho <chohyu01@cs.washington.edu>

* Adding a test for an edge case in quantile sketcher

max_bin=2 used to cause an exception.

* Fix fast_hist test

The test used to require a strictly increasing Test AUC for all examples.
One of them exhibits a small blip in Test AUC before achieving a Test AUC
of 1. (See bottom.)

Solution: do not require monotonic increase for this particular example.

[0] train-auc:0.99989 test-auc:0.999497
[1] train-auc:1 test-auc:0.999749
[2] train-auc:1 test-auc:0.999749
[3] train-auc:1 test-auc:0.999749
[4] train-auc:1 test-auc:0.999749
[5] train-auc:1 test-auc:0.999497
[6] train-auc:1 test-auc:1
[7] train-auc:1 test-auc:1
[8] train-auc:1 test-auc:1
[9] train-auc:1 test-auc:1
2017-01-13 09:25:55 -08:00
jokari69
fb0fc0c580 option to shuffle data in mknfolds (#1459)
* option to shuffle data in mknfolds

* removed possibility to run as stand alone test

* split function def in 2 lines for lint

* option to shuffle data in mknfolds

* removed possibility to run as stand alone test

* split function def in 2 lines for lint
2016-12-23 07:53:30 +08:00
AbdealiJK
47ba2de7d4 tests/cpp: Add tests for multiclass_metric.cc 2016-12-04 11:25:57 -08:00
AbdealiJK
a7e20555a3 tests/cpp: Add tests for rank_metrics.cc 2016-12-04 11:25:57 -08:00
AbdealiJK
4a2ef130a7 tests/cpp: Add test for elementwise_metric.cc 2016-12-04 11:25:57 -08:00
AbdealiJK
03abd47f49 tests/cpp: Add tests for Metric RMSE 2016-12-04 11:25:57 -08:00
AbdealiJK
582c373274 tests/cpp: Add tests for metric.cc 2016-12-04 11:25:57 -08:00
AbdealiJK
cc859420ba tests/cpp: Add tests for TweedieRegression 2016-12-04 11:25:57 -08:00
AbdealiJK
fa865564f6 tests/cpp: Add tests for GammaRegression 2016-12-04 11:25:57 -08:00
AbdealiJK
401e4b5220 tests/cpp: Add tests for PoissonRegression 2016-12-04 11:25:57 -08:00
AbdealiJK
d41aab4f61 tests/cpp: Add tests for regression_obj.cc
Test the objective functions in regression_obj.cc

tests/cpp: Add tests for objective.cc and RegLossObj
2016-12-04 11:25:57 -08:00
AbdealiJK
fd99d39372 tests/cpp: Add tests for SplitEntry 2016-12-04 11:25:57 -08:00
AbdealiJK
62e3468603 tests/cpp: Add tests for param.h 2016-12-04 11:25:57 -08:00
AbdealiJK
d6407c3746 tests/cpp: Add tests for SparsePageDMatrix
The SparsePageDMatrix or external memory DMatrix reads data from the
file IO rather than load it into RAM.
2016-12-04 11:25:57 -08:00
AbdealiJK
c3629c91d3 tests/cpp: Add tests for SimpleCSRSource
Test the binary format saved and read by a SimpleDMatrix, which is
internally the SimpleCSRSource.
2016-12-04 11:25:57 -08:00
AbdealiJK
be0f55d563 tests/cpp: Add tests for SimpleDMatrix 2016-12-04 11:25:57 -08:00
AbdealiJK
ef7fe06cf8 tests/cpp/test_metainfo: Add tests to save and load 2016-12-04 11:25:57 -08:00
AbdealiJK
8eb69e0677 travis: Add code coverage on success
Update the code coverage of the project on codecov for easy viewing.

Also the gcov on travis uses a different version which cannot
find the directory of the given files, and it needs to be specified
in the -o flag. Hence now we loop over the list of files and
run them independently.
2016-12-04 11:25:57 -08:00
AbdealiJK
61a9b3a49e travis: Run CPP tests 2016-12-04 11:25:57 -08:00
AbdealiJK
006f9e0760 Makefile: Add CPP code coverage 2016-12-04 11:25:57 -08:00
AbdealiJK
1f2ad36bad Add make commands for tests
This adds the make commands required to build and run tests.
2016-12-04 11:25:57 -08:00
Yuan (Terry) Tang
090b37e85d Bumped up err assert in glm test (#1792) 2016-11-20 18:23:19 -06:00
Tianqi Chen
060a0ac396 Update setup.sh 2016-11-19 17:57:47 -08:00
Tianqi Chen
aa841ee58d Update setup.sh 2016-11-19 17:56:36 -08:00
AbdealiJK
b94fcab4dc Add dump_format=json option (#1726)
* Add format to the params accepted by DumpModel

Currently, only the test format is supported when trying to dump
a model. The plan is to add more such formats like JSON which are
easy to read and/or parse by machines. And to make the interface
for this even more generic to allow other formats to be added.

Hence, we make some modifications to make these function generic
and accept a new parameter "format" which signifies the format of
the dump to be created.

* Fix typos and errors in docs

* plugin: Mention all the register macros available

Document the register macros currently available to the plugin
writers so they know what exactly can be extended using hooks.

* sparce_page_source: Use same arg name in .h and .cc

* gbm: Add JSON dump

The dump_format argument can be used to specify what type
of dump file should be created. Add functionality to dump
gblinear and gbtree into a JSON file.

The JSON file has an array, each item is a JSON object for the tree.
For gblinear:
 - The item is the bias and weights vectors
For gbtree:
 - The item is the root node. The root node has a attribute "children"
   which holds the children nodes. This happens recursively.

* core.py: Add arg dump_format for get_dump()
2016-11-04 09:55:25 -07:00
Yuan (Terry) Tang
63829d656c Fix mknfold using new StratifiedKFold API (#1660) 2016-10-12 14:43:37 -07:00
Yuan (Terry) Tang
a64fd74421 Fix wrong expected feature types (#1646) 2016-10-08 21:16:29 -07:00
Vadim Khotilovich
693ddb860e More robust DMatrix creation from a sparse matrix (#1606)
* [CORE] DMatrix from sparse w/ explicit #col #row; safer arg types

* [python-package] c-api change for _init_from_csr _init_from_csc

* fix spaces

* [R-package] adopt the new XGDMatrixCreateFromCSCEx interface

* [CORE] redirect old sparse creators to new ones
2016-09-25 10:01:22 -07:00
RAMitchell
93196eb811 cmake build system (#1314)
* Changed c api to compile under MSVC

* Include functional.h header for MSVC

* Add cmake build
2016-07-02 19:07:35 -07:00
Yoshinori Nakano
949d1e3027 add Dart booster (#1220) 2016-06-08 14:04:01 -07:00
Vadim Khotilovich
9a48a40cf1 Fixes for multiple and default metric (#1239)
* fix multiple evaluation metrics

* create DefaultEvalMetric only when really necessary

* py test for #1239

* make travis happy
2016-06-04 22:17:35 -07:00
tqchen
149589c583 [PYTHON] Refactor trainnig API to use callback 2016-05-19 21:31:23 -07:00
Alistair Johnson
6750c8b743 Added other feature importances in python package (#1135)
* added new function to calculate other feature importances

* added capability to plot other feature importance measures

* changed plotting default to fscore

* added info on importance_type to boilerplate comment

* updated text of error statement

* added self module name to fix call

* added unit test for feature importances

* style fixes
2016-05-02 12:25:24 -05:00
sinhrks
9da2f3e613 DOC/TST: Fix Python sklearn dep 2016-05-01 17:27:43 +09:00
Faron
ad3f49e881 [py] eta decay bugfix 2016-04-30 15:51:57 +02:00
sinhrks
6bab164d80 Bug mixing DMatrix's with and without feature names 2016-04-30 14:42:57 +09:00
Faron
cf607e2448 [py] split value histograms 2016-04-28 20:26:21 +02:00
sinhrks
c55cc809e5 BUG: XGBClassifier.feature_importances_ raises ValueError if input is pandas DataFrame 2016-04-27 21:50:03 +09:00
sinhrks
8fc2456c87 Enable flake8 2016-04-24 17:32:31 +09:00
CodingCat
a31a978471 run native lib building command from maven 2016-03-16 16:47:08 -04:00
tqchen
ec2fb5bc48 Fix multi-class loading 2016-03-10 19:22:26 -08:00
tqchen
ced6d45e01 Update rabit 2016-03-02 20:53:34 -08:00
CodingCat
f8fff6c6fc rename files/packages 2016-03-01 23:48:35 -05:00
CodingCat
55e36893cd add style check for java and scala code 2016-03-01 20:53:50 -05:00
CodingCat
3b246c2420 re-structure Java API, add Scala API and consolidate the names of Java/Scala API 2016-03-01 20:53:41 -05:00
tqchen
ecb3a271be [PYTHON-DIST] Distributed xgboost python training API. 2016-02-29 16:54:13 -08:00
terrytangyuan
803a6fe474 Separate dependencies and lightweight test env for Python 2016-02-28 20:11:10 -06:00
tqchen
4a16b729fc [PYTHON] Simplify training logic, update rabit lib 2016-02-28 13:20:55 -08:00
tqchen
90bc7f8f6b [TEST] Fix travis test when reading hdfs 2016-02-27 18:15:32 -08:00
Tianqi Chen
758a77de9c Fix testcase after update and allow hdfs load 2016-02-26 17:04:51 -08:00