114 Commits

Author SHA1 Message Date
ngoyal2707
902ecbade8 added python doc string for nthreads to dmatrix (#3363) 2018-06-08 14:16:30 +12:00
Kristian Gampong
a510e68dda Add validate_features option for Booster predict (#3323)
* Add validate_features option for Booster predict

* Fix trailing whitespace in docstring
2018-05-29 11:40:49 -07:00
Rory Mitchell
a185ddfe03
Implement GPU accelerated coordinate descent algorithm (#3178)
* Implement GPU accelerated coordinate descent algorithm. 

* Exclude external memory tests for GPU
2018-04-20 14:56:35 +12:00
Philip Hyunsu Cho
32ea70c1c9
Documenting CSV loading into DMatrix (#3137)
* Support CSV file in DMatrix

We'd just need to expose the CSV parser in dmlc-core to the Python wrapper

* Revert extra code; document existing CSV support

CSV support is already there but undocumented

* Add notice about categorical features
2018-02-28 18:41:10 -08:00
Scott Lundberg
d878c36c84 Add SHAP interaction effects, fix minor bug, and add cox loss (#3043)
* Add interaction effects and cox loss

* Minimize whitespace changes

* Cox loss now no longer needs a pre-sorted dataset.

* Address code review comments

* Remove mem check, rename to pred_interactions, include bias

* Make lint happy

* More lint fixes

* Fix cox loss indexing

* Fix main effects and tests

* Fix lint

* Use half interaction values on the off-diagonals

* Fix lint again
2018-02-07 20:38:01 -06:00
Yuchao Dai
eedca8c8ec fix the typo in core.py (#2978) 2017-12-25 21:08:27 -08:00
Sam O
602b34ab91 Fix performance of c_array in python core.py (#2786) 2017-11-29 11:12:49 -08:00
Rory Mitchell
16c63f30d0
Fix MultiIndex detection (breaks for latest pandas==0.21.0). (#2872) 2017-11-11 11:12:23 +13:00
Scott Lundberg
78c4188cec SHAP values for feature contributions (#2438)
* SHAP values for feature contributions

* Fix commenting error

* New polynomial time SHAP value estimation algorithm

* Update API to support SHAP values

* Fix merge conflicts with updates in master

* Correct submodule hashes

* Fix variable sized stack allocation

* Make lint happy

* Add docs

* Fix typo

* Adjust tolerances

* Remove unneeded def

* Fixed cpp test setup

* Updated R API and cleaned up

* Fixed test typo
2017-10-12 12:35:51 -07:00
Philip Cho
31ad40b963 Make __del__ method idempotent (#2627)
Addresses Issue #2533.
2017-09-27 03:03:55 -04:00
Icyblade Dai
0e85b30fdd Fix issue 2670 (#2671)
* fix issue 2670

* add python<3.6 compatibility

* fix Index

* fix Index/MultiIndex

* fix lint

* fix W0622

really nonsense

* fix lambda

* Trigger Travis

* add test for MultiIndex

* remove tailing whitespace
2017-09-19 15:49:41 -04:00
PSEUDOTENSOR / Jonathan McKinney
6b375f6ad8 Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation (#2530)
* Multi-threaded XGDMatrixCreateFromMat for faster DMatrix creation from numpy arrays for python interface.
2017-07-21 14:43:17 +12:00
Jakub Zakrzewski
ed6384ecbf [Python] Use appropriate integer types when calling native code. (#2361)
Don't use implicit conversions to c_int, which incidentally happen to work
on (some) 64-bit platforms, but:
* may lead to truncation of the input value to a 32-bit signed int,
* cause segfaults on some 32-bit architectures (tested on Ubuntu ARM,
  but is also the likely cause of issue #1707).

Also, when passing references use explicit 64-bit integers, where needed,
instead of c_ulong, which is not guaranteed to be this large.
2017-06-02 10:16:54 -07:00
Juang, Yi-Lin
6776292951 Minor cleanup (#2342)
* Clean up demo of multiclass classification

* Remove extra space
2017-05-26 09:40:41 -04:00
Maurus Cuelenaere
6bd1869026 Add prediction of feature contributions (#2003)
* Add prediction of feature contributions

This implements the idea described at http://blog.datadive.net/interpreting-random-forests/
which tries to give insight in how a prediction is composed of its feature contributions
and a bias.

* Support multi-class models

* Calculate learning_rate per-tree instead of using the one from the first tree

* Do not rely on node.base_weight * learning_rate having the same value as the node mean value (aka leaf value, if it were a leaf); instead calculate them (lazily) on-the-fly

* Add simple test for contributions feature

* Check against param.num_nodes instead of checking for non-zero length

* Loop over all roots instead of only the first
2017-05-14 00:58:10 -05:00
yexu15
179b384e39 A fix regarding the compatibility with python 2.6 (#1981)
* A fix regarding the compatibility with python 2.6

the syntax of {n: self.attr(n) for n in attr_names} is illegal in python 2.6

* Update core.py

add a space after comma
2017-01-29 20:18:28 -08:00
AbdealiJK
6f16f0ef58 Use bst_float consistently throughout (#1824)
* Fix various typos

* Add override to functions that are overridden

gcc gives warnings about functions that are being overridden by not
being marked as oveirridden. This fixes it.

* Use bst_float consistently

Use bst_float for all the variables that involve weight,
leaf value, gradient, hessian, gain, loss_chg, predictions,
base_margin, feature values.

In some cases, when due to additions and so on the value can
take a larger value, double is used.

This ensures that type conversions are minimal and reduces loss of
precision.
2016-11-30 10:02:10 -08:00
Zhongxiao Ma
55bfc29942 keep builtin evaluations while using customized evaluation function (#1624)
* keep builtin evaluations while using customized evaluation function

* fix concat bytes to str
2016-11-10 12:40:48 -08:00
AbdealiJK
b94fcab4dc Add dump_format=json option (#1726)
* Add format to the params accepted by DumpModel

Currently, only the test format is supported when trying to dump
a model. The plan is to add more such formats like JSON which are
easy to read and/or parse by machines. And to make the interface
for this even more generic to allow other formats to be added.

Hence, we make some modifications to make these function generic
and accept a new parameter "format" which signifies the format of
the dump to be created.

* Fix typos and errors in docs

* plugin: Mention all the register macros available

Document the register macros currently available to the plugin
writers so they know what exactly can be extended using hooks.

* sparce_page_source: Use same arg name in .h and .cc

* gbm: Add JSON dump

The dump_format argument can be used to specify what type
of dump file should be created. Add functionality to dump
gblinear and gbtree into a JSON file.

The JSON file has an array, each item is a JSON object for the tree.
For gblinear:
 - The item is the bias and weights vectors
For gbtree:
 - The item is the root node. The root node has a attribute "children"
   which holds the children nodes. This happens recursively.

* core.py: Add arg dump_format for get_dump()
2016-11-04 09:55:25 -07:00
Eric Liu
9b2e41340b make DMatrix._init_from_npy2d only copy data when necessary (#1637)
* make DMatrix._init_from_npy2d only copy data when necessary

When creating DMatrix from a 2d ndarray, it can unnecessarily copy the input data. This can be problematic when the data is already very large--running out of memory. The copy is temporary (going out of scope at the end of this function) but it still adds to peak memory usage.

``numpy.array`` copies its input no matter what by default. By adding ``copy=False``, it will only do so when necessary. Since XGDMatrixCreateFromMat is readonly on the input buffer, this copy is not needed.

Also added comments explaining when a copy can happen (if data ordering/layout is wrong or if type is not 32-bit float).

* remove whitespace
2016-10-20 09:30:52 -07:00
Vadim Khotilovich
693ddb860e More robust DMatrix creation from a sparse matrix (#1606)
* [CORE] DMatrix from sparse w/ explicit #col #row; safer arg types

* [python-package] c-api change for _init_from_csr _init_from_csc

* fix spaces

* [R-package] adopt the new XGDMatrixCreateFromCSCEx interface

* [CORE] redirect old sparse creators to new ones
2016-09-25 10:01:22 -07:00
tqchen
149589c583 [PYTHON] Refactor trainnig API to use callback 2016-05-19 21:31:23 -07:00
Vadim Khotilovich
ffed95eec0 py: replace attr_names() with attributes() 2016-05-15 22:04:38 -05:00
Vadim Khotilovich
a13a3a4d76 attr_names for python interface; attribute deletion via set_attr 2016-05-15 02:05:10 -05:00
Shayne Kang
bf24d6ae98 fix VisibleDeprecationWarning 2016-05-08 01:44:04 +09:00
Alistair Johnson
6750c8b743 Added other feature importances in python package (#1135)
* added new function to calculate other feature importances

* added capability to plot other feature importance measures

* changed plotting default to fscore

* added info on importance_type to boilerplate comment

* updated text of error statement

* added self module name to fix call

* added unit test for feature importances

* style fixes
2016-05-02 12:25:24 -05:00
sinhrks
6bab164d80 Bug mixing DMatrix's with and without feature names 2016-04-30 14:42:57 +09:00
Faron
cf607e2448 [py] split value histograms 2016-04-28 20:26:21 +02:00
sinhrks
8fc2456c87 Enable flake8 2016-04-24 17:32:31 +09:00
Julian Quick
bbb9ce1641 Verbose message: which fields have impropper data types
A more verbose error message letting the user know which fields have impropper data types
2016-03-22 14:13:29 -06:00
Julian Quick
2cd109fb98 a more verbose field mismatch error message
This error message can be hard to understand when there are several fields, as shown in the example below. This improves the error message, letting the user know which fields were unexpected or missing.

    import xgboost as xgb
    import pandas as pd
    train = pd.DataFrame({'a':[1], 'b':[2], 'c':[3], 'd':[4], 'f':[2], 'g':2, 'etc etc etc':[11]})
    dtrain = xgb.DMatrix(train.drop('d', axis=1), train.d)
    test = pd.DataFrame({'a':[1], 'b':[2], 'c':[1], 'd':[4], 'e':[2], 'f':[2], 'g':2, 'etc etc etc':[11]})
    dtest = xgb.DMatrix(test)
    modl = xgb.train({}, dtrain)
    modl.predict(dtest)
    
    
    # ValueError: feature_names mismatch: [u'a', u'b', u'c', u'etc etc etc', u'f', u'g'] [u'a', u'b', u'c', u'd', u'e', u'etc etc etc', u'f', u'g']
2016-03-17 18:13:30 -06:00
tqchen
ecb3a271be [PYTHON-DIST] Distributed xgboost python training API. 2016-02-29 16:54:13 -08:00
tqchen
a71ba04109 [DIST] Add Distributed XGBoost on AWS Tutorial 2016-02-25 21:51:37 -08:00
Maxim Grechkin
f5e96eba72 Make missing handling consistent with sklearn's portion of the python package 2016-01-28 14:16:11 -08:00
Kai Luo
d9e50fd7f3 __copy__ calls __deepcopy__ with an argument 2016-01-20 19:57:20 +08:00
Kai Luo
5cd765e935 fix signature of __deepcopy__ method 2016-01-20 17:18:11 +08:00
terrytangyuan
0eb6240fd0 Fixed all lint errors 2015-12-11 18:46:15 -06:00
terrytangyuan
a7e79e089b fix lint errors in core 2015-12-11 18:37:13 -06:00
sinhrks
25c4fbd0cb Cleanup pandas support 2015-11-13 06:55:04 +09:00
antonymayi
8c7b18daed python 2.6 compatibility tweak
replacing set literal {} with set() for python 2.6 compatibility (plus reformatting the line)
2015-11-10 14:50:54 +01:00
Yuan (Terry) Tang
1dd96b6cdc Merge pull request #597 from JohanManders/python-pandas-dtypes
Python pandas dtypes
2015-11-09 18:08:41 -06:00
FrozenFingerz
b59018aa05 python: multiple eval_metrics changes
- allows feval to return a list of tuples (name, error/score value)
- changed behavior for multiple eval_metrics in conjunction with
early_stopping: Instead of raising an error, the last passed evel_metric
(or last entry in return value of feval) is used for early stopping
- allows list of eval_metrics in dict-typed params
- unittest for new features / behavior

documentation updated

- example for assigning a list to 'eval_metric'
- note about early stopping on last passed eval metric

- info msg for used eval metric added
2015-11-08 11:23:54 +01:00
Johan Manders
5f0f8749d9 Cleaned up some code 2015-11-04 18:05:47 +01:00
Johan Manders
f9e1b2b7b7 Added back feature names 2015-11-03 21:26:11 +01:00
Johan Manders
96f221e0d0 Merge pull request #5 from dmlc/master
Update to latest version
2015-11-03 20:37:20 +01:00
Preston Parry
b3bb54da73 fixes typo in error message 2015-10-27 23:34:50 -07:00
Johan Manders
7c79c9ac3a Bool gets mapped to i instead of int 2015-10-19 17:36:57 +02:00
Johan Manders
9bbc3901ee More Pandas dtypes and more flexible variable naming
- Pandas DataFrame supports more dtypes than 'int64', 'float64' and 'bool', therefor added a bunch of extra dtypes for the data variable.
- From now on the label variable can be a Pandas DataFrame with the same dtypes as the data variable.
- If label is a Pandas DataFrame will be converted to float.
- If no feature_types is set, the data dtypes will be converted to 'int' or 'float'.
- The feature_names may contain every character except [, ] or <
2015-10-17 15:13:42 +02:00
sinhrks
dbcb4c8729 Support non-str column names 2015-10-04 13:30:01 +09:00
sinhrks
b943becc61 python DMatrix now accepts pandas DataFrame 2015-10-01 22:52:32 +09:00