* option to shuffle data in mknfolds
* removed possibility to run as stand alone test
* split function def in 2 lines for lint
* option to shuffle data in mknfolds
* removed possibility to run as stand alone test
* split function def in 2 lines for lint
* Fix various typos
* Add override to functions that are overridden
gcc gives warnings about functions that are being overridden by not
being marked as oveirridden. This fixes it.
* Use bst_float consistently
Use bst_float for all the variables that involve weight,
leaf value, gradient, hessian, gain, loss_chg, predictions,
base_margin, feature values.
In some cases, when due to additions and so on the value can
take a larger value, double is used.
This ensures that type conversions are minimal and reduces loss of
precision.
* Allow using learning_rates parameter when doing CV
- Create a new `callback_cv` method working when called from `xgb.cv()`
- Rename existing `callback` into `callback_train` and make it the default callback
- Get the logic out of the callbacks and place it into a common helper
* Add a learning_rates parameter to cv()
* lint
* remove caller explicit reference
* callback is aware of its calling context
* remove caller argument
* remove learning_rates param
* restore learning_rates for training, but deprecated
* lint
* lint line too long
* quick example for predefined callbacks
* Add format to the params accepted by DumpModel
Currently, only the test format is supported when trying to dump
a model. The plan is to add more such formats like JSON which are
easy to read and/or parse by machines. And to make the interface
for this even more generic to allow other formats to be added.
Hence, we make some modifications to make these function generic
and accept a new parameter "format" which signifies the format of
the dump to be created.
* Fix typos and errors in docs
* plugin: Mention all the register macros available
Document the register macros currently available to the plugin
writers so they know what exactly can be extended using hooks.
* sparce_page_source: Use same arg name in .h and .cc
* gbm: Add JSON dump
The dump_format argument can be used to specify what type
of dump file should be created. Add functionality to dump
gblinear and gbtree into a JSON file.
The JSON file has an array, each item is a JSON object for the tree.
For gblinear:
- The item is the bias and weights vectors
For gbtree:
- The item is the root node. The root node has a attribute "children"
which holds the children nodes. This happens recursively.
* core.py: Add arg dump_format for get_dump()
* make DMatrix._init_from_npy2d only copy data when necessary
When creating DMatrix from a 2d ndarray, it can unnecessarily copy the input data. This can be problematic when the data is already very large--running out of memory. The copy is temporary (going out of scope at the end of this function) but it still adds to peak memory usage.
``numpy.array`` copies its input no matter what by default. By adding ``copy=False``, it will only do so when necessary. Since XGDMatrixCreateFromMat is readonly on the input buffer, this copy is not needed.
Also added comments explaining when a copy can happen (if data ordering/layout is wrong or if type is not 32-bit float).
* remove whitespace
*Fix 1439
*Fix python_wrapper when eval set name contain '-' will cause early_stop maximize variable con't set to True propely
Change-Id: Ib0595afd4ae7b445a84c00a3a8faeccc506c6d13
* add scikit-learn v0.18 compatibility
import KFold & StratifiedKFold from sklearn.model_selection instead of sklearn.cross_validation
* change DeprecationWarning to ImportError
DeprecationWarning isn't an exception, so it should work the other way around.
Currently xgboost can only be installed by running:
python setup.py install
Now it can be packaged (in binary form) as a wheel and installed like:
pip install xgboost-0.6-py2-none-any.whl
distutils and wheel install `data_files` differently than setuptools.
setuptools will install the `data_files` in the package directory whereas the
others install it in `sys.prefix`. By adding `sys.prefix` to the list of
directories to check for the shared library, xgboost can now be distributed as
a wheel.
* force gcc-5 or clang-omp for Mac OS, prepare for pip pack
* add sklearn dep, make -j4
* finalize PyPI submission
* revert to Xcode clang for passing build #1468
* force to clang, try to solve cmake travis error
* remove sklearn dependency
* added new function to calculate other feature importances
* added capability to plot other feature importance measures
* changed plotting default to fscore
* added info on importance_type to boilerplate comment
* updated text of error statement
* added self module name to fix call
* added unit test for feature importances
* style fixes
This error message can be hard to understand when there are several fields, as shown in the example below. This improves the error message, letting the user know which fields were unexpected or missing.
import xgboost as xgb
import pandas as pd
train = pd.DataFrame({'a':[1], 'b':[2], 'c':[3], 'd':[4], 'f':[2], 'g':2, 'etc etc etc':[11]})
dtrain = xgb.DMatrix(train.drop('d', axis=1), train.d)
test = pd.DataFrame({'a':[1], 'b':[2], 'c':[1], 'd':[4], 'e':[2], 'f':[2], 'g':2, 'etc etc etc':[11]})
dtest = xgb.DMatrix(test)
modl = xgb.train({}, dtrain)
modl.predict(dtest)
# ValueError: feature_names mismatch: [u'a', u'b', u'c', u'etc etc etc', u'f', u'g'] [u'a', u'b', u'c', u'd', u'e', u'etc etc etc', u'f', u'g']