- allows feval to return a list of tuples (name, error/score value) - changed behavior for multiple eval_metrics in conjunction with early_stopping: Instead of raising an error, the last passed evel_metric (or last entry in return value of feval) is used for early stopping - allows list of eval_metrics in dict-typed params - unittest for new features / behavior documentation updated - example for assigning a list to 'eval_metric' - note about early stopping on last passed eval metric - info msg for used eval metric added
166 lines
4.9 KiB
Markdown
166 lines
4.9 KiB
Markdown
Python Package Introduction
|
|
===========================
|
|
This document gives a basic walkthrough of xgboost python package.
|
|
|
|
***List of other Helpful Links***
|
|
* [Python walkthrough code collections](https://github.com/tqchen/xgboost/blob/master/demo/guide-python)
|
|
* [Python API Reference](python_api.rst)
|
|
|
|
Install XGBoost
|
|
---------------
|
|
To install XGBoost, do the following steps:
|
|
|
|
* You need to run `make` in the root directory of the project
|
|
* In the `python-package` directory run
|
|
```shell
|
|
python setup.py install
|
|
```
|
|
|
|
```python
|
|
import xgboost as xgb
|
|
```
|
|
|
|
Data Interface
|
|
--------------
|
|
The XGBoost python module is able to load data from:
|
|
- libsvm txt format file
|
|
- Numpy 2D array, and
|
|
- xgboost binary buffer file.
|
|
|
|
The data will be store in a ```DMatrix``` object.
|
|
|
|
* To load a libsvm text file or a XGBoost binary file into ```DMatrix```, the command is:
|
|
```python
|
|
dtrain = xgb.DMatrix('train.svm.txt')
|
|
dtest = xgb.DMatrix('test.svm.buffer')
|
|
```
|
|
* To load a numpy array into ```DMatrix```, the command is:
|
|
```python
|
|
data = np.random.rand(5,10) # 5 entities, each contains 10 features
|
|
label = np.random.randint(2, size=5) # binary target
|
|
dtrain = xgb.DMatrix( data, label=label)
|
|
```
|
|
* To load a scpiy.sparse array into ```DMatrix```, the command is:
|
|
```python
|
|
csr = scipy.sparse.csr_matrix((dat, (row, col)))
|
|
dtrain = xgb.DMatrix(csr)
|
|
```
|
|
* Saving ```DMatrix``` into XGBoost binary file will make loading faster in next time:
|
|
```python
|
|
dtrain = xgb.DMatrix('train.svm.txt')
|
|
dtrain.save_binary("train.buffer")
|
|
```
|
|
* To handle missing value in ```DMatrix```, you can initialize the ```DMatrix``` by specifying missing values:
|
|
```python
|
|
dtrain = xgb.DMatrix(data, label=label, missing = -999.0)
|
|
```
|
|
* Weight can be set when needed:
|
|
```python
|
|
w = np.random.rand(5, 1)
|
|
dtrain = xgb.DMatrix(data, label=label, missing = -999.0, weight=w)
|
|
```
|
|
|
|
Setting Parameters
|
|
------------------
|
|
XGBoost use list of pair to save [parameters](../parameter.md). Eg
|
|
* Booster parameters
|
|
```python
|
|
param = {'bst:max_depth':2, 'bst:eta':1, 'silent':1, 'objective':'binary:logistic' }
|
|
param['nthread'] = 4
|
|
param['eval_metric'] = 'auc'
|
|
```
|
|
* You can also specify multiple eval metrics:
|
|
```python
|
|
param['eval_metric'] = ['auc', 'ams@0']
|
|
|
|
# alternativly:
|
|
# plst = param.items()
|
|
# plst += [('eval_metric', 'ams@0')]
|
|
```
|
|
|
|
* Specify validations set to watch performance
|
|
```python
|
|
evallist = [(dtest,'eval'), (dtrain,'train')]
|
|
```
|
|
|
|
Training
|
|
--------
|
|
|
|
With parameter list and data, you are able to train a model.
|
|
* Training
|
|
```python
|
|
num_round = 10
|
|
bst = xgb.train( plst, dtrain, num_round, evallist )
|
|
```
|
|
* Saving model
|
|
After training, you can save model and dump it out.
|
|
```python
|
|
bst.save_model('0001.model')
|
|
```
|
|
* Dump Model and Feature Map
|
|
You can dump the model to txt and review the meaning of model
|
|
```python
|
|
# dump model
|
|
bst.dump_model('dump.raw.txt')
|
|
# dump model with feature map
|
|
bst.dump_model('dump.raw.txt','featmap.txt')
|
|
```
|
|
* Loading model
|
|
After you save your model, you can load model file at anytime by using
|
|
```python
|
|
bst = xgb.Booster({'nthread':4}) #init model
|
|
bst.load_model("model.bin") # load data
|
|
```
|
|
|
|
Early Stopping
|
|
--------------
|
|
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds.
|
|
Early stopping requires at least one set in `evals`. If there's more than one, it will use the last.
|
|
|
|
`train(..., evals=evals, early_stopping_rounds=10)`
|
|
|
|
The model will train until the validation score stops improving. Validation error needs to decrease at least every `early_stopping_rounds` to continue training.
|
|
|
|
If early stopping occurs, the model will have two additional fields: `bst.best_score` and `bst.best_iteration`. Note that `train()` will return a model from the last iteration, not the best one.
|
|
|
|
This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). Note that if you specify more than one evaluation metric the last one in `param['eval_metric']` is used for early stopping.
|
|
|
|
Prediction
|
|
----------
|
|
After you training/loading a model and preparing the data, you can start to do prediction.
|
|
```python
|
|
# 7 entities, each contains 10 features
|
|
data = np.random.rand(7, 10)
|
|
dtest = xgb.DMatrix(data)
|
|
ypred = bst.predict(xgmat)
|
|
```
|
|
|
|
If early stopping is enabled during training, you can predict with the best iteration.
|
|
```python
|
|
ypred = bst.predict(xgmat,ntree_limit=bst.best_iteration)
|
|
```
|
|
|
|
Plotting
|
|
--------
|
|
|
|
You can use plotting module to plot importance and output tree.
|
|
|
|
To plot importance, use ``plot_importance``. This function requires ``matplotlib`` to be installed.
|
|
|
|
```python
|
|
xgb.plot_importance(bst)
|
|
```
|
|
|
|
To output tree via ``matplotlib``, use ``plot_tree`` specifying ordinal number of the target tree.
|
|
This function requires ``graphviz`` and ``matplotlib``.
|
|
|
|
```python
|
|
xgb.plot_tree(bst, num_trees=2)
|
|
```
|
|
|
|
When you use ``IPython``, you can use ``to_graphviz`` function which converts the target tree to ``graphviz`` instance. ``graphviz`` instance is automatically rendered on ``IPython``.
|
|
|
|
```python
|
|
xgb.to_graphviz(bst, num_trees=2)
|
|
```
|