Merge remote-tracking branch 'dmlc/master'

This commit is contained in:
El Potaeto
2015-08-05 12:07:41 +02:00
62 changed files with 1802 additions and 834 deletions

3
demo/.gitignore vendored
View File

@@ -1 +1,2 @@
*.libsvm
*.libsvm
*.pkl

View File

@@ -1,14 +1,14 @@
XGBoost Examples
====
This folder contains all the code examples using xgboost.
XGBoost Code Examples
=====================
This folder contains all the code examples using xgboost.
* Contribution of examples, benchmarks is more than welcome!
* If you like to share how you use xgboost to solve your problem, send a pull request:)
Features Walkthrough
====
This is a list of short codes introducing different functionalities of xgboost and its wrapper.
* Basic walkthrough of wrappers
--------------------
This is a list of short codes introducing different functionalities of xgboost packages.
* Basic walkthrough of packages
[python](guide-python/basic_walkthrough.py)
[R](../R-package/demo/basic_walkthrough.R)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/basic_walkthrough.jl)
@@ -20,24 +20,24 @@ This is a list of short codes introducing different functionalities of xgboost a
[python](guide-python/boost_from_prediction.py)
[R](../R-package/demo/boost_from_prediction.R)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/boost_from_prediction.jl)
* Predicting using first n trees
* Predicting using first n trees
[python](guide-python/predict_first_ntree.py)
[R](../R-package/demo/boost_from_prediction.R)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/boost_from_prediction.jl)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/boost_from_prediction.jl)
* Generalized Linear Model
[python](guide-python/generalized_linear_model.py)
[R](../R-package/demo/generalized_linear_model.R)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/generalized_linear_model.jl)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/generalized_linear_model.jl)
* Cross validation
[python](guide-python/cross_validation.py)
[R](../R-package/demo/cross_validation.R)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/cross_validation.jl)
[Julia](https://github.com/antinucleon/XGBoost.jl/blob/master/demo/cross_validation.jl)
* Predicting leaf indices
[python](guide-python/predict_leaf_indices.py)
[R](../R-package/demo/predict_leaf_indices.R)
Basic Examples by Tasks
====
-----------------------
Most of examples in this section are based on CLI or python version.
However, the parameter settings can be applied to all versions
* [Binary classification](binary_classification)
@@ -46,7 +46,7 @@ However, the parameter settings can be applied to all versions
* [Learning to Rank](rank)
Benchmarks
====
----------
* [Starter script for Kaggle Higgs Boson](kaggle-higgs)
* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)
* [Kaggle Tradeshift winning solution by daxiongshu](https://github.com/daxiongshu/kaggle-tradeshift-winning-solution)

View File

@@ -1,6 +1,6 @@
XGBoost Python Feature Walkthrough
====
* [Basic walkthrough of wrappers](basic_walkthrough.py)
==================================
* [Basic walkthrough of wrappers](basic_walkthrough.py)
* [Cutomize loss function, and evaluation metric](custom_objective.py)
* [Boosting from existing prediction](boost_from_prediction.py)
* [Predicting using first n trees](predict_first_ntree.py)

View File

@@ -8,7 +8,7 @@ import pickle
import xgboost as xgb
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.cross_validation import KFold, train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import load_iris, load_digits, load_boston
@@ -65,3 +65,13 @@ print("Pickling sklearn API models")
pickle.dump(clf, open("best_boston.pkl", "wb"))
clf2 = pickle.load(open("best_boston.pkl", "rb"))
print(np.allclose(clf.predict(X), clf2.predict(X)))
# Early-stopping
X = digits['data']
y = digits['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = xgb.XGBClassifier()
clf.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="auc",
eval_set=[(X_test, y_test)])

View File

@@ -1,7 +1,7 @@
---
title: "Understanding XGBoost Model on Otto Dataset"
author: "Michaël Benesty"
output:
output:
rmarkdown::html_vignette:
css: ../../R-package/vignettes/vignette.css
number_sections: yes
@@ -54,7 +54,7 @@ test[1:6,1:5, with =F]
Each *column* represents a feature measured by an `integer`. Each *row* is an **Otto** product.
Obviously the first column (`ID`) doesn't contain any useful information.
Obviously the first column (`ID`) doesn't contain any useful information.
To let the algorithm focus on real stuff, we will delete it.
@@ -124,7 +124,7 @@ param <- list("objective" = "multi:softprob",
cv.nround <- 5
cv.nfold <- 3
bst.cv = xgb.cv(param=param, data = trainMatrix, label = y,
bst.cv = xgb.cv(param=param, data = trainMatrix, label = y,
nfold = cv.nfold, nrounds = cv.nround)
```
> As we can see the error rate is low on the test dataset (for a 5mn trained model).
@@ -144,7 +144,7 @@ Feature importance
So far, we have built a model made of **`r nround`** trees.
To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products).
To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products).
Each division operation is called a *split*.
@@ -158,7 +158,7 @@ In the same way, in Boosting we try to optimize the missclassification at each r
The improvement brought by each *split* can be measured, it is the *gain*.
Each *split* is done on one feature only at one value.
Each *split* is done on one feature only at one value.
Let's see what the model looks like.
@@ -168,7 +168,7 @@ model[1:10]
```
> For convenience, we are displaying the first 10 lines of the model only.
Clearly, it is not easy to understand what it means.
Clearly, it is not easy to understand what it means.
Basically each line represents a *branch*, there is the *tree* ID, the feature ID, the point where it *splits*, and information regarding the next *branches* (left, right, when the row for this feature is N/A).
@@ -217,7 +217,7 @@ xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2)
We are just displaying the first two trees here.
On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated.
On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated.
Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes.
Going deeper