Merge pull request #283 from pommedeterresautee/master

OTTO Rmarkdown
This commit is contained in:
Tianqi Chen 2015-05-03 09:09:49 -07:00
commit a8d059902d
26 changed files with 246 additions and 28 deletions

View File

@ -1,3 +1,5 @@
\.o$
\.so$
\.dll$
^.*\.Rproj$
^\.Rproj\.user$

View File

@ -21,7 +21,7 @@ VignetteBuilder: knitr
Suggests:
knitr,
ggplot2 (>= 1.0.0),
DiagrammeR (>= 0.4),
DiagrammeR (>= 0.6),
Ckmeans.1d.dp (>= 3.3.1),
vcd (>= 1.3)
Depends:

View File

@ -1,4 +1,4 @@
# Generated by roxygen2 (4.1.0): do not edit by hand
# Generated by roxygen2 (4.1.1): do not edit by hand
export(getinfo)
export(setinfo)

View File

@ -42,8 +42,9 @@
#' \item \code{reg:logistic} logistic regression.
#' \item \code{binary:logistic} logistic regression for binary classification. Output probability.
#' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes).
#' \item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class.
#' \item \code{num_class} set the number of classes. To use only with multiclass objectives.
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{tonum_class}.
#' \item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
#' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
#' }
#' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
@ -84,7 +85,7 @@
#' \item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
#' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}.
#' \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
#' \item \code{ndcg} Normalized Discounted Cumulative Gain. \url{http://en.wikipedia.org/wiki/NDCG}
#' \item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{http://en.wikipedia.org/wiki/NDCG}
#' }
#'
#' Full list of parameters is available in the Wiki \url{https://github.com/dmlc/xgboost/wiki/Parameters}.

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgboost.R
\docType{data}
\name{agaricus.test}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgboost.R
\docType{data}
\name{agaricus.train}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/getinfo.xgb.DMatrix.R
\docType{methods}
\name{getinfo}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/nrow.xgb.DMatrix.R
\docType{methods}
\name{nrow,xgb.DMatrix-method}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/predict.xgb.Booster.R
\docType{methods}
\name{predict,xgb.Booster-method}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/predict.xgb.Booster.handle.R
\docType{methods}
\name{predict,xgb.Booster.handle-method}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/setinfo.xgb.DMatrix.R
\docType{methods}
\name{setinfo}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/slice.xgb.DMatrix.R
\docType{methods}
\name{slice}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.DMatrix.R
\name{xgb.DMatrix}
\alias{xgb.DMatrix}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.DMatrix.save.R
\name{xgb.DMatrix.save}
\alias{xgb.DMatrix.save}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.cv.R
\name{xgb.cv}
\alias{xgb.cv}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.dump.R
\name{xgb.dump}
\alias{xgb.dump}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.importance.R
\name{xgb.importance}
\alias{xgb.importance}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.load.R
\name{xgb.load}
\alias{xgb.load}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.model.dt.tree.R
\name{xgb.model.dt.tree}
\alias{xgb.model.dt.tree}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.plot.importance.R
\name{xgb.plot.importance}
\alias{xgb.plot.importance}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.plot.tree.R
\name{xgb.plot.tree}
\alias{xgb.plot.tree}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.save.R
\name{xgb.save}
\alias{xgb.save}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.save.raw.R
\name{xgb.save.raw}
\alias{xgb.save.raw}

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgb.train.R
\name{xgb.train}
\alias{xgb.train}
@ -48,8 +48,9 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
\item \code{reg:logistic} logistic regression.
\item \code{binary:logistic} logistic regression for binary classification. Output probability.
\item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
\item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes).
\item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class.
\item \code{num_class} set the number of classes. To use only with multiclass objectives.
\item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is a number and should be from 0 \code{tonum_class}
\item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class.
\item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
}
\item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5

View File

@ -1,4 +1,4 @@
% Generated by roxygen2 (4.1.0): do not edit by hand
% Generated by roxygen2 (4.1.1): do not edit by hand
% Please edit documentation in R/xgboost.R
\name{xgboost}
\alias{xgboost}

View File

@ -0,0 +1,214 @@
---
title: "Understanding XGBoost model using only embedded model"
author: "Michaël Benesty"
output: html_document
---
Introduction
============
According to the **Kaggle** forum, XGBoost seems to be one of the most used tool to make prediction regarding the classification of the products from **OTTO** dataset.
**XGBoost** is an implementation of the famous gradient boosting algorithm described by Friedman in XYZ. This model is often described as a *blackbox*, meaning it works well but it is not trivial to understand how. Indeed, the model is made of hundreds (thousands?) of decision trees. You may wonder how possible a human would be able to have a general view of the model?
The purpose of this RMarkdown document is to demonstrate how we can leverage the functions already implemented in **XGBoost R** package for that purpose. Of course, everything showed below can be applied to the dataset you may have to manipulate at work or wherever!
First we will train a model on the **OTTO** dataset, then we will generate two vizualisations to get a clue of what is important to the model, finally, we will see how we can leverage these information.
Preparation of the data
=======================
This part is based on the tutorial posted on the [**OTTO Kaggle** forum](**LINK HERE**).
First, let's load the packages and the dataset.
```{r loading}
require(xgboost)
require(methods)
require(data.table)
require(magrittr)
train <- fread('data/train.csv', header = T, stringsAsFactors = F)
test <- fread('data/test.csv', header=TRUE, stringsAsFactors = F)
```
> `magrittr` and `data.table` are here to make the code cleaner and more rapid.
Let's see what is in this dataset.
```{r explore}
# Train dataset dimensions
dim(train)
# Training content
train[1:6,1:5, with =F]
# Test dataset dimensions
dim(train)
# Test content
test[1:6,1:5, with =F]
```
> We only display the 6 first rows and 5 first columns for convenience
Each column represents a feature measured by an integer. Each row is a product.
Obviously the first column (`ID`) doesn't contain any useful information.
To let the algorithm focus on real stuff, we will delete the column.
```{r clean, results='hide'}
# Delete ID column in training dataset
train[, id := NULL]
# Delete ID column in testing dataset
test[, id := NULL]
```
According to the `OTTO` challenge description, we have here a multi class classication challenge. We need to extract the labels (here the name of the different classes) from the dataset. We only have two files (test and training), it seems logic that the training file contains the class we are looking for. Usually the labels is in the first or the last column. Let's check the content of the last column.
```{r searchLabel}
# Check the content of the last column
train[1:6, ncol(train), with = F]
# Save the name of the last column
nameLastCol <- names(train)[ncol(train)]
```
The class are provided as character string in the `ncol(train)`th column called `nameLastCol`. As you may know, **XGBoost** doesn't support anything else than numbers. So we will convert classes to integers. Moreover, according to the documentation, it should start at 0.
For that purpose, we will:
* extract the target column
* remove "Class_" from each class name
* convert to integers
* remove 1 to the new value
```{r classToIntegers}
# Convert to classes to numbers
y <- train[, nameLastCol, with = F][[1]] %>% gsub('Class_','',.) %>% {as.integer(.) -1}
# Display the first 5 levels
y[1:5]
```
We remove label column from training dataset, otherwise XGBoost would use it to guess the labels!!!
```{r deleteCols, results='hide'}
train[, nameLastCol:=NULL, with = F]
```
`data.table` is an awesome implementation of data.frame, unfortunately it is not a format supported natively by XGBoost. We need to convert both datasets (training and test) in numeric Matrix format.
```{r convertToNumericMatrix}
trainMatrix <- train[,lapply(.SD,as.numeric)] %>% as.matrix
testMatrix <- test[,lapply(.SD,as.numeric)] %>% as.matrix
```
Model training
==============
Before the learning we will use the cross validation to evaluate the our error rate.
Basically XGBoost will divide the training data in `nfold` parts, then XGBoost will retain the first part and use it as the test data. Then it will reintegrate the first part to the training dataset and retain the second part, do a training and so on...
Look at the function documentation for more information.
```{r crossValidation}
numberOfClasses <- max(y)
param <- list("objective" = "multi:softprob",
"eval_metric" = "mlogloss",
"num_class" = numberOfClasses + 1)
cv.nround <- 50
cv.nfold <- 3
bst.cv = xgb.cv(param=param, data = trainMatrix, label = y,
nfold = cv.nfold, nrounds = cv.nround)
```
> As we can see the error rate is low on the test dataset (for a 5mn trained model).
Finally, we are ready to train the real model!!!
```{r modelTraining}
nround = 50
bst = xgboost(param=param, data = trainMatrix, label = y, nrounds=nround)
```
Model understanding
===================
Feature importance
------------------
So far, we have built a model made of `nround` trees.
To build a tree, the dataset is divided recursvely several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **OTTO** products).
Each division operation is called a *split*.
Each group at each division level is called a branch and the deepest level is called a **leaf**.
In the final model, these leafs are supposed to be as pure as possible for each tree, meaning in our case that each leaf should be made of one class of **OTTO** product only (of course it is not true, but that's what we try to achieve in a minimum of splits).
**Not all splits are equally important**. Basically the first split of a tree will have more impact on the purity that, for instance, the deepest split. Intuitively, we understand that the first split makes most of the work, and the following splits focus on smaller parts of the dataset which have been missclassified by the first tree.
In the same way, in Boosting we try to optimize the missclassification at each round (it is called the **loss**). So the first tree will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous trees.
The improvement brought by each split can be measured, it is the **gain**.
Each split is done on one feature only at one value.
Let's see what the model looks like.
```{r modelDump}
model <- xgb.dump(bst, with.stats = T)
model[1:10]
```
> For convenience, we are displaying the first 10 lines of the model only.
Clearly, it is not easy to understand what it means.
Basically each line represents a branch, there is the tree ID, the feature ID, the point where it splits, and information regarding the next branches (left, right, when the row for this feature is N/A).
Hopefully, XGBoost offers a better representation: **feature importance**.
Feature importance is about averaging the gain of each feature for all split and all trees.
Then we can use the function `xgb.plot.importance`.
```{r importanceFeature}
# Get the feature real names
names <- dimnames(trainMatrix)[[2]]
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model = bst)
# Nice graph
xgb.plot.importance(importance_matrix[1:10,])
```
> To make it understandable we first extract the column names from the `Matrix`.
Interpretation
--------------
In the feature importance above, we can see the first 10 most important features.
This function gives a color to each bar. Basically a K-mean clustering is applied to group each feature by importance.
From here you can take several actions. For instance you can remove the less important feature (feature selection process), or go deeper in the interaction between the most important features and labels.
Or you can just reason about why these features are so importat (in **OTTO** challenge we can't go this way because there is not enough information).
Tree graph
----------
Feature importance gives you feature weight information but not interaction between features.
**XGBoost R** package have another useful function for that.
```{r treeGraph, dpi=300, fig.align='left'}
xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 1)
```
We are just displaying the first tree here.
On simple models first trees may be enough. Here, it may not be the case.