xgboost/R-package/inst/doc/xgboost.Rnw

\documentclass{article}

\usepackage{natbib}
\usepackage{graphics}
\usepackage{amsmath}
\usepackage{hyperref}
\usepackage{indentfirst}
\usepackage[utf8]{inputenc}

% \VignetteIndexEntry{xgboost}

\begin{document}

<<foo,include=FALSE,echo=FALSE>>=
options(keep.source = TRUE, width = 60)
foo <- packageDescription("xgboost")
@

\title{xgboost Package Example (Version \Sexpr{foo$Version})}
\author{Tong He}
\maketitle

\section{Introduction}

This is an introductory document of using the \verb@xgboost@ package in R.

\verb@xgboost@ is short for eXtreme Gradient Boosting (Tree). It is an efficient
 and scalable implementation of \cite{gbm}. It supports regression and
classification analysis on different types of input datasets.

It has several features:
\begin{enumerate}
    \item{Speed: }{\verb@xgboost@ can automatically do parallel computation on
    Windows and Linux, with openmp. It is generally over 10 times faster than
    \verb@gbm@.}
    \item{Input Type: }{\verb@xgboost@ takes several types of input data:}
    \begin{itemize}
        \item{Dense Matrix: }{R's dense matrix, i.e. \verb@matrix@}
        \item{Sparse Matrix: }{R's sparse matrix \verb@Matrix::dgCMatrix@}
        \item{Data File: }{Local data files}
        \item{xgb.DMatrix: }{\verb@xgboost@'s own class. Recommended.}
    \end{itemize}
    \item{Sparsity: }{\verb@xgboost@ accepts sparse input for both tree booster
    and linear booster.}
    \item{Customization: }{\verb@xgboost@ supports customized objective function
    and evaluation function}
    \item{Performance: }{\verb@xgboost@ has better performance on several different
    datasets. Its rising popularity and fame in different Kaggle competitions
    is the evidence.}
\end{enumerate}

\section{Example with iris}

In this section, we will illustrate some common usage of \verb@xgboost@.

<<Training and prediction with iris>>=
library(xgboost)
data(iris)
bst <- xgboost(as.matrix(iris[,1:4]),as.numeric(iris[,5]),
               nrounds = 5)
xgb.save(bst, 'model.save')
bst = xgb.load('model.save')
pred <- predict(bst, as.matrix(iris[,1:4]))
@

\verb@xgboost@ is the main function to train a \verb@Booster@, i.e. a model.
\verb@predict@ does prediction on the model.

Here we can save the model to a binary local file, and load it when needed.
We can't inspect the trees inside. However we have another function to save the
model in plain text.
<<Dump Model>>=
xgb.dump(bst, 'model.dump')
@

The output looks like

\begin{verbatim}
booster[0]:
0:[f2<2.45] yes=1,no=2,missing=1
    1:leaf=0.147059
    2:[f3<1.65] yes=3,no=4,missing=3
        3:leaf=0.464151
        4:leaf=0.722449
booster[1]:
0:[f2<2.45] yes=1,no=2,missing=1
    1:leaf=0.103806
    2:[f2<4.85] yes=3,no=4,missing=3
        3:leaf=0.316341
        4:leaf=0.510365
\end{verbatim}

It is important to know \verb@xgboost@'s own data type: \verb@xgb.DMatrix@.
It speeds up \verb@xgboost@.

We can use \verb@xgb.DMatrix@ to construct an \verb@xgb.DMatrix@ object:
<<xgb.DMatrix>>=
iris.mat <- as.matrix(iris[,1:4])
iris.label <- as.numeric(iris[,5])
diris <- xgb.DMatrix(iris.mat, label = iris.label)
class(diris)
getinfo(diris,'label')
@

We can also save the matrix to a binary file. Then load it simply with
\verb@xgb.DMatrix@
<<save model>>=
xgb.DMatrix.save(diris, 'iris.xgb.DMatrix')
diris = xgb.DMatrix('iris.xgb.DMatrix')
@

\section{Advanced Examples}

The function \verb@xgboost@ is a simple function with less parameters, in order
to be R-friendly. The core training function is wrapped in \verb@xgb.train@. It
is more flexible than \verb@xgboost@, but it requires users to read the document
a bit more carefully.

\verb@xgb.train@ only accept a \verb@xgb.DMatrix@ object as its input, while it
supports some additional features as custom objective and evaluation functions.

<<Customized loss function>>=
logregobj <- function(preds, dtrain) {
   labels <- getinfo(dtrain, "label")
   preds <- 1/(1 + exp(-preds))
   grad <- preds - labels
   hess <- preds * (1 - preds)
   return(list(grad = grad, hess = hess))
}

evalerror <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- sqrt(mean((preds-labels)^2))
  return(list(metric = "MSE", value = err))
}

dtest <- slice(diris,1:100)
watchlist <- list(eval = dtest, train = diris)
param <- list(max_depth = 2, eta = 1, silent = 1)

bst <- xgb.train(param, diris, nround = 2, watchlist, logregobj, evalerror)
@

The gradient and second order gradient is required for the output of customized
objective function.

We also have \verb@slice@ for row extraction. It is useful in
cross-validation.

For a walkthrough demo, please see \verb@R-package/demo/demo.R@ for further
details.

\section{The Higgs Boson competition}

We have made a demo for \href{http://www.kaggle.com/c/higgs-boson}{the Higgs
Boson Machine Learning Challenge}.

Here are the instructions to make a submission
\begin{enumerate}
    \item Download the \href{http://www.kaggle.com/c/higgs-boson/data}{datasets}
    and extract them to \verb@data/@.
    \item Run scripts under \verb@xgboost/demo/kaggle-higgs/@:
    \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/higgs-train.R}{higgs-train.R}
    and \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/higgs-pred.R}{higgs-pred.R}.
    The computation will take less than a minute on Intel i7.
    \item Go to the \href{http://www.kaggle.com/c/higgs-boson/submissions/attach}{submission page}
    and submit your result.
\end{enumerate}

We provide \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/speedtest.R}{a script}
to compare the time cost on the higgs dataset with \verb@gbm@ and \verb@xgboost@.
The training set contains 350000 records and 30 features.

\verb@xgboost@ can automatically do parallel computation. On a machine with Intel
i7-4700MQ and 24GB memories, we found that \verb@xgboost@ costs about 35 seconds, which is about 20 times faster
than \verb@gbm@. When we limited \verb@xgboost@ to use only one thread, it was
still about two times faster than \verb@gbm@.

Meanwhile, the result from \verb@xgboost@ reaches
\href{http://www.kaggle.com/c/higgs-boson/details/evaluation}{3.60@AMS} with a
single model. This results stands in the
\href{http://www.kaggle.com/c/higgs-boson/leaderboard}{top 30\%} of the
competition.


\begin{thebibliography}{}

\bibitem[Friedman et al.(2001)Friedman, Jerome H.]{gbm}
Friedman, Jerome H. (2001).
\newblock Greedy function approximation: a gradient boosting machine.
\newblock In \emph{ Annals of Statistics} (2001): 1189-1232.

\bibitem[Friedman(2000)]{logitboost}
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. (2000).
\newblock Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).
\newblock \emph{The annals of statistics} 28.2 (2000):337-407.

\end{thebibliography}


\end{document}