220 lines
7.5 KiB
Plaintext
220 lines
7.5 KiB
Plaintext
\documentclass{article}
|
|
\RequirePackage{url}
|
|
\usepackage{hyperref}
|
|
\RequirePackage{amsmath}
|
|
\RequirePackage{natbib}
|
|
\RequirePackage[a4paper,lmargin={1.25in},rmargin={1.25in},tmargin={1in},bmargin={1in}]{geometry}
|
|
|
|
\makeatletter
|
|
% \VignetteIndexEntry{xgboost: eXtreme Gradient Boosting}
|
|
%\VignetteKeywords{xgboost, gbm, gradient boosting machines}
|
|
%\VignettePackage{xgboost}
|
|
% \VignetteEngine{knitr::knitr}
|
|
\makeatother
|
|
|
|
\begin{document}
|
|
%\SweaveOpts{concordance=TRUE}
|
|
|
|
<<knitropts,echo=FALSE,message=FALSE>>=
|
|
if (require('knitr')) opts_chunk$set(fig.width = 5, fig.height = 5, fig.align = 'center', tidy = FALSE, warning = FALSE, cache = TRUE)
|
|
@
|
|
|
|
%
|
|
<<prelim,echo=FALSE>>=
|
|
xgboost.version = '0.3-0'
|
|
@
|
|
%
|
|
|
|
\begin{center}
|
|
\vspace*{6\baselineskip}
|
|
\rule{\textwidth}{1.6pt}\vspace*{-\baselineskip}\vspace*{2pt}
|
|
\rule{\textwidth}{0.4pt}\\[2\baselineskip]
|
|
{\LARGE \textbf{xgboost: eXtreme Gradient Boosting}}\\[1.2\baselineskip]
|
|
\rule{\textwidth}{0.4pt}\vspace*{-\baselineskip}\vspace{3.2pt}
|
|
\rule{\textwidth}{1.6pt}\\[2\baselineskip]
|
|
{\Large Tianqi Chen, Tong He}\\[\baselineskip]
|
|
{\large Package Version: \Sexpr{xgboost.version}}\\[\baselineskip]
|
|
{\large \today}\par
|
|
\vfill
|
|
\begin{figure}[h]
|
|
\centering
|
|
\includegraphics[width=0.4\textwidth]{fig/sfu-logo.pdf}
|
|
\end{figure}
|
|
\end{center}
|
|
|
|
\thispagestyle{empty}
|
|
|
|
\clearpage
|
|
|
|
\setcounter{page}{1}
|
|
|
|
\section{Introduction}
|
|
|
|
This is an introductory document of using the \verb@xgboost@ package in R.
|
|
|
|
\verb@xgboost@ is short for eXtreme Gradient Boosting package. It is an efficient
|
|
and scalable implementation of gradient boosting framework by \citep{friedman2001greedy}.
|
|
The package includes efficient linear model solver and tree learning algorithm.
|
|
It supports various objective functions, including regression, classification
|
|
and ranking. The package is made to be extendible, so that user are also allowed
|
|
to define there own objectives easily. It has several features:
|
|
\begin{enumerate}
|
|
\item{Speed: }{\verb@xgboost@ can automatically do parallel computation on
|
|
Windows and Linux, with openmp. It is generally over 10 times faster than
|
|
\verb@gbm@.}
|
|
\item{Input Type: }{\verb@xgboost@ takes several types of input data:}
|
|
\begin{itemize}
|
|
\item{Dense Matrix: }{R's dense matrix, i.e. \verb@matrix@}
|
|
\item{Sparse Matrix: }{R's sparse matrix \verb@Matrix::dgCMatrix@}
|
|
\item{Data File: }{Local data files}
|
|
\item{xgb.DMatrix: }{\verb@xgboost@'s own class. Recommended.}
|
|
\end{itemize}
|
|
\item{Sparsity: }{\verb@xgboost@ accepts sparse input for both tree booster
|
|
and linear booster, and is optimized for sparse input.}
|
|
\item{Customization: }{\verb@xgboost@ supports customized objective function
|
|
and evaluation function}
|
|
\item{Performance: }{\verb@xgboost@ has better performance on several different
|
|
datasets.}
|
|
\end{enumerate}
|
|
|
|
|
|
\section{Example with iris}
|
|
|
|
In this section, we will illustrate some common usage of \verb@xgboost@.
|
|
|
|
<<Training and prediction with iris>>=
|
|
library(xgboost)
|
|
data(iris)
|
|
bst <- xgboost(as.matrix(iris[,1:4]),as.numeric(iris[,5]),
|
|
nrounds = 5)
|
|
xgb.save(bst, 'model.save')
|
|
bst = xgb.load('model.save')
|
|
pred <- predict(bst, as.matrix(iris[,1:4]))
|
|
@
|
|
|
|
\verb@xgboost@ is the main function to train a \verb@Booster@, i.e. a model.
|
|
\verb@predict@ does prediction on the model.
|
|
|
|
Here we can save the model to a binary local file, and load it when needed.
|
|
We can't inspect the trees inside. However we have another function to save the
|
|
model in plain text.
|
|
<<Dump Model>>=
|
|
xgb.dump(bst, 'model.dump')
|
|
@
|
|
|
|
The output looks like
|
|
|
|
\begin{verbatim}
|
|
booster[0]:
|
|
0:[f2<2.45] yes=1,no=2,missing=1
|
|
1:leaf=0.147059
|
|
2:[f3<1.65] yes=3,no=4,missing=3
|
|
3:leaf=0.464151
|
|
4:leaf=0.722449
|
|
booster[1]:
|
|
0:[f2<2.45] yes=1,no=2,missing=1
|
|
1:leaf=0.103806
|
|
2:[f2<4.85] yes=3,no=4,missing=3
|
|
3:leaf=0.316341
|
|
4:leaf=0.510365
|
|
\end{verbatim}
|
|
|
|
It is important to know \verb@xgboost@'s own data type: \verb@xgb.DMatrix@.
|
|
It speeds up \verb@xgboost@, and is needed for advanced features such as
|
|
training from initial prediction value, weighted training instance.
|
|
|
|
We can use \verb@xgb.DMatrix@ to construct an \verb@xgb.DMatrix@ object:
|
|
<<xgb.DMatrix>>=
|
|
iris.mat <- as.matrix(iris[,1:4])
|
|
iris.label <- as.numeric(iris[,5])
|
|
diris <- xgb.DMatrix(iris.mat, label = iris.label)
|
|
class(diris)
|
|
getinfo(diris,'label')
|
|
@
|
|
|
|
We can also save the matrix to a binary file. Then load it simply with
|
|
\verb@xgb.DMatrix@
|
|
<<save model>>=
|
|
xgb.DMatrix.save(diris, 'iris.xgb.DMatrix')
|
|
diris = xgb.DMatrix('iris.xgb.DMatrix')
|
|
@
|
|
|
|
\section{Advanced Examples}
|
|
|
|
The function \verb@xgboost@ is a simple function with less parameters, in order
|
|
to be R-friendly. The core training function is wrapped in \verb@xgb.train@. It
|
|
is more flexible than \verb@xgboost@, but it requires users to read the document
|
|
a bit more carefully.
|
|
|
|
\verb@xgb.train@ only accept a \verb@xgb.DMatrix@ object as its input, while it
|
|
supports advanced features as custom objective and evaluation functions.
|
|
|
|
<<Customized loss function>>=
|
|
logregobj <- function(preds, dtrain) {
|
|
labels <- getinfo(dtrain, "label")
|
|
preds <- 1/(1 + exp(-preds))
|
|
grad <- preds - labels
|
|
hess <- preds * (1 - preds)
|
|
return(list(grad = grad, hess = hess))
|
|
}
|
|
|
|
evalerror <- function(preds, dtrain) {
|
|
labels <- getinfo(dtrain, "label")
|
|
err <- sqrt(mean((preds-labels)^2))
|
|
return(list(metric = "MSE", value = err))
|
|
}
|
|
|
|
dtest <- slice(diris,1:100)
|
|
watchlist <- list(eval = dtest, train = diris)
|
|
param <- list(max_depth = 2, eta = 1, silent = 1)
|
|
|
|
bst <- xgb.train(param, diris, nround = 2, watchlist, logregobj, evalerror)
|
|
@
|
|
|
|
The gradient and second order gradient is required for the output of customized
|
|
objective function.
|
|
|
|
We also have \verb@slice@ for row extraction. It is useful in
|
|
cross-validation.
|
|
|
|
For a walkthrough demo, please see \verb@R-package/demo/demo.R@ for further
|
|
details.
|
|
|
|
\section{The Higgs Boson competition}
|
|
|
|
We have made a demo for \href{http://www.kaggle.com/c/higgs-boson}{the Higgs
|
|
Boson Machine Learning Challenge}.
|
|
|
|
Here are the instructions to make a submission
|
|
\begin{enumerate}
|
|
\item Download the \href{http://www.kaggle.com/c/higgs-boson/data}{datasets}
|
|
and extract them to \verb@data/@.
|
|
\item Run scripts under \verb@xgboost/demo/kaggle-higgs/@:
|
|
\href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/higgs-train.R}{higgs-train.R}
|
|
and \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/higgs-pred.R}{higgs-pred.R}.
|
|
The computation will take less than a minute on Intel i7.
|
|
\item Go to the \href{http://www.kaggle.com/c/higgs-boson/submissions/attach}{submission page}
|
|
and submit your result.
|
|
\end{enumerate}
|
|
|
|
We provide \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/speedtest.R}{a script}
|
|
to compare the time cost on the higgs dataset with \verb@gbm@ and \verb@xgboost@.
|
|
The training set contains 350000 records and 30 features.
|
|
|
|
\verb@xgboost@ can automatically do parallel computation. On a machine with Intel
|
|
i7-4700MQ and 24GB memories, we found that \verb@xgboost@ costs about 35 seconds, which is about 20 times faster
|
|
than \verb@gbm@. When we limited \verb@xgboost@ to use only one thread, it was
|
|
still about two times faster than \verb@gbm@.
|
|
|
|
Meanwhile, the result from \verb@xgboost@ reaches
|
|
\href{http://www.kaggle.com/c/higgs-boson/details/evaluation}{3.60@AMS} with a
|
|
single model. This results stands in the
|
|
\href{http://www.kaggle.com/c/higgs-boson/leaderboard}{top 30\%} of the
|
|
competition.
|
|
|
|
\bibliographystyle{jss}
|
|
\nocite{*} % list uncited references
|
|
\bibliography{xgboost}
|
|
|
|
\end{document}
|