[R] On-demand serialization + standardization of attributes (#9924)

---------

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
This commit is contained in:
david-cortes
2024-01-10 22:08:42 +01:00
committed by GitHub
parent 01c4711556
commit d3a8d284ab
64 changed files with 1773 additions and 1281 deletions

View File

@@ -1,223 +0,0 @@
\documentclass{article}
\RequirePackage{url}
\usepackage{hyperref}
\RequirePackage{amsmath}
\RequirePackage{natbib}
\RequirePackage[a4paper,lmargin={1.25in},rmargin={1.25in},tmargin={1in},bmargin={1in}]{geometry}
\makeatletter
% \VignetteIndexEntry{xgboost: eXtreme Gradient Boosting}
%\VignetteKeywords{xgboost, gbm, gradient boosting machines}
%\VignettePackage{xgboost}
% \VignetteEngine{knitr::knitr}
\makeatother
\begin{document}
%\SweaveOpts{concordance=TRUE}
<<knitropts,echo=FALSE,message=FALSE>>=
if (require('knitr')) opts_chunk$set(fig.width = 5, fig.height = 5, fig.align = 'center', tidy = FALSE, warning = FALSE, cache = TRUE)
@
%
<<prelim,echo=FALSE>>=
xgboost.version <- packageDescription("xgboost")$Version
@
%
\begin{center}
\vspace*{6\baselineskip}
\rule{\textwidth}{1.6pt}\vspace*{-\baselineskip}\vspace*{2pt}
\rule{\textwidth}{0.4pt}\\[2\baselineskip]
{\LARGE \textbf{xgboost: eXtreme Gradient Boosting}}\\[1.2\baselineskip]
\rule{\textwidth}{0.4pt}\vspace*{-\baselineskip}\vspace{3.2pt}
\rule{\textwidth}{1.6pt}\\[2\baselineskip]
{\Large Tianqi Chen, Tong He}\\[\baselineskip]
{\large Package Version: \Sexpr{xgboost.version}}\\[\baselineskip]
{\large \today}\par
\vfill
\end{center}
\thispagestyle{empty}
\clearpage
\setcounter{page}{1}
\section{Introduction}
This is an introductory document of using the \verb@xgboost@ package in R.
\verb@xgboost@ is short for eXtreme Gradient Boosting package. It is an efficient
and scalable implementation of gradient boosting framework by \citep{friedman2001greedy} \citep{friedman2000additive}.
The package includes efficient linear model solver and tree learning algorithm.
It supports various objective functions, including regression, classification
and ranking. The package is made to be extendible, so that users are also allowed to define their own objectives easily. It has several features:
\begin{enumerate}
\item{Speed: }{\verb@xgboost@ can automatically do parallel computation on
Windows and Linux, with openmp. It is generally over 10 times faster than
\verb@gbm@.}
\item{Input Type: }{\verb@xgboost@ takes several types of input data:}
\begin{itemize}
\item{Dense Matrix: }{R's dense matrix, i.e. \verb@matrix@}
\item{Sparse Matrix: }{R's sparse matrix \verb@Matrix::dgCMatrix@}
\item{Data File: }{Local data files}
\item{xgb.DMatrix: }{\verb@xgboost@'s own class. Recommended.}
\end{itemize}
\item{Sparsity: }{\verb@xgboost@ accepts sparse input for both tree booster
and linear booster, and is optimized for sparse input.}
\item{Customization: }{\verb@xgboost@ supports customized objective function
and evaluation function}
\item{Performance: }{\verb@xgboost@ has better performance on several different
datasets.}
\end{enumerate}
\section{Example with Mushroom data}
In this section, we will illustrate some common usage of \verb@xgboost@. The
Mushroom data is cited from UCI Machine Learning Repository. \citep{Bache+Lichman:2013}
<<Training and prediction with iris>>=
library(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test
bst <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = 1,
nrounds = 2, objective = "binary:logistic", nthread = 2)
xgb.save(bst, 'model.save')
bst = xgb.load('model.save')
xgb.parameters(bst) <- list(nthread = 2)
pred <- predict(bst, test$data)
@
\verb@xgboost@ is the main function to train a \verb@Booster@, i.e. a model.
\verb@predict@ does prediction on the model.
Here we can save the model to a binary local file, and load it when needed.
We can't inspect the trees inside. However we have another function to save the
model in plain text.
<<Dump Model>>=
xgb.dump(bst, 'model.dump')
@
The output looks like
\begin{verbatim}
booster[0]:
0:[f28<1.00001] yes=1,no=2,missing=2
1:[f108<1.00001] yes=3,no=4,missing=4
3:leaf=1.85965
4:leaf=-1.94071
2:[f55<1.00001] yes=5,no=6,missing=6
5:leaf=-1.70044
6:leaf=1.71218
booster[1]:
0:[f59<1.00001] yes=1,no=2,missing=2
1:leaf=-6.23624
2:[f28<1.00001] yes=3,no=4,missing=4
3:leaf=-0.96853
4:leaf=0.784718
\end{verbatim}
It is important to know \verb@xgboost@'s own data type: \verb@xgb.DMatrix@.
It speeds up \verb@xgboost@, and is needed for advanced features such as
training from initial prediction value, weighted training instance.
We can use \verb@xgb.DMatrix@ to construct an \verb@xgb.DMatrix@ object:
<<xgb.DMatrix>>=
dtrain <- xgb.DMatrix(train$data, label = train$label, nthread = 2)
class(dtrain)
head(getinfo(dtrain,'label'))
@
We can also save the matrix to a binary file. Then load it simply with
\verb@xgb.DMatrix@
<<save model>>=
xgb.DMatrix.save(dtrain, 'xgb.DMatrix')
dtrain = xgb.DMatrix('xgb.DMatrix')
@
\section{Advanced Examples}
The function \verb@xgboost@ is a simple function with less parameter, in order
to be R-friendly. The core training function is wrapped in \verb@xgb.train@. It is more flexible than \verb@xgboost@, but it requires users to read the document a bit more carefully.
\verb@xgb.train@ only accept a \verb@xgb.DMatrix@ object as its input, while it supports advanced features as custom objective and evaluation functions.
<<Customized loss function>>=
logregobj <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
preds <- 1/(1 + exp(-preds))
grad <- preds - labels
hess <- preds * (1 - preds)
return(list(grad = grad, hess = hess))
}
evalerror <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- sqrt(mean((preds-labels)^2))
return(list(metric = "MSE", value = err))
}
dtest <- xgb.DMatrix(test$data, label = test$label, nthread = 2)
watchlist <- list(eval = dtest, train = dtrain)
param <- list(max_depth = 2, eta = 1, nthread = 2)
bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, logregobj, evalerror, maximize = FALSE)
@
The gradient and second order gradient is required for the output of customized
objective function.
We also have \verb@slice@ for row extraction. It is useful in
cross-validation.
For a walkthrough demo, please see \verb@R-package/demo/@ for further
details.
\section{The Higgs Boson competition}
We have made a demo for \href{http://www.kaggle.com/c/higgs-boson}{the Higgs
Boson Machine Learning Challenge}.
Here are the instructions to make a submission
\begin{enumerate}
\item Download the \href{http://www.kaggle.com/c/higgs-boson/data}{datasets}
and extract them to \verb@data/@.
\item Run scripts under \verb@xgboost/demo/kaggle-higgs/@:
\href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/higgs-train.R}{higgs-train.R}
and \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/higgs-pred.R}{higgs-pred.R}.
The computation will take less than a minute on Intel i7.
\item Go to the \href{http://www.kaggle.com/c/higgs-boson/submissions/attach}{submission page}
and submit your result.
\end{enumerate}
We provide \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/speedtest.R}{a script}
to compare the time cost on the higgs dataset with \verb@gbm@ and \verb@xgboost@.
The training set contains 350000 records and 30 features.
\verb@xgboost@ can automatically do parallel computation. On a machine with Intel
i7-4700MQ and 24GB memories, we found that \verb@xgboost@ costs about 35 seconds, which is about 20 times faster
than \verb@gbm@. When we limited \verb@xgboost@ to use only one thread, it was
still about two times faster than \verb@gbm@.
Meanwhile, the result from \verb@xgboost@ reaches
\href{http://www.kaggle.com/c/higgs-boson/details/evaluation}{3.60@AMS} with a
single model. This results stands in the
\href{http://www.kaggle.com/c/higgs-boson/leaderboard}{top 30\%} of the
competition.
\bibliographystyle{jss}
\nocite{*} % list uncited references
\bibliography{xgboost}
\end{document}
<<Temp file cleaning, include=FALSE>>=
file.remove("xgb.DMatrix")
file.remove("model.dump")
file.remove("model.save")
@

View File

@@ -107,7 +107,7 @@ train <- agaricus.train
test <- agaricus.test
```
> In the real world, it would be up to you to make this division between `train` and `test` data. The way to do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/data-splitting.html).
> In the real world, it would be up to you to make this division between `train` and `test` data.
Each variable is a `list` containing two things, `label` and `data`:
@@ -155,11 +155,13 @@ We will train decision tree model using the following parameters:
bstSparse <- xgboost(
data = train$data
, label = train$label
, max_depth = 2
, eta = 1
, nthread = 2
, params = list(
max_depth = 2
, eta = 1
, nthread = 2
, objective = "binary:logistic"
)
, nrounds = 2
, objective = "binary:logistic"
)
```
@@ -175,11 +177,13 @@ Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R**
bstDense <- xgboost(
data = as.matrix(train$data),
label = train$label,
max_depth = 2,
eta = 1,
nthread = 2,
nrounds = 2,
objective = "binary:logistic"
params = list(
max_depth = 2,
eta = 1,
nthread = 2,
objective = "binary:logistic"
),
nrounds = 2
)
```
@@ -191,11 +195,13 @@ bstDense <- xgboost(
dtrain <- xgb.DMatrix(data = train$data, label = train$label, nthread = 2)
bstDMatrix <- xgboost(
data = dtrain,
max_depth = 2,
eta = 1,
nthread = 2,
nrounds = 2,
objective = "binary:logistic"
params = list(
max_depth = 2,
eta = 1,
nthread = 2,
objective = "binary:logistic"
),
nrounds = 2
)
```
@@ -209,11 +215,13 @@ One of the simplest way to see the training progress is to set the `verbose` opt
# verbose = 0, no message
bst <- xgboost(
data = dtrain
, max_depth = 2
, eta = 1
, nthread = 2
, params = list(
max_depth = 2
, eta = 1
, nthread = 2
, objective = "binary:logistic"
)
, nrounds = 2
, objective = "binary:logistic"
, verbose = 0
)
```
@@ -222,11 +230,13 @@ bst <- xgboost(
# verbose = 1, print evaluation metric
bst <- xgboost(
data = dtrain
, max_depth = 2
, eta = 1
, nthread = 2
, params = list(
max_depth = 2
, eta = 1
, nthread = 2
, objective = "binary:logistic"
)
, nrounds = 2
, objective = "binary:logistic"
, verbose = 1
)
```
@@ -235,11 +245,13 @@ bst <- xgboost(
# verbose = 2, also print information about tree
bst <- xgboost(
data = dtrain
, max_depth = 2
, eta = 1
, nthread = 2
, params = list(
max_depth = 2
, eta = 1
, nthread = 2
, objective = "binary:logistic"
)
, nrounds = 2
, objective = "binary:logistic"
, verbose = 2
)
```
@@ -336,12 +348,14 @@ watchlist <- list(train = dtrain, test = dtest)
bst <- xgb.train(
data = dtrain
, max_depth = 2
, eta = 1
, nthread = 2
, params = list(
max_depth = 2
, eta = 1
, nthread = 2
, objective = "binary:logistic"
)
, nrounds = 2
, watchlist = watchlist
, objective = "binary:logistic"
)
```
@@ -349,7 +363,7 @@ bst <- xgb.train(
Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.
If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/data-splitting.html).
If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix.
For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.
@@ -357,13 +371,15 @@ For a better understanding of the learning progression, you may want to have som
bst <- xgb.train(
data = dtrain
, max_depth = 2
, eta = 1
, nthread = 2
, params = list(
eta = 1
, nthread = 2
, objective = "binary:logistic"
, eval_metric = "error"
, eval_metric = "logloss"
)
, nrounds = 2
, watchlist = watchlist
, eval_metric = "error"
, eval_metric = "logloss"
, objective = "binary:logistic"
)
```
@@ -377,14 +393,15 @@ Until now, all the learnings we have performed were based on boosting trees. **X
```{r linearBoosting, message=F, warning=F}
bst <- xgb.train(
data = dtrain
, booster = "gblinear"
, max_depth = 2
, nthread = 2
, params = list(
booster = "gblinear"
, nthread = 2
, objective = "binary:logistic"
, eval_metric = "error"
, eval_metric = "logloss"
)
, nrounds = 2
, watchlist = watchlist
, eval_metric = "error"
, eval_metric = "logloss"
, objective = "binary:logistic"
)
```
@@ -406,12 +423,14 @@ xgb.DMatrix.save(dtrain, fname)
dtrain2 <- xgb.DMatrix(fname)
bst <- xgb.train(
data = dtrain2
, max_depth = 2
, eta = 1
, nthread = 2
, params = list(
max_depth = 2
, eta = 1
, nthread = 2
, objective = "binary:logistic"
)
, nrounds = 2
, watchlist = watchlist
, objective = "binary:logistic"
)
```
@@ -492,17 +511,17 @@ file.remove(fname)
> result is `0`? We are good!
In some very specific cases, like when you want to pilot **XGBoost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it.
In some very specific cases, you will want to save the model as a *R* binary vector. See below how to do it.
```{r saveLoadRBinVectorModel, message=F, warning=F}
# save model to R's raw vector
rawVec <- xgb.serialize(bst)
rawVec <- xgb.save.raw(bst)
# print class
print(class(rawVec))
# load binary model to R
bst3 <- xgb.load(rawVec)
bst3 <- xgb.load.raw(rawVec)
xgb.parameters(bst3) <- list(nthread = 2)
pred3 <- predict(bst3, test$data)

View File

@@ -53,11 +53,10 @@ labels <- c(1, 1, 1,
data <- data.frame(dates = dates, labels = labels)
bst <- xgb.train(
data = xgb.DMatrix(as.matrix(data$dates), label = labels),
data = xgb.DMatrix(as.matrix(data$dates), label = labels, missing = NA),
nthread = 2,
nrounds = 1,
objective = "binary:logistic",
missing = NA,
max_depth = 1
)
```