[R] On-demand serialization + standardization of attributes (#9924)

--------- Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
2024-01-10 22:08:42 +01:00
parent 01c4711556
commit d3a8d284ab
64 changed files with 1773 additions and 1281 deletions
--- a/R-package/vignettes/xgboost.Rnw
+++ b/R-package/vignettes/xgboost.Rnw
@@ -1,223 +0,0 @@
-\documentclass{article}
-\RequirePackage{url}
-\usepackage{hyperref}
-\RequirePackage{amsmath}
-\RequirePackage{natbib}
-\RequirePackage[a4paper,lmargin={1.25in},rmargin={1.25in},tmargin={1in},bmargin={1in}]{geometry}
-
-\makeatletter
-% \VignetteIndexEntry{xgboost: eXtreme Gradient Boosting}
-%\VignetteKeywords{xgboost, gbm, gradient boosting machines}
-%\VignettePackage{xgboost}
-% \VignetteEngine{knitr::knitr}
-\makeatother
-
-\begin{document}
-%\SweaveOpts{concordance=TRUE}
-
-<<knitropts,echo=FALSE,message=FALSE>>=
-if (require('knitr')) opts_chunk$set(fig.width = 5, fig.height = 5, fig.align = 'center', tidy = FALSE, warning = FALSE, cache = TRUE)
-@
-
-%
-<<prelim,echo=FALSE>>=
-xgboost.version <- packageDescription("xgboost")$Version
-
-@
-%
-
-    \begin{center}
-    \vspace*{6\baselineskip}
-    \rule{\textwidth}{1.6pt}\vspace*{-\baselineskip}\vspace*{2pt}
-    \rule{\textwidth}{0.4pt}\\[2\baselineskip]
-    {\LARGE \textbf{xgboost: eXtreme Gradient Boosting}}\\[1.2\baselineskip]
-    \rule{\textwidth}{0.4pt}\vspace*{-\baselineskip}\vspace{3.2pt}
-    \rule{\textwidth}{1.6pt}\\[2\baselineskip]
-    {\Large Tianqi Chen, Tong He}\\[\baselineskip]
-    {\large Package Version: \Sexpr{xgboost.version}}\\[\baselineskip]
-    {\large \today}\par
-    \vfill
-    \end{center}
-
-\thispagestyle{empty}
-
-\clearpage
-
-\setcounter{page}{1}
-
-\section{Introduction}
-
-This is an introductory document of using the \verb@xgboost@ package in R.
-
-\verb@xgboost@ is short for eXtreme Gradient Boosting package. It is an efficient
- and scalable implementation of gradient boosting framework by \citep{friedman2001greedy} \citep{friedman2000additive}.
-The package includes efficient linear model solver and tree learning algorithm.
-It supports various objective functions, including regression, classification
-and ranking. The package is made to be extendible, so that users are also allowed to define their own objectives easily. It has several features:
-\begin{enumerate}
-    \item{Speed: }{\verb@xgboost@ can automatically do parallel computation on
-    Windows and Linux, with openmp. It is generally over 10 times faster than
-    \verb@gbm@.}
-    \item{Input Type: }{\verb@xgboost@ takes several types of input data:}
-    \begin{itemize}
-        \item{Dense Matrix: }{R's dense matrix, i.e. \verb@matrix@}
-        \item{Sparse Matrix: }{R's sparse matrix \verb@Matrix::dgCMatrix@}
-        \item{Data File: }{Local data files}
-        \item{xgb.DMatrix: }{\verb@xgboost@'s own class. Recommended.}
-    \end{itemize}
-    \item{Sparsity: }{\verb@xgboost@ accepts sparse input for both tree booster
-    and linear booster, and is optimized for sparse input.}
-    \item{Customization: }{\verb@xgboost@ supports customized objective function
-    and evaluation function}
-    \item{Performance: }{\verb@xgboost@ has better performance on several different
-    datasets.}
-\end{enumerate}
-
-
-\section{Example with Mushroom data}
-
-In this section, we will illustrate some common usage of \verb@xgboost@. The
-Mushroom data is cited from UCI Machine Learning Repository. \citep{Bache+Lichman:2013}
-
-<<Training and prediction with iris>>=
-library(xgboost)
-data(agaricus.train, package='xgboost')
-data(agaricus.test, package='xgboost')
-train <- agaricus.train
-test <- agaricus.test
-bst <- xgboost(data = train$data, label = train$label, max_depth = 2, eta = 1,
-               nrounds = 2, objective = "binary:logistic", nthread = 2)
-xgb.save(bst, 'model.save')
-bst = xgb.load('model.save')
-xgb.parameters(bst) <- list(nthread = 2)
-pred <- predict(bst, test$data)
-@
-
-\verb@xgboost@ is the main function to train a \verb@Booster@, i.e. a model.
-\verb@predict@ does prediction on the model.
-
-Here we can save the model to a binary local file, and load it when needed.
-We can't inspect the trees inside. However we have another function to save the
-model in plain text.
-<<Dump Model>>=
-xgb.dump(bst, 'model.dump')
-@
-
-The output looks like
-
-\begin{verbatim}
-booster[0]:
-0:[f28<1.00001] yes=1,no=2,missing=2
-  1:[f108<1.00001] yes=3,no=4,missing=4
-    3:leaf=1.85965
-    4:leaf=-1.94071
-  2:[f55<1.00001] yes=5,no=6,missing=6
-    5:leaf=-1.70044
-    6:leaf=1.71218
-booster[1]:
-0:[f59<1.00001] yes=1,no=2,missing=2
-  1:leaf=-6.23624
-  2:[f28<1.00001] yes=3,no=4,missing=4
-    3:leaf=-0.96853
-    4:leaf=0.784718
-\end{verbatim}
-
-It is important to know \verb@xgboost@'s own data type: \verb@xgb.DMatrix@.
-It speeds up \verb@xgboost@, and is needed for advanced features such as
-training from initial prediction value, weighted training instance.
-
-We can use \verb@xgb.DMatrix@ to construct an \verb@xgb.DMatrix@ object:
-<<xgb.DMatrix>>=
-dtrain <- xgb.DMatrix(train$data, label = train$label, nthread = 2)
-class(dtrain)
-head(getinfo(dtrain,'label'))
-@
-
-We can also save the matrix to a binary file. Then load it simply with
-\verb@xgb.DMatrix@
-<<save model>>=
-xgb.DMatrix.save(dtrain, 'xgb.DMatrix')
-dtrain = xgb.DMatrix('xgb.DMatrix')
-@
-
-\section{Advanced Examples}
-
-The function \verb@xgboost@ is a simple function with less parameter, in order
-to be R-friendly. The core training function is wrapped in \verb@xgb.train@. It is more flexible than \verb@xgboost@, but it requires users to read the document a bit more carefully.
-
-\verb@xgb.train@ only accept a \verb@xgb.DMatrix@ object as its input, while it supports advanced features as custom objective and evaluation functions.
-
-<<Customized loss function>>=
-logregobj <- function(preds, dtrain) {
-   labels <- getinfo(dtrain, "label")
-   preds <- 1/(1 + exp(-preds))
-   grad <- preds - labels
-   hess <- preds * (1 - preds)
-   return(list(grad = grad, hess = hess))
-}
-
-evalerror <- function(preds, dtrain) {
-  labels <- getinfo(dtrain, "label")
-  err <- sqrt(mean((preds-labels)^2))
-  return(list(metric = "MSE", value = err))
-}
-
-dtest <- xgb.DMatrix(test$data, label = test$label, nthread = 2)
-watchlist <- list(eval = dtest, train = dtrain)
-param <- list(max_depth = 2, eta = 1, nthread = 2)
-
-bst <- xgb.train(param, dtrain, nrounds = 2, watchlist, logregobj, evalerror, maximize = FALSE)
-@
-
-The gradient and second order gradient is required for the output of customized
-objective function.
-
-We also have \verb@slice@ for row extraction. It is useful in
-cross-validation.
-
-For a walkthrough demo, please see \verb@R-package/demo/@ for further
-details.
-
-\section{The Higgs Boson competition}
-
-We have made a demo for \href{http://www.kaggle.com/c/higgs-boson}{the Higgs
-Boson Machine Learning Challenge}.
-
-Here are the instructions to make a submission
-\begin{enumerate}
-    \item Download the \href{http://www.kaggle.com/c/higgs-boson/data}{datasets}
-    and extract them to \verb@data/@.
-    \item Run scripts under \verb@xgboost/demo/kaggle-higgs/@:
-    \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/higgs-train.R}{higgs-train.R}
-    and \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/higgs-pred.R}{higgs-pred.R}.
-    The computation will take less than a minute on Intel i7.
-    \item Go to the \href{http://www.kaggle.com/c/higgs-boson/submissions/attach}{submission page}
-    and submit your result.
-\end{enumerate}
-
-We provide \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/speedtest.R}{a script}
-to compare the time cost on the higgs dataset with \verb@gbm@ and \verb@xgboost@.
-The training set contains 350000 records and 30 features.
-
-\verb@xgboost@ can automatically do parallel computation. On a machine with Intel
-i7-4700MQ and 24GB memories, we found that \verb@xgboost@ costs about 35 seconds, which is about 20 times faster
-than \verb@gbm@. When we limited \verb@xgboost@ to use only one thread, it was
-still about two times faster than \verb@gbm@.
-
-Meanwhile, the result from \verb@xgboost@ reaches
-\href{http://www.kaggle.com/c/higgs-boson/details/evaluation}{3.60@AMS} with a
-single model. This results stands in the
-\href{http://www.kaggle.com/c/higgs-boson/leaderboard}{top 30\%} of the
-competition.
-
-\bibliographystyle{jss}
-\nocite{*} % list uncited references
-\bibliography{xgboost}
-
-\end{document}
-
-<<Temp file cleaning, include=FALSE>>=
-file.remove("xgb.DMatrix")
-file.remove("model.dump")
-file.remove("model.save")
-@
--- a/R-package/vignettes/xgboostPresentation.Rmd
+++ b/R-package/vignettes/xgboostPresentation.Rmd
@@ -107,7 +107,7 @@ train <- agaricus.train
 test <- agaricus.test
 ```

-> In the real world, it would be up to you to make this division between `train` and `test` data. The way to do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/data-splitting.html).
+> In the real world, it would be up to you to make this division between `train` and `test` data.

 Each variable is a `list` containing two things, `label` and `data`:

@@ -155,11 +155,13 @@ We will train decision tree model using the following parameters:
 bstSparse <- xgboost(
    data = train$data
    , label = train$label
-    , max_depth = 2
-    , eta = 1
-    , nthread = 2
+    , params = list(
+        max_depth = 2
+        , eta = 1
+        , nthread = 2
+        , objective = "binary:logistic"
+    )
    , nrounds = 2
-    , objective = "binary:logistic"
 )
 ```

@@ -175,11 +177,13 @@ Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic **R**
 bstDense <- xgboost(
    data = as.matrix(train$data),
    label = train$label,
-    max_depth = 2,
-    eta = 1,
-    nthread = 2,
-    nrounds = 2,
-    objective = "binary:logistic"
+    params = list(
+        max_depth = 2,
+        eta = 1,
+        nthread = 2,
+        objective = "binary:logistic"
+    ),
+    nrounds = 2
 )
 ```

@@ -191,11 +195,13 @@ bstDense <- xgboost(
 dtrain <- xgb.DMatrix(data = train$data, label = train$label, nthread = 2)
 bstDMatrix <- xgboost(
    data = dtrain,
-    max_depth = 2,
-    eta = 1,
-    nthread = 2,
-    nrounds = 2,
-    objective = "binary:logistic"
+    params = list(
+        max_depth = 2,
+        eta = 1,
+        nthread = 2,
+        objective = "binary:logistic"
+    ),
+    nrounds = 2
 )
 ```

@@ -209,11 +215,13 @@ One of the simplest way to see the training progress is to set the `verbose` opt
 # verbose = 0, no message
 bst <- xgboost(
    data = dtrain
-    , max_depth = 2
-    , eta = 1
-    , nthread = 2
+    , params = list(
+        max_depth = 2
+        , eta = 1
+        , nthread = 2
+        , objective = "binary:logistic"
+    )
    , nrounds = 2
-    , objective = "binary:logistic"
    , verbose = 0
 )
 ```
@@ -222,11 +230,13 @@ bst <- xgboost(
 # verbose = 1, print evaluation metric
 bst <- xgboost(
    data = dtrain
-    , max_depth = 2
-    , eta = 1
-    , nthread = 2
+    , params = list(
+        max_depth = 2
+        , eta = 1
+        , nthread = 2
+        , objective = "binary:logistic"
+    )
    , nrounds = 2
-    , objective = "binary:logistic"
    , verbose = 1
 )
 ```
@@ -235,11 +245,13 @@ bst <- xgboost(
 # verbose = 2, also print information about tree
 bst <- xgboost(
    data = dtrain
-    , max_depth = 2
-    , eta = 1
-    , nthread = 2
+    , params = list(
+        max_depth = 2
+        , eta = 1
+        , nthread = 2
+        , objective = "binary:logistic"
+    )
    , nrounds = 2
-    , objective = "binary:logistic"
    , verbose = 2
 )
 ```
@@ -336,12 +348,14 @@ watchlist <- list(train = dtrain, test = dtest)

 bst <- xgb.train(
    data = dtrain
-    , max_depth = 2
-    , eta = 1
-    , nthread = 2
+    , params = list(
+        max_depth = 2
+        , eta = 1
+        , nthread = 2
+        , objective = "binary:logistic"
+    )
    , nrounds = 2
    , watchlist = watchlist
-    , objective = "binary:logistic"
 )
 ```

@@ -349,7 +363,7 @@ bst <- xgb.train(

 Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.

-If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/data-splitting.html).
+If with your own dataset you have not such results, you should think about how you divided your dataset in training and test. May be there is something to fix.

 For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.

@@ -357,13 +371,15 @@ For a better understanding of the learning progression, you may want to have som
 bst <- xgb.train(
    data = dtrain
    , max_depth = 2
-    , eta = 1
-    , nthread = 2
+    , params = list(
+        eta = 1
+        , nthread = 2
+        , objective = "binary:logistic"
+        , eval_metric = "error"
+        , eval_metric = "logloss"
+    )
    , nrounds = 2
    , watchlist = watchlist
-    , eval_metric = "error"
-    , eval_metric = "logloss"
-    , objective = "binary:logistic"
 )
 ```

@@ -377,14 +393,15 @@ Until now, all the learnings we have performed were based on boosting trees. **X
 ```{r linearBoosting, message=F, warning=F}
 bst <- xgb.train(
    data = dtrain
-    , booster = "gblinear"
-    , max_depth = 2
-    , nthread = 2
+    , params = list(
+        booster = "gblinear"
+        , nthread = 2
+        , objective = "binary:logistic"
+        , eval_metric = "error"
+        , eval_metric = "logloss"
+    )
    , nrounds = 2
    , watchlist = watchlist
-    , eval_metric = "error"
-    , eval_metric = "logloss"
-    , objective = "binary:logistic"
 )
 ```

@@ -406,12 +423,14 @@ xgb.DMatrix.save(dtrain, fname)
 dtrain2 <- xgb.DMatrix(fname)
 bst <- xgb.train(
    data = dtrain2
-    , max_depth = 2
-    , eta = 1
-    , nthread = 2
+    , params = list(
+        max_depth = 2
+        , eta = 1
+        , nthread = 2
+        , objective = "binary:logistic"
+    )
    , nrounds = 2
    , watchlist = watchlist
-    , objective = "binary:logistic"
 )
 ```

@@ -492,17 +511,17 @@ file.remove(fname)

 > result is `0`? We are good!

-In some very specific cases, like when you want to pilot **XGBoost** from `caret` package, you will want to save the model as a *R* binary vector. See below how to do it.
+In some very specific cases, you will want to save the model as a *R* binary vector. See below how to do it.

 ```{r saveLoadRBinVectorModel, message=F, warning=F}
 # save model to R's raw vector
-rawVec <- xgb.serialize(bst)
+rawVec <- xgb.save.raw(bst)

 # print class
 print(class(rawVec))

 # load binary model to R
-bst3 <- xgb.load(rawVec)
+bst3 <- xgb.load.raw(rawVec)
 xgb.parameters(bst3) <- list(nthread = 2)
 pred3 <- predict(bst3, test$data)

--- a/R-package/vignettes/xgboostfromJSON.Rmd
+++ b/R-package/vignettes/xgboostfromJSON.Rmd
@@ -53,11 +53,10 @@ labels <- c(1, 1, 1,
 data <- data.frame(dates = dates, labels = labels)

 bst <- xgb.train(
-  data = xgb.DMatrix(as.matrix(data$dates), label = labels),
+  data = xgb.DMatrix(as.matrix(data$dates), label = labels, missing = NA),
  nthread = 2,
  nrounds = 1,
  objective = "binary:logistic",
-  missing = NA,
  max_depth = 1
 )
 ```