add vignette

2014-08-29 11:40:15 -07:00
parent 086433da0d
commit 5f510c683b
4 changed files with 179 additions and 4 deletions
--- a/R-package/R/slice.xgb.DMatrix.R
+++ b/R-package/R/slice.xgb.DMatrix.R
@@ -13,7 +13,7 @@ setClass('xgb.DMatrix')
 #' data(iris)
 #' iris[,5] <- as.numeric(iris[,5])
 #' dtrain <- xgb.DMatrix(as.matrix(iris[,1:4]), label=iris[,5])
-#' dsub <- slice(dtrain, c(1,2,3))
+#' dsub <- slice(dtrain, 1:3)
 #' @export
 #' 
 slice <- function(object, ...){
--- a/R-package/R/xgb.train.R
+++ b/R-package/R/xgb.train.R
@@ -44,8 +44,8 @@
 #' @examples
 #' data(iris)
 #' iris[,5] <- as.numeric(iris[,5])
-#' dtrain = xgb.DMatrix(as.matrix(iris[,1:4]), label=iris[,5])
-#' dtest = dtrain
+#' dtrain <- xgb.DMatrix(as.matrix(iris[,1:4]), label=iris[,5])
+#' dtest <- dtrain
 #' watchlist <- list(eval = dtest, train = dtrain)
 #' param <- list(max_depth = 2, eta = 1, silent = 1)
 #' logregobj <- function(preds, dtrain) {
--- a/R-package/inst/doc/xgboost.Rnw
+++ b/R-package/inst/doc/xgboost.Rnw
@@ -0,0 +1,174 @@
+
+\documentclass{article}
+
+\usepackage{natbib}
+\usepackage{graphics}
+\usepackage{amsmath}
+\usepackage{hyperref}
+\usepackage{indentfirst}
+\usepackage[utf8]{inputenc}
+
+\DeclareMathOperator{\var}{var}
+\DeclareMathOperator{\cov}{cov}
+
+% \VignetteIndexEntry{xgboost Example}
+
+\begin{document}
+
+<<foo,include=FALSE,echo=FALSE>>=
+options(keep.source = TRUE, width = 60)
+foo <- packageDescription("xgboost")
+@
+
+\title{xgboost Package Example (Version \Sexpr{foo$Version})}
+\author{Tong He}
+\maketitle
+
+\section{Introduction}
+
+This is an example of using the \verb@xgboost@ package in R. 
+
+\verb@xgboost@ is short for eXtreme Gradient Boosting (Tree). It supports
+regression and classification analysis on different types of input datasets.
+
+Comparing to \verb@gbm@ in R, it has several features:
+\begin{enumerate}
+    \item{Speed: }{\verb@xgboost@ can automatically do parallel computation on 
+    Windows and Linux, with openmp.}
+    \item{Input Type: }{\verb@xgboost@ takes several types of input data:}
+    \begin{itemize}
+        \item{Dense Matrix: }{R's dense matrix, i.e. \verb@matrix@}
+        \item{Sparse Matrix: }{R's sparse matrix \verb@Matrix::dgCMatrix@}
+        \item{Data File: }{Local data files}
+        \item{xgb.DMatrix: }{\verb@xgboost@'s own class. Recommended.}
+    \end{itemize}
+    \item{Penalization: }{\verb@xgboost@ supports penalization in 
+    $L_0,L_1,L_2$}
+    \item{Customization: }{\verb@xgboost@ supports customized objective function 
+    and evaluation function}
+    \item{Performance: }{\verb@xgboost@ has better performance on several different
+    datasets. Its rising popularity and fame in different Kaggle competitions 
+    is the evidence.}
+\end{enumerate}
+
+\section{Example with iris}
+
+In this section, we will illustrate some common usage of \verb@xgboost@.
+
+<<Training and prediction with iris>>=
+library(xgboost)
+data(iris)
+bst <- xgboost(as.matrix(iris[,1:4]),as.numeric(iris[,5]), 
+               nrounds = 5)
+xgb.save(bst, 'model.save')
+bst = xgb.load('model.save')
+pred <- predict(bst, as.matrix(iris[,1:4]))
+hist(pred)
+@
+
+\verb@xgboost@ is the main function to train a \verb@Booster@, i.e. a model.
+\verb@predict@ does prediction on the model.
+
+Here we can save the model to a binary local file, and load it when needed.
+We can't inspect the trees inside. However we have another function to save the
+model in plain text. 
+<<Dump Model>>=
+xgb.dump(bst, 'model.dump')
+@
+
+The output looks like 
+
+\begin{verbatim}
+booster[0]:
+0:[f2<2.45] yes=1,no=2,missing=1
+    1:leaf=0.147059
+    2:[f3<1.65] yes=3,no=4,missing=3
+        3:leaf=0.464151
+        4:leaf=0.722449
+booster[1]:
+0:[f2<2.45] yes=1,no=2,missing=1
+    1:leaf=0.103806
+    2:[f2<4.85] yes=3,no=4,missing=3
+        3:leaf=0.316341
+        4:leaf=0.510365
+\end{verbatim}
+
+It is important to know \verb@xgboost@'s own data type: \verb@xgb.DMatrix@.
+It speeds up \verb@xgboost@. 
+
+We can use \verb@xgb.DMatrix@ to construct an \verb@xgb.DMatrix@ object:
+<<xgb.DMatrix>>=
+iris.mat <- as.matrix(iris[,1:4])
+iris.label <- as.numeric(iris[,5])
+diris <- xgb.DMatrix(iris.mat, label = iris.label)
+class(diris)
+getinfo(diris,'label')
+@
+
+We can also save the matrix to a binary file. Then load it simply with 
+\verb@xgb.DMatrix@
+<<save model>>=
+xgb.DMatrix.save(diris, 'iris.xgb.DMatrix')
+diris = xgb.DMatrix('iris.xgb.DMatrix')
+@
+
+\section{Advanced Examples}
+
+The function \verb@xgboost@ is a simple function with less parameters, in order
+to be R-friendly. The core training function is wrapped in \verb@xgb.train@. It
+is more flexible than \verb@xgboost@, but it requires users to read the document
+a bit more carefully.
+
+\verb@xgb.train@ only accept a \verb@xgb.DMatrix@ object as its input, while it 
+supports some additional features as custom objective and evaluation functions.
+
+<<Customized loss function>>=
+logregobj <- function(preds, dtrain) {
+   labels <- getinfo(dtrain, "label")
+   preds <- 1/(1 + exp(-preds))
+   grad <- preds - labels
+   hess <- preds * (1 - preds)
+   return(list(grad = grad, hess = hess))
+}
+
+evalerror <- function(preds, dtrain) {
+  labels <- getinfo(dtrain, "label")
+  err <- sqrt(mean((preds-labels)^2))
+  return(list(metric = "MSE", value = err))
+}
+
+dtest <- slice(diris,1:100)
+watchlist <- list(eval = dtest, train = diris)
+param <- list(max_depth = 2, eta = 1, silent = 1)
+
+bst <- xgb.train(param, diris, nround = 2, watchlist, logregobj, evalerror)
+@
+
+The gradient and second order gradient is required for the output of customized 
+objective function. 
+
+We also have \verb@slice@ for row extraction. It is useful in 
+cross-validation.
+
+\section{The Higgs Boson competition}
+
+We have made a demo for \href{http://www.kaggle.com/c/higgs-boson}{the Higgs 
+Boson Machine Learning Challenge}. 
+
+Our result reaches 3.60 with a single model. This results stands in the top 30%
+of the competition.
+
+Here are the instructions to make a submission
+\begin{enumerate}
+    \item Download the \href{http://www.kaggle.com/c/higgs-boson/data}{datasets}
+    and extract them to \verb@data/@.
+    \item Run scripts under \verb@xgboost/demo/kaggle-higgs/@: 
+    \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/higgs-train.R}{higgs-train.R} 
+    and \href{https://github.com/tqchen/xgboost/blob/master/demo/kaggle-higgs/higgs-pred.R}{higgs-pred.R}. 
+    The computation will take less than a minute on Intel i7. 
+    \item Go to the \href{http://www.kaggle.com/c/higgs-boson/submissions/attach}{submission page} 
+    and submit your result.
+\end{enumerate}
+
+
+\end{document}
--- a/demo/kaggle-higgs/README.md
+++ b/demo/kaggle-higgs/README.md
@@ -10,6 +10,7 @@ This script will achieve about 3.600 AMS score in public leadboard. To get start
 cd ../..
 make
 ```
+
 2. Put training.csv test.csv on folder './data' (you can create a symbolic link)

 3. Run ./run.sh
@@ -21,5 +22,5 @@ speedtest.py compares xgboost's speed on this dataset with sklearn.GBM

 Using R module
 =====
-* Alternatively, you can run using R, higgs-train.R and higgs-pred.R
+* Alternatively, you can run using R, higgs-train.R and higgs-pred.R.