Merge pull request #199 from pommedeterresautee/master

Cross validation documentation improvement
2015-03-18 11:14:36 -07:00
parent adfa023822 4094039ce5
commit 8025b338a8
3 changed files with 33 additions and 27 deletions
--- a/R-package/R/xgb.cv.R
+++ b/R-package/R/xgb.cv.R
@@ -25,12 +25,12 @@
 #'   \item \code{nthread} number of thread used in training, if not set, all threads are used
 #' }
 #'
-#'   See \url{https://github.com/tqchen/xgboost/wiki/Parameters} for 
-#'   further details. See also demo/ for walkthrough example in R.
-#' @param data takes an \code{xgb.DMatrix} as the input.
+#'   See \link{xgb.train} for further details.
+#'   See also demo/ for walkthrough example in R.
+#' @param data takes an \code{xgb.DMatrix} or \code{Matrix} as the input.
 #' @param nrounds the max number of iterations
-#' @param nfold number of folds used
-#' @param label option field, when data is Matrix
+#' @param nfold the original dataset is randomly partitioned into \code{nfold} equal size subsamples. 
+#' @param label option field, when data is \code{Matrix}
 #' @param missing Missing is only used when input is dense matrix, pick a float
 #'     value that represents missing value. Sometime a data use 0 or other extreme value to represents missing values.
 #' @param prediction A logical value indicating whether to return the prediction vector.
@@ -56,18 +56,21 @@
 #' @return A \code{data.table} with each mean and standard deviation stat for training set and test set.
 #' 
 #' @details 
-#' This is the cross validation function for xgboost
-#'
-#' Parallelization is automatically enabled if OpenMP is present.
-#' Number of threads can also be manually specified via "nthread" parameter.
+#' The original sample is randomly partitioned into \code{nfold} equal size subsamples. 
 #' 
-#' This function only accepts an \code{xgb.DMatrix} object as the input.
+#' Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data. 
+#' 
+#' The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data.
+#' 
+#' All observations are used for both training and validation.
+#' 
+#' Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation}
 #'
 #' @examples
 #' data(agaricus.train, package='xgboost')
 #' dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
 #' history <- xgb.cv(data = dtrain, nround=3, nthread = 2, nfold = 5, metrics=list("rmse","auc"),
-#'                   "max.depth"=3, "eta"=1, "objective"="binary:logistic")
+#'                   max.depth =3, eta = 1, objective = "binary:logistic")
 #' print(history)
 #' @export
 #'
--- a/R-package/man/xgb.cv.Rd
+++ b/R-package/man/xgb.cv.Rd
@@ -21,16 +21,16 @@ xgb.cv(params = list(), data, nrounds, nfold, label = NULL,
  \item \code{nthread} number of thread used in training, if not set, all threads are used
 }

-  See \url{https://github.com/tqchen/xgboost/wiki/Parameters} for
-  further details. See also demo/ for walkthrough example in R.}
+  See \link{xgb.train} for further details.
+  See also demo/ for walkthrough example in R.}

-\item{data}{takes an \code{xgb.DMatrix} as the input.}
+\item{data}{takes an \code{xgb.DMatrix} or \code{Matrix} as the input.}

 \item{nrounds}{the max number of iterations}

-\item{nfold}{number of folds used}
+\item{nfold}{the original dataset is randomly partitioned into \code{nfold} equal size subsamples.}

-\item{label}{option field, when data is Matrix}
+\item{label}{option field, when data is \code{Matrix}}

 \item{missing}{Missing is only used when input is dense matrix, pick a float
 value that represents missing value. Sometime a data use 0 or other extreme value to represents missing values.}
@@ -68,18 +68,21 @@ A \code{data.table} with each mean and standard deviation stat for training set
 The cross valudation function of xgboost
 }
 \details{
-This is the cross validation function for xgboost
+The original sample is randomly partitioned into \code{nfold} equal size subsamples.

-Parallelization is automatically enabled if OpenMP is present.
-Number of threads can also be manually specified via "nthread" parameter.
+Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data.

-This function only accepts an \code{xgb.DMatrix} object as the input.
+The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data.
+
+All observations are used for both training and validation.
+
+Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation}
 }
 \examples{
 data(agaricus.train, package='xgboost')
 dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
 history <- xgb.cv(data = dtrain, nround=3, nthread = 2, nfold = 5, metrics=list("rmse","auc"),
-                  "max.depth"=3, "eta"=1, "objective"="binary:logistic")
+                  max.depth =3, eta = 1, objective = "binary:logistic")
 print(history)
 }

--- a/README.md
+++ b/README.md
@@ -1,5 +1,5 @@
-xgboost: eXtreme Gradient Boosting 
-======
+XGBoost: eXtreme Gradient Boosting 
+==================================
 An optimized general purpose gradient boosting library. The library is parallelized, and also provides an optimized distributed version.
 It implements machine learning algorithm under gradient boosting framework, including generalized linear model and gradient boosted regression tree (GBDT). XGBoost can also also distributed and scale to even larger data.

@@ -23,7 +23,7 @@ Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washin
 * The model presented is used by xgboost for boosted trees

 What's New
-=====
+==========
 * [Distributed XGBoost now runs on YARN](multi-node/hadoop)!
 * [xgboost user group](https://groups.google.com/forum/#!forum/xgboost-user/) for tracking changes, sharing your experience on xgboost
 * [Distributed XGBoost](multi-node) is now available!!
@@ -37,7 +37,7 @@ What's New
 * Thanks to Tong He, the new [R package](R-package) is available

 Features
-======
+========
 * Sparse feature format:
  - Sparse feature format allows easy handling of missing values, and improve computation efficiency.
 * Push the limit on single machine:
@@ -74,7 +74,7 @@ Build
  Then run ```bash build.sh``` normally.

 Version
-======
+=======
 * This version xgboost-0.3, the code has been refactored from 0.2x to be cleaner and more flexibility
 * This version of xgboost is not compatible with 0.2x, due to huge amount of changes in code structure
  - This means the model and buffer file of previous version can not be loaded in xgboost-3.0
@@ -82,6 +82,6 @@ Version
 * Change log in [CHANGES.md](CHANGES.md)

 XGBoost in Graphlab Create
-======
+==========================
 * XGBoost is adopted as part of boosted tree toolkit in Graphlab Create (GLC). Graphlab Create is a powerful python toolkit that allows you to data manipulation, graph processing, hyper-parameter search, and visualization of TeraBytes scale data in one framework. Try the Graphlab Create in http://graphlab.com/products/create/quick-start-guide.html
 * Nice blogpost by Jay Gu using GLC boosted tree to solve kaggle bike sharing challenge: http://blog.graphlab.com/using-gradient-boosted-trees-to-predict-bike-sharing-demand