Merge pull request #199 from pommedeterresautee/master

Cross validation documentation improvement
This commit is contained in:
Tong He 2015-03-18 11:14:36 -07:00
commit 8025b338a8
3 changed files with 33 additions and 27 deletions

View File

@ -25,12 +25,12 @@
#' \item \code{nthread} number of thread used in training, if not set, all threads are used #' \item \code{nthread} number of thread used in training, if not set, all threads are used
#' } #' }
#' #'
#' See \url{https://github.com/tqchen/xgboost/wiki/Parameters} for #' See \link{xgb.train} for further details.
#' further details. See also demo/ for walkthrough example in R. #' See also demo/ for walkthrough example in R.
#' @param data takes an \code{xgb.DMatrix} as the input. #' @param data takes an \code{xgb.DMatrix} or \code{Matrix} as the input.
#' @param nrounds the max number of iterations #' @param nrounds the max number of iterations
#' @param nfold number of folds used #' @param nfold the original dataset is randomly partitioned into \code{nfold} equal size subsamples.
#' @param label option field, when data is Matrix #' @param label option field, when data is \code{Matrix}
#' @param missing Missing is only used when input is dense matrix, pick a float #' @param missing Missing is only used when input is dense matrix, pick a float
#' value that represents missing value. Sometime a data use 0 or other extreme value to represents missing values. #' value that represents missing value. Sometime a data use 0 or other extreme value to represents missing values.
#' @param prediction A logical value indicating whether to return the prediction vector. #' @param prediction A logical value indicating whether to return the prediction vector.
@ -56,18 +56,21 @@
#' @return A \code{data.table} with each mean and standard deviation stat for training set and test set. #' @return A \code{data.table} with each mean and standard deviation stat for training set and test set.
#' #'
#' @details #' @details
#' This is the cross validation function for xgboost #' The original sample is randomly partitioned into \code{nfold} equal size subsamples.
#' #'
#' Parallelization is automatically enabled if OpenMP is present. #' Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data.
#' Number of threads can also be manually specified via "nthread" parameter.
#' #'
#' This function only accepts an \code{xgb.DMatrix} object as the input. #' The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data.
#'
#' All observations are used for both training and validation.
#'
#' Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation}
#' #'
#' @examples #' @examples
#' data(agaricus.train, package='xgboost') #' data(agaricus.train, package='xgboost')
#' dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) #' dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
#' history <- xgb.cv(data = dtrain, nround=3, nthread = 2, nfold = 5, metrics=list("rmse","auc"), #' history <- xgb.cv(data = dtrain, nround=3, nthread = 2, nfold = 5, metrics=list("rmse","auc"),
#' "max.depth"=3, "eta"=1, "objective"="binary:logistic") #' max.depth =3, eta = 1, objective = "binary:logistic")
#' print(history) #' print(history)
#' @export #' @export
#' #'

View File

@ -21,16 +21,16 @@ xgb.cv(params = list(), data, nrounds, nfold, label = NULL,
\item \code{nthread} number of thread used in training, if not set, all threads are used \item \code{nthread} number of thread used in training, if not set, all threads are used
} }
See \url{https://github.com/tqchen/xgboost/wiki/Parameters} for See \link{xgb.train} for further details.
further details. See also demo/ for walkthrough example in R.} See also demo/ for walkthrough example in R.}
\item{data}{takes an \code{xgb.DMatrix} as the input.} \item{data}{takes an \code{xgb.DMatrix} or \code{Matrix} as the input.}
\item{nrounds}{the max number of iterations} \item{nrounds}{the max number of iterations}
\item{nfold}{number of folds used} \item{nfold}{the original dataset is randomly partitioned into \code{nfold} equal size subsamples.}
\item{label}{option field, when data is Matrix} \item{label}{option field, when data is \code{Matrix}}
\item{missing}{Missing is only used when input is dense matrix, pick a float \item{missing}{Missing is only used when input is dense matrix, pick a float
value that represents missing value. Sometime a data use 0 or other extreme value to represents missing values.} value that represents missing value. Sometime a data use 0 or other extreme value to represents missing values.}
@ -68,18 +68,21 @@ A \code{data.table} with each mean and standard deviation stat for training set
The cross valudation function of xgboost The cross valudation function of xgboost
} }
\details{ \details{
This is the cross validation function for xgboost The original sample is randomly partitioned into \code{nfold} equal size subsamples.
Parallelization is automatically enabled if OpenMP is present. Of the \code{nfold} subsamples, a single subsample is retained as the validation data for testing the model, and the remaining \code{nfold - 1} subsamples are used as training data.
Number of threads can also be manually specified via "nthread" parameter.
This function only accepts an \code{xgb.DMatrix} object as the input. The cross-validation process is then repeated \code{nrounds} times, with each of the \code{nfold} subsamples used exactly once as the validation data.
All observations are used for both training and validation.
Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%29#k-fold_cross-validation}
} }
\examples{ \examples{
data(agaricus.train, package='xgboost') data(agaricus.train, package='xgboost')
dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label) dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
history <- xgb.cv(data = dtrain, nround=3, nthread = 2, nfold = 5, metrics=list("rmse","auc"), history <- xgb.cv(data = dtrain, nround=3, nthread = 2, nfold = 5, metrics=list("rmse","auc"),
"max.depth"=3, "eta"=1, "objective"="binary:logistic") max.depth =3, eta = 1, objective = "binary:logistic")
print(history) print(history)
} }

View File

@ -1,5 +1,5 @@
xgboost: eXtreme Gradient Boosting XGBoost: eXtreme Gradient Boosting
====== ==================================
An optimized general purpose gradient boosting library. The library is parallelized, and also provides an optimized distributed version. An optimized general purpose gradient boosting library. The library is parallelized, and also provides an optimized distributed version.
It implements machine learning algorithm under gradient boosting framework, including generalized linear model and gradient boosted regression tree (GBDT). XGBoost can also also distributed and scale to even larger data. It implements machine learning algorithm under gradient boosting framework, including generalized linear model and gradient boosted regression tree (GBDT). XGBoost can also also distributed and scale to even larger data.
@ -23,7 +23,7 @@ Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washin
* The model presented is used by xgboost for boosted trees * The model presented is used by xgboost for boosted trees
What's New What's New
===== ==========
* [Distributed XGBoost now runs on YARN](multi-node/hadoop)! * [Distributed XGBoost now runs on YARN](multi-node/hadoop)!
* [xgboost user group](https://groups.google.com/forum/#!forum/xgboost-user/) for tracking changes, sharing your experience on xgboost * [xgboost user group](https://groups.google.com/forum/#!forum/xgboost-user/) for tracking changes, sharing your experience on xgboost
* [Distributed XGBoost](multi-node) is now available!! * [Distributed XGBoost](multi-node) is now available!!
@ -37,7 +37,7 @@ What's New
* Thanks to Tong He, the new [R package](R-package) is available * Thanks to Tong He, the new [R package](R-package) is available
Features Features
====== ========
* Sparse feature format: * Sparse feature format:
- Sparse feature format allows easy handling of missing values, and improve computation efficiency. - Sparse feature format allows easy handling of missing values, and improve computation efficiency.
* Push the limit on single machine: * Push the limit on single machine:
@ -74,7 +74,7 @@ Build
Then run ```bash build.sh``` normally. Then run ```bash build.sh``` normally.
Version Version
====== =======
* This version xgboost-0.3, the code has been refactored from 0.2x to be cleaner and more flexibility * This version xgboost-0.3, the code has been refactored from 0.2x to be cleaner and more flexibility
* This version of xgboost is not compatible with 0.2x, due to huge amount of changes in code structure * This version of xgboost is not compatible with 0.2x, due to huge amount of changes in code structure
- This means the model and buffer file of previous version can not be loaded in xgboost-3.0 - This means the model and buffer file of previous version can not be loaded in xgboost-3.0
@ -82,6 +82,6 @@ Version
* Change log in [CHANGES.md](CHANGES.md) * Change log in [CHANGES.md](CHANGES.md)
XGBoost in Graphlab Create XGBoost in Graphlab Create
====== ==========================
* XGBoost is adopted as part of boosted tree toolkit in Graphlab Create (GLC). Graphlab Create is a powerful python toolkit that allows you to data manipulation, graph processing, hyper-parameter search, and visualization of TeraBytes scale data in one framework. Try the Graphlab Create in http://graphlab.com/products/create/quick-start-guide.html * XGBoost is adopted as part of boosted tree toolkit in Graphlab Create (GLC). Graphlab Create is a powerful python toolkit that allows you to data manipulation, graph processing, hyper-parameter search, and visualization of TeraBytes scale data in one framework. Try the Graphlab Create in http://graphlab.com/products/create/quick-start-guide.html
* Nice blogpost by Jay Gu using GLC boosted tree to solve kaggle bike sharing challenge: http://blog.graphlab.com/using-gradient-boosted-trees-to-predict-bike-sharing-demand * Nice blogpost by Jay Gu using GLC boosted tree to solve kaggle bike sharing challenge: http://blog.graphlab.com/using-gradient-boosted-trees-to-predict-bike-sharing-demand