[R] docs update - callbacks and parameter style

2016-06-27 01:59:58 -05:00
parent e9eb34fabc
commit a0aa305268
28 changed files with 564 additions and 162 deletions
--- a/R-package/man/xgb.cv.Rd
+++ b/R-package/man/xgb.cv.Rd
@@ -7,7 +7,7 @@
 xgb.cv(params = list(), data, nrounds, nfold, label = NULL, missing = NA,
  prediction = FALSE, showsd = TRUE, metrics = list(), obj = NULL,
  feval = NULL, stratified = TRUE, folds = NULL, verbose = TRUE,
-  print.every.n = 1L, early.stop.round = NULL, maximize = NULL,
+  print_every_n = 1L, early_stopping_rounds = NULL, maximize = NULL,
  callbacks = list(), ...)
 }
 \arguments{
@@ -19,11 +19,11 @@ xgb.cv(params = list(), data, nrounds, nfold, label = NULL, missing = NA,
    \item \code{binary:logistic} logistic regression for classification
  }
  \item \code{eta} step size of each boosting step
-  \item \code{max.depth} maximum depth of the tree
+  \item \code{max_depth} maximum depth of the tree
  \item \code{nthread} number of thread used in training, if not set, all threads are used
 }

-  See \link{xgb.train} for further details.
+  See \code{\link{xgb.train}} for further details.
  See also demo/ for walkthrough example in R.}

 \item{data}{takes an \code{xgb.DMatrix} or \code{Matrix} as the input.}
@@ -32,14 +32,16 @@ xgb.cv(params = list(), data, nrounds, nfold, label = NULL, missing = NA,

 \item{nfold}{the original dataset is randomly partitioned into \code{nfold} equal size subsamples.}

-\item{label}{option field, when data is \code{Matrix}}
+\item{label}{vector of response values. Should be provided only when data is \code{DMatrix}.}

-\item{missing}{Missing is only used when input is dense matrix, pick a float
-value that represents missing value. Sometime a data use 0 or other extreme value to represents missing values.}
+\item{missing}{is only used when input is a dense matrix. By default is set to NA, which means 
+that NA values should be considered as 'missing' by the algorithm. 
+Sometimes, 0 or other extreme value might be used to represent missing values.}

-\item{prediction}{A logical value indicating whether to return the prediction vector.}
+\item{prediction}{A logical value indicating whether to return the test fold predictions 
+from each CV model. This parameter engages the \code{\link{cb.cv.predict}} callback.}

-\item{showsd}{\code{boolean}, whether show standard deviation of cross validation}
+\item{showsd}{\code{boolean}, whether to show standard deviation of cross validation}

 \item{metrics, }{list of evaluation metrics to be used in cross validation,
  when it is not specified, the evaluation metric is chosen according to objective function.
@@ -59,34 +61,61 @@ gradient with given prediction and dtrain.}
 \code{list(metric='metric-name', value='metric-value')} with given 
 prediction and dtrain.}

-\item{stratified}{\code{boolean} whether sampling of folds should be stratified by the values of labels in \code{data}}
+\item{stratified}{a \code{boolean} indicating whether sampling of folds should be stratified 
+by the values of outcome labels.}

-\item{folds}{\code{list} provides a possibility of using a list of pre-defined CV folds (each element must be a vector of fold's indices).
-If folds are supplied, the nfold and stratified parameters would be ignored.}
+\item{folds}{\code{list} provides a possibility to use a list of pre-defined CV folds
+(each element must be a vector of test fold's indices). When folds are supplied, 
+the \code{nfold} and \code{stratified} parameters are ignored.}

 \item{verbose}{\code{boolean}, print the statistics during the process}

-\item{print.every.n}{Print every N progress messages when \code{verbose>0}. Default is 1 which means all messages are printed.}
+\item{print_every_n}{Print each n-th iteration evaluation messages when \code{verbose>0}.
+Default is 1 which means all messages are printed. This parameter is passed to the 
+\code{\link{cb.print.evaluation}} callback.}

-\item{early.stop.round}{If \code{NULL}, the early stopping function is not triggered. 
+\item{early_stopping_rounds}{If \code{NULL}, the early stopping function is not triggered. 
 If set to an integer \code{k}, training with a validation set will stop if the performance 
-doesn't improve for \code{k} rounds.}
+doesn't improve for \code{k} rounds.
+Setting this parameter engages the \code{\link{cb.early.stop}} callback.}

-\item{maximize}{If \code{feval} and \code{early.stop.round} are set, then \code{maximize} must be set as well.
-\code{maximize=TRUE} means the larger the evaluation score the better.}
+\item{maximize}{If \code{feval} and \code{early_stopping_rounds} are set,
+then this parameter must be set as well.
+When it is \code{TRUE}, it means the larger the evaluation score the better.
+This parameter is passed to the \code{\link{cb.early.stop}} callback.}
+
+\item{callbacks}{a list of callback functions to perform various task during boosting.
+See \code{\link{callbacks}}. Some of the callbacks are automatically created depending on the 
+parameters' values. User can provide either existing or their own callback methods in order 
+to customize the training process.}

 \item{...}{other parameters to pass to \code{params}.}
 }
 \value{
-TODO: update this...
-
-If \code{prediction = TRUE}, a list with the following elements is returned:
+An object of class \code{xgb.cv.synchronous} with the following elements:
 \itemize{
-  \item \code{dt} a \code{data.table} with each mean and standard deviation stat for training set and test set
-  \item \code{pred} an array or matrix (for multiclass classification) with predictions for each CV-fold for the model having been trained on the data in all other folds.
+  \item \code{call} a function call.
+  \item \code{params} parameters that were passed to the xgboost library. Note that it does not 
+        capture parameters changed by the \code{\link{cb.reset.parameters}} callback.
+  \item \code{callbacks} callback functions that were either automatically assigned or 
+        explicitely passed.
+  \item \code{evaluation_log} evaluation history storead as a \code{data.table} with the
+        first column corresponding to iteration number and the rest corresponding to the 
+        CV-based evaluation means and standard deviations for the training and test CV-sets.
+        It is created by the \code{\link{cb.evaluation.log}} callback.
+  \item \code{niter} number of boosting iterations.
+  \item \code{folds} the list of CV folds' indices - either those passed through the \code{folds} 
+        parameter or randomly generated.
+  \item \code{best_iteration} iteration number with the best evaluation metric value
+        (only available with early stopping).
+  \item \code{best_ntreelimit} the \code{ntreelimit} value corresponding to the best iteration, 
+        which could further be used in \code{predict} method
+        (only available with early stopping).
+  \item \code{pred} CV prediction values available when \code{prediction} is set. 
+        It is either vector or matrix (see \code{\link{cb.cv.predict}}).
+  \item \code{models} a liost of the CV folds' models. It is only available with the explicit 
+        setting of the \code{cb.cv.predict(save_models = TRUE)} callback.
 }
-
-If \code{prediction = FALSE}, just a \code{data.table} with each mean and standard deviation stat for training set and test set is returned.
 }
 \description{
 The cross valudation function of xgboost
@@ -105,9 +134,10 @@ Adapted from \url{http://en.wikipedia.org/wiki/Cross-validation_\%28statistics\%
 \examples{
 data(agaricus.train, package='xgboost')
 dtrain <- xgb.DMatrix(agaricus.train$data, label = agaricus.train$label)
-history <- xgb.cv(data = dtrain, nround=3, nthread = 2, nfold = 5, metrics=list("rmse","auc"),
-                  max.depth =3, eta = 1, objective = "binary:logistic")
-print(history)
+cv <- xgb.cv(data = dtrain, nrounds = 3, nthread = 2, nfold = 5, metrics = list("rmse","auc"),
+                  max_depth = 3, eta = 1, objective = "binary:logistic")
+print(cv)
+print(cv, verbose=TRUE)

 }