From be6bd3859d4083eb89a12f266fc6ee1118aa8a8d Mon Sep 17 00:00:00 2001 From: El Potaeto Date: Sun, 29 Mar 2015 01:52:26 +0100 Subject: [PATCH] Add Random Forest parameter (num_parallel_tree) in function doc + example in Vignette. --- R-package/R/xgb.train.R | 1 + R-package/man/xgb.train.Rd | 1 + R-package/vignettes/discoverYourData.Rmd | 15 +++++++++++++++ 3 files changed, 17 insertions(+) diff --git a/R-package/R/xgb.train.R b/R-package/R/xgb.train.R index d5cf5cbde..1444964e5 100644 --- a/R-package/R/xgb.train.R +++ b/R-package/R/xgb.train.R @@ -22,6 +22,7 @@ #' \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1 #' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1 #' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1 +#' \item \code{num_parallel_tree} number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1 #' } #' #' 2.2. Parameter for Linear Booster diff --git a/R-package/man/xgb.train.Rd b/R-package/man/xgb.train.Rd index d56f0b84e..91e21b50c 100644 --- a/R-package/man/xgb.train.Rd +++ b/R-package/man/xgb.train.Rd @@ -28,6 +28,7 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL, \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1 \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1 \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1 + \item \code{num_parallel_tree} number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1 } 2.2. Parameter for Linear Booster diff --git a/R-package/vignettes/discoverYourData.Rmd b/R-package/vignettes/discoverYourData.Rmd index 49d5bf0cd..9419a13ae 100644 --- a/R-package/vignettes/discoverYourData.Rmd +++ b/R-package/vignettes/discoverYourData.Rmd @@ -313,4 +313,19 @@ However, in Random Forests™ this random choice will be done for each tree, bec In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them. +If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters! For instance, to compute a model with 1000 trees, with a 0.5 factor on sampling rows and columns: + +```{r, warning=FALSE, message=FALSE} +data(agaricus.train, package='xgboost') +data(agaricus.test, package='xgboost') +train <- agaricus.train +test <- agaricus.test + +#Random Forest™ - 1000 trees +bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic") + +#Boosting - 3 rounds +bst <- xgboost(data = train$data, label = train$label, max.depth = 4, nround = 3, objective = "binary:logistic") +``` + > [**Random Forests™**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software. \ No newline at end of file