Add Random Forest parameter (num_parallel_tree) in function doc + example in Vignette.

2015-03-29 01:52:26 +01:00 · 2015-03-29 01:52:26 +01:00 · be6bd3859d
commit be6bd3859d
parent 7d0ac3a3dd
3 changed files with 17 additions and 0 deletions
--- a/R-package/R/xgb.train.R
+++ b/R-package/R/xgb.train.R
@ -22,6 +22,7 @@
 #'   \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
 #'   \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1
 #'   \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
+#'   \item \code{num_parallel_tree} number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample  < 1}  and \code{round = 1}) accordingly. Default: 1
 #' }
 #' 
 #' 2.2. Parameter for Linear Booster
--- a/R-package/man/xgb.train.Rd
+++ b/R-package/man/xgb.train.Rd
@ -28,6 +28,7 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
  \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
  \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1
  \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
+  \item \code{num_parallel_tree} number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample  < 1}  and \code{round = 1}) accordingly. Default: 1
 }

 2.2. Parameter for Linear Booster
--- a/R-package/vignettes/discoverYourData.Rmd
+++ b/R-package/vignettes/discoverYourData.Rmd
@ -313,4 +313,19 @@ However, in Random Forests™ this random choice will be done for each tree, bec

 In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.

+If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters! For instance, to compute a model with 1000 trees, with a 0.5 factor on sampling rows and columns:
+
+```{r, warning=FALSE, message=FALSE}
+data(agaricus.train, package='xgboost')
+data(agaricus.test, package='xgboost')
+train <- agaricus.train
+test <- agaricus.test
+
+#Random Forest™ - 1000 trees
+bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")
+
+#Boosting - 3 rounds
+bst <- xgboost(data = train$data, label = train$label, max.depth = 4, nround = 3, objective = "binary:logistic")
+```
+
 > [**Random Forests™**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software.