Merge pull request #241 from pommedeterresautee/master

Add experimental RF parameter documentation
2015-04-15 10:15:41 -07:00 · 2015-04-15 10:15:41 -07:00 · a596d11ed1
commit a596d11ed1
parent 6370b38c14 a49150a6d2
6 changed files with 32 additions and 7 deletions
--- a/.gitignore
+++ b/.gitignore
@ -47,6 +47,7 @@ Debug
 .Rproj.user
 *.cpage.col
 *.cpage
+*.Rproj
 xgboost
 xgboost.mpi
 xgboost.mock
--- a/R-package/R/xgb.train.R
+++ b/R-package/R/xgb.train.R
@ -22,6 +22,7 @@
 #'   \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
 #'   \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1
 #'   \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
+#'   \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample  < 1}  and \code{round = 1}) accordingly. Default: 1
 #' }
 #' 
 #' 2.2. Parameter for Linear Booster
--- a/R-package/man/xgb.train.Rd
+++ b/R-package/man/xgb.train.Rd
@ -28,6 +28,7 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
  \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
  \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1
  \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
+  \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample  < 1}  and \code{round = 1}) accordingly. Default: 1
 }

 2.2. Parameter for Linear Booster
--- a/R-package/vignettes/discoverYourData.Rmd
+++ b/R-package/vignettes/discoverYourData.Rmd
@ -313,4 +313,25 @@ However, in Random Forests™ this random choice will be done for each tree, bec

 In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them.

+If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters! 
+
+**Warning**: this is still an experimental parameter.
+
+For instance, to compute a model with 1000 trees, with a 0.5 factor on sampling rows and columns:
+
+```{r, warning=FALSE, message=FALSE}
+data(agaricus.train, package='xgboost')
+data(agaricus.test, package='xgboost')
+train <- agaricus.train
+test <- agaricus.test
+
+#Random Forest™ - 1000 trees
+bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")
+
+#Boosting - 3 rounds
+bst <- xgboost(data = train$data, label = train$label, max.depth = 4, nround = 3, objective = "binary:logistic")
+```
+
+> Note that the parameter `round` is set to `1`.
+
 > [**Random Forests™**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software.
--- a/README.md
+++ b/README.md
@ -5,25 +5,27 @@ It implements machine learning algorithm under gradient boosting framework, incl

 Contributors: https://github.com/dmlc/xgboost/graphs/contributors

-Turorial and Documentation: https://github.com/dmlc/xgboost/wiki
+Issues Tracker: [https://github.com/dmlc/xgboost/issues](https://github.com/dmlc/xgboost/issues?q=is%3Aissue+label%3Aquestion)

-Issues Tracker: [https://github.com/dmlc/xgboost/issues](https://github.com/dmlc/xgboost/issues?q=is%3Aissue+label%3Aquestion) for bugreport and other issues
-
-Please join [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/) to ask usage questions and share your experience on xgboost.
+Please join [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/) to ask questions and share your experience on xgboost.

 Examples Code: [Learning to use xgboost by examples](demo)

-Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model - Machine Learning with R](https://www.youtube.com/watch?v=Og7CGAfSr_Y)
-
 Distributed Version: [Distributed XGBoost](multi-node)

 Notes on the Code: [Code Guide](src)

+Turorial and Documentation: https://github.com/dmlc/xgboost/wiki
+
+Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model - Machine Learning with R](https://www.youtube.com/watch?v=Og7CGAfSr_Y)
+
 Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)
 * This slide is made by Tianqi Chen to introduce gradient boosting in a statistical view.
 * It present boosted tree learning as formal functional space optimization of defined objective.
 * The model presented is used by xgboost for boosted trees

+Presention of a real use case of XGBoost to prepare tax audit in France: [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit)
+
 What's New
 ==========
 * XGBoost now support HDFS and S3
--- a/demo/kaggle-otto/README.MD
+++ b/demo/kaggle-otto/README.MD
@ -22,4 +22,3 @@ devtools::install_github('tqchen/xgboost',subdir='R-package')
 Windows users may need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.


-
				`@ -22,4 +22,3 @@ devtools::install_github('tqchen/xgboost',subdir='R-package')`
				`Windows users may need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.`