diff --git a/.gitignore b/.gitignore index 8b2c65f62..9fd1e0f72 100644 --- a/.gitignore +++ b/.gitignore @@ -47,6 +47,7 @@ Debug .Rproj.user *.cpage.col *.cpage +*.Rproj xgboost xgboost.mpi xgboost.mock diff --git a/R-package/R/xgb.train.R b/R-package/R/xgb.train.R index d5cf5cbde..20908863f 100644 --- a/R-package/R/xgb.train.R +++ b/R-package/R/xgb.train.R @@ -22,6 +22,7 @@ #' \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1 #' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1 #' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1 +#' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1 #' } #' #' 2.2. Parameter for Linear Booster diff --git a/R-package/man/xgb.train.Rd b/R-package/man/xgb.train.Rd index d56f0b84e..3f93b3989 100644 --- a/R-package/man/xgb.train.Rd +++ b/R-package/man/xgb.train.Rd @@ -28,6 +28,7 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL, \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1 \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1 \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1 + \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1 } 2.2. Parameter for Linear Booster diff --git a/R-package/vignettes/discoverYourData.Rmd b/R-package/vignettes/discoverYourData.Rmd index 49d5bf0cd..fa780ee94 100644 --- a/R-package/vignettes/discoverYourData.Rmd +++ b/R-package/vignettes/discoverYourData.Rmd @@ -313,4 +313,25 @@ However, in Random Forests™ this random choice will be done for each tree, bec In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is not always that simple). Therefore, all the importance will be on feature `A` or on feature `B` (but not both). You will know that one feature have an important role in the link between the observations and the label. It is still up to you to search for the correlated features to the one detected as important if you need to know all of them. +If you want to try Random Forests™ algorithm, you can tweak Xgboost parameters! + +**Warning**: this is still an experimental parameter. + +For instance, to compute a model with 1000 trees, with a 0.5 factor on sampling rows and columns: + +```{r, warning=FALSE, message=FALSE} +data(agaricus.train, package='xgboost') +data(agaricus.test, package='xgboost') +train <- agaricus.train +test <- agaricus.test + +#Random Forest™ - 1000 trees +bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic") + +#Boosting - 3 rounds +bst <- xgboost(data = train$data, label = train$label, max.depth = 4, nround = 3, objective = "binary:logistic") +``` + +> Note that the parameter `round` is set to `1`. + > [**Random Forests™**](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_papers.htm) is a trademark of Leo Breiman and Adele Cutler and is licensed exclusively to Salford Systems for the commercial release of the software. \ No newline at end of file diff --git a/README.md b/README.md index 4acdc03dc..ec03ca336 100644 --- a/README.md +++ b/README.md @@ -5,25 +5,27 @@ It implements machine learning algorithm under gradient boosting framework, incl Contributors: https://github.com/dmlc/xgboost/graphs/contributors -Turorial and Documentation: https://github.com/dmlc/xgboost/wiki +Issues Tracker: [https://github.com/dmlc/xgboost/issues](https://github.com/dmlc/xgboost/issues?q=is%3Aissue+label%3Aquestion) -Issues Tracker: [https://github.com/dmlc/xgboost/issues](https://github.com/dmlc/xgboost/issues?q=is%3Aissue+label%3Aquestion) for bugreport and other issues - -Please join [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/) to ask usage questions and share your experience on xgboost. +Please join [XGBoost User Group](https://groups.google.com/forum/#!forum/xgboost-user/) to ask questions and share your experience on xgboost. Examples Code: [Learning to use xgboost by examples](demo) -Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model - Machine Learning with R](https://www.youtube.com/watch?v=Og7CGAfSr_Y) - Distributed Version: [Distributed XGBoost](multi-node) Notes on the Code: [Code Guide](src) +Turorial and Documentation: https://github.com/dmlc/xgboost/wiki + +Video tutorial: [Better Optimization with Repeated Cross Validation and the XGBoost model - Machine Learning with R](https://www.youtube.com/watch?v=Og7CGAfSr_Y) + Learning about the model: [Introduction to Boosted Trees](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf) * This slide is made by Tianqi Chen to introduce gradient boosting in a statistical view. * It present boosted tree learning as formal functional space optimization of defined objective. * The model presented is used by xgboost for boosted trees +Presention of a real use case of XGBoost to prepare tax audit in France: [Feature Importance Analysis with XGBoost in Tax audit](http://fr.slideshare.net/MichaelBENESTY/feature-importance-analysis-with-xgboost-in-tax-audit) + What's New ========== * XGBoost now support HDFS and S3 diff --git a/demo/kaggle-otto/README.MD b/demo/kaggle-otto/README.MD index 0c7bd45a4..94e422a13 100644 --- a/demo/kaggle-otto/README.MD +++ b/demo/kaggle-otto/README.MD @@ -22,4 +22,3 @@ devtools::install_github('tqchen/xgboost',subdir='R-package') Windows users may need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first. -