From e92d384a6a1440c89276045f1655ab87619dc3ef Mon Sep 17 00:00:00 2001 From: pommedeterresautee Date: Fri, 8 May 2015 16:29:29 +0200 Subject: [PATCH 1/2] small change in the wording of Otto R markdown --- .../kaggle-otto/understandingXGBoostModel.Rmd | 24 +++++++++---------- 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/demo/kaggle-otto/understandingXGBoostModel.Rmd b/demo/kaggle-otto/understandingXGBoostModel.Rmd index cf755909a..428d0faf0 100644 --- a/demo/kaggle-otto/understandingXGBoostModel.Rmd +++ b/demo/kaggle-otto/understandingXGBoostModel.Rmd @@ -42,13 +42,13 @@ Let's explore the dataset. dim(train) # Training content -train[1:6,1:5, with =F] +train[1:6, 1:5, with =F] # Test dataset dimensions dim(train) # Test content -test[1:6,1:5, with =F] +test[1:6, 1:5, with =F] ``` > We only display the 6 first rows and 5 first columns for convenience @@ -107,7 +107,7 @@ testMatrix <- test[,lapply(.SD,as.numeric)] %>% as.matrix Model training ============== -Before the learning we will use the cross validation to evaluate the our error rate. +Before the learning we will use the cross validation to evaluate the error rate. Basically **XGBoost** will divide the training data in `nfold` parts, then **XGBoost** will retain the first part and use it as the test data. Then it will reintegrate the first part to the training dataset and retain the second part, do a training and so on... @@ -144,21 +144,21 @@ Feature importance So far, we have built a model made of **`r nround`** trees. -To build a tree, the dataset is divided recursively several times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products). +To build a *tree*, the dataset is divided recursively `max.depth` times. At the end of the process, you get groups of observations (here, these observations are properties regarding **Otto** products). Each division operation is called a *split*. -Each group at each division level is called a branch and the deepest level is called a **leaf**. +Each group at each division level is called a *branch* and the deepest level is called a *leaf*. In the final model, these leafs are supposed to be as pure as possible for each tree, meaning in our case that each leaf should be made of one class of **Otto** product only (of course it is not true, but that's what we try to achieve in a minimum of splits). **Not all splits are equally important**. Basically the first split of a tree will have more impact on the purity that, for instance, the deepest split. Intuitively, we understand that the first split makes most of the work, and the following splits focus on smaller parts of the dataset which have been missclassified by the first tree. -In the same way, in Boosting we try to optimize the missclassification at each round (it is called the **loss**). So the first tree will do the big work and the following trees will focus on the remaining, on the parts not correctly learned by the previous trees. +In the same way, in Boosting we try to optimize the missclassification at each round (it is called the *loss*). So the first tree will do most of the work and the following trees will focus on the remaining, on the parts not correctly learned by the previous trees. -The improvement brought by each split can be measured, it is the **gain**. +The improvement brought by each split can be measured, it is the *gain*. -Each split is done on one feature only at one value. +Each split is done on one feature only at one specific value. Let's see what the model looks like. @@ -189,7 +189,7 @@ importance_matrix <- xgb.importance(names, model = bst) xgb.plot.importance(importance_matrix[1:10,]) ``` -> To make it understandable we first extract the column names from the `Matrix`. +> To make the graph understandable we first extract the column names from the `Matrix`. Interpretation -------------- @@ -198,9 +198,9 @@ In the feature importance above, we can see the first 10 most important features This function gives a color to each bar. Basically a K-means clustering is applied to group each feature by importance. -From here you can take several actions. For instance you can remove the less important feature (feature selection process), or go deeper in the interaction between the most important features and labels. +From here you can take several actions. For instance you can remove the less important features (feature selection process), or go deeper in the interaction between the most important features and labels. -Or you can just reason about why these features are so importat (in **Otto** challenge we can't go this way because there is not enough information). +Or you can try to guess why these features are so importat (in **Otto** challenge we can't go this way because there is not enough information). Tree graph ---------- @@ -216,7 +216,7 @@ xgb.plot.tree(feature_names = names, model = bst, n_first_tree = 2) We are just displaying the first two trees here. On simple models the first two trees may be enough. Here, it might not be the case. We can see from the size of the trees that the intersaction between features is complicated. -Besides, **XGBoost** generate `k` trees at each round for a `k`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes. +Besides, **XGBoost** generate `K` trees at each round for a `K`-classification problem. Therefore the two trees illustrated here are trying to classify data into different classes. Going deeper ============ From 11ba651a07d6ff573bf8e6b6125fe5079d9480d9 Mon Sep 17 00:00:00 2001 From: pommedeterresautee Date: Fri, 8 May 2015 16:59:29 +0200 Subject: [PATCH 2/2] Regularization parameters documentation improvement --- R-package/R/xgb.train.R | 4 ++-- R-package/man/xgb.train.Rd | 8 ++++---- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/R-package/R/xgb.train.R b/R-package/R/xgb.train.R index a99740f64..d75659737 100644 --- a/R-package/R/xgb.train.R +++ b/R-package/R/xgb.train.R @@ -16,11 +16,11 @@ #' 2.1. Parameter for Tree Booster #' #' \itemize{ -#' \item \code{eta} step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinkage the feature weights to make the boosting process more conservative. Default: 0.3 +#' \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3 #' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be. #' \item \code{max_depth} maximum depth of a tree. Default: 6 #' \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1 -#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1 +#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1 #' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1 #' \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1 #' } diff --git a/R-package/man/xgb.train.Rd b/R-package/man/xgb.train.Rd index 1bd243d60..a24f337f9 100644 --- a/R-package/man/xgb.train.Rd +++ b/R-package/man/xgb.train.Rd @@ -22,11 +22,11 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL, 2.1. Parameter for Tree Booster \itemize{ - \item \code{eta} step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinkage the feature weights to make the boosting process more conservative. Default: 0.3 + \item \code{eta} control the learning rate: scale the contribution of each tree by a factor of \code{0 < eta < 1} when it is added to the current approximation. Used to prevent overfitting by making the boosting process more conservative. Lower value for \code{eta} implies larger value for \code{nrounds}: low \code{eta} value means model more robust to overfitting but slower to compute. Default: 0.3 \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be. \item \code{max_depth} maximum depth of a tree. Default: 6 \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1 - \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1 + \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. It makes computation shorter (because less data to analyse). It is advised to use this parameter with \code{eta} and increase \code{nround}. Default: 1 \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1 \item \code{num_parallel_tree} Experimental parameter. number of trees to grow per round. Useful to test Random Forest through Xgboost (set \code{colsample_bytree < 1}, \code{subsample < 1} and \code{round = 1}) accordingly. Default: 1 } @@ -49,7 +49,7 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL, \item \code{binary:logistic} logistic regression for binary classification. Output probability. \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation. \item \code{num_class} set the number of classes. To use only with multiclass objectives. - \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is a number and should be from 0 \code{tonum_class} + \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective. Class is represented by a number and should be from 0 to \code{tonum_class}. \item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probabilities of each data point belonging to each class. \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss. } @@ -98,7 +98,7 @@ Number of threads can also be manually specified via \code{nthread} parameter. \item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances. \item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation. - \item \code{ndcg} Normalized Discounted Cumulative Gain. \url{http://en.wikipedia.org/wiki/NDCG} + \item \code{ndcg} Normalized Discounted Cumulative Gain (for ranking task). \url{http://en.wikipedia.org/wiki/NDCG} } Full list of parameters is available in the Wiki \url{https://github.com/dmlc/xgboost/wiki/Parameters}.