Merge pull request #175 from pommedeterresautee/master
markdown Vignette can be compiled as package Vignette (use devtools) + improve Vignette text
This commit is contained in:
commit
fe7651fe53
3
.gitignore
vendored
3
.gitignore
vendored
@ -55,3 +55,6 @@ rabit
|
||||
.Rbuildignore
|
||||
R-package.Rproj
|
||||
|
||||
R-package/inst
|
||||
R-package/src
|
||||
|
||||
|
||||
@ -3,7 +3,7 @@ Type: Package
|
||||
Title: eXtreme Gradient Boosting
|
||||
Version: 0.3-3
|
||||
Date: 2014-12-28
|
||||
Author: Tianqi Chen <tianqi.tchen@gmail.com>, Tong He <hetong007@gmail.com>, Michaël Benesty <michael@benesty.fr>
|
||||
Author: Tianqi Chen <tianqi.tchen@gmail.com>, Tong He <hetong007@gmail.com>, Michael Benesty <michael@benesty.fr>
|
||||
Maintainer: Tong He <hetong007@gmail.com>
|
||||
Description: This package is a R wrapper of xgboost, which is short for eXtreme
|
||||
Gradient Boosting. It is an efficient and scalable implementation of
|
||||
@ -29,4 +29,4 @@ Imports:
|
||||
stringr (>= 0.6.2),
|
||||
DiagrammeR (>= 0.4),
|
||||
ggplot2 (>= 1.0.0),
|
||||
Ckmeans.1d.dp (>= 3.3.0)
|
||||
Ckmeans.1d.dp (>= 3.3.1)
|
||||
|
||||
@ -1,21 +1,54 @@
|
||||
#' eXtreme Gradient Boosting Training
|
||||
#'
|
||||
#' The training function of xgboost
|
||||
#' An advanced interface for training xgboost model. Look at \code{\link{xgboost}} function for a simpler interface.
|
||||
#'
|
||||
#' @param params the list of parameters. Commonly used ones are:
|
||||
#' @param params the list of parameters.
|
||||
#'
|
||||
#' 1. General Parameters
|
||||
#'
|
||||
#' \itemize{
|
||||
#' \item \code{objective} objective function, common ones are
|
||||
#' \itemize{
|
||||
#' \item \code{reg:linear} linear regression
|
||||
#' \item \code{binary:logistic} logistic regression for classification
|
||||
#' }
|
||||
#' \item \code{eta} step size of each boosting step
|
||||
#' \item \code{max.depth} maximum depth of the tree
|
||||
#' \item \code{nthread} number of thread used in training, if not set, all threads are used
|
||||
#' \item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}
|
||||
#' \item \code{silent} 0 means printing running messages, 1 means silent mode. Default: 0
|
||||
#' }
|
||||
#'
|
||||
#' See \url{https://github.com/tqchen/xgboost/wiki/Parameters} for
|
||||
#' further details. See also demo/ for walkthrough example in R.
|
||||
#'
|
||||
#' 2. Booster Parameters
|
||||
#'
|
||||
#' 2.1. Parameter for Tree Booster
|
||||
#'
|
||||
#' \itemize{
|
||||
#' \item \code{eta} step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinkage the feature weights to make the boosting process more conservative. Default: 0.3
|
||||
#' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
|
||||
#' \item \code{max_depth} maximum depth of a tree. Default: 6
|
||||
#' \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
|
||||
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1
|
||||
#' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
|
||||
#' }
|
||||
#'
|
||||
#' 2.2. Parameter for Linear Booster
|
||||
#'
|
||||
#' \itemize{
|
||||
#' \item \code{lambda} L2 regularization term on weights. Default: 0
|
||||
#' \item \code{lambda_bias} L2 regularization term on bias. Default: 0
|
||||
#' \item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
|
||||
#' }
|
||||
#'
|
||||
#' 3. Task Parameters
|
||||
#'
|
||||
#' \itemize{
|
||||
#' \item \code{objective} specify the learning task and the corresponding learning objective, and the objective options are below:
|
||||
#' \itemize{
|
||||
#' \item \code{reg:linear} linear regression (Default).
|
||||
#' \item \code{reg:logistic} logistic regression.
|
||||
#' \item \code{binary:logistic} logistic regression for binary classification. Output probability.
|
||||
#' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
|
||||
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes).
|
||||
#' \item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class.
|
||||
#' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
|
||||
#' }
|
||||
#' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
|
||||
#' \item \code{eval_metric} evaluation metrics for validation data. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
|
||||
#' }
|
||||
#'
|
||||
#' @param data takes an \code{xgb.DMatrix} as the input.
|
||||
#' @param nrounds the max number of iterations
|
||||
#' @param watchlist what information should be printed when \code{verbose=1} or
|
||||
@ -35,15 +68,27 @@
|
||||
#' @param ... other parameters to pass to \code{params}.
|
||||
#'
|
||||
#' @details
|
||||
#' This is the training function for xgboost.
|
||||
#' This is the training function for \code{xgboost}.
|
||||
#'
|
||||
#' It supports advanced features such as \code{watchlist}, customized objective function (\code{feval}),
|
||||
#' therefore it is more flexible than \code{\link{xgboost}} function.
|
||||
#'
|
||||
#' Parallelization is automatically enabled if OpenMP is present.
|
||||
#' Number of threads can also be manually specified via "nthread" parameter.
|
||||
#' Parallelization is automatically enabled if \code{OpenMP} is present.
|
||||
#' Number of threads can also be manually specified via \code{nthread} parameter.
|
||||
#'
|
||||
#' This function only accepts an \code{xgb.DMatrix} object as the input.
|
||||
#' It supports advanced features such as watchlist, customized objective function,
|
||||
#' therefore it is more flexible than \code{\link{xgboost}}.
|
||||
#' \code{eval_metric} parameter (not listed above) is set automatically by Xgboost but can be overriden by parameter. Below is provided the list of different metric optimized by Xgboost to help you to understand how it works inside or to use them with the \code{watchlist} parameter.
|
||||
#' \itemize{
|
||||
#' \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
|
||||
#' \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
|
||||
#' \item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
|
||||
#' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}.
|
||||
#' \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
|
||||
#' \item \code{ndcg} Normalized Discounted Cumulative Gain. \url{http://en.wikipedia.org/wiki/NDCG}
|
||||
#' }
|
||||
#'
|
||||
#' Full list of parameters is available in the Wiki \url{https://github.com/tqchen/xgboost/wiki/Parameters}.
|
||||
#'
|
||||
#' This function only accepts an \code{\link{xgb.DMatrix}} object as the input.
|
||||
#'
|
||||
#' @examples
|
||||
#' data(agaricus.train, package='xgboost')
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
#' eXtreme Gradient Boosting (Tree) library
|
||||
#'
|
||||
#' A simple interface for xgboost in R
|
||||
#' A simple interface for training xgboost model. Look at \code{\link{xgb.train}} function for a more advanced interface.
|
||||
#'
|
||||
#' @param data takes \code{matrix}, \code{dgCMatrix}, local data file or
|
||||
#' \code{xgb.DMatrix}.
|
||||
@ -8,50 +8,21 @@
|
||||
#' if data is local data file or \code{xgb.DMatrix}.
|
||||
#' @param params the list of parameters.
|
||||
#'
|
||||
#' 1. General Parameters
|
||||
#'
|
||||
#' Commonly used ones are:
|
||||
#' \itemize{
|
||||
#' \item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}
|
||||
#' \item \code{silent} 0 means printing running messages, 1 means silent mode. Default: 0
|
||||
#' }
|
||||
#'
|
||||
#' 2. Booster Parameters
|
||||
#'
|
||||
#' 2.1. Parameter for Tree Booster
|
||||
#'
|
||||
#' \itemize{
|
||||
#' \item \code{eta} step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinkage the feature weights to make the boosting process more conservative. Default: 0.3
|
||||
#' \item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
|
||||
#' \item \code{max_depth} maximum depth of a tree. Default: 6
|
||||
#' \item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
|
||||
#' \item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1
|
||||
#' \item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
|
||||
#' }
|
||||
#'
|
||||
#' 2.2. Parameter for Linear Booster
|
||||
#'
|
||||
#' \itemize{
|
||||
#' \item \code{lambda} L2 regularization term on weights. Default: 0
|
||||
#' \item \code{lambda_bias} L2 regularization term on bias. Default: 0
|
||||
#' \item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
|
||||
#' }
|
||||
#'
|
||||
#' 3. Task Parameters
|
||||
#'
|
||||
#' \itemize{
|
||||
#' \item \code{objective} specify the learning task and the corresponding learning objective, and the objective options are below:
|
||||
#' \item \code{objective} objective function, common ones are
|
||||
#' \itemize{
|
||||
#' \item \code{reg:linear} linear regression (Default).
|
||||
#' \item \code{reg:logistic} logistic regression.
|
||||
#' \item \code{binary:logistic} logistic regression for binary classification. Output probability.
|
||||
#' \item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
|
||||
#' \item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes).
|
||||
#' \item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class.
|
||||
#' \item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
|
||||
#' \item \code{reg:linear} linear regression
|
||||
#' \item \code{binary:logistic} logistic regression for classification
|
||||
#' }
|
||||
#' \item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
|
||||
#' \item \code{eval_metric} evaluation metrics for validation data. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
|
||||
#' \item \code{eta} step size of each boosting step
|
||||
#' \item \code{max.depth} maximum depth of the tree
|
||||
#' \item \code{nthread} number of thread used in training, if not set, all threads are used
|
||||
#' }
|
||||
#'
|
||||
#' Look at \code{\link{xgb.train}} for a more complete list of parameters or \url{https://github.com/tqchen/xgboost/wiki/Parameters} for the full list.
|
||||
#'
|
||||
#' See also \code{demo/} for walkthrough example in R.
|
||||
#'
|
||||
#' @param nrounds the max number of iterations
|
||||
#' @param verbose If 0, xgboost will stay silent. If 1, xgboost will print
|
||||
@ -62,22 +33,11 @@
|
||||
#' @param ... other parameters to pass to \code{params}.
|
||||
#'
|
||||
#' @details
|
||||
#' This is the modeling function for xgboost.
|
||||
#' This is the modeling function for Xgboost.
|
||||
#'
|
||||
#' Parallelization is automatically enabled if OpenMP is present.
|
||||
#' Number of threads can also be manually specified via "nthread" parameter.
|
||||
#' Parallelization is automatically enabled if \code{OpenMP} is present.
|
||||
#'
|
||||
#' \code{eval_metric} is set automatically by xgboost but can be overriden by parameter. Below is provided the list of different metric optimized by xgboost to help you to understand how it works inside. It should not be overriden until you have a real reason to do so.
|
||||
#' \itemize{
|
||||
#' \item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
|
||||
#' \item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
|
||||
#' \item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
|
||||
#' \item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}.
|
||||
#' \item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
|
||||
#' \item \code{ndcg} Normalized Discounted Cumulative Gain. \url{http://en.wikipedia.org/wiki/NDCG}
|
||||
#' }
|
||||
#'
|
||||
#' More parameters are available in the Wiki \url{https://github.com/tqchen/xgboost/wiki/Parameters}.
|
||||
#' Number of threads can also be manually specified via \code{nthread} parameter.
|
||||
#'
|
||||
#' @examples
|
||||
#' data(agaricus.train, package='xgboost')
|
||||
|
||||
@ -8,20 +8,52 @@ xgb.train(params = list(), data, nrounds, watchlist = list(), obj = NULL,
|
||||
feval = NULL, verbose = 1, ...)
|
||||
}
|
||||
\arguments{
|
||||
\item{params}{the list of parameters. Commonly used ones are:
|
||||
\item{params}{the list of parameters.
|
||||
|
||||
1. General Parameters
|
||||
|
||||
\itemize{
|
||||
\item \code{objective} objective function, common ones are
|
||||
\itemize{
|
||||
\item \code{reg:linear} linear regression
|
||||
\item \code{binary:logistic} logistic regression for classification
|
||||
}
|
||||
\item \code{eta} step size of each boosting step
|
||||
\item \code{max.depth} maximum depth of the tree
|
||||
\item \code{nthread} number of thread used in training, if not set, all threads are used
|
||||
\item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}
|
||||
\item \code{silent} 0 means printing running messages, 1 means silent mode. Default: 0
|
||||
}
|
||||
|
||||
See \url{https://github.com/tqchen/xgboost/wiki/Parameters} for
|
||||
further details. See also demo/ for walkthrough example in R.}
|
||||
2. Booster Parameters
|
||||
|
||||
2.1. Parameter for Tree Booster
|
||||
|
||||
\itemize{
|
||||
\item \code{eta} step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinkage the feature weights to make the boosting process more conservative. Default: 0.3
|
||||
\item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
|
||||
\item \code{max_depth} maximum depth of a tree. Default: 6
|
||||
\item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
|
||||
\item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1
|
||||
\item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
|
||||
}
|
||||
|
||||
2.2. Parameter for Linear Booster
|
||||
|
||||
\itemize{
|
||||
\item \code{lambda} L2 regularization term on weights. Default: 0
|
||||
\item \code{lambda_bias} L2 regularization term on bias. Default: 0
|
||||
\item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
|
||||
}
|
||||
|
||||
3. Task Parameters
|
||||
|
||||
\itemize{
|
||||
\item \code{objective} specify the learning task and the corresponding learning objective, and the objective options are below:
|
||||
\itemize{
|
||||
\item \code{reg:linear} linear regression (Default).
|
||||
\item \code{reg:logistic} logistic regression.
|
||||
\item \code{binary:logistic} logistic regression for binary classification. Output probability.
|
||||
\item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
|
||||
\item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes).
|
||||
\item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class.
|
||||
\item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
|
||||
}
|
||||
\item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
|
||||
\item \code{eval_metric} evaluation metrics for validation data. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
|
||||
}}
|
||||
|
||||
\item{data}{takes an \code{xgb.DMatrix} as the input.}
|
||||
|
||||
@ -46,17 +78,30 @@ prediction and dtrain,}
|
||||
\item{...}{other parameters to pass to \code{params}.}
|
||||
}
|
||||
\description{
|
||||
The training function of xgboost
|
||||
An advanced interface for training xgboost model. Look at \code{\link{xgboost}} function for a simpler interface.
|
||||
}
|
||||
\details{
|
||||
This is the training function for xgboost.
|
||||
This is the training function for \code{xgboost}.
|
||||
|
||||
Parallelization is automatically enabled if OpenMP is present.
|
||||
Number of threads can also be manually specified via "nthread" parameter.
|
||||
It supports advanced features such as \code{watchlist}, customized objective function (\code{feval}),
|
||||
therefore it is more flexible than \code{\link{xgboost}} function.
|
||||
|
||||
This function only accepts an \code{xgb.DMatrix} object as the input.
|
||||
It supports advanced features such as watchlist, customized objective function,
|
||||
therefore it is more flexible than \code{\link{xgboost}}.
|
||||
Parallelization is automatically enabled if \code{OpenMP} is present.
|
||||
Number of threads can also be manually specified via \code{nthread} parameter.
|
||||
|
||||
\code{eval_metric} parameter (not listed above) is set automatically by Xgboost but can be overriden by parameter. Below is provided the list of different metric optimized by Xgboost to help you to understand how it works inside or to use them with the \code{watchlist} parameter.
|
||||
\itemize{
|
||||
\item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
|
||||
\item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
|
||||
\item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
|
||||
\item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}.
|
||||
\item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
|
||||
\item \code{ndcg} Normalized Discounted Cumulative Gain. \url{http://en.wikipedia.org/wiki/NDCG}
|
||||
}
|
||||
|
||||
Full list of parameters is available in the Wiki \url{https://github.com/tqchen/xgboost/wiki/Parameters}.
|
||||
|
||||
This function only accepts an \code{\link{xgb.DMatrix}} object as the input.
|
||||
}
|
||||
\examples{
|
||||
data(agaricus.train, package='xgboost')
|
||||
|
||||
@ -19,50 +19,21 @@ value that represents missing value. Sometimes a data use 0 or other extreme val
|
||||
|
||||
\item{params}{the list of parameters.
|
||||
|
||||
1. General Parameters
|
||||
|
||||
Commonly used ones are:
|
||||
\itemize{
|
||||
\item \code{booster} which booster to use, can be \code{gbtree} or \code{gblinear}. Default: \code{gbtree}
|
||||
\item \code{silent} 0 means printing running messages, 1 means silent mode. Default: 0
|
||||
}
|
||||
|
||||
2. Booster Parameters
|
||||
|
||||
2.1. Parameter for Tree Booster
|
||||
|
||||
\itemize{
|
||||
\item \code{eta} step size shrinkage used in update to prevents overfitting. After each boosting step, we can directly get the weights of new features. and eta actually shrinkage the feature weights to make the boosting process more conservative. Default: 0.3
|
||||
\item \code{gamma} minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
|
||||
\item \code{max_depth} maximum depth of a tree. Default: 6
|
||||
\item \code{min_child_weight} minimum sum of instance weight(hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger, the more conservative the algorithm will be. Default: 1
|
||||
\item \code{subsample} subsample ratio of the training instance. Setting it to 0.5 means that xgboost randomly collected half of the data instances to grow trees and this will prevent overfitting. Default: 1
|
||||
\item \code{colsample_bytree} subsample ratio of columns when constructing each tree. Default: 1
|
||||
}
|
||||
|
||||
2.2. Parameter for Linear Booster
|
||||
|
||||
\itemize{
|
||||
\item \code{lambda} L2 regularization term on weights. Default: 0
|
||||
\item \code{lambda_bias} L2 regularization term on bias. Default: 0
|
||||
\item \code{alpha} L1 regularization term on weights. (there is no L1 reg on bias because it is not important). Default: 0
|
||||
}
|
||||
|
||||
3. Task Parameters
|
||||
|
||||
\itemize{
|
||||
\item \code{objective} specify the learning task and the corresponding learning objective, and the objective options are below:
|
||||
\item \code{objective} objective function, common ones are
|
||||
\itemize{
|
||||
\item \code{reg:linear} linear regression (Default).
|
||||
\item \code{reg:logistic} logistic regression.
|
||||
\item \code{binary:logistic} logistic regression for binary classification. Output probability.
|
||||
\item \code{binary:logitraw} logistic regression for binary classification, output score before logistic transformation.
|
||||
\item \code{multi:softmax} set xgboost to do multiclass classification using the softmax objective, you also need to set num_class(number of classes).
|
||||
\item \code{multi:softprob} same as softmax, but output a vector of ndata * nclass, which can be further reshaped to ndata, nclass matrix. The result contains predicted probability of each data point belonging to each class.
|
||||
\item \code{rank:pairwise} set xgboost to do ranking task by minimizing the pairwise loss.
|
||||
\item \code{reg:linear} linear regression
|
||||
\item \code{binary:logistic} logistic regression for classification
|
||||
}
|
||||
\item \code{base_score} the initial prediction score of all instances, global bias. Default: 0.5
|
||||
\item \code{eval_metric} evaluation metrics for validation data. Default: metric will be assigned according to objective(rmse for regression, and error for classification, mean average precision for ranking). List is provided in detail section.
|
||||
}}
|
||||
\item \code{eta} step size of each boosting step
|
||||
\item \code{max.depth} maximum depth of the tree
|
||||
\item \code{nthread} number of thread used in training, if not set, all threads are used
|
||||
}
|
||||
|
||||
Look at \code{\link{xgb.train}} for a more complete list of parameters or \url{https://github.com/tqchen/xgboost/wiki/Parameters} for the full list.
|
||||
|
||||
See also \code{demo/} for walkthrough example in R.}
|
||||
|
||||
\item{nrounds}{the max number of iterations}
|
||||
|
||||
@ -73,25 +44,14 @@ performance and construction progress information}
|
||||
\item{...}{other parameters to pass to \code{params}.}
|
||||
}
|
||||
\description{
|
||||
A simple interface for xgboost in R
|
||||
A simple interface for training xgboost model. Look at \code{\link{xgb.train}} function for a more advanced interface.
|
||||
}
|
||||
\details{
|
||||
This is the modeling function for xgboost.
|
||||
This is the modeling function for Xgboost.
|
||||
|
||||
Parallelization is automatically enabled if OpenMP is present.
|
||||
Number of threads can also be manually specified via "nthread" parameter.
|
||||
Parallelization is automatically enabled if \code{OpenMP} is present.
|
||||
|
||||
\code{eval_metric} is set automatically by xgboost but can be overriden by parameter. Below is provided the list of different metric optimized by xgboost to help you to understand how it works inside. It should not be overriden until you have a real reason to do so.
|
||||
\itemize{
|
||||
\item \code{rmse} root mean square error. \url{http://en.wikipedia.org/wiki/Root_mean_square_error}
|
||||
\item \code{logloss} negative log-likelihood. \url{http://en.wikipedia.org/wiki/Log-likelihood}
|
||||
\item \code{error} Binary classification error rate. It is calculated as \code{(wrong cases) / (all cases)}. For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
|
||||
\item \code{merror} Multiclass classification error rate. It is calculated as \code{(wrong cases) / (all cases)}.
|
||||
\item \code{auc} Area under the curve. \url{http://en.wikipedia.org/wiki/Receiver_operating_characteristic#'Area_under_curve} for ranking evaluation.
|
||||
\item \code{ndcg} Normalized Discounted Cumulative Gain. \url{http://en.wikipedia.org/wiki/NDCG}
|
||||
}
|
||||
|
||||
More parameters are available in the Wiki \url{https://github.com/tqchen/xgboost/wiki/Parameters}.
|
||||
Number of threads can also be manually specified via \code{nthread} parameter.
|
||||
}
|
||||
\examples{
|
||||
data(agaricus.train, package='xgboost')
|
||||
|
||||
@ -1,10 +1,15 @@
|
||||
---
|
||||
title: "Understand your dataset with Xgboost"
|
||||
output:
|
||||
html_document:
|
||||
output:
|
||||
rmarkdown::html_vignette:
|
||||
css: vignette.css
|
||||
number_sections: yes
|
||||
toc: yes
|
||||
author: Tianqi Chen, Tong He, Michaël Benesty
|
||||
vignette: >
|
||||
%\VignetteIndexEntry{Discover your data}
|
||||
%\VignetteEngine{knitr::rmarkdown}
|
||||
\usepackage[utf8]{inputenc}
|
||||
---
|
||||
|
||||
Introduction
|
||||
@ -12,9 +17,7 @@ Introduction
|
||||
|
||||
The purpose of this Vignette is to show you how to use **Xgboost** to discover and better understand your own dataset.
|
||||
|
||||
You may know **Xgboost** as a state of the art tool to build some kind of Machine learning models. It has been [used](https://github.com/tqchen/xgboost) to win several [Kaggle](http://www.kaggle.com) competition.
|
||||
|
||||
During these competition, the purpose is to make prediction. This Vignette is not about showing you how to predict anything. The purpose of this document is to explain *how to use **Xgboost** to understand the link between the features of your data and an outcome*.
|
||||
This Vignette is not about showing you how to predict anything (see [Xgboost presentation](www.somewhere.org)). The purpose of this document is to explain how to use **Xgboost** to understand the *link* between the *features* of your data and an *outcome*.
|
||||
|
||||
For the purpose of this tutorial we will first load the required packages.
|
||||
|
||||
@ -22,27 +25,27 @@ For the purpose of this tutorial we will first load the required packages.
|
||||
require(xgboost)
|
||||
require(Matrix)
|
||||
require(data.table)
|
||||
if (!require(vcd)) install.packages('vcd')
|
||||
if (!require('vcd')) install.packages('vcd')
|
||||
```
|
||||
> **VCD** is used for one of its embedded dataset only (and not for its own functions).
|
||||
> **VCD** package is used for one of its embedded dataset only (and not for its own functions).
|
||||
|
||||
Preparation of the dataset
|
||||
==========================
|
||||
|
||||
According to its documentation, **Xgboost** works only on `numeric` variables.
|
||||
**Xgboost** works only on `numeric` variables.
|
||||
|
||||
Sometimes the dataset we have to work on have *categorical* data.
|
||||
|
||||
A *categorical* variable is one which have a fixed number of different values. By exemple, if for each observation a variable called *Colour* can have only *red*, *blue* or *green* as value, it is a *categorical* variable.
|
||||
|
||||
> In **R**, *categorical* variable is called `factor`.
|
||||
> In *R*, *categorical* variable is called `factor`.
|
||||
> Type `?factor` in console for more information.
|
||||
|
||||
In this demo we will see how to transform a dense dataframe with *categorical* variables to a sparse matrix before analyzing it in **Xgboost**.
|
||||
In this demo we will see how to transform a dense dataframe (dense = few zero in the matrix) with *categorical* variables to a very sparse matrix (sparse = lots of zero in the matrix) of `numeric` features before analyzing these data in **Xgboost**.
|
||||
|
||||
The method we are going to see is usually called [one hot encoding](http://en.wikipedia.org/wiki/One-hot).
|
||||
|
||||
The first step is to load Arthritis dataset in memory and create a copy of the dataset with `data.table` package (`data.table` is 100% compliant with **R** dataframe but its syntax is a lot more consistent and its performance are really good).
|
||||
The first step is to load Arthritis dataset in memory and wrap the dataset with `data.table` package (`data.table` is 100% compliant with *R* dataframe but its syntax is a lot more consistent and its performance are really good).
|
||||
|
||||
```{r, results='hide'}
|
||||
data(Arthritis)
|
||||
@ -65,16 +68,17 @@ str(df)
|
||||
> `ordinal` variable is a categorical variable with values wich can be ordered
|
||||
> Here: `None` > `Some` > `Marked`.
|
||||
|
||||
Let's add some new categorical features to see if it helps.
|
||||
Let's add some new *categorical* features to see if it helps.
|
||||
|
||||
Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in ML, but tree algorithms (including boosted trees) are able to select the best features, even in case of highly correlated features.
|
||||
Of course these feature are highly correlated to the Age feature. Usually it's not a good thing in Machine Learning. Fortunately, tree algorithms (including boosted trees) are very robust in this specific case.
|
||||
|
||||
```{r}
|
||||
df[,AgeDiscret:= as.factor(round(Age/10,0))][1:10]
|
||||
```
|
||||
|
||||
> For the first feature we create groups of age by rounding the real age.
|
||||
> Note that we transform it to `factor` so the algorithm treat them as independant values.
|
||||
> For the first feature we create groups of age by rounding the real age.
|
||||
> Note that we transform it to `factor` so the algorithm treat these age groups as independant values.
|
||||
> Therefore, 20 is not closer to 30 than 60. To make it short, the distance between ages is lost in this transformation.
|
||||
|
||||
Following is an even stronger simplification of the real age with an arbitrary split at 30 years old. I choose this value **based on nothing**. We will see later if simplifying the information based on arbitrary values is a good strategy (I am sure you already have an idea of how well it will work!).
|
||||
|
||||
@ -97,11 +101,10 @@ print(levels(df[,Treatment]))
|
||||
|
||||
Next step, we will transform the categorical data to dummy variables.
|
||||
This is the [one hot encoding](http://en.wikipedia.org/wiki/One-hot) part.
|
||||
The purpose is to transform each value of each *categorical* feature in a binary feature.
|
||||
|
||||
For example, the column Treatment will be replaced by two columns, Placebo, and Treated. Each of them will be *binary*. For example an observation which had the value Placebo in column Treatment before the transformation will have, after the transformation, the value 1 in the new column Placebo and the value 0 in the new column Treated.
|
||||
The purpose is to transform each value of each *categorical* feature in a binary feature `{0, 1}`.
|
||||
|
||||
> Formulae `Improved~.-1` used below means transform all *categorical* features but column Improved to binary values.
|
||||
For example, the column Treatment will be replaced by two columns, Placebo, and Treated. Each of them will be *binary*. Therefore, an observation which has the value Placebo in column Treatment before the transformation will have after the transformation the value `1` in the new column Placebo and the value `0` in the new column Treated.
|
||||
|
||||
Column Improved is excluded because it will be our output column, the one we want to predict.
|
||||
|
||||
@ -110,10 +113,12 @@ sparse_matrix <- sparse.model.matrix(Improved~.-1, data = df)
|
||||
print(sparse_matrix[1:10,])
|
||||
```
|
||||
|
||||
Create the output vector (not as a sparse `Matrix`):
|
||||
> Formulae `Improved~.-1` used above means transform all *categorical* features but column Improved to binary values.
|
||||
|
||||
1. Set, for all rows, field in Y column to 0;
|
||||
2. set Y to 1 when Improved == Marked;
|
||||
Create the output `numeric` vector (not as a sparse `Matrix`):
|
||||
|
||||
1. Set, for all rows, field in Y column to `0`;
|
||||
2. set Y to `1` when Improved == Marked;
|
||||
3. Return Y column.
|
||||
|
||||
```{r}
|
||||
@ -123,7 +128,7 @@ output_vector = df[,Y:=0][Improved == "Marked",Y:=1][,Y]
|
||||
Build the model
|
||||
===============
|
||||
|
||||
The code below is very usual. For more information, you can look at the documentation of `xgboost()` function.
|
||||
The code below is very usual. For more information, you can look at the documentation of `xgboost` function (or to the vignette [Xgboost presentation](www.somewhere.org)).
|
||||
|
||||
```{r}
|
||||
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
|
||||
@ -131,7 +136,7 @@ bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
|
||||
|
||||
```
|
||||
|
||||
You can see plenty of `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well your model explains your data. Lower is better.
|
||||
You can see plenty of `train-error: 0.XXXXX` lines followed by a number. It decreases. Each line shows how well your model explains your data. Lower is better.
|
||||
|
||||
A model which fits too well may [overfit](http://en.wikipedia.org/wiki/Overfitting) (meaning it copy paste too much the past, and is not that good to predict the future).
|
||||
|
||||
@ -145,7 +150,7 @@ Feature importance
|
||||
Measure feature importance
|
||||
--------------------------
|
||||
|
||||
In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the values of the feature (because one binary column == one value of one *categorical* feature)
|
||||
In the code below, `sparse_matrix@Dimnames[[2]]` represents the column names of the sparse matrix. These names are the original values of the feature (remember, one binary column == one value of one *categorical* feature).
|
||||
|
||||
```{r}
|
||||
importance <- xgb.importance(sparse_matrix@Dimnames[[2]], model = bst)
|
||||
@ -155,9 +160,9 @@ print(importance)
|
||||
> The column `Gain` provide the information we are looking for.
|
||||
> As you can see, features are classified by `Gain`.
|
||||
|
||||
`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite, both new branch being more accurate than the one before the insertion of the feature).
|
||||
`Gain` is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before adding a new split on a feature X to the branch there was some wrongly classified elements, after adding the split on this feature, there are two new branches, and each of these branch is more accurate (one branch saying if your observation is on this branch then it should be classified as 1, and the other branch saying the exact opposite, both new branches being more accurate than the one before the split).
|
||||
|
||||
`Cover` measure the relative quantity of observations concerned by a feature.
|
||||
`Cover` measures the relative quantity of observations concerned by a feature.
|
||||
|
||||
`Frequence` is a simpler way to measure the `Gain`. It just counts the number of times a feature is used in all generated trees. You should not use it (unless you know why you want to use it).
|
||||
|
||||
@ -166,13 +171,13 @@ Plotting the feature importance
|
||||
|
||||
All these things are nice, but it would be even better to plot the result. Fortunately, such function already exists.
|
||||
|
||||
```{r}
|
||||
```{r, fig.width=8, fig.height=5, fig.align='center'}
|
||||
xgb.plot.importance(importance_matrix = importance)
|
||||
```
|
||||
|
||||
Feature have been automatically divided in 2 clusters: the interesting features... and the others.
|
||||
Feature have automatically been divided in 2 clusters: the interesting features... and the others.
|
||||
|
||||
> Depending of the case you may have more than two clusters.
|
||||
> Depending of the dataset and the learning parameters you may have more than two clusters.
|
||||
> Default value is to limit them to 10, but you can increase this limit. Look at the function documentation for more information.
|
||||
|
||||
According to the plot above, the most important feature in this dataset to predict if the treatment will work is :
|
||||
@ -182,7 +187,7 @@ According to the plot above, the most important feature in this dataset to predi
|
||||
* the sex is third but already included in the not interesting feature ;
|
||||
* then we see our generated features (AgeDiscret). We can see that their contribution is very low.
|
||||
|
||||
Does these results make sense?
|
||||
Do these results make sense?
|
||||
------------------------------
|
||||
|
||||
Let's check some **Chi2** between each of these features and the outcome.
|
||||
@ -208,7 +213,11 @@ c2 <- chisq.test(df$AgeCat, df$Y)
|
||||
print(c2)
|
||||
```
|
||||
|
||||
The perfectly random split I did between young and old at 30 years old have a low correlation of **`r round(c2$statistic, 2)`**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same. Don't let your *gut* lower the quality of your model. In *data science* expression, there is the word *science* :-)
|
||||
The perfectly random split I did between young and old at 30 years old have a low correlation of **`r round(c2$statistic, 2)`**. It's a result we may expect as may be in my mind > 30 years is being old (I am 32 and starting feeling old, this may explain that), but for the illness we are studying, the age to be vulnerable is not the same.
|
||||
|
||||
Morality: don't let your *gut* lower the quality of your model.
|
||||
|
||||
In *data science* expression, there is the word *science* :-)
|
||||
|
||||
Conclusion
|
||||
==========
|
||||
@ -232,12 +241,12 @@ Special Note: What about Random forest?
|
||||
|
||||
As you may know, [Random Forest](http://en.wikipedia.org/wiki/Random_forest) algorithm is cousin with boosting and both are part of the [ensemble leanrning](http://en.wikipedia.org/wiki/Ensemble_learning) family.
|
||||
|
||||
Both trains several decision trees for one dataset. The *main* difference is that in Random Forest, trees are independant and in boosting tree N+1 focus its learning on what has no been well modeled by tree N (and so on...).
|
||||
Both trains several decision trees for one dataset. The *main* difference is that in Random Forest, trees are independant and in boosting tree N+1 focus its learning on the loss (= what has no been well modeled by tree N).
|
||||
|
||||
This difference have an impact on a corner case in feature importance analysis: the *correlated features*.
|
||||
This difference have an impact on feature importance analysis: the *correlated features*.
|
||||
|
||||
Imagine two features perfectly correlated, feature `A` and feature `B`. For one specific tree, if the algorithm needs one of them, it will choose randomly (true in both boosting and random forest).
|
||||
|
||||
However, in Random Forest this choice will be done plenty of times, because trees are independant. So the **importance** of a specific feature is diluted among features `A` and `B`. So you won't easily know they are important to predict what you want to predict.
|
||||
However, in Random Forest this random choice will be done for each tree, because each tree is independant from the others. Therefore, approximatively, depending of your parameters, 50% of the trees will choose feature `A` and the other 50% will choose feature `B`. So the **importance** of the information contained in `A` and `B` (which is the same, because they are perfectly correlated) is diluted in `A` and `B`. So you won't easily know this information is important to predict what you want to predict! It is even worse when you have 10 correlated features...
|
||||
|
||||
In boosting, when as aspect of your dataset have been learned by the algorithm, there is no more need to refocus on it. Therefore, all the importace will be on `A` or `B`. You will know that one of them is important, it is up to you to search for correlated features.
|
||||
In boosting, when a specific link between feature and outcome have been learned by the algorithm, it will try to not refocus on it (in theory it is what happens, reality is never that simple). Therefore, all the importance will be on `A` or on `B`. You will know that one feature have an important role in the link between your dataset and the outcome. It is still up to you to search for the correlated features to the one detected as important if you need all of them.
|
||||
|
||||
@ -24,9 +24,11 @@ body{
|
||||
/ color: white;
|
||||
|
||||
line-height: 1;
|
||||
max-width: 960px;
|
||||
max-width: 800px;
|
||||
padding: 20px;
|
||||
font-size: 17px;
|
||||
text-align: justify;
|
||||
text-justify: inter-word;
|
||||
}
|
||||
|
||||
|
||||
@ -34,7 +36,7 @@ p {
|
||||
line-height: 150%;
|
||||
/ max-width: 540px;
|
||||
max-width: 960px;
|
||||
font-weight: 400;
|
||||
font-weight: 400;
|
||||
/ color: #333333
|
||||
}
|
||||
|
||||
@ -90,10 +92,12 @@ a {
|
||||
padding: 0;
|
||||
vertical-align: baseline;
|
||||
}
|
||||
|
||||
a:hover {
|
||||
text-decoration: blink;
|
||||
color: green;
|
||||
}
|
||||
|
||||
a:visited {
|
||||
color: gray;
|
||||
}
|
||||
@ -110,48 +114,66 @@ ul {
|
||||
li {
|
||||
line-height:150%
|
||||
}
|
||||
|
||||
li ul, li ul {
|
||||
margin-left: 24px;
|
||||
}
|
||||
|
||||
pre {
|
||||
padding: 0px 24px;
|
||||
padding: 0px 10px;
|
||||
max-width: 800px;
|
||||
white-space: pre-wrap;
|
||||
}
|
||||
|
||||
code {
|
||||
font-family: Consolas, Monaco, Andale Mono, monospace;
|
||||
line-height: 1.5;
|
||||
font-size: 15px;
|
||||
background: #F0F0F0;
|
||||
border-radius: 4px;
|
||||
padding: 5px;
|
||||
display: inline-block;
|
||||
max-width: 800px;
|
||||
white-space: pre-wrap;
|
||||
}
|
||||
|
||||
code.r, code.cpp {
|
||||
display: block;
|
||||
word-wrap: break-word;
|
||||
background: #F8F8F8;
|
||||
border: 1px solid #606AAA;
|
||||
}
|
||||
|
||||
aside {
|
||||
display: block;
|
||||
float: right;
|
||||
width: 390px;
|
||||
}
|
||||
|
||||
blockquote {
|
||||
font-size:14px;
|
||||
border-left:.5em solid #606AAA;
|
||||
background: #f5f5f5;
|
||||
color:#bfbfbf;
|
||||
padding: 5px;
|
||||
margin-left:25px;
|
||||
background: #F8F8F8;
|
||||
padding-left: 1em;
|
||||
margin-left:10px;
|
||||
max-width: 500px;
|
||||
}
|
||||
|
||||
blockquote cite {
|
||||
font-size:14px;
|
||||
line-height:20px;
|
||||
line-height:10px;
|
||||
color:#bfbfbf;
|
||||
}
|
||||
|
||||
blockquote cite:before {
|
||||
content: '\2014 \00A0';
|
||||
/content: '\2014 \00A0';
|
||||
}
|
||||
|
||||
blockquote p {
|
||||
color: #666;
|
||||
}
|
||||
hr {
|
||||
/ width: 540px;
|
||||
/ width: 540px;
|
||||
text-align: left;
|
||||
margin: 0 auto 0 0;
|
||||
color: #999;
|
||||
|
||||
@ -214,3 +214,8 @@ competition.
|
||||
|
||||
\end{document}
|
||||
|
||||
<<Temp file cleaning, include=FALSE>>=
|
||||
file.remove("xgb.DMatrix")
|
||||
file.remove("model.dump")
|
||||
file.remove("model.save")
|
||||
@
|
||||
@ -1,27 +1,73 @@
|
||||
---
|
||||
title: "Xgboost presentation"
|
||||
output:
|
||||
html_document:
|
||||
output:
|
||||
rmarkdown::html_vignette:
|
||||
css: vignette.css
|
||||
number_sections: yes
|
||||
toc: yes
|
||||
bibliography: xgboost.bib
|
||||
author: Tianqi Chen, Tong He, Michaël Benesty
|
||||
vignette: >
|
||||
%\VignetteIndexEntry{Xgboost presentation}
|
||||
%\VignetteEngine{knitr::rmarkdown}
|
||||
\usepackage[utf8]{inputenc}
|
||||
---
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
This is an introductory document of using the \verb@xgboost@ package in *R*.
|
||||
|
||||
**Xgboost** is short for e**X**treme **G**radient **B**oosting package.
|
||||
|
||||
It is an efficient and scalable implementation of gradient boosting framework by @friedman2001greedy.
|
||||
|
||||
The package includes efficient *linear model* solver and *tree learning* algorithm. It supports various objective functions, including *regression*, *classification* and *ranking*. The package is made to be extendible, so that users are also allowed to define their own objectives easily.
|
||||
|
||||
It has been [used](https://github.com/tqchen/xgboost) to win several [Kaggle](http://www.kaggle.com) competitions.
|
||||
|
||||
It has several features:
|
||||
|
||||
* Speed: it can automatically do parallel computation on *Windows* and *Linux*, with *OpenMP*. It is generally over 10 times faster than the classical `gbm`.
|
||||
* Input Type: it takes several types of input data:
|
||||
* *Dense* Matrix: *R*'s *dense* matrix, i.e. `matrix` ;
|
||||
* *Sparse* Matrix: *R*'s *sparse* matrix, i.e. `Matrix::dgCMatrix` ;
|
||||
* Data File: local data files ;
|
||||
* `xgb.DMatrix`: it's own class (recommended).
|
||||
* Sparsity: it accepts *sparse* input for both *tree booster* and *linear booster*, and is optimized for *sparse* input ;
|
||||
* Customization: it supports customized objective function and evaluation function ;
|
||||
* Performance: it has better performance on several different datasets.
|
||||
|
||||
The purpose of this Vignette is to show you how to use **Xgboost** to make prediction from a model based on your own dataset.
|
||||
|
||||
You may know **Xgboost** as a state of the art tool to build some kind of Machine learning models. It has been [used](https://github.com/tqchen/xgboost) to win several [Kaggle](http://www.kaggle.com) competition.
|
||||
Installation
|
||||
============
|
||||
|
||||
For the purpose of this tutorial we will first load the required packages.
|
||||
The first step is of course to install the package.
|
||||
|
||||
For up-to-date version (which is *highly* recommended), install from Github:
|
||||
|
||||
```{r installGithub, eval=FALSE}
|
||||
devtools::install_github('tqchen/xgboost',subdir='R-package')
|
||||
```
|
||||
|
||||
> *Windows* user will need to install [RTools](http://cran.r-project.org/bin/windows/Rtools/) first.
|
||||
|
||||
For stable version on CRAN, run:
|
||||
|
||||
```{r installCran, eval=FALSE}
|
||||
install.packages('xgboost')
|
||||
```
|
||||
|
||||
For the purpose of this tutorial we will load **Xgboost** package.
|
||||
|
||||
```{r libLoading, results='hold', message=F, warning=F}
|
||||
require(xgboost)
|
||||
require(methods)
|
||||
```
|
||||
|
||||
In this example, we are aiming to predict whether a mushroom can be eated.
|
||||
In this example, we are aiming to predict whether a mushroom can be eated or not (yeah I know, like many tutorial, example data are the exact one you will work on in your every day life :-).
|
||||
|
||||
Mushroom data is cited from UCI Machine Learning Repository. @Bache+Lichman:2013.
|
||||
|
||||
Learning
|
||||
========
|
||||
@ -29,11 +75,14 @@ Learning
|
||||
Dataset loading
|
||||
---------------
|
||||
|
||||
We load the `agaricus` datasets and link it to variables.
|
||||
We will load the `agaricus` datasets embedded with the package and will link them to variables.
|
||||
|
||||
The dataset is already separated in `train` and `test` data.
|
||||
The datasets are already separated in `train` and `test` data:
|
||||
|
||||
As their names imply, the train part will be used to build the model and the test part to check how well our model works. Without separation we would test the model on data the algorithm have already seen, as you may imagine, it's not the best methodology to check the performance of a prediction (would it even be a prediction?).
|
||||
* As their names imply, the `train` part will be used to build the model ;
|
||||
* `test` will be used to check how well our model is.
|
||||
|
||||
Without dividing the dataset we would test the model on data the algorithm have already seen. As you may imagine, it's not the best methodology to check the performance of a prediction (can it even be called a *prediction*?).
|
||||
|
||||
```{r datasetLoading, results='hold', message=F, warning=F}
|
||||
data(agaricus.train, package='xgboost')
|
||||
@ -42,113 +91,168 @@ train <- agaricus.train
|
||||
test <- agaricus.test
|
||||
```
|
||||
|
||||
> Each variable is a S3 object containing both label and data.
|
||||
> In the real world, it would be up to you to make this division between `train` and `test` data. The way you should do it is out of the purpose of this article, however `caret` package may [help](http://topepo.github.io/caret/splitting.html).
|
||||
|
||||
> In the real world, it would be up to you to make this division between `train` and `test` data.
|
||||
Each variable is a `list` containing both label and data.
|
||||
```{r dataList, message=F, warning=F}
|
||||
str(train)
|
||||
```
|
||||
|
||||
The loaded data is stored in `dgCMatrix` which is a **sparse matrix** type.
|
||||
Let's discover the dimensionality of our datasets.
|
||||
|
||||
Label is a `numeric` vector in `{0,1}`.
|
||||
```{r dataSize, message=F, warning=F}
|
||||
dim(train$data)
|
||||
dim(test$data)
|
||||
```
|
||||
|
||||
Clearly, we have here a small dataset, however **Xgboost** can manage huge one very efficiently.
|
||||
|
||||
The loaded `data` are stored in `dgCMatrix` which is a *sparse* matrix type and `label` is a `numeric` vector in `{0,1}`.
|
||||
|
||||
```{r dataClass, message=F, warning=F}
|
||||
class(train$data)[1]
|
||||
class(train$label)
|
||||
```
|
||||
|
||||
Basic Training using XGBoost
|
||||
`label` is the outcome of our dataset meaning it is the binary *classification* we want to predict in future data.
|
||||
|
||||
Basic Training using Xgboost
|
||||
----------------------------
|
||||
|
||||
The most critical part of the process is the training.
|
||||
The most critical part of the process is the training one.
|
||||
|
||||
We are using the train data. Both `data` and `label` are in each data (explained above). To access to the field of a `S3` object we use the `$` character in **R**.
|
||||
We are using the `train` data. As explained above, both `data` and `label` are in a variable.
|
||||
|
||||
> label is the outcome of our dataset. It is the classification we want to predict. For these data we already have it, but when our model is built, that is this column we want to guess.
|
||||
|
||||
In sparse matrix, cells which contains `0` are not encoded. Therefore, in a dataset where there are plenty of `0`, dataset size is optimized. It is very usual to have such dataset. **Xgboost** can manage both dense and sparse matrix.
|
||||
In *sparse* matrix, cells which contains `0` are not encoded. Therefore, in a dataset where there are plenty of `0`, memory size is optimized. It is very usual to have such dataset. **Xgboost** can manage both *dense* and *sparse* matrix.
|
||||
|
||||
```{r trainingSparse, message=F, warning=F}
|
||||
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
Alternatively, you can put your dataset in a dense matrix, i.e. a basic R-matrix.
|
||||
> To reach the value of a variable in a `list` use the `$` character followed by the name.
|
||||
|
||||
Alternatively, you can put your dataset in a *dense* matrix, i.e. a basic *R* matrix.
|
||||
|
||||
```{r trainingDense, message=F, warning=F}
|
||||
bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2,
|
||||
objective = "binary:logistic")
|
||||
bstDense <- xgboost(data = as.matrix(train$data), label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
Above, data and label are not stored together.
|
||||
|
||||
**Xgboost** offer a way to group them in a `xgb.DMatrix`. You can even add other meta data. It will be usefull for the most advanced features.
|
||||
**Xgboost** offer a way to group them in a `xgb.DMatrix`. You can even add other meta data in it. It will be usefull for the most advanced features we will discover later.
|
||||
|
||||
```{r trainingDmatrix, message=F, warning=F}
|
||||
dtrain <- xgb.DMatrix(data = train$data, label = train$label)
|
||||
bstDMatrix <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
Below is a demonstration of the effect of verbose parameter.
|
||||
**Xgboost** have plenty of features to help you to view how the learning progress internally. The obvious purpose is to help you to set the best parameters, which is the key in model quality you are building.
|
||||
|
||||
```{r trainingVerbose, message=T, warning=F}
|
||||
# verbose 0, no message
|
||||
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2,
|
||||
objective = "binary:logistic", verbose = 0)
|
||||
One of the most simple way to see the training progress is to set the `verbose` option.
|
||||
|
||||
# verbose 1, print evaluation metric
|
||||
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2,
|
||||
objective = "binary:logistic", verbose = 1)
|
||||
```{r trainingVerbose0, message=T, warning=F}
|
||||
# verbose = 0, no message
|
||||
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 0)
|
||||
```
|
||||
|
||||
# verbose 2, also print information about tree
|
||||
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2,
|
||||
objective = "binary:logistic", verbose = 2)
|
||||
```{r trainingVerbose1, message=T, warning=F}
|
||||
# verbose = 1, print evaluation metric
|
||||
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 1)
|
||||
```
|
||||
|
||||
```{r trainingVerbose2, message=T, warning=F}
|
||||
# verbose = 2, also print information about tree
|
||||
bst <- xgboost(data = dtrain, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic", verbose = 2)
|
||||
```
|
||||
|
||||
Basic prediction using Xgboost
|
||||
------------------------------
|
||||
|
||||
The main use of **Xgboost** is to predict data. For that purpose we will use the test dataset. We remind you that the algorithm has never seen these data.
|
||||
The main use of **Xgboost** is to predict data. For that purpose we will use the `test` dataset.
|
||||
|
||||
```{r predicting, message=F, warning=F}
|
||||
pred <- predict(bst, test$data)
|
||||
|
||||
# size of the prediction vector
|
||||
print(length(pred))
|
||||
|
||||
# limit display of predictions to the first 10
|
||||
print(pred[1:10])
|
||||
```
|
||||
|
||||
The only thing **Xgboost** do is a regression. But we are in a classification problem. If we think about this regression results, they are just kind of probabilities being classified as `1`.
|
||||
|
||||
Therefore, we will set the rule if the probability is `> 5` then the observation is classified as `1` and is classified `0` otherwise.
|
||||
|
||||
```{r predictingTest, message=F, warning=F}
|
||||
err <- mean(as.numeric(pred > 0.5) != test$label)
|
||||
print(paste("test-error=", err))
|
||||
```
|
||||
|
||||
> You can put data in Matrix, sparseMatrix, or xgb.DMatrix
|
||||
> We remind you that the algorithm has never seen the `test` data before.
|
||||
|
||||
Here, we have just computed a simple metric: the average error:
|
||||
|
||||
* `as.numeric(pred > 0.5)` applies our rule that when the probability (== prediction == regression) is over `0.5` the observation is classified as `1` and `0` otherwise ;
|
||||
* `probabilityVectorPreviouslyComputed != test$label` computes the vector of error between true data and computed probabilities ;
|
||||
* `mean(vectorOfErrors)` computes the average error itself.
|
||||
|
||||
The most important thing to remember is that **to do a classification basically, you just do a regression and then apply a threeshold**.
|
||||
|
||||
Multiclass classification works in a very similar way.
|
||||
|
||||
This metrix is **`r round(err, 2)`** and is pretty low: our yummly mushroom model works well!
|
||||
|
||||
Save and load models
|
||||
--------------------
|
||||
|
||||
When your dataset is big, it may takes time to build a model. Or may be you are not a big fan of loosing time in redoing the same thing again and again. In these cases, you will want to save your model and load it when required.
|
||||
May be your dataset is big, and it takes time to train a model on it? May be you are not a big fan of loosing time in redoing the same task again and again? In these very rare cases, you will want to save your model and load it when required.
|
||||
|
||||
Hopefully for you, **Xgboost** implement such functions.
|
||||
Hopefully for you, **Xgboost** implements such functions.
|
||||
|
||||
```{r saveLoadModel, message=F, warning=F}
|
||||
```{r saveModel, message=F, warning=F}
|
||||
# save model to binary local file
|
||||
xgb.save(bst, "xgboost.model")
|
||||
```
|
||||
|
||||
> `xgb.save` function should return `r TRUE` if everything goes well and crashes otherwise.
|
||||
|
||||
An interesting test to see how identic to the original one our saved model is would be to compare the two predictions.
|
||||
|
||||
```{r loadModel, message=F, warning=F}
|
||||
# load binary model to R
|
||||
bst2 <- xgb.load("xgboost.model")
|
||||
pred2 <- predict(bst2, test$data)
|
||||
|
||||
# pred2 should be identical to pred
|
||||
# And now the test
|
||||
print(paste("sum(abs(pred2-pred))=", sum(abs(pred2-pred))))
|
||||
```
|
||||
|
||||
In some very specific cases, like when you want to pilot **Xgboost** from `caret`, you will want to save the model as a **R** binary vector. See below how to do it.
|
||||
```{r clean, include=FALSE}
|
||||
# delete the created model
|
||||
file.remove("./xgboost.model")
|
||||
```
|
||||
|
||||
> result is `0`? We are good!
|
||||
|
||||
In some very specific cases, like when you want to pilot **Xgboost** from `caret` package, you will want to save the model as a *R* `binary` vector. See below how to do it.
|
||||
|
||||
```{r saveLoadRBinVectorModel, message=F, warning=F}
|
||||
# save model to R's raw vector
|
||||
raw = xgb.save.raw(bst)
|
||||
rawVec <- xgb.save.raw(bst)
|
||||
|
||||
# print class
|
||||
print(class(rawVec))
|
||||
|
||||
# load binary model to R
|
||||
bst3 <- xgb.load(raw)
|
||||
bst3 <- xgb.load(rawVec)
|
||||
pred3 <- predict(bst3, test$data)
|
||||
|
||||
# pred2 should be identical to pred
|
||||
print(paste("sum(abs(pred3-pred))=", sum(abs(pred2-pred))))
|
||||
```
|
||||
|
||||
|
||||
> Again `0`? It seems that `Xgboost` works prety well!
|
||||
|
||||
Advanced features
|
||||
=================
|
||||
@ -166,34 +270,46 @@ dtrain <- xgb.DMatrix(data = train$data, label=train$label)
|
||||
dtest <- xgb.DMatrix(data = test$data, label=test$label)
|
||||
```
|
||||
|
||||
Using xgb.train
|
||||
---------------
|
||||
Measure learning progress with xgb.train
|
||||
----------------------------------------
|
||||
|
||||
`xgb.train` is a powerfull way to follow progress in learning of one or more dataset.
|
||||
Both `xgboost` (simple) and `xgb.train` (advanced) functions train models.
|
||||
|
||||
One way to measure progress in learning of a model is to provide to the **Xgboost** a second dataset already classified. Therefore it can learn on the real dataset and test its model on the second one. Some metrics are measured after each round during the learning.
|
||||
One of the special feature of `xgb.train` is the capacity to follow the progress of the learning after each round. Because of the way boosting works, there is a time when having too many rounds lead to an overfitting. You can see this feature as a cousin of cross-validation method. The following features will help you to avoid overfitting or optimizing the learning time in stopping it as soon as possible.
|
||||
|
||||
For that purpose, you will use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name.
|
||||
One way to measure progress in learning of a model is to provide to the **Xgboost** a second dataset already classified. Therefore it can learn on the first dataset and test its model on the second one. Some metrics are measured after each round during the learning.
|
||||
|
||||
For the purpose of this example, we use `watchlist` parameter. It is a list of `xgb.DMatrix`, each of them tagged with a name.
|
||||
|
||||
```{r watchlist, message=F, warning=F}
|
||||
watchlist <- list(train=dtrain, test=dtest)
|
||||
|
||||
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist,
|
||||
objective = "binary:logistic")
|
||||
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
> To train with watchlist, we use `xgb.train`, which contains more advanced features than `xgboost` function.
|
||||
**Xgboost** has computed at each round the same average error metric than seen above (we set `nround` to 2, that is why we have two lines of metric here). Obviously, the `train-error` number is related to the training dataset (the one the algorithm learns from) and the `test-error` number to the test dataset.
|
||||
|
||||
For a better understanding, you may want to have some specific metric or even use multiple evaluation metrics.
|
||||
Both training and test error related metrics are very similar, and in some way, it makes sense: what we have learned from the training dataset matches the observations from the test dataset.
|
||||
|
||||
`eval.metric` allows us to monitor the evaluation of several metrics at a time. Hereafter we will watch two new metrics, logloss and error.
|
||||
If with your own dataset you have not such results, you should think about how you did to divide your dataset in training and test. May be there is something to fix. Again, `caret` package may [help](http://topepo.github.io/caret/splitting.html).
|
||||
|
||||
For a better understanding of the learning progression, you may want to have some specific metric or even use multiple evaluation metrics.
|
||||
|
||||
```{r watchlist2, message=F, warning=F}
|
||||
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist,
|
||||
eval.metric = "error", eval.metric = "logloss",
|
||||
objective = "binary:logistic")
|
||||
bst <- xgb.train(data=dtrain, max.depth=2, eta=1, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic")
|
||||
```
|
||||
|
||||
> `eval.metric` allows us to monitor two new metrics for each round, logloss and error.
|
||||
|
||||
Until know, all the learnings we have performed were based on boosting trees. **Xgboost** implements a second algorithm, based on linear boosting. The only difference with previous command is `booster = "gblinear"` parameter (and removing `eta` parameter).
|
||||
|
||||
```{r linearBoosting, message=F, warning=F}
|
||||
bst <- xgb.train(data=dtrain, booster = "gblinear", max.depth=2, nround=2, watchlist=watchlist, eval.metric = "error", eval.metric = "logloss", objective = "binary:logistic")
|
||||
```
|
||||
|
||||
In this specific case, linear boosting gets sligtly better performance metrics than decision trees based algorithm. In simple case, it will happem because there is nothing better than a linear algorithm to catch a linear link. However, decision trees are much better to catch a non linear link between predictors and outcome. Check both implementations with your own dataset to have an idea of what to use.
|
||||
|
||||
|
||||
Manipulating xgb.DMatrix
|
||||
------------------------
|
||||
|
||||
@ -205,8 +321,11 @@ Like saving models, `xgb.DMatrix` object (which groups both dataset and outcome)
|
||||
xgb.DMatrix.save(dtrain, "dtrain.buffer")
|
||||
# to load it in, simply call xgb.DMatrix
|
||||
dtrain2 <- xgb.DMatrix("dtrain.buffer")
|
||||
bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nround=2, watchlist=watchlist,
|
||||
objective = "binary:logistic")
|
||||
bst <- xgb.train(data=dtrain2, max.depth=2, eta=1, nround=2, watchlist=watchlist, objective = "binary:logistic")
|
||||
```
|
||||
|
||||
```{r DMatrixDel, include=FALSE}
|
||||
file.remove("dtrain.buffer")
|
||||
```
|
||||
|
||||
### Information extraction
|
||||
@ -229,13 +348,7 @@ You can dump the tree you learned using `xgb.dump` into a text file.
|
||||
xgb.dump(bst, with.stats = T)
|
||||
```
|
||||
|
||||
Feature importance
|
||||
------------------
|
||||
> if you provide a path to `fname` parameter you can save the trees to your hard drive.
|
||||
|
||||
Finally, you can check which features are the most important.
|
||||
|
||||
```{r featureImportance, message=T, warning=F}
|
||||
importance_matrix <- xgb.importance(feature_names = train$data@Dimnames[[2]], model = bst)
|
||||
print(importance_matrix)
|
||||
xgb.plot.importance(importance_matrix)
|
||||
```
|
||||
References
|
||||
==========
|
||||
8
build.sh
8
build.sh
@ -1,8 +1,8 @@
|
||||
#!/bin/bash
|
||||
# this is a simple script to make xgboost in MAC nad Linux
|
||||
# basically, it first try to make with OpenMP, if fails, disable OpenMP and make again
|
||||
# This will automatically make xgboost for MAC users who do not have openmp support
|
||||
# In most cases, type make will give what you want
|
||||
# This is a simple script to make xgboost in MAC and Linux
|
||||
# Basically, it first try to make with OpenMP, if fails, disable OpenMP and make it again.
|
||||
# This will automatically make xgboost for MAC users who don't have OpenMP support.
|
||||
# In most cases, type make will give what you want.
|
||||
|
||||
# download rabit
|
||||
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user